Segmentation of the mean of heteroscedastic data via cross-validation

(1)

1/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Segmentation of the mean of heteroscedastic data via cross-validation

Alain Celisse

1

UMR 8524 CNRS - Université Lille 1

2

SSB Group, Paris

joint work with Sylvain Arlot

GDR Statistique et Santé

Paris, October, 21 2009

(2)

2/23

Illustration: Original signal

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

(3)

2/23

Illustration: Observed signal (discretized)

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Discretized signal (n=100 observations)

(4)

3/23

Illustration: Find breakpoints

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

?

? ?

(5)

3/23

Illustration: True regression function

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Signal Reg. func.

? ?

(6)

4/23

Statistical framework: Change-point detection

( t

₁

, Y

₁

), . . . , ( t

_n

, Y

_n

) ∈ [ 0 , 1 ] × Y independent, Y

_i

= s ( t

_i

) + σ

_i

∈ Y = R

s: piecewise constant

Residuals : E [ _i ] = 0 and E ² _i

= 1 .

Noise level: σ _i (heteroscedastic)

(7)

4/23

Statistical framework: Change-point detection

( t

₁

, Y

₁

), . . . , ( t

_n

, Y

_n

) ∈ [ 0 , 1 ] × Y independent, Y

_i

= s ( t

_i

) + σ

_i

∈ Y = R

Instants t i : deterministic (e.g. t i = i / n).

s: piecewise constant

Residuals : E [ _i ] = 0 and E ² _i

= 1 .

Noise level: σ _i (heteroscedastic)

(8)

5/23

Estimation versus Identication

Purpose:

Estimate s to recover most of the important jumps w.r.t. the noise level −→ Estimation purpose.

55 60 65 70 75 80 85 90 95 100

−0.2 0 0.2 0.4 0.6 0.8

1 Signal: Y

Reg. func. s

Strategy:

1

Use piecewise constant functions.

2

Adopt the model selection point of view.

(9)

6/23

Model selection

Models:

( I λ ) λ∈Λ

_m

: partition of [ 0 , 1 ]

S _m : linear space of piecewise constant functions on ( I λ ) λ∈Λ

_m

Strategy:

(S _m ) _m ∈M

_n

−→ ( b s _m ) _m ∈M

_n

−→ b s _m _b ???

Goal:

Oracle inequality (in expectation, or with large probability): k s − b s _m _b k ² ≤ C inf

M

_n

n

k s − b s _m k ² + R ( m , n ) o

(10)

6/23

Model selection

Models:

( I λ ) λ∈Λ

_m

: partition of [ 0 , 1 ]

S _m : linear space of piecewise constant functions on ( I λ ) λ∈Λ

_m

Strategy:

(S _m ) _m ∈M

_n

−→ ( b s _m ) _m ∈M

_n

−→ b s _m _b ???

Goal:

Oracle inequality (in expectation, or with large probability):

k s − b s _m _b k ² ≤ C inf

M

_n

n

k s − b s _m k ² + R (m, n) o

(11)

7/23

Least-squares estimator

Empirical risk minimizer over S _m (= model):

b s _m ∈ arg min

u ∈ S

m

P _n γ( u ) = arg min

u ∈ S

m

1 n

X n i = 1

( u ( t _i ) − Y _i ) ² . Regressogram:

b s m = X

λ∈Λ

_m

β b _λ 1 I

λ

β b _λ = 1 Card { t _i ∈ I λ }

X

t

i

∈ I

λ

Y i .

Oracle:

m ^∗ := Argmin _m ∈M

_n

k s − b s m k ² .

−→ b s m

^∗

: best estimator among { b s m | m ∈ M _n } .

(12)

8/23

Empirical Risk Minimization (ERM)

Assumption:

The number D − 1 of breakpoints is known.

Question:

Find the locations of the D − 1 breakpoints (D is given).

Strategy:

The best segmentation in D pieces is obtained by applying the ERM algorithm over S

D

m

= D S _m : ERM algorithm:

m b _ERM ( D ) = Argmin _m | D

m

= D P _n γ ( b s _m ) .

(13)

9/23

ERM segmentation: Homoscedastic

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Segmentation (Homoscedastic)

Position t

Signal

Y_i Signal Oracle ERM

−→ ERM is close to the oracle

(14)

10/23

Expectations

Homoscedastic:

R ( b s _m ) = dist (s , S _m ) + σ ² D _m

n + cste, E [ P _n γ( b s _m ) ] = dist ( s , S _m ) −σ ² D _m

n + cste . Conclusions:

1

The variance term σ ² D _m / n does not matter,

2

S m s are only distinguished according to dist ( s , S m ) .

(15)

11/23

ERM segmentation: Heteroscedastic

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Segmentation (Heteroscedastic)

Position t

Signal

Y_i Signal Oracle ERM

−→ ERM overts in noisy regions

(16)

11/23

ERM overtting: Expectations

Heteroscedastic:

R ( b s m ) = dist ( s , S m ) + 1 n

X

λ

(σ ^r _λ ) ² + cste ,

E [ P _n γ ( b s _m ) ] = dist ( s , S _m ) − 1 n

X

λ

(σ _λ ^r ) ² + cste ,

with (σ ^r _λ ) ² := _n ¹

λ

P _n

i = 1 σ _i ² 1 _I

λ

( t _i ), n λ := Card ({ i | t _i ∈ I λ }) . Conclusions:

1

The variance term is dierent for models S _m (with dimension D),

2

ERM rather puts breakpoints in the noise.

(17)

12/23

Cross-validation principle

0 0.5 1

−3

−2

−1 0 1 2 3

0 0.5 1

−3

−2

−1 0 1 2 3

−3

−2

−1 0 1 2 3

−3

−2

−1

0

1

2

3

(18)

13/23

Cross-validation

Leave-p-out (Lpo) ^∀ 1 ≤ p ≤ n − 1,

R b

_p

( b s

_m

) = n

p

⁻1

X

D^(t)∈E_p



 1 p

X

Zi∈D^(v⁾

b s

_m^D^(t)

( X

_i

) − Y

_i

2



 ,

where E

_p

=

D

⁽^t⁾

⊂ { Z

₁

, . . . , Z

_n

} | Card D

⁽^t⁾

= n − p . Algorithmic complexity: exponential.

Theorem (C. Ph.D. (2008))

R b

p

( b s

m

) = X

λ∈Λ(m)

S

λ,2

A

λ

+ S

λ,²1

− S

λ,2

B

λ

, where S

λ,1

:= P

n

i=1

Y

_i

1

Iλ

, S

λ,2

:= P

n

i=1

Y

_i²

1

Iλ

, A

λ

, B

λ

: known functions.

Algorithmic complexity: O( n ) .

(19)

14/23

Applicability of cross-validation

Lpo-based model selection procedure:

1

Lpo is Computationally Tractable C. and Robin (2008), CSDA: Density

C. and Robin (2008), arXiv: Multiple Testing C. Ph.D. (2008), TEL: Density, regression C. (2009), arXiv: Density

2

As computationally expensive as ERM.

Lpo segmentation of dimension D:

For every 1 ≤ p ≤ n − 1,

m b _p (D) = Argmin _m | D

m

= D R b _p ( b s _m ).

(20)

15/23

Taking variance into account: Lpo expectation

Theorem (C. Ph.D. (2008)) Homoscedastic:

E h

R b _p ( b s _m ) i

≈ dist (s , S _m ) + σ ² D _m

n − p + σ ² , Heteroscedastic:

E h

R b _p ( b s _m ) i

≈ dist (s, S _m ) + 1 n − p

X

λ

(σ _λ ^r ) ² + cste.

R ( b s _m ) = dist (s, S _m ) + 1 n

X

λ

(σ _λ ^r ) ² + cste.

(21)

16/23

Leave-one-out (Lpo with p = 1): An alternative to ERM

Strategy:

Replace ERM by leave-one-out (Loo) to take variance into account.

Loo algorithm:

m b ₁ ( D ) = Argmin _m | D

m

= D R b ₁ ( b s _m ).

Conclusion:

Loo prevents from overtting.

0 10 20 30 40 50 60 70 80 90 100

−1.5

−1

−0.5 0 0.5 1 1.5 2

Oracle ERM

0 10 20 30 40 50 60 70 80 90 100

−1.5

−1

−0.5 0 0.5 1 1.5 2

Oracle Loo ERM

(22)

17/23

Quality of the segmentations w.r.t. D

5 10 15 20 25 30 35

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Number of breakpoints

Average loss value

Segmentation quality (Homosc.), N=300 trials

ERM Loo

5 10 15 20 25 30 35 40

0 0.02 0.04 0.06 0.08 0.1 0.12

Number of breakpoints

Average loss value

Segmentation quality (heterosc.), N=300 trials ERM Loo

(23)

18/23

Quality of the best segmentation

s

·

σ

_·

ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01 pc,1 1.31 ± 0.02 1.16 ± 0.02 pc,3 3.09 ± 0.03 2.52 ± 0.03 3 c 3.18 ± 0.01 3.25 ± 0.01 pc,1 3.00 ± 0.01 2.67 ± 0.02 pc,3 4.41 ± 0.02 3.97 ± 0.02 Table: Average of E

h inf

D

s − b s

A(D)

2

i / E

h inf

m

k s − b s

m

k

²

i over 10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.

(24)

18/23

Quality of the best segmentation

s

·

σ

_·

ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01 pc,1 1.31 ± 0.02 1.16 ± 0.02 pc,3 3.09 ± 0.03 2.52 ± 0.03 3 c 3.18 ± 0.01 3.25 ± 0.01 pc,1 3.00 ± 0.01 2.67 ± 0.02 pc,3 4.41 ± 0.02 3.97 ± 0.02 Table: Average of E

h inf

D

s − b s

A(D)

2

i / E

h inf

m

k s − b s

m

k

²

i over 10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.

(25)

19/23

Summary

1

Lpo takes variance into account

−→ outperforms ERM (heteroscedastic).

−→ close to ERM (homoscedastic).

2

Lpo is fully tractable (closed-form expressions)

−→ as computationally expensive as ERM.

3

Segmentation of the mean of heteroscedastic data via cross-validation

1/23