• Aucun résultat trouvé

Segmentation of the mean of heteroscedastic data via cross-validation

N/A
N/A
Protected

Academic year: 2022

Partager "Segmentation of the mean of heteroscedastic data via cross-validation"

Copied!
30
0
0

Texte intégral

(1)

1/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Segmentation of the mean of heteroscedastic data via cross-validation

Alain Celisse

1

UMR 8524 CNRS - Université Lille 1

2

SSB Group, Paris

joint work with Sylvain Arlot

GDR Statistique et Santé

Paris, October, 21 2009

(2)

2/23

Illustration: Original signal

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

(3)

2/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Illustration: Observed signal (discretized)

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Discretized signal (n=100 observations)

(4)

3/23

Illustration: Find breakpoints

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

?

?

?

? ?

(5)

3/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Illustration: True regression function

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Signal Reg. func.

? ?

(6)

4/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Statistical framework: Change-point detection

( t

1

, Y

1

), . . . , ( t

n

, Y

n

) ∈ [ 0 , 1 ] × Y independent, Y

i

= s ( t

i

) + σ

i

i

∈ Y = R

s: piecewise constant

Residuals : E [ i ] = 0 and E 2 i

= 1 .

Noise level: σ i (heteroscedastic)

(7)

4/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Statistical framework: Change-point detection

( t

1

, Y

1

), . . . , ( t

n

, Y

n

) ∈ [ 0 , 1 ] × Y independent, Y

i

= s ( t

i

) + σ

i

i

∈ Y = R

Instants t i : deterministic (e.g. t i = i / n).

s: piecewise constant

Residuals : E [ i ] = 0 and E 2 i

= 1 .

Noise level: σ i (heteroscedastic)

(8)

5/23

Estimation versus Identication

Purpose:

Estimate s to recover most of the important jumps w.r.t. the noise level −→ Estimation purpose.

55 60 65 70 75 80 85 90 95 100

−0.2 0 0.2 0.4 0.6 0.8

1 Signal: Y

Reg. func. s

Strategy:

1

Use piecewise constant functions.

2

Adopt the model selection point of view.

(9)

6/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Model selection

Models:

( I λ ) λ∈Λ

m

: partition of [ 0 , 1 ]

S m : linear space of piecewise constant functions on ( I λ ) λ∈Λ

m

Strategy:

(S m ) m ∈M

n

−→ ( b s m ) m ∈M

n

−→ b s m b ???

Goal:

Oracle inequality (in expectation, or with large probability): k s − b s m b k 2 ≤ C inf

M

n

n

k s − b s m k 2 + R ( m , n ) o

(10)

6/23

Model selection

Models:

( I λ ) λ∈Λ

m

: partition of [ 0 , 1 ]

S m : linear space of piecewise constant functions on ( I λ ) λ∈Λ

m

Strategy:

(S m ) m ∈M

n

−→ ( b s m ) m ∈M

n

−→ b s m b ???

Goal:

Oracle inequality (in expectation, or with large probability):

k s − b s m b k 2 ≤ C inf

M

n

n

k s − b s m k 2 + R (m, n) o

(11)

7/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Least-squares estimator

Empirical risk minimizer over S m (= model):

b s m ∈ arg min

u ∈ S

m

P n γ( u ) = arg min

u ∈ S

m

1 n

X n i = 1

( u ( t i ) − Y i ) 2 . Regressogram:

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card { t i ∈ I λ }

X

t

i

∈ I

λ

Y i .

Oracle:

m := Argmin m ∈M

n

k s − b s m k 2 .

−→ b s m

: best estimator among { b s m | m ∈ M n } .

(12)

8/23

Empirical Risk Minimization (ERM)

Assumption:

The number D − 1 of breakpoints is known.

Question:

Find the locations of the D − 1 breakpoints (D is given).

Strategy:

The best segmentation in D pieces is obtained by applying the ERM algorithm over S

D

m

= D S m : ERM algorithm:

m b ERM ( D ) = Argmin m | D

m

= D P n γ ( b s m ) .

(13)

9/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

ERM segmentation: Homoscedastic

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Segmentation (Homoscedastic)

Position t

Signal

Yi Signal Oracle ERM

−→ ERM is close to the oracle

(14)

10/23

Expectations

Homoscedastic:

R ( b s m ) = dist (s , S m ) + σ 2 D m

n + cste, E [ P n γ( b s m ) ] = dist ( s , S m ) −σ 2 D m

n + cste . Conclusions:

1

The variance term σ 2 D m / n does not matter,

2

S m s are only distinguished according to dist ( s , S m ) .

(15)

11/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

ERM segmentation: Heteroscedastic

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Segmentation (Heteroscedastic)

Position t

Signal

Yi Signal Oracle ERM

−→ ERM overts in noisy regions

(16)

11/23

ERM overtting: Expectations

Heteroscedastic:

R ( b s m ) = dist ( s , S m ) + 1 n

X

λ

r λ ) 2 + cste ,

E [ P n γ ( b s m ) ] = dist ( s , S m ) − 1 n

X

λ

λ r ) 2 + cste ,

with (σ r λ ) 2 := n 1

λ

P n

i = 1 σ i 2 1 I

λ

( t i ), n λ := Card ({ i | t i ∈ I λ }) . Conclusions:

1

The variance term is dierent for models S m (with dimension D),

2

ERM rather puts breakpoints in the noise.

(17)

12/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Cross-validation principle

0 0.5 1

−3

−2

−1 0 1 2 3

0 0.5 1

−3

−2

−1 0 1 2 3

−3

−2

−1 0 1 2 3

−3

−2

−1

0

1

2

3

(18)

13/23

Cross-validation

Leave-p-out (Lpo) 1 ≤ p ≤ n − 1,

R b

p

( b s

m

) = n

p

1

X

D(t)∈Ep

 1 p

X

Zi∈D(v)

b s

mD(t)

( X

i

) − Y

i

2

 ,

where E

p

=

D

(t)

⊂ { Z

1

, . . . , Z

n

} | Card D

(t)

= n − p . Algorithmic complexity: exponential.

Theorem (C. Ph.D. (2008))

R b

p

( b s

m

) = X

λ∈Λ(m)

S

λ,2

A

λ

+ S

λ,21

− S

λ,2

B

λ

, where S

λ,1

:= P

n

i=1

Y

i

1

, S

λ,2

:= P

n

i=1

Y

i2

1

, A

λ

, B

λ

: known functions.

Algorithmic complexity: O( n ) .

(19)

14/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Applicability of cross-validation

Lpo-based model selection procedure:

1

Lpo is Computationally Tractable C. and Robin (2008), CSDA: Density

C. and Robin (2008), arXiv: Multiple Testing C. Ph.D. (2008), TEL: Density, regression C. (2009), arXiv: Density

2

As computationally expensive as ERM.

Lpo segmentation of dimension D:

For every 1 ≤ p ≤ n − 1,

m b p (D) = Argmin m | D

m

= D R b p ( b s m ).

(20)

15/23

Taking variance into account: Lpo expectation

Theorem (C. Ph.D. (2008)) Homoscedastic:

E h

R b p ( b s m ) i

≈ dist (s , S m ) + σ 2 D m

n − p + σ 2 , Heteroscedastic:

E h

R b p ( b s m ) i

≈ dist (s, S m ) + 1 n − p

X

λ

λ r ) 2 + cste.

R ( b s m ) = dist (s, S m ) + 1 n

X

λ

λ r ) 2 + cste.

(21)

16/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Leave-one-out (Lpo with p = 1): An alternative to ERM

Strategy:

Replace ERM by leave-one-out (Loo) to take variance into account.

Loo algorithm:

m b 1 ( D ) = Argmin m | D

m

= D R b 1 ( b s m ).

Conclusion:

Loo prevents from overtting.

0 10 20 30 40 50 60 70 80 90 100

−1.5

−1

−0.5 0 0.5 1 1.5 2

Oracle ERM

0 10 20 30 40 50 60 70 80 90 100

−1.5

−1

−0.5 0 0.5 1 1.5 2

Oracle Loo ERM

(22)

17/23

Quality of the segmentations w.r.t. D

5 10 15 20 25 30 35

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Number of breakpoints

Average loss value

Segmentation quality (Homosc.), N=300 trials

ERM Loo

5 10 15 20 25 30 35 40

0 0.02 0.04 0.06 0.08 0.1 0.12

Number of breakpoints

Average loss value

Segmentation quality (heterosc.), N=300 trials ERM Loo

(23)

18/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Quality of the best segmentation

s

·

σ

·

ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01 pc,1 1.31 ± 0.02 1.16 ± 0.02 pc,3 3.09 ± 0.03 2.52 ± 0.03 3 c 3.18 ± 0.01 3.25 ± 0.01 pc,1 3.00 ± 0.01 2.67 ± 0.02 pc,3 4.41 ± 0.02 3.97 ± 0.02 Table: Average of E

h inf

D

s − b s

A(D)

2

i / E

h inf

m

k s − b s

m

k

2

i over 10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.

(24)

18/23

Quality of the best segmentation

s

·

σ

·

ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01 pc,1 1.31 ± 0.02 1.16 ± 0.02 pc,3 3.09 ± 0.03 2.52 ± 0.03 3 c 3.18 ± 0.01 3.25 ± 0.01 pc,1 3.00 ± 0.01 2.67 ± 0.02 pc,3 4.41 ± 0.02 3.97 ± 0.02 Table: Average of E

h inf

D

s − b s

A(D)

2

i / E

h inf

m

k s − b s

m

k

2

i over 10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.

(25)

19/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Summary

1

Lpo takes variance into account

−→ outperforms ERM (heteroscedastic).

−→ close to ERM (homoscedastic).

2

Lpo is fully tractable (closed-form expressions)

−→ as computationally expensive as ERM.

3

Similar results when D is chosen by V -fold cross-validation.

Conclusion:

Cross-validation is robust (to heteroscedasticity) and reliable alternative to ERM.

−→ Arlot and C. (2009), arXiv

(26)

20/23

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(27)

21/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Results: Chromosome 9

Homoscedastic model (Picard et al. (05))

LOO + VFCV

(28)

22/23

Results: Chromosome 1

Homoscedastic model (Picard et al. (05))

Heteroscedastic model (Picard et al. (05))

LOO + VFCV

(29)

23/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Prospects

1

Optimality results for segmentation procedures.

2

Other resampling schemes (Bootstrap, Rademacher penalties,. . . )

3

Extension to the multivariate setting: Detect ANR project

Biology: Multi-patient CGH prole segmentation.

Computer vision: Video segmentation

Thank you.

(30)

23/23

Prospects

1

Optimality results for segmentation procedures.

2

Other resampling schemes (Bootstrap, Rademacher penalties,. . . )

3

Extension to the multivariate setting: Detect ANR project

Biology: Multi-patient CGH prole segmentation.

Computer vision: Video segmentation

Thank you.

Références

Documents relatifs

We assert that dimensionality transcending is useful for cross-subject transfer learning when the score of the DT pipeline is superior to that of the calibration pipeline, since

The main contribution of the present work is two-fold: (i) we describe a new general strategy to derive moment and exponential concentration inequalities for the LpO estimator

The aim of this article is to develop a new method to jointly estimate mean and variance functions in the framework of smoothing spline analysis of variance (SS-ANOVA, see Wahba

We first compute the total mean curvature of a sphere of radius R > 0 where a neighbourhood of the north pole has been removed, and replaced by an internal sphere of small radius

The current study is based on a forward and backward translation of the original English defi - nition using a Delphi consensus procedure.. 15 The forward translation of MM

Keywords: change-point detection, model selection, cross-validation, heteroscedastic data, CGH prole segmentation..

Version définitive du manuscrit publié dans / Final version of the manuscript published in :.. nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur /

To conclude, the combination of Lpo (for choosing a segmentation for each possible number of breakpoints) and VFCV yields the most reliable procedure for detecting changes in the