1/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Segmentation of the mean of heteroscedastic data via cross-validation
Alain Celisse
1
UMR 8524 CNRS - Université Lille 1
2
SSB Group, Paris
joint work with Sylvain Arlot
GDR Statistique et Santé
Paris, October, 21 2009
2/23
Illustration: Original signal
0 10 20 30 40 50 60 70 80 90 100
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Position t
Signal
2/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Illustration: Observed signal (discretized)
0 10 20 30 40 50 60 70 80 90 100
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Position t
Signal
Discretized signal (n=100 observations)
3/23
Illustration: Find breakpoints
0 10 20 30 40 50 60 70 80 90 100
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Position t
Signal
?
?
?
? ?
3/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Illustration: True regression function
0 10 20 30 40 50 60 70 80 90 100
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Position t
Signal
Signal Reg. func.
? ?
4/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Statistical framework: Change-point detection
( t
1, Y
1), . . . , ( t
n, Y
n) ∈ [ 0 , 1 ] × Y independent, Y
i= s ( t
i) + σ
ii∈ Y = R
s: piecewise constant
Residuals : E [ i ] = 0 and E 2 i
= 1 .
Noise level: σ i (heteroscedastic)
4/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Statistical framework: Change-point detection
( t
1, Y
1), . . . , ( t
n, Y
n) ∈ [ 0 , 1 ] × Y independent, Y
i= s ( t
i) + σ
ii∈ Y = R
Instants t i : deterministic (e.g. t i = i / n).
s: piecewise constant
Residuals : E [ i ] = 0 and E 2 i
= 1 .
Noise level: σ i (heteroscedastic)
5/23
Estimation versus Identication
Purpose:
Estimate s to recover most of the important jumps w.r.t. the noise level −→ Estimation purpose.
55 60 65 70 75 80 85 90 95 100
−0.2 0 0.2 0.4 0.6 0.8
1 Signal: Y
Reg. func. s
Strategy:
1
Use piecewise constant functions.
2
Adopt the model selection point of view.
6/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Model selection
Models:
( I λ ) λ∈Λ
m: partition of [ 0 , 1 ]
S m : linear space of piecewise constant functions on ( I λ ) λ∈Λ
mStrategy:
(S m ) m ∈M
n−→ ( b s m ) m ∈M
n−→ b s m b ???
Goal:
Oracle inequality (in expectation, or with large probability): k s − b s m b k 2 ≤ C inf
M
nn
k s − b s m k 2 + R ( m , n ) o
6/23
Model selection
Models:
( I λ ) λ∈Λ
m: partition of [ 0 , 1 ]
S m : linear space of piecewise constant functions on ( I λ ) λ∈Λ
mStrategy:
(S m ) m ∈M
n−→ ( b s m ) m ∈M
n−→ b s m b ???
Goal:
Oracle inequality (in expectation, or with large probability):
k s − b s m b k 2 ≤ C inf
M
nn
k s − b s m k 2 + R (m, n) o
7/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Least-squares estimator
Empirical risk minimizer over S m (= model):
b s m ∈ arg min
u ∈ S
mP n γ( u ) = arg min
u ∈ S
m1 n
X n i = 1
( u ( t i ) − Y i ) 2 . Regressogram:
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card { t i ∈ I λ }
X
t
i∈ I
λY i .
Oracle:
m ∗ := Argmin m ∈M
nk s − b s m k 2 .
−→ b s m
∗: best estimator among { b s m | m ∈ M n } .
8/23
Empirical Risk Minimization (ERM)
Assumption:
The number D − 1 of breakpoints is known.
Question:
Find the locations of the D − 1 breakpoints (D is given).
Strategy:
The best segmentation in D pieces is obtained by applying the ERM algorithm over S
D
m= D S m : ERM algorithm:
m b ERM ( D ) = Argmin m | D
m= D P n γ ( b s m ) .
9/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
ERM segmentation: Homoscedastic
0 10 20 30 40 50 60 70 80 90 100
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Segmentation (Homoscedastic)
Position t
Signal
Yi Signal Oracle ERM
−→ ERM is close to the oracle
10/23
Expectations
Homoscedastic:
R ( b s m ) = dist (s , S m ) + σ 2 D m
n + cste, E [ P n γ( b s m ) ] = dist ( s , S m ) −σ 2 D m
n + cste . Conclusions:
1
The variance term σ 2 D m / n does not matter,
2
S m s are only distinguished according to dist ( s , S m ) .
11/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
ERM segmentation: Heteroscedastic
0 10 20 30 40 50 60 70 80 90 100
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Segmentation (Heteroscedastic)
Position t
Signal
Yi Signal Oracle ERM
−→ ERM overts in noisy regions
11/23
ERM overtting: Expectations
Heteroscedastic:
R ( b s m ) = dist ( s , S m ) + 1 n
X
λ
(σ r λ ) 2 + cste ,
E [ P n γ ( b s m ) ] = dist ( s , S m ) − 1 n
X
λ
(σ λ r ) 2 + cste ,
with (σ r λ ) 2 := n 1
λ
P n
i = 1 σ i 2 1 I
λ( t i ), n λ := Card ({ i | t i ∈ I λ }) . Conclusions:
1
The variance term is dierent for models S m (with dimension D),
2
ERM rather puts breakpoints in the noise.
12/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Cross-validation principle
0 0.5 1
−3
−2
−1 0 1 2 3
0 0.5 1
−3
−2
−1 0 1 2 3
−3
−2
−1 0 1 2 3
−3
−2
−1
0
1
2
3
13/23
Cross-validation
Leave-p-out (Lpo) ∀ 1 ≤ p ≤ n − 1,
R b
p( b s
m) = n
p
−1X
D(t)∈Ep
1 p
X
Zi∈D(v)
b s
mD(t)( X
i) − Y
i2
,
where E
p=
D
(t)⊂ { Z
1, . . . , Z
n} | Card D
(t)= n − p . Algorithmic complexity: exponential.
Theorem (C. Ph.D. (2008))
R b
p( b s
m) = X
λ∈Λ(m)
S
λ,2A
λ+ S
λ,21− S
λ,2B
λ, where S
λ,1:= P
ni=1
Y
i1
Iλ, S
λ,2:= P
ni=1
Y
i21
Iλ, A
λ, B
λ: known functions.
Algorithmic complexity: O( n ) .
14/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Applicability of cross-validation
Lpo-based model selection procedure:
1
Lpo is Computationally Tractable C. and Robin (2008), CSDA: Density
C. and Robin (2008), arXiv: Multiple Testing C. Ph.D. (2008), TEL: Density, regression C. (2009), arXiv: Density
2
As computationally expensive as ERM.
Lpo segmentation of dimension D:
For every 1 ≤ p ≤ n − 1,
m b p (D) = Argmin m | D
m= D R b p ( b s m ).
15/23
Taking variance into account: Lpo expectation
Theorem (C. Ph.D. (2008)) Homoscedastic:
E h
R b p ( b s m ) i
≈ dist (s , S m ) + σ 2 D m
n − p + σ 2 , Heteroscedastic:
E h
R b p ( b s m ) i
≈ dist (s, S m ) + 1 n − p
X
λ
(σ λ r ) 2 + cste.
R ( b s m ) = dist (s, S m ) + 1 n
X
λ
(σ λ r ) 2 + cste.
16/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Leave-one-out (Lpo with p = 1): An alternative to ERM
Strategy:
Replace ERM by leave-one-out (Loo) to take variance into account.
Loo algorithm:
m b 1 ( D ) = Argmin m | D
m= D R b 1 ( b s m ).
Conclusion:
Loo prevents from overtting.
0 10 20 30 40 50 60 70 80 90 100
−1.5
−1
−0.5 0 0.5 1 1.5 2
Oracle ERM
0 10 20 30 40 50 60 70 80 90 100
−1.5
−1
−0.5 0 0.5 1 1.5 2
Oracle Loo ERM
17/23
Quality of the segmentations w.r.t. D
5 10 15 20 25 30 35
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Number of breakpoints
Average loss value
Segmentation quality (Homosc.), N=300 trials
ERM Loo
5 10 15 20 25 30 35 40
0 0.02 0.04 0.06 0.08 0.1 0.12
Number of breakpoints
Average loss value
Segmentation quality (heterosc.), N=300 trials ERM Loo
18/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Quality of the best segmentation
s
·σ
·ERM Loo
2 c 2.88 ± 0.01 2.93 ± 0.01 pc,1 1.31 ± 0.02 1.16 ± 0.02 pc,3 3.09 ± 0.03 2.52 ± 0.03 3 c 3.18 ± 0.01 3.25 ± 0.01 pc,1 3.00 ± 0.01 2.67 ± 0.02 pc,3 4.41 ± 0.02 3.97 ± 0.02 Table: Average of E
h inf
Ds − b s
A(D)2
i / E
h inf
mk s − b s
mk
2i over 10 000 samples. A denotes either ERM, or Loo.
−→ Same results when D is chosen by VFCV.
18/23
Quality of the best segmentation
s
·σ
·ERM Loo
2 c 2.88 ± 0.01 2.93 ± 0.01 pc,1 1.31 ± 0.02 1.16 ± 0.02 pc,3 3.09 ± 0.03 2.52 ± 0.03 3 c 3.18 ± 0.01 3.25 ± 0.01 pc,1 3.00 ± 0.01 2.67 ± 0.02 pc,3 4.41 ± 0.02 3.97 ± 0.02 Table: Average of E
h inf
Ds − b s
A(D)2
i / E
h inf
mk s − b s
mk
2i over 10 000 samples. A denotes either ERM, or Loo.
−→ Same results when D is chosen by VFCV.
19/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Summary
1
Lpo takes variance into account
−→ outperforms ERM (heteroscedastic).
−→ close to ERM (homoscedastic).
2
Lpo is fully tractable (closed-form expressions)
−→ as computationally expensive as ERM.
3
Similar results when D is chosen by V -fold cross-validation.
Conclusion:
Cross-validation is robust (to heteroscedasticity) and reliable alternative to ERM.
−→ Arlot and C. (2009), arXiv
20/23
The Bt474 Cell lines
These are epithelial cells
Obtained from human breast cancer tumors
A test genome is compared to a reference male genome
We only consider chromosomes 1 and 9
21/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Results: Chromosome 9
Homoscedastic model (Picard et al. (05))
LOO + VFCV
22/23
Results: Chromosome 1
Homoscedastic model (Picard et al. (05))
Heteroscedastic model (Picard et al. (05))
LOO + VFCV
23/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Prospects
1
Optimality results for segmentation procedures.
2
Other resampling schemes (Bootstrap, Rademacher penalties,. . . )
3
Extension to the multivariate setting: Detect ANR project
Biology: Multi-patient CGH prole segmentation.
Computer vision: Video segmentation
Thank you.
23/23
Prospects
1
Optimality results for segmentation procedures.
2
Other resampling schemes (Bootstrap, Rademacher penalties,. . . )
3