Segmentation in the mean of heteroscedastic data via resampling
Alain Celisse
1
AgroParisTech, Paris
2
University Paris-Sud 11, Orsay
3
SSB Group, Paris
Change point detection methods and Applications AgroParisTech, Paris, September, 11 2008
joint work with Sylvain Arlot
Statistical framework: regression on a fixed design
(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).
Noise : E [ i ] = 0 and E 2 i
= 1.
Noise level: σ i (heteroscedastic)
Goal: estimate s
Statistical framework: regression on a fixed design
(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).
Noise : E [ i ] = 0 and E 2 i
= 1.
Noise level: σ i (heteroscedastic)
Goal: estimate s
Statistical framework: regression on a fixed design
(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).
Noise : E [ i ] = 0 and E 2 i
= 1.
Noise level: σ i (heteroscedastic)
Goal: estimate s
Statistical framework: regression on a fixed design
(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).
Noise : E [ i ] = 0 and E 2 i
= 1.
Noise level: σ i (heteroscedastic)
Goal: estimate s
Statistical framework: regression on a fixed design
(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).
Noise : E [ i ] = 0 and E 2 i
= 1.
Noise level: σ i (heteroscedastic)
Goal: estimate s
Statistical framework: regression on a fixed design
(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).
Noise : E [ i ] = 0 and E 2 i
= 1.
Noise level: σ i (heteroscedastic)
Goal: estimate s
Change-points detection framework
Y i = s (t i ) + σ i i
s : piecewise constant with high jumps Heteroscedastic noise
Purpose and strategy:
Estimate s to recover most of the significant jumps w.r.t. the noise level.
We choose our estimator among piecewise constant functions
Change-points detection framework
Y i = s (t i ) + σ i i
s : piecewise constant with high jumps Heteroscedastic noise
Purpose and strategy:
Estimate s to recover most of the significant jumps w.r.t. the noise level.
We choose our estimator among piecewise constant functions
Change-points detection framework
Y i = s (t i ) + σ i i
s : piecewise constant with high jumps Heteroscedastic noise
Purpose and strategy:
Estimate s to recover most of the significant jumps w.r.t. the noise level.
We choose our estimator among piecewise constant functions
Change-points detection framework
Y i = s (t i ) + σ i i
s : piecewise constant with high jumps Heteroscedastic noise
Purpose and strategy:
Estimate s to recover most of the significant jumps w.r.t. the noise level.
We choose our estimator among piecewise constant functions
Change-points detection framework
Y i = s (t i ) + σ i i
s : piecewise constant with high jumps Heteroscedastic noise
Purpose and strategy:
Estimate s to recover most of the significant jumps w.r.t. the noise level.
We choose our estimator among piecewise constant functions
Change-points detection framework
Y i = s (t i ) + σ i i
s : piecewise constant with high jumps Heteroscedastic noise
Purpose and strategy:
Estimate s to recover most of the significant jumps w.r.t. the noise level.
We choose our estimator among piecewise constant functions
Estimation vs. Selection
A change-point in a noisy region
We do not systematically
want to recover it
Use the quadratic risk
Detected change-points
are meaningful
Estimation vs. Selection
A change-point in a noisy region
We do not systematically
want to recover it
Use the quadratic risk
Detected change-points
are meaningful
Estimation vs. Selection
A change-point in a noisy region
We do not systematically
want to recover it
Use the quadratic risk
Detected change-points
are meaningful
Estimation vs. Selection
A change-point in a noisy region
We do not systematically
want to recover it
Use the quadratic risk
Detected change-points
are meaningful
Estimation vs. Selection
A change-point in a noisy region
We do not systematically
want to recover it
Use the quadratic risk
Detected change-points
are meaningful
Loss function, least-squares risk and contrast
Loss function:
` (s , u ) = ks − uk 2 n := 1 n
n
X
i =1
(s(t i ) − u(t i )) 2
Least-squares risk of an estimator b s : R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:
P n γ (u) := 1 n
n
X
i=1
γ (u, (t i , Y i )) ,
with γ(u, (x, y )) = (u(x) − y ) 2
Loss function, least-squares risk and contrast
Loss function:
` (s , u ) = ks − uk 2 n := 1 n
n
X
i =1
(s(t i ) − u(t i )) 2 Least-squares risk of an estimator b s :
R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:
P n γ (u) := 1 n
n
X
i=1
γ (u, (t i , Y i )) ,
with γ(u, (x, y )) = (u(x) − y ) 2
Loss function, least-squares risk and contrast
Loss function:
` (s , u ) = ks − uk 2 n := 1 n
n
X
i =1
(s(t i ) − u(t i )) 2 Least-squares risk of an estimator b s :
R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:
P n γ (u) := 1 n
n
X
i=1
γ (u, (t i , Y i )) ,
with γ(u, (x, y )) = (u(x) − y ) 2
Least-squares estimator
(I λ ) λ∈Λ
m: partition of [0, 1]
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
mEmpirical risk minimizer over S m (= model):
b s m ∈ arg min
u∈S
mP n γ(u, ·) = arg min
u∈S
m1 n
n
X
i=1
(u(t i ) − Y i ) 2 . Regressogram
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card {t i ∈ I λ }
X
t
i∈I
λY i .
Least-squares estimator
(I λ ) λ∈Λ
m: partition of [0, 1]
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
mEmpirical risk minimizer over S m (= model):
b s m ∈ arg min
u∈S
mP n γ(u, ·) = arg min
u∈S
m1 n
n
X
i=1
(u(t i ) − Y i ) 2 . Regressogram
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card {t i ∈ I λ }
X
t
i∈I
λY i .
Least-squares estimator
(I λ ) λ∈Λ
m: partition of [0, 1]
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
mEmpirical risk minimizer over S m (= model):
b s m ∈ arg min
u∈S
mP n γ(u, ·) = arg min
u∈S
m1 n
n
X
i=1
(u(t i ) − Y i ) 2 .
Regressogram b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card {t i ∈ I λ }
X
t
i∈I
λY i .
Least-squares estimator
(I λ ) λ∈Λ
m: partition of [0, 1]
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
mEmpirical risk minimizer over S m (= model):
b s m ∈ arg min
u∈S
mP n γ(u, ·) = arg min
u∈S
m1 n
n
X
i=1
(u(t i ) − Y i ) 2 . Regressogram
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card {t i ∈ I λ }
X
t
i∈I
λY i .
Data (t 1 , Y 1 ), . . . , (t n , Y n )
Goal: reconstruct the signal
How many breakpoints?
D = 1 D = 3
How many breakpoints?
D = 1 D = 3
The oracle
Model selection
(S m ) m∈M −→ ( b s m ) m∈M −→ b s
m b ???
Goals:
Oracle inequality (in expectation, or with a large probability):
` (s, b s m b ) ≤ C inf
M
n{` (s, b s m ) + R(m, n)}
Adaptivity (provided (S m ) m∈M
nis well chosen)
Model selection
(S m ) m∈M −→ ( b s m ) m∈M −→ b s
m b ???
Goals:
Oracle inequality (in expectation, or with a large probability):
` (s, b s m b ) ≤ C inf
M
n{` (s, b s m ) + R(m, n)}
Adaptivity (provided (S m ) m∈M
nis well chosen)
Model selection
(S m ) m∈M −→ ( b s m ) m∈M −→ b s
m b ???
Goals:
Oracle inequality (in expectation, or with a large probability):
` (s, b s m b ) ≤ C inf
M
n{` (s, b s m ) + R(m, n)}
Adaptivity (provided (S m ) m∈M
nis well chosen)
Collection of models
For any m ∈ M n , (I λ ) λ∈Λ
m
denotes a partition of [0, 1] such that
I λ = [t i
k, t i
k+1)
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
m
with dim(S m ) = D m
∀1 ≤ D ≤ n − 1, Card {m ∈ M n | D m = D} =
n − 1 D − 1
Collection of models
For any m ∈ M n , (I λ ) λ∈Λ
m
denotes a partition of [0, 1] such that
I λ = [t i
k, t i
k+1)
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
m
with dim(S m ) = D m
∀1 ≤ D ≤ n − 1, Card {m ∈ M n | D m = D} =
n − 1 D − 1
Collection of models
For any m ∈ M n , (I λ ) λ∈Λ
m
denotes a partition of [0, 1] such that
I λ = [t i
k, t i
k+1)
S m : linear space of piecewise constant functions on (I λ ) λ∈Λ
m
with dim(S m ) = D m
∀1 ≤ D ≤ n − 1, Card {m ∈ M n | D m = D} =
n − 1 D − 1
The collection complexity idea
Usual approach:
Bias-variance tradeoff Mallows’ C p :
C p (m) = P n γ( b s m ) + 2σ 2 D m n
This approach is useless:
Mallows’ C p overfits (Figure)
The collection complexity is involved in this phenomenon
C
pcriterion
Oracle dimension: 5
The collection complexity idea
Usual approach:
Bias-variance tradeoff Mallows’ C p :
C p (m) = P n γ( b s m ) + 2σ 2 D m n
This approach is useless:
Mallows’ C p overfits (Figure)
The collection complexity is involved in this phenomenon
C
pcriterion
Oracle dimension: 5
The collection complexity idea
Usual approach:
Bias-variance tradeoff Mallows’ C p :
C p (m) = P n γ( b s m ) + 2σ 2 D m n
This approach is useless:
Mallows’ C p overfits (Figure)
The collection complexity is involved in this phenomenon
C
pcriterion
Oracle dimension: 5
The collection complexity idea
Usual approach:
Bias-variance tradeoff Mallows’ C p :
C p (m) = P n γ( b s m ) + 2σ 2 D m n
This approach is useless:
Mallows’ C p overfits (Figure)
The collection complexity is involved in this phenomenon
C
pcriterion
Oracle dimension: 5
The collection complexity idea
Usual approach:
Bias-variance tradeoff Mallows’ C p :
C p (m) = P n γ( b s m ) + 2σ 2 D m n
This approach is useless:
Mallows’ C p overfits (Figure)
The collection complexity is involved in this phenomenon
C
pcriterion
Oracle dimension: 5
Model collection complexity and overfitting
Regular partitions At least 15 points
At least 10 points At least 5 points
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u)
m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u)
m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Birg´ e and Massart (2001): Homoscedastic
Algorithm 1:
∀m, b s m = arg min u∈S
mP n γ (u) m b = arg min
m∈M {P n γ( b s m ) + pen(D m )}
Algorithm 1’: (=Algorithm 1)
∀D, b s
m(D) b = arg min u∈e S
D
P n γ(u ) D b = arg min
D∈D
P n γ( b s
m(D) b ) + pen(D m ) where
S e D = [
m|D
m=D
S m
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Complexity measure in the homoscedastic setup
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 2
Complexity of S m : of order D m /n S e D = S
m|D
m=D S m
S e D : more complex than any S m Effective complexity measure
Curves may be superimposed Complexity of S e D :
pen(m) = c 1
D m
n + c 2
D m
n log n
D m
What about the heteroscedastic setting?
Heteroscedastic data
0 5 10 15 20 25 30 35 40 45 50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average estimated loss for Reg= 1, S= 5
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Strategy: Resampling with rich collections
With NOT TOO RICH collections of models Homoscedastic:
Penalties like c 1 D n
mare reliable complexity measures
Heteroscedastic:
Resampling outperforms upon penalties c 1 D
mn (Arlot (08)) With RICH collections of models
Homoscedastic:
c 1 D n
m+ c 2 D n
mlog
n D
mis a reliable complexity measure
Heteroscedastic:
We will use resampling to measure
the complexity
Cross-validation principle
Cross-validation principle
The training set: Computation of the estimator
The training set: Computation of the estimator
The test set: Quality assessment
The test set: Quality assessment
V -Fold cross-validation (VFCV)
Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V
For each 1 ≤ j ≤ V :
(X
1, Y
1), . . . , (X
n−p, Y
n−p)
| {z }
, (X
n−p+1, Y
n−p+1), . . . , (X
n, Y
n)
| {z }
Training set Test set B
jb s
m(−j)= arg min
u∈Sm
( 1 n − p
n−p
X
i=1
γ(u, (X
i, Y
i)) )
P
n(j)= 1 p
n
X
i=n−p+1
δ
(Xi,Yi)−→ P
n(j)γ b s
m(−j)VFCV −→ m b ∈ arg min M
nn 1 V
P V
j=1 P n (j) γ
b s m (−j)
o
V -Fold cross-validation (VFCV)
Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V
For each 1 ≤ j ≤ V :
(X
1, Y
1), . . . , (X
n−p, Y
n−p)
| {z }
, (X
n−p+1, Y
n−p+1), . . . , (X
n, Y
n)
| {z }
Training set Test set B
jb s
m(−j)= arg min
u∈Sm
( 1 n − p
n−p
X
i=1
γ(u, (X
i, Y
i)) )
P
n(j)= 1 p
n
X
i=n−p+1
δ
(Xi,Yi)−→ P
n(j)γ b s
m(−j)VFCV −→ m b ∈ arg min M
nn 1 V
P V
j=1 P n (j) γ
b s m (−j)
o
V -Fold cross-validation (VFCV)
Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V
For each 1 ≤ j ≤ V :
(X
1, Y
1), . . . , (X
n−p, Y
n−p)
| {z }
, (X
n−p+1, Y
n−p+1), . . . , (X
n, Y
n)
| {z }
Training set Test set B
jb s
m(−j)= arg min
u∈Sm
( 1 n − p
n−p
X
i=1
γ(u, (X
i, Y
i)) )
P
n(j)= 1 p
n
X
i=n−p+1
δ
(Xi,Yi)−→ P
n(j)γ b s
m(−j)VFCV −→ m b ∈ arg min M
nn 1 V
P V
j=1 P n (j) γ
b s m (−j)
o
V -Fold cross-validation (VFCV)
Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V
For each 1 ≤ j ≤ V :
(X
1, Y
1), . . . , (X
n−p, Y
n−p)
| {z }
, (X
n−p+1, Y
n−p+1), . . . , (X
n, Y
n)
| {z }
Training set Test set B
jb s
m(−j)= arg min
u∈Sm
( 1 n − p
n−p
X
i=1
γ(u, (X
i, Y
i)) )
P
n(j)= 1 p
n
X
i=n−p+1
δ
(Xi,Yi)−→ P
n(j)γ b s
m(−j)VFCV −→ m b ∈ arg min M
nn 1 V
P V
j=1 P n (j) γ
b s m (−j)
o
V -Fold cross-validation (VFCV)
Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V
For each 1 ≤ j ≤ V :
(X
1, Y
1), . . . , (X
n−p, Y
n−p)
| {z }
, (X
n−p+1, Y
n−p+1), . . . , (X
n, Y
n)
| {z }
Training set Test set B
jb s
m(−j)= arg min
u∈Sm
( 1 n − p
n−p
X
i=1
γ(u, (X
i, Y
i)) )
P
n(j)= 1 p
n
X
i=n−p+1
δ
(Xi,Yi)−→ P
n(j)γ b s
m(−j)VFCV −→ m b ∈ arg min M
nn 1 V
P V
j=1 P n (j) γ
b s m (−j)
o
V -Fold cross-validation (VFCV)
Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V
For each 1 ≤ j ≤ V :
(X
1, Y
1), . . . , (X
n−p, Y
n−p)
| {z }
, (X
n−p+1, Y
n−p+1), . . . , (X
n, Y
n)
| {z }
Training set Test set B
jb s
m(−j)= arg min
u∈Sm
( 1 n − p
n−p
X
i=1
γ(u, (X
i, Y
i)) )
P
n(j)= 1 p
n
X
i=n−p+1
δ
(Xi,Yi)−→ P
n(j)γ b s
m(−j)VFCV −→ m b ∈ arg min M
nn 1 V
P V
j=1 P n (j) γ
b s m (−j)
o
BM, VFCV: choosing the number of breakpoints
First step:
∀D, m(D) = b Argmin m|D
m=D P n γ ( b s m ) Second Step:
1
BM (Homoscedastic):
P
nγ( b s
m(D)b
) + pen(D) ≈ E h
s − b s
m(D)b
2
i
2
VFCV (Homoscedastic and Heteroscedastic):
1 V
V
X
j=1
P
n(j)γ b s
(−j)mb(−j)(D)
≈ E h
s − b s
m(D)b
2
i
BM, VFCV: choosing the number of breakpoints
First step:
∀D, m(D) = b Argmin m|D
m=D P n γ ( b s m ) Second Step:
1
BM (Homoscedastic):
P
nγ( b s
m(D)b
) + pen(D) ≈ E h
s − b s
m(D)b
2
i
2
VFCV (Homoscedastic and Heteroscedastic):
1 V
V
X
j=1
P
n(j)γ b s
(−j)mb(−j)(D)
≈ E h
s − b s
m(D)b
2
i
BM, VFCV: choosing the number of breakpoints
First step:
∀D, m(D) = b Argmin m|D
m=D P n γ ( b s m ) Second Step:
1
BM (Homoscedastic):
P
nγ( b s
m(D)b
) + pen(D) ≈ E h
s − b s
m(D)b
2
i
2
VFCV (Homoscedastic and Heteroscedastic):
1 V
V
X
j=1
P
n(j)γ b s
(−j)mb(−j)(D)
≈ E h
s − b s
m(D)b
2
i
BM, VFCV: choosing the number of breakpoints
First step:
∀D, m(D) = b Argmin m|D
m=D P n γ ( b s m ) Second Step:
1
BM (Homoscedastic):
P
nγ( b s
m(D)b
) + pen(D) ≈ E h
s − b s
m(D)b
2
i
2
VFCV (Homoscedastic and Heteroscedastic):
1 V
V
X
j=1
P
n(j)γ b s
(−j)mb(−j)(D)
≈ E h
s − b s
m(D)b
2
i
BM, VFCV: performance (number of breakpoints)
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2 0.25 0.3
0.35 Average estimated loss for Reg= 1, S= 2 Ideal VFCV BM
Homoscedastic
0 10 20 30 40 50
0 0.05 0.1 0.15 0.2
0.25 Average estimated loss for Reg= 2, S= 5 Ideal VFCV BM
Heteroscedastic
σ · Ideal VFCV BM C p
2 2.19 ± 0.28 3.35 ± 0.35 2.96 ± 0.36 62.1 ± 7.1
3 4.09 ± 0.24 5.12 ± 0.3 4.84 ± 0.32 19.3 ± 1.2
5 4.04 ± 0.57 7.33 ± 0.9 38.5 ± 2.8 131 ± 14
6 2.13 ± 0.18 3.47 ± 0.28 5.1 ± 0.39 94.7 ± 7.7
Summary of main ideas
Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM
Heteroscedastic setup: Resampling strongly outperforms upon
BM
Summary of main ideas
Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM
Heteroscedastic setup: Resampling strongly outperforms upon
BM
Summary of main ideas
Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM
Heteroscedastic setup: Resampling strongly outperforms upon
BM
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m
=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m
=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m
=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m
=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m
=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Overfitting of ERM
ERM only takes into account the fit to the data Overfitting may occur when
∀D, m(D) = b Argmin m|D
m
=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,
Card ({m | D m = D}) =
n − 1 D − 1
Idea:
ERM may be replaced by resampling to provide { m(D)} b D
Algorithms
Goal:
See whether resampling outperforms upon ERM Alternatives:
ERM:
∀D, m(D) = arg b min
m|D
m=D P n γ ( b s m ) Leave-one-out (LOO):
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ
b s m (−i )
Algorithms
Goal:
See whether resampling outperforms upon ERM Alternatives:
ERM:
∀D, m(D) = arg b min
m|D
m=D P n γ ( b s m ) Leave-one-out (LOO):
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ
b s m (−i )
Algorithms
Goal:
See whether resampling outperforms upon ERM Alternatives:
ERM:
∀D, m(D) = arg b min
m|D
m=D P n γ ( b s m ) Leave-one-out (LOO):
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ
b s m (−i )
Algorithms
Goal:
See whether resampling outperforms upon ERM Alternatives:
ERM:
∀D, m(D) = arg b min
m|D
m=D P n γ ( b s m ) Leave-one-out (LOO):
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ
b s m (−i )
Algorithms
Goal:
See whether resampling outperforms upon ERM Alternatives:
ERM:
∀D, m(D) = arg b min
m|D
m=D P n γ ( b s m ) Leave-one-out (LOO):
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ
b s m (−i )
Quality of the segmentations: Homoscedastic
5 10 15 20 25 30 35 40 45 50
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Average estimated loss for r= 3, b= 2
ERM LOO
Quality of the segmentations: Heteroscedastic
5 10 15 20 25 30 35 40 45 50
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10−5 Average estimated loss for r= 3, b= 12
ERM LOO
5 10 15 20 25 30 35
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Average estimated loss for r= 3, b= 7
ERM LOO
Overfitting with ERM: dimension 6
0 10 20 30 40 50 60 70 80 90 100
−2
−1.5
−1
−0.5 0 0.5 1 1.5
ERM LOO Oracle
Overfitting with ERM: dimension 7
0 10 20 30 40 50 60 70 80 90 100
−2
−1.5
−1
−0.5 0 0.5 1 1.5
ERM LOO Oracle
Overfitting with ERM: dimension 8
0 10 20 30 40 50 60 70 80 90 100
−2
−1.5
−1
−0.5 0 0.5 1 1.5
ERM LOO Oracle
Overfitting with ERM: dimension 9
0 10 20 30 40 50 60 70 80 90 100
−2
−1.5
−1
−0.5 0 0.5 1 1.5
ERM LOO Oracle
Overfitting with ERM: dimension 10
0 10 20 30 40 50 60 70 80 90 100
−2
−1.5
−1
−0.5 0 0.5 1 1.5
ERM LOO Oracle
Global algorithm description
1
LOO at the first step to choose the best segmentation for each dimension:
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ b s m (−i )
2
Use the VFCV to choose the number of breakpoints
D b = arg min
D
1 V
V
X
j =1
P n (j) γ b s (−j )
m b
(−j)(D)
Global algorithm description
1
LOO at the first step to choose the best segmentation for each dimension:
∀D, m(D) = arg b min
m|D
m=D
1 n
n
X
i=1
P n (i ) γ
b s m (−i )
2
Use the VFCV to choose the number of breakpoints
D b = arg min
D
1 V
V
X
j =1
P n (j) γ b s (−j )
m b
(−j)(D)
The Bt474 Cell lines
These are epithelial cells
Obtained from human breast cancer tumors
A test genome is compared to a reference male genome
We only consider chromosomes 1 and 9
The Bt474 Cell lines
These are epithelial cells
Obtained from human breast cancer tumors
A test genome is compared to a reference male genome
We only consider chromosomes 1 and 9
The Bt474 Cell lines
These are epithelial cells
Obtained from human breast cancer tumors
A test genome is compared to a reference male genome
We only consider chromosomes 1 and 9
The Bt474 Cell lines
These are epithelial cells
Obtained from human breast cancer tumors
A test genome is compared to a reference male genome
We only consider chromosomes 1 and 9
Results: Chromosome 9
Homoscedastic model (Picardet al.(05))
LOO+VFCV
Results: Chromosome 1
Homoscedastic model (Picardet al.(05))