Segmentation in the mean of heteroscedastic data via resampling

(1)

Segmentation in the mean of heteroscedastic data via resampling

Alain Celisse

1

AgroParisTech, Paris

2

University Paris-Sud 11, Orsay

3

SSB Group, Paris

Change point detection methods and Applications AgroParisTech, Paris, September, 11 2008

joint work with Sylvain Arlot

(2)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y _i = s (t _i ) + σ _i _i ∈ Y = [0, 1] or R Instants t _i : deterministic (t _i = i /n).

Noise : E [ i ] = 0 and E ² _i

= 1.

Noise level: σ _i (heteroscedastic)

Goal: estimate s

(3)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y _i = s (t _i ) + σ _i _i ∈ Y = [0, 1] or R Instants t _i : deterministic (t _i = i /n).

Noise : E [ i ] = 0 and E ² _i

= 1.

Noise level: σ _i (heteroscedastic)

Goal: estimate s

(4)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y _i = s (t _i ) + σ _i _i ∈ Y = [0, 1] or R Instants t _i : deterministic (t _i = i /n).

Noise : E [ i ] = 0 and E ² _i

= 1.

Noise level: σ _i (heteroscedastic)

Goal: estimate s

(5)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y _i = s (t _i ) + σ _i _i ∈ Y = [0, 1] or R Instants t _i : deterministic (t _i = i /n).

Noise : E [ i ] = 0 and E ² _i

= 1.

Noise level: σ _i (heteroscedastic)

Goal: estimate s

(6)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y _i = s (t _i ) + σ _i _i ∈ Y = [0, 1] or R Instants t _i : deterministic (t _i = i /n).

Noise : E [ i ] = 0 and E ² _i

= 1.

Noise level: σ _i (heteroscedastic)

Goal: estimate s

(7)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y _i = s (t _i ) + σ _i _i ∈ Y = [0, 1] or R Instants t _i : deterministic (t _i = i /n).

Noise : E [ i ] = 0 and E ² _i

= 1.

Noise level: σ _i (heteroscedastic)

Goal: estimate s

(8)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(9)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(10)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(11)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(12)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(13)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(14)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(15)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(16)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(17)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(18)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(19)

Loss function, least-squares risk and contrast

Loss function:

` (s , u ) = ks − uk ² _n := 1 n

n

X

i =1

(s(t _i ) − u(t _i )) ²

Least-squares risk of an estimator b s : R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:

P n γ (u) := 1 n

n

X

i=1

γ (u, (t i , Y i )) ,

with γ(u, (x, y )) = (u(x) − y ) ²

(20)

Loss function, least-squares risk and contrast

Loss function:

` (s , u ) = ks − uk ² _n := 1 n

n

X

i =1

(s(t _i ) − u(t _i )) ² Least-squares risk of an estimator b s :

R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:

P n γ (u) := 1 n

n

X

i=1

γ (u, (t i , Y i )) ,

with γ(u, (x, y )) = (u(x) − y ) ²

(21)

Loss function, least-squares risk and contrast

Loss function:

` (s , u ) = ks − uk ² _n := 1 n

n

X

i =1

(s(t _i ) − u(t _i )) ² Least-squares risk of an estimator b s :

R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:

P n γ (u) := 1 n

n

X

i=1

γ (u, (t i , Y i )) ,

with γ(u, (x, y )) = (u(x) − y ) ²

(22)

Least-squares estimator

(I _λ ) λ∈Λ

m

: partition of [0, 1]

S _m : linear space of piecewise constant functions on (I _λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s _m ∈ arg min

u∈S

m

P _n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t _i ) − Y _i ) ² . Regressogram

b s _m = X

λ∈Λ

m

β b _λ 1 I

λ

β b _λ = 1 Card {t _i ∈ I λ }

X

t

i

∈I

_λ

Y _i .

(23)

Least-squares estimator

(I _λ ) λ∈Λ

m

: partition of [0, 1]

S _m : linear space of piecewise constant functions on (I _λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s _m ∈ arg min

u∈S

m

P _n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t _i ) − Y _i ) ² . Regressogram

b s _m = X

λ∈Λ

m

β b _λ 1 I

λ

β b _λ = 1 Card {t _i ∈ I λ }

X

t

i

∈I

_λ

Y _i .

(24)

Least-squares estimator

(I _λ ) λ∈Λ

m

: partition of [0, 1]

S _m : linear space of piecewise constant functions on (I _λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s _m ∈ arg min

u∈S

m

P _n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t _i ) − Y _i ) ² .

Regressogram b s _m = X

λ∈Λ

m

β b _λ 1 I

λ

β b _λ = 1 Card {t _i ∈ I λ }

X

t

i

∈I

_λ

Y _i .

(25)

Least-squares estimator

(I _λ ) λ∈Λ

m

: partition of [0, 1]

S _m : linear space of piecewise constant functions on (I _λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s _m ∈ arg min

u∈S

m

P _n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t _i ) − Y _i ) ² . Regressogram

b s _m = X

λ∈Λ

m

β b _λ 1 I

λ

β b _λ = 1 Card {t _i ∈ I λ }

X

t

i

∈I

_λ

Y _i .

(26)

Data (t 1 , Y 1 ), . . . , (t n , Y n )

(27)

Goal: reconstruct the signal

(28)

How many breakpoints?

D = 1 D = 3

(29)

How many breakpoints?

D = 1 D = 3

(30)

The oracle

(31)

Model selection

(S _m ) m∈M −→ ( b s _m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

` (s, b s _m _b ) ≤ C inf

M

n

{` (s, b s m ) + R(m, n)}

Adaptivity (provided (S _m ) m∈M

_n

is well chosen)

(32)

Model selection

(S _m ) m∈M −→ ( b s _m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

` (s, b s _m _b ) ≤ C inf

M

n

{` (s, b s m ) + R(m, n)}

Adaptivity (provided (S _m ) m∈M

_n

is well chosen)

(33)

Model selection

(S _m ) m∈M −→ ( b s _m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

` (s, b s _m _b ) ≤ C inf

M

n

{` (s, b s m ) + R(m, n)}

Adaptivity (provided (S _m ) m∈M

_n

is well chosen)

(34)

Collection of models

For any m ∈ M _n , (I _λ ) _λ∈Λ

m

denotes a partition of [0, 1] such that

I _λ = [t _i

_k

, t _i

_k+1

)

S _m : linear space of piecewise constant functions on (I _λ ) _λ∈Λ

m

with dim(S m ) = D m

∀1 ≤ D ≤ n − 1, Card {m ∈ M _n | D _m = D} =

n − 1 D − 1

(35)

Collection of models

For any m ∈ M _n , (I _λ ) _λ∈Λ

m

denotes a partition of [0, 1] such that

I _λ = [t _i

_k

, t _i

_k+1

)

S _m : linear space of piecewise constant functions on (I _λ ) _λ∈Λ

m

with dim(S m ) = D m

∀1 ≤ D ≤ n − 1, Card {m ∈ M _n | D _m = D} =

n − 1 D − 1

(36)

Collection of models

For any m ∈ M _n , (I _λ ) _λ∈Λ

m

denotes a partition of [0, 1] such that

I _λ = [t _i

_k

, t _i

_k+1

)

S _m : linear space of piecewise constant functions on (I _λ ) _λ∈Λ

m

with dim(S m ) = D m

∀1 ≤ D ≤ n − 1, Card {m ∈ M _n | D _m = D} =

n − 1 D − 1

(37)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C _p :

C _p (m) = P _n γ( b s _m ) + 2σ ² D _m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

_p

criterion

Oracle dimension: 5

(38)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C _p :

C _p (m) = P _n γ( b s _m ) + 2σ ² D _m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

_p

criterion

Oracle dimension: 5

(39)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C _p :

C _p (m) = P _n γ( b s _m ) + 2σ ² D _m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

_p

criterion

Oracle dimension: 5

(40)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C _p :

C _p (m) = P _n γ( b s _m ) + 2σ ² D _m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

_p

criterion

Oracle dimension: 5

(41)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C _p :

C _p (m) = P _n γ( b s _m ) + 2σ ² D _m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

_p

criterion

Oracle dimension: 5

(42)

Model collection complexity and overfitting

Regular partitions At least 15 points

At least 10 points At least 5 points

(43)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u)

m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(44)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u)

m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(45)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(46)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(47)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(48)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(49)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(50)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(51)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s _m = arg min _u∈S

_m

P _n γ (u) m b = arg min

m∈M {P _n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min _u∈e _S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e _D = [

m|D

m

=D

S m

(52)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(53)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(54)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(55)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(56)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(57)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(58)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(59)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(60)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Complexity of S m : of order D m /n S e _D = S

m|D

m

=D S _m

S e _D : more complex than any S _m Effective complexity measure

Curves may be superimposed Complexity of S e _D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D _m

What about the heteroscedastic setting?

(61)

Heteroscedastic data

0 5 10 15 20 25 30 35 40 45 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

(62)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(63)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(64)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(65)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(66)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(67)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(68)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(69)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(70)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(71)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c ₁ ^D _n

^m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c ₁ ^D _n

^m

+ c ₂ ^D _n

^m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(72)

Cross-validation principle

(73)

Cross-validation principle

(74)

The training set: Computation of the estimator

(75)

The training set: Computation of the estimator

(76)

The test set: Quality assessment

(77)

The test set: Quality assessment

(78)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B _j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

_n−p

, Y

_n−p

)

| {z }

, (X

_n−p+1

, Y

_n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

_m^(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

_n^(j)

= 1 p

n

X

i=n−p+1

δ

_(X_i_,Y_i₎

−→ P

_n^(j)

γ b s

_m^(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n ^(j) γ

b s m ^(−j)

o

(79)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B _j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

_n−p

, Y

_n−p

)

| {z }

, (X

_n−p+1

, Y

_n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

_m^(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

_n^(j)

= 1 p

n

X

i=n−p+1

δ

_(X_i_,Y_i₎

−→ P

_n^(j)

γ b s

_m^(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n ^(j) γ

b s m ^(−j)

o

(80)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B _j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

_n−p

, Y

_n−p

)

| {z }

, (X

_n−p+1

, Y

_n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

_m^(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

_n^(j)

= 1 p

n

X

i=n−p+1

δ

_(X_i_,Y_i₎

−→ P

_n^(j)

γ b s

_m^(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n ^(j) γ

b s m ^(−j)

o

(81)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B _j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

_n−p

, Y

_n−p

)

| {z }

, (X

_n−p+1

, Y

_n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

_m^(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

_n^(j)

= 1 p

n

X

i=n−p+1

δ

_(X_i_,Y_i₎

−→ P

_n^(j)

γ b s

_m^(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n ^(j) γ

b s m ^(−j)

o

(82)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B _j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

_n−p

, Y

_n−p

)

| {z }

, (X

_n−p+1

, Y

_n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

_m^(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

_n^(j)

= 1 p

n

X

i=n−p+1

δ

_(X_i_,Y_i₎

−→ P

_n^(j)

γ b s

_m^(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n ^(j) γ

b s m ^(−j)

o

(83)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B _j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

_n−p

, Y

_n−p

)

| {z }

, (X

_n−p+1

, Y

_n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

_m^(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

_n^(j)

= 1 p

n

X

i=n−p+1

δ

_(X_i_,Y_i₎

−→ P

_n^(j)

γ b s

_m^(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n ^(j) γ

b s m ^(−j)

o

(84)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin _m|D

_m

_=D P _n γ ( b s _m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

_n^(j)

γ b s

^(−j)

mb^(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(85)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin _m|D

_m

_=D P _n γ ( b s _m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

_n^(j)

γ b s

^(−j)

mb^(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(86)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin _m|D

_m

_=D P _n γ ( b s _m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

_n^(j)

γ b s

^(−j)

mb^(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(87)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin _m|D

_m

_=D P _n γ ( b s _m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

_n^(j)

γ b s

^(−j)

mb^(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(88)

BM, VFCV: performance (number of breakpoints)

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3

0.35 Average estimated loss for Reg= 1, S= 2 Ideal VFCV BM

Homoscedastic

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2

0.25 Average estimated loss for Reg= 2, S= 5 Ideal VFCV BM

Heteroscedastic

σ · Ideal VFCV BM C p

2 2.19 ± 0.28 3.35 ± 0.35 2.96 ± 0.36 62.1 ± 7.1

3 4.09 ± 0.24 5.12 ± 0.3 4.84 ± 0.32 19.3 ± 1.2

5 4.04 ± 0.57 7.33 ± 0.9 38.5 ± 2.8 131 ± 14

6 2.13 ± 0.18 3.47 ± 0.28 5.1 ± 0.39 94.7 ± 7.7

(89)

Summary of main ideas

Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM

Heteroscedastic setup: Resampling strongly outperforms upon

BM

(90)

Summary of main ideas

Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM

Heteroscedastic setup: Resampling strongly outperforms upon

BM

(91)

Summary of main ideas

Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM

Heteroscedastic setup: Resampling strongly outperforms upon

BM

(92)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin _m|D

m

=D P _n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(93)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

_m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(94)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin _m|D

m

=D P _n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(95)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin _m|D

m

=D P _n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(96)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin _m|D

m

=D P _n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(97)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin _m|D

m

=D P _n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(98)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin _m|D

m

=D P _n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D _m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b _D

(99)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

_m

=D

1 n

n

X

i=1

P _n ⁽ⁱ ⁾ γ

b s _m ⁽⁻ⁱ ⁾

(100)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

_m

=D

1 n

n

X

i=1

P _n ⁽ⁱ ⁾ γ

b s _m ⁽⁻ⁱ ⁾

(101)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

_m

=D

1 n

n

X

i=1

P _n ⁽ⁱ ⁾ γ

b s _m ⁽⁻ⁱ ⁾

(102)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

_m

=D

1 n

n

X

i=1

P _n ⁽ⁱ ⁾ γ

b s _m ⁽⁻ⁱ ⁾

(103)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

_m

=D

1 n

n

X

i=1

P _n ⁽ⁱ ⁾ γ

b s _m ⁽⁻ⁱ ⁾

(104)

Quality of the segmentations: Homoscedastic

5 10 15 20 25 30 35 40 45 50

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Average estimated loss for r= 3, b= 2

ERM LOO

(105)

Quality of the segmentations: Heteroscedastic

5 10 15 20 25 30 35 40 45 50

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁻⁵ Average estimated loss for r= 3, b= 12

ERM LOO

5 10 15 20 25 30 35

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Average estimated loss for r= 3, b= 7

ERM LOO

(106)

Overfitting with ERM: dimension 6

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(107)

Overfitting with ERM: dimension 7

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(108)

Overfitting with ERM: dimension 8

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(109)

Overfitting with ERM: dimension 9

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(110)

Overfitting with ERM: dimension 10

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(111)

Global algorithm description

1

LOO at the first step to choose the best segmentation for each dimension:

∀D, m(D) = arg b min

m|D

_m

=D

1 n

n

X

i=1

P _n ⁽ⁱ ⁾ γ b s _m ⁽⁻ⁱ ⁾

2

Use the VFCV to choose the number of breakpoints

D b = arg min

D





 1 V

V

X

j =1

P _n ^(j) γ b s ^(−j ⁾

m b

^(−j)

(D)







(112)

Global algorithm description

1

LOO at the first step to choose the best segmentation for each dimension:

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n ⁽ⁱ ⁾ γ

b s m ⁽⁻ⁱ ⁾

2

Use the VFCV to choose the number of breakpoints

D b = arg min

D





 1 V

V

X

j =1

P _n ^(j) γ b s ^(−j ⁾

m b

^(−j)

(D)







(113)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(114)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(115)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(116)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(117)

Results: Chromosome 9

Homoscedastic model (Picardet al.(05))

LOO+VFCV

(118)

Results: Chromosome 1

Homoscedastic model (Picardet al.(05))

LOO+VFCV

(119)

Conclusion

We have designed a resampling-based procedure

It may be effectively applied to the change-points detection problem

It behaves almost as well as BM in a homoscedastic framework It works well in a fully heteroscedastic setup

This methodology may be extended to other resampling schemes

This algorithm relies on the dimension as a criterion to gather

models: We may think about alternative criteria.

(120)

Conclusion

We have designed a resampling-based procedure

It may be effectively applied to the change-points detection problem

It behaves almost as well as BM in a homoscedastic framework It works well in a fully heteroscedastic setup

This methodology may be extended to other resampling schemes

This algorithm relies on the dimension as a criterion to gather

models: We may think about alternative criteria.

(121)

Segmentation in the mean of heteroscedastic data via resampling