Segmentation in the mean of heteroscedastic data via resampling

126  Download (0)

Full text

(1)

Segmentation in the mean of heteroscedastic data via resampling

Alain Celisse

1

AgroParisTech, Paris

2

University Paris-Sud 11, Orsay

3

SSB Group, Paris

Change point detection methods and Applications AgroParisTech, Paris, September, 11 2008

joint work with Sylvain Arlot

(2)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).

Noise : E [ i ] = 0 and E 2 i

= 1.

Noise level: σ i (heteroscedastic)

Goal: estimate s

(3)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).

Noise : E [ i ] = 0 and E 2 i

= 1.

Noise level: σ i (heteroscedastic)

Goal: estimate s

(4)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).

Noise : E [ i ] = 0 and E 2 i

= 1.

Noise level: σ i (heteroscedastic)

Goal: estimate s

(5)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).

Noise : E [ i ] = 0 and E 2 i

= 1.

Noise level: σ i (heteroscedastic)

Goal: estimate s

(6)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).

Noise : E [ i ] = 0 and E 2 i

= 1.

Noise level: σ i (heteroscedastic)

Goal: estimate s

(7)

Statistical framework: regression on a fixed design

(t 1 , Y 1 ), . . . , (t n , Y n ) ∈ [0, 1] × Y independent, Y i = s (t i ) + σ i i ∈ Y = [0, 1] or R Instants t i : deterministic (t i = i /n).

Noise : E [ i ] = 0 and E 2 i

= 1.

Noise level: σ i (heteroscedastic)

Goal: estimate s

(8)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(9)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(10)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(11)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(12)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(13)

Change-points detection framework

Y i = s (t i ) + σ i i

s : piecewise constant with high jumps Heteroscedastic noise

Purpose and strategy:

Estimate s to recover most of the significant jumps w.r.t. the noise level.

We choose our estimator among piecewise constant functions

(14)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(15)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(16)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(17)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(18)

Estimation vs. Selection

A change-point in a noisy region

We do not systematically

want to recover it

Use the quadratic risk

Detected change-points

are meaningful

(19)

Loss function, least-squares risk and contrast

Loss function:

` (s , u ) = ks − uk 2 n := 1 n

n

X

i =1

(s(t i ) − u(t i )) 2

Least-squares risk of an estimator b s : R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:

P n γ (u) := 1 n

n

X

i=1

γ (u, (t i , Y i )) ,

with γ(u, (x, y )) = (u(x) − y ) 2

(20)

Loss function, least-squares risk and contrast

Loss function:

` (s , u ) = ks − uk 2 n := 1 n

n

X

i =1

(s(t i ) − u(t i )) 2 Least-squares risk of an estimator b s :

R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:

P n γ (u) := 1 n

n

X

i=1

γ (u, (t i , Y i )) ,

with γ(u, (x, y )) = (u(x) − y ) 2

(21)

Loss function, least-squares risk and contrast

Loss function:

` (s , u ) = ks − uk 2 n := 1 n

n

X

i =1

(s(t i ) − u(t i )) 2 Least-squares risk of an estimator b s :

R n ( b s ) := E [ ` (s, b s ) ] Empirical risk:

P n γ (u) := 1 n

n

X

i=1

γ (u, (t i , Y i )) ,

with γ(u, (x, y )) = (u(x) − y ) 2

(22)

Least-squares estimator

(I λ ) λ∈Λ

m

: partition of [0, 1]

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s m ∈ arg min

u∈S

m

P n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t i ) − Y i ) 2 . Regressogram

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card {t i ∈ I λ }

X

t

i

∈I

λ

Y i .

(23)

Least-squares estimator

(I λ ) λ∈Λ

m

: partition of [0, 1]

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s m ∈ arg min

u∈S

m

P n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t i ) − Y i ) 2 . Regressogram

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card {t i ∈ I λ }

X

t

i

∈I

λ

Y i .

(24)

Least-squares estimator

(I λ ) λ∈Λ

m

: partition of [0, 1]

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s m ∈ arg min

u∈S

m

P n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t i ) − Y i ) 2 .

Regressogram b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card {t i ∈ I λ }

X

t

i

∈I

λ

Y i .

(25)

Least-squares estimator

(I λ ) λ∈Λ

m

: partition of [0, 1]

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

Empirical risk minimizer over S m (= model):

b s m ∈ arg min

u∈S

m

P n γ(u, ·) = arg min

u∈S

m

1 n

n

X

i=1

(u(t i ) − Y i ) 2 . Regressogram

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card {t i ∈ I λ }

X

t

i

∈I

λ

Y i .

(26)

Data (t 1 , Y 1 ), . . . , (t n , Y n )

(27)

Goal: reconstruct the signal

(28)

How many breakpoints?

D = 1 D = 3

(29)

How many breakpoints?

D = 1 D = 3

(30)

The oracle

(31)

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

` (s, b s m b ) ≤ C inf

M

n

{` (s, b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen)

(32)

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

` (s, b s m b ) ≤ C inf

M

n

{` (s, b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen)

(33)

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

` (s, b s m b ) ≤ C inf

M

n

{` (s, b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen)

(34)

Collection of models

For any m ∈ M n , (I λ ) λ∈Λ

m

denotes a partition of [0, 1] such that

I λ = [t i

k

, t i

k+1

)

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

with dim(S m ) = D m

∀1 ≤ D ≤ n − 1, Card {m ∈ M n | D m = D} =

n − 1 D − 1

(35)

Collection of models

For any m ∈ M n , (I λ ) λ∈Λ

m

denotes a partition of [0, 1] such that

I λ = [t i

k

, t i

k+1

)

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

with dim(S m ) = D m

∀1 ≤ D ≤ n − 1, Card {m ∈ M n | D m = D} =

n − 1 D − 1

(36)

Collection of models

For any m ∈ M n , (I λ ) λ∈Λ

m

denotes a partition of [0, 1] such that

I λ = [t i

k

, t i

k+1

)

S m : linear space of piecewise constant functions on (I λ ) λ∈Λ

m

with dim(S m ) = D m

∀1 ≤ D ≤ n − 1, Card {m ∈ M n | D m = D} =

n − 1 D − 1

(37)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C p :

C p (m) = P n γ( b s m ) + 2σ 2 D m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

p

criterion

Oracle dimension: 5

(38)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C p :

C p (m) = P n γ( b s m ) + 2σ 2 D m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

p

criterion

Oracle dimension: 5

(39)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C p :

C p (m) = P n γ( b s m ) + 2σ 2 D m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

p

criterion

Oracle dimension: 5

(40)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C p :

C p (m) = P n γ( b s m ) + 2σ 2 D m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

p

criterion

Oracle dimension: 5

(41)

The collection complexity idea

Usual approach:

Bias-variance tradeoff Mallows’ C p :

C p (m) = P n γ( b s m ) + 2σ 2 D m n

This approach is useless:

Mallows’ C p overfits (Figure)

The collection complexity is involved in this phenomenon

C

p

criterion

Oracle dimension: 5

(42)

Model collection complexity and overfitting

Regular partitions At least 15 points

At least 10 points At least 5 points

(43)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u)

m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(44)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u)

m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(45)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(46)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(47)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(48)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(49)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(50)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(51)

Birg´ e and Massart (2001): Homoscedastic

Algorithm 1:

∀m, b s m = arg min u∈S

m

P n γ (u) m b = arg min

m∈M {P n γ( b s m ) + pen(D m )}

Algorithm 1’: (=Algorithm 1)

∀D, b s

m(D) b = arg min u∈e S

D

P n γ(u ) D b = arg min

D∈D

P n γ( b s

m(D) b ) + pen(D m ) where

S e D = [

m|D

m

=D

S m

(52)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(53)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(54)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(55)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(56)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(57)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(58)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(59)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(60)

Complexity measure in the homoscedastic setup

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 2

Complexity of S m : of order D m /n S e D = S

m|D

m

=D S m

S e D : more complex than any S m Effective complexity measure

Curves may be superimposed Complexity of S e D :

pen(m) = c 1

D m

n + c 2

D m

n log n

D m

What about the heteroscedastic setting?

(61)

Heteroscedastic data

0 5 10 15 20 25 30 35 40 45 50

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Average estimated loss for Reg= 1, S= 5

(62)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(63)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(64)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(65)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(66)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(67)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(68)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(69)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(70)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(71)

Strategy: Resampling with rich collections

With NOT TOO RICH collections of models Homoscedastic:

Penalties like c 1 D n

m

are reliable complexity measures

Heteroscedastic:

Resampling outperforms upon penalties c 1 D

m

n (Arlot (08)) With RICH collections of models

Homoscedastic:

c 1 D n

m

+ c 2 D n

m

log

n D

m

is a reliable complexity measure

Heteroscedastic:

We will use resampling to measure

the complexity

(72)

Cross-validation principle

(73)

Cross-validation principle

(74)

The training set: Computation of the estimator

(75)

The training set: Computation of the estimator

(76)

The test set: Quality assessment

(77)

The test set: Quality assessment

(78)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

n−p

, Y

n−p

)

| {z }

, (X

n−p+1

, Y

n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

m(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

n(j)

= 1 p

n

X

i=n−p+1

δ

(Xi,Yi)

−→ P

n(j)

γ b s

m(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n (j) γ

b s m (−j)

o

(79)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

n−p

, Y

n−p

)

| {z }

, (X

n−p+1

, Y

n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

m(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

n(j)

= 1 p

n

X

i=n−p+1

δ

(Xi,Yi)

−→ P

n(j)

γ b s

m(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n (j) γ

b s m (−j)

o

(80)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

n−p

, Y

n−p

)

| {z }

, (X

n−p+1

, Y

n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

m(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

n(j)

= 1 p

n

X

i=n−p+1

δ

(Xi,Yi)

−→ P

n(j)

γ b s

m(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n (j) γ

b s m (−j)

o

(81)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

n−p

, Y

n−p

)

| {z }

, (X

n−p+1

, Y

n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

m(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

n(j)

= 1 p

n

X

i=n−p+1

δ

(Xi,Yi)

−→ P

n(j)

γ b s

m(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n (j) γ

b s m (−j)

o

(82)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

n−p

, Y

n−p

)

| {z }

, (X

n−p+1

, Y

n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

m(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

n(j)

= 1 p

n

X

i=n−p+1

δ

(Xi,Yi)

−→ P

n(j)

γ b s

m(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n (j) γ

b s m (−j)

o

(83)

V -Fold cross-validation (VFCV)

Randomly split the data into V disjoint subsets (B j ) of size p ≈ n/V

For each 1 ≤ j ≤ V :

(X

1

, Y

1

), . . . , (X

n−p

, Y

n−p

)

| {z }

, (X

n−p+1

, Y

n−p+1

), . . . , (X

n

, Y

n

)

| {z }

Training set Test set B

j

b s

m(−j)

= arg min

u∈Sm

( 1 n − p

n−p

X

i=1

γ(u, (X

i

, Y

i

)) )

P

n(j)

= 1 p

n

X

i=n−p+1

δ

(Xi,Yi)

−→ P

n(j)

γ b s

m(−j)

VFCV −→ m b ∈ arg min M

n

n 1 V

P V

j=1 P n (j) γ

b s m (−j)

o

(84)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin m|D

m

=D P n γ ( b s m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

n(j)

γ b s

(−j)

mb(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(85)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin m|D

m

=D P n γ ( b s m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

n(j)

γ b s

(−j)

mb(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(86)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin m|D

m

=D P n γ ( b s m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

n(j)

γ b s

(−j)

mb(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(87)

BM, VFCV: choosing the number of breakpoints

First step:

∀D, m(D) = b Argmin m|D

m

=D P n γ ( b s m ) Second Step:

1

BM (Homoscedastic):

P

n

γ( b s

m(D)b

) + pen(D) ≈ E h

s − b s

m(D)b

2

i

2

VFCV (Homoscedastic and Heteroscedastic):

1 V

V

X

j=1

P

n(j)

γ b s

(−j)

mb(−j)(D)

≈ E h

s − b s

m(D)b

2

i

(88)

BM, VFCV: performance (number of breakpoints)

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2 0.25 0.3

0.35 Average estimated loss for Reg= 1, S= 2 Ideal VFCV BM

Homoscedastic

0 10 20 30 40 50

0 0.05 0.1 0.15 0.2

0.25 Average estimated loss for Reg= 2, S= 5 Ideal VFCV BM

Heteroscedastic

σ · Ideal VFCV BM C p

2 2.19 ± 0.28 3.35 ± 0.35 2.96 ± 0.36 62.1 ± 7.1

3 4.09 ± 0.24 5.12 ± 0.3 4.84 ± 0.32 19.3 ± 1.2

5 4.04 ± 0.57 7.33 ± 0.9 38.5 ± 2.8 131 ± 14

6 2.13 ± 0.18 3.47 ± 0.28 5.1 ± 0.39 94.7 ± 7.7

(89)

Summary of main ideas

Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM

Heteroscedastic setup: Resampling strongly outperforms upon

BM

(90)

Summary of main ideas

Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM

Heteroscedastic setup: Resampling strongly outperforms upon

BM

(91)

Summary of main ideas

Resampling turns out to be reliable with rich collections Homoscedastic setup: Resampling performs almost as well as BM

Heteroscedastic setup: Resampling strongly outperforms upon

BM

(92)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(93)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(94)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(95)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(96)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(97)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(98)

Overfitting of ERM

ERM only takes into account the fit to the data Overfitting may occur when

∀D, m(D) = b Argmin m|D

m

=D P n γ (u) Overfitting is all the more strong as S e D is large In our setting,

Card ({m | D m = D}) =

n − 1 D − 1

Idea:

ERM may be replaced by resampling to provide { m(D)} b D

(99)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ

b s m (−i )

(100)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ

b s m (−i )

(101)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ

b s m (−i )

(102)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ

b s m (−i )

(103)

Algorithms

Goal:

See whether resampling outperforms upon ERM Alternatives:

ERM:

∀D, m(D) = arg b min

m|D

m

=D P n γ ( b s m ) Leave-one-out (LOO):

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ

b s m (−i )

(104)

Quality of the segmentations: Homoscedastic

5 10 15 20 25 30 35 40 45 50

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Average estimated loss for r= 3, b= 2

ERM LOO

(105)

Quality of the segmentations: Heteroscedastic

5 10 15 20 25 30 35 40 45 50

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10−5 Average estimated loss for r= 3, b= 12

ERM LOO

5 10 15 20 25 30 35

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Average estimated loss for r= 3, b= 7

ERM LOO

(106)

Overfitting with ERM: dimension 6

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(107)

Overfitting with ERM: dimension 7

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(108)

Overfitting with ERM: dimension 8

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(109)

Overfitting with ERM: dimension 9

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(110)

Overfitting with ERM: dimension 10

0 10 20 30 40 50 60 70 80 90 100

−2

−1.5

−1

−0.5 0 0.5 1 1.5

ERM LOO Oracle

(111)

Global algorithm description

1

LOO at the first step to choose the best segmentation for each dimension:

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ b s m (−i )

2

Use the VFCV to choose the number of breakpoints

D b = arg min

D

 1 V

V

X

j =1

P n (j) γ b s (−j )

m b

(−j)

(D)

(112)

Global algorithm description

1

LOO at the first step to choose the best segmentation for each dimension:

∀D, m(D) = arg b min

m|D

m

=D

1 n

n

X

i=1

P n (i ) γ

b s m (−i )

2

Use the VFCV to choose the number of breakpoints

D b = arg min

D

 1 V

V

X

j =1

P n (j) γ b s (−j )

m b

(−j)

(D)

(113)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(114)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(115)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(116)

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

(117)

Results: Chromosome 9

Homoscedastic model (Picardet al.(05))

LOO+VFCV

(118)

Results: Chromosome 1

Homoscedastic model (Picardet al.(05))

LOO+VFCV

(119)

Conclusion

We have designed a resampling-based procedure

It may be effectively applied to the change-points detection problem

It behaves almost as well as BM in a homoscedastic framework It works well in a fully heteroscedastic setup

This methodology may be extended to other resampling schemes

This algorithm relies on the dimension as a criterion to gather

models: We may think about alternative criteria.

(120)

Conclusion

We have designed a resampling-based procedure

It may be effectively applied to the change-points detection problem

It behaves almost as well as BM in a homoscedastic framework It works well in a fully heteroscedastic setup

This methodology may be extended to other resampling schemes

This algorithm relies on the dimension as a criterion to gather

models: We may think about alternative criteria.

(121)

Conclusion

We have designed a resampling-based procedure

It may be effectively applied to the change-points detection problem

It behaves almost as well as BM in a homoscedastic framework It works well in a fully heteroscedastic setup

This methodology may be extended to other resampling schemes

This algorithm relies on the dimension as a criterion to gather

models: We may think about alternative criteria.

Figure

Updating...

References

Related subjects :