• Aucun résultat trouvé

Data-driven penalties for model selection

N/A
N/A
Protected

Academic year: 2022

Partager "Data-driven penalties for model selection"

Copied!
84
0
0

Texte intégral

(1)

Data-driven penalties for model selection

Sylvain Arlot

1

Cnrs

2

Ecole Normale Sup´ ´ erieure (Paris), Liens, Willow Team

Mathematical Statistics Seminar, WIAS, Berlin, 22/04/2009

(2)

2/32

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E 2 |X

= 1 noise level σ(X ) predictor t : X 7→ Y ?

Data-driven penalties for model selection Sylvain Arlot

(3)

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(4)

2/32

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E 2 |X

= 1 noise level σ(X ) predictor t : X 7→ Y ?

Data-driven penalties for model selection Sylvain Arlot

(5)

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(6)

3/32

Loss function, least-square estimator

Least-square risk:

E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card{X i ∈ I λ }

X

X

i

∈I

λ

Y i .

Data-driven penalties for model selection Sylvain Arlot

(7)

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

s m = X

β b λ 1 I β b λ = 1 X

Y i .

(8)

3/32

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

m

of X . b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card{X i ∈ I λ }

X

X

i

∈I

λ

Y i .

Data-driven penalties for model selection Sylvain Arlot

(9)

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 . e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

s m = X

β b λ 1 I β b λ = 1 X

Y i .

(10)

4/32

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the smoothness of s or to the variations of σ

Data-driven penalties for model selection Sylvain Arlot

(11)

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s m ) + R(m, n)}

(12)

4/32

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the smoothness of s or to the variations of σ

Data-driven penalties for model selection Sylvain Arlot

(13)

Penalization

m b ∈ arg min

m∈M {P n γ( b s m ) + pen(m)}

(14)

5/32

Penalization

m b ∈ arg min

m∈M {P n γ( b s m ) + pen(m)}

pen(m) = 2σ 2 D m

n (Mallows 1973) pen(m) = 2 σ b 2 D m

n or K D b m

And several other penalties (global or local Rademacher complexities, bootstrap or resampling penalties, etc.)

⇒ Ideal penalty: pen id (m) = (P − P n )(γ( b s m , ·))

Data-driven penalties for model selection Sylvain Arlot

(15)

Penalization

m b ∈ arg min

m∈M {P n γ( b s m ) + pen(m)}

pen(m) = 2σ 2 D m

n (Mallows 1973)

pen(m) = 2 σ b 2 D m

n or K D b m

And several other penalties (global or local Rademacher complexities, bootstrap or resampling penalties, etc.)

⇒ Ideal penalty: pen id (m) = (P − P n )(γ( b s m , ·))

(16)

6/32

Data-driven calibration of the penalty

Assume that we know (or have estimated) pen 0 such that K ? pen 0 (m) ≈ E [pen id (m)] (K ? unknown)

Examples: pen 0 (m) = D m , Rademacher complexity, etc.

m(K b ) ∈ arg min

m∈M

n

{P n γ ( b s m ) + K pen 0 (m)}

⇒ how to choose K ?

Data-driven penalties for model selection Sylvain Arlot

(17)

Data-driven calibration of the penalty

Assume that we know (or have estimated) pen 0 such that K ? pen 0 (m) ≈ E [pen id (m)] (K ? unknown)

Examples: pen 0 (m) = D m , Rademacher complexity, etc.

m(K b ) ∈ arg min

m∈M

n

{P n γ ( b s m ) + K pen 0 (m)}

⇒ how to choose K ?

(18)

6/32

Data-driven calibration of the penalty

Assume that we know (or have estimated) pen 0 such that K ? pen 0 (m) ≈ E [pen id (m)] (K ? unknown)

Examples: pen 0 (m) = D m , Rademacher complexity, etc.

m(K b ) ∈ arg min

m∈M

n

{P n γ ( b s m ) + K pen 0 (m)}

⇒ how to choose K ?

Data-driven penalties for model selection Sylvain Arlot

(19)

Dimension jump

(20)

8/32

Efficiency as a function of K

Data-driven penalties for model selection Sylvain Arlot

(21)

Algorithm (Birg´ e, Massart 2007; A., Massart, JMLR 2009)

1

for every K > 0, compute m(K b ) ∈ arg min

m∈M

n

{P n γ ( b s m ) + K pen 0 (m)}

2

find K b min such that D

m(K) b is “very large” when K < K b min and

“reasonably small” when K > K b min

3

choose the model m b = m b

2 K b min

(22)

10/32

The slope heuristics K = 0

Data-driven penalties for model selection Sylvain Arlot

(23)

The slope heuristics K = 0.45K ?

(24)

12/32

The slope heuristics K = 0.5K ?

Data-driven penalties for model selection Sylvain Arlot

(25)

The slope heuristics K = 0.55K ?

(26)

14/32

The slope heuristics K = 0.75K ?

Data-driven penalties for model selection Sylvain Arlot

(27)

The slope heuristics K = K ?

(28)

16/32

The slope heuristics: informal argument

Data-driven penalties for model selection Sylvain Arlot

(29)

Two theorems

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian

Theorem (Minimal penalty; A. and Massart, JMLR 2009) If 0 ≤ K < K ? /2, with probability at least 1 − ♦ n −2 ,

`(s, b s

m(K) b ) ≥ ln(n) inf

m∈M {`(s, b s m )} and D

m(K) b ≥ ♦ n

ln(n)

(30)

17/32

Two theorems

Theorem (Optimal penalty; A. and Massart, JMLR 2009) If K > K ? /2, with probability at least 1 − ♦ n −2 ,

`(s , b s

m(K b ) ) ≤ C n (K ) inf

m∈M {`(s , b s m )} and D

m(K) b ≤ n 1−η where C n (K ) ≤ C (K ), C n (K ? ) ≤ 1 + ln(n) −1/5 and η > 0 may depend on the smoothness of s.

Theorem (Minimal penalty; A. and Massart, JMLR 2009) If 0 ≤ K < K ? /2, with probability at least 1 − ♦ n −2 ,

`(s, b s

m(K) b ) ≥ ln(n) inf

m∈M {`(s, b s m )} and D

m(K) b ≥ ♦ n ln(n)

Data-driven penalties for model selection Sylvain Arlot

(31)

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P n γ ( b s m ) = P n γ (s m ) − (P n (γ (s m ) − γ ( b s m ))

P n (γ (s m ) − γ ( b s m )) ≈ P (γ ( b s m ) − γ (s m ))

Ingredients of the proof:

estimation of the expectations

(32)

18/32

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P n γ ( b s m ) = P n γ (s m ) − (P n (γ (s m ) − γ ( b s m ))

P n (γ (s m ) − γ ( b s m )) ≈ P (γ ( b s m ) − γ (s m ))

Ingredients of the proof:

estimation of the expectations concentration inequalities

Data-driven penalties for model selection Sylvain Arlot

(33)

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P n γ ( b s m ) = P n γ (s m ) − (P n (γ (s m ) − γ ( b s m ))

P n (γ (s m ) − γ ( b s m )) ≈ P (γ ( b s m ) − γ (s m ))

Ingredients of the proof:

estimation of the expectations

(34)

18/32

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P n γ ( b s m ) = P n γ (s m ) − (P n (γ (s m ) − γ ( b s m ))

P n (γ (s m ) − γ ( b s m )) ≈ P (γ ( b s m ) − γ (s m ))

Ingredients of the proof:

estimation of the expectations concentration inequalities

Data-driven penalties for model selection Sylvain Arlot

(35)

Illustration: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

pen 0 (m) = D m

E [`(s , b s

m b )]

E [inf m∈M {`(s , b s m )}]

computed over 1000 samples.

Model selection method Efficiency

Mallows (σ) 2.03 ± 0.04

Mallows ( b σ) 1.93 ± 0.04

Slope (threshold) 1.88 ± 0.03

(36)

20/32

Related results

Birg´ e and Massart (2007): similar theoretical results when the noise is Gaussian homoscedastic (either polynomial or

exponential collections of models).

Successfully applied to change-point detection (Lebarbier, 2005).

The slope heuristics experimentally works in several other frameworks:

mixture models (Maugis and Michel, 2008), clustering (Baudry, 2007),

spatial statistics (Verzelen, 2008), estimation of oil reserves (Lepez, 2002), genomics (Villers, 2007).

Data-driven penalties for model selection Sylvain Arlot

(37)

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

n = 1000 data points Regular histograms on

0; 1 2

(D m,1 pieces), then regular histograms on 1

2 ; 1

(D m,2 pieces).

(38)

21/32

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

n = 1000 data points The ideal penalty is not a linear function of the dimension.

Data-driven penalties for model selection Sylvain Arlot

(39)

Limitations of linear penalties: illustration

(40)

22/32

Limitations of linear penalties: illustration

Data-driven penalties for model selection Sylvain Arlot

(41)

Limitations of linear penalties: m(K b ? ) 6= m ?

Density of (D

m(K b

?

),1 , D

m(K b

?

),2 ) and (D m

?

,1 , D m

?

,2 ) according to

N = 1000 samples

(42)

24/32

Limitations of linear penalties: theory

Y = X + σ(X ) with X ∼ U([0; 1]) , E [|X ] = 0 E

2 |X

= 1 and

Z 1/2

0

(σ(x)) 2 dx 6=

Z 1

1/2

(σ(x)) 2 dx Regular histograms on

0; 1 2

(1 ≤ D m,1 ≤ n/(2 ln(n) 2 ) pieces), then regular histograms on 1

2 ; 1

(1 ≤ D m,2 ≤ n/(2 ln(n) 2 ) pieces).

Theorem (A. 2008, arXiv:0812.3141)

There exist absolute constants C, η > 0 and an event of probability at least 1 − Cn −2 on which

∀K > 0, ∀ m(K b ) ∈ arg min

m∈M

n

{P n γ ( b s m ) + KD m } ,

`(s, b s

m(K) b ) ≥ (1 + η) inf

m∈M

n

{`(s , b s m )} .

Data-driven penalties for model selection Sylvain Arlot

(43)

Resampling heuristics (bootstrap, Efron 1979)

Real world : P sampling // P n +3 b s m

pen id (m) = (P − P n )γ ( b s m ) = F (P, P n )

(44)

25/32

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P sampling // P n +3

b s m

Bootstrap world : P n

pen id (m) = (P − P n )γ ( b s m ) = F (P, P n )

Data-driven penalties for model selection Sylvain Arlot

(45)

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P sampling // P n +3

b s m

Bootstrap world : P n

resampling // P n W +3 b s m W

(P − P n )γ ( b s m ) = F (P , P n ) /o /o /o // F (P n , P n W ) = (P n − P n W )γ b s m W

(46)

25/32

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P sampling // P n +3 b s m

Bootstrap world : P n subsampling // P n W +3 b s m W

(P − P n )γ ( b s m ) = F (P , P n ) /o /o /o // F (P n , P n W ) = (P n − P n W )γ b s m W

V -fold: P n W = 1 n − Card(B J )

X

i ∈B /

J

δ (X

i

,Y

i

) with J ∼ U(1, . . . , V )

Data-driven penalties for model selection Sylvain Arlot

(47)

V -fold penalization

Ideal penalty:

(P − P n )(γ( b s m )) V -fold penalty:

pen(m) = C V

V

X

j=1

h

(P n − P n (−j ) )(γ ( b s m (−j ) )) i

b s m (−j ) ∈ arg min

t∈S

m

P n (−j) γ(t ) with C ≥ V − 1 to be chosen

C = V − 1 for estimating (almost) unbiasedly the ideal penalty

The final estimator is b s

m b with

(48)

26/32

V -fold penalization

Ideal penalty:

(P − P n )(γ( b s m )) V -fold penalty:

pen(m) = C V

V

X

j=1

h

(P n − P n (−j ) )(γ ( b s m (−j ) )) i

b s m (−j ) ∈ arg min

t∈S

m

P n (−j) γ(t ) with C ≥ V − 1 to be chosen

C = V − 1 for estimating (almost) unbiasedly the ideal penalty

The final estimator is b s

m b with m b ∈ arg min

m∈M {P n γ ( b s m ) + pen(m)}

Data-driven penalties for model selection Sylvain Arlot

(49)

V -fold penalization

Ideal penalty:

(P − P n )(γ( b s m )) V -fold penalty:

pen(m) = C V

V

X

j=1

h

(P n − P n (−j ) )(γ ( b s m (−j ) )) i

b s m (−j ) ∈ arg min

t∈S

m

P n (−j) γ(t ) with C ≥ V − 1 to be chosen

C = V − 1 for estimating (almost) unbiasedly the ideal penalty

The final estimator is b s

m b with

(50)

27/32

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian

Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Data-driven penalties for model selection Sylvain Arlot

(51)

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s m ) ≤

1 + ln(n) −1/5

inf {`(s, b s m )}

(52)

27/32

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Data-driven penalties for model selection Sylvain Arlot

(53)

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s m ) ≤

1 + ln(n) −1/5

inf {`(s, b s m )}

(54)

28/32

Simulation framework

Y i = s (X i ) + σ(X i ) i X i ∼ i.i.d. U ([0; 1]) i ∼ i.i.d. N (0, 1)

M n : histograms regular on [0, 1/2] (D 1 pieces), and on [1/2, 1]

(D 2 pieces), with 1 ≤ D 1 , D 22 log(n) n .

⇒ Benchmark:

C classical = E [`(s, b s

m b )]

E [inf m∈M `(s, b s m )] computed with N = 1000 samples

Data-driven penalties for model selection Sylvain Arlot

(55)

Simulation framework

Y i = s (X i ) + σ(X i ) i X i ∼ i.i.d. U ([0; 1]) i ∼ i.i.d. N (0, 1)

M n : histograms regular on [0, 1/2] (D 1 pieces), and on [1/2, 1]

(D 2 pieces), with 1 ≤ D 1 , D 22 log(n) n .

⇒ Benchmark:

C classical = E [`(s, b s

m b )]

E [inf m∈M `(s, b s m )] computed with N = 1000 samples

(56)

29/32

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 0.5 1

−4 0 4

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06 20-fold 2.58 ± 0.06 leave-one-out 2.59 ± 0.06 pen 2-f 3.06 ± 0.07 pen 5-f 2.75 ± 0.06 pen 10-f 2.65 ± 0.06 pen Loo 2.59 ± 0.06 Mallows ×1.25 3.17 ± 0.07 pen 2-f ×1.25 2.75 ± 0.06 pen 5-f ×1.25 2.38 ± 0.06 pen 10-f ×1.25 2.28 ± 0.05 pen Loo ×1.25 2.21 ± 0.05

Data-driven penalties for model selection Sylvain Arlot

(57)

Other resampling-based penalties

Efron’s bootstrap penalties (Efron 1983, Shibata 1997):

pen(m) = E h

(P n − P n W )(γ ( b s m W ))

(X i , Y i ) 1≤i≤n

i

General resampling penalties (A. 2008, hal-00262478) Rademacher complexities (Koltchinskii 2001 ; Bartlett, Boucheron, Lugosi 2002): subsampling

pen id (m) ≤ pen glo id (m) = sup

t∈S

m

(P − P n )γ(t, ·)

idem with general exchangeable weights (Fromont 2004)

Local Rademacher complexities (Bartlett, Bousquet,

(58)

31/32

Cross-validation procedures

Hold-out, Cross-validation, Leave-one-out, V -fold cross-validation:

I ⊂ {1, . . . , n} random sub-sample of size q (VFCV:

q = n(V V −1) ).

V -fold cross-validation is biased

⇒ suboptimal model selection when V is fixed as n → ∞ (A.

2008, arXiv:0802.0566)

V -fold penalization with C = V − 1

⇔ Burman’s corrected V -fold cross-validation (1989).

Data-driven penalties for model selection Sylvain Arlot

(59)

Cross-validation procedures

Hold-out, Cross-validation, Leave-one-out, V -fold cross-validation:

I ⊂ {1, . . . , n} random sub-sample of size q (VFCV:

q = n(V V −1) ).

V -fold cross-validation is biased

⇒ suboptimal model selection when V is fixed as n → ∞ (A.

2008, arXiv:0802.0566)

V -fold penalization with C = V − 1

⇔ Burman’s corrected V -fold cross-validation (1989).

(60)

31/32

Cross-validation procedures

Hold-out, Cross-validation, Leave-one-out, V -fold cross-validation:

I ⊂ {1, . . . , n} random sub-sample of size q (VFCV:

q = n(V V −1) ).

V -fold cross-validation is biased

⇒ suboptimal model selection when V is fixed as n → ∞ (A.

2008, arXiv:0802.0566)

V -fold penalization with C = V − 1

⇔ Burman’s corrected V -fold cross-validation (1989).

Data-driven penalties for model selection Sylvain Arlot

(61)

Conclusion

Shape of the penalty: estimated by resampling (V -fold, bootstrap, exchangeable bootstrap...)

⇒ adaptation to unknown variations of the noise-level

Multiplying constant: estimated thanks to the slope heuristics (model-selection based estimator)

⇒ oracle inequalities with constant 1 + n ,

even when pen 0 (m) is a V -fold or resampling penalty, inside the slope heuristics algorithm

Cross-validation and resampling penalties can also be used for

change-point detection, i.e., for detecting changes in the

mean of an heteroscedastic sequence (joint work with A.

(62)

32/32

Conclusion

Shape of the penalty: estimated by resampling (V -fold, bootstrap, exchangeable bootstrap...)

⇒ adaptation to unknown variations of the noise-level

Multiplying constant: estimated thanks to the slope heuristics (model-selection based estimator)

⇒ oracle inequalities with constant 1 + n ,

even when pen 0 (m) is a V -fold or resampling penalty, inside the slope heuristics algorithm

Cross-validation and resampling penalties can also be used for change-point detection, i.e., for detecting changes in the mean of an heteroscedastic sequence (joint work with A.

Celisse, arXiv:0902.3977)

Data-driven penalties for model selection Sylvain Arlot

(63)

Conclusion

Shape of the penalty: estimated by resampling (V -fold, bootstrap, exchangeable bootstrap...)

⇒ adaptation to unknown variations of the noise-level

Multiplying constant: estimated thanks to the slope heuristics (model-selection based estimator)

⇒ oracle inequalities with constant 1 + n ,

even when pen 0 (m) is a V -fold or resampling penalty, inside the slope heuristics algorithm

Cross-validation and resampling penalties can also be used for

change-point detection, i.e., for detecting changes in the

mean of an heteroscedastic sequence (joint work with A.

(64)

33/32

Change-point detection via cross-validation

∀1 ≤ i ≤ n, Y i = s (t i )+σ(t i ) i with E [ i ] = 0 E 2 i

= 1 Goal: detect changes in the mean s of the signal Y

⇒ model selection

No assumption on the variance σ(t i ) 2

Birg´ e and Massart’s penalty (assumes σ(t i ) ≡ σ):

pen(m) = CD m n

5 + 2 log n

D m

Data-driven penalties for model selection Sylvain Arlot

(65)

Change-point detection via cross-validation

∀1 ≤ i ≤ n, Y i = s (t i )+σ(t i ) i with E [ i ] = 0 E 2 i

= 1 Goal: detect changes in the mean s of the signal Y

⇒ model selection

No assumption on the variance σ(t i ) 2

Birg´ e and Massart’s penalty (assumes σ(t i ) ≡ σ):

pen(m) = CD m n

5 + 2 log n

D m

(66)

34/32

Fixed D, Homoscedastic data; n = 100, σ = 0.25, D = 4

0 0.2 0.4 0.6 0.8 1

−1

−0.5 0 0.5 1

t

i

Y

i

ERM Oracle

Data-driven penalties for model selection Sylvain Arlot

(67)

Fixed D, Heteroscedastic; n = 100, kσk = 0.30, D = 6

−1.5

−1

−0.5 0 0.5 1 1.5

Y

i

ERM

Oracle

(68)

36/32

Fixed D, Heteroscedastic; n = 100, kσk = 0.30, D = 6

0 0.2 0.4 0.6 0.8 1

−1.5

−1

−0.5 0 0.5 1 1.5

t

i

Y

i

Loo ERM Oracle

Data-driven penalties for model selection Sylvain Arlot

(69)

Homoscedastic data: loss as a function of D

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

ERM Loo Lpo20 Lpo50

(70)

38/32

Heteroscedastic data: loss as a function of D

0 5 10 15 20 25 30 35 40

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Dimension D

ERM Loo Lpo20 Lpo50

Data-driven penalties for model selection Sylvain Arlot

(71)

Homoscedastic data: estimation of the loss for every D

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Loss VF5 BM

(72)

40/32

Heteroscedastic data: estimation of the loss for every D

0 5 10 15 20 25 30 35 40

0 0.05 0.1 0.15 0.2 0.25

Loss VF5 BM

Data-driven penalties for model selection Sylvain Arlot

(73)

A family of two-steps change-point detection algorithms

1

∀D ∈ {1, . . . , D max }, select a model m(D) of dimension b D:

m(D) b ∈ arg min

m∈M

n

, D

m

=D {crit 1 (m; (t i , Y i ) i )}

Examples of crit 1 : empirical risk, leave-p-out or V -fold estimators of the risk

2

Select D b

D b ∈ arg min

D∈{1,...,D

max

} {crit 2 (D; (t i , Y i ) i ; crit 1 (·))}

Examples of crit 2 : penalized empirical criterion, V -fold

cross-validation estimator of the risk

(74)

42/32

Simulation results

Deterministic (s, σ):

σ [Emp, VF 5 ] [Loo, VF 5 ] [Lpo 20 , VF 5 ] [Emp, BM ] cst 4.41 ± 0.02 4.54 ± 0.02 4.62 ± 0.02 4.39 ± 0.01 p-c 6.32 ± 0.02 5.74 ± 0.02 5.81 ± 0.02 8.47 ± 0.03 sine 5.97 ± 0.02 5.72 ± 0.02 5.86 ± 0.02 7.59 ± 0.03

Random (s, σ):

σ [Emp, VF 5 ] [Loo, VF 5 ] [Lpo 20 , VF 5 ] [Emp, BM ] A 4.78 ± 0.03 4.65 ± 0.03 4.78 ± 0.03 6.82 ± 0.03 B 5.09 ± 0.03 4.88 ± 0.03 4.91 ± 0.03 7.21 ± 0.04 C 7.17 ± 0.05 6.61 ± 0.05 6.49 ± 0.05 13.49 ± 0.07

Data-driven penalties for model selection Sylvain Arlot

(75)

Bias of cross-validation

Ideal criterion: P γ( b s m )

Regression on a model of histograms with D m pieces (σ(X ) ≡ σ for simplicity):

E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n

E h

P n (j) γ b s m (−j )

i

= E h

P γ b s m (−j )

i ≈ Pγ (s m ) + V V − 1

D m σ 2 n

⇒ bias if V is fixed (“overpenalization”)

(76)

43/32

Bias of cross-validation

Ideal criterion: P γ( b s m )

Regression on a model of histograms with D m pieces (σ(X ) ≡ σ for simplicity):

E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n

E h

P n (j) γ b s m (−j )

i

= E h

P γ b s m (−j )

i ≈ Pγ (s m ) + V V − 1

D m σ 2 n

⇒ bias if V is fixed (“overpenalization”)

Data-driven penalties for model selection Sylvain Arlot

(77)

Bias of cross-validation

Ideal criterion: P γ( b s m )

Regression on a model of histograms with D m pieces (σ(X ) ≡ σ for simplicity):

E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n

E h

P n (j) γ b s m (−j )

i

= E h

P γ b s m (−j )

i ≈ Pγ (s m ) + V V − 1

D m σ 2 n

⇒ bias if V is fixed (“overpenalization”)

(78)

44/32

Suboptimality of V -fold cross-validation

Y = X + σ with bounded and σ > 0 M: family of regular histograms on X = [0, 1]

V fixed Theorem (A. 2008)

With probability at least 1 − ♦ n −2 ,

`(s, b s

m b ) ≥ (1 + κ(V )) inf

m∈M {`(s , b s m )}

with κ(V ) > 0.

Data-driven penalties for model selection Sylvain Arlot

(79)

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 4

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06

20-fold 2.58 ± 0.06

leave-one-out 2.59 ± 0.06

(80)

46/32

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

Models: dyadic regular histograms

2-fold 1.002 ± 0.003 5-fold 1.014 ± 0.003 10-fold 1.021 ± 0.003 20-fold 1.029 ± 0.004 leave-one-out 1.034 ± 0.004

Data-driven penalties for model selection Sylvain Arlot

(81)

Choice of V

optimal performance when V = V ? : trade-off variability–bias (difficult to find V ? from the data)

SNR large:

⇒ V ? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ? too large for computations SNR small:

⇒ V ? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

(82)

47/32

Choice of V

optimal performance when V = V ? : trade-off variability–bias (difficult to find V ? from the data)

SNR large:

⇒ V ? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ? too large for computations SNR small:

⇒ V ? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

Data-driven penalties for model selection Sylvain Arlot

(83)

Choice of V

optimal performance when V = V ? : trade-off variability–bias (difficult to find V ? from the data)

SNR large:

⇒ V ? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ? too large for computations SNR small:

⇒ V ? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

(84)

47/32

Choice of V

optimal performance when V = V ? : trade-off variability–bias (difficult to find V ? from the data)

SNR large:

⇒ V ? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ? too large for computations SNR small:

⇒ V ? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

Data-driven penalties for model selection Sylvain Arlot

Références

Documents relatifs

This paper tackles the problem of selecting among several linear estimators in non- parametric regression; this includes model selection for linear regression, the choice of

Version définitive du manuscrit publié dans / Final version of the manuscript published in :.. nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur /

To conclude, the combination of Lpo (for choosing a segmentation for each possible number of breakpoints) and VFCV yields the most reliable procedure for detecting changes in the

In a general statistical framework, the model selection performance of MCCV, VFCV, LOO, LOO Bootstrap, and .632 bootstrap for se- lection among minimum contrast estimators was

The purpose of the next two sections is to investigate whether the dimensionality of the models, which is freely available in general, can be used for building a computationally

OLS, random-design regression, heteroscedastic noise (regressograms, A. &amp; Massart, 2009; piecewise polynomials, Saumard, 2013). Least-squares density

To conclude, the combination of Lpo (for choosing a segmentation for each possible number of change-points) and VFCV yields a reliable procedure for detecting changes in the mean of

Indeed, it generalizes the results of Birg´e and Massart (2007) on the existence of minimal penalties to heteroscedastic regression with a random design, even if we have to restrict