Data-driven penalties for model selection

(1)

Data-driven penalties for model selection

Sylvain Arlot

1

Cnrs

2

Ecole Normale Sup´ ´ erieure (Paris), Liens, Willow Team

Mathematical Statistics Seminar, WIAS, Berlin, 22/04/2009

(2)

2/32

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E ² |X

= 1 noise level σ(X ) predictor t : X 7→ Y ?

Data-driven penalties for model selection Sylvain Arlot

(3)

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E ² |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(4)

2/32

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E ² |X

= 1 noise level σ(X ) predictor t : X 7→ Y ?

(5)

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s (X ) + σ(X ) X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise : E [|X ] = 0 E ² |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(6)

3/32

Loss function, least-square estimator

Least-square risk:

E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

t∈S

m

P _n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X _i ) − Y _i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

_λ

β b λ = 1 Card{X _i ∈ I _λ }

X

i

∈I

_λ

Y i .

(7)

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

t∈S

m

P _n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X _i ) − Y _i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

m

of X .

s m = X

β b λ 1 I β b λ = 1 X

Y i .

(8)

3/32

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

_m

of X . b s m = X

λ∈Λ

m

β b _λ 1 I

λ

β b _λ = 1 Card{X _i ∈ I _λ }

X

i

∈I

_λ

Y i .

(9)

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

t∈S

m

P _n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X _i ) − Y _i ) ² . e.g., histograms on a partition (I _λ ) λ∈Λ

m

of X .

s m = X

β b λ 1 I β b λ = 1 X

Y i .

(10)

4/32

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s _m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the smoothness of s or to the variations of σ

(11)

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s _m ) + R(m, n)}

(12)

4/32

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s _m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the smoothness of s or to the variations of σ

(13)

Penalization

m b ∈ arg min

m∈M {P _n γ( b s _m ) + pen(m)}

(14)

5/32

Penalization

m b ∈ arg min

m∈M {P _n γ( b s m ) + pen(m)}

pen(m) = 2σ ² D _m

n (Mallows 1973) pen(m) = 2 σ b ² D _m

n or K D b _m

And several other penalties (global or local Rademacher complexities, bootstrap or resampling penalties, etc.)

⇒ Ideal penalty: pen _id (m) = (P − P _n )(γ( b s _m , ·))

(15)

Penalization

m b ∈ arg min

m∈M {P _n γ( b s m ) + pen(m)}

pen(m) = 2σ ² D _m

n (Mallows 1973)

pen(m) = 2 σ b ² D _m

n or K D b _m

And several other penalties (global or local Rademacher complexities, bootstrap or resampling penalties, etc.)

⇒ Ideal penalty: pen _id (m) = (P − P _n )(γ( b s _m , ·))

(16)

6/32

Data-driven calibration of the penalty

Assume that we know (or have estimated) pen ₀ such that K ^? pen ₀ (m) ≈ E [pen _id (m)] (K ^? unknown)

Examples: pen ₀ (m) = D m , Rademacher complexity, etc.

m(K b ) ∈ arg min

m∈M

n

{P _n γ ( b s m ) + K pen ₀ (m)}

⇒ how to choose K ?

(17)

Data-driven calibration of the penalty

Assume that we know (or have estimated) pen ₀ such that K ^? pen ₀ (m) ≈ E [pen _id (m)] (K ^? unknown)

Examples: pen ₀ (m) = D m , Rademacher complexity, etc.

m(K b ) ∈ arg min

m∈M

n

{P _n γ ( b s m ) + K pen ₀ (m)}

⇒ how to choose K ?

(18)

6/32

Data-driven calibration of the penalty

Assume that we know (or have estimated) pen ₀ such that K ^? pen ₀ (m) ≈ E [pen _id (m)] (K ^? unknown)

Examples: pen ₀ (m) = D m , Rademacher complexity, etc.

m(K b ) ∈ arg min

m∈M

n

{P _n γ ( b s m ) + K pen ₀ (m)}

⇒ how to choose K ?

(19)

Dimension jump

(20)

8/32

Efficiency as a function of K

(21)

Algorithm (Birg´ e, Massart 2007; A., Massart, JMLR 2009)

1

for every K > 0, compute m(K b ) ∈ arg min

m∈M

_n

{P _n γ ( b s _m ) + K pen ₀ (m)}

2

find K b _min such that D

m(K) b is “very large” when K < K b _min and

“reasonably small” when K > K b _min

3

choose the model m b = m b

2 K b _min

(22)

10/32

The slope heuristics K = 0

(23)

The slope heuristics K = 0.45K ^?

(24)

12/32

The slope heuristics K = 0.5K ^?

(25)

The slope heuristics K = 0.55K ^?

(26)

14/32

The slope heuristics K = 0.75K ^?

(27)

The slope heuristics K = K ^?

(28)

16/32

The slope heuristics: informal argument

(29)

Two theorems

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k ∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian

Theorem (Minimal penalty; A. and Massart, JMLR 2009) If 0 ≤ K < K ^? /2, with probability at least 1 − ♦ n ⁻² ,

`(s, b s

m(K) b ) ≥ ln(n) inf

m∈M {`(s, b s m )} and D

m(K) b ≥ ♦ n

ln(n)

(30)

17/32

Two theorems

Theorem (Optimal penalty; A. and Massart, JMLR 2009) If K > K ^? /2, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m(K b ) ) ≤ C n (K ) inf

m∈M {`(s , b s m )} and D

m(K) b ≤ n ^1−η where C n (K ) ≤ C (K ), C n (K ^? ) ≤ 1 + ln(n) ^−1/5 and η > 0 may depend on the smoothness of s.

Theorem (Minimal penalty; A. and Massart, JMLR 2009) If 0 ≤ K < K ^? /2, with probability at least 1 − ♦ n ⁻² ,

`(s, b s

m(K) b ) ≥ ln(n) inf

m∈M {`(s, b s _m )} and D

m(K) b ≥ ♦ n ln(n)

(31)

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P _n γ ( b s _m ) = P _n γ (s _m ) − (P _n (γ (s _m ) − γ ( b s _m ))

P _n (γ (s _m ) − γ ( b s _m )) ≈ P (γ ( b s _m ) − γ (s _m ))

Ingredients of the proof:

estimation of the expectations

(32)

18/32

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P _n γ ( b s _m ) = P _n γ (s _m ) − (P _n (γ (s _m ) − γ ( b s _m ))

P _n (γ (s _m ) − γ ( b s _m )) ≈ P (γ ( b s _m ) − γ (s _m ))

Ingredients of the proof:

estimation of the expectations concentration inequalities

(33)

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P _n γ ( b s _m ) = P _n γ (s _m ) − (P _n (γ (s _m ) − γ ( b s _m ))

P _n (γ (s _m ) − γ ( b s _m )) ≈ P (γ ( b s _m ) − γ (s _m ))

Ingredients of the proof:

estimation of the expectations

(34)

18/32

The slope heuristics: sketch of proof

prediction error P γ ( b s m ) = Pγ (s m ) + P (γ ( b s m ) − γ (s m ))

empirical risk P _n γ ( b s _m ) = P _n γ (s _m ) − (P _n (γ (s _m ) − γ ( b s _m ))

P _n (γ (s _m ) − γ ( b s _m )) ≈ P (γ ( b s _m ) − γ (s _m ))

Ingredients of the proof:

estimation of the expectations concentration inequalities

(35)

Illustration: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

pen ₀ (m) = D m

E [`(s , b s

m b )]

E [inf m∈M {`(s , b s m )}]

computed over 1000 samples.

Model selection method Efficiency

Mallows (σ) 2.03 ± 0.04

Mallows ( b σ) 1.93 ± 0.04

Slope (threshold) 1.88 ± 0.03

(36)

20/32

Related results

Birg´ e and Massart (2007): similar theoretical results when the noise is Gaussian homoscedastic (either polynomial or

exponential collections of models).

Successfully applied to change-point detection (Lebarbier, 2005).

The slope heuristics experimentally works in several other frameworks:

mixture models (Maugis and Michel, 2008), clustering (Baudry, 2007),

spatial statistics (Verzelen, 2008), estimation of oil reserves (Lepez, 2002), genomics (Villers, 2007).

(37)

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

n = 1000 data points Regular histograms on

0; ¹ ₂

(D _m,1 pieces), then regular histograms on ₁

2 ; 1

(D m,2 pieces).

(38)

21/32

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

n = 1000 data points The ideal penalty is not a linear function of the dimension.

(39)

Limitations of linear penalties: illustration

(40)

22/32

Limitations of linear penalties: illustration

(41)

Limitations of linear penalties: m(K b ^? ) 6= m ^?

Density of (D

m(K b

^?

),1 , D

m(K b

^?

),2 ) and (D m

^?

,1 , D m

^?

,2 ) according to

N = 1000 samples

(42)

24/32

Limitations of linear penalties: theory

Y = X + σ(X ) with X ∼ U([0; 1]) , E [|X ] = 0 E

² |X

= 1 and

Z 1/2

0 (σ(x)) ² dx 6=

Z 1

1/2

(σ(x)) ² dx Regular histograms on

0; ¹ ₂

(1 ≤ D m,1 ≤ n/(2 ln(n) ² ) pieces), then regular histograms on ₁

2 ; 1

(1 ≤ D m,2 ≤ n/(2 ln(n) ² ) pieces).

Theorem (A. 2008, arXiv:0812.3141)

There exist absolute constants C, η > 0 and an event of probability at least 1 − Cn ⁻² on which

∀K > 0, ∀ m(K b ) ∈ arg min

m∈M

_n

{P _n γ ( b s _m ) + KD _m } ,

`(s, b s

m(K) b ) ≥ (1 + η) inf

m∈M

n

{`(s , b s _m )} .

(43)

Resampling heuristics (bootstrap, Efron 1979)

Real world : P ^sampling ^// P n +3 b s m

pen _id (m) = (P − P _n )γ ( b s _m ) = F (P, P _n )

(44)

25/32

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P ^sampling ^// P n +3

b s _m

Bootstrap world : P n

pen _id (m) = (P − P n )γ ( b s m ) = F (P, P n )

(45)

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P ^sampling ^// P n +3

b s _m

Bootstrap world : P n

resampling // P _n ^W ⁺³ b s _m ^W

(P − P _n )γ ( b s _m ) = F (P , P _n ) ^/o ^/o ^/o ^// F (P n , P _n ^W ) = (P n − P _n ^W )γ b s _m ^W

(46)

25/32

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P ^sampling ^// P _n ⁺³ b s m

Bootstrap world : P _n subsampling // P _n ^W ⁺³ b s _m ^W

(P − P _n )γ ( b s _m ) = F (P , P _n ) ^/o ^/o ^/o ^// F (P n , P _n ^W ) = (P n − P _n ^W )γ b s _m ^W

V -fold: P _n ^W = 1 n − Card(B _J )

X

i ∈B /

_J

δ _(X

_i

_,Y

_i

₎ with J ∼ U(1, . . . , V )

(47)

V -fold penalization

Ideal penalty:

(P − P n )(γ( b s m )) V -fold penalty:

pen(m) = C V

V

X

j=1

h

(P n − P n ^(−j ⁾ )(γ ( b s m ^(−j ⁾ )) i

b s m ^(−j ⁾ ∈ arg min

t∈S

m

P n ^(−j) γ(t ) with C ≥ V − 1 to be chosen

C = V − 1 for estimating (almost) unbiasedly the ideal penalty

The final estimator is b s

m b with

(48)

26/32

V -fold penalization

Ideal penalty:

(P − P n )(γ( b s m )) V -fold penalty:

pen(m) = C V

V

X

j=1

h

(P n − P n ^(−j ⁾ )(γ ( b s m ^(−j ⁾ )) i

b s m ^(−j ⁾ ∈ arg min

t∈S

m

P n ^(−j) γ(t ) with C ≥ V − 1 to be chosen

C = V − 1 for estimating (almost) unbiasedly the ideal penalty

The final estimator is b s

m b with m b ∈ arg min

m∈M {P _n γ ( b s m ) + pen(m)}

(49)

V -fold penalization

Ideal penalty:

(P − P n )(γ( b s m )) V -fold penalty:

pen(m) = C V

V

X

j=1

h

(P n − P n ^(−j ⁾ )(γ ( b s m ^(−j ⁾ )) i

b s m ^(−j ⁾ ∈ arg min

t∈S

m

P n ^(−j) γ(t ) with C ≥ V − 1 to be chosen

C = V − 1 for estimating (almost) unbiasedly the ideal penalty

The final estimator is b s

m b with

(50)

27/32

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian

Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s _m )}

(51)

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s _m ) ≤

1 + ln(n) ^−1/5

inf {`(s, b s _m )}

(52)

27/32

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s _m )}

(53)

Non-asymptotic pathwise oracle inequality

Fixed V or V = n C ≈ V − 1

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2008, arXiv:0802.0566)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s _m ) ≤

1 + ln(n) ^−1/5

inf {`(s, b s _m )}

(54)

28/32

Simulation framework

Y i = s (X i ) + σ(X i ) i X i ∼ ^i.i.d. U ([0; 1]) i ∼ ^i.i.d. N (0, 1)

M _n : histograms regular on [0, 1/2] (D 1 pieces), and on [1/2, 1]

(D ₂ pieces), with 1 ≤ D ₁ , D ₂ ≤ _{2 log(n)} ⁿ .

⇒ Benchmark:

C _classical = E [`(s, b s

m b )]

E [inf m∈M `(s, b s _m )] computed with N = 1000 samples

(55)

Simulation framework

Y i = s (X i ) + σ(X i ) i X i ∼ ^i.i.d. U ([0; 1]) i ∼ ^i.i.d. N (0, 1)

M _n : histograms regular on [0, 1/2] (D 1 pieces), and on [1/2, 1]

(D ₂ pieces), with 1 ≤ D ₁ , D ₂ ≤ _{2 log(n)} ⁿ .

⇒ Benchmark:

C _classical = E [`(s, b s

m b )]

E [inf m∈M `(s, b s _m )] computed with N = 1000 samples

(56)

29/32

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 0.5 1

−4 0 4

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06 20-fold 2.58 ± 0.06 leave-one-out 2.59 ± 0.06 pen 2-f 3.06 ± 0.07 pen 5-f 2.75 ± 0.06 pen 10-f 2.65 ± 0.06 pen Loo 2.59 ± 0.06 Mallows ×1.25 3.17 ± 0.07 pen 2-f ×1.25 2.75 ± 0.06 pen 5-f ×1.25 2.38 ± 0.06 pen 10-f ×1.25 2.28 ± 0.05 pen Loo ×1.25 2.21 ± 0.05

(57)

Other resampling-based penalties

Efron’s bootstrap penalties (Efron 1983, Shibata 1997):

pen(m) = E h

(P n − P _n ^W )(γ ( b s _m ^W ))

(X _i , Y _i ) 1≤i≤n

i

General resampling penalties (A. 2008, hal-00262478) Rademacher complexities (Koltchinskii 2001 ; Bartlett, Boucheron, Lugosi 2002): subsampling

pen _id (m) ≤ pen ^glo _id (m) = sup

t∈S

m

(P − P _n )γ(t, ·)

idem with general exchangeable weights (Fromont 2004)

Local Rademacher complexities (Bartlett, Bousquet,

(58)

31/32

Cross-validation procedures

Hold-out, Cross-validation, Leave-one-out, V -fold cross-validation:

I ⊂ {1, . . . , n} random sub-sample of size q (VFCV:

q = ^n(V _V ⁻¹⁾ ).

V -fold cross-validation is biased

⇒ suboptimal model selection when V is fixed as n → ∞ (A.

2008, arXiv:0802.0566)

V -fold penalization with C = V − 1

⇔ Burman’s corrected V -fold cross-validation (1989).

(59)

Cross-validation procedures

Hold-out, Cross-validation, Leave-one-out, V -fold cross-validation:

I ⊂ {1, . . . , n} random sub-sample of size q (VFCV:

q = ^n(V _V ⁻¹⁾ ).

V -fold cross-validation is biased

⇒ suboptimal model selection when V is fixed as n → ∞ (A.

2008, arXiv:0802.0566)

V -fold penalization with C = V − 1

⇔ Burman’s corrected V -fold cross-validation (1989).

(60)

31/32

Cross-validation procedures

Hold-out, Cross-validation, Leave-one-out, V -fold cross-validation:

I ⊂ {1, . . . , n} random sub-sample of size q (VFCV:

q = ^n(V _V ⁻¹⁾ ).

V -fold cross-validation is biased

⇒ suboptimal model selection when V is fixed as n → ∞ (A.

2008, arXiv:0802.0566)

V -fold penalization with C = V − 1

⇔ Burman’s corrected V -fold cross-validation (1989).

(61)

Conclusion

Shape of the penalty: estimated by resampling (V -fold, bootstrap, exchangeable bootstrap...)

⇒ adaptation to unknown variations of the noise-level

Multiplying constant: estimated thanks to the slope heuristics (model-selection based estimator)

⇒ oracle inequalities with constant 1 + _n ,

even when pen ₀ (m) is a V -fold or resampling penalty, inside the slope heuristics algorithm

Cross-validation and resampling penalties can also be used for

change-point detection, i.e., for detecting changes in the

mean of an heteroscedastic sequence (joint work with A.

(62)

32/32

Conclusion

Shape of the penalty: estimated by resampling (V -fold, bootstrap, exchangeable bootstrap...)

⇒ adaptation to unknown variations of the noise-level

Multiplying constant: estimated thanks to the slope heuristics (model-selection based estimator)

⇒ oracle inequalities with constant 1 + _n ,

even when pen ₀ (m) is a V -fold or resampling penalty, inside the slope heuristics algorithm

Cross-validation and resampling penalties can also be used for change-point detection, i.e., for detecting changes in the mean of an heteroscedastic sequence (joint work with A.

Celisse, arXiv:0902.3977)

(63)

Conclusion

Shape of the penalty: estimated by resampling (V -fold, bootstrap, exchangeable bootstrap...)

⇒ adaptation to unknown variations of the noise-level

Multiplying constant: estimated thanks to the slope heuristics (model-selection based estimator)

⇒ oracle inequalities with constant 1 + _n ,

even when pen ₀ (m) is a V -fold or resampling penalty, inside the slope heuristics algorithm

Cross-validation and resampling penalties can also be used for

change-point detection, i.e., for detecting changes in the

mean of an heteroscedastic sequence (joint work with A.

(64)

33/32

Change-point detection via cross-validation

∀1 ≤ i ≤ n, Y _i = s (t _i )+σ(t _i ) _i with E [ _i ] = 0 E ² _i

= 1 Goal: detect changes in the mean s of the signal Y

⇒ model selection

No assumption on the variance σ(t _i ) ²

Birg´ e and Massart’s penalty (assumes σ(t i ) ≡ σ):

pen(m) = CD _m n

5 + 2 log n

D m

(65)

Change-point detection via cross-validation

∀1 ≤ i ≤ n, Y _i = s (t _i )+σ(t _i ) _i with E [ _i ] = 0 E ² _i

= 1 Goal: detect changes in the mean s of the signal Y

⇒ model selection

No assumption on the variance σ(t _i ) ²

Birg´ e and Massart’s penalty (assumes σ(t i ) ≡ σ):

pen(m) = CD _m n

5 + 2 log n

D m

(66)

34/32

Fixed D, Homoscedastic data; n = 100, σ = 0.25, D = 4

0 0.2 0.4 0.6 0.8 1

−1

−0.5 0 0.5 1

t

i

Y

i

ERM Oracle

(67)

Fixed D, Heteroscedastic; n = 100, kσk = 0.30, D = 6

−1.5

−1

−0.5 0 0.5 1 1.5

Y

i

ERM

Oracle

(68)

36/32

Fixed D, Heteroscedastic; n = 100, kσk = 0.30, D = 6

0 0.2 0.4 0.6 0.8 1

−1.5

−1

−0.5 0 0.5 1 1.5

t

i

Y

_i

Loo ERM Oracle

(69)

Homoscedastic data: loss as a function of D

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

ERM Loo Lpo20 Lpo50

(70)

38/32

Heteroscedastic data: loss as a function of D

0 5 10 15 20 25 30 35 40

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Dimension D

ERM Loo Lpo20 Lpo50

(71)

Homoscedastic data: estimation of the loss for every D

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Loss VF₅ BM

(72)

40/32

Heteroscedastic data: estimation of the loss for every D

0 5 10 15 20 25 30 35 40

0 0.05 0.1 0.15 0.2 0.25

Loss VF₅ BM

(73)

A family of two-steps change-point detection algorithms

1

∀D ∈ {1, . . . , D _max }, select a model m(D) of dimension b D:

m(D) b ∈ arg min

m∈M

_n

, D

m

=D {crit ₁ (m; (t _i , Y _i ) _i )}

Examples of crit ₁ : empirical risk, leave-p-out or V -fold estimators of the risk

2

Select D b

D b ∈ arg min

D∈{1,...,D

_max

} {crit ₂ (D; (t _i , Y _i ) _i ; crit ₁ (·))}

Examples of crit 2 : penalized empirical criterion, V -fold

cross-validation estimator of the risk

(74)

42/32

Simulation results

Deterministic (s, σ):

σ [Emp, VF ₅ ] [Loo, VF ₅ ] [Lpo ₂₀ , VF ₅ ] [Emp, BM ] cst 4.41 ± 0.02 4.54 ± 0.02 4.62 ± 0.02 4.39 ± 0.01 p-c 6.32 ± 0.02 5.74 ± 0.02 5.81 ± 0.02 8.47 ± 0.03 sine 5.97 ± 0.02 5.72 ± 0.02 5.86 ± 0.02 7.59 ± 0.03

Random (s, σ):

σ [Emp, VF 5 ] [Loo, VF 5 ] [Lpo 20 , VF 5 ] [Emp, BM ] A 4.78 ± 0.03 4.65 ± 0.03 4.78 ± 0.03 6.82 ± 0.03 B 5.09 ± 0.03 4.88 ± 0.03 4.91 ± 0.03 7.21 ± 0.04 C 7.17 ± 0.05 6.61 ± 0.05 6.49 ± 0.05 13.49 ± 0.07

(75)

Bias of cross-validation

Ideal criterion: P γ( b s _m )

Regression on a model of histograms with D _m pieces (σ(X ) ≡ σ for simplicity):

E [P γ( b s _m )] ≈ P γ(s _m ) + D _m σ ² n

E h

P n ^(j) γ b s m ^(−j ⁾

i

= E h

P γ b s m ^(−j ⁾

i ≈ Pγ (s _m ) + V V − 1

D m σ ² n

⇒ bias if V is fixed (“overpenalization”)

(76)

43/32

Bias of cross-validation

Ideal criterion: P γ( b s _m )

Regression on a model of histograms with D _m pieces (σ(X ) ≡ σ for simplicity):

E [P γ( b s _m )] ≈ P γ(s _m ) + D _m σ ² n

E h

P n ^(j) γ b s m ^(−j ⁾

i

= E h

P γ b s m ^(−j ⁾

i ≈ Pγ (s _m ) + V V − 1

D m σ ² n

⇒ bias if V is fixed (“overpenalization”)

(77)

Bias of cross-validation

Ideal criterion: P γ( b s _m )

Regression on a model of histograms with D _m pieces (σ(X ) ≡ σ for simplicity):

E [P γ( b s _m )] ≈ P γ(s _m ) + D _m σ ² n

E h

P n ^(j) γ b s m ^(−j ⁾

i

= E h

P γ b s m ^(−j ⁾

i ≈ Pγ (s _m ) + V V − 1

D m σ ² n

⇒ bias if V is fixed (“overpenalization”)

(78)

44/32

Suboptimality of V -fold cross-validation

Y = X + σ with bounded and σ > 0 M: family of regular histograms on X = [0, 1]

V fixed Theorem (A. 2008)

With probability at least 1 − ♦ n ⁻² ,

`(s, b s

m b ) ≥ (1 + κ(V )) inf

m∈M {`(s , b s m )}

with κ(V ) > 0.

(79)

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 4

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06

20-fold 2.58 ± 0.06

leave-one-out 2.59 ± 0.06

(80)

46/32

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

Models: dyadic regular histograms

2-fold 1.002 ± 0.003 5-fold 1.014 ± 0.003 10-fold 1.021 ± 0.003 20-fold 1.029 ± 0.004 leave-one-out 1.034 ± 0.004

(81)

Choice of V

optimal performance when V = V ^? : trade-off variability–bias (difficult to find V ^? from the data)

SNR large:

⇒ V ^? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ^? too large for computations SNR small:

⇒ V ^? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

(82)

47/32

Choice of V

optimal performance when V = V ^? : trade-off variability–bias (difficult to find V ^? from the data)

SNR large:

⇒ V ^? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ^? too large for computations SNR small:

⇒ V ^? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

(83)

Choice of V

optimal performance when V = V ^? : trade-off variability–bias (difficult to find V ^? from the data)

SNR large:

⇒ V ^? → ∞ when n → ∞ (suboptimality result if V fixed)

⇒ V ^? too large for computations SNR small:

⇒ V ^? = 2 is possible

⇒ unsatisfactory (highly variable)

V should be chosen according to computation time also

(84)

47/32

Choice of V

optimal performance when V = V ^? : trade-off variability–bias (difficult to find V ^? from the data)

Data-driven penalties for model selection