• Aucun résultat trouvé

Optimal model selection

N/A
N/A
Protected

Academic year: 2022

Partager "Optimal model selection"

Copied!
43
0
0

Texte intégral

(1)

1/23

Optimal model selection

Sylvain Arlot

1

Cnrs

2

Ecole Normale Sup´ ´ erieure (Paris), Liens, Willow Team

EMS 2009, Toulouse, 23/07/2009

(2)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(3)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(4)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(5)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(6)

3/23

Loss function, least-square estimator

Least-square risk:

E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card{X i ∈ I λ }

X

X

i

∈I

λ

Y i .

(7)

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card{X i ∈ I λ }

X

X

i

∈I

λ

Y i .

(8)

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card{X i ∈ I λ }

X

X

i

∈I

λ

Y i .

(9)

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) 2 . e.g., histograms on a partition (I λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

λ

β b λ = 1 Card{X i ∈ I λ }

X

X

i

∈I

λ

Y i .

(10)

4/23

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the

smoothness of s or to the variations of σ

(11)

4/23

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the

smoothness of s or to the variations of σ

(12)

4/23

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the

smoothness of s or to the variations of σ

(13)

5/23

Penalization

m b ∈ arg min

m∈M {P n γ( b s m ) + pen(m)}

(14)

5/23

Penalization

m b ∈ arg min

m∈M {P n γ( b s m ) + pen(m)}

Unbiased risk estimation principle

⇒ Ideal penalty: pen id (m) = (P − P n )(γ( b s m , ·))

pen(m) = 2σ 2 D m

n (Mallows 1973) pen(m) = 2 σ b 2 D m

n or K D b m

(15)

6/23

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

ε n = 1000 data points Regular histograms on

0; 1 2

(D m,1 bins), then regular histograms on 1

2 ; 1

(D m,2 bins).

data sample oracle: D m,1 = 1, D m,2 = 3

(16)

6/23

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

ε n = 1000 data points

The ideal penalty is not a linear function of the dimension.

(17)

7/23

Limitations of linear penalties: illustration

(18)

8/23

Limitations of linear penalties: m(K b ? ) 6= m ?

Density of (D

m(K b

?

),1 , D

m(K b

?

),2 ) and (D m

?

,1 , D m

?

,2 ) according to N = 1000 samples

m(K b ? ) m ?

(19)

9/23

Limitations of linear penalties: theory

Y = X + σ(X )ε with X ∼ U ([0; 1]) , E [ε|X ] = 0 E

ε 2 |X

= 1 and

Z 1/2 0

(σ(x)) 2 dx 6=

Z 1 1/2

(σ(x)) 2 dx Regular histograms on

0; 1 2

(1 ≤ D m,1 ≤ n/(2 ln(n) 2 ) bins), then regular histograms on 1

2 ; 1

(1 ≤ D m,2 ≤ n/(2 ln(n) 2 ) bins).

Theorem (A. 2008, arXiv:0812.3141)

There exist constants C , η > 0 (only depending on σ(·)) and an event of probability at least 1 − Cn −2 on which

∀K > 0, ∀ m(K b ) ∈ arg min

m∈M

n

{P n γ ( b s m ) + KD m } ,

`(s, b s

m(K) b ) ≥ (1 + η) inf

m∈M

n

{`(s , b s m )} .

(20)

10/23

Cross-validation

(X 1 , Y 1 ), . . . , (X q , Y q )

| {z }

,(X q+1 , Y q+1 ), . . . , (X n , Y n )

| {z }

Training Validation

b s m (e) ∈ arg min

t∈S

m

( q X

i=1

γ(t , (X i , Y i )) )

P n (v) = 1 n − q

n

X

i=q+1

δ (X

i

,Y

i

) ⇒ P n (v) γ b s m (e)

V -fold cross-validation : (B j ) 1≤j≤V partition of {1, . . . , n}

⇒ m b ∈ arg min

m∈M

 1 V

V

X

j=1

P n j γ b s m (−j)

e s = b s

m b

(21)

11/23

Bias of cross-validation

Ideal criterion: P γ( b s m )

Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):

E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n

E h

P n (j) γ b s m (−j )

i

= E h

P γ b s m (−j )

i ≈ Pγ (s m ) + V V − 1

D m σ 2 n

⇒ bias if V is fixed (“overpenalization”)

(22)

11/23

Bias of cross-validation

Ideal criterion: P γ( b s m )

Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):

E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n

E h

P n (j) γ b s m (−j )

i

= E h

P γ b s m (−j )

i ≈ Pγ (s m ) + V V − 1

D m σ 2 n

⇒ bias if V is fixed (“overpenalization”)

(23)

11/23

Bias of cross-validation

Ideal criterion: P γ( b s m )

Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):

E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n

E h

P n (j) γ b s m (−j )

i

= E h

P γ b s m (−j )

i ≈ Pγ (s m ) + V V − 1

D m σ 2 n

⇒ bias if V is fixed (“overpenalization”)

(24)

12/23

Suboptimality of V -fold cross-validation

Y = X + σε with ε bounded and σ > 0 M: family of regular histograms on X = [0, 1]

m b selected by V -fold cross-validation with V fixed as n grows Theorem (A. 2008, arXiv:0802.0566)

With probability at least 1 − ♦ n −2 ,

`(s, b s

m b ) ≥ (1 + κ(V )) inf

m∈M {`(s , b s m )}

with κ(V ) > 0.

(25)

13/23

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 0.5 1

−4 0 4

Models: regular histograms on 0; 1 2

, then regular histograms on 1

2 ; 1 .

E [`(s , b s m b )]

E [inf m∈M {`(s , b s m )}]

computed over 1000 samples.

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06

20-fold 2.58 ± 0.06

leave-one-out 2.59 ± 0.06

(26)

14/23

Resampling heuristics (bootstrap, Efron 1979)

Real world : P sampling // P n +3 b s m

pen id (m) = (P − P n )γ ( b s m ) = F (P, P n )

(27)

14/23

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P sampling // P n +3 b s m

Bootstrap world : P n

pen id (m) = (P − P n )γ ( b s m ) = F (P, P n )

(28)

14/23

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P sampling // P n +3 b s m

Bootstrap world : P n

resampling // P n W +3 b s m W

(P − P n )γ ( b s m ) = F (P , P n ) /o /o /o // F (P n , P n W ) = (P n − P n W )γ b s m W where P n W = n −1

n

X

i=1

W i δ (X

i

,Y

i

) .

(29)

15/23

Resampling penalization

Ideal penalty:

(P − P n )(γ( b s m )) Resampling penalty:

pen(m) = C E h

(P n − P n W

b s m W

|(X i , Y i ) 1≤i≤n

i

b s m W ∈ arg min

t∈S

m

P n W γ (t) with C ≥ C W to be chosen (no bias if C = C W ) The final estimator is b s

m b with m b ∈ arg min

m∈M {P n γ ( b s m ) + pen(m)}

(30)

15/23

Resampling penalization

Ideal penalty:

(P − P n )(γ( b s m )) Resampling penalty:

pen(m) = C E h

(P n − P n W

b s m W

|(X i , Y i ) 1≤i≤n

i

b s m W ∈ arg min

t∈S

m

P n W γ (t) with C ≥ C W to be chosen (no bias if C = C W ) The final estimator is b s

m b with m b ∈ arg min

m∈M {P n γ ( b s m ) + pen(m)}

(31)

15/23

Resampling penalization

Ideal penalty:

(P − P n )(γ( b s m )) Resampling penalty:

pen(m) = C E h

(P n − P n W

b s m W

|(X i , Y i ) 1≤i≤n

i

b s m W ∈ arg min

t∈S

m

P n W γ (t) with C ≥ C W to be chosen (no bias if C = C W ) The final estimator is b s

m b with m b ∈ arg min

m∈M {P n γ ( b s m ) + pen(m)}

(32)

16/23

Resampling penalization with heteroscedastic data

(33)

17/23

Other resampling-based penalties

Efron’s bootstrap penalties (Efron, 1983; Shibata, 1997):

pen(m) = E h

(P n − P n W )(γ ( b s m W ))

(X i , Y i ) 1≤i≤n

i

Rademacher complexities (Koltchinskii 2001; Bartlett, Boucheron and Lugosi, 2002): subsampling

pen id (m) ≤ pen glo id (m) = sup

t∈S

m

(P − P n )γ(t, ·) idem with general exchangeable weights (Fromont, 2004) Local Rademacher complexities (Bartlett, Bousquet and Mendelson, 2004; Koltchinskii, 2006)

. . .

(34)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C W

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian

Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(35)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C W

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(36)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C W

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(37)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C W

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(38)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C W

Histograms; “small” number of models (Card(M n ) ≤ ♦ n ) Bounded data: kY k ≤ A < ∞

Noise-level lower bounded: 0 < σ min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n −2 ,

`(s , b s

m b ) ≤

1 + ln(n) −1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(39)

19/23

V -fold penalization

V -fold penalty:

pen VF (m) = C V

V

X

j =1

h

(P n − P n (−j) )(γ( b s m (−j) )) i

b s m (−j ) ∈ arg min

t∈S

m

P n (−j) γ(t )

with C ≥ V − 1 to be chosen (no bias if C = V − 1, see also Burman, 1989)

The final estimator is b s m b with m b ∈ arg min

m∈M {P n γ ( b s m ) + pen VF (m)}

⇒ oracle inequality with constant 1 + ln(n) −1/5 if V = O(1) or

V = n (A. 2008, arXiv:0802.0566)

(40)

20/23

V -fold penalization in the general framework

Resampling and V -fold penalization are well-defined in the general framework

Constant C W or V − 1: could be estimated with the slope heuristics (A. and Massart, JMLR 2009)

Constant V − 1 for V -fold penalization:

pen VF (m, n) = C V

P n (j ) − P n (−j) γ

b s m (−j)

⇒ E [pen VF (m, n)] = C E

h pen id

m, n(V V −1) i V

= C E [pen id (m, n)]

V − 1 if E [pen id (m, n)] ≈ α(m)

n

(41)

21/23

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06

20-fold 2.58 ± 0.06

leave-one-out 2.59 ± 0.06

pen 2-f 3.06 ± 0.07

pen 5-f 2.75 ± 0.06

pen 10-f 2.65 ± 0.06

pen Loo 2.59 ± 0.06

Mallows ×1.25 3.17 ± 0.07

pen 2-f ×1.25 2.75 ± 0.06

pen 5-f ×1.25 2.38 ± 0.06

pen 10-f ×1.25 2.28 ± 0.05

pen Loo ×1.25 2.21 ± 0.05

(42)

22/23

Simulations: change-point detection, n = 200

N = 5000 samples generated

5-fold 1.436 ± 0.008

10-fold 1.400 ± 0.008

20-fold 1.372 ± 0.008

pen 5-f 1.615 ± 0.011

pen 10-f 1.444 ± 0.009

pen 20-f 1.390 ± 0.008

pen 5-f ×1.25 1.462 ± 0.008

pen 10-f ×1.25 1.379 ± 0.008

pen 20-f ×1.25 1.315 ± 0.007

(43)

23/23

Conclusion

Usual model selection procedures (C p , V -fold cross-validation) are suboptimal in some realistic frameworks

Resampling and V -fold penalties are (first order) optimal and robust to unknown variations of the noise-level

Theoretical results for regressograms (and recently in density estimation by Lerasle, see CPS 49),

but these procedures are well-defined in the general

framework, rely on a widely valid heuristics, and

experimentally perform well.

Références

Documents relatifs

In addition the conjoint analysis problem in the context of learning association rules does not directly performs from preferences: using transactional data as input, there should

Keywords: step drawdown tests; well efficiency; Rorabaugh model; Jacob model; Strategic Water Storage &amp; Recovery Project; Liwa aquifer; Abu Dhabi; General Well Efficiency

The profitability of these Private Equity funds is calculated after fund managers have been compensated (approximately 20% of carried interest and 1.5% to 2.5% of the managed funds

In the present work, the setup of families of random variables indexed by stopping times allows to solve the Dynkin game problem under very weak assumptions by using only

In the present work, which is self-contained, we study the general case of a reward given by an admissible family φ = (φ(θ), θ ∈ T 0 ) of non negative random variables, and we solve

It is also classical to build functions satisfying [BR] with the Fourier spaces in order to prove that the oracle reaches the minimax rate of convergence over some Sobolev balls,

Keywords: nonparametric regression, heteroscedastic noise, random design, optimal model selection, slope heuristics, hold-out penalty..

Models in competition are composed of the relevant clustering variables S, the subset R of S required to explain the irrelevant variables according to a linear regression, in