1/23
Optimal model selection
Sylvain Arlot
1
Cnrs
2
Ecole Normale Sup´ ´ erieure (Paris), Liens, Willow Team
EMS 2009, Toulouse, 23/07/2009
2/23
Statistical framework: regression on a random design
(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown
Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R
noise ε : E [ε|X ] = 0 E ε 2 |X
= 1 noise level σ(X )
predictor t : X 7→ Y ?
2/23
Statistical framework: regression on a random design
(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown
Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R
noise ε : E [ε|X ] = 0 E ε 2 |X
= 1 noise level σ(X )
predictor t : X 7→ Y ?
2/23
Statistical framework: regression on a random design
(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown
Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R
noise ε : E [ε|X ] = 0 E ε 2 |X
= 1 noise level σ(X )
predictor t : X 7→ Y ?
2/23
Statistical framework: regression on a random design
(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown
Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R
noise ε : E [ε|X ] = 0 E ε 2 |X
= 1 noise level σ(X )
predictor t : X 7→ Y ?
3/23
Loss function, least-square estimator
Least-square risk:
E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) 2
Empirical risk minimizer on S m (= model):
b s m ∈ arg min
t∈S
mP n γ (t, ·) = arg min
t∈S
m1 n
n
X
i=1
(t(X i ) − Y i ) 2 .
e.g., histograms on a partition (I λ ) λ∈Λ
mof X .
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card{X i ∈ I λ }
X
X
i∈I
λY i .
3/23
Loss function, least-square estimator
Loss function:
`(s , t) = P γ(t, ·) − P γ(s , ·) = E
(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2
Empirical risk minimizer on S m (= model):
b s m ∈ arg min
t∈S
mP n γ (t, ·) = arg min
t∈S
m1 n
n
X
i=1
(t(X i ) − Y i ) 2 .
e.g., histograms on a partition (I λ ) λ∈Λ
mof X .
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card{X i ∈ I λ }
X
X
i∈I
λY i .
3/23
Loss function, least-square estimator
Loss function:
`(s , t) = P γ(t, ·) − P γ(s , ·) = E
(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2
Empirical risk minimizer on S m (= model):
b s m ∈ arg min
t∈S
mP n γ (t, ·) = arg min
t∈S
m1 n
n
X
i=1
(t(X i ) − Y i ) 2 .
e.g., histograms on a partition (I λ ) λ∈Λ
mof X .
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card{X i ∈ I λ }
X
X
i∈I
λY i .
3/23
Loss function, least-square estimator
Loss function:
`(s , t) = P γ(t, ·) − P γ(s , ·) = E
(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2
Empirical risk minimizer on S m (= model):
b s m ∈ arg min
t∈S
mP n γ (t, ·) = arg min
t∈S
m1 n
n
X
i=1
(t(X i ) − Y i ) 2 . e.g., histograms on a partition (I λ ) λ∈Λ
mof X .
b s m = X
λ∈Λ
mβ b λ 1 I
λβ b λ = 1 Card{X i ∈ I λ }
X
X
i∈I
λY i .
4/23
Model selection
(S m ) m∈M −→ ( b s m ) m∈M −→ b s
m b ???
Goals:
Oracle inequality (in expectation, or with a large probability):
`(s , b s
m b ) ≤ C inf
m∈M {`(s , b s m ) + R(m, n)}
Adaptivity (provided (S m ) m∈M
nis well chosen), e.g., to the
smoothness of s or to the variations of σ
4/23
Model selection
(S m ) m∈M −→ ( b s m ) m∈M −→ b s
m b ???
Goals:
Oracle inequality (in expectation, or with a large probability):
`(s , b s
m b ) ≤ C inf
m∈M {`(s , b s m ) + R(m, n)}
Adaptivity (provided (S m ) m∈M
nis well chosen), e.g., to the
smoothness of s or to the variations of σ
4/23
Model selection
(S m ) m∈M −→ ( b s m ) m∈M −→ b s
m b ???
Goals:
Oracle inequality (in expectation, or with a large probability):
`(s , b s
m b ) ≤ C inf
m∈M {`(s , b s m ) + R(m, n)}
Adaptivity (provided (S m ) m∈M
nis well chosen), e.g., to the
smoothness of s or to the variations of σ
5/23
Penalization
m b ∈ arg min
m∈M {P n γ( b s m ) + pen(m)}
5/23
Penalization
m b ∈ arg min
m∈M {P n γ( b s m ) + pen(m)}
Unbiased risk estimation principle
⇒ Ideal penalty: pen id (m) = (P − P n )(γ( b s m , ·))
pen(m) = 2σ 2 D m
n (Mallows 1973) pen(m) = 2 σ b 2 D m
n or K D b m
6/23
Limitations of linear penalties: illustration
Y = X + 1 + 1 X ≤1/2
ε n = 1000 data points Regular histograms on
0; 1 2
(D m,1 bins), then regular histograms on 1
2 ; 1
(D m,2 bins).
data sample oracle: D m,1 = 1, D m,2 = 3
6/23
Limitations of linear penalties: illustration
Y = X + 1 + 1 X ≤1/2
ε n = 1000 data points
The ideal penalty is not a linear function of the dimension.
7/23
Limitations of linear penalties: illustration
8/23
Limitations of linear penalties: m(K b ? ) 6= m ?
Density of (D
m(K b
?),1 , D
m(K b
?),2 ) and (D m
?,1 , D m
?,2 ) according to N = 1000 samples
m(K b ? ) m ?
9/23
Limitations of linear penalties: theory
Y = X + σ(X )ε with X ∼ U ([0; 1]) , E [ε|X ] = 0 E
ε 2 |X
= 1 and
Z 1/2 0
(σ(x)) 2 dx 6=
Z 1 1/2
(σ(x)) 2 dx Regular histograms on
0; 1 2
(1 ≤ D m,1 ≤ n/(2 ln(n) 2 ) bins), then regular histograms on 1
2 ; 1
(1 ≤ D m,2 ≤ n/(2 ln(n) 2 ) bins).
Theorem (A. 2008, arXiv:0812.3141)
There exist constants C , η > 0 (only depending on σ(·)) and an event of probability at least 1 − Cn −2 on which
∀K > 0, ∀ m(K b ) ∈ arg min
m∈M
n{P n γ ( b s m ) + KD m } ,
`(s, b s
m(K) b ) ≥ (1 + η) inf
m∈M
n{`(s , b s m )} .
10/23
Cross-validation
(X 1 , Y 1 ), . . . , (X q , Y q )
| {z }
,(X q+1 , Y q+1 ), . . . , (X n , Y n )
| {z }
Training Validation
b s m (e) ∈ arg min
t∈S
m( q X
i=1
γ(t , (X i , Y i )) )
P n (v) = 1 n − q
n
X
i=q+1
δ (X
i,Y
i) ⇒ P n (v) γ b s m (e)
V -fold cross-validation : (B j ) 1≤j≤V partition of {1, . . . , n}
⇒ m b ∈ arg min
m∈M
1 V
V
X
j=1
P n j γ b s m (−j)
e s = b s
m b
11/23
Bias of cross-validation
Ideal criterion: P γ( b s m )
Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):
E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n
E h
P n (j) γ b s m (−j )
i
= E h
P γ b s m (−j )
i ≈ Pγ (s m ) + V V − 1
D m σ 2 n
⇒ bias if V is fixed (“overpenalization”)
11/23
Bias of cross-validation
Ideal criterion: P γ( b s m )
Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):
E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n
E h
P n (j) γ b s m (−j )
i
= E h
P γ b s m (−j )
i ≈ Pγ (s m ) + V V − 1
D m σ 2 n
⇒ bias if V is fixed (“overpenalization”)
11/23
Bias of cross-validation
Ideal criterion: P γ( b s m )
Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):
E [P γ( b s m )] ≈ P γ(s m ) + D m σ 2 n
E h
P n (j) γ b s m (−j )
i
= E h
P γ b s m (−j )
i ≈ Pγ (s m ) + V V − 1
D m σ 2 n
⇒ bias if V is fixed (“overpenalization”)
12/23
Suboptimality of V -fold cross-validation
Y = X + σε with ε bounded and σ > 0 M: family of regular histograms on X = [0, 1]
m b selected by V -fold cross-validation with V fixed as n grows Theorem (A. 2008, arXiv:0802.0566)
With probability at least 1 − ♦ n −2 ,
`(s, b s
m b ) ≥ (1 + κ(V )) inf
m∈M {`(s , b s m )}
with κ(V ) > 0.
13/23
Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes
0 0.5 1
−4 0 4
0 0.5 1
−4 0 4