V-fold cross-validation improved: V-fold penalization

(1)

V-fold cross-validation improved: V-fold penalization

Sylvain Arlot

1University Paris-Sud XI, Orsay

2Inria Saclay, Projet Select

Journ´ees Statistiques du Sud, INSA Toulouse June 18, 2008

(2)

Statistical framework: regression on a random design

(X1,Y1), . . . ,(Xn,Yn)∈ X × Y i.i.d. (Xi,Yi)∼P unknown

Y =s(X) +σ(X) X ∈ X ⊂R^d, Y ∈ Y = [0; 1] or R

noise: E[|X] = 0 noise level σ(X) predictor t :X 7→ Y ?

(3)

Statistical framework: regression on a random design

Y =s(X) +σ(X) X ∈ X ⊂R^d, Y ∈ Y = [0; 1] or R

(4)

Statistical framework: regression on a random design

Y =s(X) +σ(X) X ∈ X ⊂R^d, Y ∈ Y = [0; 1] or R

(5)

Statistical framework: regression on a random design

Y =s(X)+σ(X) X ∈ X ⊂R^d, Y ∈ Y = [0; 1] or R

(6)

Loss function, least-square estimator

Least-square risk:

Eγ(t,(X,Y)) =Pγ(t,·) with γ(t,(x,y)) = (t(x)−y)²

Empirical risk minimizer onS_m (= model):

bs_m ∈arg min

t∈Sm

P_nγ(t,·) = arg min

t∈Sm

1 n

n

X

i=1

(t(X_i)−Y_i)² .

e.g. histograms on a partition (I_λ)λ∈Λm of X. 1

(7)

Loss function, least-square estimator

Loss function:

`(s,t) =Pγ(t,·)−Pγ(s,·) =E

(t(X)−s(X))² with γ(t,(x,y)) = (t(x)−y)²

bs_m ∈arg min

t∈Sm

1 n

n

X

i=1

(t(X_i)−Y_i)² .

e.g. histograms on a partition (I_λ)λ∈Λm of X.

bsm= X

λ∈Λ

βbλ1I_λ βbλ= 1 Card{X_i ∈I_λ}

X

X∈I

Yi.

(8)

Loss function, least-square estimator

Loss function:

`(s,t) =Pγ(t,·)−Pγ(s,·) =E

Empirical risk minimizer onSm (= model):

bsm ∈arg min

t∈Sm

Pnγ(t,·) = arg min

t∈Sm

1 n

n

X

i=1

(t(Xi)−Yi)² .

e.g. histograms on a partition (I_λ)λ∈Λ_m of X. 1

(9)

Loss function, least-square estimator

Loss function:

`(s,t) =Pγ(t,·)−Pγ(s,·) =E

bs_m ∈arg min

t∈Sm

1 n

n

X

i=1

(t(X_i)−Y_i)² . e.g. histograms on a partition (I_λ)λ∈Λm of X.

bsm= X

λ∈Λ

βbλ1I_λ βbλ= 1 Card{X_i ∈I_λ}

X

X∈I

Yi.

(10)

Model selection

(S_m)m∈M −→ (bs_m)m∈M −→ bs

mb ???

Oracle inequality (in expectation, or with a large probability):

`(s,bs

mb)≤C inf

m∈M{`(s,bs_m) +R(m,n)}

Adaptivity (e.g.,α ifs is α-h¨older,σ(X) in the heteroscedastic framework)

(11)

Model selection

mb ???

`(s,bs

mb)≤C inf

(12)

Model selection

mb ???

`(s,bs

mb)≤C inf

(13)

Cross-validation

(X₁,Y₁), . . . ,(X_q,Y_q)

| {z }

,(X_q+1,Y_q+1), . . . ,(X_n,Y_n)

| {z }

Training Validation

bs_m^(t)∈arg min

t∈Sm

(1 q

q

X

i=1

γ(t,(X_i,Y_i)) )

P_n^(v⁾= 1 n−q

n

X

i=q+1

δ_(X_i_,Y_i₎ ⇒P_n^(v⁾γ bs_m^(t)

V-fold cross-validation: (Bj)1≤j≤V partition of {1, . . . ,n}

⇒m∈arg min



1 X^V

P^(j)γ s^(−j⁾





s =s

(14)

Cross-validation

(X₁,Y₁), . . . ,(X_q,Y_q)

| {z }

,(X_q+1,Y_q+1), . . . ,(X_n,Y_n)

| {z }

Training Validation

bs_m^(t)∈arg min

t∈Sm

(1 q

q

X

i=1

γ(t,(X_i,Y_i)) )

P_n^(v⁾= 1 n−q

n

X

i=q+1

(15)

Cross-validation

(X₁,Y₁), . . . ,(X_q,Y_q)

| {z }

,(X_q+1,Y_q+1), . . . ,(X_n,Y_n)

| {z }

Training Validation

bs_m^(t)∈arg min

t∈Sm

(1 q

q

X

i=1

γ(t,(X_i,Y_i)) )

P_n^(v⁾= 1 n−q

n

X

i=q+1

⇒m∈arg min



1 X^V

P^(j)γ s^(−j⁾





s =s

(16)

Bias of cross-validation

Ideal criterion: Pγ(bs_m)

Regression on an histogram model of dimensionDm, when σ(X)≡σ:

E[Pγ(bs_m)]≈Pγ(s_m) +D_mσ² n

E h

Pn^(j)γ bsm^(−j⁾

i

=E h

Pγ bsm^(−j⁾

i≈Pγ(s_m) + V V −1

D_mσ² n

(17)

Bias of cross-validation

Ideal criterion: Pγ(bs_m)

Regression on an histogram model of dimensionDm, when σ(X)≡σ:

E[Pγ(bs_m)]≈Pγ(s_m) +D_mσ² n

E h

Pn^(j)γ bsm^(−j⁾

i

=E h

Pγ bsm^(−j⁾

i≈Pγ(s_m) + V V −1

D_mσ² n

⇒biasif V is fixed

(18)

Suboptimality of V -fold cross-validation

Y =X +σwith bounded and σ >0

M_n: family of regular histograms on X = [0,1]

V fixed Theorem

With probability at least1−♦n⁻²,

`(s,bs

mb)≥(1 +κ(V)) inf

m∈M{`(s,bs_m)}

withκ(V)>0.

(19)

Choice of V

Bias: decreases with V (can be corrected: Burman 1989) Variability: large if V is small (V = 2), or sometimes when V is very large (V =n, unstable algorithms)

Computation time: complexity proportional to V

⇒trade-off

⇒classical conclusion: “V = 10 is fine”

(20)

Choice of V

⇒trade-off

(21)

Choice of V

⇒trade-off

(22)

Choice of V

⇒trade-off

(23)

Simulation framework

Y_i =s(X_i) +σ(X_i)_i X_i ∼^i.i.d.U([0; 1]) _i ∼^i.i.d.N(0,1) M_n=n

regular histograms with D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

Card{X_i ∈I_λ} ≥2o

⇒Benchmark:

C_classical= E[`(s,bs

mb)]

E[infm∈M`(s,bsm)] computed with N= 1000 samples

(24)

Simulation framework

regular histogramswith D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

⇒Benchmark:

mb)]

E[infm∈M`(s,sm)] computed with N= 1000 samples

(25)

Simulation framework

regular histograms with D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

⇒Benchmark:

mb)]

(26)

Simulations: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

0 4

2-fold 2.08±0.04 5-fold 2.14±0.04 10-fold 2.10±0.05 20-fold 2.09±0.04 leave-one-out 2.08±0.04

(27)

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

2-fold 1.002±0.003 5-fold 1.014±0.003 10-fold 1.021±0.003 20-fold 1.029±0.004 leave-one-out 1.034±0.004

(28)

The penalization viewpoint

penalization: mb ∈arg minm∈M{P_nγ(bsm) + pen(m)}

ideal penalty: pen_id(m) =Pγ(bsm)−Pnγ(bsm) V-fold cross-validation isoverpenalizing:

E h1

V

PV

j=1P_n^(j⁾γ bs_m^(−j)

−P_nγ(bs_m)i

E[pen_id(m)] ≈1 + 1 2(V −1)

non-asymptotic phenomenon:

better to overpenalize when the signal-to-noise ration/σ² is

(29)

The penalization viewpoint

E h1

V

PV

−P_nγ(bs_m)i

E[pen_id(m)] ≈1 + 1 2(V −1)

better to overpenalize when the signal-to-noise ration/σ² is small.

(30)

The penalization viewpoint

E h1

V

PV

−P_nγ(bs_m)i

E[pen_id(m)] ≈1 + 1 2(V −1)

better to overpenalize when the signal-to-noise ration/σ² is

(31)

Overpenalization (s = sin, σ ≡ 1, n = 200, Mallows’ C

p

)

(32)

Conclusions on V -fold cross-validation

asymptotically suboptimal if V fixed

optimal V^?: trade-offvariability–overpenalization V^?= 2 can happen for prediction

difficult to find V^? from the data (+ complexity issue)

low signal-to-noise ratio ⇒V^? unsatisfactory (highlyvariable) large signal-to-noise ratio ⇒ V^? too large (computation time)

(33)

Conclusions on V -fold cross-validation

asymptotically suboptimal if V fixed

optimal V^?: trade-offvariability–overpenalization V^?= 2 can happen for prediction

difficult to find V^? from the data (+ complexity issue)

low signal-to-noise ratio ⇒V^? unsatisfactory (highlyvariable) large signal-to-noise ratio ⇒ V^? too large (computation time)

(34)

Penalization

mb ∈arg min

m∈M{P_nγ(bs_m) + pen(m)}

Ideal penalty: pen_id(m) = (P−P_n)(γ(bs_m,·))

pen(m) = 2σ²D_m

n (Mallows 1973) pen(m) = 2σb²D_m n

(35)

Penalization

mb ∈arg min

Ideal penalty: pen_id(m) = (P−P_n)(γ(bs_m,·))

pen(m) = 2σ²D_m

n (Mallows 1973) pen(m) = 2σb²D_m n

(36)

Penalization

mb ∈arg min

m∈M{P_nγ(bsm) + pen(m)}

Ideal penalty: pen_id(m) = (P−Pn)(γ(bsm,·)) Theorem (Suboptimality of linear penalties, A., 2008) X = [0,1], Y =X +σ(X),σ(x) =1x≤1/2+ 31x>1/2

M_n: Regular histograms on [0; 1/2]and [1/2; 1]

With a probability at least1−♦n⁻², for every K ≥0 and m(Kb )∈arg min

m∈Mn

{P_nγ(bsm) +KDm} ,

(37)

Resampling heuristics (bootstrap, Efron 1979)

Real world: P ^sampling ^//Pn +3

bsm

pen_id(m) = (P−P_n)γ(bs_m) =F(P,P_n)

V-fold: P_n^W = 1 n−Card(B_J)

Xδ_(X_i_,Y_i₎ with J∼ U(1, . . . ,V)

(38)

Resampling heuristics (bootstrap, Efron 1979)

Real world:

OOOOOOO P ^sampling ^//P_n ⁺³bsm

Bootstrap world: Pn

pen_id(m) = (P−P_n)γ(bs_m) =F(P,P_n)

(39)

Resampling heuristics (bootstrap, Efron 1979)

Real world:

OOOOOOO P ^sampling ^//P_n ⁺³bs_m

Bootstrap world: P_n ^resampling ^//P_n^W ⁺³bs_m^W

(P −Pn)γ(bsm) =F(P,Pn) ^/o ^/o ^/o ^//F(Pn,P_n^W) = (Pn−P_n^W)γ bs_m^W

V-fold: P_n^W = 1 n−Card(B )

Xδ_(X_i_,Y_i₎ with J∼ U(1, . . . ,V)

(40)

Resampling heuristics (bootstrap, Efron 1979)

Real world:

OOOOOOO P ^sampling ^//P_n ⁺³bs_m

Bootstrap world: P_n subsampling //P_n^W ⁺³bs_m^W

(P −Pn)γ(bsm) =F(P,Pn) ^/o ^/o ^/o ^//F(Pn,P_n^W) = (Pn−P_n^W)γ bs_m^W

(41)

V -fold penalization

Ideal penalty:

(P−P_n)(γ(bs_m)) V-fold penalty:

pen(m) = C V

V

X

j=1

h

(P_n−Pn^(−j⁾)(γ(bsm^(−j⁾)) i

bsm^(−j⁾∈arg min

t∈Sm

Pn^(−j)γ(t) with C ≥V −1 to be chosen

(C =V −1 ⇒ we recover Burman’s correctedV-fold, 1989) The final estimator isbs

mb with mb ∈arg min

(42)

V -fold penalization

Ideal penalty:

pen(m) = C V

V

X

j=1

h

(P_n−Pn^(−j⁾)(γ(bsm^(−j⁾)) i

t∈Sm

mb with

(43)

V -fold penalization

Ideal penalty:

pen(m) = C V

V

X

j=1

h

(P_n−Pn^(−j⁾)(γ(bsm^(−j⁾)) i

t∈Sm

mb with mb ∈arg min

(44)

Some references on model selection and resampling

Hold-out, Cross-validation, Leave-one-out, V-fold cross-validation:

I ⊂ {1, . . . ,n} random sub-sample of sizeq (VFCV:

q = ^n(V_V⁻¹⁾).

Efron’s bootstrap penalties (Efron 1983, Shibata 1997):

pen(m) =E h

(Pn−P_n^W)(γ(bs_m^W))

(X_i,Y_i)1≤i≤n

i

Rademacher complexities (Koltchinskii 2001 ; Bartlett, Boucheron, Lugosi 2002): subsampling

pen_id(m)≤pen^glo_id (m) = sup

t∈Sm

(P −P_n)γ(t,·)

(45)

Non-asymptotic pathwise oracle inequality

C ≈V −1

Histogram regression on a random design

Small number of models (at most polynomial in n) Model pre-selection: removem when

λ∈Λmin_m{Card{X_i ∈I_λ}} ≤1 Fixed V or V =n

Theorem

Under a “reasonable” set of assumptions on P, with probability at least1−♦n⁻²,

`(s,bs

mb)≤

1 + ln(n)^−1/5

m∈Minf {`(s,bs_m)}

(46)

Non-asymptotic pathwise oracle inequality

C ≈V −1

Histogram regression on a random design

Small number of models (at most polynomial in n) Model pre-selection: removem when

λ∈Λmin_m{Card{X_i ∈I_λ}} ≤1 Fixed V or V =n

Theorem

Under a “reasonable” set of assumptions on P, with probability at least1−♦n⁻²,

(47)

Sufficient assumptions

Reminder: the proceduredoes not useany of these assumptions.

Bounded data: kYk_∞≤A<∞ Minimal noise-level:

0< σmin≤σ(X)

Smoothness of the regression function s: non-constant, belongs to some h¨olderian ball H_α(R)

Regularity of the partition: minλP(X ∈Iλ)≥♦D_m⁻¹

and they can be relaxed...

(48)

Sufficient assumptions

Reminder: the proceduredoes not useany of these assumptions.

Bounded data: kYk_∞≤A<∞ Minimal noise-level:

0< σmin≤σ(X)

Smoothness of the regression function s: non-constant, belongs to some h¨olderian ball H_α(R)

Regularity of the partition: minλP(X ∈Iλ)≥♦D_m⁻¹

(49)

Corollaries

Classical oracle inequality:

E[`(s,bs

mb)]≤

1 + ln(n)^−1/5 E

m∈Minf {`(s,bs_m)}

+♦n⁻² Asymptotic optimality ifC ∼_n→+∞V −1 :

`(s,bs

mb) infm∈M{`(s,bs_m)}

−−−−→a.s.

n→+∞ 1

Adaptation to h¨olderian regularity in an heteroscedastic framework (regular histograms):

s ∈ H(α,R),α∈(0,1],X ⊂R^k, (· · ·) Y bounded

⇒ ratekσk

4α 2α+k

L²(Leb)R^2α+k^2k n^2α+k^−2α .

(50)

Simulation framework

Yi =s(Xi) +σ(Xi)i Xi ∼^i.i.d.U([0; 1]) i ∼^i.i.d.N(0,1) M_n=

n

regular histogramswith D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

Card{X_i ∈Iλ} ≥2 o

⇒Benchmark:

Cclassical= E[`(s,bs_m_b)]

(51)

Model selection methods

Mallows:

pen(m) = 2bσ²Dmn⁻¹

“Classical” V-fold cross-validation (V ∈ {2,5,10,20,n}):

mb ∈arg min

m∈M





 1 V

V

X

j=1

P_n^jγ

bs_m^(−j),·







es =bs

mb

V-fold penalties (V ∈ {2,5,10,n}),C =V −1

(52)

Simulations: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

0 4

Mallows 1.93±0.04 2-fold 2.08±0.04 5-fold 2.14±0.04 10-fold 2.10±0.05 20-fold 2.09±0.04 leave-one-out 2.08±0.04 pen 2-f 2.58±0.06 pen 5-f 2.22±0.05 pen 10-f 2.12±0.05 pen Loo 2.08±0.05

(53)

Simulations: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

0 4

Mallows 1.93±0.04

2-fold 2.08±0.04

5-fold 2.14±0.04

10-fold 2.10±0.05 20-fold 2.09±0.04 leave-one-out 2.08±0.04 pen 2-f 2.58±0.06 pen 5-f 2.22±0.05 pen 10-f 2.12±0.05 pen Loo 2.08±0.05 Mallows×1.25 1.80±0.03 pen 2-f×1.25 2.17±0.05 pen 5-f×1.25 1.91±0.05 pen 10-f×1.25 1.87±0.03

(54)

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 4

Mallows 3.69±0.07

2-fold 2.54±0.05

5-fold 2.58±0.06

10-fold 2.60±0.06 20-fold 2.58±0.06 leave-one-out 2.59±0.06 pen 2-f 3.06±0.07 pen 5-f 2.75±0.06 pen 10-f 2.65±0.06 pen Loo 2.59±0.06 Mallows×1.25 3.17±0.07 pen 2-f×1.25 2.75±0.06

(55)

Conclusions on V -fold penalization

asymptotically optimal, even if V fixed optimal V^?: the largest possible one

⇒ easier to balance with the computational cost

low signal-to-noise ratio ⇒easy to overpenalize and decrease variability (keep V large)

large signal-to-noise ratio ⇒ possible to stay unbiased with a small V (for computational reasons)

flexibilityimproves V-fold cross-validation (according to both theoretical results andsimulations)

theory can be extended to exchangeable weighted bootstrap penalties(e.g. bootstrap, i.i.d. Rademacher, leave-one-out, leave-p-out withp =αn).

Some open problems: consistency when C V −1,

(56)

Conclusions on V -fold penalization

(57)

Conclusions on V -fold penalization

Some open problems: consistency when C V −1,

(58)

Conclusions on V -fold penalization

(59)

Thank you for your attention !

Preprint: arXiv:0802.0566

(60)

Part I

Appendix

(61)

Limitations of a linear penalty

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−2

−1 0 1 2 3 4

0.1 0.15 0.2 0.25

Ideal penalty

Y =X +σ(X) σ(X) =1_X≥¹

2 ∼ N(0,1)

Regular histograms on 0;¹₂ (D_m,1 pieces), then regular histograms on₁

2; 1 (D_m,2 pieces).

⇒ pen_id(m) is not a linear

(62)

Limitations of a linear penalty: m(K b ) 6= m

^?

0 5 10 15 20

0 2 4 6 8 10 12 14 16 18

Dm2

(63)

Sketch of the proof

For eachm∈ M_n,

pen_id(m)≈E[pen_id(m)]∝E[pen(m)]≈pen(m) with remainders`(s,bsm) whenDm →+∞:

Explicit computation of pen_idand pen

Comparison of expectations: E(pen_id)∝E(pen) (if minλ∈Λm{nP(X ∈Iλ)} →+∞)

Moment inequalities (Boucheron, Bousquet, Lugosi, Massart 2003)

⇒ concentration inequalities (for pen_id and pen)

Assumptions ⇒control of the remainders in terms of `(s,bsm).

(64)

Sketch of the proof

For eachm∈ M_n,

(65)

Sketch of the proof

For eachm∈ M_n,

(66)

Sketch of the proof

For eachm∈ M_n,

(67)

Sketch of the proof

For eachm∈ M_n,

(68)

Overpenalization (HeaviSine, n = 2048, σ ≡ 1)

(69)

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

Mallows 1.015±0.003 2-fold 1.002±0.003 5-fold 1.014±0.003 10-fold 1.021±0.003 20-fold 1.029±0.004 leave-one-out 1.034±0.004 pen 2-f 1.038±0.004 pen 5-f 1.037±0.004 pen 10-f 1.034±0.004 pen Loo 1.034±0.004 Mallows×1.25 1.002±0.003 pen 2-f×1.25 1.011±0.003 pen 5-f×1.25 1.006±0.003 pen 10-f×1.25 1.005±0.003

(70)

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

Mallows 1.015±0.003 2-fold 1.002±0.003 5-fold 1.014±0.003 10-fold 1.021±0.003 20-fold 1.029±0.004 leave-one-out 1.034±0.004 pen 2-f 1.038±0.004 pen 5-f 1.037±0.004 pen 10-f 1.034±0.004 pen Loo 1.034±0.004 Mallows×1.25 1.002±0.003 pen 2-f×1.25 1.011±0.003

(71)

Simulations: HeaviSine, n = 2048, σ(x ) = x , 2 bin sizes

0 0.5 1

−8 0 8

0 8

Mallows 1.373±0.010 2-fold 1.184±0.004 5-fold 1.115±0.005 10-fold 1.109±0.004 20-fold 1.105±0.004 leave-one-out 1.105±0.004 pen 2-f 1.103±0.005 pen 5-f 1.104±0.004 pen 10-f 1.104±0.004 pen Loo 1.105±0.004 Mallows×1.25 1.411±0.008 pen 2-f×1.25 1.106±0.004 pen 5-f×1.25 1.102±0.004 pen 10-f×1.25 1.098±0.004

(72)

Simulations: sin, variable n and σ, regular histograms

n 200 1000 200

σ 1 1 0.1

Mallows (K = 2) 1.93±0.04 1.67±0.04 1.40±0.02 2-fold 2.08±0.04 1.67±0.04 1.39±0.02 10-fold 2.10±0.05 1.75±0.04 1.38±0.02 pen 10-fold 2.12±0.05 1.78±0.05 1.37±0.02

(73)

Simulations: sin, variable n and σ, regular histograms

n 200 1000 200

σ 1 1 0.1

Mallows (K = 2) 1.93±0.04 1.67±0.04 1.40±0.02 2-fold 2.08±0.04 1.67±0.04 1.39±0.02 10-fold 2.10±0.05 1.75±0.04 1.38±0.02 pen 10-fold 2.12±0.05 1.78±0.05 1.37±0.02 Mallows (K = 2.5) 1.80±0.03 1.62±0.03 1.43±0.02 pen 10-fold ×1.25 1.87±0.03 1.63±0.04 1.38±0.02