• Aucun résultat trouvé

V-fold cross-validation improved: V-fold penalization

N/A
N/A
Protected

Academic year: 2022

Partager "V-fold cross-validation improved: V-fold penalization"

Copied!
73
0
0

Texte intégral

(1)

V-fold cross-validation improved: V-fold penalization

Sylvain Arlot

1University Paris-Sud XI, Orsay

2Inria Saclay, Projet Select

Journ´ees Statistiques du Sud, INSA Toulouse June 18, 2008

(2)

Statistical framework: regression on a random design

(X1,Y1), . . . ,(Xn,Yn)∈ X × Y i.i.d. (Xi,Yi)∼P unknown

Y =s(X) +σ(X) X ∈ X ⊂Rd, Y ∈ Y = [0; 1] or R

noise: E[|X] = 0 noise level σ(X) predictor t :X 7→ Y ?

(3)

Statistical framework: regression on a random design

(X1,Y1), . . . ,(Xn,Yn)∈ X × Y i.i.d. (Xi,Yi)∼P unknown

Y =s(X) +σ(X) X ∈ X ⊂Rd, Y ∈ Y = [0; 1] or R

noise: E[|X] = 0 noise level σ(X) predictor t :X 7→ Y ?

(4)

Statistical framework: regression on a random design

(X1,Y1), . . . ,(Xn,Yn)∈ X × Y i.i.d. (Xi,Yi)∼P unknown

Y =s(X) +σ(X) X ∈ X ⊂Rd, Y ∈ Y = [0; 1] or R

noise: E[|X] = 0 noise level σ(X) predictor t :X 7→ Y ?

(5)

Statistical framework: regression on a random design

(X1,Y1), . . . ,(Xn,Yn)∈ X × Y i.i.d. (Xi,Yi)∼P unknown

Y =s(X)+σ(X) X ∈ X ⊂Rd, Y ∈ Y = [0; 1] or R

noise: E[|X] = 0 noise level σ(X) predictor t :X 7→ Y ?

(6)

Loss function, least-square estimator

Least-square risk:

Eγ(t,(X,Y)) =Pγ(t,·) with γ(t,(x,y)) = (t(x)−y)2

Empirical risk minimizer onSm (= model):

bsm ∈arg min

t∈Sm

Pnγ(t,·) = arg min

t∈Sm

1 n

n

X

i=1

(t(Xi)−Yi)2 .

e.g. histograms on a partition (Iλ)λ∈Λm of X. 1

(7)

Loss function, least-square estimator

Loss function:

`(s,t) =Pγ(t,·)−Pγ(s,·) =E

(t(X)−s(X))2 with γ(t,(x,y)) = (t(x)−y)2

Empirical risk minimizer onSm (= model):

bsm ∈arg min

t∈Sm

Pnγ(t,·) = arg min

t∈Sm

1 n

n

X

i=1

(t(Xi)−Yi)2 .

e.g. histograms on a partition (Iλ)λ∈Λm of X.

bsm= X

λ∈Λ

βbλ1Iλ βbλ= 1 Card{Xi ∈Iλ}

X

X∈I

Yi.

(8)

Loss function, least-square estimator

Loss function:

`(s,t) =Pγ(t,·)−Pγ(s,·) =E

(t(X)−s(X))2 with γ(t,(x,y)) = (t(x)−y)2

Empirical risk minimizer onSm (= model):

bsm ∈arg min

t∈Sm

Pnγ(t,·) = arg min

t∈Sm

1 n

n

X

i=1

(t(Xi)−Yi)2 .

e.g. histograms on a partition (Iλ)λ∈Λm of X. 1

(9)

Loss function, least-square estimator

Loss function:

`(s,t) =Pγ(t,·)−Pγ(s,·) =E

(t(X)−s(X))2 with γ(t,(x,y)) = (t(x)−y)2

Empirical risk minimizer onSm (= model):

bsm ∈arg min

t∈Sm

Pnγ(t,·) = arg min

t∈Sm

1 n

n

X

i=1

(t(Xi)−Yi)2 . e.g. histograms on a partition (Iλ)λ∈Λm of X.

bsm= X

λ∈Λ

βbλ1Iλ βbλ= 1 Card{Xi ∈Iλ}

X

X∈I

Yi.

(10)

Model selection

(Sm)m∈M −→ (bsm)m∈M −→ bs

mb ???

Oracle inequality (in expectation, or with a large probability):

`(s,bs

mb)≤C inf

m∈M{`(s,bsm) +R(m,n)}

Adaptivity (e.g.,α ifs is α-h¨older,σ(X) in the heteroscedastic framework)

(11)

Model selection

(Sm)m∈M −→ (bsm)m∈M −→ bs

mb ???

Oracle inequality (in expectation, or with a large probability):

`(s,bs

mb)≤C inf

m∈M{`(s,bsm) +R(m,n)}

Adaptivity (e.g.,α ifs is α-h¨older,σ(X) in the heteroscedastic framework)

(12)

Model selection

(Sm)m∈M −→ (bsm)m∈M −→ bs

mb ???

Oracle inequality (in expectation, or with a large probability):

`(s,bs

mb)≤C inf

m∈M{`(s,bsm) +R(m,n)}

Adaptivity (e.g.,α ifs is α-h¨older,σ(X) in the heteroscedastic framework)

(13)

Cross-validation

(X1,Y1), . . . ,(Xq,Yq)

| {z }

,(Xq+1,Yq+1), . . . ,(Xn,Yn)

| {z }

Training Validation

bsm(t)∈arg min

t∈Sm

(1 q

q

X

i=1

γ(t,(Xi,Yi)) )

Pn(v)= 1 n−q

n

X

i=q+1

δ(Xi,Yi) ⇒Pn(v)γ bsm(t)

V-fold cross-validation: (Bj)1≤j≤V partition of {1, . . . ,n}

⇒m∈arg min

1 XV

P(j)γ s(−j)

s =s

(14)

Cross-validation

(X1,Y1), . . . ,(Xq,Yq)

| {z }

,(Xq+1,Yq+1), . . . ,(Xn,Yn)

| {z }

Training Validation

bsm(t)∈arg min

t∈Sm

(1 q

q

X

i=1

γ(t,(Xi,Yi)) )

Pn(v)= 1 n−q

n

X

i=q+1

δ(Xi,Yi) ⇒Pn(v)γ bsm(t)

V-fold cross-validation: (Bj)1≤j≤V partition of {1, . . . ,n}

(15)

Cross-validation

(X1,Y1), . . . ,(Xq,Yq)

| {z }

,(Xq+1,Yq+1), . . . ,(Xn,Yn)

| {z }

Training Validation

bsm(t)∈arg min

t∈Sm

(1 q

q

X

i=1

γ(t,(Xi,Yi)) )

Pn(v)= 1 n−q

n

X

i=q+1

δ(Xi,Yi) ⇒Pn(v)γ bsm(t)

V-fold cross-validation: (Bj)1≤j≤V partition of {1, . . . ,n}

⇒m∈arg min

1 XV

P(j)γ s(−j)

s =s

(16)

Bias of cross-validation

Ideal criterion: Pγ(bsm)

Regression on an histogram model of dimensionDm, when σ(X)≡σ:

E[Pγ(bsm)]≈Pγ(sm) +Dmσ2 n

E h

Pn(j)γ bsm(−j)

i

=E h

Pγ bsm(−j)

i≈Pγ(sm) + V V −1

Dmσ2 n

(17)

Bias of cross-validation

Ideal criterion: Pγ(bsm)

Regression on an histogram model of dimensionDm, when σ(X)≡σ:

E[Pγ(bsm)]≈Pγ(sm) +Dmσ2 n

E h

Pn(j)γ bsm(−j)

i

=E h

Pγ bsm(−j)

i≈Pγ(sm) + V V −1

Dmσ2 n

⇒biasif V is fixed

(18)

Suboptimality of V -fold cross-validation

Y =X +σwith bounded and σ >0

Mn: family of regular histograms on X = [0,1]

V fixed Theorem

With probability at least1−♦n−2,

`(s,bs

mb)≥(1 +κ(V)) inf

m∈M{`(s,bsm)}

withκ(V)>0.

(19)

Choice of V

Bias: decreases with V (can be corrected: Burman 1989) Variability: large if V is small (V = 2), or sometimes when V is very large (V =n, unstable algorithms)

Computation time: complexity proportional to V

⇒trade-off

⇒classical conclusion: “V = 10 is fine”

(20)

Choice of V

Bias: decreases with V (can be corrected: Burman 1989) Variability: large if V is small (V = 2), or sometimes when V is very large (V =n, unstable algorithms)

Computation time: complexity proportional to V

⇒trade-off

⇒classical conclusion: “V = 10 is fine”

(21)

Choice of V

Bias: decreases with V (can be corrected: Burman 1989) Variability: large if V is small (V = 2), or sometimes when V is very large (V =n, unstable algorithms)

Computation time: complexity proportional to V

⇒trade-off

⇒classical conclusion: “V = 10 is fine”

(22)

Choice of V

Bias: decreases with V (can be corrected: Burman 1989) Variability: large if V is small (V = 2), or sometimes when V is very large (V =n, unstable algorithms)

Computation time: complexity proportional to V

⇒trade-off

⇒classical conclusion: “V = 10 is fine”

(23)

Simulation framework

Yi =s(Xi) +σ(Xi)i Xii.i.d.U([0; 1]) ii.i.d.N(0,1) Mn=n

regular histograms with D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

Card{Xi ∈Iλ} ≥2o

⇒Benchmark:

Cclassical= E[`(s,bs

mb)]

E[infm∈M`(s,bsm)] computed with N= 1000 samples

(24)

Simulation framework

Yi =s(Xi) +σ(Xi)i Xii.i.d.U([0; 1]) ii.i.d.N(0,1) Mn=n

regular histogramswith D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

Card{Xi ∈Iλ} ≥2o

⇒Benchmark:

Cclassical= E[`(s,bs

mb)]

E[infm∈M`(s,sm)] computed with N= 1000 samples

(25)

Simulation framework

Yi =s(Xi) +σ(Xi)i Xii.i.d.U([0; 1]) ii.i.d.N(0,1) Mn=n

regular histograms with D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

Card{Xi ∈Iλ} ≥2o

⇒Benchmark:

Cclassical= E[`(s,bs

mb)]

E[infm∈M`(s,bsm)] computed with N= 1000 samples

(26)

Simulations: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

0 4

2-fold 2.08±0.04 5-fold 2.14±0.04 10-fold 2.10±0.05 20-fold 2.09±0.04 leave-one-out 2.08±0.04

(27)

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

2-fold 1.002±0.003 5-fold 1.014±0.003 10-fold 1.021±0.003 20-fold 1.029±0.004 leave-one-out 1.034±0.004

(28)

The penalization viewpoint

penalization: mb ∈arg minm∈M{Pnγ(bsm) + pen(m)}

ideal penalty: penid(m) =Pγ(bsm)−Pnγ(bsm) V-fold cross-validation isoverpenalizing:

E h1

V

PV

j=1Pn(j)γ bsm(−j)

−Pnγ(bsm)i

E[penid(m)] ≈1 + 1 2(V −1)

non-asymptotic phenomenon:

better to overpenalize when the signal-to-noise ration/σ2 is

(29)

The penalization viewpoint

penalization: mb ∈arg minm∈M{Pnγ(bsm) + pen(m)}

ideal penalty: penid(m) =Pγ(bsm)−Pnγ(bsm) V-fold cross-validation isoverpenalizing:

E h1

V

PV

j=1Pn(j)γ bsm(−j)

−Pnγ(bsm)i

E[penid(m)] ≈1 + 1 2(V −1)

non-asymptotic phenomenon:

better to overpenalize when the signal-to-noise ration/σ2 is small.

(30)

The penalization viewpoint

penalization: mb ∈arg minm∈M{Pnγ(bsm) + pen(m)}

ideal penalty: penid(m) =Pγ(bsm)−Pnγ(bsm) V-fold cross-validation isoverpenalizing:

E h1

V

PV

j=1Pn(j)γ bsm(−j)

−Pnγ(bsm)i

E[penid(m)] ≈1 + 1 2(V −1)

non-asymptotic phenomenon:

better to overpenalize when the signal-to-noise ration/σ2 is

(31)

Overpenalization (s = sin, σ ≡ 1, n = 200, Mallows’ C

p

)

(32)

Conclusions on V -fold cross-validation

asymptotically suboptimal if V fixed

optimal V?: trade-offvariability–overpenalization V?= 2 can happen for prediction

difficult to find V? from the data (+ complexity issue)

low signal-to-noise ratio ⇒V? unsatisfactory (highlyvariable) large signal-to-noise ratio ⇒ V? too large (computation time)

(33)

Conclusions on V -fold cross-validation

asymptotically suboptimal if V fixed

optimal V?: trade-offvariability–overpenalization V?= 2 can happen for prediction

difficult to find V? from the data (+ complexity issue)

low signal-to-noise ratio ⇒V? unsatisfactory (highlyvariable) large signal-to-noise ratio ⇒ V? too large (computation time)

(34)

Penalization

mb ∈arg min

m∈M{Pnγ(bsm) + pen(m)}

Ideal penalty: penid(m) = (P−Pn)(γ(bsm,·))

pen(m) = 2σ2Dm

n (Mallows 1973) pen(m) = 2σb2Dm n

(35)

Penalization

mb ∈arg min

m∈M{Pnγ(bsm) + pen(m)}

Ideal penalty: penid(m) = (P−Pn)(γ(bsm,·))

pen(m) = 2σ2Dm

n (Mallows 1973) pen(m) = 2σb2Dm n

(36)

Penalization

mb ∈arg min

m∈M{Pnγ(bsm) + pen(m)}

Ideal penalty: penid(m) = (P−Pn)(γ(bsm,·)) Theorem (Suboptimality of linear penalties, A., 2008) X = [0,1], Y =X +σ(X),σ(x) =1x≤1/2+ 31x>1/2

Mn: Regular histograms on [0; 1/2]and [1/2; 1]

With a probability at least1−♦n−2, for every K ≥0 and m(Kb )∈arg min

m∈Mn

{Pnγ(bsm) +KDm} ,

(37)

Resampling heuristics (bootstrap, Efron 1979)

Real world: P sampling //Pn +3

bsm

penid(m) = (P−Pn)γ(bsm) =F(P,Pn)

V-fold: PnW = 1 n−Card(BJ)

(Xi,Yi) with J∼ U(1, . . . ,V)

(38)

Resampling heuristics (bootstrap, Efron 1979)

Real world:

OOOOOOO P sampling //Pn +3bsm

Bootstrap world: Pn

penid(m) = (P−Pn)γ(bsm) =F(P,Pn)

(39)

Resampling heuristics (bootstrap, Efron 1979)

Real world:

OOOOOOO P sampling //Pn +3bsm

Bootstrap world: Pn resampling //PnW +3bsmW

(P −Pn)γ(bsm) =F(P,Pn) /o /o /o //F(Pn,PnW) = (Pn−PnW)γ bsmW

V-fold: PnW = 1 n−Card(B )

(Xi,Yi) with J∼ U(1, . . . ,V)

(40)

Resampling heuristics (bootstrap, Efron 1979)

Real world:

OOOOOOO P sampling //Pn +3bsm

Bootstrap world: Pn subsampling //PnW +3bsmW

(P −Pn)γ(bsm) =F(P,Pn) /o /o /o //F(Pn,PnW) = (Pn−PnW)γ bsmW

(41)

V -fold penalization

Ideal penalty:

(P−Pn)(γ(bsm)) V-fold penalty:

pen(m) = C V

V

X

j=1

h

(Pn−Pn(−j))(γ(bsm(−j))) i

bsm(−j)∈arg min

t∈Sm

Pn(−j)γ(t) with C ≥V −1 to be chosen

(C =V −1 ⇒ we recover Burman’s correctedV-fold, 1989) The final estimator isbs

mb with mb ∈arg min

m∈M{Pnγ(bsm) + pen(m)}

(42)

V -fold penalization

Ideal penalty:

(P−Pn)(γ(bsm)) V-fold penalty:

pen(m) = C V

V

X

j=1

h

(Pn−Pn(−j))(γ(bsm(−j))) i

bsm(−j)∈arg min

t∈Sm

Pn(−j)γ(t) with C ≥V −1 to be chosen

(C =V −1 ⇒ we recover Burman’s correctedV-fold, 1989) The final estimator isbs

mb with

(43)

V -fold penalization

Ideal penalty:

(P−Pn)(γ(bsm)) V-fold penalty:

pen(m) = C V

V

X

j=1

h

(Pn−Pn(−j))(γ(bsm(−j))) i

bsm(−j)∈arg min

t∈Sm

Pn(−j)γ(t) with C ≥V −1 to be chosen

(C =V −1 ⇒ we recover Burman’s correctedV-fold, 1989) The final estimator isbs

mb with mb ∈arg min

m∈M{Pnγ(bsm) + pen(m)}

(44)

Some references on model selection and resampling

Hold-out, Cross-validation, Leave-one-out, V-fold cross-validation:

I ⊂ {1, . . . ,n} random sub-sample of sizeq (VFCV:

q = n(VV−1)).

Efron’s bootstrap penalties (Efron 1983, Shibata 1997):

pen(m) =E h

(Pn−PnW)(γ(bsmW))

(Xi,Yi)1≤i≤n

i

Rademacher complexities (Koltchinskii 2001 ; Bartlett, Boucheron, Lugosi 2002): subsampling

penid(m)≤pengloid (m) = sup

t∈Sm

(P −Pn)γ(t,·)

(45)

Non-asymptotic pathwise oracle inequality

C ≈V −1

Histogram regression on a random design

Small number of models (at most polynomial in n) Model pre-selection: removem when

λ∈Λminm{Card{Xi ∈Iλ}} ≤1 Fixed V or V =n

Theorem

Under a “reasonable” set of assumptions on P, with probability at least1−♦n−2,

`(s,bs

mb)≤

1 + ln(n)−1/5

m∈Minf {`(s,bsm)}

(46)

Non-asymptotic pathwise oracle inequality

C ≈V −1

Histogram regression on a random design

Small number of models (at most polynomial in n) Model pre-selection: removem when

λ∈Λminm{Card{Xi ∈Iλ}} ≤1 Fixed V or V =n

Theorem

Under a “reasonable” set of assumptions on P, with probability at least1−♦n−2,

(47)

Sufficient assumptions

Reminder: the proceduredoes not useany of these assumptions.

Bounded data: kYk≤A<∞ Minimal noise-level:

0< σmin≤σ(X)

Smoothness of the regression function s: non-constant, belongs to some h¨olderian ball Hα(R)

Regularity of the partition: minλP(X ∈Iλ)≥♦Dm−1

and they can be relaxed...

(48)

Sufficient assumptions

Reminder: the proceduredoes not useany of these assumptions.

Bounded data: kYk≤A<∞ Minimal noise-level:

0< σmin≤σ(X)

Smoothness of the regression function s: non-constant, belongs to some h¨olderian ball Hα(R)

Regularity of the partition: minλP(X ∈Iλ)≥♦Dm−1

(49)

Corollaries

Classical oracle inequality:

E[`(s,bs

mb)]≤

1 + ln(n)−1/5 E

m∈Minf {`(s,bsm)}

+♦n−2 Asymptotic optimality ifC ∼n→+∞V −1 :

`(s,bs

mb) infm∈M{`(s,bsm)}

−−−−→a.s.

n→+∞ 1

Adaptation to h¨olderian regularity in an heteroscedastic framework (regular histograms):

s ∈ H(α,R),α∈(0,1],X ⊂Rk, (· · ·) Y bounded

⇒ ratekσk

2α+k

L2(Leb)R2α+k2k n2α+k−2α .

(50)

Simulation framework

Yi =s(Xi) +σ(Xi)i Xii.i.d.U([0; 1]) ii.i.d.N(0,1) Mn=

n

regular histogramswith D pieces, 1≤D ≤ n log(n) and s.t. min

λ∈Λm

Card{Xi ∈Iλ} ≥2 o

⇒Benchmark:

Cclassical= E[`(s,bsmb)]

E[infm∈M`(s,bsm)] computed with N= 1000 samples

(51)

Model selection methods

Mallows:

pen(m) = 2bσ2Dmn−1

“Classical” V-fold cross-validation (V ∈ {2,5,10,20,n}):

mb ∈arg min

m∈M

 1 V

V

X

j=1

Pnjγ

bsm(−j)

es =bs

mb

V-fold penalties (V ∈ {2,5,10,n}),C =V −1

(52)

Simulations: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

0 4

Mallows 1.93±0.04 2-fold 2.08±0.04 5-fold 2.14±0.04 10-fold 2.10±0.05 20-fold 2.09±0.04 leave-one-out 2.08±0.04 pen 2-f 2.58±0.06 pen 5-f 2.22±0.05 pen 10-f 2.12±0.05 pen Loo 2.08±0.05

(53)

Simulations: s (x ) = sin(πx ), n = 200, σ ≡ 1

0 0.5 1

−4 0 4

0 4

Mallows 1.93±0.04

2-fold 2.08±0.04

5-fold 2.14±0.04

10-fold 2.10±0.05 20-fold 2.09±0.04 leave-one-out 2.08±0.04 pen 2-f 2.58±0.06 pen 5-f 2.22±0.05 pen 10-f 2.12±0.05 pen Loo 2.08±0.05 Mallows×1.25 1.80±0.03 pen 2-f×1.25 2.17±0.05 pen 5-f×1.25 1.91±0.05 pen 10-f×1.25 1.87±0.03

(54)

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 4

Mallows 3.69±0.07

2-fold 2.54±0.05

5-fold 2.58±0.06

10-fold 2.60±0.06 20-fold 2.58±0.06 leave-one-out 2.59±0.06 pen 2-f 3.06±0.07 pen 5-f 2.75±0.06 pen 10-f 2.65±0.06 pen Loo 2.59±0.06 Mallows×1.25 3.17±0.07 pen 2-f×1.25 2.75±0.06

(55)

Conclusions on V -fold penalization

asymptotically optimal, even if V fixed optimal V?: the largest possible one

⇒ easier to balance with the computational cost

low signal-to-noise ratio ⇒easy to overpenalize and decrease variability (keep V large)

large signal-to-noise ratio ⇒ possible to stay unbiased with a small V (for computational reasons)

flexibilityimproves V-fold cross-validation (according to both theoretical results andsimulations)

theory can be extended to exchangeable weighted bootstrap penalties(e.g. bootstrap, i.i.d. Rademacher, leave-one-out, leave-p-out withp =αn).

Some open problems: consistency when C V −1,

(56)

Conclusions on V -fold penalization

asymptotically optimal, even if V fixed optimal V?: the largest possible one

⇒ easier to balance with the computational cost

low signal-to-noise ratio ⇒easy to overpenalize and decrease variability (keep V large)

large signal-to-noise ratio ⇒ possible to stay unbiased with a small V (for computational reasons)

flexibilityimproves V-fold cross-validation (according to both theoretical results andsimulations)

theory can be extended to exchangeable weighted bootstrap penalties(e.g. bootstrap, i.i.d. Rademacher, leave-one-out, leave-p-out withp =αn).

(57)

Conclusions on V -fold penalization

asymptotically optimal, even if V fixed optimal V?: the largest possible one

⇒ easier to balance with the computational cost

low signal-to-noise ratio ⇒easy to overpenalize and decrease variability (keep V large)

large signal-to-noise ratio ⇒ possible to stay unbiased with a small V (for computational reasons)

flexibilityimproves V-fold cross-validation (according to both theoretical results andsimulations)

theory can be extended to exchangeable weighted bootstrap penalties(e.g. bootstrap, i.i.d. Rademacher, leave-one-out, leave-p-out withp =αn).

Some open problems: consistency when C V −1,

(58)

Conclusions on V -fold penalization

asymptotically optimal, even if V fixed optimal V?: the largest possible one

⇒ easier to balance with the computational cost

low signal-to-noise ratio ⇒easy to overpenalize and decrease variability (keep V large)

large signal-to-noise ratio ⇒ possible to stay unbiased with a small V (for computational reasons)

flexibilityimproves V-fold cross-validation (according to both theoretical results andsimulations)

theory can be extended to exchangeable weighted bootstrap penalties(e.g. bootstrap, i.i.d. Rademacher, leave-one-out, leave-p-out withp =αn).

(59)

Thank you for your attention !

Preprint: arXiv:0802.0566

(60)

Part I

Appendix

(61)

Limitations of a linear penalty

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−2

−1 0 1 2 3 4

0.1 0.15 0.2 0.25

Ideal penalty

Y =X +σ(X) σ(X) =1X1

2 ∼ N(0,1)

Regular histograms on 0;12 (Dm,1 pieces), then regular histograms on1

2; 1 (Dm,2 pieces).

⇒ penid(m) is not a linear

(62)

Limitations of a linear penalty: m(K b ) 6= m

?

0 5 10 15 20

0 2 4 6 8 10 12 14 16 18

Dm2

(63)

Sketch of the proof

For eachm∈ Mn,

penid(m)≈E[penid(m)]∝E[pen(m)]≈pen(m) with remainders`(s,bsm) whenDm →+∞:

Explicit computation of penidand pen

Comparison of expectations: E(penid)∝E(pen) (if minλ∈Λm{nP(X ∈Iλ)} →+∞)

Moment inequalities (Boucheron, Bousquet, Lugosi, Massart 2003)

⇒ concentration inequalities (for penid and pen)

Assumptions ⇒control of the remainders in terms of `(s,bsm).

(64)

Sketch of the proof

For eachm∈ Mn,

penid(m)≈E[penid(m)]∝E[pen(m)]≈pen(m) with remainders`(s,bsm) whenDm →+∞:

Explicit computation of penidand pen

Comparison of expectations: E(penid)∝E(pen) (if minλ∈Λm{nP(X ∈Iλ)} →+∞)

Moment inequalities (Boucheron, Bousquet, Lugosi, Massart 2003)

⇒ concentration inequalities (for penid and pen)

(65)

Sketch of the proof

For eachm∈ Mn,

penid(m)≈E[penid(m)]∝E[pen(m)]≈pen(m) with remainders`(s,bsm) whenDm →+∞:

Explicit computation of penidand pen

Comparison of expectations: E(penid)∝E(pen) (if minλ∈Λm{nP(X ∈Iλ)} →+∞)

Moment inequalities (Boucheron, Bousquet, Lugosi, Massart 2003)

⇒ concentration inequalities (for penid and pen)

Assumptions ⇒control of the remainders in terms of `(s,bsm).

(66)

Sketch of the proof

For eachm∈ Mn,

penid(m)≈E[penid(m)]∝E[pen(m)]≈pen(m) with remainders`(s,bsm) whenDm →+∞:

Explicit computation of penidand pen

Comparison of expectations: E(penid)∝E(pen) (if minλ∈Λm{nP(X ∈Iλ)} →+∞)

Moment inequalities (Boucheron, Bousquet, Lugosi, Massart 2003)

⇒ concentration inequalities (for penid and pen)

(67)

Sketch of the proof

For eachm∈ Mn,

penid(m)≈E[penid(m)]∝E[pen(m)]≈pen(m) with remainders`(s,bsm) whenDm →+∞:

Explicit computation of penidand pen

Comparison of expectations: E(penid)∝E(pen) (if minλ∈Λm{nP(X ∈Iλ)} →+∞)

Moment inequalities (Boucheron, Bousquet, Lugosi, Massart 2003)

⇒ concentration inequalities (for penid and pen)

Assumptions ⇒control of the remainders in terms of `(s,bsm).

(68)

Overpenalization (HeaviSine, n = 2048, σ ≡ 1)

(69)

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

Mallows 1.015±0.003 2-fold 1.002±0.003 5-fold 1.014±0.003 10-fold 1.021±0.003 20-fold 1.029±0.004 leave-one-out 1.034±0.004 pen 2-f 1.038±0.004 pen 5-f 1.037±0.004 pen 10-f 1.034±0.004 pen Loo 1.034±0.004 Mallows×1.25 1.002±0.003 pen 2-f×1.25 1.011±0.003 pen 5-f×1.25 1.006±0.003 pen 10-f×1.25 1.005±0.003

(70)

Simulations: HeaviSine, n = 2048, σ ≡ 1

0 0.5 1

−8 0 8

Mallows 1.015±0.003 2-fold 1.002±0.003 5-fold 1.014±0.003 10-fold 1.021±0.003 20-fold 1.029±0.004 leave-one-out 1.034±0.004 pen 2-f 1.038±0.004 pen 5-f 1.037±0.004 pen 10-f 1.034±0.004 pen Loo 1.034±0.004 Mallows×1.25 1.002±0.003 pen 2-f×1.25 1.011±0.003

(71)

Simulations: HeaviSine, n = 2048, σ(x ) = x , 2 bin sizes

0 0.5 1

−8 0 8

0 8

Mallows 1.373±0.010 2-fold 1.184±0.004 5-fold 1.115±0.005 10-fold 1.109±0.004 20-fold 1.105±0.004 leave-one-out 1.105±0.004 pen 2-f 1.103±0.005 pen 5-f 1.104±0.004 pen 10-f 1.104±0.004 pen Loo 1.105±0.004 Mallows×1.25 1.411±0.008 pen 2-f×1.25 1.106±0.004 pen 5-f×1.25 1.102±0.004 pen 10-f×1.25 1.098±0.004

(72)

Simulations: sin, variable n and σ, regular histograms

n 200 1000 200

σ 1 1 0.1

Mallows (K = 2) 1.93±0.04 1.67±0.04 1.40±0.02 2-fold 2.08±0.04 1.67±0.04 1.39±0.02 10-fold 2.10±0.05 1.75±0.04 1.38±0.02 pen 10-fold 2.12±0.05 1.78±0.05 1.37±0.02

(73)

Simulations: sin, variable n and σ, regular histograms

n 200 1000 200

σ 1 1 0.1

Mallows (K = 2) 1.93±0.04 1.67±0.04 1.40±0.02 2-fold 2.08±0.04 1.67±0.04 1.39±0.02 10-fold 2.10±0.05 1.75±0.04 1.38±0.02 pen 10-fold 2.12±0.05 1.78±0.05 1.37±0.02 Mallows (K = 2.5) 1.80±0.03 1.62±0.03 1.43±0.02 pen 10-fold ×1.25 1.87±0.03 1.63±0.04 1.38±0.02

Références

Documents relatifs

The average ratio (estimator of variance/empirical variance) is displayed for two variance estimators, in an ideal situation where 10 independent training and test sets are

If the amount of data is large enough, PE can be estimated by the mean error over a hold-out test set, and the usual variance estimate for means of independent variables can then

If the amount of data is large enough, PE can be estimated by the mean error over a hold-out test set, and the usual variance estimate for means of independent variables can then

Keywords and phrases: non-parametric statistics, statistical learning, resampling, non-asymptotic, V -fold cross- validation, model selection, penalization, non-parametric

In short, it is proved in least-squares regression that at first order, V -fold cross-validation is suboptimal for model selection (with an estimation goal) if V stays bounded,

Then, results concerning hold-out penalization are detailed in Section D, with the proof of the oracle inequality stated in Section 8.2 (Theorem 12) and an exact computation of

 En raffinant le maillage, les éléments sont toujours distordus mais ils permettent une représentation acceptable de la géométrique réelle de la coque, ainsi,

Dans ces conditions, nous pouvons dire en acceptant 10 chances sur 100 de nous tromper que la procédure de calibration externe avec le niveau de concentration le plus élevé