Cross-validation for estimator selection

(1)

1/57

Cross-validation for estimator selection

Sylvain Arlot (joint works with Alain Celisse, Matthieu Lerasle, Nelo Magalh˜aes)

1Cnrs

2Ecole Normale Sup´´ erieure (Paris),DI/ENS, ´EquipeSierra

IHP, Paris April, 9th, 2015

Main reference (survey): arXiv:0907.4728

(2)

Outline

1 Estimator selection

2 Cross-validation

3 Cross-validation for risk estimation

4 Cross-validation for estimator selection

5 LargeM

6 Conclusion

(3)

3/57

Regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(4)

Goal: predict Y given X , i.e., denoising

−2

−1 0 1 2 3 4

(5)

5/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Prediction problem / regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈ X × Y (i.i.d. ∼P) Contrastγ(t; (x,y)) measures how wellt(x) “predicts”y Goal: learn t ∈S={ measurable functionsX → Y} s.t.

E(X,Y)∼P

γ t; (X,Y)

=:Pγ(t) is minimal.

R

least-squares contrast γ(t; (x,y)) = (t(x)−y)² s^?∈argmin_t∈_SPγ(t) is the regression function: s^?(X) =E[Y |X]

⇒ excess loss

`(s^?,t) :=Pγ(t)−Pγ(s^?) =E h

t(X)−s^?(X)2i

(6)

Prediction problem / regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈ X × Y (i.i.d. ∼P) Contrastγ(t; (x,y)) measures how wellt(x) “predicts”y Goal: learn t ∈S={ measurable functionsX → Y} s.t.

E(X,Y)∼P

γ t; (X,Y)

Example: regressionY =R,

least-squares contrast γ(t; (x,y)) = (t(x)−y)² s^?∈argmin_t∈_SPγ(t) is the regression function:

s^?(X) =E[Y |X]

⇒ excess loss

? ? h

? 2i

(7)

6/57

Density estimation: data ξ

1

, . . . , ξ

n

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(8)

Goal: estimate the common density s

^?

of ξ

i

(9)

8/57

Problem: density estimation

Data D_n: ξ₁, . . . , ξ_n∈Ξ (i.i.d. ∼P, densitys^? w.r.t. µ) Least-squares contrast γ(t, ξ) =ktk²_L2(µ)−2t(ξ)

Goal: learn t ∈S={measurable functions Ξ→R} s.t.

Eξ∼P

γ(t;ξ)

Pγ(t) = t²dµ−2 ts^?dµ= (t−s^?)²dµ− ks^?k²_L2(µ)

⇒ the true densitys^? ∈argmin_t∈_SPγ(t) and theexcess loss is

`(s^?,t) :=Pγ(t)−Pγ(s^?) =kt−s^?k²_L2(µ)

(10)

Problem: density estimation

Data D_n: ξ₁, . . . , ξ_n∈Ξ (i.i.d. ∼P, densitys^? w.r.t. µ) Least-squares contrast γ(t, ξ) =ktk²_L2(µ)−2t(ξ)

Goal: learn t ∈S={measurable functions Ξ→R} s.t.

Eξ∼P

γ(t;ξ)

Pγ(t) = Z

t²dµ−2 Z

ts^?dµ= Z

(t−s^?)²dµ− ks^?k²_L2(µ)

⇒ the true densitys^? ∈argmin_t∈_SPγ(t) and theexcess loss is

`(s^?,t) :=Pγ(t)−Pγ(s^?) =kt−s^?k²

(11)

9/57

General setting

Data ξ₁, . . . , ξ_n∈Ξ i.i.d. with distribution P prediction: ξ_i = (X_i,Y_i)∈ X × Y

Goal: Estimate some features^?∈Sof P density, regression function, Bayes predictor...

Contrast functionγ :S×Ξ→Rsuch that s^? ∈argmin

t∈S

Pγ(t) with Pγ(t) :=Eξ∼P

γ(t;ξ) Excess loss

`(s^?,t) :=Pγ(t)−Pγ(s^?)>0

(12)

Examples

Prediction: ξi = (Xi,Yi)

Xn+1 “predict”Yn+1 with t(Xn+1)?

γ t; (x,y)

quantifies the “distance” between t(x) and y Regression (Y =R), least squares:

γ t; (x,y)

= t(x)−y2

s^?(X) =E[Y|X] Binary classification (Y ={0,1}), 0–1 contrast:

γ t; (x,y)

= 1_t(x)6=y Density estimation (reference measureµ):

least squares: γ(t;ξ) =ktk²_L2(µ)−2t(ξ)

(13)

11/57

Estimator selection (regression): regular regressograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(14)

Estimator selection (regression): kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(15)

13/57

Estimator selection (regression): k nearest neighbours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(16)

Estimator selection (regression): Nadaraya-Watson

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(17)

15/57

Estimator selection (density): regular histograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(18)

Estimator selection (density): Parzen, Gaussian kernel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(19)

17/57

Estimator selection

Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on somemodel Sm⊂S

bsm ∈argmin

t∈S_m

{P_nγ(t)} where Pnγ(t) := 1 n

X

ξ∈Dn

γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}

bm m∈M b b n

Examples:

model selection

calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between different methods ex.: k-NN vs. smoothing splines?

(20)

Estimator selection

Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on some model Sm⊂S

bsm ∈argmin

t∈S_m

X

ξ∈Dn

Estimator collection (bs_m)m∈M⇒ choose mb =m(Db _n)?

model selection

calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between different methods ex.: k-NN vs. smoothing splines?

(21)

17/57

Estimator selection

Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on some model Sm⊂S

bsm ∈argmin

t∈S_m

X

ξ∈Dn

Estimator collection (bs_m)m∈M⇒ choose mb =m(Db _n)?

Examples:

model selection

calibration of tuning parameters(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice betweendifferent methods ex.: k-NN vs. smoothing splines?

(22)

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality(in expectation or with a large probability):

`(s^?,bs

mb)6C inf

m∈M

`(s^?,bsm) +Rn

model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m^?

−−−→n→∞ 1. Equivalent to estimation in the parametric setting. Both goals with the same procedure (AIC-BIC dilemma)? No in general (Yang, 2005). Sometimes possible.

(23)

18/57

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):

`(s^?,bs

mb)6C inf

m∈M

`(s^?,bsm) +Rn

Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m^?

−−−→n→∞ 1. Equivalent to estimation in the parametric setting.

No in general (Yang, 2005). Sometimes possible.

(24)

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):

`(s^?,bs

mb)6C inf

m∈M

`(s^?,bsm) +Rn

Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m^?

−−−→n→∞ 1. Equivalent to estimation in the parametric setting.

Both goals with the same procedure (AIC-BIC dilemma)?

(25)

19/57

Estimation goal: Bias-variance trade-off

E[`(s^?,bsm)] = Bias + Variance Bias or Approximation error

`(s^?,s_m^?) = inf

t∈Sm

`(s^?,t) Variance or Estimation error

OLS in regression: σ²dim(Sm) n

⇔avoid overfittingandunderfitting

(26)

Estimation goal: Bias-variance trade-off

E[`(s^?,bsm)] = Bias + Variance Bias or Approximation error

`(s^?,s_m^?) = inf

t∈Sm

`(s^?,t) Variance or Estimation error

OLS in regression: σ²dim(Sm) n

(27)

20/57

Estimation goal: Bias-variance trade-off

0 20 40 60 80 100

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

dimension D

Approx. error Estim. error E[Risk]

(28)

Outline

2 Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

Training set D_n^(t)⇒bs_m^(t)=bs_m D_n^(t)

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setD_n^(v⁾⇒evaluate risk

P_n^(v)γ bs_m^(t)

= 1 nv

X

ξ∈Dn^(v)

γ bs_m^(t);ξ

nv=|D_n^(v)|=n−nt

cross-validation: average several hold-out estimators

Rb^cv

bs_m;D_n; (I_j^(t))_16j6B

= 1 B

B

X

j=1

Pn^(v,j)γ bsm^(t,j)

Dn^(t,j)=(ξi)

i∈I(t) j

estimator selection:

mb ∈argmin

m∈M

n

Rb^cv(bs_m;D_n)o

(35)

23/57

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setD_n^(v⁾⇒evaluate risk

hold-out estimator of the risk:

P_n^(v)γ bs_m^(t)

= 1 nv

X

ξ∈Dn^(v)

γ bs_m^(t);ξ

nv=|D_n^(v⁾|=n−nt

Rb^cv

= 1 B

B

X

j=1

Pn^(v,j)γ bsm^(t,j⁾

Dn^(t,j)=(ξi)

i∈I(t) j

mb ∈argmin

m∈M

n

Rb^cv(bs_m;D_n)o

(36)

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setD_n^(v⁾⇒evaluate risk hold-out estimator of the risk:

P_n^(v)γ bs_m^(t)

= 1 nv

X

ξ∈Dn^(v)

γ bs_m^(t);ξ

Rb^cv

= 1 B

B

X

j=1

Dn^(t,j)=(ξi)

i∈I(t) j

mb ∈argmin

m∈M

n

Rb^cv(bs_m;D_n)o

(37)

23/57

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setD_n^(v⁾⇒evaluate risk hold-out estimator of the risk:

P_n^(v)γ bs_m^(t)

= 1 nv

X

ξ∈Dn^(v)

γ bs_m^(t);ξ

Rb^cv

= 1 B

B

X

j=1

Dn^(t,j)=(ξi)

i∈I(t) j

mb ∈argmin

m∈M

n

Rb^cv(bs_m;D_n)o

(38)

Cross-validation: examples

Exhaustive data splitting: all possible subsets of size n_t

⇒ leave-one-out (nt=n−1) Rb^loo(bsm;Dn) = 1

n

X

j=1

γ

bsm^(−j⁾;ξj

⇒ leave-p-out (nt=n−p)

j 16j6V

⇒ Rb^vf(bs_m;D_n;B) = 1 V

V

X

j=1

P_n^jγ bs_m^(−j) Monte-Carlo CV / Repeated learning testing:

I₁^(t), . . . ,I_B^(t) i.i.d. uniform

(39)

24/57

Cross-validation: examples

n

X

j=1

γ

bsm^(−j⁾;ξj

V-fold cross-validation: B= (B_j)_16j6V partition of {1, . . . ,n}

V

X

j=1

P_n^jγ bs_m^(−j)

I₁^(t), . . . ,I_B^(t) i.i.d. uniform

(40)

Cross-validation: examples

n

X

j=1

γ

bsm^(−j⁾;ξj

V-fold cross-validation: B= (B_j)_16j6V partition of {1, . . . ,n}

V

X

j=1

P_n^jγ bs_m^(−j) Monte-Carlo CV / Repeated learning testing:

(t) (t)

(41)

25/57

Outline

2 Cross-validation

5 LargeM

6 Conclusion

(42)

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn^(t,j⁾

=nt

For V-fold CV: Card(B_j) =n/V. Ideal criterion: Pγ bsm(Dn)

E

Rb^cv

bs_m;D_n; I_j^(t)

16j6B

=E h

Pγ bs_m(D_n_t)i

⇒ everything depends on n→E h

Pγ bs_m(D_n)i

Note: bias can be corrected in some settings (Burman, 1989). Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.

(43)

26/57

Assume

E h

Pγ bs_m(D_n)i

=α(m) +β(m) n

⇒ E

Rb^cv

bs_m;D_n; I_j^(t)

16j6B

=α(m) + n nt

β(m) n

⇒V-fold: bias decreases when V increases, vanishes asV →+∞.

(49)

27/57

Bias of cross-validation: generic example

Assume

E h

Pγ bs_m(D_n)i

=α(m) +β(m) n

⇒ E

Rb^cv

bs_m;D_n; I_j^(t)

16j6B

=α(m) + n nt

β(m) n

⇒Bias:

(50)

Bias of cross-validation: generic example

Assume

E h

Pγ bs_m(D_n)i

=α(m) +β(m) n

⇒ E

Rb^cv

bs_m;D_n; I_j^(t)

16j6B

=α(m) + n nt

β(m) n

⇒Bias:

(51)

28/57

Variance of cross-validation

Hold-out (Nadeau & Bengio, 2003):

var

Pn^(v⁾γ bsm^(t)

= 1 n_vE

h var

γ(u;ξ)

u=bsm^(t)

i

+ var

Pγ bsm(Dnt)

t

var Rb^cv

bsm;Dn;

I_j^(t)

16j6B

!

= var

Rb^`po bsm;Dn

+ 1 B E

h var_I(t)

Pn^(v⁾γ

bsm^(t)

Dn

i

| {z }

permutation variance

V-fold CV: B,nt,nv related

leave-one-out: related to stability? (empirical results)

(52)

Variance of cross-validation

var

Pn^(v⁾γ bsm^(t)

= 1 n_vE

h var

γ(u;ξ)

u=bsm^(t)

i

+ var

Pγ bsm(Dnt) Monte-Carlo CV and number of splits: (p=n−nt) var Rb^cv

bsm;Dn;

I_j^(t)

16j6B

!

= var

Rb^`po bsm;Dn

+ 1 B E

h var_I(t)

Pn^(v)γ

bsm^(t)

Dn

i

| {z }

(53)

28/57

Variance of cross-validation

var

Pn^(v⁾γ bsm^(t)

= 1 n_vE

h var

γ(u;ξ)

u=bsm^(t)

i

+ var

Pγ bsm(Dnt) Monte-Carlo CV and number of splits: (p=n−nt) var Rb^cv

bsm;Dn;

I_j^(t)

16j6B

!

= var

Rb^`po bsm;Dn

+ 1 B E

h var_I(t)

Pn^(v)γ

bsm^(t)

Dn

i

| {z }

V-fold CV:B,nt,nv related

(54)

Variance of the V -fold CV criterion

Least-squares density estimation (A. & Lerasle 2012), exact computation (non-asymptotic):

var

Rb^vf(bsm;Dn;B)

= 1^+O(1)

n var_P(s_m^?) + 2

n²

1 + 4

V −1^+O(_V¹⁺¹_n)

A(m)

(simplified formula, histogram model with bin sized_m⁻¹,A(m)≈dm) Linear regression, specific setting, asymptotic formula

(Burman, 1989):

var

Rb^vf(bs_m;D_n;B)

= 2σ² n +4σ⁴

n²

4 + 4

V −1⁺

2 (V−1)2+ ¹

(V−1)3

+o(ⁿ⁻²)

(55)

30/57

Outline

2 Cross-validation

5 LargeM

6 Conclusion

(56)

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rb^cv(bs_m)o

vs. m^? ∈argmin

m∈M

n

Pγ bs_m(D_n)o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rb^cv(bsm) +Z o

⇒ bias and variance meaningless.

Perfect ranking among (bsm)m∈M ⇔ ∀m,m ∈ M, sign Rb^cv(bs_m)−Rb^cv(bs_m⁰)

= sign Pγ(bs_m)−Pγ(bs_m⁰)

⇒ E h

Rb^cv bsm

−Rb^cv bs_m⁰i

should be of the good sign (unbiased risk estimation heuristic: AIC,C_p, leave-one-out...)

⇒ var

Rb^vf bsm

−Rb^vf bsm⁰

should be minimal (detailed heuristic: A. & Lerasle 2012)

(57)

31/57

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rb^cv(bs_m)o

vs. m^? ∈argmin

m∈M

n

mb ∈argmin

m∈M

n

Rb^cv(bsm) +Z o

Perfect ranking among (bsm)m∈M ⇔ ∀m,m⁰ ∈ M, sign Rb^cv(bs_m)−Rb^cv(bs_m⁰)

⇒ E Rb bsm −Rb bs_m⁰ should be of the good sign (unbiased risk estimation heuristic: AIC,C_p, leave-one-out...)

⇒ var

Rb^vf bsm

−Rb^vf bsm⁰

should be minimal (detailed heuristic: A. & Lerasle 2012)

(58)

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rb^cv(bs_m)o

vs. m^? ∈argmin

m∈M

n

mb ∈argmin

m∈M

n

Rb^cv(bsm) +Z o

⇒ E h

Rb^cv bsm

−Rb^cv bs_m⁰i

should be of the good sign(unbiased risk estimation heuristic: AIC,C_p, leave-one-out...)

⇒ var Rb bsm −Rb bsm⁰ should be minimal (detailed heuristic: A. & Lerasle 2012)

(59)

31/57

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rb^cv(bs_m)o

vs. m^? ∈argmin

m∈M

n

mb ∈argmin

m∈M

n

Rb^cv(bsm) +Z o

⇒ E h

Rb^cv bsm

−Rb^cv bs_m⁰i

should be of the good sign (unbiased risk estimation heuristic: AIC,C_p, leave-one-out...)

⇒ var

Rb^vf bsm

−Rb^vf bsm⁰

should be minimal(detailed heuristic: A. & Lerasle 2012)

(60)

Bias and estimator selection: generic example

mb ∈argmin

m∈M

n

Rb^vf(bsm) o

vs. m^?∈argmin

m∈M

n

Pγ bsm(Dn)o Assume

E h

Pγ bsm(Dn)i

=α(m) + β(m) n

(e.g., LS/ridge/k NN regression, LS/kernel density estimation).

E

Pγ(bs_m)−Pγ(bs_m⁰)

=α(m)−α(m⁰) +β(m)−β(m⁰) n E

h

Rb^cv(bs_m)−Rb^cv(bs_m⁰)i

=α(m)−α(m⁰) + n n_t

β(m)−β(m⁰) n

⇒ CV favoursm with smaller complexityβ(m), more and more as n_t decreases.

(61)

32/57

Bias and estimator selection: generic example

mb ∈argmin

m∈M

n

Rb^vf(bsm) o

vs. m^?∈argmin

m∈M

n

Pγ bsm(Dn)o Assume

E h

Pγ bsm(Dn)i

=α(m) + β(m) n

(e.g., LS/ridge/k NN regression, LS/kernel density estimation).

Key quantities:

E

Pγ(bs_m)−Pγ(bs_m⁰)

=α(m)−α(m⁰) +β(m)−β(m⁰) n E

h

Rb^cv(bs_m)−Rb^cv(bs_m⁰)i

=α(m)−α(m⁰) + n n_t

β(m)−β(m⁰) n

⇒ CV favoursm with smaller complexityβ(m), more and more as nt decreases.

(62)

CV with an estimation goal: the big picture (M “small”)

At first order, the bias drives the performanceof:

leave-p-out,V-fold CV, Monte-Carlo CV if B n²

or ifnv large enough (including hold-out) CV performs similarly to

argmin

m∈M

n E

h

Pγ bsm(Dn_t)io

t

⇒ suboptimal otherwise

e.g., V-fold CV with V fixed.

Theoretical results for least-squares regression and density estimation at least.

(63)

33/57

CV with an estimation goal: the big picture (M “small”)

At first order, the bias drives the performance of:

leave-p-out,V-fold CV, Monte-Carlo CV if B n²

or ifnv large enough (including hold-out) CV performs similarly to

argmin

m∈M

n E

h

Pγ bsm(Dnt)io

⇒ first-order optimality ifnt∼n

⇒ suboptimal otherwise

e.g., V-fold CV with V fixed.

Theoretical results for least-squares regression and density estimation at least.

(64)

34/57

Bias-corrected VFCV / V -fold penalization

Bias-correctedV-fold CV(Burman, 1989):

Rb^vf,corr(bsm;Dn;B) :=Rb^vf(bsm;Dn;B) +Pnγ(bsm)− 1 V

V

X

j=1

Pnγ

bsm^(−j⁾

n bm VF bm n

| {z } V-fold penalty (A. 2008) In least-squares density estimation (A. & Lerasle, 2012): Rb^vf(bsm;Dn;B) =Pnγ bsm(Dn)

+

1 + 1

2(V −1)

| {z } overpenalization factor

pen_VF(bsm;Dn;B)

Rb^`po(bs_m;D_n;B) =P_nγ bs_m(D_n) +

z }| {



1 + 1 2

n p −1



 pen_VF(bs_m;D_n;B_loo)

(65)

34/57

Bias-corrected VFCV / V -fold penalization

V

X

j=1

Pnγ

bsm^(−j⁾

=P_nγ(bs_m) + pen_VF(bs_m;D_n;B)

| {z } V-fold penalty (A. 2008)

Rb^vf(bsm;Dn;B) =Pnγ bsm(Dn) +

1 + 1

2(V −1)

pen_VF(bsm;Dn;B)

Rb^`po(bs_m;D_n;B) =P_nγ bs_m(D_n) +

z }| {



1 + 1 2

n p −1



 pen_VF(bs_m;D_n;B_loo)

(66)

Bias-corrected VFCV / V -fold penalization

V

X

j=1

Pnγ

bsm^(−j⁾

=P_nγ(bs_m) + pen_VF(bs_m;D_n;B)

| {z } V-fold penalty (A. 2008) In least-squares density estimation (A. & Lerasle, 2012):

Rb^vf(bsm;Dn;B) =Pnγ bsm(Dn) +

1 + 1

2(V −1)

pen_VF(bsm;Dn;B)

z }| {



1



(67)

35/57

Variance and estimator selection

∆(m,m⁰,V) =Rb^vf,corr(bsm)−Rb^vf,corr(bsm⁰)

Theorem (A. & Lerasle 2012, least-squares density estimation)

var ∆(m,m⁰,V)

=4

1 +2 n + 1

n²

varP s_m^? −s_m^?0

n + 2

1 + 4 V −1− 1

n

B(m,m⁰) n²

| {z }

>0

If S_m⊂S_m⁰ are two histogram models with constant bin sizes d_m⁻¹,d_m⁻¹0, then, B(m,m⁰)∝

s_m^? −s_m^?0

dm. The two terms are of the same order if

s_m^? −s_m^?0

≈dm/n.

(68)

Variance of R b

^vf,corr

( b s

_m

) − R b

^vf,corr

( b s

_m^?

) vs. (d

_m

, V )

0 20 40 60 80 100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

dimension LOO

10−Fold 5−Fold 2−Fold E[penid]

(69)

37/57

Probability of selection of every m

0 10 20 30 40

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

dimension

P(m is selected)

LOO 10−Fold 5−Fold 2−Fold E[penid]

(70)

Experiment (LS density estimation): V -fold CV

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

V=n (LOO) V=10 V=5 V=2

(71)

38/57

Experiment (LS density estimation): V -fold CV

1 1.2 1.4 1.6 1.8 2

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

overpenalization factor C V=n (LOO) V=10 V=5 V=2

(72)

Experiment (LS density estimation): V -fold CV

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

V=n (LOO) V=10 V=5 V=2

(73)

38/57

Experiment (LS density estimation): V -fold CV

1 1.2 1.4 1.6 1.8 2

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

overpenalization factor C V=n (LOO) V=10 V=5 V=2