• Aucun résultat trouvé

Cross-validation for estimator selection

N/A
N/A
Protected

Academic year: 2022

Partager "Cross-validation for estimator selection"

Copied!
110
0
0

Texte intégral

(1)

1/57

Cross-validation for estimator selection

Sylvain Arlot (joint works with Alain Celisse, Matthieu Lerasle, Nelo Magalh˜aes)

1Cnrs

2Ecole Normale Sup´´ erieure (Paris),DI/ENS, ´EquipeSierra

IHP, Paris April, 9th, 2015

Main reference (survey): arXiv:0907.4728

(2)

Outline

1 Estimator selection

2 Cross-validation

3 Cross-validation for risk estimation

4 Cross-validation for estimator selection

5 LargeM

6 Conclusion

(3)

3/57

Regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(4)

Goal: predict Y given X , i.e., denoising

−2

−1 0 1 2 3 4

(5)

5/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Prediction problem / regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈ X × Y (i.i.d. ∼P) Contrastγ(t; (x,y)) measures how wellt(x) “predicts”y Goal: learn t ∈S={ measurable functionsX → Y} s.t.

E(X,Y)∼P

γ t; (X,Y)

=:Pγ(t) is minimal.

R

least-squares contrast γ(t; (x,y)) = (t(x)−y)2 s?∈argmint∈SPγ(t) is the regression function: s?(X) =E[Y |X]

⇒ excess loss

`(s?,t) :=Pγ(t)−Pγ(s?) =E h

t(X)−s?(X)2i

(6)

Prediction problem / regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈ X × Y (i.i.d. ∼P) Contrastγ(t; (x,y)) measures how wellt(x) “predicts”y Goal: learn t ∈S={ measurable functionsX → Y} s.t.

E(X,Y)∼P

γ t; (X,Y)

=:Pγ(t) is minimal.

Example: regressionY =R,

least-squares contrast γ(t; (x,y)) = (t(x)−y)2 s?∈argmint∈SPγ(t) is the regression function:

s?(X) =E[Y |X]

⇒ excess loss

? ? h

? 2i

(7)

6/57

Density estimation: data ξ

1

, . . . , ξ

n

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(8)

Goal: estimate the common density s

?

of ξ

i

(9)

8/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Problem: density estimation

Data Dn: ξ1, . . . , ξn∈Ξ (i.i.d. ∼P, densitys? w.r.t. µ) Least-squares contrast γ(t, ξ) =ktk2L2(µ)−2t(ξ)

Goal: learn t ∈S={measurable functions Ξ→R} s.t.

Eξ∼P

γ(t;ξ)

=:Pγ(t) is minimal.

Pγ(t) = t2dµ−2 ts?dµ= (t−s?)2dµ− ks?k2L2(µ)

⇒ the true densitys? ∈argmint∈SPγ(t) and theexcess loss is

`(s?,t) :=Pγ(t)−Pγ(s?) =kt−s?k2L2(µ)

(10)

Problem: density estimation

Data Dn: ξ1, . . . , ξn∈Ξ (i.i.d. ∼P, densitys? w.r.t. µ) Least-squares contrast γ(t, ξ) =ktk2L2(µ)−2t(ξ)

Goal: learn t ∈S={measurable functions Ξ→R} s.t.

Eξ∼P

γ(t;ξ)

=:Pγ(t) is minimal.

Pγ(t) = Z

t2dµ−2 Z

ts?dµ= Z

(t−s?)2dµ− ks?k2L2(µ)

⇒ the true densitys? ∈argmint∈SPγ(t) and theexcess loss is

`(s?,t) :=Pγ(t)−Pγ(s?) =kt−s?k2

(11)

9/57

General setting

Data ξ1, . . . , ξn∈Ξ i.i.d. with distribution P prediction: ξi = (Xi,Yi)∈ X × Y

Goal: Estimate some features?∈Sof P density, regression function, Bayes predictor...

Contrast functionγ :S×Ξ→Rsuch that s? ∈argmin

t∈S

Pγ(t) with Pγ(t) :=Eξ∼P

γ(t;ξ) Excess loss

`(s?,t) :=Pγ(t)−Pγ(s?)>0

(12)

Examples

Prediction: ξi = (Xi,Yi)

Xn+1 “predict”Yn+1 with t(Xn+1)?

γ t; (x,y)

quantifies the “distance” between t(x) and y Regression (Y =R), least squares:

γ t; (x,y)

= t(x)−y2

s?(X) =E[Y|X] Binary classification (Y ={0,1}), 0–1 contrast:

γ t; (x,y)

= 1t(x)6=y Density estimation (reference measureµ):

least squares: γ(t;ξ) =ktk2L2(µ)2t(ξ)

(13)

11/57

Estimator selection (regression): regular regressograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(14)

Estimator selection (regression): kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−1 0 1 2 3 4

(15)

13/57

Estimator selection (regression): k nearest neighbours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(16)

Estimator selection (regression): Nadaraya-Watson

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−1 0 1 2 3 4

(17)

15/57

Estimator selection (density): regular histograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(18)

Estimator selection (density): Parzen, Gaussian kernel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(19)

17/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Estimator selection

Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on somemodel Sm⊂S

bsm ∈argmin

t∈Sm

{Pnγ(t)} where Pnγ(t) := 1 n

X

ξ∈Dn

γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}

bm m∈M b b n

Examples:

model selection

calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between different methods ex.: k-NN vs. smoothing splines?

(20)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Estimator selection

Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on some model Sm⊂S

bsm ∈argmin

t∈Sm

{Pnγ(t)} where Pnγ(t) := 1 n

X

ξ∈Dn

γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}

Estimator collection (bsm)m∈M⇒ choose mb =m(Db n)?

model selection

calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between different methods ex.: k-NN vs. smoothing splines?

(21)

17/57

Estimator selection

Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on some model Sm⊂S

bsm ∈argmin

t∈Sm

{Pnγ(t)} where Pnγ(t) := 1 n

X

ξ∈Dn

γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}

Estimator collection (bsm)m∈M⇒ choose mb =m(Db n)?

Examples:

model selection

calibration of tuning parameters(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice betweendifferent methods ex.: k-NN vs. smoothing splines?

(22)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality(in expectation or with a large probability):

`(s?,bs

mb)6C inf

m∈M

`(s?,bsm) +Rn

model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m?

−−−→n→∞ 1. Equivalent to estimation in the parametric setting. Both goals with the same procedure (AIC-BIC dilemma)? No in general (Yang, 2005). Sometimes possible.

(23)

18/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):

`(s?,bs

mb)6C inf

m∈M

`(s?,bsm) +Rn

Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m?

−−−→n→∞ 1. Equivalent to estimation in the parametric setting.

No in general (Yang, 2005). Sometimes possible.

(24)

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):

`(s?,bs

mb)6C inf

m∈M

`(s?,bsm) +Rn

Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m?

−−−→n→∞ 1. Equivalent to estimation in the parametric setting.

Both goals with the same procedure (AIC-BIC dilemma)?

(25)

19/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Estimation goal: Bias-variance trade-off

E[`(s?,bsm)] = Bias + Variance Bias or Approximation error

`(s?,sm?) = inf

t∈Sm

`(s?,t) Variance or Estimation error

OLS in regression: σ2dim(Sm) n

⇔avoid overfittingandunderfitting

(26)

Estimation goal: Bias-variance trade-off

E[`(s?,bsm)] = Bias + Variance Bias or Approximation error

`(s?,sm?) = inf

t∈Sm

`(s?,t) Variance or Estimation error

OLS in regression: σ2dim(Sm) n

(27)

20/57

Estimation goal: Bias-variance trade-off

0 20 40 60 80 100

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

dimension D

Approx. error Estim. error E[Risk]

(28)

Outline

1 Estimator selection

2 Cross-validation

3 Cross-validation for risk estimation

4 Cross-validation for estimator selection

5 LargeM

6 Conclusion

(29)

22/57

Validation principle

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(30)

Validation principle: learning sample

−2

−1 0 1 2 3 4

(31)

22/57

Validation principle: learning sample

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(32)

Validation principle: validation sample

−2

−1 0 1 2 3 4

(33)

22/57

Validation principle: validation sample

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(34)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

Training set Dn(t)⇒bsm(t)=bsm Dn(t)

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setDn(v)⇒evaluate risk

Pn(v)γ bsm(t)

= 1 nv

X

ξ∈Dn(v)

γ bsm(t)

nv=|Dn(v)|=n−nt

cross-validation: average several hold-out estimators

Rbcv

bsm;Dn; (Ij(t))16j6B

= 1 B

B

X

j=1

Pn(v,j)γ bsm(t,j)

Dn(t,j)=(ξi)

i∈I(t) j

estimator selection:

mb ∈argmin

m∈M

n

Rbcv(bsm;Dn)o

(35)

23/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

Training set Dn(t)⇒bsm(t)=bsm Dn(t)

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setDn(v)⇒evaluate risk

hold-out estimator of the risk:

Pn(v)γ bsm(t)

= 1 nv

X

ξ∈Dn(v)

γ bsm(t)

nv=|Dn(v)|=n−nt

Rbcv

bsm;Dn; (Ij(t))16j6B

= 1 B

B

X

j=1

Pn(v,j)γ bsm(t,j)

Dn(t,j)=(ξi)

i∈I(t) j

estimator selection:

mb ∈argmin

m∈M

n

Rbcv(bsm;Dn)o

(36)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

Training set Dn(t)⇒bsm(t)=bsm Dn(t)

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setDn(v)⇒evaluate risk hold-out estimator of the risk:

Pn(v)γ bsm(t)

= 1 nv

X

ξ∈Dn(v)

γ bsm(t)

nv=|Dn(v)|=n−nt

cross-validation: average several hold-out estimators

Rbcv

bsm;Dn; (Ij(t))16j6B

= 1 B

B

X

j=1

Pn(v,j)γ bsm(t,j)

Dn(t,j)=(ξi)

i∈I(t) j

mb ∈argmin

m∈M

n

Rbcv(bsm;Dn)o

(37)

23/57

Cross-validation

(X1,Y1), . . . ,(Xnt,Ynt)

| {z }

Training set Dn(t)⇒bsm(t)=bsm Dn(t)

(Xnt+1,Ynt+1), . . . ,(Xn,Yn)

| {z }

Validation setDn(v)⇒evaluate risk hold-out estimator of the risk:

Pn(v)γ bsm(t)

= 1 nv

X

ξ∈Dn(v)

γ bsm(t)

nv=|Dn(v)|=n−nt

cross-validation: average several hold-out estimators

Rbcv

bsm;Dn; (Ij(t))16j6B

= 1 B

B

X

j=1

Pn(v,j)γ bsm(t,j)

Dn(t,j)=(ξi)

i∈I(t) j

estimator selection:

mb ∈argmin

m∈M

n

Rbcv(bsm;Dn)o

(38)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Cross-validation: examples

Exhaustive data splitting: all possible subsets of size nt

⇒ leave-one-out (nt=n−1) Rbloo(bsm;Dn) = 1

n

n

X

j=1

γ

bsm(−j)j

⇒ leave-p-out (nt=n−p)

j 16j6V

⇒ Rbvf(bsm;Dn;B) = 1 V

V

X

j=1

Pnjγ bsm(−j) Monte-Carlo CV / Repeated learning testing:

I1(t), . . . ,IB(t) i.i.d. uniform

(39)

24/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Cross-validation: examples

Exhaustive data splitting: all possible subsets of size nt

⇒ leave-one-out (nt=n−1) Rbloo(bsm;Dn) = 1

n

n

X

j=1

γ

bsm(−j)j

⇒ leave-p-out (nt=n−p)

V-fold cross-validation: B= (Bj)16j6V partition of {1, . . . ,n}

⇒ Rbvf(bsm;Dn;B) = 1 V

V

X

j=1

Pnjγ bsm(−j)

I1(t), . . . ,IB(t) i.i.d. uniform

(40)

Cross-validation: examples

Exhaustive data splitting: all possible subsets of size nt

⇒ leave-one-out (nt=n−1) Rbloo(bsm;Dn) = 1

n

n

X

j=1

γ

bsm(−j)j

⇒ leave-p-out (nt=n−p)

V-fold cross-validation: B= (Bj)16j6V partition of {1, . . . ,n}

⇒ Rbvf(bsm;Dn;B) = 1 V

V

X

j=1

Pnjγ bsm(−j) Monte-Carlo CV / Repeated learning testing:

(t) (t)

(41)

25/57

Outline

1 Estimator selection

2 Cross-validation

3 Cross-validation for risk estimation

4 Cross-validation for estimator selection

5 LargeM

6 Conclusion

(42)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn(t,j)

=nt

For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn)

E

Rbcv

bsm;Dn; Ij(t)

16j6B

=E h

Pγ bsm(Dnt)i

⇒ everything depends on n→E h

Pγ bsm(Dn)i

Note: bias can be corrected in some settings (Burman, 1989). Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.

(43)

26/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn(t,j)

=nt

For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:

E

Rbcv

bsm;Dn; Ij(t)

16j6B

=E h

Pγ bsm(Dnt)i

⇒ everything depends on n→E Pγ bsm(Dn)

Note: bias can be corrected in some settings (Burman, 1989). Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.

(44)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn(t,j)

=nt

For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:

E

Rbcv

bsm;Dn; Ij(t)

16j6B

=E h

Pγ bsm(Dnt)i

⇒ everything depends onn →E h

Pγ bsm(Dn)i

Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.

(45)

26/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn(t,j)

=nt

For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:

E

Rbcv

bsm;Dn; Ij(t)

16j6B

=E h

Pγ bsm(Dnt)i

⇒ everything depends onn →E h

Pγ bsm(Dn)i

Note: bias can be correctedin some settings (Burman, 1989).

n bm n

otherwise, stronger bias.

(46)

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn(t,j)

=nt

For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:

E

Rbcv

bsm;Dn; Ij(t)

16j6B

=E h

Pγ bsm(Dnt)i

⇒ everything depends onn →E h

Pγ bsm(Dn)i

Note: bias can be correctedin some settings (Burman, 1989).

Note: Dn→sm(Dn) must be fixedbefore seeing any data;

(47)

27/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation: generic example

Assume

E h

Pγ bsm(Dn)i

=α(m) +β(m) n

(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).

⇒ E Rbcv bsm;Dn; Ij(t)

16j6B =α(m) +

nt n

⇒Bias:

decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.

⇒V-fold: bias decreases when V increases, vanishes asV →+∞.

(48)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation: generic example

Assume

E h

Pγ bsm(Dn)i

=α(m) +β(m) n

(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).

⇒ E

Rbcv

bsm;Dn; Ij(t)

16j6B

=α(m) + n nt

β(m) n

decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.

⇒V-fold: bias decreases when V increases, vanishes asV →+∞.

(49)

27/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias of cross-validation: generic example

Assume

E h

Pγ bsm(Dn)i

=α(m) +β(m) n

(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).

⇒ E

Rbcv

bsm;Dn; Ij(t)

16j6B

=α(m) + n nt

β(m) n

⇒Bias:

decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.

(50)

Bias of cross-validation: generic example

Assume

E h

Pγ bsm(Dn)i

=α(m) +β(m) n

(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).

⇒ E

Rbcv

bsm;Dn; Ij(t)

16j6B

=α(m) + n nt

β(m) n

⇒Bias:

decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.

(51)

28/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Variance of cross-validation

Hold-out (Nadeau & Bengio, 2003):

var

Pn(v)γ bsm(t)

= 1 nvE

h var

γ(u;ξ)

u=bsm(t)

i

+ var

Pγ bsm(Dnt)

t

var Rbcv

bsm;Dn;

Ij(t)

16j6B

!

= var

Rb`po bsm;Dn

+ 1 B E

h varI(t)

Pn(v)γ

bsm(t)

Dn

i

| {z }

permutation variance

V-fold CV: B,nt,nv related

leave-one-out: related to stability? (empirical results)

(52)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Variance of cross-validation

Hold-out (Nadeau & Bengio, 2003):

var

Pn(v)γ bsm(t)

= 1 nvE

h var

γ(u;ξ)

u=bsm(t)

i

+ var

Pγ bsm(Dnt) Monte-Carlo CV and number of splits: (p=n−nt) var Rbcv

bsm;Dn;

Ij(t)

16j6B

!

= var

Rb`po bsm;Dn

+ 1 B E

h varI(t)

Pn(v)γ

bsm(t)

Dn

i

| {z }

permutation variance

leave-one-out: related to stability? (empirical results)

(53)

28/57

Variance of cross-validation

Hold-out (Nadeau & Bengio, 2003):

var

Pn(v)γ bsm(t)

= 1 nvE

h var

γ(u;ξ)

u=bsm(t)

i

+ var

Pγ bsm(Dnt) Monte-Carlo CV and number of splits: (p=n−nt) var Rbcv

bsm;Dn;

Ij(t)

16j6B

!

= var

Rb`po bsm;Dn

+ 1 B E

h varI(t)

Pn(v)γ

bsm(t)

Dn

i

| {z }

permutation variance

V-fold CV:B,nt,nv related

leave-one-out: related to stability? (empirical results)

(54)

Variance of the V -fold CV criterion

Least-squares density estimation (A. & Lerasle 2012), exact computation (non-asymptotic):

var

Rbvf(bsm;Dn;B)

= 1+O(1)

n varP(sm?) + 2

n2

1 + 4

V −1+O(V1+1n)

A(m)

(simplified formula, histogram model with bin sizedm−1,A(m)dm) Linear regression, specific setting, asymptotic formula

(Burman, 1989):

var

Rbvf(bsm;Dn;B)

= 2σ2 n +4σ4

n2

4 + 4

V −1+

2 (V−1)2+ 1

(V−1)3

+o(n−2)

(55)

30/57

Outline

1 Estimator selection

2 Cross-validation

3 Cross-validation for risk estimation

4 Cross-validation for estimator selection

5 LargeM

6 Conclusion

(56)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rbcv(bsm)o

vs. m? ∈argmin

m∈M

n

Pγ bsm(Dn)o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rbcv(bsm) +Z o

⇒ bias and variance meaningless.

Perfect ranking among (bsm)m∈M ⇔ ∀m,m ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)

= sign Pγ(bsm)−Pγ(bsm0)

⇒ E h

Rbcv bsm

−Rbcv bsm0i

should be of the good sign (unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)

⇒ var

Rbvf bsm

−Rbvf bsm0

should be minimal (detailed heuristic: A. & Lerasle 2012)

(57)

31/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rbcv(bsm)o

vs. m? ∈argmin

m∈M

n

Pγ bsm(Dn)o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rbcv(bsm) +Z o

⇒ bias and variance meaningless.

Perfect ranking among (bsm)m∈M ⇔ ∀m,m0 ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)

= sign Pγ(bsm)−Pγ(bsm0)

⇒ E Rb bsm −Rb bsm0 should be of the good sign (unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)

⇒ var

Rbvf bsm

−Rbvf bsm0

should be minimal (detailed heuristic: A. & Lerasle 2012)

(58)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rbcv(bsm)o

vs. m? ∈argmin

m∈M

n

Pγ bsm(Dn)o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rbcv(bsm) +Z o

⇒ bias and variance meaningless.

Perfect ranking among (bsm)m∈M ⇔ ∀m,m0 ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)

= sign Pγ(bsm)−Pγ(bsm0)

⇒ E h

Rbcv bsm

−Rbcv bsm0i

should be of the good sign(unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)

⇒ var Rb bsm −Rb bsm0 should be minimal (detailed heuristic: A. & Lerasle 2012)

(59)

31/57

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rbcv(bsm)o

vs. m? ∈argmin

m∈M

n

Pγ bsm(Dn)o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rbcv(bsm) +Z o

⇒ bias and variance meaningless.

Perfect ranking among (bsm)m∈M ⇔ ∀m,m0 ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)

= sign Pγ(bsm)−Pγ(bsm0)

⇒ E h

Rbcv bsm

−Rbcv bsm0i

should be of the good sign (unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)

⇒ var

Rbvf bsm

−Rbvf bsm0

should be minimal(detailed heuristic: A. & Lerasle 2012)

(60)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias and estimator selection: generic example

mb ∈argmin

m∈M

n

Rbvf(bsm) o

vs. m?∈argmin

m∈M

n

Pγ bsm(Dn)o Assume

E h

Pγ bsm(Dn)i

=α(m) + β(m) n

(e.g., LS/ridge/k NN regression, LS/kernel density estimation).

E

Pγ(bsm)−Pγ(bsm0)

=α(m)−α(m0) +β(m)−β(m0) n E

h

Rbcv(bsm)−Rbcv(bsm0)i

=α(m)−α(m0) + n nt

β(m)−β(m0) n

⇒ CV favoursm with smaller complexityβ(m), more and more as nt decreases.

(61)

32/57

Bias and estimator selection: generic example

mb ∈argmin

m∈M

n

Rbvf(bsm) o

vs. m?∈argmin

m∈M

n

Pγ bsm(Dn)o Assume

E h

Pγ bsm(Dn)i

=α(m) + β(m) n

(e.g., LS/ridge/k NN regression, LS/kernel density estimation).

Key quantities:

E

Pγ(bsm)−Pγ(bsm0)

=α(m)−α(m0) +β(m)−β(m0) n E

h

Rbcv(bsm)−Rbcv(bsm0)i

=α(m)−α(m0) + n nt

β(m)−β(m0) n

⇒ CV favoursm with smaller complexityβ(m), more and more as nt decreases.

(62)

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

CV with an estimation goal: the big picture (M “small”)

At first order, the bias drives the performanceof:

leave-p-out,V-fold CV, Monte-Carlo CV if B n2

or ifnv large enough (including hold-out) CV performs similarly to

argmin

m∈M

n E

h

Pγ bsm(Dnt)io

t

⇒ suboptimal otherwise

e.g., V-fold CV with V fixed.

Theoretical results for least-squares regression and density estimation at least.

(63)

33/57

CV with an estimation goal: the big picture (M “small”)

At first order, the bias drives the performance of:

leave-p-out,V-fold CV, Monte-Carlo CV if B n2

or ifnv large enough (including hold-out) CV performs similarly to

argmin

m∈M

n E

h

Pγ bsm(Dnt)io

⇒ first-order optimality ifnt∼n

⇒ suboptimal otherwise

e.g., V-fold CV with V fixed.

Theoretical results for least-squares regression and density estimation at least.

(64)

34/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias-corrected VFCV / V -fold penalization

Bias-correctedV-fold CV(Burman, 1989):

Rbvf,corr(bsm;Dn;B) :=Rbvf(bsm;Dn;B) +Pnγ(bsm)− 1 V

V

X

j=1

Pnγ

bsm(−j)

n bm VF bm n

| {z } V-fold penalty (A. 2008) In least-squares density estimation (A. & Lerasle, 2012): Rbvf(bsm;Dn;B) =Pnγ bsm(Dn)

+

1 + 1

2(V −1)

| {z } overpenalization factor

penVF(bsm;Dn;B)

Rb`po(bsm;Dn;B) =Pnγ bsm(Dn) +

z }| {

1 + 1 2

n p −1

 penVF(bsm;Dn;Bloo)

(65)

34/57

Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion

Bias-corrected VFCV / V -fold penalization

Bias-correctedV-fold CV(Burman, 1989):

Rbvf,corr(bsm;Dn;B) :=Rbvf(bsm;Dn;B) +Pnγ(bsm)− 1 V

V

X

j=1

Pnγ

bsm(−j)

=Pnγ(bsm) + penVF(bsm;Dn;B)

| {z } V-fold penalty (A. 2008)

Rbvf(bsm;Dn;B) =Pnγ bsm(Dn) +

1 + 1

2(V −1)

| {z } overpenalization factor

penVF(bsm;Dn;B)

Rb`po(bsm;Dn;B) =Pnγ bsm(Dn) +

z }| {

1 + 1 2

n p −1

 penVF(bsm;Dn;Bloo)

(66)

Bias-corrected VFCV / V -fold penalization

Bias-correctedV-fold CV(Burman, 1989):

Rbvf,corr(bsm;Dn;B) :=Rbvf(bsm;Dn;B) +Pnγ(bsm)− 1 V

V

X

j=1

Pnγ

bsm(−j)

=Pnγ(bsm) + penVF(bsm;Dn;B)

| {z } V-fold penalty (A. 2008) In least-squares density estimation (A. & Lerasle, 2012):

Rbvf(bsm;Dn;B) =Pnγ bsm(Dn) +

1 + 1

2(V −1)

| {z } overpenalization factor

penVF(bsm;Dn;B)

z }| {

1

(67)

35/57

Variance and estimator selection

∆(m,m0,V) =Rbvf,corr(bsm)−Rbvf,corr(bsm0)

Theorem (A. & Lerasle 2012, least-squares density estimation)

var ∆(m,m0,V)

=4

1 +2 n + 1

n2

varP sm? −sm?0

n + 2

1 + 4 V −1− 1

n

B(m,m0) n2

| {z }

>0

If Sm⊂Sm0 are two histogram models with constant bin sizes dm−1,dm−10, then, B(m,m0)∝

sm? −sm?0

dm. The two terms are of the same order if

sm? −sm?0

≈dm/n.

(68)

Variance of R b

vf,corr

( b s

m

) − R b

vf,corr

( b s

m?

) vs. (d

m

, V )

0 20 40 60 80 100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

dimension LOO

10−Fold 5−Fold 2−Fold E[penid]

(69)

37/57

Probability of selection of every m

0 10 20 30 40

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

dimension

P(m is selected)

LOO 10−Fold 5−Fold 2−Fold E[penid]

(70)

Experiment (LS density estimation): V -fold CV

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

V=n (LOO) V=10 V=5 V=2

(71)

38/57

Experiment (LS density estimation): V -fold CV

1 1.2 1.4 1.6 1.8 2

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

overpenalization factor C V=n (LOO) V=10 V=5 V=2

(72)

Experiment (LS density estimation): V -fold CV

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

V=n (LOO) V=10 V=5 V=2

(73)

38/57

Experiment (LS density estimation): V -fold CV

1 1.2 1.4 1.6 1.8 2

1.9 2 2.1 2.2 2.3 2.4 2.5

Risk Ratio (Cor)

overpenalization factor C V=n (LOO) V=10 V=5 V=2

Références

Documents relatifs

ةريصن يرومعل  صخلملا : ةررغللا ررربتعت ار رر لا رر يلع ةرررحت لأررتلا مررتةا لررم لأررنم ةررمةا ازرريكر ، ةررغللام ةرريعل علا رربرةا ق ررغللا

The average ratio (estimator of variance/empirical variance) is displayed for two variance estimators, in an ideal situation where 10 independent training and test sets are

The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est

Under mild assumptions on the estimation mapping, we show that this approximation is a weakly differentiable function of the parameters and its weak gradient, coined the Stein

© 2012 Gary Schneider & onestar press onestar press 49, rue Albert 75013 Paris France info@onestarpress.com www.onestarpress.com /250 Gary Schneider Portrait Sequences

Keywords: Estimator selection, Model selection, Variable selection, Linear estimator, Kernel estimator, Ridge regression, Lasso, Elastic net, Random Forest, PLS1

In this section, we attempt to overcome irregularities (many critical points) or non-smoothness of the function H by making a convolution with some appropriate function, and we will

Weissman extrapolation device for estimating extreme quantiles from heavy-tailed distribu- tion is based on two estimators: an order statistic to estimate an intermediate quantile