1/57
Cross-validation for estimator selection
Sylvain Arlot (joint works with Alain Celisse, Matthieu Lerasle, Nelo Magalh˜aes)
1Cnrs
2Ecole Normale Sup´´ erieure (Paris),DI/ENS, ´EquipeSierra
IHP, Paris April, 9th, 2015
Main reference (survey): arXiv:0907.4728
Outline
1 Estimator selection
2 Cross-validation
3 Cross-validation for risk estimation
4 Cross-validation for estimator selection
5 LargeM
6 Conclusion
3/57
Regression: data (X
1, Y
1), . . . , (X
n, Y
n)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Goal: predict Y given X , i.e., denoising
−2
−1 0 1 2 3 4
5/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Prediction problem / regression
Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈ X × Y (i.i.d. ∼P) Contrastγ(t; (x,y)) measures how wellt(x) “predicts”y Goal: learn t ∈S={ measurable functionsX → Y} s.t.
E(X,Y)∼P
γ t; (X,Y)
=:Pγ(t) is minimal.
R
least-squares contrast γ(t; (x,y)) = (t(x)−y)2 s?∈argmint∈SPγ(t) is the regression function: s?(X) =E[Y |X]
⇒ excess loss
`(s?,t) :=Pγ(t)−Pγ(s?) =E h
t(X)−s?(X)2i
Prediction problem / regression
Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈ X × Y (i.i.d. ∼P) Contrastγ(t; (x,y)) measures how wellt(x) “predicts”y Goal: learn t ∈S={ measurable functionsX → Y} s.t.
E(X,Y)∼P
γ t; (X,Y)
=:Pγ(t) is minimal.
Example: regressionY =R,
least-squares contrast γ(t; (x,y)) = (t(x)−y)2 s?∈argmint∈SPγ(t) is the regression function:
s?(X) =E[Y |X]
⇒ excess loss
? ? h
? 2i
6/57
Density estimation: data ξ
1, . . . , ξ
n0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Goal: estimate the common density s
?of ξ
i8/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Problem: density estimation
Data Dn: ξ1, . . . , ξn∈Ξ (i.i.d. ∼P, densitys? w.r.t. µ) Least-squares contrast γ(t, ξ) =ktk2L2(µ)−2t(ξ)
Goal: learn t ∈S={measurable functions Ξ→R} s.t.
Eξ∼P
γ(t;ξ)
=:Pγ(t) is minimal.
Pγ(t) = t2dµ−2 ts?dµ= (t−s?)2dµ− ks?k2L2(µ)
⇒ the true densitys? ∈argmint∈SPγ(t) and theexcess loss is
`(s?,t) :=Pγ(t)−Pγ(s?) =kt−s?k2L2(µ)
Problem: density estimation
Data Dn: ξ1, . . . , ξn∈Ξ (i.i.d. ∼P, densitys? w.r.t. µ) Least-squares contrast γ(t, ξ) =ktk2L2(µ)−2t(ξ)
Goal: learn t ∈S={measurable functions Ξ→R} s.t.
Eξ∼P
γ(t;ξ)
=:Pγ(t) is minimal.
Pγ(t) = Z
t2dµ−2 Z
ts?dµ= Z
(t−s?)2dµ− ks?k2L2(µ)
⇒ the true densitys? ∈argmint∈SPγ(t) and theexcess loss is
`(s?,t) :=Pγ(t)−Pγ(s?) =kt−s?k2
9/57
General setting
Data ξ1, . . . , ξn∈Ξ i.i.d. with distribution P prediction: ξi = (Xi,Yi)∈ X × Y
Goal: Estimate some features?∈Sof P density, regression function, Bayes predictor...
Contrast functionγ :S×Ξ→Rsuch that s? ∈argmin
t∈S
Pγ(t) with Pγ(t) :=Eξ∼P
γ(t;ξ) Excess loss
`(s?,t) :=Pγ(t)−Pγ(s?)>0
Examples
Prediction: ξi = (Xi,Yi)
Xn+1 “predict”Yn+1 with t(Xn+1)?
γ t; (x,y)
quantifies the “distance” between t(x) and y Regression (Y =R), least squares:
γ t; (x,y)
= t(x)−y2
s?(X) =E[Y|X] Binary classification (Y ={0,1}), 0–1 contrast:
γ t; (x,y)
= 1t(x)6=y Density estimation (reference measureµ):
least squares: γ(t;ξ) =ktk2L2(µ)−2t(ξ)
11/57
Estimator selection (regression): regular regressograms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Estimator selection (regression): kernel ridge
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−1 0 1 2 3 4
13/57
Estimator selection (regression): k nearest neighbours
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Estimator selection (regression): Nadaraya-Watson
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−1 0 1 2 3 4
15/57
Estimator selection (density): regular histograms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Estimator selection (density): Parzen, Gaussian kernel
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
17/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Estimator selection
Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on somemodel Sm⊂S
bsm ∈argmin
t∈Sm
{Pnγ(t)} where Pnγ(t) := 1 n
X
ξ∈Dn
γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}
bm m∈M b b n
Examples:
model selection
calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)
choice between different methods ex.: k-NN vs. smoothing splines?
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Estimator selection
Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on some model Sm⊂S
bsm ∈argmin
t∈Sm
{Pnγ(t)} where Pnγ(t) := 1 n
X
ξ∈Dn
γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}
Estimator collection (bsm)m∈M⇒ choose mb =m(Db n)?
model selection
calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)
choice between different methods ex.: k-NN vs. smoothing splines?
17/57
Estimator selection
Estimator/Learning algorithm: bs :Dn7→bs(Dn)∈S Example: least-squares estimator on some model Sm⊂S
bsm ∈argmin
t∈Sm
{Pnγ(t)} where Pnγ(t) := 1 n
X
ξ∈Dn
γ(t;ξ) Examples of models: histograms, span{ϕ1, . . . , ϕD}
Estimator collection (bsm)m∈M⇒ choose mb =m(Db n)?
Examples:
model selection
calibration of tuning parameters(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)
choice betweendifferent methods ex.: k-NN vs. smoothing splines?
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Estimator selection: two possible goals
Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality(in expectation or with a large probability):
`(s?,bs
mb)6C inf
m∈M
`(s?,bsm) +Rn
model/estimator, assuming it is well-defined, i.e., Selection consistency:
P m(Db n) =m?
−−−→n→∞ 1. Equivalent to estimation in the parametric setting. Both goals with the same procedure (AIC-BIC dilemma)? No in general (Yang, 2005). Sometimes possible.
18/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Estimator selection: two possible goals
Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):
`(s?,bs
mb)6C inf
m∈M
`(s?,bsm) +Rn
Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:
P m(Db n) =m?
−−−→n→∞ 1. Equivalent to estimation in the parametric setting.
No in general (Yang, 2005). Sometimes possible.
Estimator selection: two possible goals
Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):
`(s?,bs
mb)6C inf
m∈M
`(s?,bsm) +Rn
Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:
P m(Db n) =m?
−−−→n→∞ 1. Equivalent to estimation in the parametric setting.
Both goals with the same procedure (AIC-BIC dilemma)?
19/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Estimation goal: Bias-variance trade-off
E[`(s?,bsm)] = Bias + Variance Bias or Approximation error
`(s?,sm?) = inf
t∈Sm
`(s?,t) Variance or Estimation error
OLS in regression: σ2dim(Sm) n
⇔avoid overfittingandunderfitting
Estimation goal: Bias-variance trade-off
E[`(s?,bsm)] = Bias + Variance Bias or Approximation error
`(s?,sm?) = inf
t∈Sm
`(s?,t) Variance or Estimation error
OLS in regression: σ2dim(Sm) n
20/57
Estimation goal: Bias-variance trade-off
0 20 40 60 80 100
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
dimension D
Approx. error Estim. error E[Risk]
Outline
1 Estimator selection
2 Cross-validation
3 Cross-validation for risk estimation
4 Cross-validation for estimator selection
5 LargeM
6 Conclusion
22/57
Validation principle
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Validation principle: learning sample
−2
−1 0 1 2 3 4
22/57
Validation principle: learning sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Validation principle: validation sample
−2
−1 0 1 2 3 4
22/57
Validation principle: validation sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Cross-validation
(X1,Y1), . . . ,(Xnt,Ynt)
| {z }
Training set Dn(t)⇒bsm(t)=bsm Dn(t)
(Xnt+1,Ynt+1), . . . ,(Xn,Yn)
| {z }
Validation setDn(v)⇒evaluate risk
Pn(v)γ bsm(t)
= 1 nv
X
ξ∈Dn(v)
γ bsm(t);ξ
nv=|Dn(v)|=n−nt
cross-validation: average several hold-out estimators
Rbcv
bsm;Dn; (Ij(t))16j6B
= 1 B
B
X
j=1
Pn(v,j)γ bsm(t,j)
Dn(t,j)=(ξi)
i∈I(t) j
estimator selection:
mb ∈argmin
m∈M
n
Rbcv(bsm;Dn)o
23/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Cross-validation
(X1,Y1), . . . ,(Xnt,Ynt)
| {z }
Training set Dn(t)⇒bsm(t)=bsm Dn(t)
(Xnt+1,Ynt+1), . . . ,(Xn,Yn)
| {z }
Validation setDn(v)⇒evaluate risk
hold-out estimator of the risk:
Pn(v)γ bsm(t)
= 1 nv
X
ξ∈Dn(v)
γ bsm(t);ξ
nv=|Dn(v)|=n−nt
Rbcv
bsm;Dn; (Ij(t))16j6B
= 1 B
B
X
j=1
Pn(v,j)γ bsm(t,j)
Dn(t,j)=(ξi)
i∈I(t) j
estimator selection:
mb ∈argmin
m∈M
n
Rbcv(bsm;Dn)o
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Cross-validation
(X1,Y1), . . . ,(Xnt,Ynt)
| {z }
Training set Dn(t)⇒bsm(t)=bsm Dn(t)
(Xnt+1,Ynt+1), . . . ,(Xn,Yn)
| {z }
Validation setDn(v)⇒evaluate risk hold-out estimator of the risk:
Pn(v)γ bsm(t)
= 1 nv
X
ξ∈Dn(v)
γ bsm(t);ξ
nv=|Dn(v)|=n−nt
cross-validation: average several hold-out estimators
Rbcv
bsm;Dn; (Ij(t))16j6B
= 1 B
B
X
j=1
Pn(v,j)γ bsm(t,j)
Dn(t,j)=(ξi)
i∈I(t) j
mb ∈argmin
m∈M
n
Rbcv(bsm;Dn)o
23/57
Cross-validation
(X1,Y1), . . . ,(Xnt,Ynt)
| {z }
Training set Dn(t)⇒bsm(t)=bsm Dn(t)
(Xnt+1,Ynt+1), . . . ,(Xn,Yn)
| {z }
Validation setDn(v)⇒evaluate risk hold-out estimator of the risk:
Pn(v)γ bsm(t)
= 1 nv
X
ξ∈Dn(v)
γ bsm(t);ξ
nv=|Dn(v)|=n−nt
cross-validation: average several hold-out estimators
Rbcv
bsm;Dn; (Ij(t))16j6B
= 1 B
B
X
j=1
Pn(v,j)γ bsm(t,j)
Dn(t,j)=(ξi)
i∈I(t) j
estimator selection:
mb ∈argmin
m∈M
n
Rbcv(bsm;Dn)o
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Cross-validation: examples
Exhaustive data splitting: all possible subsets of size nt
⇒ leave-one-out (nt=n−1) Rbloo(bsm;Dn) = 1
n
n
X
j=1
γ
bsm(−j);ξj
⇒ leave-p-out (nt=n−p)
j 16j6V
⇒ Rbvf(bsm;Dn;B) = 1 V
V
X
j=1
Pnjγ bsm(−j) Monte-Carlo CV / Repeated learning testing:
I1(t), . . . ,IB(t) i.i.d. uniform
24/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Cross-validation: examples
Exhaustive data splitting: all possible subsets of size nt
⇒ leave-one-out (nt=n−1) Rbloo(bsm;Dn) = 1
n
n
X
j=1
γ
bsm(−j);ξj
⇒ leave-p-out (nt=n−p)
V-fold cross-validation: B= (Bj)16j6V partition of {1, . . . ,n}
⇒ Rbvf(bsm;Dn;B) = 1 V
V
X
j=1
Pnjγ bsm(−j)
I1(t), . . . ,IB(t) i.i.d. uniform
Cross-validation: examples
Exhaustive data splitting: all possible subsets of size nt
⇒ leave-one-out (nt=n−1) Rbloo(bsm;Dn) = 1
n
n
X
j=1
γ
bsm(−j);ξj
⇒ leave-p-out (nt=n−p)
V-fold cross-validation: B= (Bj)16j6V partition of {1, . . . ,n}
⇒ Rbvf(bsm;Dn;B) = 1 V
V
X
j=1
Pnjγ bsm(−j) Monte-Carlo CV / Repeated learning testing:
(t) (t)
25/57
Outline
1 Estimator selection
2 Cross-validation
3 Cross-validation for risk estimation
4 Cross-validation for estimator selection
5 LargeM
6 Conclusion
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation
In this talk, we always assume: ∀j, Card Dn(t,j)
=nt
For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn)
E
Rbcv
bsm;Dn; Ij(t)
16j6B
=E h
Pγ bsm(Dnt)i
⇒ everything depends on n→E h
Pγ bsm(Dn)i
Note: bias can be corrected in some settings (Burman, 1989). Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.
26/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation
In this talk, we always assume: ∀j, Card Dn(t,j)
=nt
For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:
E
Rbcv
bsm;Dn; Ij(t)
16j6B
=E h
Pγ bsm(Dnt)i
⇒ everything depends on n→E Pγ bsm(Dn)
Note: bias can be corrected in some settings (Burman, 1989). Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation
In this talk, we always assume: ∀j, Card Dn(t,j)
=nt
For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:
E
Rbcv
bsm;Dn; Ij(t)
16j6B
=E h
Pγ bsm(Dnt)i
⇒ everything depends onn →E h
Pγ bsm(Dn)i
Note: Dn→bsm(Dn) must be fixedbefore seeing any data; otherwise, stronger bias.
26/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation
In this talk, we always assume: ∀j, Card Dn(t,j)
=nt
For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:
E
Rbcv
bsm;Dn; Ij(t)
16j6B
=E h
Pγ bsm(Dnt)i
⇒ everything depends onn →E h
Pγ bsm(Dn)i
Note: bias can be correctedin some settings (Burman, 1989).
n bm n
otherwise, stronger bias.
Bias of cross-validation
In this talk, we always assume: ∀j, Card Dn(t,j)
=nt
For V-fold CV: Card(Bj) =n/V. Ideal criterion: Pγ bsm(Dn) General analysis for the bias:
E
Rbcv
bsm;Dn; Ij(t)
16j6B
=E h
Pγ bsm(Dnt)i
⇒ everything depends onn →E h
Pγ bsm(Dn)i
Note: bias can be correctedin some settings (Burman, 1989).
Note: Dn→sm(Dn) must be fixedbefore seeing any data;
27/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation: generic example
Assume
E h
Pγ bsm(Dn)i
=α(m) +β(m) n
(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).
⇒ E Rbcv bsm;Dn; Ij(t)
16j6B =α(m) +
nt n
⇒Bias:
decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.
⇒V-fold: bias decreases when V increases, vanishes asV →+∞.
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation: generic example
Assume
E h
Pγ bsm(Dn)i
=α(m) +β(m) n
(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).
⇒ E
Rbcv
bsm;Dn; Ij(t)
16j6B
=α(m) + n nt
β(m) n
decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.
⇒V-fold: bias decreases when V increases, vanishes asV →+∞.
27/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias of cross-validation: generic example
Assume
E h
Pγ bsm(Dn)i
=α(m) +β(m) n
(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).
⇒ E
Rbcv
bsm;Dn; Ij(t)
16j6B
=α(m) + n nt
β(m) n
⇒Bias:
decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.
Bias of cross-validation: generic example
Assume
E h
Pγ bsm(Dn)i
=α(m) +β(m) n
(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).
⇒ E
Rbcv
bsm;Dn; Ij(t)
16j6B
=α(m) + n nt
β(m) n
⇒Bias:
decreases as a function of nt, minimal for nt =n−1, negligible if nt ∼n.
28/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Variance of cross-validation
Hold-out (Nadeau & Bengio, 2003):
var
Pn(v)γ bsm(t)
= 1 nvE
h var
γ(u;ξ)
u=bsm(t)
i
+ var
Pγ bsm(Dnt)
t
var Rbcv
bsm;Dn;
Ij(t)
16j6B
!
= var
Rb`po bsm;Dn
+ 1 B E
h varI(t)
Pn(v)γ
bsm(t)
Dn
i
| {z }
permutation variance
V-fold CV: B,nt,nv related
leave-one-out: related to stability? (empirical results)
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Variance of cross-validation
Hold-out (Nadeau & Bengio, 2003):
var
Pn(v)γ bsm(t)
= 1 nvE
h var
γ(u;ξ)
u=bsm(t)
i
+ var
Pγ bsm(Dnt) Monte-Carlo CV and number of splits: (p=n−nt) var Rbcv
bsm;Dn;
Ij(t)
16j6B
!
= var
Rb`po bsm;Dn
+ 1 B E
h varI(t)
Pn(v)γ
bsm(t)
Dn
i
| {z }
permutation variance
leave-one-out: related to stability? (empirical results)
28/57
Variance of cross-validation
Hold-out (Nadeau & Bengio, 2003):
var
Pn(v)γ bsm(t)
= 1 nvE
h var
γ(u;ξ)
u=bsm(t)
i
+ var
Pγ bsm(Dnt) Monte-Carlo CV and number of splits: (p=n−nt) var Rbcv
bsm;Dn;
Ij(t)
16j6B
!
= var
Rb`po bsm;Dn
+ 1 B E
h varI(t)
Pn(v)γ
bsm(t)
Dn
i
| {z }
permutation variance
V-fold CV:B,nt,nv related
leave-one-out: related to stability? (empirical results)
Variance of the V -fold CV criterion
Least-squares density estimation (A. & Lerasle 2012), exact computation (non-asymptotic):
var
Rbvf(bsm;Dn;B)
= 1+O(1)
n varP(sm?) + 2
n2
1 + 4
V −1+O(V1+1n)
A(m)
(simplified formula, histogram model with bin sizedm−1,A(m)≈dm) Linear regression, specific setting, asymptotic formula
(Burman, 1989):
var
Rbvf(bsm;Dn;B)
= 2σ2 n +4σ4
n2
4 + 4
V −1+
2 (V−1)2+ 1
(V−1)3
+o(n−2)
30/57
Outline
1 Estimator selection
2 Cross-validation
3 Cross-validation for risk estimation
4 Cross-validation for estimator selection
5 LargeM
6 Conclusion
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Risk estimation and estimator selection are different goals
mb ∈argmin
m∈M
n
Rbcv(bsm)o
vs. m? ∈argmin
m∈M
n
Pγ bsm(Dn)o For anyZ (deterministic or random),
mb ∈argmin
m∈M
n
Rbcv(bsm) +Z o
⇒ bias and variance meaningless.
Perfect ranking among (bsm)m∈M ⇔ ∀m,m ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)
= sign Pγ(bsm)−Pγ(bsm0)
⇒ E h
Rbcv bsm
−Rbcv bsm0i
should be of the good sign (unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)
⇒ var
Rbvf bsm
−Rbvf bsm0
should be minimal (detailed heuristic: A. & Lerasle 2012)
31/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Risk estimation and estimator selection are different goals
mb ∈argmin
m∈M
n
Rbcv(bsm)o
vs. m? ∈argmin
m∈M
n
Pγ bsm(Dn)o For anyZ (deterministic or random),
mb ∈argmin
m∈M
n
Rbcv(bsm) +Z o
⇒ bias and variance meaningless.
Perfect ranking among (bsm)m∈M ⇔ ∀m,m0 ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)
= sign Pγ(bsm)−Pγ(bsm0)
⇒ E Rb bsm −Rb bsm0 should be of the good sign (unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)
⇒ var
Rbvf bsm
−Rbvf bsm0
should be minimal (detailed heuristic: A. & Lerasle 2012)
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Risk estimation and estimator selection are different goals
mb ∈argmin
m∈M
n
Rbcv(bsm)o
vs. m? ∈argmin
m∈M
n
Pγ bsm(Dn)o For anyZ (deterministic or random),
mb ∈argmin
m∈M
n
Rbcv(bsm) +Z o
⇒ bias and variance meaningless.
Perfect ranking among (bsm)m∈M ⇔ ∀m,m0 ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)
= sign Pγ(bsm)−Pγ(bsm0)
⇒ E h
Rbcv bsm
−Rbcv bsm0i
should be of the good sign(unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)
⇒ var Rb bsm −Rb bsm0 should be minimal (detailed heuristic: A. & Lerasle 2012)
31/57
Risk estimation and estimator selection are different goals
mb ∈argmin
m∈M
n
Rbcv(bsm)o
vs. m? ∈argmin
m∈M
n
Pγ bsm(Dn)o For anyZ (deterministic or random),
mb ∈argmin
m∈M
n
Rbcv(bsm) +Z o
⇒ bias and variance meaningless.
Perfect ranking among (bsm)m∈M ⇔ ∀m,m0 ∈ M, sign Rbcv(bsm)−Rbcv(bsm0)
= sign Pγ(bsm)−Pγ(bsm0)
⇒ E h
Rbcv bsm
−Rbcv bsm0i
should be of the good sign (unbiased risk estimation heuristic: AIC,Cp, leave-one-out...)
⇒ var
Rbvf bsm
−Rbvf bsm0
should be minimal(detailed heuristic: A. & Lerasle 2012)
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias and estimator selection: generic example
mb ∈argmin
m∈M
n
Rbvf(bsm) o
vs. m?∈argmin
m∈M
n
Pγ bsm(Dn)o Assume
E h
Pγ bsm(Dn)i
=α(m) + β(m) n
(e.g., LS/ridge/k NN regression, LS/kernel density estimation).
E
Pγ(bsm)−Pγ(bsm0)
=α(m)−α(m0) +β(m)−β(m0) n E
h
Rbcv(bsm)−Rbcv(bsm0)i
=α(m)−α(m0) + n nt
β(m)−β(m0) n
⇒ CV favoursm with smaller complexityβ(m), more and more as nt decreases.
32/57
Bias and estimator selection: generic example
mb ∈argmin
m∈M
n
Rbvf(bsm) o
vs. m?∈argmin
m∈M
n
Pγ bsm(Dn)o Assume
E h
Pγ bsm(Dn)i
=α(m) + β(m) n
(e.g., LS/ridge/k NN regression, LS/kernel density estimation).
Key quantities:
E
Pγ(bsm)−Pγ(bsm0)
=α(m)−α(m0) +β(m)−β(m0) n E
h
Rbcv(bsm)−Rbcv(bsm0)i
=α(m)−α(m0) + n nt
β(m)−β(m0) n
⇒ CV favoursm with smaller complexityβ(m), more and more as nt decreases.
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
CV with an estimation goal: the big picture (M “small”)
At first order, the bias drives the performanceof:
leave-p-out,V-fold CV, Monte-Carlo CV if B n2
or ifnv large enough (including hold-out) CV performs similarly to
argmin
m∈M
n E
h
Pγ bsm(Dnt)io
t
⇒ suboptimal otherwise
e.g., V-fold CV with V fixed.
Theoretical results for least-squares regression and density estimation at least.
33/57
CV with an estimation goal: the big picture (M “small”)
At first order, the bias drives the performance of:
leave-p-out,V-fold CV, Monte-Carlo CV if B n2
or ifnv large enough (including hold-out) CV performs similarly to
argmin
m∈M
n E
h
Pγ bsm(Dnt)io
⇒ first-order optimality ifnt∼n
⇒ suboptimal otherwise
e.g., V-fold CV with V fixed.
Theoretical results for least-squares regression and density estimation at least.
34/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias-corrected VFCV / V -fold penalization
Bias-correctedV-fold CV(Burman, 1989):
Rbvf,corr(bsm;Dn;B) :=Rbvf(bsm;Dn;B) +Pnγ(bsm)− 1 V
V
X
j=1
Pnγ
bsm(−j)
n bm VF bm n
| {z } V-fold penalty (A. 2008) In least-squares density estimation (A. & Lerasle, 2012): Rbvf(bsm;Dn;B) =Pnγ bsm(Dn)
+
1 + 1
2(V −1)
| {z } overpenalization factor
penVF(bsm;Dn;B)
Rb`po(bsm;Dn;B) =Pnγ bsm(Dn) +
z }| {
1 + 1 2
n p −1
penVF(bsm;Dn;Bloo)
34/57
Estimator selection Cross-validation CV for risk estimation CV for estimator selection LargeM Conclusion
Bias-corrected VFCV / V -fold penalization
Bias-correctedV-fold CV(Burman, 1989):
Rbvf,corr(bsm;Dn;B) :=Rbvf(bsm;Dn;B) +Pnγ(bsm)− 1 V
V
X
j=1
Pnγ
bsm(−j)
=Pnγ(bsm) + penVF(bsm;Dn;B)
| {z } V-fold penalty (A. 2008)
Rbvf(bsm;Dn;B) =Pnγ bsm(Dn) +
1 + 1
2(V −1)
| {z } overpenalization factor
penVF(bsm;Dn;B)
Rb`po(bsm;Dn;B) =Pnγ bsm(Dn) +
z }| {
1 + 1 2
n p −1
penVF(bsm;Dn;Bloo)
Bias-corrected VFCV / V -fold penalization
Bias-correctedV-fold CV(Burman, 1989):
Rbvf,corr(bsm;Dn;B) :=Rbvf(bsm;Dn;B) +Pnγ(bsm)− 1 V
V
X
j=1
Pnγ
bsm(−j)
=Pnγ(bsm) + penVF(bsm;Dn;B)
| {z } V-fold penalty (A. 2008) In least-squares density estimation (A. & Lerasle, 2012):
Rbvf(bsm;Dn;B) =Pnγ bsm(Dn) +
1 + 1
2(V −1)
| {z } overpenalization factor
penVF(bsm;Dn;B)
z }| {
1
35/57
Variance and estimator selection
∆(m,m0,V) =Rbvf,corr(bsm)−Rbvf,corr(bsm0)
Theorem (A. & Lerasle 2012, least-squares density estimation)
var ∆(m,m0,V)
=4
1 +2 n + 1
n2
varP sm? −sm?0
n + 2
1 + 4 V −1− 1
n
B(m,m0) n2
| {z }
>0
If Sm⊂Sm0 are two histogram models with constant bin sizes dm−1,dm−10, then, B(m,m0)∝
sm? −sm?0
dm. The two terms are of the same order if
sm? −sm?0
≈dm/n.
Variance of R b
vf,corr( b s
m) − R b
vf,corr( b s
m?) vs. (d
m, V )
0 20 40 60 80 100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
dimension LOO
10−Fold 5−Fold 2−Fold E[penid]
37/57
Probability of selection of every m
0 10 20 30 40
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
dimension
P(m is selected)
LOO 10−Fold 5−Fold 2−Fold E[penid]
Experiment (LS density estimation): V -fold CV
1.9 2 2.1 2.2 2.3 2.4 2.5
Risk Ratio (Cor)
V=n (LOO) V=10 V=5 V=2
38/57
Experiment (LS density estimation): V -fold CV
1 1.2 1.4 1.6 1.8 2
1.9 2 2.1 2.2 2.3 2.4 2.5
Risk Ratio (Cor)
overpenalization factor C V=n (LOO) V=10 V=5 V=2
Experiment (LS density estimation): V -fold CV
1.9 2 2.1 2.2 2.3 2.4 2.5
Risk Ratio (Cor)
V=n (LOO) V=10 V=5 V=2
38/57
Experiment (LS density estimation): V -fold CV
1 1.2 1.4 1.6 1.8 2
1.9 2 2.1 2.2 2.3 2.4 2.5
Risk Ratio (Cor)
overpenalization factor C V=n (LOO) V=10 V=5 V=2