Cross-validation for estimator selection or aggregation

(1)

1/47

Estimator selection Cross-validation CV for risk estimation CV for estimator selection Conclusion on CV Agghoo

Cross-validation for estimator selection or aggregation

Sylvain Arlot (joint works with Alain Celisse, Matthieu Lerasle, Nelo Magalhães, Guillaume Maillard)

Université Paris-Sud, Laboratoire de Mathématiques d’Orsay

Data and Analytics for Short-Term Operations, Cambridge February 27, 2019

Survey: arXiv:0907.4728 (& arXiv:1703.03167) CV inL²density estimation: arXiv:1210.5830

Aggregated hold-out: arXiv:1709.03702

(2)

2/47

Outline

1 Estimator selection

2 Cross-validation

3 Cross-validation for risk estimation

4 Cross-validation for estimator selection

5 Conclusion on CV

6 Combining cross-validation with aggregation

(3)

3/47

Regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(4)

4/47

Goal: predict Y given X , i.e., denoising

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(5)

5/47

General setting: prediction

Data: Dn= (Xi,Yi)_16i6n∈(X × Y)ⁿ assumed i.i.d. ∼P Predictor: f :X → Y (F: set of all predictors)

Risk (prediction error): R(f)=E h

c f(X),Yⁱ minimal forf =f^?

LS regression: c(y,y⁰) = (y−y⁰)²,f^?(X) =E[Y|X]and R(f)− R(f^?) =E

(f(X)−f^?(X))²

Goal: fromD_n only, find f ∈ F with R(f) minimal.

Examples: regression, classification

More general setting possible, including density estimation with LS or KL risk.

(6)

6/47

Estimator selection (regression): regular regressograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(7)

7/47

Estimator selection (regression): kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(8)

8/47

Estimator selection (regression): k nearest neighbours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(9)

9/47

Estimator selection

Estimator/Learning algorithm: ^bf :D_n7→bf(D_n)∈ F Example: least-squares estimator on somemodel S_m⊂ F bf_m ∈argmin

f∈Sm

n

Rb_n(f)^o where R^b_n(f) := 1 n

X

(Xi,Yi)∈Dn

c f(X_i),Y_i

Examples of models: histograms, span{ϕ1, . . . , ϕ_D}

Estimator collection(^bf_m)m∈M ⇒ choosem_b =m(D_b _n)?

Examples:

model selection

calibration of tuning parameters (choosingk or the distance fork-NN, choice of a regularization parameter, etc.) choice between different methods

ex.: random forests vs. SVM?

(10)

9/47

Estimator selection

Estimator/Learning algorithm: ^bf :D_n7→bf(D_n)∈ F Example: least-squares estimator on some model S_m⊂ F bf_m ∈argmin

f∈Sm

n

Rb_n(f)^o where R^b_n(f) := 1 n

X

(Xi,Yi)∈Dn

c f(X_i),Y_i

Examples of models: histograms, span{ϕ1, . . . , ϕ_D}

Estimator collection(^bf_m)m∈M ⇒ choosem_b =m(D_b _n)?

Examples:

model selection

calibration of tuning parameters(choosingk or the distance fork-NN, choice of a regularization parameter, etc.) choice betweendifferent methods

ex.: random forests vs. SVM?

(11)

10/47

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality(in expectation or with a large probability):

R bf

mb

− R(f^?)6C inf

m∈M

R(fb_m)− R(f^?) +R_n

Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m^?−−−→

n→∞ 1. Equivalent to estimation in the parametric setting. Both goals with the same procedure (AIC-BIC dilemma)? No in general (Yang, 2005). Sometimes possible.

(12)

10/47

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):

R bf

mb

− R(f^?)6C inf

m∈M

R(fb_m)− R(f^?) +R_n Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m^?−−−→

n→∞ 1. Equivalent to estimation in the parametric setting.

Both goals with the same procedure (AIC-BIC dilemma)? No in general (Yang, 2005). Sometimes possible.

(13)

10/47

Estimator selection: two possible goals

Estimation goal: minimize the risk of the final estimator, i.e., Oracle inequality (in expectation or with a large probability):

R bf

mb

− R(f^?)6C inf

m∈M

R(fb_m)− R(f^?) +R_n Identification goal: select the (asymptotically) best model/estimator, assuming it is well-defined, i.e., Selection consistency:

P m(Db n) =m^?−−−→

n→∞ 1. Equivalent to estimation in the parametric setting.

Both goals with the same procedure (AIC-BIC dilemma)?

Noin general (Yang, 2005). Sometimes possible.

(14)

11/47

Estimation goal: Bias-variance trade-off

E

hR(bfm)ⁱ− R(f^?) = Bias + Variance Bias or Approximation error

R(f_m^?)− R(f^?) = inf

f∈S_mR(f)− R(f^?) Variance or Estimation error

OLS in regression: σ²dim(Sm) n

Bias-variance trade-off

⇔avoid overfittingandunderfitting

(15)

(X₁,Y₁), . . . ,(X_n_t,Y_n_t)

| {z }

Training set D^(t)_n ⇒fb_m^(t) =^bf_m D^(t)_n

(X_n_t₊₁,Y_n_t₊₁), . . . ,(X_n,Y_n)

| {z }

Validation setD_n^(v)⇒evaluate risk hold-out estimator of the risk:

Rb^(v)_n bf_m^(t)= 1 n_v

X

(Xi,Yi)∈D_n^(v)

c^bf_m^(t)(Xi);Yi

nv=|D_n^(v)|=n−n_t

cross-validation: average several hold-out estimators

Rb^cvfbm;Dn; (I_j^(t))_16j6V= 1 V

V

X

j=1

Rb^(v,j)_n bf_m^(t,j⁾ D^(t,j)n =(Xi,Yi)

i∈I(t) j

estimator selection: mb^cvD_n; (I_j^(t))_16j6V∈argmin

m∈M

n

Rb^cvf^b_m;D_n^o ⇒ ^bf

mb^cv(Dn;(I^(t)_j )_16j6V)(D_n)

(22)

14/47

Cross-validation

(X₁,Y₁), . . . ,(X_n_t,Y_n_t)

| {z }

(X_n_t₊₁,Y_n_t₊₁), . . . ,(X_n,Y_n)

| {z }

X

(Xi,Yi)∈D_n^(v)

c^bf_m^(t)(Xi);Yi

V

X

j=1

i∈I(t) j

estimator selection: mb^cvD_n; (I_j^(t))_16j6V∈argmin

m∈M

n

mb^cv(Dn;(I^(t)_j )_16j6V)(D_n)

(23)

14/47

Cross-validation

(X₁,Y₁), . . . ,(X_n_t,Y_n_t)

| {z }

(X_n_t₊₁,Y_n_t₊₁), . . . ,(X_n,Y_n)

| {z }

X

(Xi,Yi)∈D_n^(v)

c^bf_m^(t)(Xi);Yi

V

X

j=1

i∈I(t) j

estimator selection:

mb^cvD_n; (I_j^(t))_16j6V∈argmin

m∈M

n

mb^cv(Dn;(I_j^(t))_16j6V)(D_n)

(24)

15/47

Cross-validation: examples

Exhaustive data splitting: all possible subsets of size nt

⇒ leave-one-out (n_t=n−1) Rb^loobfm;Dn

= 1 n

n

X

j=1

c^bf_m^(−j⁾(Xj);Yj

⇒ leave-p-out (n_t =n−p)

V-fold cross-validation: B= (B_j)_16j6V partition of {1, . . . ,n}

⇒ Rb^vfbfm;Dn;B= 1 V

V

X

j=1

Rb^j_nbf_m^(−j⁾ Monte-Carlo CV / Repeated learning testing:

I₁^(t), . . . ,I_V^(t) i.i.d. uniform

(25)

15/47

Cross-validation: examples

= 1 n

n

X

j=1

V

X

j=1

Rb^j_nbf_m^(−j⁾

Monte-Carlo CV / Repeated learning testing: I₁^(t), . . . ,I_V^(t) i.i.d. uniform

(26)

15/47

Cross-validation: examples

= 1 n

n

X

j=1

V

X

j=1

Rb^j_nbf_m^(−j⁾ Monte-Carlo CV / Repeated learning testing:

I₁^(t), . . . ,I_V^(t) i.i.d. uniform

(27)

16/47

Outline

2 Cross-validation

5 Conclusion on CV

(28)

17/47

Bias of cross-validation

In this talk, we always assume: ∀j, Card Dn^(t,j⁾

=n_t For V-fold CV: Card(Bj) =n/V ⇒nt =n(V −1)/V. Ideal criterion: R bfm(Dn)

General analysis for the bias: E

Rb^cv

bf_m;D_n;I_j^(t)

16j6V

=E

hR bf_m(D_n_t)ⁱ

⇒ everything depends on n→E

hR ^bf_m(D_n)ⁱ

Note: bias can be corrected in some settings (Burman, 1989). Note: Dn→^bfm(Dn) must be fixedbefore seeing any data; otherwise (e.g., data-driven model m), stronger bias.

(29)

17/47

Bias of cross-validation

General analysis for the bias:

E

Rb^cv

bf_m;D_n;I_j^(t)

16j6V

=E

hR bf_m(Dnt)ⁱ

⇒ everything depends on n→E

hR ^bf_m(D_n)ⁱ

(30)

17/47

Bias of cross-validation

E

Rb^cv

bf_m;D_n;I_j^(t)

16j6V

=E

hR bf_m(D_n_t)ⁱ

⇒ everything depends onn →E

hR ^bf_m(D_n)ⁱ

(31)

17/47

Bias of cross-validation

E

Rb^cv

bf_m;D_n;I_j^(t)

16j6V

=E

hR bf_m(D_n_t)ⁱ

⇒ everything depends onn →E

hR ^bf_m(D_n)ⁱ

Note: bias can be correctedin some settings (Burman, 1989).

Note: Dn→^bfm(Dn) must be fixedbefore seeing any data;

otherwise (e.g., data-driven model m), stronger bias.

(32)

18/47

Bias of cross-validation: generic example

Assume:

E

hR bf_m(D_n)ⁱ=α(m) + β(m) n

(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).

⇒ E

Rb^cv

bf_m;D_n;I_j^(t)

16j6V

=α(m) + n

n_t β(m)

n

⇒Bias:

decreases as a function of n_t, minimal for nt =n−1, negligible if nt ∼n.

⇒V-fold: bias decreases when V increases, vanishes asV →+∞.

(33)

18/47

Bias of cross-validation: generic example

Assume:

E

hR bf_m(D_n)ⁱ=α(m) + β(m) n

(e.g., LS/ridge/k-NN regression, LS/kernel density estimation).

⇒ E

Rb^cv

bf_m;D_n;I_j^(t)

16j6V

=α(m) + n

n_t β(m)

n

⇒Bias:

decreases as a function of n_t, minimal for nt =n−1, negligible if nt ∼n.

⇒V-fold: bias decreases whenV increases, vanishes asV →+∞.

(34)

19/47

Variance of cross-validation: general case

Hold-out (Nadeau & Bengio, 2003):

varR^b^(v_n⁾bf_m^(t)= 1 nvE

hvarc f(X),Yf =^bf_m^(t)ⁱ +varR bfm(Dnt)

Monte-Carlo CV and number of splits: (p=n−n_t) var R^b^cv

bf_m;D_n;I_j^(t)

16j6V

!

=varR^b^`po bf_m;D_n + 1

V E hvar_I(t)

Rb^(v)_n ^bf_m^(t) D_nⁱ

| {z }

permutation variance

V-fold CV:V,nt,nv related

leave-one-out: related to stability? (empirical results)

(35)

20/47

Variance of V -fold CV criterion

Least-squares density estimation (A. & Lerasle, 2016), exact computation (non-asymptotic):

varR^b^vffbm;Dn;B= 1^+O(1)

n varP(f_m^?) + 2

n²

1+ 4

V −1^+O(_V¹⁺¹_n)

A(m)

(simplified formula, histogram model with bin sized_m⁻¹,A(m)≈dm) Linear regression, asymptotic formula (Burman, 1989):

varR^b^vfbfm;Dn;B= 2σ² n +4σ⁴

n²

4+ 4 V −1⁺

2 (V−1)2+ ¹

(V−1)3

+o(ⁿ⁻²)

⇒ decreasing with V, dependence only in second order terms.

(36)

21/47

Outline

2 Cross-validation

5 Conclusion on CV

(37)

22/47

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rb^cvbfm

o vs. m^? ∈argmin

m∈M

nR fbm(Dn)^o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rb^cvbf_m+Z^o

⇒ bias and variance meaningless.

Perfect ranking among (f^bm)m∈M ⇔ ∀m,m⁰∈ M, sign R^b^cv(^bfm)−Rb^cv(^bf_m⁰)=sign R(bfm)− R(fb_m⁰)

⇒ E h

Rb^cv bfm

−Rb^cv bf_m⁰ⁱshould be of the good sign(unbiased risk estimation heuristic: AIC,C_p, leave-one-out...)

⇒ varR^b^cv ^bfm

−Rb^cv ^bfm⁰

should be minimal(detailed heuristic: A. & Lerasle, 2016)

(38)

22/47

Risk estimation and estimator selection are different goals

mb ∈argmin

m∈M

n

Rb^cvbfm

o vs. m^? ∈argmin

m∈M

nR fbm(Dn)^o For anyZ (deterministic or random),

mb ∈argmin

m∈M

n

Rb^cvbf_m+Z^o

⇒ bias and variance meaningless.

Perfect ranking among (f^bm)m∈M ⇔ ∀m,m⁰∈ M, sign R^b^cv(^bfm)−Rb^cv(^bf_m⁰)=sign R(bfm)− R(fb_m⁰)

⇒ E h

Rb^cv bfm

−Rb^cv bf_m⁰ⁱshould be of the good sign(unbiased risk estimation heuristic: AIC,C_p, leave-one-out...)

⇒ varR^b^cv ^bfm

−Rb^cv ^bfm⁰

should be minimal(detailed heuristic: A. & Lerasle, 2016)

(39)

23/47

CV with an estimation goal: the big picture (M “small”)

At first order, the bias drives the performanceof:

leave-p-out,V-fold CV, Monte-Carlo CV if V n²

or ifnv large enough (including hold-out) CV performs similarly to

argmin

m∈M

n E

hR bfm(Dn_t)^io

⇒ first-order optimality ifn_t ∼n

⇒ suboptimal otherwise

e.g., V-fold CV with V fixed.

Theoretical results for least-squares regression and density estimation at least.

(40)

23/47

CV with an estimation goal: the big picture (M “small”)

At first order, the bias drives the performance of:

leave-p-out,V-fold CV, Monte-Carlo CV if V n²

or ifnv large enough (including hold-out) CV performs similarly to

argmin

m∈M

n E

hR bfm(Dnt)^io

⇒ first-order optimality ifnt ∼n

⇒ suboptimal otherwise

e.g., V-fold CV with V fixed.

Theoretical results for least-squares regression and density estimation at least.

(41)

24/47

Bias-corrected VFCV / V -fold penalization

Bias-correctedV-fold CV(Burman, 1989):

Rb^vf,corr^bf_m;D_n;B:=R^b^vf^bf_m;D_n;B+R^b_n^bf_m− 1 V

V

X

j=1

Rb_n^bf_m^(−j)

=R^b_nbf_m(D_n)+ pen_VF(^bf_m;D_n;B)

| {z }

V-fold penalty (A. 2008)

In least-squares density estimation (A. & Lerasle, 2016): Rb^vf(^bfm;Dn;B) =Rb_n ^bfm(Dn)+

1+ 1

2(V −1)

| {z }

overpenalization factor

pen_VF(^bfm;Dn;B)

Rb^`po(^bfm;Dn;B) =Rb_n ^bfm(Dn)+

z }| {



1+ 1 2_pⁿ−1



pen_VF(^bfm;Dn;B_loo)

(42)

24/47

Bias-corrected VFCV / V -fold penalization

Bias-correctedV-fold CV(Burman, 1989):

Rb^vf,corr^bf_m;D_n;B:=R^b^vf^bf_m;D_n;B+R^b_n^bf_m− 1 V

V

X

j=1

Rb_n^bf_m^(−j)

=R^b_nbf_m(D_n)+ pen_VF(^bf_m;D_n;B)

| {z }

V-fold penalty(A. 2008)

In least-squares density estimation (A. & Lerasle, 2016):

Rb^vf(^bfm;Dn;B) =Rb_n ^bfm(Dn)+

1+ 1

2(V −1)

| {z }

overpenalization factor

pen_VF(^bfm;Dn;B)

Rb^`po(^bfm;Dn;B) =Rb_n bfm(Dn)+

z }| {



1+ 1 2_pⁿ−1



pen_VF(^bfm;Dn;B_loo)

(43)

25/47

Variance and estimator selection

∆(m,m⁰,V) =R^b^vf,corrbf_m−Rb^vf,corrbf_m⁰

Theorem (A. & Lerasle, 2016, least-squares density estimation)

var ∆(m,m⁰,V)=4

1+2 n + 1

n²

varP(f_m^?−f_m^?0) n +2

1+ 4 V −1 −1

n

B(m,m⁰) n²

| {z }

>0

If Sm⊂S_m⁰ are two histogram models with constant bin sizes d_m⁻¹,d_m⁻¹0 , then, B(m,m⁰)∝ kf_m^?−f_m^?0kd_m.

The two terms are of the same order ifkf_m^?−f_m^?0k ≈dm/n.