Optimal data-driven estimator selection with minimal penalties

(1)

Optimal data-driven estimator selection with minimal penalties

Sylvain Arlot (joint works with F. Bach, P. Massart & M.

Solnon)

1Cnrs

2Ecole Normale Sup´´ erieure (Paris),DI/ENS, ´EquipeSierra

CIRM, December, 12–13, 2013

(2)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Plan

1 Motivation

2 Slope heuristics for ordinary least-squares Framework

Optimal model selection

Minimal penalty and the slope heuristics Theoretical results

Variance estimation Practical considerations

3 Generalization: minimal penalties Linear estimators

Slope heuristics for linear estimators

Minimal penalty algorithm for linear estimators

(3)

Outline

1 Motivation

Minimal penalty algorithm for linear estimators Minimal penalty algorithm in general

4 Application to multi-task learning

(4)

Regression: data (x

1

, Y

1

), . . . , (x

n

, Y

n

)

−2

−1 0 1 2 3 4

(5)

Goal: find the signal (denoising)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(6)

Estimators: example: regressogram

−2

−1 0 1 2 3 4

(7)

Estimators: regressogram, ridge, k -NN, NW

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

(8)

Estimator selection: kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

(9)

Estimator selection

Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk

Examples:

model selection

parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice betweendifferent methods ex.: k-NN vs. kernel ridge?

Classical approaches and their limitations: cross-validation: computational cost penalization: unknown constants

elbow heuristics: no clear definition/justification

(10)

Estimator selection

Examples:

model selection

Classical approaches and their limitations: cross-validation: computational cost penalization: unknown constants

(11)

Estimator selection

Examples:

model selection

Classical approaches and their limitations:

cross-validation: computational cost penalization: unknown constants

(12)

Penalties known up to a constant factor

b

m(Y)∈argmin

m∈M

n

Emp. risk(Fbm) + pen(m) o

Optimal penalties depending on the noise level σ² (Mallows, 1973):

pen_Cp(m) = 2σ²Dm

n pen_CL(m) = 2σ²tr(Am) n Rk: various methods for estimatingσ² or avoiding its

estimation (FPE, Akaike, 1970; GCV, Craven & Wahba, 1978;

Baraud, Giraud & Huet, 2009).

Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties

Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher complexities, ...)

Goals: estimation of the optimal constant (e.g.,σ²) for estimator selection, under minimal assumptions, without overfitting

(13)

Penalties known up to a constant factor

b

m(Y)∈argmin

m∈M

n

pen_Cp(m) = 2σ²D_m

n pen_CL(m) = 2σ²tr(A_m) n

Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher complexities, ...)

Goals: estimation of the optimal constant (e.g.,σ²) for estimator selection, under minimal assumptions, without overfitting

(14)

Penalties known up to a constant factor

b

m(Y)∈argmin

m∈M

n

pen_Cp(m) = 2σ²D_m

n pen_CL(m) = 2σ²tr(A_m) n

Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher

(15)

“L-curve” and elbow heuristics?

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10⁻³

empirical error

(16)

“L-curve” and elbow heuristics?

0.5 1 1.5

2x 10⁻³

empirical / generalization error

(17)

Outline

1 Motivation

Minimal penalty algorithm for linear estimators Minimal penalty algorithm in general

4 Application to multi-task learning

(18)

Statistical framework: regression, least-squares loss

Observations: Y = (Y₁, . . . ,Y_n)∈Rⁿ

Y_i =F_i+ε_i (e.g., F_i =F(x_i)) with Y_i ∈R, (ε_i)_1≤i_≤ni.i.d.

Fixed design: x_i ∈ X deterministic

Least-squares loss of a predictort ∈Rⁿ (“ti =t(xi)”): 1

nkt−Fk² = 1 n

Xn i=1

(t_i −F_i)²

⇒ Estimator Fb(Y)∈Rⁿ?

(19)

Statistical framework: regression, least-squares loss

Least-squares loss of a predictort ∈Rⁿ (“ti =t(xi)”): 1

nkt−Fk² = 1 n

Xn i=1

(t_i −F_i)²

(20)

Statistical framework: regression, least-squares loss

Least-squares loss of a predictort ∈Rⁿ (“ti =t(xi)”):

1

nkt−Fk² = 1 n

Xn i=1

(t_i −F_i)²

(21)

Statistical framework: regression, least-squares loss

Least-squares loss of a predictort ∈Rⁿ (“ti =t(xi)”):

1

nkt−Fk² = 1 n

Xn i=1

(t_i −F_i)²

(22)

Least-squares estimators

Natural idea: minimize an estimator of the risk ¹_nkt−Fk²

Least-squares criterion: 1

nkt−Yk² = 1 n

Xn i=1

(t_i−Y_i)²

∀t ∈Rⁿ , E 1

nkt−Yk²

= 1

n kt−Fk²+ 1 nE

hkεk²i

Model: S ⊂Rⁿ ⇒ Least-squares estimator on S: FbS ∈argmin

t∈S

1

nkt−Yk²

= argmin

t∈S

( 1 n

Xn i=1

(ti −Yi)² )

so that

Fb_S = Π_S(Y) (orthogonal projection)

(23)

Least-squares estimators

Natural idea: minimize an estimator of the risk ¹_nkt−Fk² Least-squares criterion:

1

nkt−Yk² = 1 n

Xn i=1

(t_i−Y_i)²

∀t ∈Rⁿ , E 1

nkt−Yk²

= 1

n kt−Fk²+ 1 nE

hkεk²i

Model: S ⊂Rⁿ ⇒ Least-squares estimator on S: FbS ∈argmin

t∈S

1

nkt−Yk²

= argmin

t∈S

( 1 n

Xn i=1

(ti −Yi)² )

so that

Fb_S = Π_S(Y) (orthogonal projection)

(24)

Least-squares estimators

Natural idea: minimize an estimator of the risk ¹_nkt−Fk² Least-squares criterion:

1

nkt−Yk² = 1 n

Xn i=1

(t_i−Y_i)²

∀t ∈Rⁿ , E 1

nkt−Yk²

= 1

n kt−Fk²+ 1 nE

hkεk²i

Model: S ⊂Rⁿ ⇒ Least-squares estimatoronS: Fb_S ∈argmin

t∈S

1

nkt−Yk²

= argmin

t∈S

(1 n

Xn i=1

(ti −Yi)² )

(25)

Model examples

histogramson some partition Λ of X

⇒ the least-squares estimator (regressogram) can be written Fb_m(x_i) =X

λ∈Λ

βb_λ1xi∈λ βb_λ= 1 Card{x_i ∈λ}

X

xi∈λ

Y_i

subspace generated by a subset of an orthogonal basis of L²(µ) (Fourier, wavelets, ...)

variable selection: x_i =

x_i⁽¹⁾, . . . ,x_i^(p)

∈R^p gathersp variables that can (linearly) explain Yi

∀m⊂ {1, . . . ,p} , Sm= vect n

x^(j) s.t. j ∈m o

(26)

Model selection: regular regressograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

(27)

Model selection

Model collection (Sm)m∈M ⇒ (Fbm)m∈M ⇒ m(Yb ) ? Fb_m= Π_mY = Π_S_mY

Goal: minimize the risk, i.e.,

Oracle inequality(in expectation or with a large probability):

1 n

bF_m_b −F

²≤C inf

m∈M

1 n

bFm−F

²

+Rn

(28)

Bias-variance trade-off

E 1

n bF_m−F

²

= Bias + Variance Bias or Approximation error

1

nkF_m−Fk² = 1

nkΠ_mF −Fk² Variance or Estimation error

σ²dim(S_m) n

Bias-variance trade-off

⇔avoid overfittingandunderfitting

(29)

Bias-variance trade-off

E 1

n bF_m−F

²

= Bias + Variance Bias or Approximation error

1

nkF_m−Fk² = 1

nkΠ_mF −Fk² Variance or Estimation error

σ²dim(S_m) n

Bias-variance trade-off

⇔avoid overfittingandunderfitting

(30)

Why should the empirical risk be penalized?

−0.1

−0.05 0 0.05 0.1 0.15

Biais Exces de risque E[Risque empirique]

(31)

Penalization

b

m∈argmin

m∈M

1 n

bFm−Y

²+ pen(m)

Ideal penalty: pen_id(m) := 1

n bFm−F

²− 1

n

bFm−Y

² = Risk−Empirical risk

Mallows’ heuristic: pen(m)≈E[ pen_id(m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ C_p: 2σ²D_m/n (Mallows, 1973)

(32)

Penalization

b

m∈argmin

m∈M

1 n

bFm−Y

²+ pen(m)

Ideal penalty:

pen_id(m) := 1 n

bFm−F

²− 1

n

bFm−Y

(33)

Penalization

b

m∈argmin

m∈M

1 n

bFm−Y

²+ pen(m)

Ideal penalty:

pen_id(m) := 1 n

bFm−F

²− 1

n

bFm−Y

(34)

Oracle inequality

Theorem (Birg´e & Massart 2007, reformulated) Assumptions:

pen(m) = ^2σ²_n^D^m

Gaussian noise: ε_i ∼ N(0, σ²)

Then, for everyγ >0, with probability at least 1−4Card(M)n^−γ, if n≥n0(γ), for every η∈( 0,1 ),

1 n

bF_m_b −F

² ≤( 1 +η) inf

m∈M

1 n

bFm−F

²

+80γlog(n)σ² ηn

(35)

E [ Empirical risk ] + 0 × σ

²

D

m

n

⁻¹

(OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10⁻³

E[empirical error + 0* pen min]

(36)

E [ Empirical risk ] + 0.8 × σ

²

D

m

n

⁻¹

(OLS)

0.5 1 1.5

2x 10⁻³

E[empirical error + 0.8* pen min]

(37)

E [ Empirical risk ] + 0.9 × σ

²

D

m

n

⁻¹

(OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10⁻³

(38)

E [ Empirical risk ] + 1.1 × σ

²

D

m

n

⁻¹

(OLS)

0.5 1 1.5

2x 10⁻³

(39)

E [ Empirical risk ] + 1.2 × σ

²

D

m

n

⁻¹

(OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10⁻³

(40)

E [ Empirical risk ] + 2 × σ

²

D

m

n

⁻¹

(OLS)

0.5 1 1.5

2x 10⁻³

E[empirical error + 2* pen min]

(41)

OLS: Dimension jump

0 1 2 3 4 5

0 200 400 600 800 1000

2

dimension of m(C)

(42)

OLS: slope heuristics algorithm (Birg´ e & Massart 2007)

1 for every C >0, compute b

m(C)∈argmin

m∈Mn

1 n

bFm−Y

²+CD_m n

2 findCbjump such thatD

m(Cb ) is “very large” when C <Cbjump and “reasonably small” whenC >Cbjump

3 select mb =mb

2Cbjump

Practical use: package (Baudry, Maugis & Michel, 2011)

(43)

Theorem (1): Dimension jump / Minimal penalty

∃m₀ ∈ M, S_m₀ =Rⁿ, i.e., Fb_m₀ =Y ,

infm∈M{E[¹_nkFbm−Fk²]} ≤σ²δn,δn≤1/20, Gaussian noise: ε_i ∼ N(0, σ²).

Then, ∀γ >0,n≥n₀(γ), w.p. at least 1−4 Card(M)n^−γ,

∀C < 1−η_n⁻

σ², D

m(C)b ≥ 9n 10

∀C > 1 +η_n⁺

σ², D_m(C)_b ≤ n 10 with η_n⁻= 81

qγlog(n)

n ,η_n⁺=η⁻_n + 20δ_n. In the first case,

1

nkbF_m(C)−Fk² ≥ ^7σ₈² infm∈Mn{¹_nkFbm−Fk²}.

(44)

Theorem (1’): Dimension jump / Minimal penalty

Under the same assumptions, on the same event,∀a_n,b_n such that 2nδ_n+ 16.2p

γlog(n)n <b_n<a_n <n ,

∀C < 1−η⁻_n

σ², D

m(Cb )≥an

∀C > 1 +η_n⁺

σ², D

m(C)b ≤bn

with η_n⁻= 1−an

n

−1r

γlog(n) n

η_n⁺= n

bn−nδn

δn+ 8.1

rγlog(n) n

!

(45)

Theorem (2): Oracle inequality

b

m∈argmin_m∈M{_n¹kbF_m−Yk²+ 2bC_jump^D_n^m}

∃m₀ ∈ M, S_m₀ =Rⁿ, i.e., Fb_m₀ =Y , infm∈M{E[¹_n

bF_m−F

²]} ≤σ²δ_n,δ_n≤1/20, Gaussian noise: εi ∼ N(0, σ²).

Then, with probability at least1−4 Card(M)n^−γ, if n≥n₀(γ), for everyη≥2 max{η_n⁻, η⁺_n },

1 n

bF_m_b −F

² ≤( 1 + 3η) inf

m∈M

1 n

bFm−F

²

+880σ²γlog(n) ηn

(46)

Variance estimation

Slope heuristics: with probability 1−4 Card(M)n^−γ, 1−81

rγlog(n)

n ≤ Cb_jump

σ² ≤1 +20δn+ 81

rγlog(n) n

Naive estimator: b

σ_m² := 1 n−Dm

Y −Fbm

²

⇒ E b σ_m²

=σ²+k(I_n−Πm)Fk² n−D_m

Best variance estimator for E[ (σb−σ)²] is not necessarily the best for model selection.

(47)

Variance estimation

rγlog(n)

n ≤ Cb_jump

σ² ≤1 + 20δn+ 81

rγlog(n) n Naive estimator:

b

σ_m² := 1 n−Dm

Y −Fbm

²

⇒ E b σ_m²

=σ²+ k(I_n−Πm)Fk² n−D_m

Best variance estimator for E[ (σb−σ)²] is not necessarily the best for model selection.

(48)

Variance estimation

rγlog(n)

n ≤ Cb_jump

σ² ≤1 + 20δn+ 81

rγlog(n) n Naive estimator:

b

σ_m² := 1 n−Dm

Y −Fbm

²

⇒ E b σ_m²

=σ²+ k(I_n−Πm)Fk² n−D_m

Best variance estimator for E[ (σb−σ)²] is not necessarily the

(49)

Data-driven penalties

Naive estimator with some fixedm0 + plug in:

crit(m) = 1 n

Y −Fb_m

²+ 2bσ_m²₀Dm

n

Drawbacks: choice ofm0? unknown bias (overpenalization)

FPE (Akaike, 1970; Baraud, Giraud & Huet, 2009) crit_FPE(m) = 1

n

Y −Fbm

²+2bσ_m²Dm

n = 1 n

Y −Fbm

²

1 + 2Dm

n−D_m

crit_BGH(m) = 1 n

Y −Fb_m ²

1 + pen(m) n−D_m

Drawbacks: deal carefully with the largest models oracle inequalities (Baraud, Giraud & Huet, 2009) hold

assuming an upper bound on maxm∈MDm (FPE) or for a new penalty, very large for the largest models (BGH)

(50)

Data-driven penalties

Naive estimator with some fixedm0 + plug in:

crit(m) = 1 n

Y −Fb_m

²+ 2bσ_m²₀Dm

n

Drawbacks: choice ofm0? unknown bias (overpenalization) FPE (Akaike, 1970; Baraud, Giraud & Huet, 2009)

crit_FPE(m) = 1 n

Y −Fb_m

²+2bσ_m²Dm

n = 1 n

Y −Fb_m ²

1 + 2Dm

n−D_m

crit_BGH(m) = 1 n

Y −Fb_m ²

1 + pen(m) n−D_m

Drawbacks: deal carefully with the largest models

(51)

Generalized cross-validation (Craven & Wahba, 1978)

critGCV(m) = 1 n

Y −Fbm

²

1−D_m n

−2

IfDm n,

crit_GCV(m)≈ 1 n

Y −Fb_m

² n+D_m n−Dm

=crit_FPE(m) Drawbacks: deal carefully with the largest models

⇒e.g., for smoothing splines, oracle inequality assumes the effective dimension is≤n/5 for all m(Cao & Golubev, 2006)

(52)

Practical qualities of the algorithm

visual checking of existence of a jump

calibration independent from the choice of somem₀ too strongoverfitting almost impossible

one remaining parameter: how tolocalize the jump

(53)

How to localize the jump in practice?

Dimension jump: largest jump? jump on a geometrical window? complexity threshold?

Estimation of the slope of the empirical risk as a function of the dimension:

computed with which models? robust regression? Jump vs. slope? Take both!

⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

(54)

How to localize the jump in practice?

computed with which models? robust regression?

Jump vs. slope? Take both!

(55)

How to localize the jump in practice?

computed with which models? robust regression?

Jump vs. slope? Take both!

(56)

CAPUSHE (Baudry, Maugis & Michel, 2011): jump

(57)

CAPUSHE (Baudry, Maugis & Michel, 2011): slope

(58)

Outline

1 Motivation

Minimal penalty algorithm for linear estimators

(59)

Kernel ridge estimator (λ = 0.01)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(60)

k -nearest-neighbours estimator (k = 20)

−2

−1 0 1 2 3 4

(61)

Nadaraya-Watson estimator (σ = 0.01)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(62)

Linear estimators

OLS:Fb_m= Π_S_mY (projection onto S_m)

(kernel) ridge regression, spline smoothing (Wahba, 1990):

Fb_i =bf(x_i) with bf ∈argmin

f∈F_K

(1 n

Xn i=1

(Y_i−f(x_i) )²+λkfk²_F

K

)

⇒ Fb_λ,K =K(K +λI)⁻¹Y where K = (K(x_i,x_j))1≤i,j≤n

k-nearest neighbours

Nadaraya-Watson estimators

(63)

Estimator selection: kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

(64)

Estimator selection: k nearest neighbours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

(65)

Estimator selection: Nadaraya-Watson

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

(66)

Slope heuristics for linear estimators?

OLS

pen_Cp(m) = 2σ²Dm

n argmin

m∈M

1

nkbFm−Yk²+CDm

n

⇒ D

m(Cb ) “jumps” at Cb_jump≈σ²

⇒ optimal choice withm(2bb C_jump)

Linear estimators pen_CL(m) = 2σ²tr(Am)

n

argmin

m∈M

1

nkFb_m−Yk²+Ctr(Am) n

Does tr(A

m(Cb )) jump at Cbjump≈σ²?

optimal choice withm(2b Cb_jump) ?

(67)

Slope heuristics for linear estimators?

OLS

pen_Cp(m) = 2σ²Dm

n argmin

m∈M

1

nkbFm−Yk²+CDm

n

⇒ D

n

argmin

m∈M

1

nkFb_m−Yk²+Ctr(Am) n

Does tr(A

(68)

Slope heuristics for linear estimators?

OLS

pen_Cp(m) = 2σ²Dm

n argmin

m∈M

1

nkbFm−Yk²+CDm

n

⇒ D

n

argmin

m∈M

1

nkbF_m−Yk²+Ctr(A_m) n

Does tr(A

(69)

No dimension jump with a penalty ∝ tr(A

m

)

−60 −4 −2 0 2 4 6

200 400 600 800 1000

2

tr(Aλ(C))

optimal penalty / 2

(70)

Minimal penalties for linear estimators

E 1

n bFm−F

²

= 1

nk(I−Am)Fk²+tr(A^>_mAm)σ²

n = bias + variance

E 1

n

bFm−Y ²

=σ²+1

nk(I−Am)Fk²− 2 tr(A_m)−tr(A^>_mA_m) σ² n

⇒ optimal penalty ( 2 tr(A_m) )σ² n

⇒ minimal penalty 2 tr(A_m)−tr(A^>_mA_m) σ² n

(71)

Minimal penalties for linear estimators

E 1

n bFm−F

²

= 1

n = bias + variance

E 1

n

bFm−Y ²

=σ²+1

(72)

Minimal penalties for linear estimators

E 1

n bFm−F

²

= 1

n = bias + variance

E 1

n

bFm−Y ²

=σ²+1

(73)

Minimal penalties for linear estimators

E 1

n bFm−F

²

= 1

n = bias + variance

E 1

n

bFm−Y ²

=σ²+1

(74)

Minimal penalties for linear estimators

E 1

n bF_m−F

²

= 1

nk(I−A_m)Fk²+tr(A^>_mAm)σ²

n = bias + variance

E 1

n

bF_m−Y ²

=σ²+1

nk(I−A_m)Fk²− 2 tr(A_m)−tr(A^>_mA_m) σ² n

(75)

“Dimension” jump (ridge regression)

−60 −4 −2 0 2 4 6

200 400 600 800 1000

2

tr(Aλ(C))

minimal penalty optimal penalty / 2

(76)

Penalty calibration algorithm (A. & Bach 2009)

1 for every C >0, compute b

m_min(C)∈argmin

m∈M

(1 n

Y −Fb_m

²+C 2 tr(Am)−tr(A^>_mAm) n

)

2 findCbjump such thattr(A_m_b_min_(C)) is “too large” when C <Cb_jump and “reasonably small” whenC >Cb_jump,

3 select

(1 ² 2Cb tr(A ))

(77)

Theorem for linear estimators

Theorem (A. & Bach 2009–2011) Assumptions:

∀m∈ M,|||A_m||| ≤L1 and tr(A^>_mAm)≤tr(Am)≤n

∃m₀,m₁∈ M, A_m₁ =I_n, D_m₀ ≤√ n and

1

nkF_m₀−Fk²≤σ²p

log(n)/n . Gaussian noise: εi ∼ N(0, σ²)

Then, ∀γ >0,n≥n₀(γ), w.p. at least 1−6 Card(M)n^−γ,

∀C < 1−η_n⁻

σ², D

m(C)b ≥ n 3

∀C > 1 +η_n⁺

σ², D

m(C)b ≤ n 10 with η_n⁻=η_n⁺=L₂δ

qlog(n) n .

(78)

Theorem for linear estimators

Theorem (A. & Bach 2009–2011) Assumptions:

∀m∈ M,|||A_m||| ≤L₁ and tr(A^>_mA_m)≤tr(A_m)≤n

∃m₀,m1∈ M, Am1 =In, Dm0 ≤√ n and

1

nkF_m₀−Fk²≤σ²p

log(n)/n . Gaussian noise: ε_i ∼ N(0, σ²)

Then, ∀γ >0,n≥n0(γ), w.p. at least 1−6 Card(M)n^−γ,

∀C ∈(1−η⁻_n,1 +η_n⁺),η∈(0,2),

∀mb ∈argmin

m∈M

1 n

Y −Fbm

²+ 2Ctr(Am) n

,

(79)

Comparison with least-squares

Linear estimators:

pen_min(m) = σ² 2 tr(A_m)−tr(A^>_mA_m) n

pen_opt(m) = σ²( 2 tr(A_m) ) n pen_opt(m)

pen_min(m) = 2 tr(A_m)

2 tr(Am)−tr(A^>_mAm) ∈(1,2]

Least-squares case:

A^>_mAm =Am ⇒ pen_opt(m)

pen_min(m) = 2 ⇒ Slope heuristics

(80)

Comparison with least-squares

Linear estimators:

pen_min(m) = σ² 2 tr(A_m)−tr(A^>_mA_m) n

pen_opt(m) = σ²( 2 tr(A_m) ) n pen_opt(m)

pen_min(m) = 2 tr(A_m)

2 tr(Am)−tr(A^>_mAm) ∈(1,2]

Least-squares case:

pen (m)

(81)

The k -nearest neighbours case

∀i,j ∈ {1, . . . ,n}, Ai,j ∈

0,1 k

∀i ∈ {1, . . . ,n} , A_i,i = 1 k and

Xn j=1

A_i,j = 1

⇒ tr(A) = n

k = tr(A^>A)

⇒ pen_opt = 2 pen_min

(82)

The k -nearest neighbours case

∀i,j ∈ {1, . . . ,n}, Ai,j ∈

0,1 k

∀i ∈ {1, . . . ,n} , A_i,i = 1 k and

Xn j=1

A_i,j = 1

⇒ tr(A) = n

k = tr(A^>A)

⇒ pen = 2 pen

(83)

Simulations: N-W, F

i

= sin(25πx

_i³

), n = 200

3 4 5 6

0.5 1 1.5 2 2.5 3 3.5 4

mean( error / error oracle )

min. pen. largest min. pen. n/2 Mallows GCV

(84)

Simulations: ridge, F

i

= sin(25πx

_i³

), n = 200

1 1.5 2 2.5 3 3.5 4

mean( error / error oracle )

min. pen. largest min. pen. n/2 Mallows GCV

(85)

General framework

Goal: find from data t∈Swith R(t) minimal.

Empirical risk Rb_n(t)

Collection of estimators (bsm)m∈M

Oracle inequality:

R(bs

mb)− R(s^?)≤K_n inf

m∈M{ R(bs_m)− R(s^?)}+R_n whereR(s^?) := inft∈SR(t).