• Aucun résultat trouvé

Optimal data-driven estimator selection with minimal penalties

N/A
N/A
Protected

Academic year: 2022

Partager "Optimal data-driven estimator selection with minimal penalties"

Copied!
131
0
0

Texte intégral

(1)

Optimal data-driven estimator selection with minimal penalties

Sylvain Arlot (joint works with F. Bach, P. Massart & M.

Solnon)

1Cnrs

2Ecole Normale Sup´´ erieure (Paris),DI/ENS, ´EquipeSierra

CIRM, December, 12–13, 2013

(2)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Plan

1 Motivation

2 Slope heuristics for ordinary least-squares Framework

Optimal model selection

Minimal penalty and the slope heuristics Theoretical results

Variance estimation Practical considerations

3 Generalization: minimal penalties Linear estimators

Slope heuristics for linear estimators

Minimal penalty algorithm for linear estimators

(3)

Outline

1 Motivation

2 Slope heuristics for ordinary least-squares Framework

Optimal model selection

Minimal penalty and the slope heuristics Theoretical results

Variance estimation Practical considerations

3 Generalization: minimal penalties Linear estimators

Slope heuristics for linear estimators

Minimal penalty algorithm for linear estimators Minimal penalty algorithm in general

4 Application to multi-task learning

(4)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Regression: data (x

1

, Y

1

), . . . , (x

n

, Y

n

)

−2

−1 0 1 2 3 4

(5)

Goal: find the signal (denoising)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(6)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Estimators: example: regressogram

−2

−1 0 1 2 3 4

(7)

Estimators: regressogram, ridge, k -NN, NW

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

(8)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Estimator selection: kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

(9)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Estimator selection

Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk

Examples:

model selection

parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice betweendifferent methods ex.: k-NN vs. kernel ridge?

Classical approaches and their limitations: cross-validation: computational cost penalization: unknown constants

elbow heuristics: no clear definition/justification

(10)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Estimator selection

Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk

Examples:

model selection

parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice betweendifferent methods ex.: k-NN vs. kernel ridge?

Classical approaches and their limitations: cross-validation: computational cost penalization: unknown constants

elbow heuristics: no clear definition/justification

(11)

Estimator selection

Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk

Examples:

model selection

parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)

choice betweendifferent methods ex.: k-NN vs. kernel ridge?

Classical approaches and their limitations:

cross-validation: computational cost penalization: unknown constants

elbow heuristics: no clear definition/justification

(12)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Penalties known up to a constant factor

b

m(Y)∈argmin

m∈M

n

Emp. risk(Fbm) + pen(m) o

Optimal penalties depending on the noise level σ2 (Mallows, 1973):

penCp(m) = 2σ2Dm

n penCL(m) = 2σ2tr(Am) n Rk: various methods for estimatingσ2 or avoiding its

estimation (FPE, Akaike, 1970; GCV, Craven & Wahba, 1978;

Baraud, Giraud & Huet, 2009).

Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties

Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher complexities, ...)

Goals: estimation of the optimal constant (e.g.,σ2) for estimator selection, under minimal assumptions, without overfitting

(13)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Penalties known up to a constant factor

b

m(Y)∈argmin

m∈M

n

Emp. risk(Fbm) + pen(m) o

Optimal penalties depending on the noise level σ2 (Mallows, 1973):

penCp(m) = 2σ2Dm

n penCL(m) = 2σ2tr(Am) n

Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties

Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher complexities, ...)

Goals: estimation of the optimal constant (e.g.,σ2) for estimator selection, under minimal assumptions, without overfitting

(14)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Penalties known up to a constant factor

b

m(Y)∈argmin

m∈M

n

Emp. risk(Fbm) + pen(m) o

Optimal penalties depending on the noise level σ2 (Mallows, 1973):

penCp(m) = 2σ2Dm

n penCL(m) = 2σ2tr(Am) n

Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties

Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher

(15)

“L-curve” and elbow heuristics?

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10−3

empirical error

(16)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

“L-curve” and elbow heuristics?

0.5 1 1.5

2x 10−3

empirical / generalization error

(17)

Outline

1 Motivation

2 Slope heuristics for ordinary least-squares Framework

Optimal model selection

Minimal penalty and the slope heuristics Theoretical results

Variance estimation Practical considerations

3 Generalization: minimal penalties Linear estimators

Slope heuristics for linear estimators

Minimal penalty algorithm for linear estimators Minimal penalty algorithm in general

4 Application to multi-task learning

(18)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Statistical framework: regression, least-squares loss

Observations: Y = (Y1, . . . ,Yn)∈Rn

Yi =Fii (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.

Fixed design: xi ∈ X deterministic

Least-squares loss of a predictort ∈Rn (“ti =t(xi)”): 1

nkt−Fk2 = 1 n

Xn i=1

(ti −Fi)2

⇒ Estimator Fb(Y)∈Rn?

(19)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Statistical framework: regression, least-squares loss

Observations: Y = (Y1, . . . ,Yn)∈Rn

Yi =Fii (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.

Fixed design: xi ∈ X deterministic

Least-squares loss of a predictort ∈Rn (“ti =t(xi)”): 1

nkt−Fk2 = 1 n

Xn i=1

(ti −Fi)2

⇒ Estimator Fb(Y)∈Rn?

(20)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Statistical framework: regression, least-squares loss

Observations: Y = (Y1, . . . ,Yn)∈Rn

Yi =Fii (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.

Fixed design: xi ∈ X deterministic

Least-squares loss of a predictort ∈Rn (“ti =t(xi)”):

1

nkt−Fk2 = 1 n

Xn i=1

(ti −Fi)2

⇒ Estimator Fb(Y)∈Rn?

(21)

Statistical framework: regression, least-squares loss

Observations: Y = (Y1, . . . ,Yn)∈Rn

Yi =Fii (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.

Fixed design: xi ∈ X deterministic

Least-squares loss of a predictort ∈Rn (“ti =t(xi)”):

1

nkt−Fk2 = 1 n

Xn i=1

(ti −Fi)2

⇒ Estimator Fb(Y)∈Rn?

(22)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Least-squares estimators

Natural idea: minimize an estimator of the risk 1nkt−Fk2

Least-squares criterion: 1

nkt−Yk2 = 1 n

Xn i=1

(ti−Yi)2

∀t ∈Rn , E 1

nkt−Yk2

= 1

n kt−Fk2+ 1 nE

hkεk2i

Model: S ⊂Rn ⇒ Least-squares estimator on S: FbS ∈argmin

t∈S

1

nkt−Yk2

= argmin

t∈S

( 1 n

Xn i=1

(ti −Yi)2 )

so that

FbS = ΠS(Y) (orthogonal projection)

(23)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Least-squares estimators

Natural idea: minimize an estimator of the risk 1nkt−Fk2 Least-squares criterion:

1

nkt−Yk2 = 1 n

Xn i=1

(ti−Yi)2

∀t ∈Rn , E 1

nkt−Yk2

= 1

n kt−Fk2+ 1 nE

hkεk2i

Model: S ⊂Rn ⇒ Least-squares estimator on S: FbS ∈argmin

t∈S

1

nkt−Yk2

= argmin

t∈S

( 1 n

Xn i=1

(ti −Yi)2 )

so that

FbS = ΠS(Y) (orthogonal projection)

(24)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Least-squares estimators

Natural idea: minimize an estimator of the risk 1nkt−Fk2 Least-squares criterion:

1

nkt−Yk2 = 1 n

Xn i=1

(ti−Yi)2

∀t ∈Rn , E 1

nkt−Yk2

= 1

n kt−Fk2+ 1 nE

hkεk2i

Model: S ⊂Rn ⇒ Least-squares estimatoronS: FbS ∈argmin

t∈S

1

nkt−Yk2

= argmin

t∈S

(1 n

Xn i=1

(ti −Yi)2 )

(25)

Model examples

histogramson some partition Λ of X

⇒ the least-squares estimator (regressogram) can be written Fbm(xi) =X

λ∈Λ

βbλ1xi∈λ βbλ= 1 Card{xi ∈λ}

X

xi∈λ

Yi

subspace generated by a subset of an orthogonal basis of L2(µ) (Fourier, wavelets, ...)

variable selection: xi =

xi(1), . . . ,xi(p)

∈Rp gathersp variables that can (linearly) explain Yi

∀m⊂ {1, . . . ,p} , Sm= vect n

x(j) s.t. j ∈m o

(26)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Model selection: regular regressograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

(27)

Model selection

Model collection (Sm)m∈M ⇒ (Fbm)m∈M ⇒ m(Yb ) ? Fbm= ΠmY = ΠSmY

Goal: minimize the risk, i.e.,

Oracle inequality(in expectation or with a large probability):

1 n

bFmb −F

2≤C inf

m∈M

1 n

bFm−F

2

+Rn

(28)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Bias-variance trade-off

E 1

n bFm−F

2

= Bias + Variance Bias or Approximation error

1

nkFm−Fk2 = 1

nkΠmF −Fk2 Variance or Estimation error

σ2dim(Sm) n

Bias-variance trade-off

⇔avoid overfittingandunderfitting

(29)

Bias-variance trade-off

E 1

n bFm−F

2

= Bias + Variance Bias or Approximation error

1

nkFm−Fk2 = 1

nkΠmF −Fk2 Variance or Estimation error

σ2dim(Sm) n

Bias-variance trade-off

⇔avoid overfittingandunderfitting

(30)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Why should the empirical risk be penalized?

−0.1

−0.05 0 0.05 0.1 0.15

Biais Exces de risque E[Risque empirique]

(31)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Penalization

b

m∈argmin

m∈M

1 n

bFm−Y

2+ pen(m)

Ideal penalty: penid(m) := 1

n bFm−F

2− 1

n

bFm−Y

2 = Risk−Empirical risk

Mallows’ heuristic: pen(m)≈E[ penid(m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ Cp: 2σ2Dm/n (Mallows, 1973)

(32)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Penalization

b

m∈argmin

m∈M

1 n

bFm−Y

2+ pen(m)

Ideal penalty:

penid(m) := 1 n

bFm−F

2− 1

n

bFm−Y

2 = Risk−Empirical risk

Mallows’ heuristic: pen(m)≈E[ penid(m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ Cp: 2σ2Dm/n (Mallows, 1973)

(33)

Penalization

b

m∈argmin

m∈M

1 n

bFm−Y

2+ pen(m)

Ideal penalty:

penid(m) := 1 n

bFm−F

2− 1

n

bFm−Y

2 = Risk−Empirical risk

Mallows’ heuristic: pen(m)≈E[ penid(m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ Cp: 2σ2Dm/n (Mallows, 1973)

(34)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Oracle inequality

Theorem (Birg´e & Massart 2007, reformulated) Assumptions:

pen(m) = 2nDm

Gaussian noise: εi ∼ N(0, σ2)

Then, for everyγ >0, with probability at least 1−4Card(M)n−γ, if n≥n0(γ), for every η∈( 0,1 ),

1 n

bFmb −F

2 ≤( 1 +η) inf

m∈M

1 n

bFm−F

2

+80γlog(n)σ2 ηn

(35)

E [ Empirical risk ] + 0 × σ

2

D

m

n

−1

(OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10−3

E[empirical error + 0* pen min]

(36)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

E [ Empirical risk ] + 0.8 × σ

2

D

m

n

−1

(OLS)

0.5 1 1.5

2x 10−3

E[empirical error + 0.8* pen min]

(37)

E [ Empirical risk ] + 0.9 × σ

2

D

m

n

−1

(OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10−3

E[empirical error + 0.9* pen min]

(38)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

E [ Empirical risk ] + 1.1 × σ

2

D

m

n

−1

(OLS)

0.5 1 1.5

2x 10−3

E[empirical error + 1.1* pen min]

(39)

E [ Empirical risk ] + 1.2 × σ

2

D

m

n

−1

(OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2x 10−3

E[empirical error + 1.2* pen min]

(40)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

E [ Empirical risk ] + 2 × σ

2

D

m

n

−1

(OLS)

0.5 1 1.5

2x 10−3

E[empirical error + 2* pen min]

(41)

OLS: Dimension jump

0 1 2 3 4 5

0 200 400 600 800 1000

2

dimension of m(C)

(42)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

OLS: slope heuristics algorithm (Birg´ e & Massart 2007)

1 for every C >0, compute b

m(C)∈argmin

m∈Mn

1 n

bFm−Y

2+CDm n

2 findCbjump such thatD

m(Cb ) is “very large” when C <Cbjump and “reasonably small” whenC >Cbjump

3 select mb =mb

2Cbjump

Practical use: package (Baudry, Maugis & Michel, 2011)

(43)

Theorem (1): Dimension jump / Minimal penalty

Theorem (Birg´e & Massart 2007, reformulated) Assumptions:

∃m0 ∈ M, Sm0 =Rn, i.e., Fbm0 =Y ,

infm∈M{E[1nkFbm−Fk2]} ≤σ2δnn≤1/20, Gaussian noise: εi ∼ N(0, σ2).

Then, ∀γ >0,n≥n0(γ), w.p. at least 1−4 Card(M)n−γ,

∀C < 1−ηn

σ2, D

m(C)b ≥ 9n 10

∀C > 1 +ηn+

σ2, Dm(C)b ≤ n 10 with ηn= 81

qγlog(n)

nn+n + 20δn. In the first case,

1

nkbFm(C)−Fk282 infm∈Mn{1nkFbm−Fk2}.

(44)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Theorem (1’): Dimension jump / Minimal penalty

Under the same assumptions, on the same event,∀an,bn such that 2nδn+ 16.2p

γlog(n)n <bn<an <n ,

∀C < 1−ηn

σ2, D

m(Cb )≥an

∀C > 1 +ηn+

σ2, D

m(C)b ≤bn

with ηn= 1−an

n

−1r

γlog(n) n

ηn+= n

bn−nδn

δn+ 8.1

rγlog(n) n

!

(45)

Theorem (2): Oracle inequality

Theorem (Birg´e & Massart 2007, reformulated) Assumptions:

b

m∈argminm∈M{n1kbFm−Yk2+ 2bCjumpDnm}

∃m0 ∈ M, Sm0 =Rn, i.e., Fbm0 =Y , infm∈M{E[1n

bFm−F

2]} ≤σ2δnn≤1/20, Gaussian noise: εi ∼ N(0, σ2).

Then, with probability at least1−4 Card(M)n−γ, if n≥n0(γ), for everyη≥2 max{ηn, η+n },

1 n

bFmb −F

2 ≤( 1 + 3η) inf

m∈M

1 n

bFm−F

2

+880σ2γlog(n) ηn

(46)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Variance estimation

Slope heuristics: with probability 1−4 Card(M)n−γ, 1−81

rγlog(n)

n ≤ Cbjump

σ2 ≤1 +20δn+ 81

rγlog(n) n

Naive estimator: b

σm2 := 1 n−Dm

Y −Fbm

2

⇒ E b σm2

2+k(In−Πm)Fk2 n−Dm

Best variance estimator for E[ (σb−σ)2] is not necessarily the best for model selection.

(47)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Variance estimation

Slope heuristics: with probability 1−4 Card(M)n−γ, 1−81

rγlog(n)

n ≤ Cbjump

σ2 ≤1 + 20δn+ 81

rγlog(n) n Naive estimator:

b

σm2 := 1 n−Dm

Y −Fbm

2

⇒ E b σm2

2+ k(In−Πm)Fk2 n−Dm

Best variance estimator for E[ (σb−σ)2] is not necessarily the best for model selection.

(48)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Variance estimation

Slope heuristics: with probability 1−4 Card(M)n−γ, 1−81

rγlog(n)

n ≤ Cbjump

σ2 ≤1 + 20δn+ 81

rγlog(n) n Naive estimator:

b

σm2 := 1 n−Dm

Y −Fbm

2

⇒ E b σm2

2+ k(In−Πm)Fk2 n−Dm

Best variance estimator for E[ (σb−σ)2] is not necessarily the

(49)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Data-driven penalties

Naive estimator with some fixedm0 + plug in:

crit(m) = 1 n

Y −Fbm

2+ 2bσm20Dm

n

Drawbacks: choice ofm0? unknown bias (overpenalization)

FPE (Akaike, 1970; Baraud, Giraud & Huet, 2009) critFPE(m) = 1

n

Y −Fbm

2+2bσm2Dm

n = 1 n

Y −Fbm

2

1 + 2Dm

n−Dm

critBGH(m) = 1 n

Y −Fbm 2

1 + pen(m) n−Dm

Drawbacks: deal carefully with the largest models oracle inequalities (Baraud, Giraud & Huet, 2009) hold

assuming an upper bound on maxm∈MDm (FPE) or for a new penalty, very large for the largest models (BGH)

(50)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Data-driven penalties

Naive estimator with some fixedm0 + plug in:

crit(m) = 1 n

Y −Fbm

2+ 2bσm20Dm

n

Drawbacks: choice ofm0? unknown bias (overpenalization) FPE (Akaike, 1970; Baraud, Giraud & Huet, 2009)

critFPE(m) = 1 n

Y −Fbm

2+2bσm2Dm

n = 1 n

Y −Fbm 2

1 + 2Dm

n−Dm

critBGH(m) = 1 n

Y −Fbm 2

1 + pen(m) n−Dm

Drawbacks: deal carefully with the largest models

(51)

Generalized cross-validation (Craven & Wahba, 1978)

critGCV(m) = 1 n

Y −Fbm

2

1−Dm n

−2

IfDm n,

critGCV(m)≈ 1 n

Y −Fbm

2 n+Dm n−Dm

=critFPE(m) Drawbacks: deal carefully with the largest models

⇒e.g., for smoothing splines, oracle inequality assumes the effective dimension is≤n/5 for all m(Cao & Golubev, 2006)

(52)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Practical qualities of the algorithm

visual checking of existence of a jump

calibration independent from the choice of somem0 too strongoverfitting almost impossible

one remaining parameter: how tolocalize the jump

(53)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

How to localize the jump in practice?

Dimension jump: largest jump? jump on a geometrical window? complexity threshold?

Estimation of the slope of the empirical risk as a function of the dimension:

computed with which models? robust regression? Jump vs. slope? Take both!

⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

(54)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

How to localize the jump in practice?

Dimension jump: largest jump? jump on a geometrical window? complexity threshold?

Estimation of the slope of the empirical risk as a function of the dimension:

computed with which models? robust regression?

Jump vs. slope? Take both!

⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

(55)

How to localize the jump in practice?

Dimension jump: largest jump? jump on a geometrical window? complexity threshold?

Estimation of the slope of the empirical risk as a function of the dimension:

computed with which models? robust regression?

Jump vs. slope? Take both!

⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

(56)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

CAPUSHE (Baudry, Maugis & Michel, 2011): jump

(57)

CAPUSHE (Baudry, Maugis & Michel, 2011): slope

(58)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Outline

1 Motivation

2 Slope heuristics for ordinary least-squares Framework

Optimal model selection

Minimal penalty and the slope heuristics Theoretical results

Variance estimation Practical considerations

3 Generalization: minimal penalties Linear estimators

Slope heuristics for linear estimators

Minimal penalty algorithm for linear estimators

(59)

Kernel ridge estimator (λ = 0.01)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(60)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

k -nearest-neighbours estimator (k = 20)

−2

−1 0 1 2 3 4

(61)

Nadaraya-Watson estimator (σ = 0.01)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(62)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Linear estimators

OLS:Fbm= ΠSmY (projection onto Sm)

(kernel) ridge regression, spline smoothing (Wahba, 1990):

Fbi =bf(xi) with bf ∈argmin

f∈FK

(1 n

Xn i=1

(Yi−f(xi) )2+λkfk2F

K

)

⇒ Fbλ,K =K(K +λI)−1Y where K = (K(xi,xj))1≤i,j≤n

k-nearest neighbours

Nadaraya-Watson estimators

(63)

Estimator selection: kernel ridge

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

(64)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Estimator selection: k nearest neighbours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 1 2 3 4

(65)

Estimator selection: Nadaraya-Watson

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

−3

−2

−1 0 1 2 3 4

(66)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Slope heuristics for linear estimators?

OLS

penCp(m) = 2σ2Dm

n argmin

m∈M

1

nkbFm−Yk2+CDm

n

⇒ D

m(Cb ) “jumps” at Cbjump≈σ2

⇒ optimal choice withm(2bb Cjump)

Linear estimators penCL(m) = 2σ2tr(Am)

n

argmin

m∈M

1

nkFbm−Yk2+Ctr(Am) n

Does tr(A

m(Cb )) jump at Cbjump≈σ2?

optimal choice withm(2b Cbjump) ?

(67)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Slope heuristics for linear estimators?

OLS

penCp(m) = 2σ2Dm

n argmin

m∈M

1

nkbFm−Yk2+CDm

n

⇒ D

m(Cb ) “jumps” at Cbjump≈σ2

⇒ optimal choice withm(2bb Cjump)

Linear estimators penCL(m) = 2σ2tr(Am)

n

argmin

m∈M

1

nkFbm−Yk2+Ctr(Am) n

Does tr(A

m(Cb )) jump at Cbjump≈σ2?

optimal choice withm(2b Cbjump) ?

(68)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Slope heuristics for linear estimators?

OLS

penCp(m) = 2σ2Dm

n argmin

m∈M

1

nkbFm−Yk2+CDm

n

⇒ D

m(Cb ) “jumps” at Cbjump≈σ2

⇒ optimal choice withm(2bb Cjump)

Linear estimators penCL(m) = 2σ2tr(Am)

n

argmin

m∈M

1

nkbFm−Yk2+Ctr(Am) n

Does tr(A

m(Cb )) jump at Cbjump≈σ2?

optimal choice withm(2b Cbjump) ?

(69)

No dimension jump with a penalty ∝ tr(A

m

)

−60 −4 −2 0 2 4 6

200 400 600 800 1000

2

tr(Aλ(C))

optimal penalty / 2

(70)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Minimal penalties for linear estimators

E 1

n bFm−F

2

= 1

nk(I−Am)Fk2+tr(A>mAm2

n = bias + variance

E 1

n

bFm−Y 2

2+1

nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n

⇒ optimal penalty ( 2 tr(Am) )σ2 n

⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n

(71)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Minimal penalties for linear estimators

E 1

n bFm−F

2

= 1

nk(I−Am)Fk2+tr(A>mAm2

n = bias + variance

E 1

n

bFm−Y 2

2+1

nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n

⇒ optimal penalty ( 2 tr(Am) )σ2 n

⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n

(72)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Minimal penalties for linear estimators

E 1

n bFm−F

2

= 1

nk(I−Am)Fk2+tr(A>mAm2

n = bias + variance

E 1

n

bFm−Y 2

2+1

nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n

⇒ optimal penalty ( 2 tr(Am) )σ2 n

⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n

(73)

Minimal penalties for linear estimators

E 1

n bFm−F

2

= 1

nk(I−Am)Fk2+tr(A>mAm2

n = bias + variance

E 1

n

bFm−Y 2

2+1

nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n

⇒ optimal penalty ( 2 tr(Am) )σ2 n

⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n

(74)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Minimal penalties for linear estimators

E 1

n bFm−F

2

= 1

nk(I−Am)Fk2+tr(A>mAm2

n = bias + variance

E 1

n

bFm−Y 2

2+1

nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n

⇒ optimal penalty ( 2 tr(Am) )σ2 n

(75)

“Dimension” jump (ridge regression)

−60 −4 −2 0 2 4 6

200 400 600 800 1000

2

tr(Aλ(C))

minimal penalty optimal penalty / 2

(76)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Penalty calibration algorithm (A. & Bach 2009)

1 for every C >0, compute b

mmin(C)∈argmin

m∈M

(1 n

Y −Fbm

2+C 2 tr(Am)−tr(A>mAm) n

)

2 findCbjump such thattr(Ambmin(C)) is “too large” when C <Cbjump and “reasonably small” whenC >Cbjump,

3 select

(1 2 2Cb tr(A ))

(77)

Theorem for linear estimators

Theorem (A. & Bach 2009–2011) Assumptions:

∀m∈ M,|||Am||| ≤L1 and tr(A>mAm)≤tr(Am)≤n

∃m0,m1∈ M, Am1 =In, Dm0 ≤√ n and

1

nkFm0−Fk2≤σ2p

log(n)/n . Gaussian noise: εi ∼ N(0, σ2)

Then, ∀γ >0,n≥n0(γ), w.p. at least 1−6 Card(M)n−γ,

∀C < 1−ηn

σ2, D

m(C)b ≥ n 3

∀C > 1 +ηn+

σ2, D

m(C)b ≤ n 10 with ηnn+=L2δ

qlog(n) n .

(78)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Theorem for linear estimators

Theorem (A. & Bach 2009–2011) Assumptions:

∀m∈ M,|||Am||| ≤L1 and tr(A>mAm)≤tr(Am)≤n

∃m0,m1∈ M, Am1 =In, Dm0 ≤√ n and

1

nkFm0−Fk2≤σ2p

log(n)/n . Gaussian noise: εi ∼ N(0, σ2)

Then, ∀γ >0,n≥n0(γ), w.p. at least 1−6 Card(M)n−γ,

∀C ∈(1−ηn,1 +ηn+),η∈(0,2),

∀mb ∈argmin

m∈M

1 n

Y −Fbm

2+ 2Ctr(Am) n

,

(79)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Comparison with least-squares

Linear estimators:

penmin(m) = σ2 2 tr(Am)−tr(A>mAm) n

penopt(m) = σ2( 2 tr(Am) ) n penopt(m)

penmin(m) = 2 tr(Am)

2 tr(Am)−tr(A>mAm) ∈(1,2]

Least-squares case:

A>mAm =Am ⇒ penopt(m)

penmin(m) = 2 ⇒ Slope heuristics

(80)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Comparison with least-squares

Linear estimators:

penmin(m) = σ2 2 tr(Am)−tr(A>mAm) n

penopt(m) = σ2( 2 tr(Am) ) n penopt(m)

penmin(m) = 2 tr(Am)

2 tr(Am)−tr(A>mAm) ∈(1,2]

Least-squares case:

pen (m)

(81)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

The k -nearest neighbours case

∀i,j ∈ {1, . . . ,n}, Ai,j

0,1 k

∀i ∈ {1, . . . ,n} , Ai,i = 1 k and

Xn j=1

Ai,j = 1

⇒ tr(A) = n

k = tr(A>A)

⇒ penopt = 2 penmin

(82)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

The k -nearest neighbours case

∀i,j ∈ {1, . . . ,n}, Ai,j

0,1 k

∀i ∈ {1, . . . ,n} , Ai,i = 1 k and

Xn j=1

Ai,j = 1

⇒ tr(A) = n

k = tr(A>A)

⇒ pen = 2 pen

(83)

Simulations: N-W, F

i

= sin(25πx

i3

), n = 200

3 4 5 6

0.5 1 1.5 2 2.5 3 3.5 4

mean( error / error oracle )

min. pen. largest min. pen. n/2 Mallows GCV

(84)

Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion

Simulations: ridge, F

i

= sin(25πx

i3

), n = 200

1 1.5 2 2.5 3 3.5 4

mean( error / error oracle )

min. pen. largest min. pen. n/2 Mallows GCV

(85)

General framework

Goal: find from data t∈Swith R(t) minimal.

Empirical risk Rbn(t)

Collection of estimators (bsm)m∈M

Oracle inequality:

R(bs

mb)− R(s?)≤Kn inf

m∈M{ R(bsm)− R(s?)}+Rn whereR(s?) := inft∈SR(t).

Références

Documents relatifs

We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate

Considering a capacity assumption of the space H [32, 5], and a general source condition [1] of the target function f H , we show high-probability, optimal convergence results in

For the ridge estimator and the ordinary least squares estimator, and their variants, we provide new risk bounds of order d/n without logarithmic factor unlike some stan- dard

Keywords: nonparametric regression, heteroscedastic noise, random design, optimal model selection, slope heuristics, hold-out penalty..

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Wavelet block thresholding for samples with random design: a minimax approach under the L p risk, Electronic Journal of Statistics 1, 331-346..

Penalized least squares regression is often used for signal denoising and inverse problems, and is commonly interpreted in a Bayesian framework as a Maximum A Posteriori

Motivated by estimation of stress-strain curves, an application from mechanical engineer- ing, we consider in this paper weighted Least Squares estimators in the problem of