Optimal data-driven estimator selection with minimal penalties
Sylvain Arlot (joint works with F. Bach, P. Massart & M.
Solnon)
1Cnrs
2Ecole Normale Sup´´ erieure (Paris),DI/ENS, ´EquipeSierra
CIRM, December, 12–13, 2013
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Plan
1 Motivation
2 Slope heuristics for ordinary least-squares Framework
Optimal model selection
Minimal penalty and the slope heuristics Theoretical results
Variance estimation Practical considerations
3 Generalization: minimal penalties Linear estimators
Slope heuristics for linear estimators
Minimal penalty algorithm for linear estimators
Outline
1 Motivation
2 Slope heuristics for ordinary least-squares Framework
Optimal model selection
Minimal penalty and the slope heuristics Theoretical results
Variance estimation Practical considerations
3 Generalization: minimal penalties Linear estimators
Slope heuristics for linear estimators
Minimal penalty algorithm for linear estimators Minimal penalty algorithm in general
4 Application to multi-task learning
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Regression: data (x
1, Y
1), . . . , (x
n, Y
n)
−2
−1 0 1 2 3 4
Goal: find the signal (denoising)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Estimators: example: regressogram
−2
−1 0 1 2 3 4
Estimators: regressogram, ridge, k -NN, NW
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Estimator selection: kernel ridge
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Estimator selection
Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk
Examples:
model selection
parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)
choice betweendifferent methods ex.: k-NN vs. kernel ridge?
Classical approaches and their limitations: cross-validation: computational cost penalization: unknown constants
elbow heuristics: no clear definition/justification
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Estimator selection
Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk
Examples:
model selection
parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)
choice betweendifferent methods ex.: k-NN vs. kernel ridge?
Classical approaches and their limitations: cross-validation: computational cost penalization: unknown constants
elbow heuristics: no clear definition/justification
Estimator selection
Estimator collection (Fbm)m∈M ⇒m(Yb ) ? Goal: minimize the risk
Examples:
model selection
parameter tuning(choosingk or the distance fork-NN, choice of a regularization parameter, choice of a kernel, etc.)
choice betweendifferent methods ex.: k-NN vs. kernel ridge?
Classical approaches and their limitations:
cross-validation: computational cost penalization: unknown constants
elbow heuristics: no clear definition/justification
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Penalties known up to a constant factor
b
m(Y)∈argmin
m∈M
n
Emp. risk(Fbm) + pen(m) o
Optimal penalties depending on the noise level σ2 (Mallows, 1973):
penCp(m) = 2σ2Dm
n penCL(m) = 2σ2tr(Am) n Rk: various methods for estimatingσ2 or avoiding its
estimation (FPE, Akaike, 1970; GCV, Craven & Wahba, 1978;
Baraud, Giraud & Huet, 2009).
Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties
Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher complexities, ...)
Goals: estimation of the optimal constant (e.g.,σ2) for estimator selection, under minimal assumptions, without overfitting
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Penalties known up to a constant factor
b
m(Y)∈argmin
m∈M
n
Emp. risk(Fbm) + pen(m) o
Optimal penalties depending on the noise level σ2 (Mallows, 1973):
penCp(m) = 2σ2Dm
n penCL(m) = 2σ2tr(Am) n
Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties
Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher complexities, ...)
Goals: estimation of the optimal constant (e.g.,σ2) for estimator selection, under minimal assumptions, without overfitting
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Penalties known up to a constant factor
b
m(Y)∈argmin
m∈M
n
Emp. risk(Fbm) + pen(m) o
Optimal penalties depending on the noise level σ2 (Mallows, 1973):
penCp(m) = 2σ2Dm
n penCL(m) = 2σ2tr(Am) n
Optimal penalty known asymptotically(AIC; Akaike, 1973) Resampling-based penalties
Optimal constant unknown even in theory (change-point detection, mixture models, global/local Rademacher
“L-curve” and elbow heuristics?
0 200 400 600 800 1000
0 0.5 1 1.5
2x 10−3
empirical error
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
“L-curve” and elbow heuristics?
0.5 1 1.5
2x 10−3
empirical / generalization error
Outline
1 Motivation
2 Slope heuristics for ordinary least-squares Framework
Optimal model selection
Minimal penalty and the slope heuristics Theoretical results
Variance estimation Practical considerations
3 Generalization: minimal penalties Linear estimators
Slope heuristics for linear estimators
Minimal penalty algorithm for linear estimators Minimal penalty algorithm in general
4 Application to multi-task learning
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y1, . . . ,Yn)∈Rn
Yi =Fi+εi (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.
Fixed design: xi ∈ X deterministic
Least-squares loss of a predictort ∈Rn (“ti =t(xi)”): 1
nkt−Fk2 = 1 n
Xn i=1
(ti −Fi)2
⇒ Estimator Fb(Y)∈Rn?
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y1, . . . ,Yn)∈Rn
Yi =Fi+εi (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.
Fixed design: xi ∈ X deterministic
Least-squares loss of a predictort ∈Rn (“ti =t(xi)”): 1
nkt−Fk2 = 1 n
Xn i=1
(ti −Fi)2
⇒ Estimator Fb(Y)∈Rn?
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y1, . . . ,Yn)∈Rn
Yi =Fi+εi (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.
Fixed design: xi ∈ X deterministic
Least-squares loss of a predictort ∈Rn (“ti =t(xi)”):
1
nkt−Fk2 = 1 n
Xn i=1
(ti −Fi)2
⇒ Estimator Fb(Y)∈Rn?
Statistical framework: regression, least-squares loss
Observations: Y = (Y1, . . . ,Yn)∈Rn
Yi =Fi+εi (e.g., Fi =F(xi)) with Yi ∈R, (εi)1≤i≤ni.i.d.
Fixed design: xi ∈ X deterministic
Least-squares loss of a predictort ∈Rn (“ti =t(xi)”):
1
nkt−Fk2 = 1 n
Xn i=1
(ti −Fi)2
⇒ Estimator Fb(Y)∈Rn?
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Least-squares estimators
Natural idea: minimize an estimator of the risk 1nkt−Fk2
Least-squares criterion: 1
nkt−Yk2 = 1 n
Xn i=1
(ti−Yi)2
∀t ∈Rn , E 1
nkt−Yk2
= 1
n kt−Fk2+ 1 nE
hkεk2i
Model: S ⊂Rn ⇒ Least-squares estimator on S: FbS ∈argmin
t∈S
1
nkt−Yk2
= argmin
t∈S
( 1 n
Xn i=1
(ti −Yi)2 )
so that
FbS = ΠS(Y) (orthogonal projection)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Least-squares estimators
Natural idea: minimize an estimator of the risk 1nkt−Fk2 Least-squares criterion:
1
nkt−Yk2 = 1 n
Xn i=1
(ti−Yi)2
∀t ∈Rn , E 1
nkt−Yk2
= 1
n kt−Fk2+ 1 nE
hkεk2i
Model: S ⊂Rn ⇒ Least-squares estimator on S: FbS ∈argmin
t∈S
1
nkt−Yk2
= argmin
t∈S
( 1 n
Xn i=1
(ti −Yi)2 )
so that
FbS = ΠS(Y) (orthogonal projection)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Least-squares estimators
Natural idea: minimize an estimator of the risk 1nkt−Fk2 Least-squares criterion:
1
nkt−Yk2 = 1 n
Xn i=1
(ti−Yi)2
∀t ∈Rn , E 1
nkt−Yk2
= 1
n kt−Fk2+ 1 nE
hkεk2i
Model: S ⊂Rn ⇒ Least-squares estimatoronS: FbS ∈argmin
t∈S
1
nkt−Yk2
= argmin
t∈S
(1 n
Xn i=1
(ti −Yi)2 )
Model examples
histogramson some partition Λ of X
⇒ the least-squares estimator (regressogram) can be written Fbm(xi) =X
λ∈Λ
βbλ1xi∈λ βbλ= 1 Card{xi ∈λ}
X
xi∈λ
Yi
subspace generated by a subset of an orthogonal basis of L2(µ) (Fourier, wavelets, ...)
variable selection: xi =
xi(1), . . . ,xi(p)
∈Rp gathersp variables that can (linearly) explain Yi
∀m⊂ {1, . . . ,p} , Sm= vect n
x(j) s.t. j ∈m o
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Model selection: regular regressograms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 1 2 3 4
Model selection
Model collection (Sm)m∈M ⇒ (Fbm)m∈M ⇒ m(Yb ) ? Fbm= ΠmY = ΠSmY
Goal: minimize the risk, i.e.,
Oracle inequality(in expectation or with a large probability):
1 n
bFmb −F
2≤C inf
m∈M
1 n
bFm−F
2
+Rn
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Bias-variance trade-off
E 1
n bFm−F
2
= Bias + Variance Bias or Approximation error
1
nkFm−Fk2 = 1
nkΠmF −Fk2 Variance or Estimation error
σ2dim(Sm) n
Bias-variance trade-off
⇔avoid overfittingandunderfitting
Bias-variance trade-off
E 1
n bFm−F
2
= Bias + Variance Bias or Approximation error
1
nkFm−Fk2 = 1
nkΠmF −Fk2 Variance or Estimation error
σ2dim(Sm) n
Bias-variance trade-off
⇔avoid overfittingandunderfitting
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Why should the empirical risk be penalized?
−0.1
−0.05 0 0.05 0.1 0.15
Biais Exces de risque E[Risque empirique]
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Penalization
b
m∈argmin
m∈M
1 n
bFm−Y
2+ pen(m)
Ideal penalty: penid(m) := 1
n bFm−F
2− 1
n
bFm−Y
2 = Risk−Empirical risk
Mallows’ heuristic: pen(m)≈E[ penid(m) ]
⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)
⇒ Cp: 2σ2Dm/n (Mallows, 1973)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Penalization
b
m∈argmin
m∈M
1 n
bFm−Y
2+ pen(m)
Ideal penalty:
penid(m) := 1 n
bFm−F
2− 1
n
bFm−Y
2 = Risk−Empirical risk
Mallows’ heuristic: pen(m)≈E[ penid(m) ]
⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)
⇒ Cp: 2σ2Dm/n (Mallows, 1973)
Penalization
b
m∈argmin
m∈M
1 n
bFm−Y
2+ pen(m)
Ideal penalty:
penid(m) := 1 n
bFm−F
2− 1
n
bFm−Y
2 = Risk−Empirical risk
Mallows’ heuristic: pen(m)≈E[ penid(m) ]
⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)
⇒ Cp: 2σ2Dm/n (Mallows, 1973)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Oracle inequality
Theorem (Birg´e & Massart 2007, reformulated) Assumptions:
pen(m) = 2σ2nDm
Gaussian noise: εi ∼ N(0, σ2)
Then, for everyγ >0, with probability at least 1−4Card(M)n−γ, if n≥n0(γ), for every η∈( 0,1 ),
1 n
bFmb −F
2 ≤( 1 +η) inf
m∈M
1 n
bFm−F
2
+80γlog(n)σ2 ηn
E [ Empirical risk ] + 0 × σ
2D
mn
−1(OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2x 10−3
E[empirical error + 0* pen min]
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
E [ Empirical risk ] + 0.8 × σ
2D
mn
−1(OLS)
0.5 1 1.5
2x 10−3
E[empirical error + 0.8* pen min]
E [ Empirical risk ] + 0.9 × σ
2D
mn
−1(OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2x 10−3
E[empirical error + 0.9* pen min]
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
E [ Empirical risk ] + 1.1 × σ
2D
mn
−1(OLS)
0.5 1 1.5
2x 10−3
E[empirical error + 1.1* pen min]
E [ Empirical risk ] + 1.2 × σ
2D
mn
−1(OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2x 10−3
E[empirical error + 1.2* pen min]
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
E [ Empirical risk ] + 2 × σ
2D
mn
−1(OLS)
0.5 1 1.5
2x 10−3
E[empirical error + 2* pen min]
OLS: Dimension jump
0 1 2 3 4 5
0 200 400 600 800 1000
2
dimension of m(C)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
OLS: slope heuristics algorithm (Birg´ e & Massart 2007)
1 for every C >0, compute b
m(C)∈argmin
m∈Mn
1 n
bFm−Y
2+CDm n
2 findCbjump such thatD
m(Cb ) is “very large” when C <Cbjump and “reasonably small” whenC >Cbjump
3 select mb =mb
2Cbjump
Practical use: package (Baudry, Maugis & Michel, 2011)
Theorem (1): Dimension jump / Minimal penalty
Theorem (Birg´e & Massart 2007, reformulated) Assumptions:
∃m0 ∈ M, Sm0 =Rn, i.e., Fbm0 =Y ,
infm∈M{E[1nkFbm−Fk2]} ≤σ2δn,δn≤1/20, Gaussian noise: εi ∼ N(0, σ2).
Then, ∀γ >0,n≥n0(γ), w.p. at least 1−4 Card(M)n−γ,
∀C < 1−ηn−
σ2, D
m(C)b ≥ 9n 10
∀C > 1 +ηn+
σ2, Dm(C)b ≤ n 10 with ηn−= 81
qγlog(n)
n ,ηn+=η−n + 20δn. In the first case,
1
nkbFm(C)−Fk2 ≥ 7σ82 infm∈Mn{1nkFbm−Fk2}.
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Theorem (1’): Dimension jump / Minimal penalty
Under the same assumptions, on the same event,∀an,bn such that 2nδn+ 16.2p
γlog(n)n <bn<an <n ,
∀C < 1−η−n
σ2, D
m(Cb )≥an
∀C > 1 +ηn+
σ2, D
m(C)b ≤bn
with ηn−= 1−an
n
−1r
γlog(n) n
ηn+= n
bn−nδn
δn+ 8.1
rγlog(n) n
!
Theorem (2): Oracle inequality
Theorem (Birg´e & Massart 2007, reformulated) Assumptions:
b
m∈argminm∈M{n1kbFm−Yk2+ 2bCjumpDnm}
∃m0 ∈ M, Sm0 =Rn, i.e., Fbm0 =Y , infm∈M{E[1n
bFm−F
2]} ≤σ2δn,δn≤1/20, Gaussian noise: εi ∼ N(0, σ2).
Then, with probability at least1−4 Card(M)n−γ, if n≥n0(γ), for everyη≥2 max{ηn−, η+n },
1 n
bFmb −F
2 ≤( 1 + 3η) inf
m∈M
1 n
bFm−F
2
+880σ2γlog(n) ηn
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Variance estimation
Slope heuristics: with probability 1−4 Card(M)n−γ, 1−81
rγlog(n)
n ≤ Cbjump
σ2 ≤1 +20δn+ 81
rγlog(n) n
Naive estimator: b
σm2 := 1 n−Dm
Y −Fbm
2
⇒ E b σm2
=σ2+k(In−Πm)Fk2 n−Dm
Best variance estimator for E[ (σb−σ)2] is not necessarily the best for model selection.
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Variance estimation
Slope heuristics: with probability 1−4 Card(M)n−γ, 1−81
rγlog(n)
n ≤ Cbjump
σ2 ≤1 + 20δn+ 81
rγlog(n) n Naive estimator:
b
σm2 := 1 n−Dm
Y −Fbm
2
⇒ E b σm2
=σ2+ k(In−Πm)Fk2 n−Dm
Best variance estimator for E[ (σb−σ)2] is not necessarily the best for model selection.
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Variance estimation
Slope heuristics: with probability 1−4 Card(M)n−γ, 1−81
rγlog(n)
n ≤ Cbjump
σ2 ≤1 + 20δn+ 81
rγlog(n) n Naive estimator:
b
σm2 := 1 n−Dm
Y −Fbm
2
⇒ E b σm2
=σ2+ k(In−Πm)Fk2 n−Dm
Best variance estimator for E[ (σb−σ)2] is not necessarily the
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Data-driven penalties
Naive estimator with some fixedm0 + plug in:
crit(m) = 1 n
Y −Fbm
2+ 2bσm20Dm
n
Drawbacks: choice ofm0? unknown bias (overpenalization)
FPE (Akaike, 1970; Baraud, Giraud & Huet, 2009) critFPE(m) = 1
n
Y −Fbm
2+2bσm2Dm
n = 1 n
Y −Fbm
2
1 + 2Dm
n−Dm
critBGH(m) = 1 n
Y −Fbm 2
1 + pen(m) n−Dm
Drawbacks: deal carefully with the largest models oracle inequalities (Baraud, Giraud & Huet, 2009) hold
assuming an upper bound on maxm∈MDm (FPE) or for a new penalty, very large for the largest models (BGH)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Data-driven penalties
Naive estimator with some fixedm0 + plug in:
crit(m) = 1 n
Y −Fbm
2+ 2bσm20Dm
n
Drawbacks: choice ofm0? unknown bias (overpenalization) FPE (Akaike, 1970; Baraud, Giraud & Huet, 2009)
critFPE(m) = 1 n
Y −Fbm
2+2bσm2Dm
n = 1 n
Y −Fbm 2
1 + 2Dm
n−Dm
critBGH(m) = 1 n
Y −Fbm 2
1 + pen(m) n−Dm
Drawbacks: deal carefully with the largest models
Generalized cross-validation (Craven & Wahba, 1978)
critGCV(m) = 1 n
Y −Fbm
2
1−Dm n
−2
IfDm n,
critGCV(m)≈ 1 n
Y −Fbm
2 n+Dm n−Dm
=critFPE(m) Drawbacks: deal carefully with the largest models
⇒e.g., for smoothing splines, oracle inequality assumes the effective dimension is≤n/5 for all m(Cao & Golubev, 2006)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Practical qualities of the algorithm
visual checking of existence of a jump
calibration independent from the choice of somem0 too strongoverfitting almost impossible
one remaining parameter: how tolocalize the jump
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
How to localize the jump in practice?
Dimension jump: largest jump? jump on a geometrical window? complexity threshold?
Estimation of the slope of the empirical risk as a function of the dimension:
computed with which models? robust regression? Jump vs. slope? Take both!
⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
How to localize the jump in practice?
Dimension jump: largest jump? jump on a geometrical window? complexity threshold?
Estimation of the slope of the empirical risk as a function of the dimension:
computed with which models? robust regression?
Jump vs. slope? Take both!
⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html
How to localize the jump in practice?
Dimension jump: largest jump? jump on a geometrical window? complexity threshold?
Estimation of the slope of the empirical risk as a function of the dimension:
computed with which models? robust regression?
Jump vs. slope? Take both!
⇒ package Capushe(Baudry, Maugis & Michel, 2011) http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
CAPUSHE (Baudry, Maugis & Michel, 2011): jump
CAPUSHE (Baudry, Maugis & Michel, 2011): slope
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Outline
1 Motivation
2 Slope heuristics for ordinary least-squares Framework
Optimal model selection
Minimal penalty and the slope heuristics Theoretical results
Variance estimation Practical considerations
3 Generalization: minimal penalties Linear estimators
Slope heuristics for linear estimators
Minimal penalty algorithm for linear estimators
Kernel ridge estimator (λ = 0.01)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
k -nearest-neighbours estimator (k = 20)
−2
−1 0 1 2 3 4
Nadaraya-Watson estimator (σ = 0.01)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Linear estimators
OLS:Fbm= ΠSmY (projection onto Sm)
(kernel) ridge regression, spline smoothing (Wahba, 1990):
Fbi =bf(xi) with bf ∈argmin
f∈FK
(1 n
Xn i=1
(Yi−f(xi) )2+λkfk2F
K
)
⇒ Fbλ,K =K(K +λI)−1Y where K = (K(xi,xj))1≤i,j≤n
k-nearest neighbours
Nadaraya-Watson estimators
Estimator selection: kernel ridge
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Estimator selection: k nearest neighbours
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 1 2 3 4
Estimator selection: Nadaraya-Watson
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Slope heuristics for linear estimators?
OLS
penCp(m) = 2σ2Dm
n argmin
m∈M
1
nkbFm−Yk2+CDm
n
⇒ D
m(Cb ) “jumps” at Cbjump≈σ2
⇒ optimal choice withm(2bb Cjump)
Linear estimators penCL(m) = 2σ2tr(Am)
n
argmin
m∈M
1
nkFbm−Yk2+Ctr(Am) n
Does tr(A
m(Cb )) jump at Cbjump≈σ2?
optimal choice withm(2b Cbjump) ?
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Slope heuristics for linear estimators?
OLS
penCp(m) = 2σ2Dm
n argmin
m∈M
1
nkbFm−Yk2+CDm
n
⇒ D
m(Cb ) “jumps” at Cbjump≈σ2
⇒ optimal choice withm(2bb Cjump)
Linear estimators penCL(m) = 2σ2tr(Am)
n
argmin
m∈M
1
nkFbm−Yk2+Ctr(Am) n
Does tr(A
m(Cb )) jump at Cbjump≈σ2?
optimal choice withm(2b Cbjump) ?
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Slope heuristics for linear estimators?
OLS
penCp(m) = 2σ2Dm
n argmin
m∈M
1
nkbFm−Yk2+CDm
n
⇒ D
m(Cb ) “jumps” at Cbjump≈σ2
⇒ optimal choice withm(2bb Cjump)
Linear estimators penCL(m) = 2σ2tr(Am)
n
argmin
m∈M
1
nkbFm−Yk2+Ctr(Am) n
Does tr(A
m(Cb )) jump at Cbjump≈σ2?
optimal choice withm(2b Cbjump) ?
No dimension jump with a penalty ∝ tr(A
m)
−60 −4 −2 0 2 4 6
200 400 600 800 1000
2
tr(Aλ(C))
optimal penalty / 2
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n bFm−F
2
= 1
nk(I−Am)Fk2+tr(A>mAm)σ2
n = bias + variance
E 1
n
bFm−Y 2
=σ2+1
nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n
⇒ optimal penalty ( 2 tr(Am) )σ2 n
⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n bFm−F
2
= 1
nk(I−Am)Fk2+tr(A>mAm)σ2
n = bias + variance
E 1
n
bFm−Y 2
=σ2+1
nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n
⇒ optimal penalty ( 2 tr(Am) )σ2 n
⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n bFm−F
2
= 1
nk(I−Am)Fk2+tr(A>mAm)σ2
n = bias + variance
E 1
n
bFm−Y 2
=σ2+1
nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n
⇒ optimal penalty ( 2 tr(Am) )σ2 n
⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n
Minimal penalties for linear estimators
E 1
n bFm−F
2
= 1
nk(I−Am)Fk2+tr(A>mAm)σ2
n = bias + variance
E 1
n
bFm−Y 2
=σ2+1
nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n
⇒ optimal penalty ( 2 tr(Am) )σ2 n
⇒ minimal penalty 2 tr(Am)−tr(A>mAm) σ2 n
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n bFm−F
2
= 1
nk(I−Am)Fk2+tr(A>mAm)σ2
n = bias + variance
E 1
n
bFm−Y 2
=σ2+1
nk(I−Am)Fk2− 2 tr(Am)−tr(A>mAm) σ2 n
⇒ optimal penalty ( 2 tr(Am) )σ2 n
“Dimension” jump (ridge regression)
−60 −4 −2 0 2 4 6
200 400 600 800 1000
2
tr(Aλ(C))
minimal penalty optimal penalty / 2
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Penalty calibration algorithm (A. & Bach 2009)
1 for every C >0, compute b
mmin(C)∈argmin
m∈M
(1 n
Y −Fbm
2+C 2 tr(Am)−tr(A>mAm) n
)
2 findCbjump such thattr(Ambmin(C)) is “too large” when C <Cbjump and “reasonably small” whenC >Cbjump,
3 select
(1 2 2Cb tr(A ))
Theorem for linear estimators
Theorem (A. & Bach 2009–2011) Assumptions:
∀m∈ M,|||Am||| ≤L1 and tr(A>mAm)≤tr(Am)≤n
∃m0,m1∈ M, Am1 =In, Dm0 ≤√ n and
1
nkFm0−Fk2≤σ2p
log(n)/n . Gaussian noise: εi ∼ N(0, σ2)
Then, ∀γ >0,n≥n0(γ), w.p. at least 1−6 Card(M)n−γ,
∀C < 1−ηn−
σ2, D
m(C)b ≥ n 3
∀C > 1 +ηn+
σ2, D
m(C)b ≤ n 10 with ηn−=ηn+=L2δ
qlog(n) n .
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Theorem for linear estimators
Theorem (A. & Bach 2009–2011) Assumptions:
∀m∈ M,|||Am||| ≤L1 and tr(A>mAm)≤tr(Am)≤n
∃m0,m1∈ M, Am1 =In, Dm0 ≤√ n and
1
nkFm0−Fk2≤σ2p
log(n)/n . Gaussian noise: εi ∼ N(0, σ2)
Then, ∀γ >0,n≥n0(γ), w.p. at least 1−6 Card(M)n−γ,
∀C ∈(1−η−n,1 +ηn+),η∈(0,2),
∀mb ∈argmin
m∈M
1 n
Y −Fbm
2+ 2Ctr(Am) n
,
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Comparison with least-squares
Linear estimators:
penmin(m) = σ2 2 tr(Am)−tr(A>mAm) n
penopt(m) = σ2( 2 tr(Am) ) n penopt(m)
penmin(m) = 2 tr(Am)
2 tr(Am)−tr(A>mAm) ∈(1,2]
Least-squares case:
A>mAm =Am ⇒ penopt(m)
penmin(m) = 2 ⇒ Slope heuristics
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Comparison with least-squares
Linear estimators:
penmin(m) = σ2 2 tr(Am)−tr(A>mAm) n
penopt(m) = σ2( 2 tr(Am) ) n penopt(m)
penmin(m) = 2 tr(Am)
2 tr(Am)−tr(A>mAm) ∈(1,2]
Least-squares case:
pen (m)
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
The k -nearest neighbours case
∀i,j ∈ {1, . . . ,n}, Ai,j ∈
0,1 k
∀i ∈ {1, . . . ,n} , Ai,i = 1 k and
Xn j=1
Ai,j = 1
⇒ tr(A) = n
k = tr(A>A)
⇒ penopt = 2 penmin
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
The k -nearest neighbours case
∀i,j ∈ {1, . . . ,n}, Ai,j ∈
0,1 k
∀i ∈ {1, . . . ,n} , Ai,i = 1 k and
Xn j=1
Ai,j = 1
⇒ tr(A) = n
k = tr(A>A)
⇒ pen = 2 pen
Simulations: N-W, F
i= sin(25πx
i3), n = 200
3 4 5 6
0.5 1 1.5 2 2.5 3 3.5 4
mean( error / error oracle )
min. pen. largest min. pen. n/2 Mallows GCV
Motivation Slope heuristics for OLS Minimal penalties Multi-task Conclusion
Simulations: ridge, F
i= sin(25πx
i3), n = 200
1 1.5 2 2.5 3 3.5 4
mean( error / error oracle )
min. pen. largest min. pen. n/2 Mallows GCV
General framework
Goal: find from data t∈Swith R(t) minimal.
Empirical risk Rbn(t)
Collection of estimators (bsm)m∈M
Oracle inequality:
R(bs
mb)− R(s?)≤Kn inf
m∈M{ R(bsm)− R(s?)}+Rn whereR(s?) := inft∈SR(t).