1/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Data-driven calibration of linear estimators with minimal penalties, with an application to
multi-task regression
Sylvain Arlot (joint works with F. Bach & M. Solnon)
1
Cnrs
2
Ecole Normale Sup´ ´ erieure (Paris), Liens , ´ Equipe Sierra
Cambridge, November, 4th 2011
2/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Regression: data (x 1 , Y 1 ), . . . , (x n , Y n )
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
3/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Goal: find the signal (denoising)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
4/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y 1 , . . . , Y n ) ∈ R n
Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.
Fixed design: x i deterministic
Least-squares loss of a predictor t ∈ R n (“t i = t (x i )”): 1
n kt − F k 2 = 1 n
X n i=1
( t i − F i ) 2
⇒ Estimator F b (Y ) ∈ R n ?
4/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y 1 , . . . , Y n ) ∈ R n
Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.
Fixed design: x i deterministic
Least-squares loss of a predictor t ∈ R n (“t i = t (x i )”): 1
n kt − F k 2 = 1 n
X n i=1
( t i − F i ) 2
⇒ Estimator F b (Y ) ∈ R n ?
4/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y 1 , . . . , Y n ) ∈ R n
Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.
Fixed design: x i deterministic
Least-squares loss of a predictor t ∈ R n (“t i = t(x i )”):
1
n kt − F k 2 = 1 n
X n i=1
( t i − F i ) 2
⇒ Estimator F b (Y ) ∈ R n ?
4/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Statistical framework: regression, least-squares loss
Observations: Y = (Y 1 , . . . , Y n ) ∈ R n
Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.
Fixed design: x i deterministic
Least-squares loss of a predictor t ∈ R n (“t i = t(x i )”):
1
n kt − F k 2 = 1 n
X n i=1
( t i − F i ) 2
⇒ Estimator F b (Y ) ∈ R n ?
5/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimators: example: regressogram
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
6/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Least-squares estimators
Natural idea: minimize an estimator of the risk 1 n kt − F k 2
Least-squares criterion: 1
n kt − Y k 2 = 1 n
X n i=1
( t i − Y i ) 2
∀t ∈ R n , E 1
n kt − Y k 2
= 1
n kt − F k 2 + 1 n E
h kεk 2 i
Model: S ⊂ R n ⇒ Least-squares estimator on S: F b S ∈ arg min
t∈S
1
n kt − Y k 2
= arg min
t∈S
( 1 n
X n i =1
( t i − Y i ) 2 )
so that
F b S = Π S (Y ) (orthogonal projection)
6/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Least-squares estimators
Natural idea: minimize an estimator of the risk 1 n kt − F k 2 Least-squares criterion:
1
n kt − Y k 2 = 1 n
X n i=1
( t i − Y i ) 2
∀t ∈ R n , E 1
n kt − Y k 2
= 1
n kt − F k 2 + 1 n E
h kεk 2 i
Model: S ⊂ R n ⇒ Least-squares estimator on S: F b S ∈ arg min
t∈S
1
n kt − Y k 2
= arg min
t∈S
( 1 n
X n i =1
( t i − Y i ) 2 )
so that
F b S = Π S (Y ) (orthogonal projection)
6/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Least-squares estimators
Natural idea: minimize an estimator of the risk 1 n kt − F k 2 Least-squares criterion:
1
n kt − Y k 2 = 1 n
X n i=1
( t i − Y i ) 2
∀t ∈ R n , E 1
n kt − Y k 2
= 1
n kt − F k 2 + 1 n E
h kεk 2 i
Model: S ⊂ R n ⇒ Least-squares estimator on S : F b S ∈ arg min
t∈S
1
n kt − Y k 2
= arg min
t∈S
( 1 n
X n i =1
( t i − Y i ) 2 )
so that
F b S = Π S (Y ) (orthogonal projection)
7/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Model examples
histograms on some partition Λ of X
⇒ the least-squares estimator (regressogram) can be written F b m (x i ) = X
λ∈Λ
β b λ 1 x
i∈λ β b λ = 1 Card { x i ∈ λ}
X
x
i∈λ
Y i
subspace generated by a subset of an orthogonal basis of L 2 (µ) (Fourier, wavelets, and so on)
variable selection: x i =
x i (1) , . . . , x i (p)
∈ R p gathers p variables that can (linearly) explain Y i
∀m ⊂ { 1, . . . , p } , S m = vect n
x (j ) s.t. j ∈ m o
.
8/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
k -nearest-neighbours estimator (k = 20)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
9/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Nadaraya-Watson estimator (σ = 0.01)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
10/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Linear estimators
OLS: F b m = Π S
mY (projection onto S m ) (kernel) ridge regression, spline smoothing:
F b i = b f (x i ) with b f ∈ arg min
f ∈F
K( 1 n
X n i=1
( Y i − f (x i ) ) 2 + λ kf k 2 F
K
)
⇒ F b λ,K = K (K + λI) −1 Y where K = (K (x i , x j )) 1≤i,j ≤n
k -nearest neighbours
Nadaraya-Watson estimators
F b = AY where A does not depend on Y
11/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimator selection: regular regressograms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
12/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimator selection: k nearest neighbours
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
13/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimator selection: Nadaraya-Watson
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1 0 1 2 3 4
14/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimator selection
Estimator collection (b F m ) m∈M ⇒ m(Y b ) ? Example: F b m = A m Y
Goal: minimize the risk, i.e.,
Oracle inequality (in expectation or with a large probability): 1
n b F
m b − F
2 ≤ C inf
m∈M
1 n
b F m − F
2
+ R n
Examples:
model selection
calibration (choosing k or the distance for k -NN, choice of a regularization parameter, choice of a kernel, etc.)
choice between methods different in nature
ex.: k -NN vs. smoothing splines?
14/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimator selection
Estimator collection (b F m ) m∈M ⇒ m(Y b ) ? Example: F b m = A m Y
Goal: minimize the risk, i.e.,
Oracle inequality (in expectation or with a large probability):
1 n
b F
m b − F
2 ≤ C inf
m∈M
1 n
b F m − F
2
+ R n
Examples:
model selection
calibration (choosing k or the distance for k -NN, choice of a regularization parameter, choice of a kernel, etc.)
choice between methods different in nature
ex.: k -NN vs. smoothing splines?
14/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimator selection
Estimator collection (b F m ) m∈M ⇒ m(Y b ) ? Example: F b m = A m Y
Goal: minimize the risk, i.e.,
Oracle inequality (in expectation or with a large probability):
1 n
b F
m b − F
2 ≤ C inf
m∈M
1 n
b F m − F
2
+ R n
Examples:
model selection
calibration (choosing k or the distance for k -NN, choice of a regularization parameter, choice of a kernel, etc.)
choice between methods different in nature
ex.: k -NN vs. smoothing splines?
15/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Bias-variance trade-off
E 1
n b F m − F
2
= Bias + Variance Bias or Approximation error
1
n kF m − F k 2 = 1
n kA m F − F k 2 Variance or Estimation error
σ 2 tr A > m A m
n OLS: σ 2 dim(S m )
n
Bias-variance trade-off
⇔ avoid overfitting and underfitting
15/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Bias-variance trade-off
E 1
n b F m − F
2
= Bias + Variance Bias or Approximation error
1
n kF m − F k 2 = 1
n kA m F − F k 2 Variance or Estimation error
σ 2 tr A > m A m
n OLS: σ 2 dim(S m )
n
Bias-variance trade-off
⇔ avoid overfitting and underfitting
16/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Why should the empirical risk be penalized?
0 5 10 15 20 25
−0.1
−0.05 0 0.05 0.1 0.15
Dimension
Biais Exces de risque E[Risque empirique]
17/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Penalization
b
m ∈ arg min
m∈M
1 n
b F m − Y
2 + pen(m)
Ideal penalty: pen id (m) := 1
n b F m − F
2 − 1
n
b F m − Y
2 = Risk − Empirical risk Mallows’ heuristic: pen(m) ≈ E [ pen id (m) ]
⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)
⇒ OLS: C p : 2σ 2 D m /n (Mallows, 1973)
⇒ Linear estimators: C L : 2σ 2 tr(A m )/n (Mallows, 1973)
17/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Penalization
b
m ∈ arg min
m∈M
1 n
b F m − Y
2 + pen(m)
Ideal penalty:
pen id (m) := 1 n
b F m − F
2 − 1
n
b F m − Y
2 = Risk − Empirical risk Mallows’ heuristic: pen(m) ≈ E [ pen id (m) ]
⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)
⇒ OLS: C p : 2σ 2 D m /n (Mallows, 1973)
⇒ Linear estimators: C L : 2σ 2 tr(A m )/n (Mallows, 1973)
17/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Penalization
b
m ∈ arg min
m∈M
1 n
b F m − Y
2 + pen(m)
Ideal penalty:
pen id (m) := 1 n
b F m − F
2 − 1
n
b F m − Y
2 = Risk − Empirical risk Mallows’ heuristic: pen(m) ≈ E [ pen id (m) ]
⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)
⇒ OLS: C p : 2σ 2 D m /n (Mallows, 1973)
⇒ Linear estimators: C L : 2σ 2 tr(A m )/n (Mallows, 1973)
18/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Oracle inequality
Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:
pen(m) = 2C tr(A n
m) with C σ −2 − 1 ≤ L 0
q ln(n) n
∀m ∈ M, |||A m ||| ≤ L 1 and tr(A > m A m ) ≤ tr(A m ) ≤ n Gaussian noise: ε i ∼ N (0, σ 2 )
Then, with probability at least 1 − 3 Card(M)n −δ , if n ≥ n 0 , for every η ∈ ( 0, 1 ),
1 n
b F m b − F
2 ≤ ( 1 + η ) inf
m∈M
1 n
b F m − F
2
+ K (δ, L 0 , L 1 ) ln(n)σ 2 ηn
Note: ridge, λ ∈ [ 0, +∞ ] ⇔ Card(M) ∝ n
18/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Oracle inequality
Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:
pen(m) = 2C tr(A n
m) with C σ −2 − 1 ≤ L 0
q ln(n) n
∀m ∈ M, |||A m ||| ≤ L 1 and tr(A > m A m ) ≤ tr(A m ) ≤ n Gaussian noise: ε i ∼ N (0, σ 2 )
Then, with probability at least 1 − 3 Card(M)n −δ , if n ≥ n 0 , for every η ∈ ( 0, 1 ),
1 n
b F m b − F
2 ≤ ( 1 + η ) inf
m∈M
1 n
b F m − F
2
+ K (δ, L 0 , L 1 ) ln(n)σ 2 ηn
Note: ridge, λ ∈ [ 0, +∞ ] ⇔ Card(M) ∝ n
18/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Oracle inequality
Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:
pen(m) = 2C tr(A n
m) with C σ −2 − 1 ≤ L 0
q ln(n) n
∀m ∈ M, |||A m ||| ≤ L 1 and tr(A > m A m ) ≤ tr(A m ) ≤ n Gaussian noise: ε i ∼ N (0, σ 2 )
Then, with probability at least 1 − 3 Card(M)n −δ , if n ≥ n 0 , for every η ∈ ( 0, 1 ),
1 n
b F m b − F
2 ≤ ( 1 + η ) inf
m∈M
1 n
b F m − F
2
+ K (δ, L 0 , L 1 ) ln(n)σ 2 ηn
Note: ridge, λ ∈ [ 0, +∞ ] ⇔ Card(M) ∝ n
19/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Motivation (1): penalties known up to a constant factor
Ex.: pen Cp (m) = 2σ 2 D m
n pen CL (m) = 2σ 2 tr(A m ) n
Classical estimators of σ 2 : b
σ
2m0
:= kY − F b
m0k
2/(n − D
m0) (OLS) problem: choice of m
0?
b
σ
2m⇒ FPE (Akaike, 1970) and GCV (Craven & Wahba, 1979) problem: avoiding the largest models
Goals: estimation of σ 2 for model selection, under minimal
assumptions, without overfitting
19/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Motivation (1): penalties known up to a constant factor
Ex.: pen Cp (m) = 2σ 2 D m
n pen CL (m) = 2σ 2 tr(A m ) n
Classical estimators of σ 2 : b
σ
2m0:= kY − F b
m0k
2/(n − D
m0) (OLS) problem: choice of m
0?
b
σ
2m⇒ FPE (Akaike, 1970) and GCV (Craven & Wahba, 1979) problem: avoiding the largest models
Goals: estimation of σ 2 for model selection, under minimal
assumptions, without overfitting
19/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Motivation (1): penalties known up to a constant factor
Ex.: pen Cp (m) = 2σ 2 D m
n pen CL (m) = 2σ 2 tr(A m ) n
Classical estimators of σ 2 : b
σ
2m0:= kY − F b
m0k
2/(n − D
m0) (OLS) problem: choice of m
0?
b
σ
2m⇒ FPE (Akaike, 1970) and GCV (Craven & Wahba, 1979) problem: avoiding the largest models
Goals: estimation of σ 2 for model selection, under minimal
assumptions, without overfitting
20/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Motivation (2): “L-curve” and elbow heuristics?
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
empirical / generalization error
21/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
E [ Empirical risk ] + 0 × σ 2 D m n −1 (OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
E[empirical error + 0* pen min]
21/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
E [ Empirical risk ] + 0.8 × σ 2 D m n −1 (OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
E[empirical error + 0.8* pen min]
21/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
E [ Empirical risk ] + 0.9 × σ 2 D m n −1 (OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
E[empirical error + 0.9* pen min]
21/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
E [ Empirical risk ] + 1.1 × σ 2 D m n −1 (OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
E[empirical error + 1.1* pen min]
21/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
E [ Empirical risk ] + 1.2 × σ 2 D m n −1 (OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
E[empirical error + 1.2* pen min]
21/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
E [ Empirical risk ] + 2 × σ 2 D m n −1 (OLS)
0 200 400 600 800 1000
0 0.5 1 1.5
2 x 10
−3dimension
E[empirical error + 2* pen min]
22/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
OLS: Dimension jump
0 1 2 3 4 5
0 200 400 600 800 1000
C/sigma
2dimension of m(C)
23/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
OLS: algorithm (Birg´ e & Massart 2007)
1
for every C > 0, compute b
m(C ) ∈ arg min
m∈M
n1 n
b F m − Y
2 + C D m
n
2
find C b min such that D
m(C b ) is “very large” when C < C b min and “reasonably small” when C > C b min
3
select m b = m b
2 C b min
Practical use: Capushe package (Baudry, Maugis & Michel, 2010)
http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html
24/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Practical qualities of the algorithm
visual checking of existence of a jump
calibration independent from the choice of some m 0 too strong overfitting almost impossible
one remaining parameter: how to localize the jump
25/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Theorem (1): Dimension jump / Minimal penalty
Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:
∃m 0 ∈ M , D m
0≤ √
n and 1 n kF m
0− F k 2 ≤ σ 2 p
ln(n)/n . Gaussian noise: ε i ∼ N (0, σ 2 )
Then, with probability at least 1 − 3 Card(M)n −δ , if n ≥ n 0 (δ),
∀C < 1 − L 3 δ
r ln(n) n
!
σ 2 , D
m(C b ) ≥ n 3
∀C > 1 + L 3 δ
r ln(n) n
!
σ 2 , D
m(C) b ≤ n 10 and in the first case, 1 n k F b
m(C) b − F k 2 inf m∈M
n{ 1 n kb F m − F k 2 }.
26/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Theorem (2): Oracle inequality
Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:
b
m ∈ arg min m∈M { n 1 kb F m − Y k 2 + 2 C b min D n
m}
∃m 0 ∈ M , D m
0≤ √
n and 1 n kF m
0− F k 2 ≤ σ 2 p
ln(n)/n . Gaussian noise: ε i ∼ N (0, σ 2 )
Then, with probability at least 1 − 3 Card(M)n −δ , if n ≥ n 0 (δ), for every η > 0,
1 n
b F
m b − F
2 ≤ ( 1 + η ) inf
m∈M
1 n
b F m − F
2
+ K (δ) ln(n)σ 2
ηn
27/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Generalization of minimal penalties to linear estimators?
OLS
pen Cp (m) = 2σ 2 D m n
arg min
m∈M
1
n kb F m − Y k 2 + C D m n
⇒ D
m(C b ) “jumps” at C b min ≈ σ 2
⇒ optimal choice with m(2b b C min )
Linear estimators pen CL (m) = 2σ 2 tr(A m )
n
arg min
m∈M
1
n kb F m − Y k 2 + C tr(A m ) n
Does tr(A
m(C b ) ) jump at C b min ≈ σ 2 ?
optimal choice with m(2 b C b min ) ?
27/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Generalization of minimal penalties to linear estimators?
OLS
pen Cp (m) = 2σ 2 D m n
arg min
m∈M
1
n kb F m − Y k 2 + C D m n
⇒ D
m(C b ) “jumps” at C b min ≈ σ 2
⇒ optimal choice with m(2b b C min )
Linear estimators pen CL (m) = 2σ 2 tr(A m )
n
arg min
m∈M
1
n kb F m − Y k 2 + C tr(A m ) n
Does tr(A
m(C b ) ) jump at C b min ≈ σ 2 ?
optimal choice with m(2 b C b min ) ?
27/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Generalization of minimal penalties to linear estimators?
OLS
pen Cp (m) = 2σ 2 D m n
arg min
m∈M
1
n kb F m − Y k 2 + C D m n
⇒ D
m(C b ) “jumps” at C b min ≈ σ 2
⇒ optimal choice with m(2b b C min )
Linear estimators pen CL (m) = 2σ 2 tr(A m )
n
arg min
m∈M
1
n kb F m − Y k 2 + C tr(A m ) n
Does tr(A
m(C b ) ) jump at C b min ≈ σ 2 ?
optimal choice with m(2 b C b min ) ?
28/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
No dimension jump with a penalty ∝ tr(A m )
−6 0 −4 −2 0 2 4 6
200 400 600 800 1000
log(C/sigma
2)
tr(A λ (C))
optimal penalty / 2
29/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n b F m − F
2
= 1
n k(I − A m )F k 2 + tr(A > m A m )σ 2
n = bias + variance
E 1
n
b F m − Y 2
= σ 2 + 1
n k(I − A m )F k 2 − 2 tr(A m ) − tr(A > m A m ) σ 2 n
⇒ optimal penalty ( 2 tr(A m ) ) σ 2 n
⇒ minimal penalty 2 tr(A m ) − tr(A > m A m )
σ 2
n
29/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n b F m − F
2
= 1
n k(I − A m )F k 2 + tr(A > m A m )σ 2
n = bias + variance
E 1
n
b F m − Y 2
= σ 2 + 1
n k(I − A m )F k 2 − 2 tr(A m ) − tr(A > m A m ) σ 2 n
⇒ optimal penalty ( 2 tr(A m ) ) σ 2 n
⇒ minimal penalty 2 tr(A m ) − tr(A > m A m )
σ 2
n
29/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n b F m − F
2
= 1
n k(I − A m )F k 2 + tr(A > m A m )σ 2
n = bias + variance
E 1
n
b F m − Y 2
= σ 2 + 1
n k(I − A m )F k 2 − 2 tr(A m ) − tr(A > m A m ) σ 2 n
⇒ optimal penalty ( 2 tr(A m ) ) σ 2 n
⇒ minimal penalty 2 tr(A m ) − tr(A > m A m )
σ 2
n
29/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n b F m − F
2
= 1
n k(I − A m )F k 2 + tr(A > m A m )σ 2
n = bias + variance
E 1
n
b F m − Y 2
= σ 2 + 1
n k(I − A m )F k 2 − 2 tr(A m ) − tr(A > m A m ) σ 2 n
⇒ optimal penalty ( 2 tr(A m ) ) σ 2 n
⇒ minimal penalty 2 tr(A m ) − tr(A > m A m )
σ 2
n
29/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Minimal penalties for linear estimators
E 1
n b F m − F
2
= 1
n k(I − A m )F k 2 + tr(A > m A m )σ 2
n = bias + variance
E 1
n
b F m − Y 2
= σ 2 + 1
n k(I − A m )F k 2 − 2 tr(A m ) − tr(A > m A m ) σ 2 n
⇒ optimal penalty ( 2 tr(A m ) ) σ 2 n
b
m(C ) ∈ arg min
λ∈Λ
1 n
b F m − Y
2 + C × 2 tr(A m ) − tr(A > m A m ) n
30/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
“Dimension” jump (ridge regression)
−6 0 −4 −2 0 2 4 6
200 400 600 800 1000
log(C/sigma
2)
tr(A λ (C))
minimal penalty
optimal penalty / 2
31/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Penalty calibration algorithm (A. & Bach 2009)
1
for every C > 0, compute b
m min (C ) ∈ arg min
m∈M
( 1 n
Y − F b m
2 + C 2 tr(A m ) − tr(A > m A m ) n
)
2
find C b min such that tr(A
m b
min(C) ) is “too large” when C < C b min and “reasonably small” when C > C b min ,
3
select b
m ∈ arg min
m∈M
( 1 n
Y − F b m
2 + 2 C b min tr(A m ) n
)
⇒ b C min σ −2 − 1 ≤ L 4
p ln(n)/n and oracle inequality (same
assumptions as before).
31/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Penalty calibration algorithm (A. & Bach 2009)
1
for every C > 0, compute b
m min (C ) ∈ arg min
m∈M
( 1 n
Y − F b m
2 + C 2 tr(A m ) − tr(A > m A m ) n
)
2
find C b min such that tr(A
m b
min(C) ) is “too large” when C < C b min and “reasonably small” when C > C b min ,
3
select b
m ∈ arg min
m∈M
( 1 n
Y − F b m
2 + 2 C b min tr(A m ) n
)
⇒ b C min σ −2 − 1 ≤ L 4
p ln(n)/n and oracle inequality (same
assumptions as before).
32/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Comparison with least-squares
Linear estimators:
pen min (m) = σ 2 2 tr(A m ) − tr(A > m A m ) n
pen opt (m) = σ 2 ( 2 tr(A m ) ) n pen opt (m)
pen min (m) = 2 tr(A m )
2 tr(A m ) − tr(A > m A m ) ∈ (1, 2]
Least-squares case:
A > m A m = A m ⇒ pen opt (m)
pen min (m) = 2 ⇒ Slope heuristics
32/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Comparison with least-squares
Linear estimators:
pen min (m) = σ 2 2 tr(A m ) − tr(A > m A m ) n
pen opt (m) = σ 2 ( 2 tr(A m ) ) n pen opt (m)
pen min (m) = 2 tr(A m )
2 tr(A m ) − tr(A > m A m ) ∈ (1, 2]
Least-squares case:
A > m A m = A m ⇒ pen opt (m)
pen min (m) = 2 ⇒ Slope heuristics
33/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
The k -nearest neighbours case
∀i, j ∈ { 1, . . . , n } , A i,j ∈
0, 1 k
∀i ∈ { 1, . . . , n } , A i,i = 1 k and
X n j =1
A i,j = 1
⇒ tr(A) = n
k = tr(A > A)
⇒ pen opt = 2 pen min
33/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
The k -nearest neighbours case
∀i, j ∈ { 1, . . . , n } , A i,j ∈
0, 1 k
∀i ∈ { 1, . . . , n } , A i,i = 1 k and
X n j =1
A i,j = 1
⇒ tr(A) = n
k = tr(A > A)
⇒ pen opt = 2 pen min
34/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Simulation study (ridge regression, choice of λ)
4 5 6 7
0.5 1 1.5 2 2.5 3
log(n) mean( error / error oracle )
10−fold CV GCV
min. penalty
35/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task regression
We want to solve p ≥ 2 regression problems simultaneously
Observations: Y 1 , . . . , Y n ∈ R p
Y i j = F i j + ε j i j = 1, . . . , p (e.g., F i j = F j (x i )) with ( ε i ) 1≤i≤n i.i.d. N (0, Σ), Σ ∈ S p + ( R )
Implicit assumption: the p problems are “similar” Least-squares loss of a predictor t ∈ R np (“t i j = t j (x i )”):
1
np kt − F k 2 = 1 np
X n i=1
X p j =1
t i j − F i j 2
⇒ Estimator F b (Y 1 , . . . , Y n ) ∈ R np ?
35/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task regression
We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R p
Y i j = F i j + ε j i j = 1, . . . , p (e.g., F i j = F j (x i )) with ( ε i ) 1≤i≤n i.i.d. N (0, Σ), Σ ∈ S p + ( R )
Implicit assumption: the p problems are “similar” Least-squares loss of a predictor t ∈ R np (“t i j = t j (x i )”):
1
np kt − F k 2 = 1 np
X n i=1
X p j =1
t i j − F i j 2
⇒ Estimator F b (Y 1 , . . . , Y n ) ∈ R np ?
35/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task regression
We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R p
Y i j = F i j + ε j i j = 1, . . . , p (e.g., F i j = F j (x i )) with ( ε i ) 1≤i≤n i.i.d. N (0, Σ), Σ ∈ S p + ( R )
Implicit assumption: the p problems are “similar”
Least-squares loss of a predictor t ∈ R np (“t i j = t j (x i )”): 1
np kt − F k 2 = 1 np
X n i=1
X p j =1
t i j − F i j 2
⇒ Estimator F b (Y 1 , . . . , Y n ) ∈ R np ?
35/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task regression
We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R p
Y i j = F i j + ε j i j = 1, . . . , p (e.g., F i j = F j (x i )) with ( ε i ) 1≤i≤n i.i.d. N (0, Σ), Σ ∈ S p + ( R )
Implicit assumption: the p problems are “similar”
Least-squares loss of a predictor t ∈ R np (“t i j = t j (x i )”):
1
np kt − F k 2 = 1 np
X n i=1
X p j =1
t i j − F i j 2
⇒ Estimator F b (Y 1 , . . . , Y n ) ∈ R np ?
35/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task regression
We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R p
Y i j = F i j + ε j i j = 1, . . . , p (e.g., F i j = F j (x i )) with ( ε i ) 1≤i≤n i.i.d. N (0, Σ), Σ ∈ S p + ( R )
Implicit assumption: the p problems are “similar”
Least-squares loss of a predictor t ∈ R np (“t i j = t j (x i )”):
1
np kt − F k 2 = 1 np
X n i=1
X p j =1
t i j − F i j 2
⇒ Estimator F b (Y 1 , . . . , Y n ) ∈ R np ?
36/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Ridge multi-task regression
F b = (b F i j ) 1≤i≤n , 1≤j≤p with F b i j = b f j (x i ) and b f defined by:
If we consider the tasks separately:
arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ X p
j =1
λ j f j 2 F
K
A possible multi-task approach (Evgeniou et al., 2005): arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ λ X p
j=1
f j 2
F
K+ µ X
j 6=`
f j − f `
2 F
K
More generally: for M ∈ S p + ( R ) , arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ X
j ,`
M j,` D f j , f ` E
F
K
36/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Ridge multi-task regression
F b = (b F i j ) 1≤i≤n , 1≤j≤p with F b i j = b f j (x i ) and b f defined by:
If we consider the tasks separately:
arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ X p
j =1
λ j f j 2 F
K
A possible multi-task approach (Evgeniou et al., 2005):
arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ λ X p
j=1
f j 2
F
K+ µ X
j 6=`
f j − f `
2 F
K
More generally: for M ∈ S p + ( R ) , arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ X
j ,`
M j,` D f j , f ` E
F
K
36/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Ridge multi-task regression
F b = (b F i j ) 1≤i≤n , 1≤j≤p with F b i j = b f j (x i ) and b f defined by:
If we consider the tasks separately:
arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ X p
j =1
λ j f j 2 F
K
A possible multi-task approach (Evgeniou et al., 2005):
arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ λ X p
j=1
f j 2
F
K+ µ X
j 6=`
f j − f `
2 F
K
More generally: for M ∈ S p + ( R ) , arg min
f ∈F
Kp
1 np
X n i=1
X p j=1
Y i j − f j (x i ) 2
+ X
j ,`
M j,` D f j , f ` E
F
K
37/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task estimator selection
⇒ Estimators collection ( F b M ) M ∈M , M ⊂ S p + ( R ) , with F b M = A M Y and A M = M −1 ⊗ K
M −1 ⊗ K
+ npI np −1
Goal: select M b ∈ M such that np 1 kb F
M b − F k 2 is minimal Expectation of the ideal penalty:
E [ pen id (M ) ] = 2
np tr ( A M ( Σ ⊗ I n ) )
Problem: How to estimate Σ ?
37/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task estimator selection
⇒ Estimators collection ( F b M ) M ∈M , M ⊂ S p + ( R ) , with F b M = A M Y and A M = M −1 ⊗ K
M −1 ⊗ K
+ npI np −1
Goal: select M b ∈ M such that np 1 kb F
M b − F k 2 is minimal
Expectation of the ideal penalty: E [ pen id (M ) ] = 2
np tr ( A M ( Σ ⊗ I n ) )
Problem: How to estimate Σ ?
37/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task estimator selection
⇒ Estimators collection ( F b M ) M ∈M , M ⊂ S p + ( R ) , with F b M = A M Y and A M = M −1 ⊗ K
M −1 ⊗ K
+ npI np −1
Goal: select M b ∈ M such that np 1 kb F
M b − F k 2 is minimal Expectation of the ideal penalty:
E [ pen id (M ) ] = 2
np tr ( A M ( Σ ⊗ I n ) )
Problem: How to estimate Σ ?
37/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Multi-task estimator selection
⇒ Estimators collection ( F b M ) M ∈M , M ⊂ S p + ( R ) , with F b M = A M Y and A M = M −1 ⊗ K
M −1 ⊗ K
+ npI np −1
Goal: select M b ∈ M such that np 1 kb F
M b − F k 2 is minimal Expectation of the ideal penalty:
E [ pen id (M ) ] = 2
np tr ( A M ( Σ ⊗ I n ) )
Problem: How to estimate Σ ?
38/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimating the covariance matrix: idea (p = 2)
σ
2σ
3σ
1Σ
Data-driven calibration of linear estimators with minimal penalties, with an application to multi-task regression Sylvain Arlot
38/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimating the covariance matrix: idea (p = 2)
σ
c3σ
c1σ
c2σ
2σ
3σ
1Σ
Data-driven calibration of linear estimators with minimal penalties, with an application to multi-task regression Sylvain Arlot
38/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimating the covariance matrix: idea (p = 2)
σ
c3σ
c1σ
c2σ
2σ
3σ
1Σ
Σ
cData-driven calibration of linear estimators with minimal penalties, with an application to multi-task regression Sylvain Arlot
39/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimating the covariance matrix: algorithm
for every j ∈ { 1, . . . , p }, apply the “minimal penalties”
algorithm to the data set (Y i j ) 1≤i≤n
⇒ estimator a(e j ) of Σ j ,j
for every j 6= ` ∈ {1, . . . , p }, apply the “minimal penalties”
algorithm to the data set (Y i j + Y i ` ) 1≤i≤n
⇒ estimator a(e j + e ` ) of Σ j ,j + Σ `,` + 2Σ j ,`
Recover an estimator Σ of Σ: b
Σ = b J ( a(e 1 ), . . . , a(e p ), a(e 1 + e 2 ), . . . , a(e p−1 + e p ) ) where J is the unique linear application R p(p+1)/2 7→ S p (R) such that
Σ = J ( Σ 1,1 , . . . , Σ p,p , Σ 1,1 + Σ 2,2 + 2Σ 1,2 , . . . , Σ p−1,p−1 + Σ p,p + 2Σ p−1,p )
39/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Estimating the covariance matrix: algorithm
for every j ∈ { 1, . . . , p }, apply the “minimal penalties”
algorithm to the data set (Y i j ) 1≤i≤n
⇒ estimator a(e j ) of Σ j ,j
for every j 6= ` ∈ {1, . . . , p }, apply the “minimal penalties”
algorithm to the data set (Y i j + Y i ` ) 1≤i≤n
⇒ estimator a(e j + e ` ) of Σ j ,j + Σ `,` + 2Σ j ,`
Recover an estimator Σ of Σ: b
Σ = b J ( a(e 1 ), . . . , a(e p ), a(e 1 + e 2 ), . . . , a(e p−1 + e p ) ) where J is the unique linear application R p(p+1)/2 7→ S p (R) such that
Σ = J ( Σ 1,1 , . . . , Σ p,p , Σ 1,1 + Σ 2,2 + 2Σ 1,2 , . . . , Σ p−1,p−1 + Σ p,p + 2Σ p−1,p )
40/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion
Theorem: Estimating the covariance matrix
Theorem (Solnon, A. & Bach, 2011)
If for every j = 1, . . . , p, some λ j > 0 exists such that tr(A λ
j) ≤ √
n and
1 n
(I n − A λ
j)F j 2 ≤ Σ j ,j
r ln(n)
n where A λ
j= K (K +nλ j I n ) −1 , Then, with probability 1 − L 5 p 2 n −δ , if n ≥ n 0 (δ),
(1 − η)Σ Σ b (1 + η)Σ with η := L(2 + δ)c (Σ) 2 p
r ln(n) n where c(Σ) = max(Sp(Σ))/ min(Sp(Σ)).
⇒ sufficient condition for consistency
41/44
Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion