multi-task regression

(1)

1/44

Introduction Lin. estim. selection Penalty calibration (OLS) Linear estimators Multi-task Conclusion

Data-driven calibration of linear estimators with minimal penalties, with an application to

multi-task regression

Sylvain Arlot (joint works with F. Bach & M. Solnon)

1

Cnrs

2

Ecole Normale Sup´ ´ erieure (Paris), Liens , ´ Equipe Sierra

Cambridge, November, 4th 2011

(2)

2/44

Regression: data (x 1 , Y 1 ), . . . , (x n , Y n )

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(3)

3/44

Goal: find the signal (denoising)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(4)

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t (x i )”): 1

n kt − F k ² = 1 n

X n i=1

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

(5)

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t (x i )”): 1

n kt − F k ² = 1 n

X n i=1

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

(6)

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t(x i )”):

1 n kt − F k ² = 1 n

X n i=1

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

(7)

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t(x i )”):

1 n kt − F k ² = 1 n

X n i=1

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

(8)

5/44

Estimators: example: regressogram

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(9)

6/44

Least-squares estimators

Natural idea: minimize an estimator of the risk ¹ _n kt − F k ²

Least-squares criterion: 1

n kt − Y k ² = 1 n

X n i=1

( t _i − Y _i ) ²

∀t ∈ R ⁿ , E 1

n kt − Y k ²

= 1

n kt − F k ² + 1 n E

h kεk ² i

Model: S ⊂ R ⁿ ⇒ Least-squares estimator on S: F b S ∈ arg min

t∈S

1 n kt − Y k ²

= arg min

t∈S

( 1 n

X n i =1

( t i − Y i ) ² )

so that

F b _S = Π _S (Y ) (orthogonal projection)

(10)

6/44

Least-squares estimators

Natural idea: minimize an estimator of the risk ¹ _n kt − F k ² Least-squares criterion:

1 n kt − Y k ² = 1 n

X n i=1

( t _i − Y _i ) ²

∀t ∈ R ⁿ , E 1

n kt − Y k ²

= 1

n kt − F k ² + 1 n E

h kεk ² i

Model: S ⊂ R ⁿ ⇒ Least-squares estimator on S: F b S ∈ arg min

t∈S

1 n kt − Y k ²

= arg min

t∈S

( 1 n

X n i =1

( t i − Y i ) ² )

so that

F b _S = Π _S (Y ) (orthogonal projection)

(11)

6/44

Least-squares estimators

Natural idea: minimize an estimator of the risk ¹ _n kt − F k ² Least-squares criterion:

1 n kt − Y k ² = 1 n

X n i=1

( t _i − Y _i ) ²

∀t ∈ R ⁿ , E 1

n kt − Y k ²

= 1

n kt − F k ² + 1 n E

h kεk ² i

Model: S ⊂ R ⁿ ⇒ Least-squares estimator on S : F b _S ∈ arg min

t∈S

1 n kt − Y k ²

= arg min

t∈S

( 1 n

X n i =1

( t i − Y i ) ² )

so that

F b _S = Π _S (Y ) (orthogonal projection)

(12)

7/44

Model examples

histograms on some partition Λ of X

⇒ the least-squares estimator (regressogram) can be written F b _m (x _i ) = X

λ∈Λ

β b _λ 1 x

i

∈λ β b _λ = 1 Card { x _i ∈ λ}

X

x

i

∈λ

Y _i

subspace generated by a subset of an orthogonal basis of L ² (µ) (Fourier, wavelets, and so on)

variable selection: x _i =

x _i ⁽¹⁾ , . . . , x _i ^(p)

∈ R ^p gathers p variables that can (linearly) explain Y i

∀m ⊂ { 1, . . . , p } , S m = vect n

x ^(j ⁾ s.t. j ∈ m o

.

(13)

8/44

k -nearest-neighbours estimator (k = 20)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(14)

9/44

Nadaraya-Watson estimator (σ = 0.01)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(15)

10/44

Linear estimators

OLS: F b _m = Π _S

_m

Y (projection onto S _m ) (kernel) ridge regression, spline smoothing:

F b _i = b f (x _i ) with b f ∈ arg min

f ∈F

_K

( 1 n

X n i=1

( Y _i − f (x _i ) ) ² + λ kf k ² _F

K

)

⇒ F b _λ,K = K (K + λI) ⁻¹ Y where K = (K (x _i , x _j )) 1≤i,j ≤n

k -nearest neighbours

Nadaraya-Watson estimators

F b = AY where A does not depend on Y

(16)

11/44

Estimator selection: regular regressograms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(17)

12/44

Estimator selection: k nearest neighbours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(18)

13/44

Estimator selection: Nadaraya-Watson

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(19)

14/44

Estimator selection

Estimator collection (b F _m ) m∈M ⇒ m(Y b ) ? Example: F b _m = A _m Y

Goal: minimize the risk, i.e.,

Oracle inequality (in expectation or with a large probability): 1

n b F

m b − F

² ≤ C inf

m∈M

1 n

b F m − F

²

+ R n

Examples:

model selection

calibration (choosing k or the distance for k -NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between methods different in nature

ex.: k -NN vs. smoothing splines?

(20)

14/44

Estimator selection

Estimator collection (b F _m ) m∈M ⇒ m(Y b ) ? Example: F b _m = A _m Y

Goal: minimize the risk, i.e.,

Oracle inequality (in expectation or with a large probability):

1 n

b F

m b − F

² ≤ C inf

m∈M

1 n

b F _m − F

²

+ R _n

Examples:

model selection

calibration (choosing k or the distance for k -NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between methods different in nature

ex.: k -NN vs. smoothing splines?

(21)

14/44

Estimator selection

Estimator collection (b F _m ) m∈M ⇒ m(Y b ) ? Example: F b _m = A _m Y

Goal: minimize the risk, i.e.,

Oracle inequality (in expectation or with a large probability):

1 n

b F

m b − F

² ≤ C inf

m∈M

1 n

b F _m − F

²

+ R _n

Examples:

model selection

calibration (choosing k or the distance for k -NN, choice of a regularization parameter, choice of a kernel, etc.)

choice between methods different in nature

ex.: k -NN vs. smoothing splines?

(22)

15/44

Bias-variance trade-off

E 1

n b F m − F

²

= Bias + Variance Bias or Approximation error

1 n kF _m − F k ² = 1

n kA _m F − F k ² Variance or Estimation error

σ ² tr A ^> _m A m

n OLS: σ ² dim(S _m )

n

Bias-variance trade-off

⇔ avoid overfitting and underfitting

(23)

15/44

Bias-variance trade-off

E 1

n b F m − F

²

= Bias + Variance Bias or Approximation error

1 n kF _m − F k ² = 1

n kA _m F − F k ² Variance or Estimation error

σ ² tr A ^> _m A m

n OLS: σ ² dim(S _m )

n

Bias-variance trade-off

⇔ avoid overfitting and underfitting

(24)

16/44

Why should the empirical risk be penalized?

0 5 10 15 20 25

−0.1

−0.05 0 0.05 0.1 0.15

Dimension

Biais Exces de risque E[Risque empirique]

(25)

17/44

Penalization

b

m ∈ arg min

m∈M

1 n

b F _m − Y

² + pen(m)

Ideal penalty: pen _id (m) := 1

n b F m − F

² − 1

n

b F m − Y

² = Risk − Empirical risk Mallows’ heuristic: pen(m) ≈ E [ pen _id (m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ OLS: C p : 2σ ² D m /n (Mallows, 1973)

⇒ Linear estimators: C _L : 2σ ² tr(A _m )/n (Mallows, 1973)

(26)

17/44

Penalization

b

m ∈ arg min

m∈M

1 n

b F _m − Y

² + pen(m)

Ideal penalty:

pen _id (m) := 1 n

b F m − F

² − 1

n

b F m − Y

² = Risk − Empirical risk Mallows’ heuristic: pen(m) ≈ E [ pen _id (m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ OLS: C p : 2σ ² D m /n (Mallows, 1973)

⇒ Linear estimators: C _L : 2σ ² tr(A _m )/n (Mallows, 1973)

(27)

17/44

Penalization

b

m ∈ arg min

m∈M

1 n

b F _m − Y

² + pen(m)

Ideal penalty:

pen _id (m) := 1 n

b F m − F

² − 1

n

b F m − Y

² = Risk − Empirical risk Mallows’ heuristic: pen(m) ≈ E [ pen _id (m) ]

⇒ oracle inequality if Card(M) not too large (+ concentration inequalities)

⇒ OLS: C p : 2σ ² D m /n (Mallows, 1973)

⇒ Linear estimators: C _L : 2σ ² tr(A _m )/n (Mallows, 1973)

(28)

18/44

Oracle inequality

Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:

pen(m) = ^2C ^tr(A _n

^m

⁾ with C σ ⁻² − 1 ≤ L 0

q ln(n) n

∀m ∈ M, |||A _m ||| ≤ L ₁ and tr(A ^> _m A _m ) ≤ tr(A _m ) ≤ n Gaussian noise: ε _i ∼ N (0, σ ² )

Then, with probability at least 1 − 3 Card(M)n ^−δ , if n ≥ n ₀ , for every η ∈ ( 0, 1 ),

1 n

b F _m _b − F

² ≤ ( 1 + η ) inf

m∈M

1 n

b F m − F

²

+ K (δ, L 0 , L 1 ) ln(n)σ ² ηn

Note: ridge, λ ∈ [ 0, +∞ ] ⇔ Card(M) ∝ n

(29)

18/44

Oracle inequality

Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:

pen(m) = ^2C ^tr(A _n

^m

⁾ with C σ ⁻² − 1 ≤ L 0

q ln(n) n

∀m ∈ M, |||A _m ||| ≤ L ₁ and tr(A ^> _m A _m ) ≤ tr(A _m ) ≤ n Gaussian noise: ε _i ∼ N (0, σ ² )

Then, with probability at least 1 − 3 Card(M)n ^−δ , if n ≥ n ₀ , for every η ∈ ( 0, 1 ),

1 n

b F _m _b − F

² ≤ ( 1 + η ) inf

m∈M

1 n

b F m − F

²

+ K (δ, L 0 , L 1 ) ln(n)σ ² ηn

Note: ridge, λ ∈ [ 0, +∞ ] ⇔ Card(M) ∝ n

(30)

18/44

Oracle inequality

Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:

pen(m) = ^2C ^tr(A _n

^m

⁾ with C σ ⁻² − 1 ≤ L 0

q ln(n) n

∀m ∈ M, |||A _m ||| ≤ L ₁ and tr(A ^> _m A _m ) ≤ tr(A _m ) ≤ n Gaussian noise: ε _i ∼ N (0, σ ² )

Then, with probability at least 1 − 3 Card(M)n ^−δ , if n ≥ n ₀ , for every η ∈ ( 0, 1 ),

1 n

b F _m _b − F

² ≤ ( 1 + η ) inf

m∈M

1 n

b F m − F

²

+ K (δ, L 0 , L 1 ) ln(n)σ ² ηn

Note: ridge, λ ∈ [ 0, +∞ ] ⇔ Card(M) ∝ n

(31)

19/44

Motivation (1): penalties known up to a constant factor

Ex.: pen _Cp (m) = 2σ ² D m

n pen _CL (m) = 2σ ² tr(A m ) n

Classical estimators of σ ² : b

σ

²_m

0

:= kY − F b

_m₀

k

²

/(n − D

_m₀

) (OLS) problem: choice of m

₀

?

b

σ

²_m

⇒ FPE (Akaike, 1970) and GCV (Craven & Wahba, 1979) problem: avoiding the largest models

Goals: estimation of σ ² for model selection, under minimal

assumptions, without overfitting

(32)

19/44

Motivation (1): penalties known up to a constant factor

Ex.: pen _Cp (m) = 2σ ² D m

n pen _CL (m) = 2σ ² tr(A m ) n

Classical estimators of σ ² : b

σ

²_m₀

:= kY − F b

m0

k

²

/(n − D

m0

) (OLS) problem: choice of m

₀

?

b

σ

²_m

⇒ FPE (Akaike, 1970) and GCV (Craven & Wahba, 1979) problem: avoiding the largest models

Goals: estimation of σ ² for model selection, under minimal

assumptions, without overfitting

(33)

19/44

Motivation (1): penalties known up to a constant factor

Ex.: pen _Cp (m) = 2σ ² D m

n pen _CL (m) = 2σ ² tr(A m ) n

Classical estimators of σ ² : b

σ

²_m₀

:= kY − F b

m0

k

²

/(n − D

m0

) (OLS) problem: choice of m

₀

?

b

σ

²_m

⇒ FPE (Akaike, 1970) and GCV (Craven & Wahba, 1979) problem: avoiding the largest models

Goals: estimation of σ ² for model selection, under minimal

assumptions, without overfitting

(34)

20/44

Motivation (2): “L-curve” and elbow heuristics?

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

empirical / generalization error

(35)

21/44

E [ Empirical risk ] + 0 × σ ² D m n ⁻¹ (OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

E[empirical error + 0* pen min]

(36)

21/44

E [ Empirical risk ] + 0.8 × σ ² D m n ⁻¹ (OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

E[empirical error + 0.8* pen min]

(37)

21/44

E [ Empirical risk ] + 0.9 × σ ² D m n ⁻¹ (OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

E[empirical error + 0.9* pen min]

(38)

21/44

E [ Empirical risk ] + 1.1 × σ ² D m n ⁻¹ (OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

E[empirical error + 1.1* pen min]

(39)

21/44

E [ Empirical risk ] + 1.2 × σ ² D m n ⁻¹ (OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

E[empirical error + 1.2* pen min]

(40)

21/44

E [ Empirical risk ] + 2 × σ ² D m n ⁻¹ (OLS)

0 200 400 600 800 1000

0 0.5 1 1.5

2 x 10

⁻³

dimension

E[empirical error + 2* pen min]

(41)

22/44

OLS: Dimension jump

0 1 2 3 4 5

0 200 400 600 800 1000

C/sigma

²

dimension of m(C)

(42)

23/44

OLS: algorithm (Birg´ e & Massart 2007)

1

for every C > 0, compute b

m(C ) ∈ arg min

m∈M

n

1 n

b F m − Y

² + C D m

n

2

find C b _min such that D

m(C b ) is “very large” when C < C b _min and “reasonably small” when C > C b _min

3

select m b = m b

2 C b min

Practical use: Capushe package (Baudry, Maugis & Michel, 2010)

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

(43)

24/44

Practical qualities of the algorithm

visual checking of existence of a jump

calibration independent from the choice of some m ₀ too strong overfitting almost impossible

one remaining parameter: how to localize the jump

(44)

25/44

Theorem (1): Dimension jump / Minimal penalty

Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:

∃m ₀ ∈ M , D _m

₀

≤ √

n and ¹ _n kF _m

₀

− F k ² ≤ σ ² p

ln(n)/n . Gaussian noise: ε i ∼ N (0, σ ² )

Then, with probability at least 1 − 3 Card(M)n ^−δ , if n ≥ n 0 (δ),

∀C < 1 − L ₃ δ

r ln(n) n

!

σ ² , D

m(C b ) ≥ n 3

∀C > 1 + L ₃ δ

r ln(n) n

!

σ ² , D

m(C) b ≤ n 10 and in the first case, ¹ _n k F b

m(C) b − F k ² inf m∈M

n

{ ¹ _n kb F m − F k ² }.

(45)

26/44

Theorem (2): Oracle inequality

Theorem (Birg´ e & Massart 2007, A. & Bach 2009–2011) Assumptions:

b

m ∈ arg min m∈M { _n ¹ kb F _m − Y k ² + 2 C b _min ^D _n

^m

}

∃m ₀ ∈ M , D _m

₀

≤ √

n and ¹ _n kF _m

₀

− F k ² ≤ σ ² p

ln(n)/n . Gaussian noise: ε i ∼ N (0, σ ² )

Then, with probability at least 1 − 3 Card(M)n ^−δ , if n ≥ n 0 (δ), for every η > 0,

1 n

b F

m b − F

² ≤ ( 1 + η ) inf

m∈M

1 n

b F m − F

²

+ K (δ) ln(n)σ ²

ηn

(46)

27/44

Generalization of minimal penalties to linear estimators?

OLS

pen _Cp (m) = 2σ ² D _m n

arg min

m∈M

1 n kb F m − Y k ² + C D _m n

⇒ D

m(C b ) “jumps” at C b _min ≈ σ ²

⇒ optimal choice with m(2b b C _min )

Linear estimators pen _CL (m) = 2σ ² tr(A m )

n

arg min

m∈M

1 n kb F _m − Y k ² + C tr(A _m ) n

Does tr(A

m(C b ) ) jump at C b min ≈ σ ² ?

optimal choice with m(2 b C b min ) ?

(47)

27/44

Generalization of minimal penalties to linear estimators?

OLS

pen _Cp (m) = 2σ ² D _m n

arg min

m∈M

1 n kb F m − Y k ² + C D _m n

⇒ D

m(C b ) “jumps” at C b _min ≈ σ ²

⇒ optimal choice with m(2b b C _min )

Linear estimators pen _CL (m) = 2σ ² tr(A m )

n

arg min

m∈M

1 n kb F _m − Y k ² + C tr(A _m ) n

Does tr(A

m(C b ) ) jump at C b min ≈ σ ² ?

optimal choice with m(2 b C b min ) ?

(48)

27/44

Generalization of minimal penalties to linear estimators?

OLS

pen _Cp (m) = 2σ ² D _m n

arg min

m∈M

1 n kb F m − Y k ² + C D _m n

⇒ D

m(C b ) “jumps” at C b _min ≈ σ ²

⇒ optimal choice with m(2b b C _min )

Linear estimators pen _CL (m) = 2σ ² tr(A m )

n

arg min

m∈M

1 n kb F _m − Y k ² + C tr(A _m ) n

Does tr(A

m(C b ) ) jump at C b min ≈ σ ² ?

optimal choice with m(2 b C b min ) ?

(49)

28/44

No dimension jump with a penalty ∝ tr(A m )

−6 0 −4 −2 0 2 4 6

200 400 600 800 1000

log(C/sigma

²

)

tr(A λ (C))

optimal penalty / 2

(50)

29/44

Minimal penalties for linear estimators

E 1

n b F m − F

²

= 1

n k(I − A m )F k ² + tr(A ^> _m A m )σ ²

n = bias + variance

E 1

n

b F m − Y ²

= σ ² + 1

n k(I − A m )F k ² − 2 tr(A _m ) − tr(A ^> _m A _m ) σ ² n

⇒ optimal penalty ( 2 tr(A _m ) ) σ ² n

⇒ minimal penalty 2 tr(A _m ) − tr(A ^> _m A _m )

σ ²

n

(51)

29/44

Minimal penalties for linear estimators

E 1

n b F m − F

²

= 1

n k(I − A m )F k ² + tr(A ^> _m A m )σ ²

n = bias + variance

E 1

n

b F m − Y ²

= σ ² + 1

n k(I − A m )F k ² − 2 tr(A _m ) − tr(A ^> _m A _m ) σ ² n

⇒ optimal penalty ( 2 tr(A _m ) ) σ ² n

⇒ minimal penalty 2 tr(A _m ) − tr(A ^> _m A _m )

σ ²

n

(52)

29/44

Minimal penalties for linear estimators

E 1

n b F m − F

²

= 1

n k(I − A m )F k ² + tr(A ^> _m A m )σ ²

n = bias + variance

E 1

n

b F m − Y ²

= σ ² + 1

n k(I − A m )F k ² − 2 tr(A _m ) − tr(A ^> _m A _m ) σ ² n

⇒ optimal penalty ( 2 tr(A _m ) ) σ ² n

⇒ minimal penalty 2 tr(A _m ) − tr(A ^> _m A _m )

σ ²

n

(53)

29/44

Minimal penalties for linear estimators

E 1

n b F m − F

²

= 1

n k(I − A m )F k ² + tr(A ^> _m A m )σ ²

n = bias + variance

E 1

n

b F m − Y ²

= σ ² + 1

n k(I − A m )F k ² − 2 tr(A _m ) − tr(A ^> _m A _m ) σ ² n

⇒ optimal penalty ( 2 tr(A _m ) ) σ ² n

⇒ minimal penalty 2 tr(A _m ) − tr(A ^> _m A _m )

σ ²

n

(54)

29/44

Minimal penalties for linear estimators

E 1

n b F _m − F

²

= 1

n k(I − A _m )F k ² + tr(A ^> _m A m )σ ²

n = bias + variance

E 1

n

b F _m − Y ²

= σ ² + 1

n k(I − A _m )F k ² − 2 tr(A _m ) − tr(A ^> _m A _m ) σ ² n

⇒ optimal penalty ( 2 tr(A _m ) ) σ ² n

b

m(C ) ∈ arg min

λ∈Λ

1 n

b F m − Y

² + C × 2 tr(A m ) − tr(A ^> _m A m ) n

(55)

30/44

“Dimension” jump (ridge regression)

−6 0 −4 −2 0 2 4 6

200 400 600 800 1000

log(C/sigma

²

)

tr(A λ (C))

minimal penalty

optimal penalty / 2

(56)

31/44

Penalty calibration algorithm (A. & Bach 2009)

1

for every C > 0, compute b

m _min (C ) ∈ arg min

m∈M

( 1 n

Y − F b _m

² + C 2 tr(A _m ) − tr(A ^> _m A _m ) n

)

2

find C b _min such that tr(A

m b

min

(C) ) is “too large” when C < C b _min and “reasonably small” when C > C b _min ,

3

select b

m ∈ arg min

m∈M

( 1 n

Y − F b m

² + 2 C b min tr(A m ) n

)

⇒ b C min σ ⁻² − 1 ≤ L 4

p ln(n)/n and oracle inequality (same

assumptions as before).

(57)

31/44

Penalty calibration algorithm (A. & Bach 2009)

1

for every C > 0, compute b

m _min (C ) ∈ arg min

m∈M

( 1 n

Y − F b _m

² + C 2 tr(A _m ) − tr(A ^> _m A _m ) n

)

2

find C b _min such that tr(A

m b

min

(C) ) is “too large” when C < C b _min and “reasonably small” when C > C b _min ,

3

select b

m ∈ arg min

m∈M

( 1 n

Y − F b m

² + 2 C b min tr(A m ) n

)

⇒ b C _min σ ⁻² − 1 ≤ L 4

p ln(n)/n and oracle inequality (same

assumptions as before).

(58)

32/44

Comparison with least-squares

Linear estimators:

pen _min (m) = σ ² 2 tr(A _m ) − tr(A ^> _m A _m ) n

pen _opt (m) = σ ² ( 2 tr(A _m ) ) n pen _opt (m)

pen _min (m) = 2 tr(A _m )

2 tr(A m ) − tr(A ^> _m A m ) ∈ (1, 2]

Least-squares case:

A ^> _m A m = A m ⇒ pen _opt (m)

pen _min (m) = 2 ⇒ Slope heuristics

(59)

32/44

Comparison with least-squares

Linear estimators:

pen _min (m) = σ ² 2 tr(A _m ) − tr(A ^> _m A _m ) n

pen _opt (m) = σ ² ( 2 tr(A _m ) ) n pen _opt (m)

pen _min (m) = 2 tr(A _m )

2 tr(A m ) − tr(A ^> _m A m ) ∈ (1, 2]

Least-squares case:

A ^> _m A m = A m ⇒ pen _opt (m)

pen _min (m) = 2 ⇒ Slope heuristics

(60)

33/44

The k -nearest neighbours case

∀i, j ∈ { 1, . . . , n } , A i,j ∈

0, 1 k

∀i ∈ { 1, . . . , n } , A _i,i = 1 k and

X n j =1

A _i,j = 1

⇒ tr(A) = n

k = tr(A ^> A)

⇒ pen _opt = 2 pen _min

(61)

33/44

The k -nearest neighbours case

∀i, j ∈ { 1, . . . , n } , A i,j ∈

0, 1 k

∀i ∈ { 1, . . . , n } , A _i,i = 1 k and

X n j =1

A _i,j = 1

⇒ tr(A) = n

k = tr(A ^> A)

⇒ pen _opt = 2 pen _min

(62)

34/44

Simulation study (ridge regression, choice of λ)

4 5 6 7

0.5 1 1.5 2 2.5 3

log(n) mean( error / error oracle )

10−fold CV GCV

min. penalty

(63)

35/44

Multi-task regression

We want to solve p ≥ 2 regression problems simultaneously

Observations: Y 1 , . . . , Y n ∈ R ^p

Y _i ^j = F _i ^j + ε ^j _i j = 1, . . . , p (e.g., F _i ^j = F ^j (x _i )) with ( ε _i ) _1≤i≤n i.i.d. N (0, Σ), Σ ∈ S _p ⁺ ( R )

Implicit assumption: the p problems are “similar” Least-squares loss of a predictor t ∈ R ^np (“t _i ^j = t ^j (x _i )”):

1 np kt − F k ² = 1 np

X n i=1

X p j =1

t _i ^j − F _i ^j 2

⇒ Estimator F b (Y ₁ , . . . , Y _n ) ∈ R ^np ?

(64)

35/44

Multi-task regression

We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R ^p

Y _i ^j = F _i ^j + ε ^j _i j = 1, . . . , p (e.g., F _i ^j = F ^j (x _i )) with ( ε _i ) _1≤i≤n i.i.d. N (0, Σ), Σ ∈ S _p ⁺ ( R )

Implicit assumption: the p problems are “similar” Least-squares loss of a predictor t ∈ R ^np (“t _i ^j = t ^j (x _i )”):

1 np kt − F k ² = 1 np

X n i=1

X p j =1

t _i ^j − F _i ^j 2

⇒ Estimator F b (Y ₁ , . . . , Y _n ) ∈ R ^np ?

(65)

35/44

Multi-task regression

We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R ^p

Y _i ^j = F _i ^j + ε ^j _i j = 1, . . . , p (e.g., F _i ^j = F ^j (x _i )) with ( ε _i ) _1≤i≤n i.i.d. N (0, Σ), Σ ∈ S _p ⁺ ( R )

Implicit assumption: the p problems are “similar”

Least-squares loss of a predictor t ∈ R ^np (“t _i ^j = t ^j (x _i )”): 1

np kt − F k ² = 1 np

X n i=1

X p j =1

t _i ^j − F _i ^j 2

⇒ Estimator F b (Y ₁ , . . . , Y _n ) ∈ R ^np ?

(66)

35/44

Multi-task regression

We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R ^p

Y _i ^j = F _i ^j + ε ^j _i j = 1, . . . , p (e.g., F _i ^j = F ^j (x _i )) with ( ε _i ) _1≤i≤n i.i.d. N (0, Σ), Σ ∈ S _p ⁺ ( R )

Implicit assumption: the p problems are “similar”

Least-squares loss of a predictor t ∈ R ^np (“t _i ^j = t ^j (x _i )”):

1 np kt − F k ² = 1 np

X n i=1

X p j =1

t _i ^j − F _i ^j 2

⇒ Estimator F b (Y ₁ , . . . , Y _n ) ∈ R ^np ?

(67)

35/44

Multi-task regression

We want to solve p ≥ 2 regression problems simultaneously Observations: Y 1 , . . . , Y n ∈ R ^p

Y _i ^j = F _i ^j + ε ^j _i j = 1, . . . , p (e.g., F _i ^j = F ^j (x _i )) with ( ε _i ) _1≤i≤n i.i.d. N (0, Σ), Σ ∈ S _p ⁺ ( R )

Implicit assumption: the p problems are “similar”

Least-squares loss of a predictor t ∈ R ^np (“t _i ^j = t ^j (x _i )”):

1 np kt − F k ² = 1 np

X n i=1

X p j =1

t _i ^j − F _i ^j 2

⇒ Estimator F b (Y ₁ , . . . , Y _n ) ∈ R ^np ?

(68)

36/44

Ridge multi-task regression

F b = (b F _i ^j ) 1≤i≤n , 1≤j≤p with F b _i ^j = b f ^j (x _i ) and b f defined by:

If we consider the tasks separately:

arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x _i ) 2

+ X p

j =1

λ ^j f ^j ² _F

K

 



A possible multi-task approach (Evgeniou et al., 2005): arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x i ) 2

+ λ X p

j=1

f ^j ²

F

K

+ µ X

j 6=`

f ^j − f ^`

² _F

K

 



More generally: for M ∈ S _p ⁺ ( R ) , arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x _i ) 2

+ X

j ,`

M _j,` D f ^j , f ^` E

F

_K

 



(69)

36/44

Ridge multi-task regression

F b = (b F _i ^j ) 1≤i≤n , 1≤j≤p with F b _i ^j = b f ^j (x _i ) and b f defined by:

If we consider the tasks separately:

arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x _i ) 2

+ X p

j =1

λ ^j f ^j ² _F

K

 



A possible multi-task approach (Evgeniou et al., 2005):

arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x i ) 2

+ λ X p

j=1

f ^j ²

F

K

+ µ X

j 6=`

f ^j − f ^`

² _F

K

 



More generally: for M ∈ S _p ⁺ ( R ) , arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x _i ) 2

+ X

j ,`

M _j,` D f ^j , f ^` E

F

_K

 



(70)

36/44

Ridge multi-task regression

F b = (b F _i ^j ) 1≤i≤n , 1≤j≤p with F b _i ^j = b f ^j (x _i ) and b f defined by:

If we consider the tasks separately:

arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x _i ) 2

+ X p

j =1

λ ^j f ^j ² _F

K

 



A possible multi-task approach (Evgeniou et al., 2005):

arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x i ) 2

+ λ X p

j=1

f ^j ²

F

K

+ µ X

j 6=`

f ^j − f ^`

² _F

K

 



More generally: for M ∈ S _p ⁺ ( R ) , arg min

f ∈F

_K^p

 

 1 np

X n i=1

X p j=1

Y _i ^j − f ^j (x _i ) 2

+ X

j ,`

M _j,` D f ^j , f ^` E

F

_K

 



(71)

37/44

Multi-task estimator selection

⇒ Estimators collection ( F b _M ) _M ∈M , M ⊂ S _p ⁺ ( R ) , with F b _M = A _M Y and A _M = M ⁻¹ ⊗ K

M ⁻¹ ⊗ K

+ npI _np −1

Goal: select M b ∈ M such that _np ¹ kb F

M b − F k ² is minimal Expectation of the ideal penalty:

E [ pen _id (M ) ] = 2

np tr ( A M ( Σ ⊗ I n ) )

Problem: How to estimate Σ ?

(72)

37/44

Multi-task estimator selection

⇒ Estimators collection ( F b _M ) _M ∈M , M ⊂ S _p ⁺ ( R ) , with F b _M = A _M Y and A _M = M ⁻¹ ⊗ K

M ⁻¹ ⊗ K

+ npI _np −1

Goal: select M b ∈ M such that _np ¹ kb F

M b − F k ² is minimal

Expectation of the ideal penalty: E [ pen _id (M ) ] = 2

np tr ( A M ( Σ ⊗ I n ) )

Problem: How to estimate Σ ?

(73)

37/44

Multi-task estimator selection

⇒ Estimators collection ( F b _M ) _M ∈M , M ⊂ S _p ⁺ ( R ) , with F b _M = A _M Y and A _M = M ⁻¹ ⊗ K

M ⁻¹ ⊗ K

+ npI _np −1

Goal: select M b ∈ M such that _np ¹ kb F

M b − F k ² is minimal Expectation of the ideal penalty:

E [ pen _id (M ) ] = 2

np tr ( A _M ( Σ ⊗ I n ) )

Problem: How to estimate Σ ?

(74)

37/44

Multi-task estimator selection

⇒ Estimators collection ( F b _M ) _M ∈M , M ⊂ S _p ⁺ ( R ) , with F b _M = A _M Y and A _M = M ⁻¹ ⊗ K

M ⁻¹ ⊗ K

+ npI _np −1

Goal: select M b ∈ M such that _np ¹ kb F

M b − F k ² is minimal Expectation of the ideal penalty:

E [ pen _id (M ) ] = 2

np tr ( A _M ( Σ ⊗ I n ) )

Problem: How to estimate Σ ?

(75)

38/44

Estimating the covariance matrix: idea (p = 2)

σ

2

σ

3

σ

1

Σ

Data-driven calibration of linear estimators with minimal penalties, with an application to multi-task regression Sylvain Arlot

(76)

38/44

Estimating the covariance matrix: idea (p = 2)

σ

c3

σ

c1

σ

c2

σ

2

σ

₃

σ

1

Σ

(77)

38/44

Estimating the covariance matrix: idea (p = 2)

σ

c3

σ

c1

σ

c2

σ

2

σ

₃

σ

1

Σ

c

(78)

39/44

Estimating the covariance matrix: algorithm

for every j ∈ { 1, . . . , p }, apply the “minimal penalties”

algorithm to the data set (Y _i ^j ) 1≤i≤n

⇒ estimator a(e _j ) of Σ _j _,j

for every j 6= ` ∈ {1, . . . , p }, apply the “minimal penalties”

algorithm to the data set (Y _i ^j + Y _i ^` ) 1≤i≤n

⇒ estimator a(e _j + e _` ) of Σ _j _,j + Σ _`,` + 2Σ _j _,`

Recover an estimator Σ of Σ: b

Σ = b J ( a(e ₁ ), . . . , a(e _p ), a(e ₁ + e ₂ ), . . . , a(e p−1 + e _p ) ) where J is the unique linear application R ^p(p+1)/2 7→ S _p (R) such that

Σ = J ( Σ 1,1 , . . . , Σ p,p , Σ 1,1 + Σ 2,2 + 2Σ 1,2 , . . . , Σ p−1,p−1 + Σ p,p + 2Σ p−1,p )

(79)

39/44

Estimating the covariance matrix: algorithm

for every j ∈ { 1, . . . , p }, apply the “minimal penalties”

algorithm to the data set (Y _i ^j ) 1≤i≤n

⇒ estimator a(e _j ) of Σ _j _,j

for every j 6= ` ∈ {1, . . . , p }, apply the “minimal penalties”

algorithm to the data set (Y _i ^j + Y _i ^` ) 1≤i≤n

⇒ estimator a(e _j + e _` ) of Σ _j _,j + Σ _`,` + 2Σ _j _,`

Recover an estimator Σ of Σ: b

Σ = b J ( a(e ₁ ), . . . , a(e _p ), a(e ₁ + e ₂ ), . . . , a(e p−1 + e _p ) ) where J is the unique linear application R ^p(p+1)/2 7→ S _p (R) such that

Σ = J ( Σ 1,1 , . . . , Σ p,p , Σ 1,1 + Σ 2,2 + 2Σ 1,2 , . . . , Σ p−1,p−1 + Σ p,p + 2Σ p−1,p )

(80)

40/44

Theorem: Estimating the covariance matrix

Theorem (Solnon, A. & Bach, 2011)

If for every j = 1, . . . , p, some λ j > 0 exists such that tr(A _λ

_j

) ≤ √

n and

1 n

(I _n − A _λ

_j

)F ^j ² ≤ Σ _j _,j

r ln(n)

n where A _λ

_j

= K (K +nλ _j I _n ) ⁻¹ , Then, with probability 1 − L ₅ p ² n ^−δ , if n ≥ n ₀ (δ),

(1 − η)Σ Σ b (1 + η)Σ with η := L(2 + δ)c (Σ) ² p

r ln(n) n where c(Σ) = max(Sp(Σ))/ min(Sp(Σ)).

⇒ sufficient condition for consistency

(81)

41/44

Theorem: Oracle inequality

Theorem (Solnon, A. & Bach, 2011)

If moreover matrices M ∈ M can be diagonalized in the same orthogonal basis, and if

multi-task regression

1/44

Data-driven calibration of linear estimators with minimal penalties, with an application to

multi-task regression

Sylvain Arlot (joint works with F. Bach & M. Solnon)

Cnrs

Ecole Normale Sup´ ´ erieure (Paris), Liens , ´ Equipe Sierra

Cambridge, November, 4th 2011

2/44

Regression: data (x 1 , Y 1 ), . . . , (x n , Y n )

3/44

Goal: find the signal (denoising)

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y 1 , . . . , Y n ) ∈ R n

Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.

Fixed design: x i deterministic

Least-squares loss of a predictor t ∈ R n (“t i = t (x i )”): 1

n kt − F k 2 = 1 n

X n i=1

( t i − F i ) 2

⇒ Estimator F b (Y ) ∈ R n ?

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y 1 , . . . , Y n ) ∈ R n

Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.

Fixed design: x i deterministic

Least-squares loss of a predictor t ∈ R n (“t i = t (x i )”): 1

n kt − F k 2 = 1 n

X n i=1

( t i − F i ) 2

⇒ Estimator F b (Y ) ∈ R n ?

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y 1 , . . . , Y n ) ∈ R n

Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.

Fixed design: x i deterministic

Least-squares loss of a predictor t ∈ R n (“t i = t(x i )”):

1

n kt − F k 2 = 1 n

X n i=1

( t i − F i ) 2

⇒ Estimator F b (Y ) ∈ R n ?

4/44

Statistical framework: regression, least-squares loss

Observations: Y = (Y 1 , . . . , Y n ) ∈ R n

Y i = F i + ε i (e.g., F i = F (x i )) with Y i ∈ R , ( ε i ) 1≤i ≤n i.i.d.

Fixed design: x i deterministic

Least-squares loss of a predictor t ∈ R n (“t i = t(x i )”):

1

n kt − F k 2 = 1 n

X n i=1

( t i − F i ) 2

⇒ Estimator F b (Y ) ∈ R n ?

5/44

Estimators: example: regressogram

6/44

Least-squares estimators

Natural idea: minimize an estimator of the risk 1 n kt − F k 2

Least-squares criterion: 1

n kt − Y k 2 = 1 n

X n i=1

( t i − Y i ) 2

∀t ∈ R n , E 1

n kt − Y k 2

= 1

n kt − F k 2 + 1 n E

h kεk 2 i

Model: S ⊂ R n ⇒ Least-squares estimator on S: F b S ∈ arg min

t∈S

1

n kt − Y k 2

= arg min

t∈S

( 1 n

X n i =1

( t i − Y i ) 2 )

so that

F b S = Π S (Y ) (orthogonal projection)

6/44

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t (x i )”): 1

n kt − F k ² = 1 n

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t (x i )”): 1

n kt − F k ² = 1 n

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t(x i )”):

n kt − F k ² = 1 n

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

Observations: Y = (Y ₁ , . . . , Y _n ) ∈ R ⁿ

Y _i = F _i + ε _i (e.g., F _i = F (x _i )) with Y _i ∈ R , ( ε _i ) _1≤i _≤n i.i.d.

Fixed design: x _i deterministic

Least-squares loss of a predictor t ∈ R ⁿ (“t i = t(x i )”):

n kt − F k ² = 1 n

( t _i − F _i ) ²

⇒ Estimator F b (Y ) ∈ R ⁿ ?

Natural idea: minimize an estimator of the risk ¹ _n kt − F k ²

n kt − Y k ² = 1 n

( t _i − Y _i ) ²

∀t ∈ R ⁿ , E 1

n kt − Y k ²

n kt − F k ² + 1 n E

h kεk ² i

Model: S ⊂ R ⁿ ⇒ Least-squares estimator on S: F b S ∈ arg min

n kt − Y k ²

( t i − Y i ) ² )

F b _S = Π _S (Y ) (orthogonal projection)

Natural idea: minimize an estimator of the risk ¹ _n kt − F k ² Least-squares criterion:

n kt − Y k ² = 1 n

( t _i − Y _i ) ²

∀t ∈ R ⁿ , E 1

n kt − Y k ²

n kt − F k ² + 1 n E

h kεk ² i

Model: S ⊂ R ⁿ ⇒ Least-squares estimator on S: F b S ∈ arg min

n kt − Y k ²

( t i − Y i ) ² )

F b _S = Π _S (Y ) (orthogonal projection)

Natural idea: minimize an estimator of the risk ¹ _n kt − F k ² Least-squares criterion:

n kt − Y k ² = 1 n

( t _i − Y _i ) ²

∀t ∈ R ⁿ , E 1

n kt − Y k ²

n kt − F k ² + 1 n E

h kεk ² i

Model: S ⊂ R ⁿ ⇒ Least-squares estimator on S : F b _S ∈ arg min

n kt − Y k ²

( t i − Y i ) ² )

F b _S = Π _S (Y ) (orthogonal projection)

⇒ the least-squares estimator (regressogram) can be written F b _m (x _i ) = X

β b _λ 1 x

∈λ β b _λ = 1 Card { x _i ∈ λ}

Y _i

subspace generated by a subset of an orthogonal basis of L ² (µ) (Fourier, wavelets, and so on)

variable selection: x _i =

x _i ⁽¹⁾ , . . . , x _i ^(p)

∈ R ^p gathers p variables that can (linearly) explain Y i

x ^(j ⁾ s.t. j ∈ m o

OLS: F b _m = Π _S