• Aucun résultat trouvé

slides

N/A
N/A
Protected

Academic year: 2022

Partager "slides"

Copied!
33
0
0

Texte intégral

(1)

Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance

Ulysse Marteau-Ferey June 27, 2019

Joint work with Dmitrii Ostrovskii, Francis Bach and Alessandro Rudi

INRIA Paris ENS Paris, CS department´ PSL Research University

(2)

Presentation of the problem

(3)

Learning Problem

Setting: input X,output Y ∈ Y

Linear Predictor: f(x) =θ·Φ(x), Φ(x)∈ Hfeature map,Hinfinite dimensional

Problem: Find

θ?∈arg min

θ∈H

L(θ), L(θ) =E[`(Y, θ·Φ(X))]

`(·,·) loss function, (X,Y) unknown,ni.i.d. samples (xi,yi)16i6n.

Basic assumption: HHilbert space,Y,Φ(X) bounded.

(4)

Regularized Empirical Risk minimization

Problem

θ?∈arg min

θ∈H

L(θ), L(θ) =E[`(Y, θ·Φ(X))]

Classical Estimator : Regularized Empirical Risk Minimizer

θbλ=arg min

θ∈H

L(θ) +b λ

2kθk2, L(θ) =b 1 n

n

X

i=1

`(yi, θ·Φ(xi)) λ: regularization parameter→controls overfitting

(5)

Existing results

(6)

A first general result : slow rates

Assumption: `(y,·), y ∈ Y Lipschitz Lipschitz constant: R.

Slow rates inO(1/√

n)(Sridharan et al., 2009)

Bias-variance decomposition L(bθλ)−L(θ?)6kθ?k2λ+R2kΦk2

λn

L(bθλ)−L(θ?)6C 1

√n, λ=c 1

√n

(7)

Fast rates for Least-Squares

Assumption: square loss `(y,y0) =12(y−y0)2.

Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities

- bλ2−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ2dfλ

n , σ26kθ?k2kΦk2kYk2 Example : for dfλ≈d

L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2

1 n

(8)

Fast rates for Least-Squares

Assumption: square loss `(y,y0) =12(y−y0)2.

Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities

- bλ2−1/2λ θ?k26λkθ?k2 → bias regularity ofθ?

- dfλ= Tr(Σ−1λ Σ)6kΦk2/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ2dfλ

n , σ26kθ?k2kΦk2kYk2 Example : for dfλ≈d

L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2

1 n

(9)

Fast rates for Least-Squares

Assumption: square loss `(y,y0) =12(y−y0)2.

Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities

- bλ2−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2/λ → variance effective dimension

Fast rates up toO(1/n)(Caponnetto and De Vito, 2007) Bias-variance decomposition L(bθλ)−L(θ?)6bλ2dfλ

n , σ26kθ?k2kΦk2kYk2 Example : for dfλ≈d

L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2

1 n

(10)

Fast rates for Least-Squares

Assumption: square loss `(y,y0) =12(y−y0)2.

Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities

- bλ2−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ2dfλ

n , σ26kθ?k2kΦk2kYk2

Example : for dfλ≈d L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2

1 n

(11)

Fast rates for Least-Squares

Assumption: square loss `(y,y0) =12(y−y0)2.

Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities

- bλ2−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ2dfλ

n , σ26kθ?k2kΦk2kYk2 Example : fordfλ≈d

L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2

1 n

(12)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σi ψi⊗ψi σi &0 θ?=P+∞

i=0?, ψii

bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r

+∞

X

i=0

?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ

dfλ6Q2λ−1/α ↔ σi=O(i−α)

Fast rates (Caponnetto and De Vito, 2007)

γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L) andC= (σγQγL1−γ)2

(13)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σi ψi⊗ψi σi &0 θ?=P+∞

i=0?, ψii

bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r

+∞

X

i=0

?, ψii2 σ2ri <∞

dfλ → variance: eigenvalue decay of Σ dfλ6Q2λ−1/α ↔ σi=O(i−α)

Fast rates (Caponnetto and De Vito, 2007)

γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L) andC= (σγQγL1−γ)2

(14)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σi ψi⊗ψi σi &0 θ?=P+∞

i=0?, ψii

bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r

+∞

X

i=0

?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ

dfλ6Q2λ−1/α ↔ σi=O(i−α)

Fast rates (Caponnetto and De Vito, 2007)

γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L) andC= (σγQγL1−γ)2

(15)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σi ψi⊗ψi σi &0 θ?=P+∞

i=0?, ψii

bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r

+∞

X

i=0

?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ

dfλ6Q2λ−1/α ↔ σi=O(i−α)

Fast rates (Caponnetto and De Vito, 2007)

L(bθλ)−L(θ?)6bλ2dfλ

n

γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L) andC= (σγQγL1−γ)2

(16)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σi ψi⊗ψi σi &0 θ?=P+∞

i=0?, ψii

bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r

+∞

X

i=0

?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ

dfλ6Q2λ−1/α ↔ σi=O(i−α)

Fast rates (Caponnetto and De Vito, 2007)

2 −1/α

γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L) andC= (σγQγL1−γ)2

(17)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σi ψi⊗ψi σi &0 θ?=P+∞

i=0?, ψii

bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r

+∞

X

i=0

?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ

dfλ6Q2λ−1/α ↔ σi=O(i−α)

Fast rates (Caponnetto and De Vito, 2007)

L(bθλ)−L(θ?)6C n−γ, λ=c n−β, γ∈[1/2,1].

γ=α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L) andC= (σγQγL1−γ)2

(18)

Our contribution

(19)

Generalized Self Concordant functions

Regression: `(y,y0) =ψ(y−y0) - Square loss:ψ(t) =12t2 - Huber loss 1:ψ(t) =√

1 +t2−1 - Huber loss 2:ψ(t) = loget+e2−t

Classification:

- Logistic loss:`(y,y0) = log(1 +e−yy0) - GLMs:`(y,y0) =−y0·y+ logR

Yexp (y0·y˜)dµ(˜y) Defintion : GSC functions (Bach, 2010)

∀y ∈ Y, `(3)(y,·)6R`00(y,·)

4 2 0 2 4

t

0 1 2 3 4 5 6

f(t)

f(t) = log(1 + et) f(t) = log((et+ et)/2) f(t) = 1 + t2 1 f(t) = t2/2

(20)

Generalized Self Concordant functions

Regression: `(y,y0) =ψ(y−y0) - Square loss:ψ(t) =12t2 - Huber loss 1:ψ(t) =√

1 +t2−1 - Huber loss 2:ψ(t) = loget+e2−t Classification:

- Logistic loss:`(y,y0) = log(1 +e−yy0) - GLMs:`(y,y0) =−y0·y+ logR

Yexp (y0·y˜)dµ(˜y)

Defintion : GSC functions (Bach, 2010)

∀y ∈ Y, `(3)(y,·)6R`00(y,·)

1 2 3 4 5 6

f(t)

f(t) = log(1 + et) f(t) = log((et+ et)/2) f(t) = 1 + t2 1 f(t) = t2/2

(21)

Generalized Self Concordant functions

Regression: `(y,y0) =ψ(y−y0) - Square loss:ψ(t) =12t2 - Huber loss 1:ψ(t) =√

1 +t2−1 - Huber loss 2:ψ(t) = loget+e2−t Classification:

- Logistic loss:`(y,y0) = log(1 +e−yy0) - GLMs:`(y,y0) =−y0·y+ logR

Yexp (y0·y˜)dµ(˜y) Defintion : GSC functions (Bach, 2010)

∀y ∈ Y, `(3)(y,·)6R`00(y,·)

4 2 0 2 4

t

0 1 2 3 4 5 6

f(t)

f(t) = log(1 + et) f(t) = log((et+ et)/2) f(t) = 1 + t2 1 f(t) = t2/2

(22)

Fast rates for GSC functions

Assumption: ` is GSC

Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E

`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)

Two main quantities

- bλ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ+dfλ

n Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2n

(23)

Fast rates for GSC functions

Assumption: ` is GSC

Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E

`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)

Two main quantities

- bλ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

L(bθλ)−L(θ?)6bλ+dfλ

n Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2n

(24)

Fast rates for GSC functions

Assumption: ` is GSC

Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E

`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)

Two main quantities

- bλ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ+dfλ

n

Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2n

(25)

Fast rates for GSC functions

Assumption: ` is GSC

Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E

`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)

Two main quantities

- bλ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)

Bias-variance decomposition L(bθλ)−L(θ?)6bλ+dfλ

n Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d

n , λ= σ2d kθ?k2n

(26)

Link with least-squares

Assumption: ` is GSC

Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)]

Fisher information G=E

`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)

Two main quantities

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

(27)

Link with least-squares

Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=E[1 Φ(X)⊗Φ(X)] =Σ Fisher information G=E

(YΦ(X)·θ?)2 Φ(X)⊗Φ(X) σ2Σ

Two main quantities

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

(28)

Link with least-squares

Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=Σ

Fisher information Gσ2Σ σ2=kθ?k2kΦk2kYk2

Two main quantities

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

(29)

Link with least-squares

Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=Σ

Fisher information Gσ2Σ σ2=kθ?k2kΦk2kYk2 Two main quantities

- bλ2kH−1/2λ θ?k2 regularity ofθ?

- dfλ= Tr(H−1/2λ GH−1/2λ ) effective dimension

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

(30)

Link with least-squares

Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=Σ

Fisher information Gσ2Σ σ2=kθ?k2kΦk2kYk2 Two main quantities

- bλ2−1/2λ θ?k2 regularity ofθ?

- dfλ2Tr(Σ−1/2λ ΣΣ−1/2λ ) effective dimension

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

(31)

Link with least-squares

Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=Σ

Fisher information Gσ2Σ σ2=kθ?k2kΦk2kYk2 Two main quantities

- bλ2−1/2λ θ?k2= blsλ regularity ofθ?

- dfλ2Tr(Σ−1/2λ ΣΣ−1/2λ ) =σ2dflsλ effective dimension

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

L(bθλ)−L(θ?)6bλ+dfλ

n

(32)

Link with least-squares

Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=Σ

Fisher information Gσ2Σ σ2=kθ?k2kΦk2kYk2 Two main quantities

- bλ2−1/2λ θ?k2= blsλ regularity ofθ?

- dfλ2Tr(Σ−1/2λ ΣΣ−1/2λ ) =σ2dflsλ effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)

Bias-variance decomposition

(33)

Conclusion

Thank you for your attention ! Poster 175

Références

Documents relatifs

We derive fast rates of convergence for the excess risk of empirical risk minimizers based on regularization methods, such as deconvolution kernel density estimators or

We derive fast rates of convergence for empirical risk minimizers based on regularization methods, such as deconvolution kernel density estimators or spectral cut-off.. These

In doing so, we put to the fore inertial proximal algorithms that converge for general monotone inclusions, and which, in the case of convex minimization, give fast convergence rates

In the free-noise case ( = 0), minimax fast rates of convergence are well-known under the margin assumption in discriminant analysis (see [27]) or in the more general

The goal of this note is to state the optimal decay rate for solutions of the nonlinear fast diffusion equation and, in self-similar variables, the optimal convergence rates

The goal of this note is to state the optimal decay rate for solutions of the nonlinear fast diffusion equation and, in self-similar variables, the optimal convergence rates

In Dalalyan and Thompson, 2019, the authors raised the question whether it is possible to attain optimal rates of convergence in sparse regression using regularized empirical