slides

(1)

Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance

Ulysse Marteau-Ferey June 27, 2019

Joint work with Dmitrii Ostrovskii, Francis Bach and Alessandro Rudi

INRIA Paris – ENS Paris, CS department´ – PSL Research University

(2)

Presentation of the problem

(3)

Learning Problem

Setting: input X,output Y ∈ Y

Linear Predictor: f(x) =θ·Φ(x), Φ(x)∈ Hfeature map,Hinfinite dimensional

Problem: Find

θ^?∈arg min

θ∈H

L(θ), L(θ) =E[`(Y, θ·Φ(X))]

`(·,·) loss function, (X,Y) unknown,ni.i.d. samples (xi,yi)16i6n.

Basic assumption: HHilbert space,Y,Φ(X) bounded.

(4)

Regularized Empirical Risk minimization

Problem

θ^?∈arg min

θ∈H

L(θ), L(θ) =E[`(Y, θ·Φ(X))]

Classical Estimator : Regularized Empirical Risk Minimizer

θb_λ=arg min

θ∈H

L(θ) +b λ

2kθk², L(θ) =b 1 n

n

X

i=1

`(y_i, θ·Φ(x_i)) λ: regularization parameter→controls overfitting

(5)

Existing results

(6)

A first general result : slow rates

Assumption: `(y,·), y ∈ Y Lipschitz Lipschitz constant: R.

Slow rates inO(1/√

n)(Sridharan et al., 2009)

Bias-variance decomposition L(bθλ)−L(θ^?)6kθ^?k²λ+R²kΦk²_∞

λn

L(bθ_λ)−L(θ^?)6C 1

√n, λ=c 1

√n

(7)

Fast rates for Least-Squares

Assumption: square loss `(y,y⁰) =¹₂(y−y⁰)².

Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities

- bλ=λ²kΣ^−1/2_λ θ^?k²6λkθ^?k² → bias regularity ofθ^? - dfλ= Tr(Σ⁻¹_λ Σ)6kΦk²_∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

Bias-variance decomposition L(bθλ)−L(θ^?)6bλ+σ²dfλ

n , σ²6kθ^?k²kΦk²_∞kYk²_∞ Example : for df_λ≈d

L(bθ_λ)−L(θ^?)62σ²d

n , λ= σ²d kθ^?k²

1 n

(8)

Fast rates for Least-Squares

- bλ=λ²kΣ^−1/2_λ θ^?k²6λkθ^?k² → bias regularity ofθ^?

- dfλ= Tr(Σ⁻¹_λ Σ)6kΦk²_∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

1 n

(9)

Fast rates for Least-Squares

- bλ=λ²kΣ^−1/2_λ θ^?k²6λkθ^?k² → bias regularity ofθ^? - df_λ= Tr(Σ⁻¹_λ Σ)6kΦk²_∞/λ → variance effective dimension

Fast rates up toO(1/n)(Caponnetto and De Vito, 2007) Bias-variance decomposition L(bθλ)−L(θ^?)6bλ+σ²dfλ

1 n

(10)

Fast rates for Least-Squares

- bλ=λ²kΣ^−1/2_λ θ^?k²6λkθ^?k² → bias regularity ofθ^? - df_λ= Tr(Σ⁻¹_λ Σ)6kΦk²_∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

n , σ²6kθ^?k²kΦk²_∞kYk²_∞

Example : for df_λ≈d L(bθ_λ)−L(θ^?)62σ²d

1 n

(11)

Fast rates for Least-Squares

- bλ=λ²kΣ^−1/2_λ θ^?k²6λkθ^?k² → bias regularity ofθ^? - df_λ= Tr(Σ⁻¹_λ Σ)6kΦk²_∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)

n , σ²6kθ^?k²kΦk²_∞kYk²_∞ Example : fordf_λ≈d

1 n

(12)

Interpretation of the key quantities

Eigen-decomposition: Σ=P+∞

i=0 σ_i ψ_i⊗ψ_i σ_i &0 θ^?=P+∞

i=0 hθ^?, ψiiψi

bλ →bias: regularity of θ^? w.r.t. Σ b_λ6L²λ^1+2r ↔

+∞

X

i=0

hθ^?, ψ_ii² σ^2r_i <∞ df_λ → variance: eigenvalue decay of Σ

dfλ6Q²λ^−1/α ↔ σi=O(i^−α)

Fast rates (Caponnetto and De Vito, 2007)

γ= _α(1+2r^α(1+2r₎₊₁⁾ ,β=α/(α(1 + 2r) + 1),c= (σQ/L)^2β andC= (σ^γQ^γL^1−γ)²

(13)

Interpretation of the key quantities

i=0 σ_i ψ_i⊗ψ_i σ_i &0 θ^?=P+∞

i=0 hθ^?, ψiiψi

bλ →bias: regularity of θ^? w.r.t. Σ bλ6L²λ^1+2r ↔

+∞

X

i=0

hθ^?, ψ_ii² σ^2r_i <∞

df_λ → variance: eigenvalue decay of Σ dfλ6Q²λ^−1/α ↔ σi=O(i^−α)

(14)

Interpretation of the key quantities

i=0 σ_i ψ_i⊗ψ_i σ_i &0 θ^?=P+∞

i=0 hθ^?, ψiiψi

+∞

X

i=0

(15)

Interpretation of the key quantities

i=0 σ_i ψ_i⊗ψ_i σ_i &0 θ^?=P+∞

i=0 hθ^?, ψiiψi

+∞

X

i=0

L(bθλ)−L(θ^?)6bλ+σ²dfλ

n

(16)

Interpretation of the key quantities

i=0 σ_i ψ_i⊗ψ_i σ_i &0 θ^?=P+∞

i=0 hθ^?, ψiiψi

+∞

X

i=0

2 −1/α

(17)

Interpretation of the key quantities

i=0 σ_i ψ_i⊗ψ_i σ_i &0 θ^?=P+∞

i=0 hθ^?, ψiiψi

+∞

X

i=0

L(bθλ)−L(θ^?)6C n^−γ, λ=c n^−β, γ∈[1/2,1].

γ=_α(1+2r^α(1+2r₎₊₁⁾ ,β=α/(α(1 + 2r) + 1),c= (σQ/L)^2β andC= (σ^γQ^γL^1−γ)²

(18)

Our contribution

(19)

Generalized Self Concordant functions

Regression: `(y,y⁰) =ψ(y−y⁰) - Square loss:ψ(t) =¹₂t² - Huber loss 1:ψ(t) =√

1 +t²−1 - Huber loss 2:ψ(t) = log^e^t^+e₂^−t

Classification:

- Logistic loss:`(y,y⁰) = log(1 +e^−yy⁰) - GLMs:`(y,y⁰) =−y⁰·y+ logR

Yexp (y⁰·y˜)dµ(˜y) Defintion : GSC functions (Bach, 2010)

∀y ∈ Y, `⁽³⁾(y,·)6R`⁰⁰(y,·)

4 2 0 2 4

t

0 1 2 3 4 5 6

f(t)

f(t) = log(1 + e^t) f(t) = log((e^t+ e^t)/2) f(t) = 1 + t² 1 f(t) = t²/2

(20)

Generalized Self Concordant functions

1 +t²−1 - Huber loss 2:ψ(t) = log^e^t^+e₂^−t Classification:

Yexp (y⁰·y˜)dµ(˜y)

Defintion : GSC functions (Bach, 2010)

∀y ∈ Y, `⁽³⁾(y,·)6R`⁰⁰(y,·)

1 2 3 4 5 6

f(t)

(21)

Generalized Self Concordant functions

1 +t²−1 - Huber loss 2:ψ(t) = log^e^t^+e₂^−t Classification:

Yexp (y⁰·y˜)dµ(˜y) Defintion : GSC functions (Bach, 2010)

∀y ∈ Y, `⁽³⁾(y,·)6R`⁰⁰(y,·)

4 2 0 2 4

t

0 1 2 3 4 5 6

f(t)

(22)

Fast rates for GSC functions

Assumption: ` is GSC

Hessian at optimum: H=E[`⁰⁰(Y, θ^?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E

`⁰(Y, θ^?·Φ(X))² Φ(X)⊗Φ(X)

Two main quantities

- bλ=λ²kH^−1/2_λ θ^?k²6λkθ^?k²→bias regularity ofθ^? - df_λ= Tr(H^−1/2_λ GH^−1/2_λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)

Bias-variance decomposition L(bθλ)−L(θ^?)6bλ+dfλ

n Example : for df_λ≈σ² d L(bθ_λ)−L(θ^?)62σ²d

n , λ= σ²d kθ^?k²n

(23)

Fast rates for GSC functions

`⁰(Y, θ^?·Φ(X))² Φ(X)⊗Φ(X)

Two main quantities

- bλ=λ²kH^−1/2_λ θ^?k²6λkθ^?k²→bias regularity ofθ^? - df_λ= Tr(H^−1/2_λ GH^−1/2_λ )6C/λ→variance effective dimension

Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition

L(bθλ)−L(θ^?)6bλ+dfλ

(24)

Fast rates for GSC functions

`⁰(Y, θ^?·Φ(X))² Φ(X)⊗Φ(X)

Two main quantities

n

Example : for df_λ≈σ² d L(bθ_λ)−L(θ^?)62σ²d

(25)

Fast rates for GSC functions

`⁰(Y, θ^?·Φ(X))² Φ(X)⊗Φ(X)

Two main quantities

(26)

Link with least-squares

Hessian at optimum: H=E[`⁰⁰(Y, θ^?·Φ(X)) Φ(X)⊗Φ(X)]

Fisher information G=E

`⁰(Y, θ^?·Φ(X))² Φ(X)⊗Φ(X)

Two main quantities

(27)

Link with least-squares

Assumption: `(y,y⁰) = ¹₂(y−y⁰)² Σ=E[Φ(X)⊗Φ(X)]

Hessian at optimum: H=E[1 Φ(X)⊗Φ(X)] =Σ Fisher information G=E

(YΦ(X)·θ^?)² Φ(X)⊗Φ(X) σ²Σ

Two main quantities

(28)

Link with least-squares

Hessian at optimum: H=Σ

Fisher information Gσ²Σ σ²=kθ^?k²kΦk²_∞kYk²_∞

Two main quantities

(29)

Link with least-squares

Fisher information Gσ²Σ σ²=kθ^?k²kΦk²_∞kYk²_∞ Two main quantities

- bλ=λ²kH^−1/2_λ θ^?k² regularity ofθ^?

- dfλ= Tr(H^−1/2_λ GH^−1/2_λ ) effective dimension

(30)

Link with least-squares

- bλ=λ²kΣ^−1/2_λ θ^?k² regularity ofθ^?

- dfλ6σ²Tr(Σ^−1/2_λ ΣΣ^−1/2_λ ) effective dimension

(31)

Link with least-squares

- bλ=λ²kΣ^−1/2_λ θ^?k²= b^ls_λ regularity ofθ^?

- dfλ6σ²Tr(Σ^−1/2_λ ΣΣ^−1/2_λ ) =σ²df^ls_λ effective dimension

L(bθ_λ)−L(θ^?)6b_λ+dfλ

n

(32)

Link with least-squares

- bλ=λ²kΣ^−1/2_λ θ^?k²= b^ls_λ regularity ofθ^?

- dfλ6σ²Tr(Σ^−1/2_λ ΣΣ^−1/2_λ ) =σ²df^ls_λ effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)

Bias-variance decomposition

(33)

Conclusion

Thank you for your attention ! Poster 175