Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance
Ulysse Marteau-Ferey June 27, 2019
Joint work with Dmitrii Ostrovskii, Francis Bach and Alessandro Rudi
INRIA Paris – ENS Paris, CS department´ – PSL Research University
Presentation of the problem
Learning Problem
Setting: input X,output Y ∈ Y
Linear Predictor: f(x) =θ·Φ(x), Φ(x)∈ Hfeature map,Hinfinite dimensional
Problem: Find
θ?∈arg min
θ∈H
L(θ), L(θ) =E[`(Y, θ·Φ(X))]
`(·,·) loss function, (X,Y) unknown,ni.i.d. samples (xi,yi)16i6n.
Basic assumption: HHilbert space,Y,Φ(X) bounded.
Regularized Empirical Risk minimization
Problem
θ?∈arg min
θ∈H
L(θ), L(θ) =E[`(Y, θ·Φ(X))]
Classical Estimator : Regularized Empirical Risk Minimizer
θbλ=arg min
θ∈H
L(θ) +b λ
2kθk2, L(θ) =b 1 n
n
X
i=1
`(yi, θ·Φ(xi)) λ: regularization parameter→controls overfitting
Existing results
A first general result : slow rates
Assumption: `(y,·), y ∈ Y Lipschitz Lipschitz constant: R.
Slow rates inO(1/√
n)(Sridharan et al., 2009)
Bias-variance decomposition L(bθλ)−L(θ?)6kθ?k2λ+R2kΦk2∞
λn
L(bθλ)−L(θ?)6C 1
√n, λ=c 1
√n
Fast rates for Least-Squares
Assumption: square loss `(y,y0) =12(y−y0)2.
Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities
- bλ=λ2kΣ−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+σ2dfλ
n , σ26kθ?k2kΦk2∞kYk2∞ Example : for dfλ≈d
L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2
1 n
Fast rates for Least-Squares
Assumption: square loss `(y,y0) =12(y−y0)2.
Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities
- bλ=λ2kΣ−1/2λ θ?k26λkθ?k2 → bias regularity ofθ?
- dfλ= Tr(Σ−1λ Σ)6kΦk2∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+σ2dfλ
n , σ26kθ?k2kΦk2∞kYk2∞ Example : for dfλ≈d
L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2
1 n
Fast rates for Least-Squares
Assumption: square loss `(y,y0) =12(y−y0)2.
Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities
- bλ=λ2kΣ−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2∞/λ → variance effective dimension
Fast rates up toO(1/n)(Caponnetto and De Vito, 2007) Bias-variance decomposition L(bθλ)−L(θ?)6bλ+σ2dfλ
n , σ26kθ?k2kΦk2∞kYk2∞ Example : for dfλ≈d
L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2
1 n
Fast rates for Least-Squares
Assumption: square loss `(y,y0) =12(y−y0)2.
Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities
- bλ=λ2kΣ−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+σ2dfλ
n , σ26kθ?k2kΦk2∞kYk2∞
Example : for dfλ≈d L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2
1 n
Fast rates for Least-Squares
Assumption: square loss `(y,y0) =12(y−y0)2.
Covariance operator: Σ=E[Φ(X)⊗Φ(X)],Σλ=Σ+λI Two main quantities
- bλ=λ2kΣ−1/2λ θ?k26λkθ?k2 → bias regularity ofθ? - dfλ= Tr(Σ−1λ Σ)6kΦk2∞/λ → variance effective dimension Fast rates up toO(1/n)(Caponnetto and De Vito, 2007)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+σ2dfλ
n , σ26kθ?k2kΦk2∞kYk2∞ Example : fordfλ≈d
L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2
1 n
Interpretation of the key quantities
Eigen-decomposition: Σ=P+∞
i=0 σi ψi⊗ψi σi &0 θ?=P+∞
i=0 hθ?, ψiiψi
bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r ↔
+∞
X
i=0
hθ?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ
dfλ6Q2λ−1/α ↔ σi=O(i−α)
Fast rates (Caponnetto and De Vito, 2007)
γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L)2β andC= (σγQγL1−γ)2
Interpretation of the key quantities
Eigen-decomposition: Σ=P+∞
i=0 σi ψi⊗ψi σi &0 θ?=P+∞
i=0 hθ?, ψiiψi
bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r ↔
+∞
X
i=0
hθ?, ψii2 σ2ri <∞
dfλ → variance: eigenvalue decay of Σ dfλ6Q2λ−1/α ↔ σi=O(i−α)
Fast rates (Caponnetto and De Vito, 2007)
γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L)2β andC= (σγQγL1−γ)2
Interpretation of the key quantities
Eigen-decomposition: Σ=P+∞
i=0 σi ψi⊗ψi σi &0 θ?=P+∞
i=0 hθ?, ψiiψi
bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r ↔
+∞
X
i=0
hθ?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ
dfλ6Q2λ−1/α ↔ σi=O(i−α)
Fast rates (Caponnetto and De Vito, 2007)
γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L)2β andC= (σγQγL1−γ)2
Interpretation of the key quantities
Eigen-decomposition: Σ=P+∞
i=0 σi ψi⊗ψi σi &0 θ?=P+∞
i=0 hθ?, ψiiψi
bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r ↔
+∞
X
i=0
hθ?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ
dfλ6Q2λ−1/α ↔ σi=O(i−α)
Fast rates (Caponnetto and De Vito, 2007)
L(bθλ)−L(θ?)6bλ+σ2dfλ
n
γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L)2β andC= (σγQγL1−γ)2
Interpretation of the key quantities
Eigen-decomposition: Σ=P+∞
i=0 σi ψi⊗ψi σi &0 θ?=P+∞
i=0 hθ?, ψiiψi
bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r ↔
+∞
X
i=0
hθ?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ
dfλ6Q2λ−1/α ↔ σi=O(i−α)
Fast rates (Caponnetto and De Vito, 2007)
2 −1/α
γ= α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L)2β andC= (σγQγL1−γ)2
Interpretation of the key quantities
Eigen-decomposition: Σ=P+∞
i=0 σi ψi⊗ψi σi &0 θ?=P+∞
i=0 hθ?, ψiiψi
bλ →bias: regularity of θ? w.r.t. Σ bλ6L2λ1+2r ↔
+∞
X
i=0
hθ?, ψii2 σ2ri <∞ dfλ → variance: eigenvalue decay of Σ
dfλ6Q2λ−1/α ↔ σi=O(i−α)
Fast rates (Caponnetto and De Vito, 2007)
L(bθλ)−L(θ?)6C n−γ, λ=c n−β, γ∈[1/2,1].
γ=α(1+2rα(1+2r)+1) ,β=α/(α(1 + 2r) + 1),c= (σQ/L)2β andC= (σγQγL1−γ)2
Our contribution
Generalized Self Concordant functions
Regression: `(y,y0) =ψ(y−y0) - Square loss:ψ(t) =12t2 - Huber loss 1:ψ(t) =√
1 +t2−1 - Huber loss 2:ψ(t) = loget+e2−t
Classification:
- Logistic loss:`(y,y0) = log(1 +e−yy0) - GLMs:`(y,y0) =−y0·y+ logR
Yexp (y0·y˜)dµ(˜y) Defintion : GSC functions (Bach, 2010)
∀y ∈ Y, `(3)(y,·)6R`00(y,·)
4 2 0 2 4
t
0 1 2 3 4 5 6
f(t)
f(t) = log(1 + et) f(t) = log((et+ et)/2) f(t) = 1 + t2 1 f(t) = t2/2
Generalized Self Concordant functions
Regression: `(y,y0) =ψ(y−y0) - Square loss:ψ(t) =12t2 - Huber loss 1:ψ(t) =√
1 +t2−1 - Huber loss 2:ψ(t) = loget+e2−t Classification:
- Logistic loss:`(y,y0) = log(1 +e−yy0) - GLMs:`(y,y0) =−y0·y+ logR
Yexp (y0·y˜)dµ(˜y)
Defintion : GSC functions (Bach, 2010)
∀y ∈ Y, `(3)(y,·)6R`00(y,·)
1 2 3 4 5 6
f(t)
f(t) = log(1 + et) f(t) = log((et+ et)/2) f(t) = 1 + t2 1 f(t) = t2/2
Generalized Self Concordant functions
Regression: `(y,y0) =ψ(y−y0) - Square loss:ψ(t) =12t2 - Huber loss 1:ψ(t) =√
1 +t2−1 - Huber loss 2:ψ(t) = loget+e2−t Classification:
- Logistic loss:`(y,y0) = log(1 +e−yy0) - GLMs:`(y,y0) =−y0·y+ logR
Yexp (y0·y˜)dµ(˜y) Defintion : GSC functions (Bach, 2010)
∀y ∈ Y, `(3)(y,·)6R`00(y,·)
4 2 0 2 4
t
0 1 2 3 4 5 6
f(t)
f(t) = log(1 + et) f(t) = log((et+ et)/2) f(t) = 1 + t2 1 f(t) = t2/2
Fast rates for GSC functions
Assumption: ` is GSC
Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E
`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)
Two main quantities
- bλ=λ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+dfλ
n Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2n
Fast rates for GSC functions
Assumption: ` is GSC
Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E
`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)
Two main quantities
- bλ=λ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
L(bθλ)−L(θ?)6bλ+dfλ
n Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2n
Fast rates for GSC functions
Assumption: ` is GSC
Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E
`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)
Two main quantities
- bλ=λ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+dfλ
n
Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2n
Fast rates for GSC functions
Assumption: ` is GSC
Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)], Hλ=H+λI Fisher information G=E
`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)
Two main quantities
- bλ=λ2kH−1/2λ θ?k26λkθ?k2→bias regularity ofθ? - dfλ= Tr(H−1/2λ GH−1/2λ )6C/λ→variance effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)
Bias-variance decomposition L(bθλ)−L(θ?)6bλ+dfλ
n Example : for dfλ≈σ2 d L(bθλ)−L(θ?)62σ2d
n , λ= σ2d kθ?k2n
Link with least-squares
Assumption: ` is GSC
Hessian at optimum: H=E[`00(Y, θ?·Φ(X)) Φ(X)⊗Φ(X)]
Fisher information G=E
`0(Y, θ?·Φ(X))2 Φ(X)⊗Φ(X)
Two main quantities
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
Link with least-squares
Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]
Hessian at optimum: H=E[1 Φ(X)⊗Φ(X)] =Σ Fisher information G=E
(YΦ(X)·θ?)2 Φ(X)⊗Φ(X) σ2Σ
Two main quantities
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
Link with least-squares
Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]
Hessian at optimum: H=Σ
Fisher information Gσ2Σ σ2=kθ?k2kΦk2∞kYk2∞
Two main quantities
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
Link with least-squares
Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]
Hessian at optimum: H=Σ
Fisher information Gσ2Σ σ2=kθ?k2kΦk2∞kYk2∞ Two main quantities
- bλ=λ2kH−1/2λ θ?k2 regularity ofθ?
- dfλ= Tr(H−1/2λ GH−1/2λ ) effective dimension
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
Link with least-squares
Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]
Hessian at optimum: H=Σ
Fisher information Gσ2Σ σ2=kθ?k2kΦk2∞kYk2∞ Two main quantities
- bλ=λ2kΣ−1/2λ θ?k2 regularity ofθ?
- dfλ6σ2Tr(Σ−1/2λ ΣΣ−1/2λ ) effective dimension
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
Link with least-squares
Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]
Hessian at optimum: H=Σ
Fisher information Gσ2Σ σ2=kθ?k2kΦk2∞kYk2∞ Two main quantities
- bλ=λ2kΣ−1/2λ θ?k2= blsλ regularity ofθ?
- dfλ6σ2Tr(Σ−1/2λ ΣΣ−1/2λ ) =σ2dflsλ effective dimension
Fast rates up toO(1/n)(Marteau-Ferey et al., 2019) Bias-variance decomposition
L(bθλ)−L(θ?)6bλ+dfλ
n
Link with least-squares
Assumption: `(y,y0) = 12(y−y0)2 Σ=E[Φ(X)⊗Φ(X)]
Hessian at optimum: H=Σ
Fisher information Gσ2Σ σ2=kθ?k2kΦk2∞kYk2∞ Two main quantities
- bλ=λ2kΣ−1/2λ θ?k2= blsλ regularity ofθ?
- dfλ6σ2Tr(Σ−1/2λ ΣΣ−1/2λ ) =σ2dflsλ effective dimension Fast rates up toO(1/n)(Marteau-Ferey et al., 2019)
Bias-variance decomposition
Conclusion
Thank you for your attention ! Poster 175