Second Order Strikes Back : Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses
U. Marteau-Ferey, A. Rudi and F. Bach — INRIA - Ecole Normale Sup´ erieure - PSL Research University
Data : n points (xi, yi)1≤i≤n ∈ H × R, H Hilbert, supi kxik ≤ R
Goal:
ω?λ = arg minω∈H
1 n
n
X
i=1
`(yi, hω, xii) + λ
2kωk2 (1) (i) Logistic loss `(y, y0) = log(1 + e−yy0)
(ii) Potentially very small regularizer λ 1, e.g. λ ≈ 10−12 Key property : Generalized Self Concordance (GSC)
∀y ∈ Y, |`(3)(y, ·)| ≤ `(2)(y, ·) Notations:
• gλ(ω) = g(ω) + λ2kωk2 Hλ(ω) = ∇2gλ(ω)
• For any p.s.d. operator A on H, k · kA = kA1/2 · kH.
Ill-conditioned problems : second order method ?
Solving ill-conditioned logistic regression
Start at ω0 Approximate Newton Step (ANS) Hessian sketch He λ(ωt) 12He λ Hλ 2He λ
Step ωt+1 =ωt−st st = He −1λ (ωt)∇gλ(ωt)
Linear convergence near the optimum If ω0 ∈ Dλ, Dλ :=
n
ω : Rk∇gλ(ω)kH−1
λ (ω) ≤√
λo , gλ(ωt) − gλ(ω?λ) ≤ 2−t
Problem : is is possible to find ω ∈ Dλ ?
Approximate Newton Methods (ANM)
Reaching Dλ in O(log(µ0/λ)) approximate Newton steps ωK ∈ Dλ, K = O(logq(µ0/λ)), 1/q ≈ 1 − 1/(Rkω?λk)
Getting inside D
λ• Data: (xi, yi)1≤i≤n ∈ X × R, i.i.d. with distribution ρ
• Feature map: Φ : X → H, H Hilbert
• Linear predictor: f(x) ↔ hf, Φ(x)iH, f ∈ H
• Expected loss: L(f ) := E(x,y)∼ρ[`(y, f(x))]
Construct fb s.t. L(fb) − inf
f∈H L(f ) is small with high probability
Statistical goal
Classical estimator : Empirical Risk Minimization fbλ = arg min
f∈H
Lbλ(f) := 1 n
n
X
i=1
`(yi, f (xi)) + λ
2kfk2H
Learning with GSC functions
• Assumptions: ` GSC, K(·, ·) ≤ R, ∃f? ∈ arg minf∈HK L(f)
• Notations: H? = ∇2L(f?); H?λ = H? + λI
Key quantity Measures bias bλ := λ2kf?k2H?−1
λ
regularity of f? variance dλ := Tr(H?−1/2λ H?H?−1/2λ ) dimension of HK
Performance of fbλ with proba 1 − δ L(fbλ) − L(f?) ≤ C
bλ + dλ n
log 1
δ, if bλ, dλ
n ≤ λ
R2 (2) Problem : finding fbλ is a n-dimensional problem
Statistical performance of f b
λFinding a smaller set of candidate functions HM
• Subsample M points (˜xj) from the (xi)1≤i≤n, M n
• Find the best possible f of the form f = PM
j=1 αj Φ(˜xj), α ∈ RM :
fbλ,M := arg min
f∈HM
Lbλ(f), HM =
M
X
j=1
αj Φ(˜xj) : α ∈ RM
fbλ,M has the same performance (2) as fbλ if (a) M ≥ (1/λ) log(c/λδ) (uniform sampling), or
(b) M ≥ dλ log(c/λδ) (Nystrom leverage scores)
Dimension reduction with Nystr¨ om
Optimization problem : fbλ,M = PM
j=1 αbj Φ(˜xj)
αb = arg min
α∈RM
gλ(α) := 1 n
n
X
i=1
`(yi, hα, K>nMeii) + λ
2α>KM Mα (3) (KM M)ij = hΦ(˜xi), Φ(˜xj)i, (KnM)ij = hΦ(xi), Φ(˜xj)i.
Form of the Hessian
Hµ := K>nMWnKnM + µKM M, Wn diagonal Sketching the Hessian using Nystr¨om
If (a) or (b) holds, then for all µ ≥ λ, defining
He µ = KM MWMKM M + µKM M, it holds 1
2He λ Hλ 2He λ
complexity Time Memory
Computing He µ O(M3) O(M2) Computing ∇gµ O(nM + M2) O(n + M2) Computing an ANS O(nM + M3) O(n + M2)
ANS for the Nystr¨ om problem
Logistic regression on Susy and Higgs (n ≈ 107)
0 20 40 60 80 100 120
passes over data
19.4 19.6 19.8 20.0 20.2 20.4 20.6 20.8 21.0
classification error
10 410 3 10 2 10 1
distance to optimum
second order K-SVRG
20 40 60 80 100 120
passes over data
27.8 28.0 28.2 28.4 28.6 28.8 29.0
classification error
10 6105
10 4 10 3 10 2 10 1
distance to optimum
second order K-SVRG
Algorithm (GS) to solve (3) with precision c/n Returns αalg s.t. gλ(αalg) − gλ(α)b ≤ c/n Predictor falg = PM
j=1 αjalg Φ(˜xj)
Code and results: https://github.com/umarteau/Newton-Method-for-GSC-losses-
f alg has the same guarantees (2) as fbλ
Time complexity : O (T [ndλ + d3λ]) , T = Rkf?klog µλ0 + log cn Memory complexity : O(n + d2λ)
Second Order Strikes back
• Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale kernel method. In Advances in Neural Information Processing Systems 30, pages 3888–3898. 2017.
• Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least- squares:
Fast rates for regularized empirical risk minimization through self-concordance. In Proceedings of the Conference on Computational Learning Theory, 2019.
Main References