poster

(1)

Second Order Strikes Back : Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses

U. Marteau-Ferey, A. Rudi and F. Bach — INRIA - Ecole Normale Sup´ erieure - PSL Research University

Data : n points (x_i, y_i)_1≤i≤n ∈ H × R, H Hilbert, sup_i kx_ik ≤ R

Goal:

^ω_?^λ ^{= arg min}

ω∈H

1 n

n

X

i=1

`(y_i, hω, x_ii) + λ

2kωk² (1) (i) Logistic loss `(y, y⁰) = log(1 + e^−yy⁰)

(ii) Potentially very small regularizer λ 1, e.g. λ ≈ 10⁻¹² Key property : Generalized Self Concordance (GSC)

∀y ∈ Y, |`⁽³⁾(y, ·)| ≤ `⁽²⁾(y, ·) Notations:

• g^λ(ω) = g(ω) + ^λ₂kωk² H_λ(ω) = ∇²g^λ(ω)

• For any p.s.d. operator A on H, k · k_A = kA^1/2 · k_H.

Ill-conditioned problems : second order method ?

Solving ill-conditioned logistic regression

Start at ω₀ Approximate Newton Step (ANS) Hessian sketch He _λ(ω_t) ¹₂He _λ H_λ 2He _λ

Step ω_t+1 =ω_t−s_t s_t = He ⁻¹_λ (ω_t)∇g^λ(ω_t)

Linear convergence near the optimum If ω₀ ∈ D_λ, D_λ :=

n

ω : Rk∇g^λ(ω)k_H⁻¹

λ (ω) ≤√

λo , g^λ(ω_t) − g^λ(ω_?^λ) ≤ 2^−t

Problem : is is possible to find ω ∈ D_λ ?

Approximate Newton Methods (ANM)

Reaching D_λ in O(log(µ₀/λ)) approximate Newton steps ω_K ∈ D_λ, K = O(log_q(µ₀/λ)), 1/q ≈ 1 − 1/(Rkω_?^λk)

Getting inside D

_λ

• Data: (x_i, y_i)_1≤i≤n ∈ X × R, i.i.d. with distribution ρ

• Feature map: Φ : X → H, H Hilbert

• Linear predictor: f(x) ↔ hf, Φ(x)i_H, f ∈ H

• Expected loss: L(f ) := E^(x,y)∼ρ[`(y, f(x))]

Construct fb s.t. L(fb) − inf

f∈H L(f ) is small with high probability

Statistical goal

Classical estimator : Empirical Risk Minimization fb_λ = arg min

f∈H

Lb_λ(f) := 1 n

n

X

i=1

`(y_i, f (x_i)) + λ

2kfk²_H

Learning with GSC functions

• Assumptions: ` GSC, K(·, ·) ≤ R, ∃f_? ∈ arg min_f_∈H_K L(f)

• Notations: H^? = ∇²L(f_?); H^?_λ = H^? + λI

Key quantity Measures bias b_λ := λ²kf_?k²_H?−1

λ

regularity of f_? variance d_λ := Tr(H^?−1/2_λ H^?H^?−1/2_λ ) dimension of H_K

Performance of fb_λ with proba 1 − δ L(fb_λ) − L(f_?) ≤ C

b_λ + d_λ n

log 1

δ, if b_λ, d_λ

n ≤ λ

R² (2) Problem : finding fb_λ is a n-dimensional problem

Statistical performance of f b

_λ

Finding a smaller set of candidate functions H_M

• Subsample M points (˜x_j) from the (x_i)_1≤i≤n, M n

• Find the best possible f of the form f = PM

j=1 α_j Φ(˜x_j), α ∈ R^M :

fb_λ,M := arg min

f∈H_M

Lb_λ(f), H_M =







M

X

j=1

α_j Φ(˜x_j) : α ∈ R^M







fb_λ,M has the same performance (2) as fb_λ if (a) M ≥ (1/λ) log(c/λδ) (uniform sampling), or

(b) M ≥ d_λ log(c/λδ) (Nystrom leverage scores)

Dimension reduction with Nystr¨ om

Optimization problem : fb_λ,M = PM

j=1 αb_j Φ(˜x_j)

αb = arg min

α∈R^M

g^λ(α) := 1 n

n

X

i=1

`(y_i, hα, K^>_nMe_ii) + λ

2α^>K_{M M}α (3) (K_{M M})_ij = hΦ(˜x_i), Φ(˜x_j)i, (K_nM)_ij = hΦ(x_i), Φ(˜x_j)i.

Form of the Hessian

H_µ := K^>_nMW_nK_nM + µK_{M M}, W_n diagonal Sketching the Hessian using Nystr¨om

If (a) or (b) holds, then for all µ ≥ λ, defining

He _µ = K_{M M}W_MK_{M M} + µK_{M M}, it holds 1

2He _λ H_λ 2He _λ

complexity Time Memory

Computing He _µ O(M³) O(M²) Computing ∇g^µ O(nM + M²) O(n + M²) Computing an ANS O(nM + M³) O(n + M²)

ANS for the Nystr¨ om problem

Logistic regression on Susy and Higgs (n ≈ 10⁷)

0 20 40 60 80 100 120

passes over data

19.4 19.6 19.8 20.0 20.2 20.4 20.6 20.8 21.0

classification error

¹⁰ ⁴

10 ³ 10 ² 10 ¹

distance to optimum

second order K-SVRG

20 40 60 80 100 120

passes over data

27.8 28.0 28.2 28.4 28.6 28.8 29.0

classification error

¹⁰ ⁶¹⁰

5

10 ⁴ 10 ³ 10 ² 10 ¹

distance to optimum

second order K-SVRG

Algorithm (GS) to solve (3) with precision c/n Returns αâlg s.t. g^λ(αâlg) − g^λ(α)b ≤ c/n Predictor fâlg = PM

j=1 α_j^alg Φ(˜x_j)

Code and results: https://github.com/umarteau/Newton-Method-for-GSC-losses-

f ^alg has the same guarantees (2) as fb_λ

Time complexity : O (T [nd_λ + d³_λ]) , T = Rkf_?klog ^µ_λ⁰ + log cn Memory complexity : O(n + d²_λ)

Second Order Strikes back

• Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale kernel method. In Advances in Neural Information Processing Systems 30, pages 3888–3898. 2017.

• Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least- squares:

Fast rates for regularized empirical risk minimization through self-concordance. In Proceedings of the Conference on Computational Learning Theory, 2019.

Main References