• Aucun résultat trouvé

poster

N/A
N/A
Protected

Academic year: 2022

Partager "poster"

Copied!
1
0
0

Texte intégral

(1)

Second Order Strikes Back : Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses

U. Marteau-Ferey, A. Rudi and F. Bach — INRIA - Ecole Normale Sup´ erieure - PSL Research University

Data : n points (xi, yi)1≤i≤n ∈ H × R, H Hilbert, supi kxik ≤ R

Goal:

ω?λ = arg min

ω∈H

1 n

n

X

i=1

`(yi, hω, xii) + λ

2kωk2 (1) (i) Logistic loss `(y, y0) = log(1 + e−yy0)

(ii) Potentially very small regularizer λ 1, e.g. λ ≈ 10−12 Key property : Generalized Self Concordance (GSC)

∀y ∈ Y, |`(3)(y, ·)| ≤ `(2)(y, ·) Notations:

gλ(ω) = g(ω) + λ2kωk2 Hλ(ω) = ∇2gλ(ω)

• For any p.s.d. operator A on H, k · kA = kA1/2 · kH.

Ill-conditioned problems : second order method ?

Solving ill-conditioned logistic regression

Start at ω0 Approximate Newton Step (ANS) Hessian sketch He λt) 12He λ Hλ 2He λ

Step ωt+1 =ωt−st st = He −1λt)∇gλt)

Linear convergence near the optimum If ω0 ∈ Dλ, Dλ :=

n

ω : Rk∇gλ(ω)kH−1

λ (ω) ≤√

λo , gλt) − gλ?λ) ≤ 2−t

Problem : is is possible to find ω ∈ Dλ ?

Approximate Newton Methods (ANM)

Reaching Dλ in O(log(µ0/λ)) approximate Newton steps ωK ∈ Dλ, K = O(logq0/λ)), 1/q ≈ 1 − 1/(Rkω?λk)

Getting inside D

λ

Data: (xi, yi)1≤i≤n ∈ X × R, i.i.d. with distribution ρ

Feature map: Φ : X → H, H Hilbert

Linear predictor: f(x) ↔ hf, Φ(x)iH, f ∈ H

Expected loss: L(f ) := E(x,y)∼ρ[`(y, f(x))]

Construct fb s.t. L(fb) − inf

f∈H L(f ) is small with high probability

Statistical goal

Classical estimator : Empirical Risk Minimization fbλ = arg min

f∈H

Lbλ(f) := 1 n

n

X

i=1

`(yi, f (xi)) + λ

2kfk2H

Learning with GSC functions

Assumptions: ` GSC, K(·, ·) ≤ R, ∃f? ∈ arg minf∈HK L(f)

Notations: H? = ∇2L(f?); H?λ = H? + λI

Key quantity Measures bias bλ := λ2kf?k2H?−1

λ

regularity of f? variance dλ := Tr(H?−1/2λ H?H?−1/2λ ) dimension of HK

Performance of fbλ with proba 1 − δ L(fbλ) − L(f?) ≤ C

bλ + dλ n

log 1

δ, if bλ, dλ

nλ

R2 (2) Problem : finding fbλ is a n-dimensional problem

Statistical performance of f b

λ

Finding a smaller set of candidate functions HM

• Subsample M points (˜xj) from the (xi)1≤i≤n, M n

• Find the best possible f of the form f = PM

j=1 αj Φ(˜xj), α ∈ RM :

fbλ,M := arg min

f∈HM

Lbλ(f), HM =

M

X

j=1

αj Φ(˜xj) : α ∈ RM

fbλ,M has the same performance (2) as fbλ if (a) M ≥ (1/λ) log(c/λδ) (uniform sampling), or

(b) M ≥ dλ log(c/λδ) (Nystrom leverage scores)

Dimension reduction with Nystr¨ om

Optimization problem : fbλ,M = PM

j=1 αbj Φ(˜xj)

αb = arg min

α∈RM

gλ(α) := 1 n

n

X

i=1

`(yi, hα, K>nMeii) + λ

2α>KM Mα (3) (KM M)ij = hΦ(˜xi), Φ(˜xj)i, (KnM)ij = hΦ(xi), Φ(˜xj)i.

Form of the Hessian

Hµ := K>nMWnKnM + µKM M, Wn diagonal Sketching the Hessian using Nystr¨om

If (a) or (b) holds, then for all µλ, defining

He µ = KM MWMKM M + µKM M, it holds 1

2He λ Hλ 2He λ

complexity Time Memory

Computing He µ O(M3) O(M2) Computing ∇gµ O(nM + M2) O(n + M2) Computing an ANS O(nM + M3) O(n + M2)

ANS for the Nystr¨ om problem

Logistic regression on Susy and Higgs (n ≈ 107)

0 20 40 60 80 100 120

passes over data

19.4 19.6 19.8 20.0 20.2 20.4 20.6 20.8 21.0

classification error

10 4

10 3 10 2 10 1

distance to optimum

second order K-SVRG

20 40 60 80 100 120

passes over data

27.8 28.0 28.2 28.4 28.6 28.8 29.0

classification error

10 610

5

10 4 10 3 10 2 10 1

distance to optimum

second order K-SVRG

Algorithm (GS) to solve (3) with precision c/n Returns αalg s.t. gλalg) − gλ(α)b ≤ c/n Predictor falg = PM

j=1 αjalg Φ(˜xj)

Code and results: https://github.com/umarteau/Newton-Method-for-GSC-losses-

f alg has the same guarantees (2) as fbλ

Time complexity : O (T [ndλ + d3λ]) , T = Rkf?klog µλ0 + log cn Memory complexity : O(n + d2λ)

Second Order Strikes back

Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale kernel method. In Advances in Neural Information Processing Systems 30, pages 3888–3898. 2017.

Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least- squares:

Fast rates for regularized empirical risk minimization through self-concordance. In Proceedings of the Conference on Computational Learning Theory, 2019.

Main References

Références

Documents relatifs

In this paper, we argue that the three reasons above for not using Newton method can be circumvented to obtain competitive algorithms: (1) high absolute precisions are indeed not

This is similar to the asymp- totic rate achievable in the linear case with a different measure of non-stationarity (Russac et al., 2019) and the same dependency is attained with

From a theoretical point of view, it is now required to have a bound on the binary time complexity of each step (let recall that naive Gaussian elimination is exponential when one

We coded also a simple way of analyzing the Hessian matrix during the model construction process, so that figures like Figure 3.5 can be readily produced to observe the performance

In general, the results of the comparison between the two algorithms (i.e. Gencont2 and Gencont) showed that, in most of cases, both algorithms gave very similar solutions

We have applied the extension techniques to regularization by the ℓ 2 -norm and regularization by the ℓ 1 -norm, showing that new results for logistic regression can be easily

The resulting reduced system not only preserves the second-order structure but also has the same order of approximation of a re- duced linear system obtained by the

nonlinear complementarity problem, Newton method, proximal point method, projection method, global convergence, superlinear convergence.. Note that under this assumption, the