How to characterize past information ?

(1)

Sharpness, Restart, Acceleration

Vincent Roulet, Alexandre d’Aspremont

OSL, Les Houches, April 13, 2017

(2)

Motivation

I Goal :

minimize f(x), f :Rⁿ→Rcvx

I Some algorithms use past information to build next iterate

I Accelerated Gradient Method

I Universal Fast Gradient Method

I Quasi-Newton methods

I ...

I Idea : Refresh algorithms when past information is ”no longer relevant”

I Doesn’t make any sense for gradient descent with line search for example

(3)

How to characterize past information ?

I Take an algorithm Athat outputs points x =A(x₀, θ,t), where

I x0is the initial point,

I θare parameters of the algorithm

I t is the number of iterations.

I Look at the convergence rate

f(x)−f^∗≤ cd(x0,X^∗)^q t^p where

I d(x0,X^∗) is the Euclidean distance fromx0to the set of minimizersX^∗

I c,p,q are constants depending on the problem

I Bound increases with d(x0,X^∗), intuition :

x0 close to X^∗ →good initialization so fast convergence

I Exploit information on d(x0,X^∗) ?

(4)

Plan

Sharpness

Scheduled restarts General strategy

Scheduled restarts for smooth convex problem

Scheduled restarts for non-smooth or H¨older smooth convex problem

Restarts with termination criterion

Composite problems & Bregman divergences Numerical Experiments

Conclusion

(5)

Sharpness

Definition

A functionf satisfies the sharpness property on a setK if there existsr ≥1, µ >0, s.t.

µd(x,X^∗)^r ≤f(x)−f^∗, for every x∈K (Sharp) Examples

I Strongly convex function (r = 2)

I Gradient dominated functions (r = 2)

I Matrix game problems like min_xmax_yx^TAy (r = 1)

I Real analytic functions (r unknown)

I Subanalytic functions (r unknown)

(6)

Sharpness for real analytic function

Forf real analytic, x ∈Rand x^∗∈X^∗, f(x)−f^∗ =

∞

X

k=q

f^(k⁾(x^∗)

k! (x−x^∗)^k

whereq≥0 is the smallest coefficient for whichf^(q)(x^∗)6= 0.

There is an intervalV aroundx^∗ s.t.

1 2

f^(q)(x^∗)

q! |x−x^∗|^q≤f(x)−f^∗

Settingx^∗= Π_X^∗(x) this yields (Sharp) onV with q and ¹₂^f^(q)_q!^(x^∗⁾.

(7)

Sharpness for subanalytic functions

Lojasevicz inequality

I Sharpness property is known to be satisfied for real analytic functions as the Lojasevicz inequality [ Lojasevicz 1963]

I Generalized recently to broad class of non-smooth convex functions called subanalytic [Bolte et al 2007].

I Subanalytic functions are functions whose epigraph can be expressed as a semi-analytic manifold.

I Proofs rely on topological arguments so (r, µ) are mostly unknown.

(8)

Smoothness

Definition

A functionf satisfies the smoothness property on a set J if there existss ∈[1,2], L>0 s.t.

k∇f(x)− ∇f(y)k₂ ≤Lkx−yk^s−1₂ , for every x,y ∈J (Smooth)

Examples

I Non-smooth (s = 1)

I Smooth (s = 2)

I H¨older smooth (s ∈(1,2))

(9)

Sharpness and smoothness

Iff satisfies (Smooth), for everyx ∈Rⁿ andy = ΠX^∗(x), f(x)≤f(y) +∇f(y)^T(x−y) +L

skx−yk^s₂=f^∗+L

sd(x,X^∗)^s Combined with (Sharp),µd(x,X^∗)^r ≤f(x)−f^∗, this yields

0< sµ

L ≤d(x,X^∗)^s−r Takingx →X^∗, necessarily

s ≤r

Moreover ifs <r, last inequality canonly be valid on a

bounded set, either smoothness or sharpness or both are not valid in the whole space.

(10)

Condition numbers

We denote

τ = 1−s r

a condition number on the ratio of powers, s.t.

0≤τ <1 and

κ=L²^s/µ²^r a generalized condition number.

(11)

Plan

Sharpness

Conclusion

(12)

General strategy

I Take an algorithm Athat outputs points x =A(x₀, θ,t), where

I x₀is the initial point,

I θare parameters of the algorithm

I t is the number of iterations

I Look at the convergence rate iff satisfies (Sharp) f(x)−f^∗ ≤ cd(x₀,X^∗)^q

t^p

≤ c⁰(f(x₀)−f^∗)^q/r t^p

I Given γ ≥0, compute analytically t s.t.

f(x)−f^∗ ≤e^−γ(f(x0)−f^∗)

I Iterate and compute total complexity

(13)

General formulation

Given an algorithmAthat outputs points x =A(x₀, θ,t) Scheduled restart schemes :

Inputs: x0, sequenceθk, sequencetk

for k = 1. . .R do x_k =A(x_k−1, θ_k,t_k) end for

Output: xˆ=x_R

(14)

General analysis

Lemma

Givenγ ≥0, suppose setting

tk =Ce^αk, with C >0, α≥0, ensures

f(x_k)−f^∗ ≤Me^−γk, with M >0.

Writing N=P_R

k=1t_k the total number of iterations, we get f(ˆx)−f^∗ ≤Mexp(−γC⁻¹N), when α= 0, f(ˆx)−f^∗ ≤ M

(αe^−αC⁻¹N+ 1)

γ α

, when α >0.

(15)

Smooth convex problems

I Iff is cvx and smooth (s = 2,L), an optimal algorithm is the Accelerated Gradient Acc.

I Given x0, it outputs after t iterations, a point x=Acc(x₀,t), s.t.

f(x)−f^∗ ≤ cL

t²d(x₀,X^∗)², wherec is a universal constant.

I Assume thatf satisfies (Sharp) with (r, µ) on a set K µd(x,X^∗)^r ≤f(x)−f^∗, for everyx ∈K

I Assume we are given x₀ ∈Rⁿ, s.t. {x, f(x)≤f(x₀)} ⊂K.

(16)

Optimal scheme

Proposition 1st part

Assumef cvx, smooth (s = 2,L) and sharp (r, µ) on a setK. Run scheduled restarts ofAcc with

tk =Cτ,κe^τk

C_τ,κ=e^1−τ(cκ)¹²(f(x₀)−f^∗)⁻^τ² Then for every outer iterationk ≥0,

f(x_k)−f^∗≤e^−2k(f(x0)−f^∗).

(17)

Optimal scheme

Proposition

DenoteN the total number of iterations at the output ˆx, then, whenτ = 0,

f(ˆx)−f^∗ ≤exp

−2e⁻¹(cκ)⁻¹²N

(f(x₀)−f^∗) =O

exp(−κ⁻¹²N) ,

while, whenτ >0,

f(ˆx)−f^∗ ≤ f(x₀)−f^∗

τe⁻¹(f(x₀)−f^∗)^τ²(cκ)⁻¹²N+ 1²_τ

=O

κ^τ¹N⁻^τ² ,

Note : Optimal for this class of problems [Optimal methods of smooth convex optimization, A. Nemirovski, Y. Nesterov 1985]

(18)

Adaptive scheme

I In practice (r, µ) are unknown

I Given a fixed total number of iterationsN, run following schemes

S_i,j : Scheduled restart with t_k =C_ie^τ^j^k, whereC_i = 2ⁱ andτ_j = 2^−j. with i ∈[1, . . . ,blog₂Nc],j ∈[0, . . . ,dlog₂Ne]

I Optimal bounds up to constant factor 4

I Has a complexity log₂(N)² higher than runningN iterations in the optimal scheme

I Adaptive algorithm

(19)

Non-smooth or H¨ older smooth convex problems

I Iff is cvx, satisfies (Smooth) with (s,L) on a setJ, i.e.

k∇f(x)− ∇f(y)k₂ ≤Lkx−yk^s−1₂ , for every x,y ∈J, (Smooth) an optimal algorithm is the Fast Universal Gradient methodU by Nesterov, 2015.

I Given ,x₀, it outputs, aftert iterations, a point x =U(x₀, ,t) s.t.

f(x)−f^∗ ≤

2 +cL²^sd(x0,X^∗)² ²^st^2ρ^s

2 where

ρ= 3s−2 2

is the optimal rate for this class of functions.

(20)

H¨ older smooth convex problems strategy

I Assume that we have access to ₀ ≥f(x₀)−f^∗ for a given x0∈Rⁿ

I Given γ ≥0 run scheduled restarts with sequence of target accuracies

_k =e^−γk₀

I Choose t_k to ensure

f(x_k)−f^∗ ≤_k

(21)

Optimal scheme

Proposition 1st part

Assumef cvx, H¨older smooth (s,L) and sharp (r, µ) on a set K. Run scheduled restarts ofU with

_k =e^−ρk₀ t_k =C_τ,κ,ρe^τk C_τ,κ,ρ=e^1−τ(cκ)^3s−2^s

τ ρ

0

Then for every outer iterationk ≥0, f(x_k)−f^∗≤e^−ρk0.

(22)

Optimal scheme

Proposition 2nd part

−ρe⁻¹(cκ)⁻^2ρ^s N

0 =O

exp(−κ⁻^2ρ^s N)

,

while, whenτ >0,

f(ˆx)−f^∗ ≤ ₀

τe⁻¹(cκ)⁻^2ρ^s

τ ρ

0N+ 1

^ρ_τ =O

κ^2τ^s N⁻^ρ^τ ,

Note : Optimal for this class of problems [Optimal methods of smooth convex optimization, A. Nemirovski, Y. Nesterov 1985]

(23)

General convex problems

I 3 parameters for the schedule γ,C, α

I Grid search inefficient if r ors unknown

I Otherwise grid search on C works

I Can be used for

→ non-smooth (s = 1), gradient dominated functions (r= 2)

→ non-smooth (s = 1), sharp functions (r = 1)

(24)

Plan

Sharpness

Conclusion

(25)

Strategy

I Assume f^∗ known (e.g. zero sum-game matrix problem, projection a convex set...)

I Given an accuracy , denotet the number of iterations to observe that x=U(x₀, ,t) satisfies

f(x)−f^∗ ≤

→ Stop when target accuracy reached

→ Restart with a reduced target accuracy

(26)

Formulation

Given the Fast Universal Gradient methodU that outputs x=U(x₀, ,t)

Restarts with termination criterion : Inputs: x0,γ,f^∗

₀=f(x₀)−f^∗ for k = 1. . .R do

k =e^−γk−1

x_k =U(xk−1, _k,t_k) end for

Output: xˆ=xR

(27)

Assumef cvx, H¨older smooth (s,L) and sharp (r, µ) on a set K. Run restarts with termination criterion withγ =ρ.

−ρe⁻¹(cκ)⁻^2ρ^s N

0 =O

exp(−κ⁻^2ρ^s N)

,

while, whenτ >0,

f(ˆx)−f^∗ ≤ ₀

τe⁻¹(cκ)⁻^2ρ^s

τ ρ

0N+ 1

_τ^ρ =O

κ^2τ^s N⁻^ρ^τ ,

Note : Restarts robust to the choice ofγ.

Takingγ = 1 is optimal up to a small constant factor.

(28)

Plan

Sharpness

Conclusion

(29)

General setting

I Extension to

minimizef(x) =φ(x) +g(x) where

I φsatisfies (Smooth) w.r.t a generic normk.k.

k∇f(x)− ∇f(y)k ≤Lkx−yk^s−1, for everyx,y ∈J, (Smooth)

I we have access to a prox functionh1-strongly convex w.r.t.

k.kdefining a Bregman divergence

Dh(z;x) =h(z)−h(x)− ∇h(x)^T(z −x)

I g is simple in the sense that we can easily solve minz y^Tz+g(z) +λDh(z;x)

I Covers a whole class f of problems such as sparse or constrained.

I Need an appropriate notion of sharpness w.r.tk.k.

(30)

Relative sharpness

Definition

A convex functionf is called relatively sharp with respect to a strictly convex functionh on a set K ⊂dom(f) if there exists r≥1,µ >0 such that

2µD_h(x;X^∗)^r² ≤f(x)−f^∗ for any x∈K (Relative Sharpness) whereD_h(x;X^∗) = min_x^∗∈X^∗D_h(x;x^∗) andD_h is the Bregman divergence associated toh.

(31)

Plan

Sharpness

Conclusion

(32)

Numerical Experiments

I Classification problems on UCI Sonar data set with various losses.

I Check convergence of best method found by grid search Adap

I Compare against

I Gradient descentGrad

I Accelerated gradient descentAcc

I Restarts enforcing monotonicityMono,

i.e., whenf(xk+1)≤f(xk) in the inner iterations.

(33)

Least Squares and Logistic

0 200 400 600 800

Number of iterations 10^-10

10^-5 10⁰

f(x)-f*

Grad Acc Mono Adap

0 2000 4000 6000

10^-4 10^-2 10⁰

f(x)-f*

Grad Acc Mono Adap

Figure:Least squares loss (left) and Logistic loss (right).

Large dots represent restart iterations

(34)

Lasso and Dual SVM

0 1000 2000 3000 4000 Number of iterations 10^-10

10^-5 10⁰

f(x)-f*

Grad Acc Mono Adap

0 1000 2000 3000

10^-5 10⁰

f(x)-f*

Grad Acc Mono Adap

Figure:Lasso (left) and dual SVM (right) problems.

Large dots represent restart iterations

(35)

Plan

Sharpness

Conclusion

(36)

Contributions

I Open the black box model by adding a generic assumption on the behavior of the function around minimizers

I Convergence analysis of restart schemes

I Optimal schemes for smooth, H¨older smooth, non-smooth convex optimization

I Adaptive scheme for smooth convex optimization

(37)

Future work

Sharpness analysis

I Sharpness reads

µd(x,X^∗)^r ≤f(x)−f^∗, for everyx ∈K

I µ depends generally onK, thorough analysis in

From error bounds to the complexity of first-order descent methods for convex functions, J. Bote et al, 201

I Local adaptivity of restart schemes ?

I Iff^∗ known, restart with termination criterion is adaptive.

→ Approximate f^∗ ? Practical algorithm

I Grid search shows robustness but not very practical

I Restarting from a combination of points, see

Restarting accelerated gradient methods with a rough strong convexity estimate, O. Fercoq, Z. Qu, 2016

(38)

How to characterize past information ?

Sharpness, Restart, Acceleration

Motivation

How to characterize past information ?

Plan

Sharpness

Sharpness for real analytic function

Sharpness for subanalytic functions

Smoothness

Sharpness and smoothness

Condition numbers

Plan

General strategy

General formulation

General analysis

Smooth convex problems

Optimal scheme

Optimal scheme

Adaptive scheme

Non-smooth or H¨ older smooth convex problems

H¨ older smooth convex problems strategy

Optimal scheme

Optimal scheme

General convex problems

Plan

Strategy

Formulation

Plan

General setting

Relative sharpness

Plan

Numerical Experiments

Least Squares and Logistic

Lasso and Dual SVM

Plan

Contributions

Future work

Thanks !

Questions ?