slides

(1)

Machine Learning 8:

Convex Optimization for Machine Learning

Master 2 Computer Science

Aur´elien Garivier 2018-2019

(2)

Convex functions in R

^d

(4)

Convex functions and subgradients

Convex function

LetX ⊂R^d be a convex set. The functionf :X →Risconvexif

∀x,y∈X,∀λ∈[0,1], f (1−λ)x+λy

≤(1−λ)f(x) +λf(y).

Subgradients

A vectorg ∈Rⁿ is asubgradientoff atx∈ X if for anyy∈ X, f(y)≥f(x) +hg,y−xi.

The set of subgradients off atx is denoted∂f(x).

Proposition

• If∂f(x)6=∅ for allx∈ X, thenf is convex.

• Iff is convex, then∀x ∈X˚, ∂f(x)6=∅.

• Iff is convex and differentiable atx, then∂f(x) =

∇f(x) .

2

(5)

Convex functions and optimization

Proposition

Letf be convex. Then

• x is local minimum off iff 0∈∂f(x),

• and in that case,x is aglobalminimum off;

• if X is closed and iff is differentiable onX, then x = arg min

x∈X

f(x) iff ∀y ∈ X,

∇f(x),y−x

≥0.

Black-box optimization model

The setX is known,f :X →Ris unknown but accessible thru:

• a zeroth-order oracle: givenx ∈ X, yieldsf(x),

• and possibly a first-order oracle: givenx∈ X, yields g ∈∂f(x).

3

(6)

Gradient Descent

(7)

Gradient Descent algorithms

A memoryless algorithm for first-order black-box optimization:

Algorithm:Gradient Descent

Input: convex function f, step sizeγt, initial pointx0

1 for t = 0. . .T −1do

2 Compute gt ∈∂f(xt)

3 x_t+1←x_t−γ_tg_t

4 returnx_T or x₀+· · ·+x_T−1 T Questions:

• xT →

T→∞x^∗^def= arg minf?

• f(xT) →

T→∞f(x^∗) = minf?

• under which conditions?

• what about x0+· · ·+x_T−1

T ?

• at what speed?

• works in high dimension?

• do some properties help?

• can other algorithms do better?

4

(8)

Monotonicity of gradient

Property

Letf be a convex function on X, and let x,y∈ X. For every g_x ∈∂f(x) and everyg_y ∈∂f(y),

gx −gy,x−y

≥0. In fact, a differentiable mappingf is convex iff

∀x,y ∈ X,

∇f(x)− ∇f(y),x−y

≥0.

In particular,hgx,x−x^∗i ≥0.

=⇒ the negative gradient does not point the the wrong direction.

Under some assumptions (to come), this inequality can be strenghtened, making gradient descent more relevant.

5

(9)

Convergence of GD for Convex-Lipschitz functions

Lipschitz Assumption

For everyx∈ X and everyg ∈∂f(x),kgk ≤L.

This implies

f(y)−f(x) ≤

hg,y−xi

≤Lky−xk.

Theorem

Under the Lipschitz assumption, GD withγt ≡γ= ^R

L√

T satisfies f 1

T

T−1

X

i=0

xi

!

−f(x^∗)≤ RL

√ T .

• Of course, can return arg min_1≤i≤Tf(xi) instead (not always better).

• It requiresT≈ ^R²2^L² steps to ensure precision.

• Online version γt = _L^R^√_t: bound in 3RL/√

T (see Hazan).

• Works just as well for constrained optimization with xt+1←Π_X xt−γt∇f(xt)

thanks to Pythagore projection theorem.

6

(10)

Intuition: what can happen?

The step must be large enough to reach the region of the minimum, but not too large too avoid skip- ping over it.

LetX =B(0,R)⊂R²and f(x¹,x²) = R

√2γT|x¹|+L|x²| which isL-Lipschitz forγ≥ ^√^R

2LT. Then, ifx0=

√R 2,^3Lγ₄

∈ X,

• x_t¹= ^√^R

2−^√^Rt_2T and ¯x_T¹ ' ₂^R^√₂;

• x_2s+1² = ^3Lγ₄ −γL=−^Lγ₄,x_2s² = ^3Lγ₄ , and ¯x_T² '^Lγ4. Hence

f x¯_T¹,x¯_T² ' R

√2γT R 2√

2 +LγL 4 = 1

4 R²

Tγ +L²γ

, which is minimal forγ= ^R

L√

T wheref x¯_T¹,x¯_T²

≈ ^RL

2√ T.

7

(11)

Proof

Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:

2ha,bi=kak²+kbk²− ka−bk². Hence for every 0≤t <T:

f(xt)−f(x^∗)≤ hgt,xt−x^∗i

= 1

γhx_t−x_t+1,x_t−x^∗i

= 1 2γ

kx_t−x^∗k²+kx_t−x_t+1k²− kx_t+1−x^∗k²

= 1 2γ

kx_t−x^∗k²− kx_t+1−x^∗k² +γ

2kg_tk², and hence

T−1

X

t=0

f(x_t)−f(x^∗)≤ 1 2γ

kx₀−x^∗k²−kx_T−x^∗k² +L²γT

2 ≤ L√ T R²

2R +L²RT 2L√

T

and by convexity f 1 T

T−1

X

i=0

xi

!

≤ 1 T

T−1

X

t=0

f(xt).

8

(12)

Smoothness

(13)

Smoothness

Definition

A continuously differentiable functionf isβ-smooth if the gradient∇f isβ-Lipschitz, that is if for allx,y ∈ X,

k∇f(y)− ∇f(x)k ≤βky−xk.

Property

Iff isβ-smooth, then for anyx,y∈ X: f(y)−f(x)− h∇f(x),y−xi

≤ β

2ky−xk².

• f is convex andβ-smooth iffx7→ β

2kxk²−f(x) is convex iff∀x,y ∈ X f(x) +h∇f(x),y−xi ≤f(y)≤f(x) +h∇f(x),y−xi+β

2ky−xk².

• Iff is twice differentiable, thenf isα-strongly convex iff all the

eigenvalues of the Hessian off are at most equal toβ. 9

(14)

Convergence of GD for smooth convex functions

Theorem

Letf be a convex andβ-smooth function onR^d. Then GD with γt ≡γ= _β¹ satisfies:

f(xT)−f(x^∗)≤2βkx0−x^∗k² T + 4 .

Thus it requiresT≈ ^2βR steps to ensure precision .

10

(15)

Majoration/minoration

Takingγ= _β¹ is a ”safe” choice ensuring progress:

x⁺^def= x−1

β∇f(x) = arg min

y

f(x) +

∇f(x),y−x +β

2ky−xk² is such thatf(x⁺)≤f(x)− 1

2β

∇f(x)

2.Indeed, f(x⁺)−f(x)≤

∇f(x),x⁺−x +β

2kx⁺−xk²

=−1 β

∇f(x)

2+ 1 2β

∇f(x)

2=− 1 2β

∇f(x)

2.

=⇒ Descentmethod.

Moreover,x⁺is ”on the same side ofx^∗ asx” (no overshooting): since ∇f(x)k=

∇f(x)− ∇f(x^∗)k ≤βkx−x^∗k, ∇f(x),x−x^∗

≤

∇f(x)

kx−x^∗k ≤βkx−x^∗k²and thus x^∗−x⁺,x^∗−x

=

x^∗−xk²+1

β∇f(x),x^∗−x

≥0.

11

(16)

Lemma: the gradient shoots in the right direction

Lemma

For everyx∈ X,

∇f(x),x−x^∗

≥ 1 β

∇f(x) ².

We already now thatf(x^∗)≤f x−_β¹∇f(x)

≤f(x)−_2β¹

∇f(x)

². In addition, taking z=x^∗+_β¹∇f(x):

f(x^∗) =f(z) +f(x^∗)−f(z)

≥f(x) +

∇f(x),z−x

−β

2kz−x^∗k²

=f(x) +

∇f(x),x^∗−x+

∇f(x),z−x^∗

− 1 2β

∇f(x)

2

=f(x) +

∇f(x),x^∗−x + 1

2β ∇f(x)

².

Thusf(x) +

∇f(x),x^∗−x+ 1 2β

∇f(x)

²≤f(x^∗)≤f(x)− 1 2β

∇f(x) ². In fact, this lemma is a corrolary of theco-coercivity of the gradient:∀x,y∈ X,

∇f(x)− ∇f(y),x−y

≥ 1 β

∇f(x)− ∇f(y) ², which holds iff the convex, differentiable functionf isβ-smooth.

12

(17)

Proof step 1: the iterates get closer to x

^∗

Applying the preceeding lemma tox=x_t, we get kxt+1−x^∗k²=

x_t− 1

β∇f(x_t)−x^∗

2

= xt−x^∗

²− 2 β

∇f(xt),xt−x^∗ + 1

β² ∇f(xt)

²

≤ x_t−x^∗

²− 1 β²

∇f(x_t) ²

≤ x_t−x^∗

² .

→it’s good, but it can be slow...

13

(18)

Proof step 2: the values of the iterates converge

We have seen thatf(xt+1)−f(xt)≤ − 1 2β

∇f(xt)

². Hence, ifδt=f(xt)−f(x^∗), then

δ₀=f(x₀)−f(x^∗)≤ β 2

x₀−x^∗ ²

andδt+1≤δt− 1 2β

∇f(xt) ². But δ_t≤

∇f(x_t),x_t−x^∗

≤

∇f(x_t)

x_t−x^∗ .

Therefore, sinceδ_tis decreasing witht,δ_t+1≤δ_t− δ_t²

2βkx0−x^∗k². Thinking to the coresponding ODE, one setsu_t= 1/δ_t, which yields:

u_t+1≥ ut

1−_2βkx ¹

0−x∗ k2ut

≥u_t

1 + 1

2βkx0−x^∗k²u_t

=u_t+ 1

2βkx0−x^∗k²

Hence,u_T ≥u₀+ T

2βkx0−x^∗k² ≥ 2

βkx0−x^∗k²+ T

2βkx0−x^∗k²= T+ 4 2βkx0−x^∗k².

14

(19)

Strong convexity

(20)

Strong convexity

Definition

f :X →Risα-strongly convex if for allx,y ∈ X, for anygx ∈∂f(x), f(y)≥f(x) +

gx,y−x +α

2ky−xk².

• f isα-strongly convex ifff(x)−^α₂kxk² is convex.

• αmeasures thecurvature off.

• Iff is twice differentiable, thenf isα-strongly convex iff all the eigenvalues of the Hessian off are larger thanα.

15

(21)

Faster rates for Lipschitz functions through strong convexity

Theorem

Letf be aα-strongly convex andL-Lipschitz. Under the Lipschitz assumption, GD withγt =_α(t+1)¹ satisfies:

f 1 T

T−1

X

i=0

xi

!

−f(x^∗)≤ L²log(T)

αT .

Note : returning another weighted average withγ_t= _α(t+1)² yields:

f

T−1

X

i=0

2(i+ 1) T(T+ 1)xi

!

−f(x^∗)≤ 2L² α(T+ 1) .

Thus it requiresT≈ ^2L_α² steps to ensure precision.

16

(22)

Proof

2ha,bi=kak²+kbk²− ka−bk². Hence for every 0≤t <T, byα-strong convexity:

f(xt)−f(x^∗)≤ hgt,xt−x^∗i−α

2kxt−x^∗k²

= 1 γt

hx_t−x_t+1,x_t−x^∗i −α

2kx_t−x^∗k²

= 1 2γt

kx_t−x^∗k²+kx_t−x_t+1k²− kx_t+1−x^∗k²

−α

2kx_t−x^∗k²

= tα

2 kxt−x^∗k²−(t+ 1)α

2 kxt+1−x^∗k²+ 1

2(t+ 1)αkgtk² sinceγ_t= _α(t+1)¹ , and hence

T−1

X

t=0

f(xt)−f(x^∗)≤ 0×α

2 kx0−x^∗k²−Tα

2 kxT−x^∗k²+L² 2α

T−1

X

t=0

1

t+ 1 ≤L²log(T) 2α and by convexityf

1 T

PT−1 i=0 xi

≤_T¹PT−1

t=0 f(xt). 17

(23)

Outline

Convex functions inR^d Gradient Descent Smoothness Strong convexity

Smooth and Strongly Convex Functions

Lower bounds lower bound for Lipschitz convex optimization What more?

Stochastic Gradient Descent

18

(24)

Smoothness and strong convexity: sandwiching f by squares

Letx∈ X.

For everyy ∈ X,

β-smoothness implies:

f(y)≤f(x) +

∇f(x),y−x +β

2ky−xk²

def= ¯f(x) = ¯f(x⁺) +β 2

y−x⁺

2.

Moreoever,α-strong convexity implies, with x⁻=x−_α¹∇f(x), f(y)≥f(x) +

∇f(x),y−x +α

2ky−xk²

def= f(x) =f(x⁻) +α 2

y−x⁻

2.

19

(25)

Convergence of GD for smooth and strongly convex functions

Theorem

Letf be aβ-smooth andα-strongly convex function. Then GD with the choiceγt≡γ=_β¹ satisfies

f(xT)−f(x^∗)≤e⁻^T^κ f(x0)−f(x^∗) ,

whereκ=^β_α ≥1 is thecondition number off. Linear convergence: it requiresT=κlog ^osc(f ⁾

steps to ensure precision.

20

(26)

Proof: every step fills a constant part of the gap

In particular, with the choiceγ=_β¹,

f(xt+1) =f(x_t⁺)≤f¯(x_t⁺) =f(xt)− 1 2β

∇f(xt)

2,

and

f(x^∗)≥f(x^∗)≥f(x_t⁻) =f(xt)− 1 2α

∇f(xt)

2.

Hence, every step fills at least a part ^α_β of the gap:

f(x_t)−f(x_t+1)≥ α

β f(x_t)−f(x^∗) .

It follows that f(xT)−f(x^∗)≤

1−α

β

f(x_T−1)−f(x^∗)

≤

1−α β

T

f(x0)−f(x^∗)

≤e⁻^α^β^T f(x0)−f(x^∗) .

21

(27)

Using coercivity

Lemma

Iff isα-strongly convex then for allx,y∈ X, ∇f(x)− ∇f(y),x−y

≥αkx−yk².

Proof:monotonicity of the gradient of the convex functionx7→f(x)−αkxk²/2.

Lemma

Iff isα-strongly convex andβ-smooth, then for allx,y ∈ X, ∇f(x)− ∇f(y),x−y

≥ αβ

α+βkx−yk²+ 1 α+β

∇f(x)− ∇f(y)

2.

Proof:co-coercivity of the (β−α)-smooth and convex functionx7→f(x)−αkxk²/2.

22

(28)

Stronger result by coercivity

Theorem

Letf be aβ-smooth andα-strongly convex function. Then GD with the choiceγt≡γ=_α+β² satisfies

kxT−x^∗k²≤e⁻^κ+1^4T kx0−x^∗k²,

whereκ=^β_α ≥1 is thecondition number off. Corollary: since byβ-smoothnessf(xT)−f(x^∗)≤ β

2kxT−x^∗k², this bound implies

f(xT)−f(x^∗)≤β 2 exp

− 4T κ+ 1

kx0−x^∗k².

NB:Bolder jumps: γ=

α+β 2

⁻¹

≥β⁻¹.

23

(29)

Proof

Using the coercivity inequality, kxt−x^∗k²=

xt−1−γ∇f(xt−1)−x^∗

2

=

xt−1−x^∗

2−2γ

∇f(xt−1),xt−1−x^∗ +γ²

∇f(xt−1)

2

≤

1−2 αβγ α+β

kxt−1−x^∗k²+







γ²− 2γ α+β

| {z }

=0







∇f(x_t−1)

2

=

1− 2 κ+ 1

²

kx_t−1−x^∗k²

≤exp

− 4t κ+ 1

kx0−x^∗k².

24

(30)

Lower bounds lower bound for

Lipschitz convex optimization

(31)

Lower bounds

General first-order black-box optimization algorithm = sequence of maps (x₀,g₀, . . . ,x_t,g_t)7→x_t+1. We assume:

• x₀= 0

• xt+1∈Span g0, . . . ,gt , Theorem

For everyT ≥1,L,R>0 there exists a convex andL-Lipschitz functionf onR^T+1 such that for any black-box procedure as above,

0≤t≤Tmin f(xt)− min

kxk≤Rf(x)≥ RL 2 1 +√

T + 1 .

• Minimax lower bound: f and evend depend onT. . .

• . . . but not limited to gradient descent algorithms.

• For a fixed dimension, exponential rates are always possible by other means (e.g. center of gravity method).

• =⇒ the above GD algorithm is minimax rate-optimal! ²⁵

(32)

Proof

Letd=T+ 1,ρ= ^L

√d 1+√

d andα= ^L

R 1+√

d, and let f(x) =ρ max

1≤i≤dxⁱ+α 2kxk². Then

∂f(x) =αx+ρConv n

e_i:is.t.x_i= max

1≤j≤dx_jo .

Ifkxk ≤R, then∀g∈∂f(x),kgk ≤αR+ρwhich means thatf isαR+ρ=L-Lipschitz. For simplicity of notation, we assume that the first-order oracle returnsαx+ρeiwhereiis thefirst coordinate such thatx_i= max_1≤j≤dx^j.

• Thusx₁∈Span(e₁), and by inductionx_t∈Span(e₁, . . . ,e_t).

• Hence for everyj∈ {t+ 1, . . . ,d},x_t^j= 0, andf(x_t)≥0 for allt≤T=d−1.

• f reaches its minimum atx^∗= −_αd^ρ , . . . ,−_αd^ρ

since 0∈∂f(x^∗),kx^∗k²= ^ρ²

α2d =R² and

f(x^∗) =−ρ² αd+α

2 ρ² α²d =− ρ²

2αd=− RL

2 1 +√ T+ 1.

26

(33)

Other lower bounds

Forα-strongly convex and Lipschitz functions, lower bound in L² αT.

=⇒ GD is order-optimal.

Forβ-smooth convex functions, the lower bound is in βkx₀−x^∗k²

T² .

=⇒ room for improvement over GD with reaches 2βkx₀−x^∗k² T+ 4 . Forα-strongly convex andβ-smooth functions, lower bound in kx0−x^∗k²e⁻^√^T^κ.

=⇒ room for improvement over GD which reacheskx0−x^∗k²e⁻^T^κ. For proofs, see [Bubeck].

27

(34)

What more?

(35)

Need more?

• Constrained optimization

• projected gradient descent

yt =xt−γtgt, xt+1= ΠX(yt+1).

• Frank-Wolfe yt+1= arg min

y∈X

∇f(xt),y

, xt+1= (1−γt)xt+γtyt+1.

• Nesterov acceleration yt+1=xt−1

β∇f(xt), xt+1=

1 +

√κ−1

√κ+ 1

yt+1−

√κ−1

√κ+ 1yt .

• Second-order methods, Newton and quasi-Newton xt+1=xt−

∇²f(x)−1

∇f(x).

• Mirror descent: for a given convex potention Φ,

∇Φ(yt+1) =∇Φ(xt)−γgt, xt+1∈Π^Φ_X(yt+1).

• Structured optimization, proximal methods

• Example: f(x) =L(x) +λkxk1 28

(36)

Nesterov Momentum acceleration forβ-smooth convex functions Taken fromhttps://blogs.princeton.edu/imabandit

Algorithm:Nesterov Accelerated Gradient Descent Input: convex function f, initial pointx0

1 d₀←0,λ₀←1 ;

2 fort = 0. . .T−1 do

3 yt←xt+dt;

4 x_t+1←y_t−_β¹∇f(y_t);

5 λ_t+1 ←largest solution of λ²_t+1−λ_t+1=λ²_t;

6 dt+1 ←^λ_λ^t⁻¹

t+1 xt+1−xt

;

7 returnxT

• dt = momentum term (”heavy ball”), well-known practical trick to accelerate convergence, intensity ^λ_λ^t⁻¹

t+1 /1.

• λt 't/2 + 1: letδt=λt−λt−1≥0 and observe thatλ²_t−λ²_t−1=δt(2λt−δt) =λt, from which one deduces that 1/2≤δ_t=_2−δ¹

t/λt ≤1, thus 1 +t/2≤λ_t≤1 +t, hence δ_t≤ 2−1/(1+t/2)¹ ≤1/2 + 1/(t+ 1) and 1 +t/2≤λ_t≤t/2 + log(t+ 1) + 1.

29

(37)

Nesterov Acceleration

Theorem

Letf be a convex andβ-smooth function onR^d. Then Nesterov Accelerated Gradient descent algorithm satisfies:

f(xT)−f(x^∗)≤2βkx0−x^∗k²

T² .

• Thus it requiresT≈ ^2βR^√ steps to ensure precision .

• Nesterov acceleration also works forβ-smooth, α-strongly convex functions and permits to reach the minimax rate kx0−x^∗k²e⁻^√^T^κ: see for example [Bubeck].

30

(38)

Proof

Letδ_t=f(x_t)−f(x^∗). Denotingg_t=−β⁻¹∇f(x_t+d_t), one has:

δ_t+1−δ_t=f(x_t+1)−f(x_t+d_t) +f(x_t+d_t)−f(x_t)

≤ −2 β

∇f(x_t+d_t) ²+

∇f(x_t+d_t),d_t=−β 2

kgtk²+ 2g_t,d_t ,

andδt+1=f(xt+1)−f(xt+dt) +f(xt+dt)−f(x^∗)

≤ −2 β

∇f(xt+dt) ²+

∇f(xt+dt),xt+dt−x^∗

=−β 2

kgtk²+ 2

gt,xt+dt−x^∗ .

Hence, λ_t−1δ_t+1−δ_t+δ_t+1≤ −β 2

λ_tkgtk²+ 2g_t,x_t+λ_td_t−x^∗

=− β 2λt

λ_tg_t+x_t+λ_td_t−x^∗ ²−

x_t+λ_td_t−x^∗ ²

=− β 2λ_t

x_t+1+λ_t+1d_t+1−x^∗ ²−

x_t+λ_td_t−x^∗ ²

, since the choice of the momentum intensity is precisely ensuring thatxt+λtgt+λtdt= x_t+1+ (λ_t−1)(g_t+d_t) =x_t+1+ (λ_t−1)(x_t+1−x_t) =x_t+1+λ_t+1λt−1

λ_t+1 (x_t+1−x_t)

| {z }

dt+1

. It follows from the choice ofλ_tthat

λ²_tδt+1−λ²_t−1δt=λ²_tδt+1−(λ²_t−λt)δt≤ −β 2

xt+1+λt+1dt+1−x^∗

2−

xt+λtdt−x^∗

2

and hence, sinceλ−1= 0 andλ_t≥(t+ 1)/2:

T 2

2

δ_T≤λ²_T₋₁δ_T≤ β 2

x₀+λ₀d₀−x^∗ ²=β

x₀−x^∗ ²

2 .

31

(39)

Research article 4

Incremental Majorization- Minimization Optimization with Application to Large- Scale Machine Learning by Julien Mairal

SIAM Journal on Optimization Vol. 25 Issue 2, 2015 Pages 829- 855

32

(40)

Stochastic Gradient Descent

(41)

Motivation

Big data: an evaluation off can be very expensive, and useless!

(especially at the begining).

L(θ) = 1 m

m

X

i=1

` y_ihθ,x_ii .

Src:https://arxiv.org/pdf/1606.04838.pdf

→often faster and cheaper for the required precision.

33

(42)

Research article 5

The Tradeoffs of Large Scale Learning

by L´eon Bottou and Olivier Bousquet Advances in Neural Information Pro- cessing Systems, NIPS Foundation (http://books.nips.cc) (2008), pp. 161-168

NeurIPS 2018 award: ”test of time”

34

(43)

Stochastic Gradient Descent Algorithms

We consider a function to miminizef(x) = 1 m

m

X

i=1

fi(x):

Algorithm:Stochastic Gradient Descent Input: convex function f, step sizeγt,

initial pointx0 1 for t = 0. . .T −1do

2 PickIt∼ U {1, . . . ,m}

3 Compute gt ∈∂fI_t(xt)

4 x_t+1←x_t−γ_tg_t

5 returnx_T or x0+· · ·+x_T−1 T

• x_T →

T→∞x^∗^def= arg minf?

• f(xT) →

T→∞f(x^∗) = minf?

• under which conditions?

• what about x₀+· · ·+x_T−1

T ?

• at what speed?

• works in high dimension?

• do some properties help?

• can other algorithms do

better? ³⁵

(44)

Noisy Gradient Descent

LetFt =σ(I0, . . . ,It), whereF₋₁={Ω,∅}. Note thatxt is Ft−1-measurable, i.e. x_t depends only onI₀, . . . ,I_t−1.

Lemma For allt ≥0,

E

gt|F_t−1

∈∂f(xt).

Proof: lety ∈ X. Sincegt ∈∂fIt(xt),fIt(y)≥fIt(xt) +hgt,y−xti. Taking expectation conditionnal onFt−1 (i.e. integrating onIt), and using thatxt is Ft−1-measurable, one obtains:

f(y)≥f(xt) +E

hgt,y−xt|Ft−1i

=f(xt) + D

E gt|Ft−1

,y−xt

E . More generally, SGD for the optimization of functionsf that are accessible by anoisy first-order example, i.e. for which it is possible to obtain at every point an independent, unbiased estimate of the gradient.

Two distinct objective functions:

L_S(θ) = 1 m

m

X

i=1

`_i h_θ(x_i),y_i

andL_D(θ) =E

h` h_θ(X),Yi

. 36

(45)

Convergence for Lipschitz convex functions

Theorem

Assume that for alli, allx ∈ X and allg ∈∂f_i(x),kgtk ≤L. Then SGD withγt ≡γ= ^R

L√

T satisfies E

"

f 1 T

T−1

X

i=0

xi

!#

−f(x^∗)≤ RL

√T .

• Exactly the same bound as for GD in the Lipschitz convex case.

• As before, it requires T≈^R²2^L² steps to ensure precision .

• Bound only in expectation.

• In contrast to the deterministic case, smoothness does not improve the speed of convergence in general.

37

(46)

Proof: exactly the same as for GD

2ha,bi=kak²+kbk²− ka−bk². Hence for every 0≤t <T, sinceE

gt|F_t−1

∈∂f(xt), E

f(x_t)−f(x^∗)|Ft−1

≤D E

g_t|Ft−1

,x_t−x^∗E

= 1 γ D

E

xt−xt+1|F_t−1

,xt−x^∗E

=E 1

2γ

kxt−x^∗k²+kxt−xt+1k²− kxt+1−x^∗k² F_t−1

=E 1

2γ

kxt−x^∗k²− kxt+1−x^∗k² +γ

2kgtk² Ft−1

,

and hence, taking expectation:

E

"_T−1 X

t=0

f(xt)−f(x^∗)

#

≤ 1 2γ

kx0−x^∗k²−

((((((( E

kxT−x^∗k²

+L²γT 2

≤L√ T R²

2R +L²RT 2L√

T . ³⁸

(47)

A lot more to know

• Faster rate for the strongly convex case:

same proof as before.

• No improvement in general by using smoothness only.

• Ruppert-Polyak averaging.

• Improvement for sums of smooth and strongly convex functions, etc.

• Analysis in expectation only is rather weak.

• Mini-batch SGD: the best of the two worlds.

• Beyond SGD methods: momentum, simulated annealing, etc.

39

(48)

Convergence in quadratic mean

Theorem

Let (Ft)t be an increasing family ofσ-fields. For everyt≥0, letft be a convex, differentiable,β-smooth, square-integrable,Ft-measurable function onX. Further, assume that for everyx ∈ X and everyt≥1, E

∇ft(x)|Ft−1

=∇f(x), wheref is anα-strongly convex function reaching its minimum atx^∗∈ X. Also assume that for allt ≥0, E

k∇ft(x^∗)k² F_t−1

≤σ². Then, denotingκ=^β_α, the SGD with

γt = ¹

α t+1+2κ² satisfies:

E

hkxT−x^∗k²i

≤ 2κ²kx0−x^∗k²+^2σ_α₂²log _2κ^T₂ + 1

T + 2κ² .

40

(49)

Proof 1/2: induction formula for the quadratic risk

We observe that E

h

∇f_It(x_t−1)

2 Ft−1

i≤2E h

∇f_It(x_t−1)−f_It(x^∗)

2 Ft−1

i+ 2E h

∇f_It(x^∗)

2 Ft−1

i

≤2β²

xt−1−x^∗ ²+ 2σ². Hence,

E h

xt−x^∗ ²

Ft−1

i

=

xt−1−x^∗

²−2γt−1

xt−1−x^∗,∇f(xt−1) +γ_t−1² E

h

∇f_It(xt−1) ²

Ft−1

i

≤

xt−1−x^∗

²−2γt−1α

xt−1−x^∗ ²+γ²_t−1E

h

∇f_It(xt−1) ²

Ft−1

i

≤ 1−2αγt−1+ 2β²γ²_t−1

xt−1−x^∗

2+ 2σ²γ_t−1²

≤ 1−αγt−1

xt−1−x^∗

²+ 2σ²γ_t−1²

thanks to the fact that for allt≥0,αγ_t≥2β²γ²_t ⇐⇒ γ_t≤α/(2β²) =γ−1, andγ_tis decreasing int. Hence, denotingδ_t=E

kxt−x^∗k²

, by taking expectation we obtain that δ_t≤ 1−αγt−1

δt−1+ 2σ²γ_t−1² .

Note that unfolding the induction formula leads to an explicit upper-bound forδt:

δt≤

t−1

Y

k=0

1−µγk + 2σ²

t−1

X

k=0

γ_k²

t−1

Y

i=k+1

1−µγi .

41

(50)

Proof 2/2: solving the induction

One may either use the closed form forδ_t, or (with the hint of the corresponding ODE) set ut= (t+ 2κ²)δtand note that

u_t= (t+ 2κ²)δ_t

≤(t+ 2κ²)

1−αγt−1 ut−1

(t−1 + 2κ²)+ 2σ²γ²_t−1

≤(t+ 2κ²)t−1 + 2κ² t+ 2κ²

u_t−1

(t−1 + 2κ²)+2σ²(t+ 2κ²) α²(t+ 2κ²)²

=ut−1+2σ² α²

1 (t+ 2κ²)

≤u0+2σ² α²

t

X

s=1

1 (s+ 2κ²)

≤2κ²δ0+2σ²

α² logt+ 2κ² 2κ² . Hence for everyt

δ_t≤ 2κ²kx0−x^∗k²+^2σ²

α2 log^t+2κ²

2κ2

t+ 2κ² .

Remark: with some more technical work, the analysis works for allγt, possibly of the form γt=t^−βforβ≤1 : see [Bach&Moulines ’11].

42

slides

Machine Learning 8:

Convex Optimization for Machine Learning

Table of contents

Convex functions in R

Convex functions and subgradients

Convex functions and optimization

Gradient Descent

Gradient Descent algorithms

Monotonicity of gradient

Convergence of GD for Convex-Lipschitz functions

Intuition: what can happen?

Proof

Smoothness

Smoothness

Convergence of GD for smooth convex functions

Majoration/minoration

Lemma: the gradient shoots in the right direction

Proof step 1: the iterates get closer to x

Proof step 2: the values of the iterates converge

Strong convexity

Strong convexity

Faster rates for Lipschitz functions through strong convexity

Proof

Outline

Smoothness and strong convexity: sandwiching f by squares

Convergence of GD for smooth and strongly convex functions

Proof: every step fills a constant part of the gap

Using coercivity

Stronger result by coercivity

Proof

Lower bounds lower bound for

Lipschitz convex optimization

Lower bounds

Proof

Other lower bounds

What more?

Need more?

Nesterov Acceleration

Proof

Research article 4

Stochastic Gradient Descent

Motivation

Research article 5

Stochastic Gradient Descent Algorithms

Noisy Gradient Descent

Convergence for Lipschitz convex functions

Proof: exactly the same as for GD

A lot more to know

Convergence in quadratic mean

Proof 1/2: induction formula for the quadratic risk

Proof 2/2: solving the induction