• Aucun résultat trouvé

slides

N/A
N/A
Protected

Academic year: 2022

Partager "slides"

Copied!
52
0
0

Texte intégral

(1)

Machine Learning 8:

Convex Optimization for Machine Learning

Master 2 Computer Science

Aur´elien Garivier 2018-2019

(2)

Table of contents

1. Convex functions inRd 2. Gradient Descent 3. Smoothness 4. Strong convexity

5. Lower bounds lower bound for Lipschitz convex optimization 6. What more?

7. Stochastic Gradient Descent

1

(3)

Convex functions in R

d

(4)

Convex functions and subgradients

Convex function

LetX ⊂Rd be a convex set. The functionf :X →Risconvexif

∀x,y∈X,∀λ∈[0,1], f (1−λ)x+λy

≤(1−λ)f(x) +λf(y).

Subgradients

A vectorg ∈Rn is asubgradientoff atx∈ X if for anyy∈ X, f(y)≥f(x) +hg,y−xi.

The set of subgradients off atx is denoted∂f(x).

Proposition

• If∂f(x)6=∅ for allx∈ X, thenf is convex.

• Iff is convex, then∀x ∈X˚, ∂f(x)6=∅.

• Iff is convex and differentiable atx, then∂f(x) =

∇f(x) .

2

(5)

Convex functions and optimization

Proposition

Letf be convex. Then

• x is local minimum off iff 0∈∂f(x),

• and in that case,x is aglobalminimum off;

• if X is closed and iff is differentiable onX, then x = arg min

x∈X

f(x) iff ∀y ∈ X,

∇f(x),y−x

≥0.

Black-box optimization model

The setX is known,f :X →Ris unknown but accessible thru:

• a zeroth-order oracle: givenx ∈ X, yieldsf(x),

• and possibly a first-order oracle: givenx∈ X, yields g ∈∂f(x).

3

(6)

Gradient Descent

(7)

Gradient Descent algorithms

A memoryless algorithm for first-order black-box optimization:

Algorithm:Gradient Descent

Input: convex function f, step sizeγt, initial pointx0

1 for t = 0. . .T −1do

2 Compute gt ∈∂f(xt)

3 xt+1←xt−γtgt

4 returnxT or x0+· · ·+xT−1 T Questions:

• xT

T→∞xdef= arg minf?

• f(xT) →

T→∞f(x) = minf?

• under which conditions?

• what about x0+· · ·+xT−1

T ?

• at what speed?

• works in high dimension?

• do some properties help?

• can other algorithms do better?

4

(8)

Monotonicity of gradient

Property

Letf be a convex function on X, and let x,y∈ X. For every gx ∈∂f(x) and everygy ∈∂f(y),

gx −gy,x−y

≥0. In fact, a differentiable mappingf is convex iff

∀x,y ∈ X,

∇f(x)− ∇f(y),x−y

≥0.

In particular,hgx,x−xi ≥0.

=⇒ the negative gradient does not point the the wrong direction.

Under some assumptions (to come), this inequality can be strenghtened, making gradient descent more relevant.

5

(9)

Convergence of GD for Convex-Lipschitz functions

Lipschitz Assumption

For everyx∈ X and everyg ∈∂f(x),kgk ≤L.

This implies

f(y)−f(x) ≤

hg,y−xi

≤Lky−xk.

Theorem

Under the Lipschitz assumption, GD withγt ≡γ= R

L

T satisfies f 1

T

T−1

X

i=0

xi

!

−f(x)≤ RL

√ T .

• Of course, can return arg min1≤i≤Tf(xi) instead (not always better).

• It requiresTR22L2 steps to ensure precision.

• Online version γt = LRt: bound in 3RL/√

T (see Hazan).

• Works just as well for constrained optimization with xt+1←ΠX xt−γt∇f(xt)

thanks to Pythagore projection theorem.

6

(10)

Intuition: what can happen?

The step must be large enough to reach the region of the minimum, but not too large too avoid skip- ping over it.

LetX =B(0,R)⊂R2and f(x1,x2) = R

√2γT|x1|+L|x2| which isL-Lipschitz forγ≥ R

2LT. Then, ifx0=

R 2,3Lγ4

∈ X,

• xt1= R

2Rt2T and ¯xT1 ' 2R2;

• x2s+12 = 3Lγ4 −γL=−4,x2s2 = 3Lγ4 , and ¯xT2 '4. Hence

f x¯T1,x¯T2 ' R

√2γT R 2√

2 +LγL 4 = 1

4 R2

Tγ +L2γ

, which is minimal forγ= R

L

T wheref x¯T1,x¯T2

RL

2 T.

7

(11)

Proof

Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:

2ha,bi=kak2+kbk2− ka−bk2. Hence for every 0≤t <T:

f(xt)−f(x)≤ hgt,xt−xi

= 1

γhxt−xt+1,xt−xi

= 1 2γ

kxt−xk2+kxt−xt+1k2− kxt+1−xk2

= 1 2γ

kxt−xk2− kxt+1−xk2

2kgtk2, and hence

T−1

X

t=0

f(xt)−f(x)≤ 1 2γ

kx0−xk2−kxT−xk2 +L2γT

2 ≤ L√ T R2

2R +L2RT 2L√

T

and by convexity f 1 T

T−1

X

i=0

xi

!

≤ 1 T

T−1

X

t=0

f(xt).

8

(12)

Smoothness

(13)

Smoothness

Definition

A continuously differentiable functionf isβ-smooth if the gradient∇f isβ-Lipschitz, that is if for allx,y ∈ X,

k∇f(y)− ∇f(x)k ≤βky−xk.

Property

Iff isβ-smooth, then for anyx,y∈ X: f(y)−f(x)− h∇f(x),y−xi

≤ β

2ky−xk2.

• f is convex andβ-smooth iffx7→ β

2kxk2−f(x) is convex iff∀x,y ∈ X f(x) +h∇f(x),y−xi ≤f(y)≤f(x) +h∇f(x),y−xi+β

2ky−xk2.

• Iff is twice differentiable, thenf isα-strongly convex iff all the

eigenvalues of the Hessian off are at most equal toβ. 9

(14)

Convergence of GD for smooth convex functions

Theorem

Letf be a convex andβ-smooth function onRd. Then GD with γt ≡γ= β1 satisfies:

f(xT)−f(x)≤2βkx0−xk2 T + 4 .

Thus it requiresT2βR steps to ensure precision .

10

(15)

Majoration/minoration

Takingγ= β1 is a ”safe” choice ensuring progress:

x+def= x−1

β∇f(x) = arg min

y

f(x) +

∇f(x),y−x +β

2ky−xk2 is such thatf(x+)≤f(x)− 1

∇f(x)

2.Indeed, f(x+)−f(x)≤

∇f(x),x+−x +β

2kx+−xk2

=−1 β

∇f(x)

2+ 1 2β

∇f(x)

2=− 1 2β

∇f(x)

2.

=⇒ Descentmethod.

Moreover,x+is ”on the same side ofx asx” (no overshooting): since ∇f(x)k=

∇f(x)− ∇f(x)k ≤βkx−xk, ∇f(x),x−x

∇f(x)

kx−xk ≤βkx−xk2and thus x−x+,x−x

=

x−xk2+1

β∇f(x),x−x

≥0.

11

(16)

Lemma: the gradient shoots in the right direction

Lemma

For everyx∈ X,

∇f(x),xx

1 β

∇f(x) 2.

We already now thatf(x)f xβ1∇f(x)

f(x)1

∇f(x)

2. In addition, taking z=x+β1∇f(x):

f(x) =f(z) +f(x)f(z)

f(x) +

∇f(x),zx

β

2kzxk2

=f(x) +

∇f(x),xx+

∇f(x),zx

1

∇f(x)

2

=f(x) +

∇f(x),xx + 1

∇f(x)

2.

Thusf(x) +

∇f(x),xx+ 1

∇f(x)

2f(x)f(x) 1

∇f(x) 2. In fact, this lemma is a corrolary of theco-coercivity of the gradient:∀x,y∈ X,

∇f(x)− ∇f(y),xy

1 β

∇f(x)− ∇f(y) 2, which holds iff the convex, differentiable functionf isβ-smooth.

12

(17)

Proof step 1: the iterates get closer to x

Applying the preceeding lemma tox=xt, we get kxt+1xk2=

xt 1

β∇f(xt)x

2

= xtx

2 2 β

∇f(xt),xtx + 1

β2 ∇f(xt)

2

xtx

2 1 β2

∇f(xt) 2

xtx

2 .

it’s good, but it can be slow...

13

(18)

Proof step 2: the values of the iterates converge

We have seen thatf(xt+1)f(xt)≤ − 1

∇f(xt)

2. Hence, ifδt=f(xt)f(x), then

δ0=f(x0)f(x) β 2

x0x 2

andδt+1δt 1

∇f(xt) 2. But δt

∇f(xt),xtx

∇f(xt)

xtx .

Therefore, sinceδtis decreasing witht,δt+1δt δt2

2βkx0xk2. Thinking to the coresponding ODE, one setsut= 1/δt, which yields:

ut+1 ut

12βkx 1

0−x∗ k2ut

ut

1 + 1

2βkx0xk2ut

=ut+ 1

2βkx0xk2

Hence,uT u0+ T

2βkx0xk2 2

βkx0xk2+ T

2βkx0xk2= T+ 4 2βkx0xk2.

14

(19)

Strong convexity

(20)

Strong convexity

Definition

f :X →Risα-strongly convex if for allx,y ∈ X, for anygx ∈∂f(x), f(y)≥f(x) +

gx,y−x +α

2ky−xk2.

• f isα-strongly convex ifff(x)−α2kxk2 is convex.

• αmeasures thecurvature off.

• Iff is twice differentiable, thenf isα-strongly convex iff all the eigenvalues of the Hessian off are larger thanα.

15

(21)

Faster rates for Lipschitz functions through strong convexity

Theorem

Letf be aα-strongly convex andL-Lipschitz. Under the Lipschitz assumption, GD withγt =α(t+1)1 satisfies:

f 1 T

T−1

X

i=0

xi

!

−f(x)≤ L2log(T)

αT .

Note : returning another weighted average withγt= α(t+1)2 yields:

f

T−1

X

i=0

2(i+ 1) T(T+ 1)xi

!

−f(x)≤ 2L2 α(T+ 1) .

Thus it requiresT2Lα2 steps to ensure precision.

16

(22)

Proof

Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:

2ha,bi=kak2+kbk2− ka−bk2. Hence for every 0≤t <T, byα-strong convexity:

f(xt)−f(x)≤ hgt,xt−xi−α

2kxt−xk2

= 1 γt

hxt−xt+1,xt−xi −α

2kxt−xk2

= 1 2γt

kxt−xk2+kxt−xt+1k2− kxt+1−xk2

−α

2kxt−xk2

= tα

2 kxt−xk2−(t+ 1)α

2 kxt+1−xk2+ 1

2(t+ 1)αkgtk2 sinceγt= α(t+1)1 , and hence

T−1

X

t=0

f(xt)−f(x)≤ 0×α

2 kx0−xk2−Tα

2 kxT−xk2+L2

T−1

X

t=0

1

t+ 1 ≤L2log(T) 2α and by convexityf

1 T

PT−1 i=0 xi

T1PT−1

t=0 f(xt). 17

(23)

Outline

Convex functions inRd Gradient Descent Smoothness Strong convexity

Smooth and Strongly Convex Functions

Lower bounds lower bound for Lipschitz convex optimization What more?

Stochastic Gradient Descent

18

(24)

Smoothness and strong convexity: sandwiching f by squares

Letx∈ X.

For everyy ∈ X,

β-smoothness implies:

f(y)≤f(x) +

∇f(x),y−x +β

2ky−xk2

def= ¯f(x) = ¯f(x+) +β 2

y−x+

2.

Moreoever,α-strong convexity implies, with x=x−α1∇f(x), f(y)≥f(x) +

∇f(x),y−x +α

2ky−xk2

def= f(x) =f(x) +α 2

y−x

2.

19

(25)

Convergence of GD for smooth and strongly convex functions

Theorem

Letf be aβ-smooth andα-strongly convex function. Then GD with the choiceγt≡γ=β1 satisfies

f(xT)−f(x)≤eTκ f(x0)−f(x) ,

whereκ=βα ≥1 is thecondition number off. Linear convergence: it requiresT=κlog osc(f )

steps to ensure precision.

20

(26)

Proof: every step fills a constant part of the gap

In particular, with the choiceγ=β1,

f(xt+1) =f(xt+)≤f¯(xt+) =f(xt)− 1 2β

∇f(xt)

2,

and

f(x)≥f(x)≥f(xt) =f(xt)− 1 2α

∇f(xt)

2.

Hence, every step fills at least a part αβ of the gap:

f(xt)−f(xt+1)≥ α

β f(xt)−f(x) .

It follows that f(xT)−f(x)≤

1−α

β

f(xT−1)−f(x)

1−α β

T

f(x0)−f(x)

≤eαβT f(x0)−f(x) .

21

(27)

Using coercivity

Lemma

Iff isα-strongly convex then for allx,y∈ X, ∇f(x)− ∇f(y),x−y

≥αkx−yk2.

Proof:monotonicity of the gradient of the convex functionx7→f(x)αkxk2/2.

Lemma

Iff isα-strongly convex andβ-smooth, then for allx,y ∈ X, ∇f(x)− ∇f(y),x−y

≥ αβ

α+βkx−yk2+ 1 α+β

∇f(x)− ∇f(y)

2.

Proof:co-coercivity of the (βα)-smooth and convex functionx7→f(x)αkxk2/2.

22

(28)

Stronger result by coercivity

Theorem

Letf be aβ-smooth andα-strongly convex function. Then GD with the choiceγt≡γ=α+β2 satisfies

kxT−xk2≤eκ+14T kx0−xk2,

whereκ=βα ≥1 is thecondition number off. Corollary: since byβ-smoothnessf(xT)−f(x)≤ β

2kxT−xk2, this bound implies

f(xT)−f(x)≤β 2 exp

− 4T κ+ 1

kx0−xk2.

NB:Bolder jumps: γ=

α+β 2

−1

≥β−1.

23

(29)

Proof

Using the coercivity inequality, kxt−xk2=

xt−1−γ∇f(xt−1)−x

2

=

xt−1−x

2−2γ

∇f(xt−1),xt−1−x2

∇f(xt−1)

2

1−2 αβγ α+β

kxt−1−xk2+

γ2− 2γ α+β

| {z }

=0

∇f(xt−1)

2

=

1− 2 κ+ 1

2

kxt−1−xk2

≤exp

− 4t κ+ 1

kx0−xk2.

24

(30)

Lower bounds lower bound for

Lipschitz convex optimization

(31)

Lower bounds

General first-order black-box optimization algorithm = sequence of maps (x0,g0, . . . ,xt,gt)7→xt+1. We assume:

• x0= 0

• xt+1∈Span g0, . . . ,gt , Theorem

For everyT ≥1,L,R>0 there exists a convex andL-Lipschitz functionf onRT+1 such that for any black-box procedure as above,

0≤t≤Tmin f(xt)− min

kxk≤Rf(x)≥ RL 2 1 +√

T + 1 .

• Minimax lower bound: f and evend depend onT. . .

• . . . but not limited to gradient descent algorithms.

• For a fixed dimension, exponential rates are always possible by other means (e.g. center of gravity method).

• =⇒ the above GD algorithm is minimax rate-optimal! 25

(32)

Proof

Letd=T+ 1,ρ= L

d 1+

d andα= L

R 1+

d, and let f(x) =ρ max

1≤i≤dxi+α 2kxk2. Then

∂f(x) =αx+ρConv n

ei:is.t.xi= max

1≤j≤dxjo .

Ifkxk ≤R, then∀g∂f(x),kgk ≤αR+ρwhich means thatf isαR+ρ=L-Lipschitz. For simplicity of notation, we assume that the first-order oracle returnsαx+ρeiwhereiis thefirst coordinate such thatxi= max1≤j≤dxj.

Thusx1Span(e1), and by inductionxtSpan(e1, . . . ,et).

Hence for everyj∈ {t+ 1, . . . ,d},xtj= 0, andf(xt)0 for alltT=d1.

f reaches its minimum atx= αdρ , . . . ,−αdρ

since 0∂f(x),kxk2= ρ2

α2d =R2 and

f(x) =ρ2 αd+α

2 ρ2 α2d = ρ2

2αd= RL

2 1 + T+ 1.

26

(33)

Other lower bounds

Forα-strongly convex and Lipschitz functions, lower bound in L2 αT.

=⇒ GD is order-optimal.

Forβ-smooth convex functions, the lower bound is in βkx0−xk2

T2 .

=⇒ room for improvement over GD with reaches 2βkx0−xk2 T+ 4 . Forα-strongly convex andβ-smooth functions, lower bound in kx0−xk2eTκ.

=⇒ room for improvement over GD which reacheskx0−xk2eTκ. For proofs, see [Bubeck].

27

(34)

What more?

(35)

Need more?

• Constrained optimization

• projected gradient descent

yt =xt−γtgt, xt+1= ΠX(yt+1).

• Frank-Wolfe yt+1= arg min

y∈X

∇f(xt),y

, xt+1= (1−γt)xttyt+1.

• Nesterov acceleration yt+1=xt−1

β∇f(xt), xt+1=

1 +

√κ−1

√κ+ 1

yt+1

√κ−1

√κ+ 1yt .

• Second-order methods, Newton and quasi-Newton xt+1=xt

2f(x)−1

∇f(x).

• Mirror descent: for a given convex potention Φ,

∇Φ(yt+1) =∇Φ(xt)−γgt, xt+1∈ΠΦX(yt+1).

• Structured optimization, proximal methods

• Example: f(x) =L(x) +λkxk1 28

(36)

Nesterov Momentum acceleration forβ-smooth convex functions Taken fromhttps://blogs.princeton.edu/imabandit

Algorithm:Nesterov Accelerated Gradient Descent Input: convex function f, initial pointx0

1 d0←0,λ0←1 ;

2 fort = 0. . .T−1 do

3 yt←xt+dt;

4 xt+1←ytβ1∇f(yt);

5 λt+1 ←largest solution of λ2t+1−λt+12t;

6 dt+1λλt−1

t+1 xt+1−xt

;

7 returnxT

• dt = momentum term (”heavy ball”), well-known practical trick to accelerate convergence, intensity λλt−1

t+1 /1.

• λt 't/2 + 1: letδt=λtλt−10 and observe thatλ2tλ2t−1=δt(2λtδt) =λt, from which one deduces that 1/2δt=2−δ1

tt 1, thus 1 +t/2λt1 +t, hence δt 2−1/(1+t/2)1 1/2 + 1/(t+ 1) and 1 +t/2λtt/2 + log(t+ 1) + 1.

29

(37)

Nesterov Acceleration

Theorem

Letf be a convex andβ-smooth function onRd. Then Nesterov Accelerated Gradient descent algorithm satisfies:

f(xT)−f(x)≤2βkx0−xk2

T2 .

• Thus it requiresT2βR steps to ensure precision .

• Nesterov acceleration also works forβ-smooth, α-strongly convex functions and permits to reach the minimax rate kx0−xk2eTκ: see for example [Bubeck].

30

(38)

Proof

Letδt=f(xt)f(x). Denotinggt=−β−1∇f(xt+dt), one has:

δt+1δt=f(xt+1)f(xt+dt) +f(xt+dt)f(xt)

≤ −2 β

∇f(xt+dt) 2+

∇f(xt+dt),dt=β 2

kgtk2+ 2gt,dt ,

andδt+1=f(xt+1)f(xt+dt) +f(xt+dt)f(x)

≤ −2 β

∇f(xt+dt) 2+

∇f(xt+dt),xt+dtx

=β 2

kgtk2+ 2

gt,xt+dtx .

Hence, λt1δt+1δt+δt+1≤ −β 2

λtkgtk2+ 2gt,xt+λtdtx

= β t

λtgt+xt+λtdtx 2

xt+λtdtx 2

= β t

xt+1+λt+1dt+1x 2

xt+λtdtx 2

, since the choice of the momentum intensity is precisely ensuring thatxt+λtgt+λtdt= xt+1+ (λt1)(gt+dt) =xt+1+ (λt1)(xt+1xt) =xt+1+λt+1λt1

λt+1 (xt+1xt)

| {z }

dt+1

. It follows from the choice ofλtthat

λ2tδt+1−λ2t−1δt=λ2tδt+1−(λ2t−λtt≤ −β 2

xt+1+λt+1dt+1x

2

xt+λtdtx

2

and hence, sinceλ−1= 0 andλt(t+ 1)/2:

T 2

2

δTλ2T−1δT β 2

x0+λ0d0x 2=β

x0x 2

2 .

31

(39)

Research article 4

Incremental Majorization- Minimization Optimization with Application to Large- Scale Machine Learning by Julien Mairal

SIAM Journal on Optimization Vol. 25 Issue 2, 2015 Pages 829- 855

32

(40)

Stochastic Gradient Descent

(41)

Motivation

Big data: an evaluation off can be very expensive, and useless!

(especially at the begining).

L(θ) = 1 m

m

X

i=1

` yihθ,xii .

Src:https://arxiv.org/pdf/1606.04838.pdf

→often faster and cheaper for the required precision.

33

(42)

Research article 5

The Tradeoffs of Large Scale Learning

by L´eon Bottou and Olivier Bousquet Advances in Neural Information Pro- cessing Systems, NIPS Foundation (http://books.nips.cc) (2008), pp. 161-168

NeurIPS 2018 award: ”test of time”

34

(43)

Stochastic Gradient Descent Algorithms

We consider a function to miminizef(x) = 1 m

m

X

i=1

fi(x):

Algorithm:Stochastic Gradient Descent Input: convex function f, step sizeγt,

initial pointx0 1 for t = 0. . .T −1do

2 PickIt∼ U {1, . . . ,m}

3 Compute gt ∈∂fIt(xt)

4 xt+1←xt−γtgt

5 returnxT or x0+· · ·+xT−1 T

• xT

T→∞xdef= arg minf?

• f(xT) →

T→∞f(x) = minf?

• under which conditions?

• what about x0+· · ·+xT−1

T ?

• at what speed?

• works in high dimension?

• do some properties help?

• can other algorithms do

better? 35

(44)

Noisy Gradient Descent

LetFt =σ(I0, . . . ,It), whereF−1={Ω,∅}. Note thatxt is Ft−1-measurable, i.e. xt depends only onI0, . . . ,It−1.

Lemma For allt ≥0,

E

gt|Ft−1

∈∂f(xt).

Proof: lety ∈ X. Sincegt ∈∂fIt(xt),fIt(y)≥fIt(xt) +hgt,y−xti. Taking expectation conditionnal onFt−1 (i.e. integrating onIt), and using thatxt is Ft−1-measurable, one obtains:

f(y)≥f(xt) +E

hgt,y−xt|Ft−1i

=f(xt) + D

E gt|Ft−1

,y−xt

E . More generally, SGD for the optimization of functionsf that are accessible by anoisy first-order example, i.e. for which it is possible to obtain at every point an independent, unbiased estimate of the gradient.

Two distinct objective functions:

LS(θ) = 1 m

m

X

i=1

`i hθ(xi),yi

andLD(θ) =E

h` hθ(X),Yi

. 36

(45)

Convergence for Lipschitz convex functions

Theorem

Assume that for alli, allx ∈ X and allg ∈∂fi(x),kgtk ≤L. Then SGD withγt ≡γ= R

L

T satisfies E

"

f 1 T

T−1

X

i=0

xi

!#

−f(x)≤ RL

√T .

• Exactly the same bound as for GD in the Lipschitz convex case.

• As before, it requires TR22L2 steps to ensure precision .

• Bound only in expectation.

• In contrast to the deterministic case, smoothness does not improve the speed of convergence in general.

37

(46)

Proof: exactly the same as for GD

Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:

2ha,bi=kak2+kbk2− ka−bk2. Hence for every 0≤t <T, sinceE

gt|Ft−1

∈∂f(xt), E

f(xt)−f(x)|Ft−1

≤D E

gt|Ft−1

,xt−xE

= 1 γ D

E

xt−xt+1|Ft−1

,xt−xE

=E 1

kxt−xk2+kxt−xt+1k2− kxt+1−xk2 Ft−1

=E 1

kxt−xk2− kxt+1−xk2

2kgtk2 Ft−1

,

and hence, taking expectation:

E

"T−1 X

t=0

f(xt)−f(x)

#

≤ 1 2γ

kx0−xk2

((((((( E

kxT−xk2

+L2γT 2

≤L√ T R2

2R +L2RT 2L√

T . 38

(47)

A lot more to know

• Faster rate for the strongly convex case:

same proof as before.

• No improvement in general by using smoothness only.

• Ruppert-Polyak averaging.

• Improvement for sums of smooth and strongly convex functions, etc.

• Analysis in expectation only is rather weak.

• Mini-batch SGD: the best of the two worlds.

• Beyond SGD methods: momentum, simulated annealing, etc.

39

(48)

Convergence in quadratic mean

Theorem

Let (Ft)t be an increasing family ofσ-fields. For everyt≥0, letft be a convex, differentiable,β-smooth, square-integrable,Ft-measurable function onX. Further, assume that for everyx ∈ X and everyt≥1, E

∇ft(x)|Ft−1

=∇f(x), wheref is anα-strongly convex function reaching its minimum atx∈ X. Also assume that for allt ≥0, E

k∇ft(x)k2 Ft−1

≤σ2. Then, denotingκ=βα, the SGD with

γt = 1

α t+1+2κ2 satisfies:

E

hkxT−xk2i

≤ 2κ2kx0−xk2+α22log T2 + 1

T + 2κ2 .

40

(49)

Proof 1/2: induction formula for the quadratic risk

We observe that E

h

∇fIt(xt−1)

2 Ft−1

i2E h

∇fIt(xt−1)fIt(x)

2 Ft−1

i+ 2E h

∇fIt(x)

2 Ft−1

i

2

xt−1x 2+ 2σ2. Hence,

E h

xtx 2

Ft−1

i

=

xt−1x

2t−1

xt−1x,∇f(xt−1) +γt−12 E

h

∇fIt(xt−1) 2

Ft−1

i

xt−1x

2t−1α

xt−1x 2+γ2t−1E

h

∇fIt(xt−1) 2

Ft−1

i

12αγt−1+ 2β2γ2t−1

xt−1x

2+ 2σ2γt−12

1αγt−1

xt−1x

2+ 2σ2γt−12

thanks to the fact that for allt0,αγt2γ2t ⇐⇒ γtα/(2β2) =γ−1, andγtis decreasing int. Hence, denotingδt=E

kxtxk2

, by taking expectation we obtain that δt 1αγt−1

δt−1+ 2σ2γt−12 .

Note that unfolding the induction formula leads to an explicit upper-bound forδt:

δt

t−1

Y

k=0

1µγk + 2σ2

t−1

X

k=0

γk2

t−1

Y

i=k+1

1µγi .

41

(50)

Proof 2/2: solving the induction

One may either use the closed form forδt, or (with the hint of the corresponding ODE) set ut= (t+ 2κ2tand note that

ut= (t+ 2κ2t

(t+ 2κ2)

1αγt−1 ut−1

(t1 + 2κ2)+ 2σ2γ2t−1

(t+ 2κ2)t1 + 2κ2 t+ 2κ2

ut−1

(t1 + 2κ2)+2(t+ 2κ2) α2(t+ 2κ2)2

=ut−1+2 α2

1 (t+ 2κ2)

u0+2 α2

t

X

s=1

1 (s+ 2κ2)

2δ0+2

α2 logt+ 2κ2 2 . Hence for everyt

δt 2kx0xk2+2

α2 logt+2κ2

2

t+ 2κ2 .

Remark: with some more technical work, the analysis works for allγt, possibly of the form γt=t−βforβ1 : see [Bach&Moulines ’11].

42

Références

Documents relatifs

Machine Learning in Image Processing, Special issue of the EURASIP Journal on Advances in Signal Processing.. Olivier Lezoray, Christophe Charrier, Hubert Cardot,

The proof of the following theorems uses a method introduced by (Gladyshev, 1965), and widely used in the mathematical study of adaptive signal processing. We first consider a

Hinton, ”ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems 25, NIPS, 2012.. Tan, ”Deep Learning for Steganalysis

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

In particular, SG methods whose mean field is the gradient of the objective function, i.e., h(η) = ∇V (η), are considered by Moulines and Bach (2011) for strongly convex function V

While the discriminative framework is based on convex relaxations and has led to interesting developments and applications [Zhang et al., 2009, Li et al., 2009, Joulin et al.,

Laviolette, “A transductive bound for the voted classifier with an application to semi-supervised learning,” in Advances in Neural Information Processing Systems (NIPS 21), pp.

This chapter focusses on our work on the automatic organization large face databases for efficient and accurate face retrieval while Chapter 3 presents our method to learn