Machine Learning 8:
Convex Optimization for Machine Learning
Master 2 Computer Science
Aur´elien Garivier 2018-2019
Table of contents
1. Convex functions inRd 2. Gradient Descent 3. Smoothness 4. Strong convexity
5. Lower bounds lower bound for Lipschitz convex optimization 6. What more?
7. Stochastic Gradient Descent
1
Convex functions in R
dConvex functions and subgradients
Convex function
LetX ⊂Rd be a convex set. The functionf :X →Risconvexif
∀x,y∈X,∀λ∈[0,1], f (1−λ)x+λy
≤(1−λ)f(x) +λf(y).
Subgradients
A vectorg ∈Rn is asubgradientoff atx∈ X if for anyy∈ X, f(y)≥f(x) +hg,y−xi.
The set of subgradients off atx is denoted∂f(x).
Proposition
• If∂f(x)6=∅ for allx∈ X, thenf is convex.
• Iff is convex, then∀x ∈X˚, ∂f(x)6=∅.
• Iff is convex and differentiable atx, then∂f(x) =
∇f(x) .
2
Convex functions and optimization
Proposition
Letf be convex. Then
• x is local minimum off iff 0∈∂f(x),
• and in that case,x is aglobalminimum off;
• if X is closed and iff is differentiable onX, then x = arg min
x∈X
f(x) iff ∀y ∈ X,
∇f(x),y−x
≥0.
Black-box optimization model
The setX is known,f :X →Ris unknown but accessible thru:
• a zeroth-order oracle: givenx ∈ X, yieldsf(x),
• and possibly a first-order oracle: givenx∈ X, yields g ∈∂f(x).
3
Gradient Descent
Gradient Descent algorithms
A memoryless algorithm for first-order black-box optimization:
Algorithm:Gradient Descent
Input: convex function f, step sizeγt, initial pointx0
1 for t = 0. . .T −1do
2 Compute gt ∈∂f(xt)
3 xt+1←xt−γtgt
4 returnxT or x0+· · ·+xT−1 T Questions:
• xT →
T→∞x∗def= arg minf?
• f(xT) →
T→∞f(x∗) = minf?
• under which conditions?
• what about x0+· · ·+xT−1
T ?
• at what speed?
• works in high dimension?
• do some properties help?
• can other algorithms do better?
4
Monotonicity of gradient
Property
Letf be a convex function on X, and let x,y∈ X. For every gx ∈∂f(x) and everygy ∈∂f(y),
gx −gy,x−y
≥0. In fact, a differentiable mappingf is convex iff
∀x,y ∈ X,
∇f(x)− ∇f(y),x−y
≥0.
In particular,hgx,x−x∗i ≥0.
=⇒ the negative gradient does not point the the wrong direction.
Under some assumptions (to come), this inequality can be strenghtened, making gradient descent more relevant.
5
Convergence of GD for Convex-Lipschitz functions
Lipschitz Assumption
For everyx∈ X and everyg ∈∂f(x),kgk ≤L.
This implies
f(y)−f(x) ≤
hg,y−xi
≤Lky−xk.
Theorem
Under the Lipschitz assumption, GD withγt ≡γ= R
L√
T satisfies f 1
T
T−1
X
i=0
xi
!
−f(x∗)≤ RL
√ T .
• Of course, can return arg min1≤i≤Tf(xi) instead (not always better).
• It requiresT≈ R22L2 steps to ensure precision.
• Online version γt = LR√t: bound in 3RL/√
T (see Hazan).
• Works just as well for constrained optimization with xt+1←ΠX xt−γt∇f(xt)
thanks to Pythagore projection theorem.
6
Intuition: what can happen?
The step must be large enough to reach the region of the minimum, but not too large too avoid skip- ping over it.
LetX =B(0,R)⊂R2and f(x1,x2) = R
√2γT|x1|+L|x2| which isL-Lipschitz forγ≥ √R
2LT. Then, ifx0=
√R 2,3Lγ4
∈ X,
• xt1= √R
2−√Rt2T and ¯xT1 ' 2R√2;
• x2s+12 = 3Lγ4 −γL=−Lγ4,x2s2 = 3Lγ4 , and ¯xT2 'Lγ4. Hence
f x¯T1,x¯T2 ' R
√2γT R 2√
2 +LγL 4 = 1
4 R2
Tγ +L2γ
, which is minimal forγ= R
L√
T wheref x¯T1,x¯T2
≈ RL
2√ T.
7
Proof
Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:
2ha,bi=kak2+kbk2− ka−bk2. Hence for every 0≤t <T:
f(xt)−f(x∗)≤ hgt,xt−x∗i
= 1
γhxt−xt+1,xt−x∗i
= 1 2γ
kxt−x∗k2+kxt−xt+1k2− kxt+1−x∗k2
= 1 2γ
kxt−x∗k2− kxt+1−x∗k2 +γ
2kgtk2, and hence
T−1
X
t=0
f(xt)−f(x∗)≤ 1 2γ
kx0−x∗k2−kxT−x∗k2 +L2γT
2 ≤ L√ T R2
2R +L2RT 2L√
T
and by convexity f 1 T
T−1
X
i=0
xi
!
≤ 1 T
T−1
X
t=0
f(xt).
8
Smoothness
Smoothness
Definition
A continuously differentiable functionf isβ-smooth if the gradient∇f isβ-Lipschitz, that is if for allx,y ∈ X,
k∇f(y)− ∇f(x)k ≤βky−xk.
Property
Iff isβ-smooth, then for anyx,y∈ X: f(y)−f(x)− h∇f(x),y−xi
≤ β
2ky−xk2.
• f is convex andβ-smooth iffx7→ β
2kxk2−f(x) is convex iff∀x,y ∈ X f(x) +h∇f(x),y−xi ≤f(y)≤f(x) +h∇f(x),y−xi+β
2ky−xk2.
• Iff is twice differentiable, thenf isα-strongly convex iff all the
eigenvalues of the Hessian off are at most equal toβ. 9
Convergence of GD for smooth convex functions
Theorem
Letf be a convex andβ-smooth function onRd. Then GD with γt ≡γ= β1 satisfies:
f(xT)−f(x∗)≤2βkx0−x∗k2 T + 4 .
Thus it requiresT≈ 2βR steps to ensure precision .
10
Majoration/minoration
Takingγ= β1 is a ”safe” choice ensuring progress:
x+def= x−1
β∇f(x) = arg min
y
f(x) +
∇f(x),y−x +β
2ky−xk2 is such thatf(x+)≤f(x)− 1
2β
∇f(x)
2.Indeed, f(x+)−f(x)≤
∇f(x),x+−x +β
2kx+−xk2
=−1 β
∇f(x)
2+ 1 2β
∇f(x)
2=− 1 2β
∇f(x)
2.
=⇒ Descentmethod.
Moreover,x+is ”on the same side ofx∗ asx” (no overshooting): since ∇f(x)k=
∇f(x)− ∇f(x∗)k ≤βkx−x∗k, ∇f(x),x−x∗
≤
∇f(x)
kx−x∗k ≤βkx−x∗k2and thus x∗−x+,x∗−x
=
x∗−xk2+1
β∇f(x),x∗−x
≥0.
11
Lemma: the gradient shoots in the right direction
Lemma
For everyx∈ X,
∇f(x),x−x∗
≥ 1 β
∇f(x) 2.
We already now thatf(x∗)≤f x−β1∇f(x)
≤f(x)−2β1
∇f(x)
2. In addition, taking z=x∗+β1∇f(x):
f(x∗) =f(z) +f(x∗)−f(z)
≥f(x) +
∇f(x),z−x
−β
2kz−x∗k2
=f(x) +
∇f(x),x∗−x+
∇f(x),z−x∗
− 1 2β
∇f(x)
2
=f(x) +
∇f(x),x∗−x + 1
2β ∇f(x)
2.
Thusf(x) +
∇f(x),x∗−x+ 1 2β
∇f(x)
2≤f(x∗)≤f(x)− 1 2β
∇f(x) 2. In fact, this lemma is a corrolary of theco-coercivity of the gradient:∀x,y∈ X,
∇f(x)− ∇f(y),x−y
≥ 1 β
∇f(x)− ∇f(y) 2, which holds iff the convex, differentiable functionf isβ-smooth.
12
Proof step 1: the iterates get closer to x
∗Applying the preceeding lemma tox=xt, we get kxt+1−x∗k2=
xt− 1
β∇f(xt)−x∗
2
= xt−x∗
2− 2 β
∇f(xt),xt−x∗ + 1
β2 ∇f(xt)
2
≤ xt−x∗
2− 1 β2
∇f(xt) 2
≤ xt−x∗
2 .
→it’s good, but it can be slow...
13
Proof step 2: the values of the iterates converge
We have seen thatf(xt+1)−f(xt)≤ − 1 2β
∇f(xt)
2. Hence, ifδt=f(xt)−f(x∗), then
δ0=f(x0)−f(x∗)≤ β 2
x0−x∗ 2
andδt+1≤δt− 1 2β
∇f(xt) 2. But δt≤
∇f(xt),xt−x∗
≤
∇f(xt)
xt−x∗ .
Therefore, sinceδtis decreasing witht,δt+1≤δt− δt2
2βkx0−x∗k2. Thinking to the coresponding ODE, one setsut= 1/δt, which yields:
ut+1≥ ut
1−2βkx 1
0−x∗ k2ut
≥ut
1 + 1
2βkx0−x∗k2ut
=ut+ 1
2βkx0−x∗k2
Hence,uT ≥u0+ T
2βkx0−x∗k2 ≥ 2
βkx0−x∗k2+ T
2βkx0−x∗k2= T+ 4 2βkx0−x∗k2.
14
Strong convexity
Strong convexity
Definition
f :X →Risα-strongly convex if for allx,y ∈ X, for anygx ∈∂f(x), f(y)≥f(x) +
gx,y−x +α
2ky−xk2.
• f isα-strongly convex ifff(x)−α2kxk2 is convex.
• αmeasures thecurvature off.
• Iff is twice differentiable, thenf isα-strongly convex iff all the eigenvalues of the Hessian off are larger thanα.
15
Faster rates for Lipschitz functions through strong convexity
Theorem
Letf be aα-strongly convex andL-Lipschitz. Under the Lipschitz assumption, GD withγt =α(t+1)1 satisfies:
f 1 T
T−1
X
i=0
xi
!
−f(x∗)≤ L2log(T)
αT .
Note : returning another weighted average withγt= α(t+1)2 yields:
f
T−1
X
i=0
2(i+ 1) T(T+ 1)xi
!
−f(x∗)≤ 2L2 α(T+ 1) .
Thus it requiresT≈ 2Lα2 steps to ensure precision.
16
Proof
Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:
2ha,bi=kak2+kbk2− ka−bk2. Hence for every 0≤t <T, byα-strong convexity:
f(xt)−f(x∗)≤ hgt,xt−x∗i−α
2kxt−x∗k2
= 1 γt
hxt−xt+1,xt−x∗i −α
2kxt−x∗k2
= 1 2γt
kxt−x∗k2+kxt−xt+1k2− kxt+1−x∗k2
−α
2kxt−x∗k2
= tα
2 kxt−x∗k2−(t+ 1)α
2 kxt+1−x∗k2+ 1
2(t+ 1)αkgtk2 sinceγt= α(t+1)1 , and hence
T−1
X
t=0
f(xt)−f(x∗)≤ 0×α
2 kx0−x∗k2−Tα
2 kxT−x∗k2+L2 2α
T−1
X
t=0
1
t+ 1 ≤L2log(T) 2α and by convexityf
1 T
PT−1 i=0 xi
≤T1PT−1
t=0 f(xt). 17
Outline
Convex functions inRd Gradient Descent Smoothness Strong convexity
Smooth and Strongly Convex Functions
Lower bounds lower bound for Lipschitz convex optimization What more?
Stochastic Gradient Descent
18
Smoothness and strong convexity: sandwiching f by squares
Letx∈ X.
For everyy ∈ X,
β-smoothness implies:
f(y)≤f(x) +
∇f(x),y−x +β
2ky−xk2
def= ¯f(x) = ¯f(x+) +β 2
y−x+
2.
Moreoever,α-strong convexity implies, with x−=x−α1∇f(x), f(y)≥f(x) +
∇f(x),y−x +α
2ky−xk2
def= f(x) =f(x−) +α 2
y−x−
2.
19
Convergence of GD for smooth and strongly convex functions
Theorem
Letf be aβ-smooth andα-strongly convex function. Then GD with the choiceγt≡γ=β1 satisfies
f(xT)−f(x∗)≤e−Tκ f(x0)−f(x∗) ,
whereκ=βα ≥1 is thecondition number off. Linear convergence: it requiresT=κlog osc(f )
steps to ensure precision.
20
Proof: every step fills a constant part of the gap
In particular, with the choiceγ=β1,
f(xt+1) =f(xt+)≤f¯(xt+) =f(xt)− 1 2β
∇f(xt)
2,
and
f(x∗)≥f(x∗)≥f(xt−) =f(xt)− 1 2α
∇f(xt)
2.
Hence, every step fills at least a part αβ of the gap:
f(xt)−f(xt+1)≥ α
β f(xt)−f(x∗) .
It follows that f(xT)−f(x∗)≤
1−α
β
f(xT−1)−f(x∗)
≤
1−α β
T
f(x0)−f(x∗)
≤e−αβT f(x0)−f(x∗) .
21
Using coercivity
Lemma
Iff isα-strongly convex then for allx,y∈ X, ∇f(x)− ∇f(y),x−y
≥αkx−yk2.
Proof:monotonicity of the gradient of the convex functionx7→f(x)−αkxk2/2.
Lemma
Iff isα-strongly convex andβ-smooth, then for allx,y ∈ X, ∇f(x)− ∇f(y),x−y
≥ αβ
α+βkx−yk2+ 1 α+β
∇f(x)− ∇f(y)
2.
Proof:co-coercivity of the (β−α)-smooth and convex functionx7→f(x)−αkxk2/2.
22
Stronger result by coercivity
Theorem
Letf be aβ-smooth andα-strongly convex function. Then GD with the choiceγt≡γ=α+β2 satisfies
kxT−x∗k2≤e−κ+14T kx0−x∗k2,
whereκ=βα ≥1 is thecondition number off. Corollary: since byβ-smoothnessf(xT)−f(x∗)≤ β
2kxT−x∗k2, this bound implies
f(xT)−f(x∗)≤β 2 exp
− 4T κ+ 1
kx0−x∗k2.
NB:Bolder jumps: γ=
α+β 2
−1
≥β−1.
23
Proof
Using the coercivity inequality, kxt−x∗k2=
xt−1−γ∇f(xt−1)−x∗
2
=
xt−1−x∗
2−2γ
∇f(xt−1),xt−1−x∗ +γ2
∇f(xt−1)
2
≤
1−2 αβγ α+β
kxt−1−x∗k2+
γ2− 2γ α+β
| {z }
=0
∇f(xt−1)
2
=
1− 2 κ+ 1
2
kxt−1−x∗k2
≤exp
− 4t κ+ 1
kx0−x∗k2.
24
Lower bounds lower bound for
Lipschitz convex optimization
Lower bounds
General first-order black-box optimization algorithm = sequence of maps (x0,g0, . . . ,xt,gt)7→xt+1. We assume:
• x0= 0
• xt+1∈Span g0, . . . ,gt , Theorem
For everyT ≥1,L,R>0 there exists a convex andL-Lipschitz functionf onRT+1 such that for any black-box procedure as above,
0≤t≤Tmin f(xt)− min
kxk≤Rf(x)≥ RL 2 1 +√
T + 1 .
• Minimax lower bound: f and evend depend onT. . .
• . . . but not limited to gradient descent algorithms.
• For a fixed dimension, exponential rates are always possible by other means (e.g. center of gravity method).
• =⇒ the above GD algorithm is minimax rate-optimal! 25
Proof
Letd=T+ 1,ρ= L
√d 1+√
d andα= L
R 1+√
d, and let f(x) =ρ max
1≤i≤dxi+α 2kxk2. Then
∂f(x) =αx+ρConv n
ei:is.t.xi= max
1≤j≤dxjo .
Ifkxk ≤R, then∀g∈∂f(x),kgk ≤αR+ρwhich means thatf isαR+ρ=L-Lipschitz. For simplicity of notation, we assume that the first-order oracle returnsαx+ρeiwhereiis thefirst coordinate such thatxi= max1≤j≤dxj.
• Thusx1∈Span(e1), and by inductionxt∈Span(e1, . . . ,et).
• Hence for everyj∈ {t+ 1, . . . ,d},xtj= 0, andf(xt)≥0 for allt≤T=d−1.
• f reaches its minimum atx∗= −αdρ , . . . ,−αdρ
since 0∈∂f(x∗),kx∗k2= ρ2
α2d =R2 and
f(x∗) =−ρ2 αd+α
2 ρ2 α2d =− ρ2
2αd=− RL
2 1 +√ T+ 1.
26
Other lower bounds
Forα-strongly convex and Lipschitz functions, lower bound in L2 αT.
=⇒ GD is order-optimal.
Forβ-smooth convex functions, the lower bound is in βkx0−x∗k2
T2 .
=⇒ room for improvement over GD with reaches 2βkx0−x∗k2 T+ 4 . Forα-strongly convex andβ-smooth functions, lower bound in kx0−x∗k2e−√Tκ.
=⇒ room for improvement over GD which reacheskx0−x∗k2e−Tκ. For proofs, see [Bubeck].
27
What more?
Need more?
• Constrained optimization
• projected gradient descent
yt =xt−γtgt, xt+1= ΠX(yt+1).
• Frank-Wolfe yt+1= arg min
y∈X
∇f(xt),y
, xt+1= (1−γt)xt+γtyt+1.
• Nesterov acceleration yt+1=xt−1
β∇f(xt), xt+1=
1 +
√κ−1
√κ+ 1
yt+1−
√κ−1
√κ+ 1yt .
• Second-order methods, Newton and quasi-Newton xt+1=xt−
∇2f(x)−1
∇f(x).
• Mirror descent: for a given convex potention Φ,
∇Φ(yt+1) =∇Φ(xt)−γgt, xt+1∈ΠΦX(yt+1).
• Structured optimization, proximal methods
• Example: f(x) =L(x) +λkxk1 28
Nesterov Momentum acceleration forβ-smooth convex functions Taken fromhttps://blogs.princeton.edu/imabandit
Algorithm:Nesterov Accelerated Gradient Descent Input: convex function f, initial pointx0
1 d0←0,λ0←1 ;
2 fort = 0. . .T−1 do
3 yt←xt+dt;
4 xt+1←yt−β1∇f(yt);
5 λt+1 ←largest solution of λ2t+1−λt+1=λ2t;
6 dt+1 ←λλt−1
t+1 xt+1−xt
;
7 returnxT
• dt = momentum term (”heavy ball”), well-known practical trick to accelerate convergence, intensity λλt−1
t+1 /1.
• λt 't/2 + 1: letδt=λt−λt−1≥0 and observe thatλ2t−λ2t−1=δt(2λt−δt) =λt, from which one deduces that 1/2≤δt=2−δ1
t/λt ≤1, thus 1 +t/2≤λt≤1 +t, hence δt≤ 2−1/(1+t/2)1 ≤1/2 + 1/(t+ 1) and 1 +t/2≤λt≤t/2 + log(t+ 1) + 1.
29
Nesterov Acceleration
Theorem
Letf be a convex andβ-smooth function onRd. Then Nesterov Accelerated Gradient descent algorithm satisfies:
f(xT)−f(x∗)≤2βkx0−x∗k2
T2 .
• Thus it requiresT≈ 2βR√ steps to ensure precision .
• Nesterov acceleration also works forβ-smooth, α-strongly convex functions and permits to reach the minimax rate kx0−x∗k2e−√Tκ: see for example [Bubeck].
30
Proof
Letδt=f(xt)−f(x∗). Denotinggt=−β−1∇f(xt+dt), one has:
δt+1−δt=f(xt+1)−f(xt+dt) +f(xt+dt)−f(xt)
≤ −2 β
∇f(xt+dt) 2+
∇f(xt+dt),dt=−β 2
kgtk2+ 2gt,dt ,
andδt+1=f(xt+1)−f(xt+dt) +f(xt+dt)−f(x∗)
≤ −2 β
∇f(xt+dt) 2+
∇f(xt+dt),xt+dt−x∗
=−β 2
kgtk2+ 2
gt,xt+dt−x∗ .
Hence, λt−1δt+1−δt+δt+1≤ −β 2
λtkgtk2+ 2gt,xt+λtdt−x∗
=− β 2λt
λtgt+xt+λtdt−x∗ 2−
xt+λtdt−x∗ 2
=− β 2λt
xt+1+λt+1dt+1−x∗ 2−
xt+λtdt−x∗ 2
, since the choice of the momentum intensity is precisely ensuring thatxt+λtgt+λtdt= xt+1+ (λt−1)(gt+dt) =xt+1+ (λt−1)(xt+1−xt) =xt+1+λt+1λt−1
λt+1 (xt+1−xt)
| {z }
dt+1
. It follows from the choice ofλtthat
λ2tδt+1−λ2t−1δt=λ2tδt+1−(λ2t−λt)δt≤ −β 2
xt+1+λt+1dt+1−x∗
2−
xt+λtdt−x∗
2
and hence, sinceλ−1= 0 andλt≥(t+ 1)/2:
T 2
2
δT≤λ2T−1δT≤ β 2
x0+λ0d0−x∗ 2=β
x0−x∗ 2
2 .
31
Research article 4
Incremental Majorization- Minimization Optimization with Application to Large- Scale Machine Learning by Julien Mairal
SIAM Journal on Optimization Vol. 25 Issue 2, 2015 Pages 829- 855
32
Stochastic Gradient Descent
Motivation
Big data: an evaluation off can be very expensive, and useless!
(especially at the begining).
L(θ) = 1 m
m
X
i=1
` yihθ,xii .
Src:https://arxiv.org/pdf/1606.04838.pdf
→often faster and cheaper for the required precision.
33
Research article 5
The Tradeoffs of Large Scale Learning
by L´eon Bottou and Olivier Bousquet Advances in Neural Information Pro- cessing Systems, NIPS Foundation (http://books.nips.cc) (2008), pp. 161-168
NeurIPS 2018 award: ”test of time”
34
Stochastic Gradient Descent Algorithms
We consider a function to miminizef(x) = 1 m
m
X
i=1
fi(x):
Algorithm:Stochastic Gradient Descent Input: convex function f, step sizeγt,
initial pointx0 1 for t = 0. . .T −1do
2 PickIt∼ U {1, . . . ,m}
3 Compute gt ∈∂fIt(xt)
4 xt+1←xt−γtgt
5 returnxT or x0+· · ·+xT−1 T
• xT →
T→∞x∗def= arg minf?
• f(xT) →
T→∞f(x∗) = minf?
• under which conditions?
• what about x0+· · ·+xT−1
T ?
• at what speed?
• works in high dimension?
• do some properties help?
• can other algorithms do
better? 35
Noisy Gradient Descent
LetFt =σ(I0, . . . ,It), whereF−1={Ω,∅}. Note thatxt is Ft−1-measurable, i.e. xt depends only onI0, . . . ,It−1.
Lemma For allt ≥0,
E
gt|Ft−1
∈∂f(xt).
Proof: lety ∈ X. Sincegt ∈∂fIt(xt),fIt(y)≥fIt(xt) +hgt,y−xti. Taking expectation conditionnal onFt−1 (i.e. integrating onIt), and using thatxt is Ft−1-measurable, one obtains:
f(y)≥f(xt) +E
hgt,y−xt|Ft−1i
=f(xt) + D
E gt|Ft−1
,y−xt
E . More generally, SGD for the optimization of functionsf that are accessible by anoisy first-order example, i.e. for which it is possible to obtain at every point an independent, unbiased estimate of the gradient.
Two distinct objective functions:
LS(θ) = 1 m
m
X
i=1
`i hθ(xi),yi
andLD(θ) =E
h` hθ(X),Yi
. 36
Convergence for Lipschitz convex functions
Theorem
Assume that for alli, allx ∈ X and allg ∈∂fi(x),kgtk ≤L. Then SGD withγt ≡γ= R
L√
T satisfies E
"
f 1 T
T−1
X
i=0
xi
!#
−f(x∗)≤ RL
√T .
• Exactly the same bound as for GD in the Lipschitz convex case.
• As before, it requires T≈R22L2 steps to ensure precision .
• Bound only in expectation.
• In contrast to the deterministic case, smoothness does not improve the speed of convergence in general.
37
Proof: exactly the same as for GD
Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem:
2ha,bi=kak2+kbk2− ka−bk2. Hence for every 0≤t <T, sinceE
gt|Ft−1
∈∂f(xt), E
f(xt)−f(x∗)|Ft−1
≤D E
gt|Ft−1
,xt−x∗E
= 1 γ D
E
xt−xt+1|Ft−1
,xt−x∗E
=E 1
2γ
kxt−x∗k2+kxt−xt+1k2− kxt+1−x∗k2 Ft−1
=E 1
2γ
kxt−x∗k2− kxt+1−x∗k2 +γ
2kgtk2 Ft−1
,
and hence, taking expectation:
E
"T−1 X
t=0
f(xt)−f(x∗)
#
≤ 1 2γ
kx0−x∗k2−
((((((( E
kxT−x∗k2
+L2γT 2
≤L√ T R2
2R +L2RT 2L√
T . 38
A lot more to know
• Faster rate for the strongly convex case:
same proof as before.
• No improvement in general by using smoothness only.
• Ruppert-Polyak averaging.
• Improvement for sums of smooth and strongly convex functions, etc.
• Analysis in expectation only is rather weak.
• Mini-batch SGD: the best of the two worlds.
• Beyond SGD methods: momentum, simulated annealing, etc.
39
Convergence in quadratic mean
Theorem
Let (Ft)t be an increasing family ofσ-fields. For everyt≥0, letft be a convex, differentiable,β-smooth, square-integrable,Ft-measurable function onX. Further, assume that for everyx ∈ X and everyt≥1, E
∇ft(x)|Ft−1
=∇f(x), wheref is anα-strongly convex function reaching its minimum atx∗∈ X. Also assume that for allt ≥0, E
k∇ft(x∗)k2 Ft−1
≤σ2. Then, denotingκ=βα, the SGD with
γt = 1
α t+1+2κ2 satisfies:
E
hkxT−x∗k2i
≤ 2κ2kx0−x∗k2+2σα22log 2κT2 + 1
T + 2κ2 .
40
Proof 1/2: induction formula for the quadratic risk
We observe that E
h
∇fIt(xt−1)
2 Ft−1
i≤2E h
∇fIt(xt−1)−fIt(x∗)
2 Ft−1
i+ 2E h
∇fIt(x∗)
2 Ft−1
i
≤2β2
xt−1−x∗ 2+ 2σ2. Hence,
E h
xt−x∗ 2
Ft−1
i
=
xt−1−x∗
2−2γt−1
xt−1−x∗,∇f(xt−1) +γt−12 E
h
∇fIt(xt−1) 2
Ft−1
i
≤
xt−1−x∗
2−2γt−1α
xt−1−x∗ 2+γ2t−1E
h
∇fIt(xt−1) 2
Ft−1
i
≤ 1−2αγt−1+ 2β2γ2t−1
xt−1−x∗
2+ 2σ2γt−12
≤ 1−αγt−1
xt−1−x∗
2+ 2σ2γt−12
thanks to the fact that for allt≥0,αγt≥2β2γ2t ⇐⇒ γt≤α/(2β2) =γ−1, andγtis decreasing int. Hence, denotingδt=E
kxt−x∗k2
, by taking expectation we obtain that δt≤ 1−αγt−1
δt−1+ 2σ2γt−12 .
Note that unfolding the induction formula leads to an explicit upper-bound forδt:
δt≤
t−1
Y
k=0
1−µγk + 2σ2
t−1
X
k=0
γk2
t−1
Y
i=k+1
1−µγi .
41
Proof 2/2: solving the induction
One may either use the closed form forδt, or (with the hint of the corresponding ODE) set ut= (t+ 2κ2)δtand note that
ut= (t+ 2κ2)δt
≤(t+ 2κ2)
1−αγt−1 ut−1
(t−1 + 2κ2)+ 2σ2γ2t−1
≤(t+ 2κ2)t−1 + 2κ2 t+ 2κ2
ut−1
(t−1 + 2κ2)+2σ2(t+ 2κ2) α2(t+ 2κ2)2
=ut−1+2σ2 α2
1 (t+ 2κ2)
≤u0+2σ2 α2
t
X
s=1
1 (s+ 2κ2)
≤2κ2δ0+2σ2
α2 logt+ 2κ2 2κ2 . Hence for everyt
δt≤ 2κ2kx0−x∗k2+2σ2
α2 logt+2κ2
2κ2
t+ 2κ2 .
Remark: with some more technical work, the analysis works for allγt, possibly of the form γt=t−βforβ≤1 : see [Bach&Moulines ’11].
42