Nonlinear Acceleration of Stochastic Algorithms

(1)

HAL Id: hal-01618379

https://hal.archives-ouvertes.fr/hal-01618379

Preprint submitted on 17 Oct 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Nonlinear Acceleration of Stochastic Algorithms

Damien Scieur, Alexandre d’Aspremont, Francis Bach

To cite this version:

Damien Scieur, Alexandre d’Aspremont, Francis Bach. Nonlinear Acceleration of Stochastic Algo- rithms. 2017. �hal-01618379�

(2)

NONLINEAR ACCELERATION OF STOCHASTIC ALGORITHMS

DAMIEN SCIEUR, ALEXANDRE D’ASPREMONT, AND FRANCIS BACH

ABSTRACT. Extrapolation methods use the last few iterates of an optimization algorithm to produce a better estimate of the optimum. They were shown to achieve optimal convergence rates in a deterministic setting using simple gradient iterates. Here, we study extrapolation methods in a stochastic setting, where the iterates are produced by either a simple or an accelerated stochastic gradient algorithm. We first derive convergence bounds for arbitrary, potentially biased perturbations, then produce asymptotic bounds using the ratio between the variance of the noise and the accuracy of the current point. Finally, we apply this acceleration technique to stochastic algorithms such as SGD, SAGA, SVRG and Katyusha in different settings, and show significant performance gains.

1. INTRODUCTION

We focus on the problem

min

x∈R^d

f(x) (1)

wheref(x)is a smooth and strongly convex function with respect to the Euclidean norm. We consider a stochastic first-order oracle, which gives a noisy estimate of the gradient off(x), with

∇_εf(x) =∇f(x) +ε, (2)

whereεis a noise term with bounded variance. This is the case for example whenf is a sum of strongly convex functions, and we only have access to the gradient of one randomly selected function. Stochastic optimization (2) is typically challenging as classical algorithms are not convergent (for example, gradient descent or Nesterov’s accelerated gradient). Even the averaged version of stochastic gradient descent with constant step size does not converge to the solution of (1), but to another point whose proximity to the real minimizer depends of the step size [Nedi´c and Bertsekas,2001;Moulines and Bach,2011].

When f is a finite sum of N functions, then algorithms such as SAG [Schmidt et al., 2013], SAGA [Defazio et al.,2014], SDCA [Shalev-Shwartz and Zhang, 2013] and SVRG [Johnson and Zhang, 2013]

accelerate convergence using a variance reduction technique akin to control variate in Monte-Carlo methods. Their rate of convergence depends of1−µ/Land thus does exhibit an accelerated rate on par with the deterministic setting (in 1−p

µ/L). Recently a generic acceleration algorithm called Catalyst [Lin et al., 2015], based on the proximal point methods improved this rate of convergence, at least in theory.

Unfortunately, numerical experiments show this algorithm to be conservative, thus limiting practical performances. On the other hand, recent papers, for example [Shalev-Shwartz and Zhang,2014] (Accelerated SDCA) and [Allen-Zhu, 2016] (Katyusha), propose algorithms with accelerated convergence rates, if the strong convexity parameter is given.

Whenfis a quadratic function then averaged SGD converges, but the rate of decay of initial conditions is very slow. Recently, some results have focused on accelerated versions of SGD for quadratic optimization, showing that with a two step recursion it is possible to enjoy both the optimal rate for the bias term and the variance [Flammarion and Bach,2015], given an estimate of the ratio between the distance to the solution and the variance ofε.

A novel generic acceleration technique was recently proposed byScieur et al.[2016] in the deterministic setting. This uses iterates from a slow algorithm to extrapolate estimates of the solution with asymptotically

Key words and phrases. acceleration, stochastic, extrapolation.

1

arXiv:1706.07270v1 [math.OC] 22 Jun 2017

(3)

optimal convergence rate. Moreover, this rate is reachedwithout prior knowledge of the strong convexity constant, whose online estimation is still a challenge, even in the deterministic case [Fercoq and Qu,2016].

Convergence bounds are derived byScieur et al.[2016], tracking the difference between the deterministic first-order oracle of (1) and iterates from a linearized model. The main contribution of this paper is to extend the analysis to arbitrary perturbations, including stochastic ones, and to present numerical results when this acceleration method is used to speed up stochastic optimization algorithms.

In Section 2 we recall the extrapolation algorithm, and quickly summarize its main convergence bounds in Section 3. In Section 4, we consider a stochastic oracle and analyze its asymptotic convergence in Section 5. Finally, in Section 6 we describe numerical experiments which confirm the theoretical bounds and show the practical efficiency of this acceleration.

2. REGULARIZEDNONLINEARACCELERATION

Consider the optimization problem

min

x∈R^d

f(x)

where f(x) is a L−smooth and µ−strongly convex function [Nesterov, 2013]. Applying the fixed-step gradient method to this problem yields the following iterates

˜

xt+1 = ˜xt− 1

L∇f(˜xt). (3)

Letx^∗be the unique optimal point, this algorithm is proved to converge with

k˜x_t−x^∗k ≤(1−κ)^tk˜x₀−x^∗k (4) where k · k stands for the `2 norm and κ = µ/L ∈ [0,1[ is the (inverse of the) condition number of f [Nesterov, 2013]. Using a two-step recurrence, the accelerated gradient descent by Nesterov [2013]

achieves an improved convergence rate

k˜x_t−x^∗k ≤O (1−√

κ)^tk˜x₀−x^∗k

. (5)

Indeed, (5) converges faster than (4) but the accelerated algorithm requires the knowledge of µ and L.

Extrapolation techniques however obtain a similar convergence rate, but do not need estimates of the param- etersµandL. The idea is based on the comparison between the process followed byx˜iwith a linearized model around the optimum, written

xt+1 =xt− 1

L ∇f(x^∗) +∇²f(x^∗)(xt−x^∗)

, x0 = ˜x0. which can be rewritten as

xt+1−x^∗=

I− 1

L∇²f(x^∗)

(xt−x^∗), x0= ˜x0. (6) A better estimate of the optimum in (6) can be obtained by forming a linear combination of the iterates (see [Anderson,1965;Cabay and Jackson,1976;Meˇsina,1977]), with

t

X

i=0

c_ix_i−x^∗

kx_t−x^∗k,

for some specificci (either data driven, or derived from Chebyshev polynomials). These procedures were limited to quadratic functions only, i.e. whenx˜_i = x_i but this was recently extended to generic convex problems byScieur et al.[2016] and we briefly recall these results below.

To simplify the notations, we define the functiong(x)and the step

˜

xt+1 =g(˜xt), (7)

(4)

where g(x) is differentiable, Lipchitz-continuous with constant (1−κ) < 1, g(x^∗) = x^∗ and g⁰(x^∗) is symmetric. For example, the gradient method (3) matches exactly this definition withg(x) =x−∇f(x)/L.

Running ksteps of (7) produces a sequence {˜x₀, ..., x˜_k}, which we extrapolate using Algorithm 1from Scieur et al.[2016].

Algorithm 1Regularized Nonlinear Acceleration (RNA)

Input: Iteratesx˜₀,x˜₁, ...,x˜_k+1∈R^dproduced by (7), and a regularization parameterλ >0.

1: ComputeR˜ = [˜r0, ...,r˜k], wherer˜i= ˜xi+1−x˜iis thei^thresidue.

2: Solve

˜

c^λ = argmin

c^T1=1

kRck˜ ²+λkck², or equivalently solve( ˜R^TR˜+λI)z=1then set˜c^λ =z/1^Tz.

Output: Approximation ofx^∗computed asPk i=0˜c^λ_ix˜i

For a good choice ofλ, the output of Algorithm (1) is a much better estimate of the optimum thanx˜_k+1 (or any other points of the sequence). Using a simple grid search on a few values ofλis usually sufficient to improve convergence (see [Scieur et al.,2016] for more details).

3. CONVERGENCE OF REGULARIZEDNONLINEARACCELERATION

We quickly summarize the argument behind the convergence of Algorithm (1). The theoretical bound comparex˜ito the iterates produced by the linearized model

x_t+1 =x^∗+∇g(x^∗)(x_t−x^∗), x₀= ˜x₀. (8) We writec^λ the coefficients computed by Algorithm (1) from the “linearized” sequence{x₀, ..., x_k+1} and the error term can be decomposed into three parts,

k

X

i=0

˜

c^λ_ix˜i−x^∗ ≤

k

X

i=0

c^λ_ixi−x^∗

| {z }

Acceleration

+

k

X

i=0

˜ c^λ_i −c^λ_i

(xi−x^∗)

| {z }

Stability

+

k

X

i=0

˜ c^λ_i

˜ xi−xi

| {z }

Nonlinearity

. (9)

Convergence is guaranteed as long as the errors(˜xi−x^∗)and(xi−x˜i)converge to zero fast enough, which ensures a good rate of decay for the regularization parameterλ, leading to an asymptotic rate equivalent to the accelerated rate in (5).

Thestabilityterm (in˜c^λ−c^λ) is bounded using theperturbation matrix

P ,R^TR−R˜^TR˜ (10)

whereRandR˜are the matrices of residuals,

R,[r0...r_k] rt=xt+1−xt, (11) R˜,[˜r₀...˜r_k] ˜r_t= ˜x_t+1−x˜_t. (12) The proofs of the following propositions were obtained byScieur et al.[2016].

Proposition 3.1 (Stability). Let∆c^λ = ˜c^λ −c^λ be the gap between the coefficients computed by Algo- rithm(1)using the sequences{˜xi}and{x_i}with regularization parameterλ. LetP = R^TR−R˜^TR˜ be defined in(10),(11)and(12). Then

k∆c^λk ≤ kPk

λ kc^λk. (13)

3

(5)

This implies that the stability term is bounded by

k

X

i=0

∆c^λ_i(x_i−x^∗)

≤ kPk

λ kc^λkO(x₀−x^∗). (14) The termNonlinearityis bounded by the norm of the coefficients˜c_λ(controlled thanks to the regularization parameter) times the norm of thenoise matrix

E = [x₀−x˜₀, x₁−x˜₁, ..., x_k−x˜_k]. (15) Proposition 3.2(Nonlinearity). Let˜c^λbe computed by Algorithm1using the sequence{x˜0, ..., x˜k+1}with regularization parameterλandR˜be defined in(12). The norm of˜c^λis bounded by

k˜c^λk ≤ s

kRk˜ ²+λ

(k+ 1)λ ≤ 1

√ k+ 1

s

1 +kRk˜ ²

λ . (16)

This bounds the nonlinearity term because

k

X

i=0

˜

c^λ_i(˜x_i−x_i) ≤

s

1 +kRk˜ ² λ

√kEk

k+ 1, (17)

whereE is defined in(15).

These two propositions show that the regularization in Algorithm1 limits the impact of the noise: the higher λ is, the smaller these terms are. It remains to control the acceleration term. We introduce the normalized regularization value¯λ, written

λ¯, λ

kx₀−x^∗k². (18)

For smallλ, this term decreases as fast as the accelerated rate (5), as shown in the following proposition.¯ Proposition 3.3(Acceleration). LetP_kbe the subspace of polynomials of degree at mostkandS_κ(k, α)be the solution of the Regularized Chebychev Polynomial problem,

S_κ(k, α), min

p∈Pk

x∈[0,1−κ]max p²(x) +αkpk² s.t. p(1) = 1. (19)

Letλ¯be the normalized value of lambda defined in(18). The acceleration term is bounded by

k

X

i=0

c^λ_ix_i−x^∗ ≤ 1

κ q

S_κ(k,λ)kx¯ ₀−x^∗k²−λkc^λk². (20) We also get the following corollary, which will be useful for the asymptotic analysis of the rate of convergence of Algorithm1.

Corollary 3.4. Ifλ→0, the bound(20)becomes

k

X

i=0

c^λ_ixi−x^∗ ≤

1−√ κ 1 +√

κ k

kx₀−x^∗k.

These last results controlling stability, nonlinearity and acceleration are proved byScieur et al.[2016].

We now refine the final step ofScieur et al.[2016] to produce a global bound on the error that will allow us to extend these results to the stochastic setting in the next sections.

(6)

Theorem 3.5. If Algorithm1is applied to the sequence x˜i with regularization parameterλ, it converges with rate

k

X

i=0

˜ c^λ_ix˜_i

≤ kx₀−x^∗kS_κ(k,λ)¯ r 1

κ² +O(kx−x^∗k²)kPk²

λ³ + kEk

√k+ 1 s

1 +kRk˜ ²

λ . (21) Proof. The proof is inspired byScieur et al.[2016] and is also very similar to the proof of Proposition5.2.

We can bound (9) using (14) (Stability), (17) (Nonlinearity) and (20) (Acceleration). It remains to maximize over the value ofkc^λkusing the result of Proposition8.2.

This last bound is not very explicit, but an asymptotic analysis simplifies it considerably. The next new proposition shows that whenx₀is close tox^∗, then extrapolation converges as fast as in (5) in some cases.

Proposition 3.6. AssumekRk˜ =O(kx₀−x^∗k),kEk=O(kx₀−x^∗k²)andkPk=O(kx₀−x^∗k³), which is satisfied when fixed-step gradient method is applied on a twice differentiable, smooth and strongly convex function with Lipchitz-continuous Hessian. If we choseλ=O(kx₀−x^∗k^s)withs∈[2,⁸₃]then the bound (21)becomes

kx0−xlim^∗k→0

kPk

i=0˜c^λ_ix˜_ik kx₀−x^∗k ≤ 1

κ

1−κ 1 +κ

k

.

Proof. The proof is based on the fact thatλdecreases slowly enough to ensure that theStabilityand Non- linearityterms vanish over time, but fast enough to haveλ¯→0. Ifλ=kx₀−x^∗k^s, equation (21) becomes

Pk i=0˜c^λ_ix˜i

kx₀−x^∗k ≤Sκ(k,kx₀−x^∗k^s−2) s

1

κ² + O(1)kPk²

kx₀−x^∗k^3s−2 + kEk kx₀−x^∗k√

k+ 1 s

1 + kRk˜ ² kx₀−x^∗k^s. By assumption onkRk,˜ kEkandkPk, the previous bound becomes

Pk i=0c˜^λ_ix˜i

kx₀−x^∗k ≤Sκ(k,kx₀−x^∗k^s−2) r 1

κ² +O(kx₀−x^∗k^8−3s) +Op

kx₀−x^∗k²+kx₀−x^∗k^4−s .

Becauseλ∈]2,⁸₃[, all exponents ofkx₀−x^∗kare positive. By consequence, when

kx0−xlim^∗k→0

Pk i=0c˜^λ_ix˜i

kx₀−x^∗k ≤ 1

κSκ(k,0).

Finally, the desired result is obtained by using Corollary3.4.

The efficiency of Algorithm1is thus ensured by two conditions. First, we need to be able to boundkRk˜ , kPkandkEkby decreasing quantities. Second, we have to find a proper rate of decay forλand¯λsuch that thestabilityandnonlinearityterms go to zero when perturbations also go to zero. If these two conditions are met, then the accelerated rate in Proposition3.6holds.

4. NONLINEAR ANDNOISYUPDATES

In (7) we definedg(x)to be non linear, which generates a sequencex˜i. We now consider noisy iterates

˜

xt+1=g(˜xt) +ηt+1 (22)

whereη_tis a stochastic noise. To simplify notations, we write (22) as

˜

x_t+1 =x^∗+G(˜x_t−x^∗) +ε_t+1, (23) whereε_tis a stochastic noise (potentially correlated with the iteratesx_i) with bounded meanν_t,kν_tk ≤ ν and bounded covariance Σt (σ²/d)I. We also assume 0 G (1−κ)I and G symmetric. For

5

(7)

example, (23) can be linked to (22) if we setεt=ηt+O(k˜xt−x^∗k²). This corresponds to the combination of the noiseη_t+1with the Taylor remainder ofg(x)aroundx^∗.

The recursion (23) is also valid when we apply the stochastic gradient method to the quadratic problem minx

1

2kAx−bk².

This correspond to (23) withG=I−hA^TAand meanν = 0. For the theoretical results, we will compare

˜

xtwith their noiseless counterpart to control convergence,

xt+1=x^∗+G(xt−x^∗), x0 = ˜x0. (24) 5. CONVERGENCEANALYSIS WHENACCELERATINGSTOCHASTICALGORITHMS

We will control convergence in expectation. Bound (9) now becomes E

"

k

X

i=0

˜

c^λ_ix˜i−x^∗

#

≤

k

X

i=0

c^λ_ixi−x^∗

+O(kx₀−x^∗k)E h

k∆c^λki +E

h

k˜c^λkkEki

. (25) We now need to enforce bounds (14), (17) and (20) in expectation. For simplicity, we will omit all constants in what follows.

Proposition 5.1. Consider the sequencesx_iandx˜_igenerated by(22)and(24). Then,

E[kRk]˜ ≤ O(kx₀−x^∗k) +O(ν+σ) (26)

E[kEk] ≤ O(ν+σ) (27)

E[kPk] ≤ O((σ+ν)kx₀−x^∗k) +O((ν+σ)²). (28) Proof. First, we have to form the matricesR,˜ EandP. We begin withE, defined in (15). Indeed,

E_i =xi−x˜i ⇒ E₁=ε1, E₂=ε2+Gε1, E_k=

k

X

i=1

G^k−iεi.

It means that eachkE_ik=O(kε_ik). By using (33), EkEk ≤ X

i

EkE_ik

≤ X

i

EkE_i−νik+kν_ik

≤ O(ν+σ) ForR, we notice that˜

R˜_t = x˜_t+1−x˜_t,

= R_t+E_t+1− E_t We get (26) by splitting the norm,

E[kRk]˜ ≤ kRk+O(kEk)≤O(kx₀−x^∗k) +O(ν+σ). Finally, by definition ofP,

kPk ≤2kEkkRk+kEk².

(8)

Taking the expectation leads to the desired result,

E[kPk] ≤ 2E[kEkkRk] +E[kEk²],

≤ 2kRkE[kEk] +E[kEk²_F],

≤ O(kx₀−x^∗k(σ+ν)) +O (σ+ν)² .

We define the followingstochastic condition number τ , ν+σ

kx₀−x^∗k.

The Proposition5.2gives the result when injecting these bounds in (25).

Proposition 5.2. The accuracy of extrapolation Algorithm1applied to the sequence{˜x₀, ...,x˜_k}generated by(22)is bounded by

E hkP_k

i=0c˜^λ_ix˜_i−x^∗ki

kx₀−x^∗k ≤ S_κ(k,¯λ) r 1

κ² +O(τ²(1 +τ)²)

¯λ³ +O r

τ²+τ²(1 +τ²) λ¯

!!

. (29) Proof. We start with (25), then we use (13)

k

X

i=0

c^λ_ix_i−x^∗

+O(kx₀−x^∗k)E

hk∆c^λki +E

hk˜c^λkkEki ,

≤

k

X

i=0

c^λ_ix_i−x^∗

+O(kx₀−x^∗k)kc_λk λ E

hkPki +

r E

hk˜c_λk²i E

hkEk²i .

The first term can be bounded by (20),

k

X

i=0

c^λ_ixi−x^∗ ≤ 1

κ q

Sκ(k,λ)kx¯ ₀−x^∗k²−λkc^λk².

We combine this bound with the second term by maximizing overkc^λk. The optimal value is given in (34),

k

X

i=0

c^λ_ixi−x^∗

+O(kx₀−x^∗k)kc_λk λ E

h kPki

≤ kx₀−x^∗kS_κ(k,¯λ) r 1

κ² +O(kx−x^∗k²)E[kPk]²

λ³ ,

whereλ¯=λ/kx₀−x^∗k². Since, by Proposition5.1, E[kPk]² ≤O

(ν+σ)²(kx₀−x^∗k+ν+σ)²

,

we have

k

X

i=0

c^λ_ix_i−x^∗k+O(kx₀−x^∗k)kc_λk λ E

hkPki

≤ kx₀−x^∗kS_κ(k,λ)¯ r 1

κ² +O(kx−x^∗k²(ν+σ)²)(kx₀−x^∗k+ν+σ)²

λ³ . (30)

7

(9)

The last term can be bounded using (16), r

E h

kc_λk²i E

h kEk²i

≤ O X^k

i=0

kEk²_iE h

k˜c_λk²i1/2

≤ O

(ν+σ) r

E

hk˜c_λk²i

≤ O

(ν+σ) s

E h

1 +kRk˜ ² λ

i

≤ O (ν+σ) s

1 +E kRk˜ ²_F

λ

!

However,

E h

kRk˜ ²_Fi

= Pk i=0E

h k˜rik²i

= Pk

i=0kr_ik²+E h

r^T_i E_i+kE_ik²i

≤ O kx₀−x^∗k²+ (ν+σ)kx₀−x^∗k+ (ν+σ)²

≤ O kx₀−x^∗k+ (ν+σ))² Finally,

r E

hkc_λk²i E

hkEk²i

≤O (ν+σ) r

1 +(kx₀−x^∗k+ (ν+σ))² λ

!

(31) We get (29) by summing (30) and (31), then by replace all _kx^ν+σ

0−x^∗k byτ and _kx ^λ

0−x^∗k² byλ.¯

Consider a situation whereτ is small, e.g. when using stochastic gradient descent with fixed step-size, withx0far fromx^∗. The following proposition details the dependence betweenλ¯andτ ensuring the upper convergence bound remains stable whenτ goes to zero.

Proposition 5.3. Whenτ →0, ifλ¯= Θ(τ^s)withs∈]0,²₃[, we have the accelerated rate E

k

X

i=0

˜

c^λ_ix˜_i−x^∗

≤ 1 κ

1−√ κ 1 +√

κ k

kx₀−x^∗k. (32) Moreover, ifλ→ ∞, we recover the averaged gradient,

E

k

X

i=0

˜

c^λ_ix˜_i−x^∗

=E

1 k+ 1

k

X

i=0

˜ x_i−x^∗

.

Proof. Let¯λ= Θ(τ^s), using (29) we have E

k

X

i=0

˜

c^λ_ix˜_i−x^∗

≤ kx₀−x^∗kS_κ(k, τ^s) r 1

κ²O(τ^2−3s(1 +τ)²) +kx₀−x^∗kO(p

τ²+τ^2−3s(1 +τ²)).

Becauses∈]0,²₃[, means2−3s >0, thuslimτ→0τ^2−3s= 0. The limits whenτ →0is thus exactly (32).

Ifλ→ ∞, we have also

λ→∞lim ˜c_λ = lim

λ→∞argmin

c:1^Tc=1

kRck˜ +λkck² = argmin

c:1^Tc=1

kck² = 1 k+ 1 which yields the desired result.

(10)

Proposition5.3shows that Algorithm1is thus asymptotically optimal providedλis well chosen because it recovers the accelerated rate for smooth and strongly convex functions when the perturbations goes to zero. Moreover, we recover Proposition3.6when_tis the Taylor remainder, i.e. withν =O(kx₀−x^∗k²) andσ= 0, which matches the deterministic results.

Algorithm1is particularly efficient when combined with a restart scheme [Scieur et al.,2016]. From a theoretical point of view, the acceleration peak arises for small values ofk. Empirically, the improvement is usually more important at the beginning, i.e. whenkis small. Finally, the algorithmic complexity isO(k²d), which is linear in the problem dimension whenkremains bounded.

The benefits of extrapolation are limited in a regime where the noise dominates. However, when τ is relatively small then we can expect a significant speedup. This condition is satisfied in many cases, for example at the initial phase of the stochastic gradient descent or when optimizing a sum of functions with variance reduction techniques, such as SAGA or SVRG.

6. NUMERICALEXPERIMENTS

6.1. Stochastic gradient descent. We want to solve the least-square problem min

x∈R^dF(x) = 1

2kAx−bk²

whereA^TAsatisfiesµI (A^TA)LI. To solve this problem, we have access to the stochastic first-order oracle

∇_εF(x) =∇F(x) +ε,

whereεis a zero-mean noise of covariance matrixΣ ^σ_d²I. We will compare several methods.

• SGD.Fixed step-size,xt+1=xt−_L¹∇_εF(xt).

• Averaged SGD.Iteratex_kis the mean of thekfirst iterations of SGD.

• AccSGD.The optimal two-step algorithm inFlammarion and Bach[2015], with optimal parameters (this implieskx₀−x^∗kandσare known exactly).

• RNA+SGD.The regularized nonlinear acceleration Algorithm1applied to a sequence ofkiterates of SGD, withk= 10andλ=kR˜^TRk/10˜ ⁻⁶.

By Proposition5.2, we know that RNA+SGD willnotconverge to arbitrary precision because the noise is additive with a non-vanishing variance. However, Proposition5.3 predicts an improvement of the convergence at the beginning of the process. We illustrate this behavior in Figure3. We clearly see that at the beginning, the performances of RNA+SGD is comparable to that of the optimal accelerated algorithm.

However, because of the restart strategy, in the regime where the level of noise becomes more important the acceleration becomes less effective and finally the convergence stalls, as for SGD. Of course, for practical purposes, the first regime is the most important because it effectively minimizes the generalization error [D´efossez and Bach,2015;Jain et al.,2016].

6.2. Finite sums of functions. We focus on the composite problem min_x∈R^dF(x) = PN i=1

1

Nfi(x) +

µ

2kxk², where f_i are convex and L-smooth functions and µis the regularization parameter. We will use classical methods for minimizingF(x) such as SGD (with fixed step size), SAGA [Defazio et al.,2014], SVRG [Johnson and Zhang, 2013], and also the accelerated algorithm Katyusha [Allen-Zhu, 2016]. We will compare their performances with and without the (potential) acceleration provided by Algorithm1with restart eachkiteration. The parameterλis found by a grid search of sizek, the size of the input sequence, but it adds onlyonedata pass at each extrapolation. Actually, the grid search can be faster if we approximate F(x)with fewer samples, but we choose to present Algorithm1in its simplest version. We setk= 10for all the experiments.

In order to balance the complexity of the extrapolation algorithm and the optimization method we wait several data queries before adding the current point (the “snapshot”) of the method to the sequence. Indeed, the extrapolation algorithm has a complexity ofO(k²d) +O(N)(computing the coefficients˜c^λand the grid

9

(11)

SGD Ave. SGD Acc. SGD RNA + SGD

10⁰ 10² 10⁴

10^-2 10⁰ 10²

Iteration f(x)−f(x∗)

10⁰ 10² 10⁴

10⁰ 10²

Iteration

10⁰ 10² 10⁴

10¹ 10² 10³ 10⁴

Iteration

FIGURE1. *

Left:σ = 10,κ= 10⁻².Center:σ= 1000,κ= 10⁻².Right:σ= 1000,κ= 10⁻⁶.

10⁰ 10² 10⁴

10^-2 10⁰ 10²

Iteration f(x)−f(x∗)

10⁰ 10² 10⁴

10⁰ 10²

Iteration

10⁰ 10² 10⁴

10⁰ 10¹ 10² 10³

Iteration

FIGURE2. *

Left:σ= 10,κ= 1/d.Center:σ= 100,κ= 1/d. Right:σ = 1000,κ= 1/d.

FIGURE3. Comparison of performances between SGD, averaged SGD, Accelerated SGD [Flammarion and Bach, 2015] and RNA+SGD. We tested the performances on a matrix A^TA of size d = 500, with (top) random eigenvalues between κ and 1 and (bottom) decaying eigenvalues from1 to1/d. We start at kx₀ −x^∗k = 10⁴, wherex₀ andx^∗ are generated randomly.

search overλ). If we wait at leastO(N) updates, then the extrapolation method is of the same order of complexity as the optimization algorithm.

• SGD.We add the current point afterN data queries (i.e. one epoch) andksnapshots of SGD cost kN data queries.

• SAGA.We compute the gradient table exactly, then we add a new point after N queries, and k snapshots of SAGA cost(k+ 1)Nqueries. Since we optimize a sum of quadratic or logistic losses, we used the version of SAGA which storesO(N)scalars.

• SVRG.We compute the gradient exactly, then performN queries (the inner-loop of SVRG), andk snapshots of SVRG cost2kN queries.

• Katyusha. We compute the gradient exactly, then perform 4N gradient calls (the inner-loop of Katyusha), andksnapshots of Katyusha cost3kN queries.

We compare these various methods for solving least-square regression and logistic regression on several datasets (Table1), with several condition numbersκ: well (κ= 100/N), moderately (κ= 1/N) and badly (κ = 1/100N) conditioned. In this section, we present the numerical results onSid(Sido0 dataset, where N = 12678andd = 4932) with bad conditioning, see Figure4. The other experiments are highlighted in the supplementary material.

(12)

0 200 400 10^-10

10^-5

f(x)−f(x∗)

Epoch

0 50 100 150 200

10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 100 200 300

10^-10 10^-5

Time (sec)

SAGA SGD SVRG Katyusha AccSAGA AccSGD AccSVRG AccKat.

FIGURE 4. Optimization of quadratic loss (Top) and logistic loss (Bottom) with several algorithms, using the Sid dataset with bad conditioning. The experiments are done in Matlab.Left:Error vs epoch number.Right:Error vs time.

In Figure4, we clearly see that both SGD and AccSGD do not converge. This is mainly due to the fact that we do not average the points. In any case, except for quadratic problems, the averaged version of SGD does not converge to the minimum ofF with arbitrary precision.

We also notice that Algorithm 1 is unable to accelerate Katyusha. This issue was already raised by Scieur et al.[2016]: when the algorithm has a momentum term (like the Nesterov’s method), the underlying dynamical system is harder to extrapolate.

Because the iterates of SAGA and SVRG have low variance, their accelerated version converges faster to the optimum, and their performances are then comparable to Katyusha. In our experiments, Katyusha was faster than AccSAGA only once, when solving a least square problem on Sido0 with a bad condition number. Recall however that the acceleration Algorithm1does not require the specification of the strong convexity parameter, unlike Katyusha.

7. ACKNOWLEDGMENTS

The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7-PEOPLE-2013-ITN) under grant agreement no 607290 SpaRTaN, as well as support from ERC SIPA and the chaire conomie des nouvelles donnes with the data science joint research initiative with the fonds AXA pour la recherche.

11

(13)

8. APPENDIX

8.1. Missing propositions.

Proposition 8.1. LetE be a matrix formed by[1, 2, ..., _k], wherei has meankν_ik ≤ ν and variance Σ_iσI. By triangle inequality then Jensen’s inequality, we have

E[kEk₂]≤

k

X

i=0

E[kε_ik]≤

k

X

i=0

p

E[kε_ik²]≤O(ν+σ). (33) Proposition 8.2. Consider the function

f(x) = 1 κ

pa−λx²+bx defined forx∈[0,p

a/λ]. The its maximal value is attained at xopt = b√

a qλ²

κ² +λb² ,

and its maximal value is thus, ifxopt ∈[0,p a/λ], fmax=√ a

r 1 κ² +b²

λ. (34)

Proof. The (positive) root of the derivative offfollows bp

a−λx²− 1

κλx= 0 ⇔ x= b√ a qλ²

κ² +λb² .

If we inject the solution in our function, we obtain its maximal value, 1

κ v u u u ta−λ



 b√

a qλ²

κ² +λb²





2

+b b√ a qλ²

κ² +λb²

= 1 κ

s

a−λ b²a

λ²

κ² +λb² +b b√ a qλ²

κ² +λb² ,

= 1 κ

s

a−λ b²a

λ²

κ² +λb² ,

= 1 κ

v u u t

aλ^{2 1}_κ2

λ²

κ² +λb² ,

= √

a

1 κ²λ+b² qλ²

κ² +λb² ,

=

√a λ

rλ²

κ² +λb². The simplification withλin the last equality concludes the proof.

(14)

8.2. Additional numerical experiments.

8.2.1. Legend.

SAGA Sgd SVRG Katyusha AccSAGA AccSgd AccSVRG AccKat.

8.2.2. datasets.

Sonar UCI (Son) Madelon UCI (Mad) Random (Ran) Sido0 (Sid)

# samplesN 208 2000 4000 12678

Dimensiond 60 500 1500 4932

TABLE1. Datasets used in the experiments.

13

(15)

8.2.3. Quadratic loss.

Sonar dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.05 0.1 0.15

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.05 0.1 0.15

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.2 0.4

10^-10 10^-5

Time (sec)

FIGURE5. Quadratic loss with (top to bottom) good, moderate and bad conditioning using Sondataset.

(16)

Madelon dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.5 1 1.5 2

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 1 2

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-10 10^-5

Time (sec)

FIGURE6. Quadratic loss with (top to bottom) good, moderate and bad conditioning using Maddataset.

15

(17)

Random dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-10 10^-5

Time (sec)

0 50 100 150

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-10 10^-5

Time (sec)

FIGURE7. Quadratic loss with (top to bottom) good, moderate and bad conditioning using Randataset.

(18)

Sido0 dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 20 40 60

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 20 40 60

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 50 100 150 200

10^-10 10^-5

Time (sec)

FIGURE8. Quadratic loss with (top to bottom) good, moderate and bad conditioning using Siddataset.

17

(19)

8.2.4. Logistic loss.

Sonar dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.05 0.1 0.15 0.2

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.05 0.1 0.15 0.2

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 0.2 0.4 0.6

10^-10 10^-5

Time (sec)

FIGURE 9. Logistic loss with (top to bottom) good, moderate and bad conditioning using Sondataset.

(20)

Madelon dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 1 2

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 1 2

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-10 10^-5

Time (sec)

FIGURE10. Logistic loss with (top to bottom) good, moderate and bad conditioning using Maddataset.

19

(21)

Random dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 2 4 6 8

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 10 20 30

10^-10 10^-5

Time (sec)

FIGURE11. Logistic loss with (top to bottom) good, moderate and bad conditioning using Randataset.

(22)

Sido0 dataset.

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 50 100

10^-15 10^-10 10^-5

Time (sec)

0 50 100 150

10^-15 10^-10 10^-5

f(x)−f(x∗)

Epoch

0 50 100

10^-15 10^-10 10^-5

Time (sec)

0 200 400

10^-10 10^-5

f(x)−f(x∗)

Epoch

0 100 200 300

10^-10 10^-5

Time (sec)

FIGURE12. Logistic loss with (top to bottom) good, moderate and bad conditioning using Siddataset.

21

(23)

REFERENCES

Allen-Zhu, Z. [2016], ‘Katyusha: The first direct acceleration of stochastic gradient methods’, arXiv preprint arXiv:1603.05953.

Anderson, D. G. [1965], ‘Iterative procedures for nonlinear integral equations’, Journal of the ACM (JACM) 12(4), 547–560.

Cabay, S. and Jackson, L. [1976], ‘A polynomial extrapolation method for finding limits and antilimits of vector sequences’,SIAM Journal on Numerical Analysis13(5), 734–752.

Defazio, A., Bach, F. and Lacoste-Julien, S. [2014], Saga: A fast incremental gradient method with support for non- strongly convex composite objectives,in‘Advances in Neural Information Processing Systems’, pp. 1646–1654.

D´efossez, A. and Bach, F. [2015], Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling dis- tributions,in‘Artificial Intelligence and Statistics’, pp. 205–213.

Fercoq, O. and Qu, Z. [2016], ‘Restarting accelerated gradient methods with a rough strong convexity estimate’,arXiv preprint arXiv:1609.07358.

Flammarion, N. and Bach, F. [2015], From averaging to acceleration, there is only a step-size, in‘Conference on Learning Theory’, pp. 658–695.

Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P. and Sidford, A. [2016], ‘Parallelizing stochastic approximation through mini-batching and tail-averaging’,arXiv preprint arXiv:1610.03774.

Johnson, R. and Zhang, T. [2013], Accelerating stochastic gradient descent using predictive variance reduction,in

‘Advances in Neural Information Processing Systems’, pp. 315–323.

Lin, H., Mairal, J. and Harchaoui, Z. [2015], A universal catalyst for first-order optimization,in‘Advances in Neural Information Processing Systems’, pp. 3384–3392.

Meˇsina, M. [1977], ‘Convergence acceleration for the iterative solution of the equations x= ax+ f’,Computer Methods in Applied Mechanics and Engineering10(2), 165–173.

Moulines, E. and Bach, F. R. [2011], Non-asymptotic analysis of stochastic approximation algorithms for machine learning,in‘Advances in Neural Information Processing Systems’, pp. 451–459.

Nedi´c, A. and Bertsekas, D. [2001], Convergence rate of incremental subgradient algorithms,in‘Stochastic optimization: algorithms and applications’, Springer, pp. 223–264.

Nesterov, Y. [1983], A method of solving a convex programming problem with convergence rate o (1/k2),in‘Soviet Mathematics Doklady’, Vol. 27, pp. 372–376.

Nesterov, Y. [2013],Introductory lectures on convex optimization: A basic course, Vol. 87, Springer Science & Busi- ness Media.

Schmidt, M., Le Roux, N. and Bach, F. [2013], ‘Minimizing finite sums with the stochastic average gradient’,Mathe- matical Programmingpp. 1–30.

Scieur, D., d’Aspremont, A. and Bach, F. [2016], Regularized nonlinear acceleration,in‘Advances In Neural Infor- mation Processing Systems’, pp. 712–720.

Shalev-Shwartz, S. and Zhang, T. [2013], ‘Stochastic dual coordinate ascent methods for regularized loss minimization’,Journal of Machine Learning Research14(Feb), 567–599.

Shalev-Shwartz, S. and Zhang, T. [2014], Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization.,in‘ICML’, pp. 64–72.

INRIA & D.I.,

E´COLENORMALESUPERIEURE´ , PARIS, FRANCE. E-mail address:damien.scieur@inria.fr CNRS & D.I., UMR 8548,

E´COLENORMALESUPERIEURE´ , PARIS, FRANCE. E-mail address:aspremon@ens.fr

INRIA & D.I.

E´COLENORMALESUPERIEURE´ , PARIS, FRANCE. E-mail address:francis.bach@inria.fr