A Stochastic Path-Integrated Differential EstimatoR Expectation Maximization Algorithm

(1)

HAL Id: hal-03029700

https://hal.archives-ouvertes.fr/hal-03029700

Submitted on 28 Nov 2020

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Stochastic Path-Integrated Differential EstimatoR Expectation Maximization Algorithm

Gersende Fort, Eric Moulines, Hoi-To Wai

To cite this version:

Gersende Fort, Eric Moulines, Hoi-To Wai. A Stochastic Path-Integrated Differential EstimatoR Ex-

pectation Maximization Algorithm. NeurIPS 2020 - 34th Conference on Neural Information Processing

Systems, Dec 2020, Vancouver / Virtuel, Canada. �hal-03029700�

(2)

A Stochastic Path-Integrated Differential EstimatoR Expectation Maximization Algorithm

Gersende Fort

Institut de Math´ematiques de Toulouse Universit´e de Toulouse; CNRS

UPS, Toulouse, France

[email protected]

Eric Moulines

Centre de Math´ematiques Appliqu´ees Ecole Polytechnique, France CS Dpt, HSE University, Russian Federation

[email protected]

Hoi-To Wai Department of SEEM

The Chinese University of Hong Kong Shatin, Hong Kong

[email protected]

Abstract

The Expectation Maximization (EM) algorithm is of key importance for inference in latent variable models including mixture of regressors and experts, missing observations. This paper introduces a novel EM algorithm, calledSPIDER-EM, for inference from a training set of sizen,n ≫ 1. At the core of our algorithm is an estimator of the full conditional expectation in theE-step, adapted from the stochastic path-integrated differential estimator (SPIDER) technique. We derive finite-time complexity bounds for smooth non-convex likelihood: we show that for convergence to an ǫ-approximate stationary point, the complexity scales as KOpt(n, ǫ) = O(ǫ⁻¹)andKCE(n, ǫ) =n+√nO(ǫ⁻¹), whereKOpt(n, ǫ)and KCE(n, ǫ)are respectively the number ofM-steps and the number of per-sample conditional expectations evaluations. This improves over the state-of-the-art algorithms. Numerical results support our findings.

This paper is close to the final version accepted for publication in the Conference on Neural Infor- mation Processing Systems (NeurIPS 2020). The final version can be found at

https://papers.nips.cc/paper/2020/hash/c589c3a8f99401b24b9380e86d939842-Abstract.html

1 Introduction

Expectation Maximization (EM) is a key algorithm in machine-learning and statistics [20]. Ap- plications are numerous including clustering, natural language processing, parameter estimation in mixed models, missing data, to give just a few. The common feature of all these applications is the introduction of latent variables: the “incomplete” likelihoodp(y;θ)whereθ ∈Θ⊆R^dis defined by marginalizing the “complete-data” likelihoodp(y, z;θ)defined as the joint distribution of the observationyand a non-observed latent variablez∈Z, i.e.p(y;θ) =R

p(y, z;θ)µ(dz)whereZis the latent space andµis a measure onZ. We focus in this paper on the case wherep(y, z;θ)belongs to a curved exponential family, given by

p(y, z;θ)^def= ρ(y, z) exp

hs(y, z), φ(θ)i −ψ(θ) ; (1) wheres(y, z) ∈ R^q is the complete data sufficient statistics, φ : Θ → R^q andψ : Θ → R, ρ : Y×Z → R⁺are vector/scalar functions. Given a training set ofnindependent observations

(3)

{yi}ⁿi=1, our goal is to minimize the negated penalized log-likelihood with respect toθ∈Θ:

minθ∈ΘF(θ)^def= 1 n

Xn i=1

Lⁱ(θ) +R(θ), Lⁱ(θ)^def= −logp(yi;θ), (2) such thatR(θ)is a regularizer. A popular solution approach to (2) is the EM algorithm [10] which is a special instance of the Majorize-Minimization (MM) algorithm. It alternates between two steps:

in the Expectation (E) step, using the current value of the iterateθcurr, we compute a majorizing functionθ7→Q(θ, θcurr)given up to an additive constant by

Q(θ, θcurr)^def= − hs(θ¯ curr), φ(θ)i+ψ(θ) +R(θ) where s(θ)¯ ^def= 1 n

Xn i=1

¯

si(θ) ; (3) ands¯i(θ)is theith sample conditional expectation of the complete data sufficient statistics:

¯ si(θ)^def=

Z

s(yi, z)p(z|yi;θ)µ(dz), p(z|yi;θ)^def= p(yi, z;θ)/p(yi;θ). (4) As for the Maximization (M) step, a new value of θcurr is computed as a minimizer of θ 7→

Q(θ, θcurr). The majorizing function is then updated with the newθcurr. This process is iterated until convergence. One of the distinctive advantage of EM algorithms with respect to (w.r.t.) first- order methods stems from the fact that it is invariant by change of parameterization and that EM is, by construction, monotone; see [20].

The conventional EM algorithm is not suitable for analyzing the increasingly large data sets, such as those that could be considered as big data in volumes [5,14]: in such case, the explicit computation ofs(θ¯ curr)ineachE-stepof the EM algorithm involves evaluatingnconditional expectations [20]. As a remedy,incrementalmethods were designed which reduce the number of samples used per iteration to a mini-batch. Among the incremental methods, the first approach to cope with large-scale EM setting is the incremental EM (iEM) algorithm [21] (also see [22] for a refined algorithm). At each iteration,iEMselects a minibatchB^currof sizeband updates the associated statistic

¯

si(θcurr), i∈ B^curr, in the current estimateSbcurrof¯s(θcurr); and then updates the parameters by a classicalM-step. Later, an alternative approach was proposed in [6] as theOnline EMalgorithm, which shares some similarities with stochastic gradient descent [4] even thoughOnline EMis not a first-order method. Recent papers have proposed improvements toOnline EMby combining it with variance reduction techniques. For instance, [7] and [18] proposed respectively the stochastic EM with variance reduction (sEM-vr) and the fast incremental EM (FIEM) algorithms. These methods are extensions to the EM algorithm of the SVRG [15] and the SAGA [8] techniques.

The complexity of these algorithms have been analyzed under the assumption thatF(θ)is smooth but possibly non-convex. They are expressed as the number ofM-steps updates,KOpt(n, ǫ), and the number of per-sample conditional expectations evaluationsKCE(n, ǫ), in order to find an ǫ- approximate stationary point ofF(θ); see (11) for the definition. It was established in [18] that KOpt(n, ǫ) = KCE(n, ǫ) = n+n^2/3O(ǫ⁻¹)updates/evaluations are needed for thesEM-vrand FIEMalgorithms (the rate forFIEMcan be sharpened, see [12]). These complexity bounds match those of the SVRG and the SAGA algorithms for smooth non-convex optimization [25].

For smooth non-convex problems, the Stochastic Path-Integrated Differential EstimatoR(SPIDER) technique has recently been introduced by [11] (see also [27] for SPIDER-BOOST and [24] for SARAH), which established an n+√nO(ǫ⁻¹)bound of calls to first order oracles to find an ǫ- approximate stationary solution of a general finite sum optimization problem. Furthermore, the

√n-dependence was proven to be optimal. This motivates the current work to explore new EM algorithms with reduced complexity. Our contributions are:

• We propose a novelSPIDER-EMalgorithm, inspired by theSPIDERestimator in [11] and tailored to the EM framework for curved exponential family class of distributions. The SPIDER-EM uses an outer loop to maintain acontrol variatethat requires a full scan of the dataset to compute

¯

s(θcurr), and inner loops which perform low complexity updates by drawing random minibatches of samples.

• We introduce a unified framework ofstochastic approximation (SA) within EM which covers the convergence analysis ofOnline EM,sEM-vr,FIEM,SPIDER-EM. In this general framework, SPIDER-EMmay be seen as a stochastic approximation algorithm using variance reduced esti- mateSbcurr.

(4)

• Using the SA analysis framework, we prove that the complexity bounds for SPIDER-EMare KOpt(n, ǫ) =O(ǫ⁻¹),KCE(n, ǫ) =n+√nO(ǫ⁻¹). Among the incremental-EM techniques, we provide state of the art complexity bounds that overpass all the previous ones.

• The EM is not a first-order method contrary toSPIDER. Therefore, the convergence analysis of SPIDER-EMmethods require specific mathematical developments which differ significantly from the originalSPIDERanalysis. In addition, the analysis ofSPIDER-EMdiffers from previous ones for incremental EM algorithms, since it involvesbiasedapproximations, which makes the proof more challenging (seesection 9, Lemma11).

• We provide a new perspective to interpretSPIDER-EMas an equivalent algorithm to a perturbed Online-EMwhere the perturbation acts as a control variate to reduce variance - seealgorithm 7.

Furthermore, theSPIDER-EMalgorithm operates with a significantly lower memory footprint than iEMandFIEM, and the memory footprint is on par withsEM-vrandOnline EM. To our best knowl- edge, the proposed algorithm offers the best of both worlds – having a low complexity bounds and a low memory footprint. Lastly, we support the theoretical findings with numerical experiments and show thatSPIDER-EMperforms favorably compared to existing algorithms.

Notations. For two vectors a, b ∈ R^r, ha, bi denotes the usual Euclidean product andkak the associated norm. By convention, vectors are column vectors. For a vectorxwith components (x1, . . . , xr),xi:jdenotes the sub-vector with components(xi, xi+1, . . . , xj−1, xj). For two matri- cesA ∈R^r¹×r2andB ∈R^r³×r4,A⊗B denotes the Kronecker product. Iris ther×ridentity matrix.A^T is the transpose ofA.

2 EM Algorithm and its Variants using Stochastic Approximation

We formulate the model assumptions and introduce theSPIDER-EMalgorithm. Recall the definition of the negated penalized log-likelihoodF(θ)from (2) and consider a few regulatory assumptions:

H1. Θ⊆R^dis a measurable convex set. (Z,Z)is a measurable space andµis aσ-finite positive measure onZ. The functionsR : Θ → R,φ : Θ → R^q,ψ : Θ → R, andρ(yi,·) : Z → R₊, s(yi,·) :Z→R^qfori∈ {1, . . . , n}are measurable functions. For anyθ∈Θandi∈ {1, . . . , n}, the log-likelihood is bounded as−∞<Lⁱ(θ)<∞.

H2. For allθ∈Θandi∈ {1, . . . , n}, the conditional expectations¯i(θ)is well-defined.

H3. For anys ∈R^q, the maps 7→Argminθ∈Θ {ψ(θ) +R(θ)− hs, φ(θ)i}exists and is unique;

the singleton is denoted by{T(s)}.

As discussed in the Introduction, the EM algorithm is an MM algorithm associated with the ma- jorization functions{θ 7→ Q(θ, θcurr), θcurr ∈ Θ}. Thus, the EM algorithm defines a sequence {θk, k ≥0}that can be computed recursively asθk+1 =T◦s(θ¯ k), where the mapTis defined in H3ands¯is defined in (3). On the other hand, the EM algorithm can be defined through a mapping in the complete data sufficient statistics, referred to as theexpectation space. In this setting, the EM iteration defines a sequence inR^q {Sbk, k ≥ 0}given bySbk+1 = ¯s◦T(Sbk). To summarize, we observe that the EM algorithm admits two equivalent representations:

(Parameter space) θk+1=T◦¯s(θk); (Expectation space) Sbk+1 = ¯s◦T(Sbk). (5) In this paper, we focus on the expectation space representation. Letθ⋆

def= T(s⋆)wheres⋆∈R^q. It has been shown in [9] that ifs⋆is a fixed point to the EM algorithm in the expectation space, then θ⋆ =T(s⋆)is a fixed point of the EM algorithm in the parameter space, i.e.,θ⋆=T◦s(θ¯ ⋆). Note that the converse is also true. The limit points of the EM algorithm in the expectation space are the roots of themean field

h(s)^def= ¯s◦T(s)−s, s∈R^q . (6) Consider the following assumption.

H4. 1. The functionsφ, ψandRare continuously differentiable onΘ^v. IfΘis open, thenΘ^v= Θ, otherwiseΘ^vis a neighborhood ofΘ.Tis continuously differentiable onR^q.

2. The function F is continuously differentiable on Θ^v and for any θ ∈ Θ, ∇F(θ) =

−∇φ(θ)^⊤s(θ) +¯ ∇ψ(θ) +∇R(θ).

(5)

3. For anys∈R^q,B(s)^def= ∇(φ◦T)(s)is a symmetric matrix with positive minimal eigenvalue.

These assumptions are classical, see for example, [18] and the references therein.

A key property of the EM algorithm is that it ismonotone: in the parameter spaceθk+1=T◦s(θ¯ k) decreases the objective function withF(θk+1)≤F(θk). The same monotone property also holds in the expectation space. Define

W(s)^def= F◦T(s) = 1 n

Xn i=1

Lⁱ(T(s)) +R(T(s)), s∈R^q . (7) It can be shown thatF(θk+1)≤F(θk)impliesW( ˆSk+1)≤W( ˆSk). In addition, [9] showed that:

Proposition 1. Under H1, H2, H3and H4,W(s)is continuously differentiable onR^q and for any s∈R^q,∇W(s) =−B(s)h(s).

Hence, s⋆ is a fixed point to the EM algorithm in expectation space, with s⋆ = ¯s◦T(s⋆)and h(s⋆) = 0if and only ifs⋆is a stationary point satisfying∇W(s⋆) = 0. This property has made it possible to develop a new class of algorithms that preserve desirable properties of the EM (e.g, invariant in the choice of parameterization) while replacing the computation of¯s(θ)by a stochastic approximation (SA) scheme; see [26,2,3] for a survey on SA. This scheme has been exploited in [9] to deal with the case where the computation of the conditional expectations(θ)¯ is intractable.

We consider yet another form of intractability in this work which is linked with the size of the dataset n ≫ 1. To alleviate this problem, theOnline EMalgorithm [6] defines a sequence{Sbk, k ≥ 0} with the recursion:

Sbk+1=Sbk+γk+1

¯

s_Bk+1◦T(Sbk)−Sbk

, (8)

where{γk+1, k ≥0}is a deterministic sequence of step sizes,B^k+1is a mini-batch ofbexamples sampled at random in{1, . . . , n}and for a mini-batchBof sizeb, we sets¯_B^def= b⁻¹P

i∈B¯si. TheOnline EMalgorithm can be viewed as an SA scheme designed for finding the roots of the mean-field h; indeed, the mean-field of Online EMsatisfies E[¯s_B_k+1 ◦T(Sbk)−Sbk] = h(Sbk).

Hence, the possible limiting points ofOnline EMare the roots ofh(s), such a roots⋆is a stationary point ofW(see Proposition1and (7)), andT(s⋆)corresponds to a stationary point of the penalized likelihood (2); see [6] for a precise statement and [17] for a detailed convergence analysis.

Variance Reduction for SA with EM Algorithm. For the finite-sum problem (2), more efficient algorithms can be developed by introducing a control variate in order to achieve variance reduction.

Suppose that we have a random variable (r.v.) U and our aim is to estimateu ^def= E[U]. For any zero-mean r.v. V, the sumU +V is an unbiased estimator ofu. Now, ifV is negatively correlated withU andVar(V²)≤ −2 Cov(U, V), then the variance ofU +V will be lower than that of the standalone estimatorU;V is acontrol variate.

This approach has been proven to be effective for stochastic gradient algorithms: emblematic examples are Stochastic Variance Reduced Gradient (SVRG) introduced by [15] and SAGA introduced by [8]. Whereas control variates have been originally designed to the stochastic gradient framework, similar ideas can be applied to SA procedures for finite-sum optimization. ForOnline EM, variance reduction amounts to expressing the mean-field ash(s) =E[¯s_B◦T(s)−s+V]whereV is a control variate. These methods differ in the way the control variate is constructed. The efficiency of such variance reduction methods improves with the correlation ofV with¯s_B◦T(s)−s.

An SVRG-like algorithm is theStochastic EM with Variance Reduction (sEM-vr)algorithm [7]. InsEM-vr, the control variate is reset in an outer loop everykiniterations: in the outer loop#tfort∈ {1, . . . , kout}, and the inner loop#(k+ 1)fork∈ {0, . . . , kin−2}, the complete data sufficient statistic is updated usingOnline EMand a recursively defined control variate

Sbt,k+1=Sbt,k+γt,k+1(¯s_B_t,k+1◦T(bSt,k)−Sbt,k+Vt,k+1), (9) Vt,k+1= ¯s◦T(Sbt−1,kin−1)−s¯_Bt,k+1◦T(Sbt−1,kin−1). (10) Whenk= 0, the complete data sufficient statisticSbt,0is obtained by performing first a full-pass on the datasetSet,0= ¯s◦T(Sbt−1,kin−1)and then updatingSbt,0=Sbt−1,kin−1+γt,0(Set,0−Sbt−1,kin−1).

(6)

An SAGA-like version is theFast Incremental EM (FIEM)algorithm proposed in [18]. The construction of the control variate for FIEMis more involved; for details, see algorithm 5 in the supplementary material.

In [18], thesEM-VRandFIEMalgorithms have been analyzed with a randomized terminating iteration(τ, ξ), uniformly selected from{1, . . . , kout} × {0, . . . , kin−1} wherekin (resp. kout) is the number of inner loops per outer one, andkoutis the total number of outer loops. The random termination is inspired by [13] which enables one to show non-asymptotic convergence of stochastic gradient methods to a stationary point. Consider firstsEM-VR. For anyn, ǫ, we defineK(n, ǫ)⊂N³ such that, for any(kin, kout,b)∈ K(n, ǫ),

E[kh(Sbτ,ξ)k²]^def= kmax−1Pkout

t=1

Pkin−1

k=0 E[kh(Sbt,k)k²]≤ǫ , (11) wherekmax = kinkout. In words, the randomly terminated algorithm computes a solution Sbτ,ξ

such that the expected squared norm of the mean field is less thanǫ; see [13]. The finite sample complexity in terms of the number ofM-steps isK_Opt^sEM-VR(n, ǫ) = inf_K(n,ǫ)kinkout.

The complexity in terms of the total number of per-sample conditional expectations evaluations, is defined asK_CE^sEM-VR(n, ǫ,b) = inf_K_(n,ǫ){n+koutn+bkinkout+ (n∧(bkin))kout}. Similar results can be derived forFIEMand other incremental EM algorithms (seesection 6). In such case, define by kmax=kmax(n, ǫ)the minimal number of iterations such that (11) is satisfied and setK_Opt^FIEM(n, ǫ) = kmax(n, ǫ) andK_CE^FIEM(n, ǫ) = 2kmax(n, ǫ)b. It can be shown (see [18] and the supplementary material) thatK_Opt^sEM-VR(n, ǫ) = K_Opt^FIEM(n, ǫ) = n^2/3O(ǫ⁻¹)andK_CE^sEM-VR(n, ǫ) = K_CE^FIEM(n, ǫ) = n+n^2/3O(ǫ⁻¹). These bounds exhibit anO(ǫ⁻¹)growth as the stationarity requirementǫdecreases.

Such a rate is comparable to a deterministic gradient method for smooth and non-convex objective functions. However, the complexity ofM-step computations as well as of conditional expectations evaluations grow at the rate ofn^2/3, which can be undesirable ifn≫1. Hereafter, we aim to design a novel algorithm with better finite-time complexities.

3 The SPIDER-EM Algorithm

To reduce the dependence onnand the overall complexity, we propose to design a new control variate, and to optimize the size of theminibatch. To this regard, we borrow from [11,27] (see also [24] and the algorithmSARAH) a new technique called Stochastic Path-Integrated Differential Estimator (SPIDER) to generate the control variates for estimating the conditional expectation of the complete data for the full dataset.

Algorithm Description.We propose theSPIDER-EMalgorithm formulated in the expectation space.

The outer loop is the same as that ofsEM-vr. The difference lays in the update ofSbkas follows:

Data:kin∈N_⋆,kout ∈N_⋆,Sbinit∈R^q,{γt,k+1, t≥1, k≥0}positive sequence.

Result:TheSPIDER-EMsequence:Sbt,k, t= 1, . . . , koutandk= 0, . . . , kin−1

1 Sb1,0=Sb1,−1=Sbinit, S_1,0= ¯s◦T(Sb1,−1);

2 fort= 1, . . . , koutdo

3 fork= 0, . . . , kin−2do

4 Sample a mini-batchB^t,k+1in{1, . . . , n}of sizeb, with or without replacement;

5 S_t,k+1=S_t,k+ ¯s_Bt,k+1◦T(Sbt,k)−¯s_Bt,k+1◦T(Sbt,k−1) ;

6 Sbt,k+1=Sbt,k+γt,k+1 S_t,k+1−Sbt,k

7 Sbt+1,−1=Sbt,kin−1;

8 S_t+1,0= ¯s◦T(Sbt+1,−1);

9 Sbt+1,0=Sbt,kin−1+γt,kin S_t+1,0−Sbt,kin−1

Algorithm 1:TheSPIDER-EMalgorithm.

We discuss the design considerations of theSPIDER-EMalgorithm and provide insights on how it can accelerate convergence as follows.

(7)

Control Variate and Variance Reduction. We shall analyzeSPIDER-EMas an SA scheme with control variate to reduce variance. While the description ofSPIDER-EMalgorithm in the above does not present the control variates explicitly, it is possible to re-interpret the inner loop (line4–line6) with a control variate defined, fort∈N_⋆andk∈ {0, . . . , kin−2}, as

Vt,k+1 =Vt,k+ ¯s_Bt,k◦T(Sbt,k−1)−¯s_Bt,k+1◦T(Sbt,k−1)

=Pk

j=0{s¯_Bt,j◦T(Sbt,j−1)−¯s_Bt,j+1◦T(Sbt,j−1)}, (12) whereVt,0 = 0is reset at every outer iteration and, by convention,B^t,0 ^def= {1, . . . , n}. It is seen that line6can be rewritten as (see Lemma3in the supplementary material)

Sbt,k+1=Sbt,k+γt,k+1 ¯s_B_t,k+1◦T(Sbt,k)−Sbt,k+Vt,k+1

. (13)

Note that, by construction, the control variateVt,k is zero mean because,E[¯s_Bt,j ◦T(Sbt,j−1)] = E[¯s_Bt,j+1 ◦T(Sbt,j−1)] = E[¯s◦T(Sbt,j−1)]. Eq. (12) shows howSPIDER-EMconstructs a control variate by accumulating information – similar toSPIDERandSARAHin the gradient descent setting.

Comparing (12)-(13) to (9)-(10), theSPIDER-EMalgorithm differs fromsEM-vronly in the construction of the control variate. To obtain insights about their performance, let us denote the filtration asF^t,k ^def= σ(Sbinit,B^1,1, . . . ,B^1,kⁱⁿ−1, . . . ,B^t,1, . . . ,B^t,k). Observe that the conditional variances (givenF^t,k) ofSbt,k+1of thesEM-VRandSPIDER-EMalgorithms are:

Var bS_t,k+1^sEM⁻^vr|F^t,k

=γ_t,k+1² Var[¯s_Bt,k+1◦T(Sbt,k)−s¯_Bt,k+1◦T(Sbt−1,kin−1)|F^t,k], Var bS_t,k+1^SPIDER⁻^EM|F^t,k

=γ_t,k+1² Var[¯s_B_t,k+1◦T(Sbt,k)−s¯_B_t,k+1◦T(Sbt,k−1)|F^t,k]. As a comparison, the variance ofSb(t−1)kin+k+1for theOnline EMis given by

γ_(t²₋_1)k_in_+k+1Var

¯

s_B_(t−1)k_{in +}_k+1◦T(Sb(t−1)kin+k)|F(t^O⁻−^EM1)kin+k

.

Here,Fτ^O⁻^EM

def= σ(Sbinit,B¹, . . . ,B^τ). In this sense, bothsEM-vrandSPIDER-EMare variance- reduced versions of the Online EM. Additionally, SPIDER-EMand sEM-VR are designed to ex- ploit two valuesSbt,k,Sbt,k−1 andSbt,k,Sbt−1,kin−1, respectively. The former thus takes the benefit of a stronger correlation between two successive values of{Sbt,k, k ≥ 1} than betweenSbt,k and Sbt−1,kin−1in the variance reduction step. As a result, SPIDER-EMshould inherit a better rate of convergence – an intuition which is established will be Theorem2.

Step Size and Memory Footprint. The SPIDER-EMalgorithm is described with a positive step size sequence {γt,k+1, t ≥ 1, k ≥ 0}. Different strategies are allowed: (a)a constant step size γt,k+1 =γfor anyk≥0, or(b)a random sequence. We focus on case(a)in the following, while we refer the readers to [11] for such a strategy in the gradient setting. Lastly, we observe that the SPIDER-EMalgorithm has the same memory footprint requirement as thesEM-vralgorithm.

Convergence Analysis.Let(τ, ξ)be uniform r.v. on{1, . . . , kout} × {0, . . . , kin−1}, independent of theSPIDER-EMsequence{Sbt,k, t = 1,· · · , kout;k =−1,· · ·, kin−1}. Our goal is to derive explicit upper bounds forE[kh(Sbτ,ξ−1)k²]for theSPIDER-EMsequence given byalgorithm 1with a constant step size (γt,k+1=γfor anyt≥1,k≥0). We strengthen the assumption H4as follows:

H5.(a) There exist0 < vmin ≤ vmax <∞such that for alls ∈ R^q, the spectrum ofB(s)is in [vmin, vmax];B(s)is defined in H4.

(b) For anyi∈ {1, . . . , n}, the maps¯i◦Tis globally Lipschitz onR^q with constantLi. (c) The functions7→ ∇W(s) =−B(s)h(s)is globally Lipschitz onR^qwith constantL_∇W.

From H5-(a)and Proposition1, we haveE

kh(Sbτ,ξ−1)k²

≥ v⁻_max² E

k∇W(Sbτ,ξ−1)k²

so that a control ofE

kh(Sbτ,ξ−1)k²

provides a control ofE

k∇W(Sbτ,ξ−1)k²

. The convergence result for SPIDER-EMis summarized below:

(8)

Theorem 2. Assume H1, H2, H3, H4and H5and setL^{2 def}= n⁻¹Pn

i=1L²_i. Fixkout, kin ∈N_⋆, b∈N_⋆and setγt,k

def= α/Lfor anyt, k >0whereα∈(0, vmin/µ⋆(kin,b))with µ⋆(kin,b)^def= vmax

pkin/b+L_∇W/(2L). (14) TheSPIDER-EMsequence{Sbt,k, t≥1, k≥0}given byalgorithm 1satisfies

Eh

kh(Sbτ,ξ−1)k²i

≤ 1

kin

+α² b

2L

α{vmin−αµ⋆(kin,b)} 1 kout

E[W(Sbinit)]−min W . Our analysis, whose detail can be found in the supplementary material, shares some similarities with the one inSPIDER-Boost[27]. Nevertheless, there are a number of differences because(a) SPIDER-EMalgorithm recursion uses two spaces (the expectation space and the parameter space) which are connected by the mapss¯andT; (b)SPIDER-EMis not a gradient algorithm in the expectation space, but an SA scheme to obtain a root forh;(c)there is a Lyapunov functionW(s) where∇W(s)6=−h(s), but which satisfiesh∇W(s), h(s)i ≤ −vminkh(s)k². In addition, in rela- tion to the above points, our analysis took insights from [16,17] to analyzeSPIDER-EMas a biased SA scheme. Our challenge lies in carefully controlling the bias/variance of theSPIDERestimator employed, which is not reported in the prior literature.

Proof Sketch. While we shall omit the proof details, an outline of the proof is provided. Set Ht,k+1

def= γ_t,k+1⁻¹ (Sbt,k+1−Sbt,k). A key property is the following descent condition for the Lya- punov functionW. There exist positive sequencesΛt,k, βt,ksuch that for anyt≥1,k≥0,

W(Sbt,k+1)≤W(Sbt,k)−Λt,k+1kHt,k+1k²+γt,k+1 v²_max

2β_t,k+1² kHt,k+1−h(Sbt,k)k². It holds for anyt≥1and0≤k≤kin−2,

Eh

kHt,k+1−h(Sbt,k)k²|F^t−1,kin−1

i≤ ^L^b²Pk

j=0γ_t,j² E

kHt,jk²|F^t−1,kin−1

. (15) The above conditions can be combined to yield

Pkout

t=1

Pkin−1 k=0 At,kE

kHt,kk²

≤E h

W(Sbinit)i

−min W where theAt,k’s are positive. Dividing both sides of the inequality byPkout

t=1

Pkin−1

k=0 At,k leads to a bound onE[kHΞk²]for some r.v.Ξon{1, . . . , kout} × {0, . . . , kin−1}. For the concerned case whenγt,k =γ, we haveAt,k =AandΞ = (τ, ξ)is the uniform distribution, thus the convergence rate forE[kHτ,ξk²]isO(1/kinkout). Lastly, we obtain a bound for the mean field kh(Sbτ,ξ−1)k² using the standard inequality(a+b)²≤2a²+ 2b²and (15) again.

Choice of kin,b, kout and Complexity Bounds. The maximum of α{vmin −αµ⋆(kin,b)} on (0, vmin/µ⋆(kin,b))isα⋆(kin,b)^def= vmin/{2µ⋆(kin,b)}which yieldsγ = vmin/{2µ⋆(kin,b)L} and the upper bound

E

hkh(Sbτ,ξ−1)k²i

≤

µ⋆(kin,b)

v²_min + kin

4µ⋆(kin,b)b 8L

kinkout

(E[W(Sbinit)]−min W). The number of parameter updates is1 +kout +kinkout. The number of per-sample conditional expectation computations isn+koutn+ 2bkinkout. Assume thatnandǫ > 0 are given. Set for simplicityb =kin = ⌈√n⌉which means that the number of per-sample conditional expectations evaluations in the inner loop is equal ton, i.e., is an epoch (seesubsection 9.3for a discussion on other strategies). With this choice, we getµ⋆(kin,b) =m⋆

def= vmax+L_∇W/(2L). Taking kout≥

m⋆

v_min² +_4m¹_⋆

√8Lnǫ (E[W(Sbinit)]−min W),

then we haveE[kh(Sbτ,ξ−1)k²] ≤ǫ. With these choices ofkin, kout,b, the complexity in terms of the number of per-sample conditional expectations evaluationss¯iisKCE(n, ǫ) =n+√nLO(ǫ⁻¹).

The number of parameter updates isKOpt(n, ǫ) =O(ǫ⁻¹). Note that the step size is chosen to be γ=α⋆(kin,b)/L, which is independent of the targeted accuracyǫ.

Linear convergence rate. Insection 10, we provide a modification ofSPIDER-EMwhich exhibits a linear convergence rate when Wsatisfies a Polyak-Lojasiewicz inequality. Note that the latter condition (or its variants) has been used in a few recent works, e.g., [1,7].

(9)

10² 10³ 10⁴ Problem size (n) 10²

10⁴ 10⁶

No of examples

SPIDER-EM SEM-VR iEM FIEM O(n^2/3) O(n)

10² 10³ 10⁴

Problem size (n) 10²

10³ 10⁴ 10⁵ 10⁶

No of examples

SPIDER-EM SEM-VR iEM FIEM O(n^1/2) O(n^2/3) O(n)

Figure 1: [Left] Median estimated number of parameter updatesKOpt(n, ǫ)needed to reach an accuracy of2.5×10⁻⁵[Right] Median estimated number of per-sample conditional expectations KCE(n, ǫ)−nneeded to reach an accuracy of2.5×10⁻⁵. The median is taken from a Monte-Carlo simulation among 50 trials.

4 Numerical illustration

Synthetic Data. We evaluate the efficiency ofSPIDER-EMagainst the problem size. We generate a synthetic dataset withnobservations from a scalar two-components Gaussian mixture model (GMM) with0.2N(0.5,1) + 0.8N(−0.5,1). The variances and the weights are assumed known.

We fit the meansµ1, µ2of a GMM to the observed data. ForSPIDER-EM, we setb =⌈√n/20⌉, kin = ⌈n/b⌉and a fixed step size γk = 0.01. We defineτemp = tempkin+kemp as the total number of updates ofSbk evaluated, such that temp,kemp are the indices of outer, inner iteration, respectively. To estimate KOpt(n, ǫ)andKCE(n, ǫ), we run theSPIDER-EMalgorithm until the first iterationτempwhen the solution satisfies kh(Sbtemp,kemp)k² ≤ǫ = 2.5×10⁻⁵. We take the median ofτempover 50 runs to give an estimate ofKOpt(n, ǫ); similarly, we take the median of ntemp+ 2bτempto give an estimate ofKCE(n, ǫ). Note that the conditional expectations computed during the initialization step are ignored.

Figure1comparesSPIDER-EMto the state-of-the-art incremental EM algorithms for different set- tings of n. The results illustrate that the empirical performance of SPIDER-EMagrees with the theoretical analysis. In particular, we observe that forSPIDER-EM, the estimatedKOpt(n, ǫ)is independent of the problem sizenwhileKCE(n, ǫ)−ngrows at the rate of√

n.

MNIST Dataset. We perform experiment on the MNIST dataset to illustrate the effectiveness of SPIDER-EMon real data; this example is taken from [23, Section 5]. The dataset consists ofn= 6×10⁴images of handwritten digits, each with784pixels. We pre-process the dataset as follows.

First, we eliminate the uninformative pixels (67pixels are always zero) across all images to obtain a dense representation withddense = 717pixels per image. Second, we apply principal component analysis (PCA) to further reduce the data dimension. We keep thedPC= 20principal components (PCs) of each observation.

We estimate a multivariate GMM model withg = 12components. Unlike in the previous experiment, here the parameterθcollects the mixture’s weights{αℓ,1≤ℓ≤g}, the expectations of each component and a pulled full covariance matrix. SPIDER-EMis compared toiEM[21],Online EM [6],FIEM[18], andsEM-vr[7]. Details on the multivariate Gaussian mixture model are given in the supplementary material,section 11, where we give technical conditions required to verify the assumptions ofTheorem 2.

In Figure2, we display the sequence of parameter estimates{θτ}, the objective function{−W(Sbτ)} and the squared norm of the mean field{kh(Sbτ)k²}. Figure3gives insights on the distribution of kh(Sbt,k)k² along SPIDER-EMpaths. The mini-batches {B^τ}^τ are independent, and sampled at random in{1, . . . , n}with replacement. For a fair comparison, we use the same seed to sample the minibatches{B^k}; another seed is used forFIEMwhich requires a second sequence of minibatches {B^τ}^τ. The minibatch size is set to beb= 100and the stepsizeγτ = 5×10⁻³except foriEMwhere γτ= 1. The same initial valueSbinitis used for all experiments. We have implemented the procedure of [19] in order to obtain the initializationθinit and then we setSbinit def

= ¯s(θinit)(−W(Sbinit) =

(10)

0 20 40 60 80 100 120 0

0.1 0.2 0.3

Weight

0 20 40 60 80 100 120

Epoch 0

0.1 0.2 0.3

Weight

5 10 15 20 25

Epoch -34.5

-34 -33.5 -33 -32.5 -32 -31.5

Objective value {-W(Sk )}

EM iEM Online EM FIEM sEM-vr SPIDER-EM

Figure 2: [Left] Evolution of the estimates of the weightsαℓforℓ= 1, . . . , gbyOnline EM(top) andSPIDER-EM (bottom) vs the number of epochs. [Right] Evolution of the objective function

−W(Sbk)vs the number of epochs.

20 40 60 80 100 120 140 Epoch

10^-20 10^-15 10^-10 10^-5 10⁰

{ ||h(Sk)||2 } - 25th quantile ^iEM

Online EM FIEM sEM-vr SPIDER-EM

20 40 60 80 100 120 140 Epoch

10^-20 10^-15 10^-10 10^-5 10⁰

{ ||h(Sk )||2 } - 75th quantile ^iEM

Online EM FIEM sEM-vr SPIDER-EM

Figure 3: [Left] Quantile0.25and [Right] quantile0.75of the distribution ofkh(Sbt,−1)k² vs the number of epochst; the quantiles are estimated from40independent samples of this distribution.

−58.3). The plots illustrate thatSPIDER-EMreduces the variability ofOnline EMand compares favorably toiEMandFIEM. Additional details and results are given in the Supplementary material.

5 Conclusions

We have introduced theSPIDER-EMalgorithm for large-scale inference. The algorithm offers low memory footprint and improved complexity bounds compared to the state-of-the-art, which is veri- fied by theoretical analysis and numerical experiments.

(11)

Broader Impact This work does not present any foreseeable societal consequence.

Acknowledgments and Disclosure of Funding

The work of G. Fort is partially supported by theFondation Simone et Cino del Ducaunder the project OpSiMorE. The work of E. Moulines is partially supported by ANR-19-CHIA-0002-01 / chaire SCAI. It was partially prepared within the framework of the HSE University Basic Research Program. The work of H.-T. Wai is partially supported by the CUHK Direct Grant #4055113.

References

[1] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm:

From population to sample-based analysis. Ann. Statist., 45(1):77–120, 2017.

[2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive Algorithms and Stochastic Approxima- tions. Springer Verlag, 1990.

[3] V. S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint.

[4] L. Bottou and Y. Le Cun. Large scale online learning. In S. Thrun, L. K. Saul, and B. Sch¨olkopf, editors, Advances in Neural Information Processing Systems 16, pages 217–

224. MIT Press, 2004.

[5] P. B¨uhlmann, P. Drineas, M. Kane, and M. van der Laan. Handbook of Big Data. Chap- man & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, 2016. ISBN 9781482249088.

[6] O. Capp´e and E. Moulines. On-line expectation-maximization algorithm for latent data models.

J. R. Stat. Soc. Ser. B Stat. Methodol., 71(3):593–613, 2009.

[7] J. Chen, J. Zhu, Y. Teh, and T. Zhang. Stochastic expectation maximization with variance reduction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar- nett, editors,Advances in Neural Information Processing Systems 31, pages 7967–7977. 2018.

[8] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,Advances in Neural Information Processing Systems 27, pages 1646–1654. Curran Associates, Inc., 2014.

[9] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist., 27(1):94–128, 1999. ISSN 0090-5364. doi: 10.1214/aos/

1018031103. URLhttps://doi.org/10.1214/aos/1018031103.

[10] A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc. B Met., 39(1):1–38, 1977.

[11] C. Fang, C. Li, Z. Lin, and T. Zhang. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Pro- cessing Systems 31, pages 689–699. Curran Associates, Inc., 2018.

[12] G. Fort, P. Gach, and E. Moulines. Fast Incremental Expectation Maximization for non-convex optimization: non asymptotic convergence bounds. Technical report, HAL-02617725v1, 2020.

[13] S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.SIAM J. Optimiz., 23(4):2341–2368, 2013.

[14] W. H¨ardle, H. H.-S. Lu, and X. Shen.Handbook of big data analytics. Springer, 2018.

(12)

[15] R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors,Advances in Neural Information Processing Systems 26, pages 315–323. Curran As- sociates, Inc., 2013.

[16] B. Karimi, M. Lavielle, and E. Moulines. On the Convergence Properties of the Mini-Batch EM and MCEM Algorithms. Technical report, hal-02334485, 2019.

[17] B. Karimi, B. Miasojedow, E. Moulines, and H.-T. Wai. Non-asymptotic Analysis of Biased Stochastic Approximation Scheme. InCOLT, 2019.

[18] B. Karimi, H.-T. Wai, E. Moulines, and M. Lavielle. On the Global Convergence of (Fast) In- cremental Expectation Maximization Methods. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32, pages 2837–2847. Curran Associates, Inc., 2019.

[19] W. Kwedlo. A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering.Pattern Anal. Applic., 18:757–770, 2015.

[20] G. McLachlan and T. Krishnan.The EM algorithm and extensions. Wiley series in probability and statistics. Wiley, 2008.

[21] R. M. Neal and G. E. Hinton. A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368.

Springer Netherlands, Dordrecht, 1998.

[22] S. K. Ng and G. J. McLachlan. On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Stat. Comput., 13(1):45–55, 2003.

[23] H. Nguyen, F. Forbes, and G. McLachlan. Mini-batch learning of exponential family finite mixture models. Stat. Comput., 2020.

[24] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´aˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 2613–2621. JMLR.org, 2017.

[25] S. Reddi, S. Sra, B. P´oczos, and A. Smola. Fast Incremental Method for Smooth Nonconvex Optimization. In2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–

1977, 2016.

[26] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[27] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. SpiderBoost and Momentum: Faster Stochas- tic Variance Reduction Algorithms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alch´e- Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32, pages 2406–2416. 2019.

(13)

Supplementary materials for “A Stochastic Path-Integrated Differential EstimatoR Expectation

Maximization Algorithm”

Gersende Fort

Institut de Math´ematiques de Toulouse Universit´e de Toulouse; CNRS UPS, F-31062 Toulouse Cedex 9, France [email protected]

Eric Moulines

Centre de Math´ematiques Appliqu´ees Ecole Polytechnique, France

CS Departement

HSE University, Russian Federation [email protected]

Hoi-To Wai Department of SEEM

The Chinese University of Hong Kong Shatin, Hong Kong

[email protected]

Notations. For two vectors a, b ∈ R^r, ha, bi denotes the usual Euclidean product andkak the associated norm. By convention, vectors are column vectors. For a vectorxwith components (x1, . . . , xr),xi:jdenotes the sub-vector with components(xi, xi+1, . . . , xj−1, xj).

For two matricesA ∈ R^r¹×r2 andB ∈ R^r³×r4,A⊗Bdenotes the Kronecker product. Iris the r×ridentity matrix.A^T is the transpose ofA.

6 Complexity of incremental EM-based methods for smooth non-convex finite sum optimization

We first compare the complexities of the incremental EM based methods using the following table which summarizes the state-of-the-art results.

algorithm γ KOpt KCE OptimalKCE

EM[10] - 1 +kmax n+nkmax N/A

online-EM[6] decaying;O(L⁻¹k⁻^1/2) 1 +kmax n+bkmax ǫ⁻²

iEM[21] 1 1 +kmax n+bkmax ǫ⁻¹n

sEM-vr[7,18] O(L⁻¹n⁻^2/3) 1 +kinkout n(1 +kout) + (bkin+n)kout ǫ⁻¹n^2/3 FIEM[18] O(L⁻¹n⁻^2/3) 1 +kmax n+ 2bkmax ǫ⁻¹n^2/3 FIEM[12] O(L⁻¹n⁻^1/3k⁻max^1/3) 1 +kmax n+ 2bkmax ǫ⁻^3/2√n SPIDER-EM O(L⁻¹) 1 +kinkout n+koutn+ 2bkinkout ǫ⁻¹√

n Table 1: Comparison between different EM-based algorithms for smooth non convex finite sum optimization. ExceptsEM-vrandSPIDER-EMwhich have nested loops (koutis the maximal number of outer loops andkinis the number of inner loops per outer loop),kmaxis the maximal number of iterations. The last column is the optimal complexity to reach anǫ-approximate stationary point.

Next, we provide the psuedo-codes of several existing incremental EM-based algorithms, following the notations defined in the main paper.

(14)

Data:kmax∈N_⋆,Sbinit∈R^q

Result:The EM sequence:Sbk, k= 0, . . . , kmax 1 Sb0= ¯s◦T(Sbinit);

2 fork= 0, . . . , kmax−1do

3 Sbk+1= ¯s◦T(Sbk)

Algorithm 2:The EM algorithm in the expectation space.

Data:kmax∈N_⋆,Sbinit∈R^q,γk ∈(0,∞)fork= 1, . . . , kmax

Result:The SA sequence:Sbk, k= 0, . . . , kmax 1 Sb0= ¯s◦T(Sbinit);

2 fork= 0, . . . , kmax−1do

3 Sample a mini-batchB^k+1in{1, . . . , n}of sizeb, with replacement ;

4 Sbk+1=Sbk+γk+1

¯

s_Bk+1◦T(Sbk)−Sbk

.

Algorithm 3:The Online EM algorithm.

Result:The iEM sequence:Sbk, k= 0, . . . , kmax 1 S_0,i= ¯si◦T(bSinit)for alli= 1, . . . , n;

2 Sb0=Se0=n⁻¹Pn i=1S_0,i;

3 fork= 0, . . . , kmax−1do

5 S_k+1,i=S_k,ifori /∈ B^k+1;

6 S_k+1,i= ¯si◦T(Sbk)fori∈ B^k+1;

7 Sek+1=Sek+n⁻¹P

i∈B^k+1(Sk+1,i−S_k,i);

8 Sbk+1=Sbk+γk+1(Sek+1−Sb^k)

Algorithm 4:The Incremental EM (iEM) algorithm.

Result:The FIEM sequence:Sbk, k= 0, . . . , kmax 1 S_0,i= ¯si◦T(Sbinit)for alli= 1, . . . , n;

2 Sb0=Se0=n⁻¹Pn i=1S_0,i;

3 fork= 0, . . . , kmax−1do

5 S_k+1,i=S_k,ifori /∈ B^k+1;

6 S_k+1,i= ¯si◦T(Sbk)fori∈ B^k+1;

7 Sek+1=Sek+n⁻¹P

i∈B^k+1(S_k+1,i−S_k,i) ;

8 Sample a mini-batchB^′k+1in{1, . . . , n}of sizeb, with replacement ;

9 Vk+1 =Sek+1−b⁻¹P

i∈B^′k+1S_k+1,i;

10 Sbk+1=Sbk+γk+1(¯s_B^′_k+1◦T(Sbk)−Sbk+Vk+1)

Algorithm 5:The Fast Incremental EM (FIEM) algorithm.

(15)

Data:kin∈N_⋆,kout ∈N_⋆,Sbinit∈R^q,γt,k∈(0,∞)fort≥1, k≥1 Result:The sEM-vr sequence:Sbt,k, t= 1, . . . , koutandk= 0, . . . , kin−1

1 S_1,0= ¯s◦T(Sbinit);

2 Sb1,0=Sbinit;

4 fork= 0, . . . , kin−2do

5 Sample a mini-batchB^t,k+1in{1, . . . , n}of sizeb, with replacement ;

6 Vt,k+1=S_t,0−s¯_Bt,k+1◦T(Sbt−1,kin−1);

7 Sbt,k+1=Sbt,k+γt,k+1

¯

s_Bt,k+1◦T(Sbt,k)−Sbt,k+Vt,k+1

8 S_t+1,0= ¯s◦T(Sbt,kin−1);

9 Sbt+1,0=Sbt,kin−1+γt,kin

S_t+1,0−Sbt,kin−1

Algorithm 6:The sEM-vr algorithm.

7 An equivalent definition of the SPIDER-EM algorithm

Using Lemma3below this page, we deduce thatSPIDER-EMcan be equivalently described by the followingalgorithm 7.

Data:kin∈N_⋆,kout ∈N_⋆,Sbinit∈R^q, a positive sequence{γt,k, t, k≥1}. Result:The SPIDER-EM sequence:Sbt,k,t= 1, . . . , kout,k= 0, . . . , kin−1

1 Sb1,−1=Sbinit;

2 Se1,0= ¯s◦T(Sbinit);

4 Vt,0= 0 ;

5 fork= 0, . . . , kin−2do

6 Sample a mini-batchB^t,k+1in{1, . . . , n}of sizeb, with or without replacement ;

7 Vt,k+1=Vt,k+Set,k−¯s_B_t,k+1◦T(Sbt,k−1) ;

8 Set,k+1= ¯s_Bt,k+1◦T(Sbt,k) ;

9 Sbt,k+1=Sbt,k+γt,k+1

Set,k+1−Sbt,k+Vt,k+1

10 Set+1,0= ¯s◦T(Sbt,kin−1);

11 Sbt+1,0=Sbt,kin−1+γt,kin

Set+1,0−Sbt,kin−1

Algorithm 7:The SPIDER-EM algorithm (equivalent description)

Lemma 3. Let{γk, k≥1}be a positive deterministic sequence and{B^k, t, k ≥1}be a family of mini-batches sampled from{1, . . . , n}. FixSb₋1,Sb0andS₀. Define fork= 0,· · ·, kin−2

( S_k+1^def= S_k+ ¯s_B_k+1◦T(Sbk)−s¯_B_k+1◦T(bSk−1), Sbk+1

def= Sbk+γk+1

S_k+1−Sbk

. SetSe₋1

def= Sb₋1,Se0

def= Sb0,V0

def= 0and define fork= 0, . . . , kin−2, ( Vk+1

def= Vk+ ¯s_B_k◦T(Sek−1)−¯s_B_k+1◦T(Sek−1), Sek+1

def= Sek+γk+1

¯s_B_k+1◦T(Sek)−Sek+Vk+1

; by convention, set¯s_B0◦T(Se₋1) =S₀.

Then for anyk=−1, . . . , kin−1,Sek=Sbk.