Fast incremental expectation-maximization algorithm: √N iterations for an epsilon-stationary point ?

(1)

HAL Id: hal-02509621

https://hal.archives-ouvertes.fr/hal-02509621v2

Preprint submitted on 8 Feb 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Fast incremental expectation-maximization algorithm:

√N iterations for an epsilon-stationary point ?

Gersende Fort, Eric Moulines, Pierre Gach

To cite this version:

Gersende Fort, Eric Moulines, Pierre Gach. Fast incremental expectation-maximization algorithm:

√N iterations for an epsilon-stationary point ?. 2021. �hal-02509621v2�

(2)

FAST INCREMENTAL EXPECTATION-MAXIMIZATION ALGORITHM:√ N ITERATIONS FOR AN-STATIONARY POINT ?

G. Fort¹, E. Moulines², P. Gach¹

1IMT, Universit´e de Toulouse & CNRS, F-31062 Toulouse, France.

2 CMAP, Ecole Polytechnique, Route de Saclay, 91128 Palaiseau Cedex, France.

This paper was submitted to the 2020 IEEE Statistical Signal Processing Workshop in March 2020. Then, the conference was postponed to July 2021 and the reviewing process for 2020 was canceled. Therefore, this paper is no more submitted and its content is part of HAL-02617725.

P. Gach contributed to part of the results given in Sections 2.3., 2.4 and 3.

ABSTRACT

Fast Incremental Expectation Maximization (FIEM) is an iterative algorithm, based on the Expectation Maximization (EM) algorithm, which was introduced to design EM for the large scale learning framework by avoiding the full data set to be processed at each iteration. In this paper, we first recast this algorithm in theStochastic Approximation (SA) within EMframework. Then, we provide non asymptotic convergence rates as a function of the batch sizenand of the maximal number of iterationsKmax fixed by the user. This allows a complexity analysis: in order to reach an-approximate solution, how doesKmaxdepend uponnand?

Index Terms— Statistical Learning, Large Scale Learning, Non convex optimization, Iterative Expectation Maximization algorithm, Accelerated Stochastic Approximation, Control Variate.

1. INTRODUCTION

EM [1, 2] is a very popular computational tool, designed to solve non convex minimization problems onR^dwhen the objective function is not explicit but defined asF(θ) =−logR

ZG(z;θ)dµ(z)for a positive functionG. EM is a Majorize-Minimization algorithm which, based on the current value of the pointθc, defines a majorizing func- tionθ 7→Q(θ, θc)through a Kullback-Leibler argument; then, the new point is chosen as the/a minimum ofQ. The computation of a function at each iteration can be greedy and even intractable; in many applications,Qhas a special form: there exist (known and explicit) functionsψ, φ, ssuch thatQ(·, θc) = ψ(·)− h¯s(θc), φ(·)i ands(τ)¯ is the expectation of the functionswith respect to (w.r.t.) the probability distributionG(·;τ) exp(−F(τ))dµ. In these cases, the definition ofQconsists in the computation of the vector¯s(θc).

It may happen that the expectations(θ¯ c)is not explicit (see e.g. [3, section 6]); a natural idea is to substitute¯sfor an approximation, possibly random. Many stochastic EM versions were proposed and studied: among them, let us cite Monte Carlo EM [4, 5] wheres¯is This work is partially supported by the FrenchAgence Nationale de la Recherche(ANR), project under reference ANR-PRC-CE23 MASDOL; and by theFondation Simone et Cino Del Ducathrough the project OpSiMorE.

approximated by a Monte Carlo sum; and SA EM ([6, 7]) wheres¯ is approximated by a SA scheme [8]. With the Big Data era, EM applied to statistical learning evolved into online versions and large scale versions: the objective function is a loss function associated to a set of observations (also calledexamples); in online versions, the data are not stored and are processed online which means that the objective function is time-varying (see e.g. [9, 10]); in large scale versions, a batch of observations is given but it is too large to be processed at each iteration of EM.

This paper is devoted to the convergence analysis of FIEM [11], an EM-based algorithm designed for large scale learning in a nonconvex setting: FIEM is also aSA within EMalgorithm, with a SA scheme for the approximation of¯s(θc)which combines(i)a random selection of a single (or a few) observation(s) in the large batch, and (ii)a variance reduction technique inspired from SAGA [12]

(see also sEM-VR [13] which uses SVRG [14]). In Section 2, some SA within EMalgorithms are described; FIEM is explicitly recasted as such an algorithm. Our contribution is essentially the content of Section 3: we provide non asymptotic convergence rates as a function of the batch sizen and of the maximal number of itera- tionsKmaxfixed by the user. In this non convex optimization setting, errors are relative toL¹-convergence when stopping FIEM at a random timeK, prior toKmaxand chosen independently of the FIEM sources of randomness. We recover previous rates by [11]

in the case K is a uniform distribution, and improve them from n^2/3K_max⁻¹ ton^1/3Kmax^−2/3; our analysis also includes a definition of the step sizes in the SA scheme, with an explicit dependence upon nandKmax. A corollary of these bounds is a complexity analysis:

to reach an-approximate solution, we show that eitherKmax = O(n^2/3⁻¹)and the step size is constant scaling asO(n^−2/3); or Kmax = O(n^1/2^−3/2) and the step size is constant scaling as O(n^−1/2). The last contribution establishes that the same convergence rates are obtained with any random stopping ruleK(bounded byKmax), up to the choice of an adequate time-varying step size. In Section 4, through a toy example, we explore an extension of FIEM.

2. INCREMENTAL EM ALGORITHMS

This paper addresses explicit convergence rates for an algorithm designed to solve the optimization problem

Argmin_θ∈ΘF(θ), F(θ)^def= 1 n

n

X

i=1

Li(θ) +R(θ), (1) whenΘ ⊆ R^dandF can not be explicitly evaluated (nor its gradient or derivatives of higher order when they exist). Two levels of intractability ofF(θ)are considered. The first one is motivated by the large scale learning setting when the numbernis so large that computations involving a sum overnterms are not allowed or have

(3)

to be quite rare along the run of the optimization algorithm. The second one is motivated by the latent variable statistical framework, where for alli, the functionLiis not explicit and defined through an integral. A large class of computational learning problems is covered by this framework: it includes for example the situations whennis the number of examples,Liis a loss function associated to example

#iandRis a penalty term. In this paper, a specific expression for Liis considered: we restrict our attention to the case when

Li(θ)^def= −log Z

Z

hi(z) exp (hsi(z), φ(θ)i −ψ(θ))µ(dz). (2) n⁻¹Pn

i=1Lican be seen as the normalized negative log-likelihood ofnobservations in a latent variable model:(i)conditionally to the missing variables, the observations are independent thus yielding to an additive expression of the log-likelihood;(ii)the model for the joint distribution of the observationyi and its associated missing variable is a curved exponential family w.r.t. a dominating positive measuredµon a measurable set(Z,Z). In (2), the dependence upon the observationyiis omitted (but appears implicitly through the dependence uponiof the functionshi, si). Let us introduce rigorous conditions on the problem at hand. Denote by[[n]]^def= {1, . . . , n}.

A1 Θ ⊆ R^dis an open set. (Z,Z)is a measurable space andµ is aσ-finite positive measure onZ. The functionsR : Θ → R, φ : Θ → R^q, ψ : Θ → R,si : Z → R^q, hi : Z → R+ for alli∈[[n]]are measurable functions. Finally, for anyθ ∈Θand i∈[[n]],−∞<Li(θ)<∞.

Under A1, for anyθ∈Θandi∈[[n]], the quantitypi(·;θ) dµ pi(z;θ)^def= hi(z) exp (hsi(z), φ(θ)i −ψ(θ) +Li(θ)) , (3) defines a probability distribution onZ.

A2 The expectation¯si(θ) ^def= R

Zsi(z)pi(z;θ)µ(dz)exists for all θ∈Θandi∈[[n]].

For anys∈R^q, the following set is a (non empty) singleton denoted by{T(s)}:

Argmin_θ∈Θ (ψ(θ)− hs, φ(θ)i+R(θ)), Define

¯

s(θ)^def= n⁻¹

n

X

i=1

¯

si(θ). (4)

A3 The functionsφ,ψandRare continuously differentiable onΘ.

Tis continuously differentiable onR^q.

For anys∈R^q,B(s)^def= ∇(φ◦T)(s)is a symmetricq×qmatrix and there exist0< vmin ≤vmax <∞such that for alls ∈R^q, the spectrum ofB(s)is in[vmin, vmax].

For anyi∈[[n]],s¯i◦Tis globally Lipschitz onR^qwith constantLi. SetL^{2 def}= n⁻¹Pn

i=1L²_i. The functions7→B^T(s) (¯s◦T(s)−s) is globally Lipschitz onR^qwith constantLV˙.

2.1. An EM algorithm in the statistic-space

In this framework - even including a penalty termRin the objective function-, the EM algorithm is usually described as an iterative algorithm in theΘ-space: given a current valueτk∈Θ, the next point is obtained by a combination of an expectation step which here, com- putes¯s(τ^k), and a maximization step through the mapTyielding toτ^{k+1 def}= T◦¯s(τ^k). Equivalently (see [7]), by usingTwhich

mapsR^qtoΘ, it can be described in theR^q-space: given the current values¯^k ∈ s(Θ), set¯ s¯^{k+1 def}= ¯s◦T(¯s^k). In this second point of view, which is adopted throughout this paper, the limiting points are characterized as the roots of the fieldh

{s∈s(Θ) :¯ h(s) = 0}, h(s)^def= ¯s◦T(s)−s . (5) Assumption A3 implies that the roots ofhare the zeros of∇(F◦T) i.e. the critital points of the objective function. Unfortunately, in the large scale learning setting, EM can not be used since each iteration involves the computation of a sum overnterms through¯s.

2.2. A SA-based algorithm

A natural idea to overcome this intractability is the use of an iterative SA scheme for finding the roots ofhupon noting that

h(s) = 1 n

n

X

i=1

¯

si◦T(s)−s=E[¯sI◦T(s)−s] ,

whereI∼ U([[n]])is a uniform random variable (r.v.) on[[n]]. This yields Algorithm 1, whereKmaxis the total number of iterations,Sb⁰ is the initial value and{γk, k≥1}are positive step sizes.

Data:Kmax∈N,Sb⁰∈R^q,γk∈(0,∞)fork∈[[Kmax]]

Result:The SA sequence:Sb^k, k= 0, . . . , Kmax 1 fork= 0, . . . , Kmax−1do

2 Ik+1∼ U([[n]]);

3 Sb^k+1=Sb^k+γk+1(¯sI_k+1◦T(Sb^k)−Sb^k) Algorithm 1:The Stochastic Approximation (SA) algorithm

2.3. An Incremental EM algorithm (i-EM)

Another idea, introduced by [15] and studied in [16], can be seen as a two-level SA schemes where an auxiliary level is introduced in order to mimic the computation ofn⁻¹Pn

i=1¯si◦T(Sb^k)at each iteration of the algorithm;Sb^kis the current point of the algorithm at iteration k. i-EM is described by Algorithm 2 with a slight adaptation (in the original algorithm,γk+1= 1). Line (8) defines the i-EM sequence;

the update rule is based onSe^k+1, defined through Lines (4) to (7), and which satisfies for anyk≥0

Se^k= 1 n

n

X

i=1

Sk,i, Sk,i

def= ¯si◦T(Sb^<k,i),

whereSb^<0,i ^def= Sb⁰for alli∈[[n]]and fork≥0,Sb^<k+1,i =Sb^` where`stands for the current index of the statisticsSb^•during the last ”visit” to the observation#i:`=kifIk+1=i;`∈[[0, k−1]]

ifIk+1 6= i, . . . , I`+2 6= iandI`+1 =i;` = 0otherwise. The auxiliary quantitySe^k+1is an online estimate of¯s◦T(Sb^k), where at iterationk+1, only the contribution of the observation#Ik+1of the sum is updated. Note however that this algorithm, while avoiding a sum overnterms at each iteration, necessitates the storage of a vectorSk,·∈R^qnwhose length is proportional ton.

(4)

Result:The iEM sequence:Sb^k, k= 0, . . . , Kmax 1 S0,i= ¯si◦T(Sb⁰)for alli∈[[n]];

2 Se⁰=n⁻¹Pn i=1S0,i;

3 fork= 0, . . . , Kmax−1do

4 Ik+1∼ U([[n]]);

5 Sk+1,i=Sk,ifori6=Ik+1;

6 Sk+1,I_k+1= ¯sI_k+1◦T(Sb^k);

7 Se^k+1=Se^k+n⁻¹ Sk+1,I_k+1−Sk,I_k+1

;

8 Sb^k+1=Sb^k+γk+1(Se^k+1−Sb^k)

Algorithm 2:The incremental EM (i-EM) algorithm

2.4. A Fast Incremental EM algorithm (FIEM)

More recently, [11] introduced FIEM, another incremental EM algorithm; they showed it is faster than i-EM and SA. FIEM combines the two-levels SA idea of i-EM with an acceleration technique inspired from SAGA [12]. The algorithm is described by Algorithm 3.

This algorithm can be seen as a mix of the SA update (see the term Tk+1

def= ¯sJ_k+1◦T(Sb^k)−Sb^kin Line 9) and the (centered) control variateVk+1

def= Se^k+1−Sk+1,J_k+1, which is correlated toTk+1

through the random indexJk+1. The auxiliary quantitySe^k+1is the same as the one introduced in i-EM (see section 2.3).

Result:The FIEM sequence:Sb^k, k= 0, . . . , Kmax 1 S0,i= ¯si◦T(Sb⁰)for alli∈[[n]];

2 Se⁰=n⁻¹Pn i=1S0,i;

3 fork= 0, . . . , Kmax−1do

4 Ik+1∼ U([[n]]);

5 Sk+1,i=Sk,ifori6=Ik+1;

6 Sk+1,I_k+1= ¯sI_k+1◦T(Sb^k);

7 Se^k+1=Se^k+n⁻¹ Sk+1,I_k+1−Sk,I_k+1

;

8 Jk+1∼ U([[n]]);

9 Sb^k+1=

Sb^k+γk+1(¯sJ_k+1◦T(Sb^k)−Sb^k+Se^k+1−Sk+1,J_k+1) Algorithm 3:The Fast Incremental EM (FIEM) algorithm

3. FIEM: NON ASYMPTOTIC CONVERGENCE RATES The originality of our contribution is to provide new explicit non asymptotic error rates for FIEM. The proofs of the statements below can be found in [17] Since the problem (1) is most often a non convex one, convergence is considered here in terms of the rate at which the following quantitiesE0toE2decay to zero as a function of the size of the samplen, and a function of a total number of iterationsKmax

chosen by the user. As in [11] (see also [18]), we deriveL¹-error rates along a FIEM sequence stopped at a random timeK, chosen independently of the sequence. The quantities of interest are

E0 def= E

hk∇V(Sb^K)k²i

, V ^def= F◦T, E1

def= E h

k¯s◦T(Sb^K)−Sb^Kk²i

=E h

kh(Sb^K)k²i , E2

def= E h

kSe^K+1−s¯◦T(Sb^K)k²i .

They respectively quantify, at the random stopping timeK, the mean squared norm of the gradient of the objective functionF(when seen as a function onR^q through the mapT); the mean squared (kind of) distance to the set of the limiting points (see (5)); and the mean squared error when approximating¯s◦T(Sb^K)bySe^K+1.

Proposition 1 provides a control ofEi’s upper bound in the case Kis sampled uniformly on{0, . . . , Kmax−1}: as in [11], the control is proportional ton^2/3/Kmax and the dependence uponnof the step size isO(n^−2/3); the constant in the control, and the ex- act value of the step size are improved w.r.t. [11] (see section 4).

Proposition 2 shows that by using another strategy for the step size, while still being constant over iterations, the control of the errors evolves asn^1/3/Kmax^2/3. To our best knowledge, this is a new result in the literature. Set∆V ^def=E

h

V(Sb⁰)−V(Sb_max^K )i

and forn≥2, C∈(0,1),

fn(C)^def= LV˙

2L 1

n^2/3 + 1 1−n^−1/3

1 n+ 1

1−C

.

Proposition 1 Assume A1 to A3 and chooseµ∈ (0,1). LetKbe a{0, . . . , Kmax−1}-valued uniform r.v. Run FIEM with a constant step sizeγ`=√

Cn^−2/3L⁻¹whereC ∈(0,1)is the unique

solution of √

Cfn(C) =µvmin. (6) Then, for anyn≥2andKmax≥1

v⁻²maxE0≤E1+LV˙

2L

√C

vminn^2/3E2≤ n^2/3 Kmax

L∆V

√C(1−µ)vmin

.

The constantC in (6) is upper bounded by the unique point C⁺ in(0,1)solvingvminL(1−x)−√

xLV˙ = 0; thus showing that LV˙(1−C⁺)⁻¹/(2L)≤fn(C)≤sup_nfn(C⁺)<∞. Hence, the constantCcan be lower bounded and upper bounded (away from0 and1) by a constant depending only uponvmin, LandLV˙. Proposition 2 Assume A1 to A3 and chooseµ∈(0,1). LetKbe a {0, . . . , Kmax−1}-valued uniform r.v. Run FIEM with a constant step sizeγ`=√

Cn^−1/3Kmax^−1/3L⁻¹whereC >0satisfies

√ CLV˙

2L

1 +C

1 + 1 1−λ

≤µvmin, (7) for some λ ∈ (0,1). Then, for any n, Kmax ≥ 1 such that n^1/3Kmax^−2/3≤λ/C,

v⁻²_maxE0≤E1+LV˙

2L

√C vminn^1/3Kmax^1/3

E2≤ n^1/3 Kmax^2/3

L∆V

√C(1−µ)vmin

. As a corollary of Proposition 2, it may be shown that there exists a constantM ∈ (1,+∞)depending uponvmin, L, LV˙, µsuch that for anyε∈(0,1), we have

n^1/3 K^2/3max

√ L

C(1−µ)vmin

≤ε ,

by setting Kmax = M√

nε^−3/2; which in turn implies that γ` ∝ 1/√

n. From Proposition 1, the complexity is Kmax = O(n^2/3ε⁻¹). Proposition 2 provides a new rate which improves on the known result n^2/3, but at a cost on the dependence upon the precision ε. These two propositions are complementary, one

(5)

providing a better strategy than the other one depending on how√ ε compares withn^−1/6.

We conclude this section by another point of view: given a probability distributionp0, . . . , pKmax−1 for the random stopping time K, how to choose the step sizesγkin order to reach the same controls (innandKmax) as in the above propositions ? we restrict here to the ”mirror” of Proposition 1. ForC∈(0,1)andn≥2, define the functionFn,C

Fn,C: x7→ 1

Ln^2/3x(vmin−xfn(C)) .

Fn,Cis positive, increasing and continuous on(0, vmin/(2fn(C))].

Proposition 3 Assume A1 to A3. LetKbe a{0, . . . , Kmax−1}- valued r.v. with distributionp0, . . . , pKmax−1, infkpk > 0. Let C∈(0,1)solving

√

Cfn(C) = 1

2vmin. (8)

For anyn≥2andKmax≥1, we have v_max⁻² E0≤E1+LV˙

L

minkυk

√Cvminn^2/3E2≤n^2/3maxkpk

√2L∆V Cvmin

,

whereυk

def= g_k²max`p`/pkand FIEM is run with

γk+1 def= gk

n^2/3L , gk

def= F_n,C⁻¹ pk

max`p`

vmin

√C 2L

1 n^2/3

. SinceP

kpk = 1, we havemaxkpk ≥ K_max⁻¹ thus showing that among the distributions{pk, k = 0, . . . , Kmax −1}, maxkpkis minimal with the uniform distribution. In that case, Proposition 1 applied withµ = 1/2and Proposition 3 provide exactly the same control: (i)the control evolves asn^2/3/Kmax;(ii)the constantC solving (6) in the caseµ= 1/2is the same as the one solving (8);

(iii)sinceF_n,C⁻¹(v_min² √

Cn^−2/3/(2L)) = √

C, theng_k² = C and υk=Cso that the controls ofE2are the same in both propositions;

(iv)the step sizes are equal sincegk=√ C.

The choice of the constantC is also crucial on a numerical point of view, since it defines the step sizeγ`: a large one may cause instability and a small one makes the convergence longer (see section 4). We provided here simple conditions for findingCbut there are more intricate conditions than (6), (7) (8) yielding to larger con- stantsC. For example, Proposition 1 holds forCsatisfying: there existsλ∈(0,1)s.t.n^−1/3< λ/Cand

√ CLV˙

2L 1

n^2/3 + C λ−Cn^−1/3

1 n+ 1

1−λ

=µvmin. (9)

4. NUMERICAL INVESTIGATION

Consider a toy example in order to(i)illustrate the role of the step size on the efficiency of FIEM and to compare different definitions;

(ii)illustrate the interest of the variance reduction technique by com- paring SA, FIEM and a third strategy called belowFIEM-coeff. Both SA and FIEM updateSb^kby a scheme of the formSb^k+1 = Sb^k+ γk+1Hk+1whereHk+1 =Tk+1+λk+1Vk+1andλk+1 = 1for FIEM; andHk+1 =Tk+1for SA (see section 2.4 for a definition ofTk+1andVk+1). The use ofVk+1can be seen as a control variate approach [19]: if such, the optimal coefficientλ^?_k+1, defined as the quantity minimizingE

hkHk+1k²|Sb^ki

, depends on the correla- tion ofVk+1andTk+1, which is not always equal to one. Below,

under the nameFIEM-coeff, we explore the benefit of the strategy λk+1 = λ^?_k+1- which in a realistic example, has a non negligible computational cost and will necessitate to design an approximation.

F is a penalized negative log-likelihood function: then= 1e3 observations are modeled as independent; each observationYi, conditionally to a latent variableZi, is aR¹⁵-valued Normal distribution with meanAZiand covariance matrixI;Ziis aR¹⁰-valued Normal distribution with meanXθtrueand covariance matrixI; AandX are known;θtrue ∈R²⁰is unknown. The penalty term isR(θ) = 0.1kθk². In this toy example, F possesses an unique minimum which has an explicit expression in terms ofA, Xandn⁻¹Pn

i=1Yi; the constantsvmin, vmax, L, LV˙ are also explicit. We havesi(z) = X^Tz, φ(θ) = θ andψ(θ) = θ^TX^TXθ/2; pi(z;θ)is a Normal density with explicit parameters;T(s) = (λI +X^TX)⁻¹s.

0 20 40 60 80 100

10^-3 10^-2 10^-1 10⁰ 10¹ 10²

1 1e-1 1e-2 GFM KM

3000 6000 8000 10000 12000

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

0 1000 2000 3000 4000 5000 6000 7000 8000

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05

FIEM FIEM-coeff

0 2000 4000 6000 8000 10000 12000

10^-8 10^-6 10^-4 10^-2 10⁰ 10²

1e-2 1e-4 1e-5 GFM KM

6000 8000 10000 12000

0 0.005 0.01 0.015 0.02

1500 2000 2500 3000 3500 4000 4500

10¹ 102 10³

SA FIEM FIEM-Coeff

Fig. 1. [top] Role of the step size; [middle] SA, FIEM and FIEM- coeff; [bottom] FIEM and FIEM-coeff

Figure 1[top] displays the errork 7→ kθk−θ?kalong FIEM paths (top,right) with a zoom on the first iterations (top,left). The paths are run with different step sizes: γ ∈ {1,0.1,0.01, . . .};

γGFM =√

CGFMn^−2/3L⁻¹ whereCGFMsolves (9), andγKM is given in [11]. Here,γGFM ≈1.4 10⁻³ andγKM ≈3.4 10⁻⁶. We observe that large step sizes may cause numerical instability; and here,γGFM is larger thanγKMby a factor300, thus yielding to a faster convergence rate. Figure 1[middle] displays the boxplot of kθk−θ?kat iterationk={3e3,6e3,8e3,10e3,12e3}; on the left subplot, SA and FIEM are compared (resp. left/right boxplot) and on the right subplot, FIEM andFIEM-coeff are compared. FIEM clearly improves on SA; and in the convergence phase (k ≥ 6e3), FIEM and FIEM-coeff look similar. The boxplots are obtained with100independent realizations. Figure 1[bottom,left] displays

(6)

k 7→ λk, for FIEM (λk = 1) and forFIEM-coeff (λk = λ^?_k).

The plot is the mean value over1e3independent runs (the quantiles 0.25 and0.75are displayed in dotted line). It is shown that λk

is smaller than0.85in the transient phase k ∈ [5e2,3.5e3]. On Figure 1[bottom,right], a Monte Carlo estimation (over1e3inde- pendent runs) ofE

kHk+1k²

for SA, FIEM and FIEM-coeff is displayed fork∈ {1.5e3, . . . ,4.5e3}:FIEM-coeffcan improve on FIEM up to15%in the transient phase. In this example,λ^?_k+1is explicit; but the benefit will be investigated in future works, taking into account a numerical cost for its approximation. E

kHk+1k² is a crucial quantity since the convergence rates derived in section 3 are obtained by upper bounding this quantity (see [17]).

5. REFERENCES

[1] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Like- lihood from Incomplete Data via the EM Algorithm,” J. Roy.

Stat. Soc. B Met., vol. 39, no. 1, pp. 1–38, 1977.

[2] C.F.J. Wu, “On the Convergence Properties of the EM Algo- rithm,”Ann. Statist., vol. 11, no. 1, pp. 95–103, 1983.

[3] G.J. McLachlan and T. Krishnan,The EM algorithm and exten- sions, Wiley series in probability and statistics. Wiley, 2008.

[4] G.C.G. Wei and M.A. Tanner, “A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms,”J. Am. Stat. Assoc., vol. 85, no. 411, pp. 699–704, 1990.

[5] G. Fort and E. Moulines, “Convergence of the Monte Carlo expectation maximization for curved exponential families,”Ann.

Statist., vol. 31, no. 4, pp. 1220–1259, 2003.

[6] G. Celeux and J. Diebolt, “The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mix- ture problem,” Computational Statistics Quarterly, vol. 2, pp.

73–82, 1985.

[7] B. Delyon, M. Lavielle, and E. Moulines, “Convergence of a Stochastic Approximation version of the EM algorithm,”Ann.

Statist., vol. 27, no. 1, pp. 94–128, 1999.

[8] A. Benveniste, M. M´etivier, and P. Priouret, Adaptive Algo- rithms and Stochastic Approximations, Springer Verlag, 1990.

[9] O. Capp´e and E. Moulines, “On-line Expectation Maximiza- tion algorithm for latent data models,”J. Roy. Stat. Soc. B Met., vol. 71, no. 3, pp. 593–613, 2009.

[10] S. Le Corff and G. Fort, “Online Expectation Maximization based algorithms for inference in Hidden Markov Models,”

Electron. J. Statist., vol. 7, pp. 763–792, 2013.

[11] B. Karimi, H.-T. Wai, E. Moulines, and M. Lavielle, “On the Global Convergence of (Fast) Incremental Expectation Maxi- mization Methods,” inAdvances in Neural Information Pro- cessing Systems 32, H. Wallach, H. Larochelle, A. Beygelz- imer, F. d’Alch´e Buc, E. Fox, and R. Garnett, Eds., pp. 2837–

2847. Curran Associates, Inc., 2019.

[12] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives,” inAdvances in Neural Infor- mation Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., pp.

1646–1654. Curran Associates, Inc., 2014.

[13] J. Chen, J. Zhu, Y.W. Teh, and T. Zhang, “Stochastic Expecta- tion Maximization with Variance Reduction,” inAdvances in Neural Information Processing Systems 31, S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar- nett, Eds., pp. 7967–7977. Curran Associates, Inc., 2018.

[14] R. Johnson and T. Zhang, “Accelerating Stochastic Gradient Descent using Predictive Variance Reduction,” inAdvances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., pp. 315–323. Curran Associates, Inc., 2013.

[15] R. M. Neal and G. E. Hinton,A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants, pp. 355–368, Springer Netherlands, Dordrecht, 1998.

[16] A. Gunawardana and W. Byrne, “Convergence theorems for generalized alternating minimization procedures,” J. Mach.

Learn. Res., vol. 6, pp. 2049–2073, 2005.

[17] P. Gach, G. Fort, and E. Moulines, “(to be confirmed) Fast In- cremental Expectation-Maximization algorithm: non asymptotic convergence rates for an-stationary point,” Tech. Rep., https://perso.math.univ-toulouse.fr/gfort, 2020, A preliminary version of the paper is available upon request (write to the second author, Pr G. Fort) and the paper should be accessible from the webpage of G. Fort by April 1st, 2020.

[18] S. Ghadimi and G. Lan, “Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming,” SIAM J.

Optimiz., vol. 23, no. 4, pp. 2341–2368, 2013.

[19] P. Glasserman,Monte Carlo methods in financial engineering, Springer, New York, 2004.