On the Convergence Properties of the Mini-Batch EM and MCEM Algorithms

(1)

HAL Id: hal-02334485

https://hal.inria.fr/hal-02334485

Preprint submitted on 26 Oct 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

On the Convergence Properties of the Mini-Batch EM and MCEM Algorithms

Belhal Karimi, Marc Lavielle, Éric Moulines

To cite this version:

Belhal Karimi, Marc Lavielle, Éric Moulines. On the Convergence Properties of the Mini-Batch EM

and MCEM Algorithms. 2019. �hal-02334485�

(2)

On the Convergence Properties of the Mini-Batch EM and MCEM Algorithms

Belhal Karimi, Marc Lavielle, Eric Moulines

CMAP, Ecole Polytechnique, Universite Paris-Saclay, 91128 Palaiseau, France INRIA Saclay, 1 Rue Honore d’Estienne d’Orves, 91120 Palaiseau, France

belhal.karimi@polytechnique.edu October 26, 2019

Abstract

The EM algorithm is one of the most popular algorithm for inference in latent data models. For large datasets, each iteration of the algorithm can be numerically involved.

To alleviate this problem, (Neal and Hinton, 1998) has proposed an incremental version in which the conditional expectation of the latent data (E-step) is computed on a mini-batch of observations. In this paper, we analyse this variant and propose and analyse the Monte Carlo version of the incremental EM in which the conditional expectation is evaluated by a Markov Chain Monte Carlo (MCMC). We establish the almost-sure convergence of these algorithms, covering both the mini-batch EM and its stochastic version. Various numerical applications are introduced in this article to illustrate our findings.

1 Introduction

Many problems in computational statistics reduce to maximising a function, defined on a feasible set Θ, of the following form:

g(θ) , Z

Z

f (z, θ)µ(dz) , (1)

where f : Z × Θ → R

⁺

is a positive function and µ is a σ-finite measure. In the incomplete data framework, the function g is the incomplete data likelihood, z is the missing data vector and f stands for the complete data likelihood, that is the joint likelihood of the observations and the missing data.

When the direct optimisation of the function g is difficult, the EM algorithm may be an option. The EM algorithm iteratively computes a sequence of estimates {θ

^k

, k ∈ N}

starting from some initial parameter θ

⁰

. Each iteration of the EM algorithm may be decomposed into two steps. In the E-step, the following surrogate function

θ 7→ Q(θ, θ

^k−1

) , Z

Z

log f (z, θ)p(z, θ

^k−1

)µ(dz)

(3)

is computed, where p(z, θ

^k−1

) , f (z, θ

^k−1

)/g(θ

^k−1

) is the conditional probability density of the latent variables given the observations at the current fit of the parameter θ

^k−1

. In the M-step, this surrogate function is maximised yielding to a new fit of the parameter θ

^k

= argmax

_θ∈Θ

Q(θ, θ

^k−1

). The EM algorithm has been the object of considerable interest since its introduction in (Dempster et al., 1977); the scope of the algorithm and many applications are presented in the reference book (McLachlan and Krishnan, 2008).

The EM algorithm has a number of interesting features: it is monotone - at each iteration, the algorithm improves the objective function or leaves it unchanged if a local maximum has been achieved - , it is invariant in the choice of the parametrisation, it is numerically very stable - when the optimisation set is well defined - and easy to implement on a large class of models.

Many possible improvements have been proposed. In a landmark paper, (Neal and Hinton, 1998) have proposed an incremental version of the algorithm. In many applications, log f (z, θ) can be written as a large sum of functions: it is therefore possible to update at each iteration only a subsample of the terms in this sum and then to perform the M-step. As this algorithm makes use of the new information immediately, it is expected that it might improves the convergence of the EM algorithm in this context.

This algorithm has had an enormous impact in applied statistics and machine learning;

see among many others (Thiesson et al., 2001) for maximum likelihood estimation with missing data in large datasets, (Hsiao et al., 2006) for PET tomographic reconstruction, (Vlassis and Likas, 2002; Ng and McLachlan, 2003) for Gaussian mixture learning, (Likas and Galatsanos, 2004) for blind image deconvolution, (Ng and McLachlan, 2004) for segmentation of magnetic resonance images, (Blei et al., 2017) for variational inference and (Ablin et al., 2018) for Independent Component Analysis. A closely related version of the EM has been introduced in (Capp´ e and Moulines, 2009; Capp´ e, 2011); the objective is therefore slightly different, since in this case the observations are processed online.

In a recent paper, closing the gap between the practical use of EM and its theoretical understanding, (Balakrishnan et al., 2017) developed a ”sample-splitting EM” algorithm where the parameters are obtained, at each iteration, using a subset of the observations.

The authors give quantitative characterisation of the region of attraction around the global optimum for both finite sample sets and the idealized limit of infinite samples.

Even though this comprehensive work gives strong theoretical guarantees of the EM algorithm and its subsample-based variant, it does not deal with the incremental version of the EM algorithm we are studying here since the update of this ”sample-splitting” EM algorithm does not include any terms of previous iterations.

The convergence of the incremental version of the EM algorithm was first tackled

by (Gunawardana and Byrne, 2005) exploiting the interpretation of the EM algorithm

as an alternating minimisation procedure under the information geometric framework

developed in (Csisz´ ar and Tusn´ ady, 1984). Nevertheless, (Gunawardana and Byrne, 2005)

assume that the latent variables take only a finite number of values and the order in

which the observations are processed remains the same from one pass to the other. There

is no obvious way to extend this analysis to more general latent variables and sampling

schemes.

(4)

In the high-dimensional setting, exact maximisation based M-step reveals to be very time consuming, or even ill-posed, thus gradient EM (Wang et al., 2014) can be very attractive. In (Zhu et al., 2017), the authors propose a novel high-dimensional EM algorithm by incorporating variance reduction into the stochastic gradient method for EM. In this algorithm, the full gradient is calculated incrementally, updating only a mini-batch of components at each iteration. This method is widely inspired by recent advances in stochastic optimisation (Roux et al., 2012; Defazio et al., 2014) and can be appealing when the dimension is much larger than the sample size. Their algorithm has an improved overall computational complexity over (Wang et al., 2014) gradient EM and converges at a linear rate to a local optimum (see the results and proofs therein).

The Monte Carlo EM (MCEM) algorithm (Wei and Tanner, 1990) has been proposed, when the quantity computed at the E-step involves infeasible computations. It lies in splitting the E-step in a first step where the latent variables are simulated and then computing a Monte Carlo integration of the intractable expectation of the complete log likelihood. The M-step remains unchanged. The MCEM algorithm has been successfully applied in mixed effects models (McCulloch, 1997; Hughes, 1999; Baey et al., 2016) or to do inference for joint modelling of time to event data coming from clinical trials in (Chakraborty and Das, 2010), among other applications.

This algorithm has been initially studied in (Chan and Ledolter, 1995) followed by many authors such as (Sherman et al., 1999) showing the convergence of the MCEM when the Monte Carlo integration is done using independent Markov chains generated by a Gibbs sampler, (Levine and Casella, 2001; Booth et al., 2001) giving details on the implementation of the MCEM, (Fort et al., 2003) generalizing the results of (Sherman et al., 1999) for a wide class of MCMC simulation techniques, and then in (McLachlan and Krishnan, 2008) and (Neath et al., 2013).

In this contribution, we propose the stochastic version of the mini-batch EM algorithm, called the mini-batch MCEM (MBMCEM). The mini-batch of surrogate functions, computed at each iteration, is no longer determnistic but is rather approximated using Monte Carlo integration. The incremental framework developed in (Mairal, 2015), called MISO (Minimisation by Incremental Surrogate Optimisation), used to analyse the MBEM in this paper, can not be applied in this context and we provide, in this article, its extension so that convergence guarantees of the objective function and similar stationary point condition are established.

We summarise the main contributions of this paper as follows. Using the MISO framework (Mairal, 2015) we establish, under mild assumptions on the incomplete model and the auxiliary Q-function, the almost-sure convergence of the mini-batch EM algorithm by constructing suitable surrogate functions. We then establish the almost-sure convergence of the mini-batch version of the MCEM using an extension of the MISO framework when the surrogate functions are stochastic.

The remainder of this paper is organised as follows. Section 2 provides the assumptions on the model, the MBEM algorithm and sets out the convergence of the objective function.

Section 3 introduces the MBMCEM algorithm and its convergence theorem. Each section

also provides the executed algorithm when the complete model belongs to the curved

(5)

exponential family. Section 4 investigates, through a simulation study on a mixed effect model and a logistic regression, how these algorithms converge with respect to the mini-batch size. Section 5 is devoted to the technical proofs of our results.

2 Convergence of the mini-batch EM algorithm

2.1 Model assumptions and notations

M 1. The parameter set Θ is a closed convex subset of R

^p

.

Let N be an integer and for i ∈ J 1, N K , Z

i

be a subset of R

^mⁱ

, µ

i

be a σ-finite measure on the Borel σ-algebra Z

_i

= B(Z

_i

) and {f

_i

(z

_i

, θ), θ ∈ Θ} be a family of positive µ

_i

-integrable Borel functions on Z

_i

. Set z = (z

_i

∈ Z

_i

, i ∈ J 1, N K ) ∈ Z where Z = ×

^N_n=1

Z

_i

and µ is the product of the measures (µ

i

, i ∈ J 1, N K ).

Define, for all i ∈ J 1, N K and θ ∈ Θ:

g

_i

(θ) , Z

Zi

f

_i

(z

_i

, θ)µ

_i

(dz

_i

) and p

_i

(z

_i

, θ) , (

_f

i(zi,θ)

gi(θ)

if g

_i

(θ) 6= 0

0 otherwise . (2)

Note that p

i

(z

i

, θ) defines a probability density function with respect to µ

i

. Thus P

_i

= {p

_i

(z

_i

, θ); θ ∈ Θ} is a family of probability density. We denote by {P

_i,θ

; θ ∈ Θ} the associated family of probability measures. For all θ ∈ Θ, we set

f (z, θ) =

N

Y

i=1

f

_i

(z

_i

, θ), g(θ) =

N

Y

i=1

g

_i

(θ) and p(z, θ) =

N

Y

i=1

p

_i

(z

_i

, θ) . (3) Remark 1. An example of this setting is the incomplete data framework. In this case, we consider N independent observations (y

i

∈ Y

i

, i ∈ J 1, N K ) where Y

i

is a subset of R

^`ⁱ

and missing data (z

_i

∈ Z

_i

, i ∈ J 1, N K ). In this framework,

• f

_i

(z

_i

, θ) is the complete data likelihood that is the likelihood of the observed data y

_i

augmented with the missing data z

i

.

• g

_i

(θ) is the incomplete data likelihood that is the likelihood of the observed data y

_i

.

• p

i

(z

i

, θ) is the posterior distribution of the missing data z

i

given the observed data y

i

.

Our objective is to maximise the function θ → log g(θ) or equivalently to minimise the objective function ` : Θ 7→ R defined as:

`(θ) , − log g(θ) =

N

X

i=1

`

i

(θ) for all θ ∈ Θ , (4)

where `

i

(θ) , − log g

i

(θ). The EM algorithm is an iterative optimisation algorithm that

minimises the function θ → `(θ) when its direct minimisation is difficult. Denote by θ

^k−1

(6)

the current fit of the parameter at iteration k. The k-th step of the EM algorithm might be decomposed into two steps. The E-step consists in computing the surrogate function defined for all θ ∈ Θ as :

Q(θ, θ

^k−1

) , − Z

Z

p(z, θ

^k−1

) log f (z, θ)µ(dz) (5)

= −

N

X

i=1

Z

Zi

p

i

(z

i

, θ

^k−1

) log f

i

(z

i

, θ)µ

i

(dz

i

) =

N

X

i=1

Q

i

(θ, θ

^k−1

) , (6) where:

Q

_i

(θ, θ

^k−1

) , − Z

Zi

p

_i

(z

_i

, θ

^k−1

) log f

_i

(z

_i

, θ)µ

_i

(dz

_i

) . (7) In the M-step, the value of θ minimising Q(θ, θ

^k−1

) is calculated. This yields the new parameter estimate θ

^k

. These two steps are repeated until convergence. The essence of the EM algorithm is that decreasing Q(θ, θ

^k−1

) forces a decrease of the function θ → `(θ) (Dempster et al., 1977). The mini-batch version of the EM algorithm is described as

follows:

Algorithm 1 mini-batch EM algorithm

Initialisation: given an initial parameter estimate θ

⁰

, for all i ∈ J 1, N K compute a surrogate function ϑ → R

⁰_i

(ϑ) = Q

i

(ϑ, θ

⁰

) defined by (7).

Iteration k: given the current estimate θ

^k−1

:

1. Pick a set I

k

uniformly on {A ⊂ J 1, N K , card(A) = p}.

2. For all i ∈ I

k

, compute ϑ → Q

i

(ϑ, θ

^k−1

) defined by (7).

3. Set θ

^k

∈ arg min

ϑ∈Θ

P

N

i=1

R

^k_i

(ϑ) where R

^k_i

(ϑ) are defined recursively as follows:

R

^k_i

(ϑ) =

( Q

i

(ϑ, θ

^k−1

) if i ∈ I

_k

R

^k−1_i

(ϑ) otherwise (8)

We remark that, for all i ∈ J 1, N K and θ ∈ Θ:

R

_i^k

(θ) = Q

i

(θ, θ

^τ^i,k

) , (9) where for all i ∈ J 1, N K , τ

_i,0

= 0 and k ≥ 1 the indices τ

_i,k

are defined recursively as follows:

τ

_i,k

=

( k − 1 if i ∈ I

k

τ

i,k−1

otherwise (10)

As noted in (Gunawardana and Byrne, 2005) and (Neal and Hinton, 1998), there is

no guarantee, unlike the EM algorithm, that the objective function θ → `(θ) decreases

(7)

at each iteration. We also remark that we recover the full EM algorithm when the mini-batch size p is set to be equal to N . Let T (Θ) be a neighborhood of Θ. To study the convergence of the MBEM algorithm we consider the following assumptions:

M 2. For all i ∈ J 1, N K , assume that:

a. For all θ ∈ Θ and z

i

∈ Z

i

, f

i

(z

i

, θ) is strictly positive, the function θ → f

i

(z

i

, θ) is two-times differentiable on T (Θ) for µ

_i

almost every z

_i

and for all ϑ ∈ Θ:

Z

Zi

|∇f

_i

(z

_i

, θ)|µ

_i

(dz

_i

) < ∞ and Z

Zi

p

_i

(z

_i

, ϑ)| log f

_i

(z

_i

, θ)|µ

_i

(dz

_i

) < ∞ . (11) b. For all θ ∈ Θ, there exist δ > 0 and a measurable function ψ

θ

: Z

i

→ R such that

sup

kϑ−θk≤δ

|∇

²

f

_i

(z

_i

, ϑ)| ≤ ψ

_θ

(z

_i

) for µ

_i

almost every z

_i

with R

Zi

ψ

_θ

(z

_i

)µ

_i

(dz

_i

) < ∞.

c. There exist a measurable function φ

_i

: Z

_i

→ R and L

_i

< ∞ such that sup

θ∈Θ

|∇

²

log f

i

(z

i

, θ)| ≤ φ

i

(z

i

) for µ

_i

almost every z

_i

with sup

θ∈Θ

R

Zi

p

_i

(z

_i

, θ)φ

_i

(z

_i

)µ

_i

(dz

_i

) ≤ L

_i

. d. For all i ∈ J 1, N K and θ ∈ Θ, sup

θ∈Θ

|∇

²

l

i

(θ)| < ∞.

It is easily checked that these assumptions imply for all i ∈ J 1, N K that:

1. The function θ → g

_i

(θ) is continuously differentiable on T (Θ) and the Fisher identity (Fisher, 1925) holds:

∇`

_i

(θ) = − Z

Zi

p

i

(z

i

, θ)∇ log f

i

(z

i

, θ)µ

i

(dz

i

) . (12) 2. For all ϑ ∈ Θ, the function θ → Q

i

(θ, ϑ) is continuously differentiable on T (Θ) and

is L

_i

−smooth, i.e., for all (θ, θ

⁰

) ∈ Θ and L

_i

> 0:

k∇Q

_i

(θ, ϑ) − ∇Q

_i

(θ

⁰

, ϑ)k ≤ L

i

kθ − θ

⁰

k . (13) 3. For all i ∈ J 1, N K and θ ∈ Θ, Louis Formula (Louis, 1982) yields that:

∇

²

l

i

(θ) = − Z

Zi

p

i

(z

i

, θ)∇

²

log f

i

(z

i

, θ)µ

i

(dz

i

) (14)

− Z

Zi

p

i

(z

i

, θ) (∇ log f

i

(z

i

, θ))

^>

∇ log f

i

(z

i

, θ)µ

i

(dz

i

) (15) +

Z

Zi

p

i

(z

i

, θ)∇ log f

i

(z

i

, θ)µ

i

(dz

i

)

>

Z

Zi

p

i

(z

i

, θ)∇ log f

i

(z

i

, θ)µ

i

(dz

i

) .

(16)

(8)

Thus, sufficient conditions to verify M 2d. are M 2c. and the following condition: There exist a measurable function N

i

: Z

i

→ R such that for all θ ∈ Θ,

|∇ log f

_i

(z

_i

, θ)| ≤ N

_i

(z

_i

) for µ

_i

almost every z

_i

with R

Zi

p

_i

(z

_i

, θ)N

_i²

(z

_i

)µ

_i

(dz

_i

) < ∞.

M 3. For all i ∈ J 1, N K , the objective function `

i

is bounded from below, i.e. there exist M

i

∈ R such that for all θ ∈ Θ :

`

_i

(θ) ≥ M

_i

. (17)

For θ ∈ Θ, we say that a function B

i,θ

is a surrogate of `

i

at θ if the three following properties are satisfied:

S.1 the function ϑ → B

_i,θ

(ϑ) is continuously differentiable on T (Θ) S.2 for all ϑ ∈ Θ, B

_i,θ

(ϑ) ≥ `

i

(ϑ)

S.3 B

i,θ

(θ) = `

i

(θ) and ∇B

_i,θ

(ϑ)

ϑ=θ

= ∇`

_i

(ϑ)

ϑ=θ

.

For all i ∈ J 1, N K and (θ, θ

⁰

) ∈ Θ

²

, define the Kullback-Leibler Divergence from P

_i,θ⁰

to P

i,θ

as:

KL(P

_i,θ

k P

_i,θ⁰

) , Z

Zi

p

_i

(z

_i

, θ) log p

_i

(z

_i

, θ)

p

i

(z

i

, θ

⁰

) µ

_i

(dz

_i

) (18) and the negated entropy function H

_i

(θ) as:

H

_i

(θ) , Z

Zi

p

_i

(z

_i

, θ) log p

_i

(z

_i

, θ)µ

_i

(dz

_i

) . (19) To analyse the MBEM algorithm, we introduce for i ∈ J 1, N K and θ ∈ Θ the function ϑ → B

_i,θ

(ϑ) defined by:

B

_i,θ

(ϑ) , Q

_i

(ϑ, θ) + H

_i

(θ) . (20) We will show below that for i ∈ J 1, N K and θ ∈ Θ, B

_i,θ

is a surrogate of l

i

at θ. Let us note that this function can be rewritten as follows:

B

_i,θ

(ϑ) = Z

Zi

p

_i

(z

_i

, θ) log p

_i

(z

_i

, θ)

f

_i

(z

_i

, ϑ) µ

_i

(dz

_i

) (21)

= Z

Zi

p

_i

(z

_i

, θ) log p

_i

(z

_i

, θ)

p

_i

(z

_i

, ϑ) µ

_i

(dz

_i

) + `

_i

(ϑ) (22)

= KL(P

_i,θ

k P

_i,ϑ

) + `

i

(ϑ) . (23)

We verify S.1 using assumption M 2. Since ϑ → KL(P

_i,θ

k P

_i,ϑ

) is always positive and

is equal to zero if θ = ϑ, we verify S.2 and the first part of S.3. The second part of

S.3 follows from the Fisher identity (12). The difference between the surrogate function

and the objective function denoted, for all ϑ ∈ Θ, h

_i

(ϑ) , B

_i,θ

(ϑ) − l

_i

(ϑ) plays a key

role in the convergence analysis. Here, for all i ∈ J 1, N K and ϑ ∈ Θ the error reads

h

i

(ϑ) = KL(P

_i,θ

k P

_i,ϑ

). Under M 2c. and M 2d., we obtain that for all i ∈ J 1, N K , the

function ϑ → h

_i

(ϑ) is L

_i

−smooth. Since for all i ∈ J 1, N K and θ ∈ Θ, the surrogate

(9)

function ϑ → B

i,θ

(ϑ) is equal to ϑ → Q

i

(ϑ, θ) up to a constant, the MBEM algorithm is equivalent to the following theoretical algorithm:

Algorithm 2 Theoretical MBEM algorithm

Initialisation: given an initial parameter estimate θ

⁰

, for all i ∈ J 1, N K compute a surrogate function ϑ → A

⁰_i

(ϑ) = B

_i,θ⁰

(ϑ) defined by (21).

Iteration k: given the current estimate θ

^k−1

:

1. Pick a set I

_k

uniformly on {A ⊂ J 1, N K , card(A) = p}.

2. For all i ∈ I

k

, compute a surrogate function ϑ → B

_i,θ^k−1

(ϑ) defined by (21).

3. Set θ

^k

∈ arg min

ϑ∈Θ

P

N

i=1

A

^k_i

(ϑ) where A

^k_i

(ϑ) are defined recursively as follows:

A

^k_i

(ϑ) =

( B

_i,θ^k−1

(ϑ) if i ∈ I

_k

A

^k−1_i

(ϑ) otherwise (24)

We remark that, for all i ∈ J 1, N K and ϑ ∈ Θ:

A

^k_i

(ϑ) = B

_i,θτi,k

(ϑ) (25)

using the notation introduced in (10). Denote by h·,·i the scalar product. We now state the convergence theorem of the MBEM algorithm:

Theorem 1. Assume M1-M3. Let θ

^k

k≥1

be a sequence generated from θ

⁰

∈ Θ by the iterative application described by algorithm 1. Then:

(i) `(θ

^k

)

k≥1

converges almost surely.

(ii) θ

^k

k≥1

satisfies the Asymptotic Stationary Point Condition, i.e.

lim inf

k→∞

inf

θ∈Θ

h∇`(θ

^k

), θ − θ

^k

i

kθ − θ

^k

k

₂

≥ 0 (26) Proof. The proof is postponed to section 5.1

We observe that in the unconstrained case, we have:

inf

θ∈R^d

h∇`(θ

^k

), θ − θ

^k

i

kθ − θ

^k

k

₂

= −k∇`(θ

^k

)k , (27) which yields to lim

k→∞

k∇`(θ

^k

)k = 0.

(10)

2.2 MBEM for a curved exponential family

In the particular case where for all i ∈ J 1, N K and z

i

∈ Z

i

, the function θ → f

i

(z

i

, θ) belongs to the curved exponential family, we assume that:

E 1. For all i ∈ J 1, N K and θ ∈ Θ:

log f

_i

(z

_i

, θ) = H

_i

(z

_i

) − ψ

_i

(θ) + h S ˜

_i

(z

_i

), φ

_i

(θ)i. (28) where ψ

_i

: Θ 7→ R and φ

_i

: Θ 7→ R

^dⁱ

are twice continuously differentiable functions of θ, H

_i

: Z

_i

7→ R is a twice continuously differentiable function of z

_i

and S ˜

i

: Z

i

7→ S

i

is a statistic taking its values in a convex subset S

i

of R

^dⁱ

and such that R

Zi

| S ˜

_i

(z

_i

)|p

_i

(z

_i

, θ)µ

_i

(dz

_i

) < ∞.

Define for all θ ∈ Θ and i ∈ J 1, N K the function ¯ s

i

: Θ → S

i

as:

¯ s

_i

(θ) ,

Z

Zi

S ˜

_i

(z

_i

)p

_i

(z

_i

, θ)µ

_i

(dz

_i

). (29) Define, for all θ ∈ Θ and s = (s

i

, i ∈ J 1, N K ) ∈ S where S = ×

^N_n=1

S

i

, the function L(s; θ) by:

L(s; θ) ,

N

X

i=1

ψ

_i

(θ) −

N

X

i=1

hs

_i

, φ

_i

(θ)i. (30)

E 2. There exist a function θ ˆ : S 7→ Θ such that for all s ∈ S, :

L(s; ˆ θ(s)) ≤ L(s; θ). (31)

In many models of practical interest for all s ∈ S, θ 7→ L(s, θ) has a unique minimum.

In the context of the curved exponential family, the MBEM algorithm can be formulated as follows:

Algorithm 3 mini-batch EM for a curved exponential family

Initialisation: given an initial parameter estimate θ

⁰

, for all i ∈ J 1, N K compute s

⁰_i

= ¯ s(θ

⁰

).

Iteration k: given the current estimate θ

^k−1

:

1. Pick a set I

_k

uniformly on {A ⊂ J 1, N K , card(A) = p}.

2. For i ∈ J 1, N K , compute s

^k_i

such as:

s

^k_i

=

( s ¯

_i

(θ

^k−1

) if i ∈ I

_k

.

s

^k−1_i

otherwise. (32)

3. Set θ

^k

= ˆ θ(s

^k

) where s

^k

= (s

^k_i

, i ∈ J 1, N K ).

(11)

Example 1. We observe N independent and identically distributed (i.i.d.) random variables (y

i

, i ∈ J 1, N K ). Each one of these observations is distributed according to a mixture model. Denote by (c

^j

, j ∈ J 1, J K ) the distribution of the component of the mixture and (π

j

, j ∈ J 1, J K ) the associated weights. Consider the complete data likelihood for each individual f

i

(z

i

, θ):

f

_i

(z

_i

, θ) =

J

Y

j=1

(π

_j

c

^j

(y

_i

, δ))

¹^zi^=j

. (33) We restrict this study to a mixture of Gaussian distributions. In such case θ = ((π

_j

, µ

_j

, σ

_j

), j ∈ J 1, J K ) and the individual complete log likelihood is expressed as:

log f

_i

(z

_i

, θ) =

J

X

j=1

1

_z_i_=j

log(π

_j

) +

J

X

j=1

1

_z_i_=j

"

− (y

_i

− µ

_j

)

²

2σ

_j²

− 1

2 log σ

²_j

#

. (34) The complete data sufficient statistics are given for all i ∈ J 1, N K and j ∈ J 1, J K , by S ˜

_i^1,j

(y

i

, z

i

) , 1

zi=j

, ˜ S

_i^2,j

(y

i

, z

i

) , 1

zi=j

y

i

and ˜ S

_i^3,j

(y

i

, z

i

) , 1

zi=j

y

_i²

. At each iteration k, algorithm 3 consists in picking a set I

_k

and for i ∈ I

_k

, computing the following quantities:

(¯ s

^k_i

)

^1,j

= Z

Zi

1

_z_i_=j

p

_i

(z

_i

, θ

^k−1

)µ

_i

(dz

_i

) = p

_ij

(θ

^k−1

) , (35) (¯ s

^k_i

)

^2,j

=

Z

Zi

1

_z_i_=j

y

_i

p

_i

(z

_i

, θ

^k−1

)µ

_i

(dz

_i

) = p

_ij

(θ

^k−1

)y

_i

, (36) (¯ s

^k_i

)

^3,j

=

Z

Zi

1

_z_i_=j

y

_i²

p

_i

(z

_i

, θ

^k−1

)µ

_i

(dz

_i

) = p

_ij

(θ

^k−1

)y

_i²

, (37) where the quantity p

ij

(θ

^k−1

) , P

_i,θ^k−1

(z

i

= j) is obtained using the Bayes rule:

p

ij

(θ

^k−1

) = P

i

(z

i

= j)p

i

(y

i

|z

_i

= j; θ

^k−1

)

p

i

(y

i

; θ

^k−1

) = π

_j^k−1

c

^j

(y

_i

; µ

^k−1_j

, σ

^k−1_j

) P

J

l=1

π

^k−1_l

c

^l

(y

i

; µ

^k−1_l

, σ

_l^k−1

) . (38) For i / ∈ I

_k

, j ∈ J 1, J K , and d ∈ J 1, 3 K (¯ s

^k_i

)

^d,j

= (¯ s

^k−1_i

)

^d,j

. Finally the maximisation step yields:

π

_j^k

= P

N

i=1

(¯ s

^k_i

)

^1,j

N , (39)

µ

^k_j

= P

N

i=1

(¯ s

^k_i

)

^2,j

P

N

i=1

(¯ s

^k_i

)

^1,j

, (40)

σ

_j^k

= P

N

i=1

(¯ s

^k_i

)

^3,j

P

N

i=1

(¯ s

^k_i

)

^1,j

− (µ

^k_j

)

²

. (41)

(12)

3 Convergence of the mini-batch MCEM algorithm

We now consider the stochastic version of the MBEM algorithm called the mini-batch MCEM algorithm. At iteration k, the MBMCEM approximates the quantity defined by (7) by Monte Carlo integration, i.e. for all i ∈ I

_k

, ϑ ∈ Θ and k ≥ 1:

Q ˆ

^k_i

(ϑ, θ

^k−1

) , 1 M

_k

Mk−1

X

m=0

log f

i

(z

_i^k,m

, ϑ) , (42) where {z

_i^k,m

}

^M_m=0^k⁻¹

is a Monte Carlo batch. In simple scenarios, the samples {z

^k,m_i

}

^M_m=0^k⁻¹

are conditionally independent and identically distributed with distribution p

_i

(z

_i

, θ

^k−1

).

Nevertheless, in most cases, sampling exactly from this distribution is not an option and the Monte Carlo batch is sampled by Monte Carlo Markov Chains (MCMC) algorithm.

MCMC algorithms are a class of methods allowing to sample from complex distribution over (possibly) large dimensional space.

Recall that a Markov kernel P on a measurable space (E, E) is an application on E × E, taking values in [0, 1] such that for any z ∈ E, P (z, ·) is a probability measure on E and for any A ∈ E, P (·, A) is measurable. We denote by P

^k

the k−th iterate of P defined recursively as P

⁰

(z, A) , 1

_A

(z) and for k ≥ 1, P

^k

(z, A) , R

A

P

^k−1

(z, dz

⁰

)P (z

⁰

, A). The probability π is said to be stationary for P if R

E

π(dz)P (z, A) = π(A) for any A ∈ E. We refer the reader to (Meyn and Tweedie, 2012) for the definitions of basic properties of Markov chains.

For i ∈ J 1, N K and θ ∈ Θ, let P

_i,θ

be a Markov kernel with stationary distribution π

i,θ

(A

i

) = R

Ai

p

i

(z

i

, θ)µ

i

(dz

i

) where A

i

∈ Z

_i

. For example, P

i,θ

might be either a Gibbs or a Metropolis-Hastings samplers with target distribution π

_i,θ

. For θ ∈ Θ, let λ

_i,θ

be a probability measure on Z

_i

× Z

_i

. We will use λ

_i,θ

as an initial distribution and allow this initial distribution to depend on the parameter θ. For example, λ

i,θ

might be the Dirac mass at some given point but more clever choice can be made. We denote by E

_i,θ

the expectation of the canonical Markov chain {z

^m_i

}

^∞_m=0

with initial distribution λ

_i,θ

and transition kernel P

_i,θ

.

In this setting, the Monte Carlo mini-batch {z

^k,m_i

}

^M_m=0^k−1

is a realisation of a Markov

Chain with initial distribution λ

_i,θk−1

and transition kernel P

_i,θk−1

. The MBMCEM

algorithm can be summarised as follows:

(13)

Algorithm 4 mini-batch MCEM algorithm

Initialisation: given an initial parameter estimate θ

⁰

, for all i ∈ J 1, N K and m ∈ J 0, M

₀

− 1 K , sample a Markov Chain {z

_i^0,m

}

^M_m=0⁰⁻¹

with initial distribution λ

_i,θ0

and transition kernel P

_i,θ0

and compute a function ϑ → R ˆ

_i⁰

(ϑ) = ˆ Q

⁰_i

(ϑ, θ

⁰

) defined by (42).

Iteration k: given the current estimate θ

^k−1

:

1. Pick a set I

_k

uniformly on {A ⊂ J 1, N K , card(A) = p}.

2. For all i ∈ I

k

and m ∈ J 0, M

k

− 1 K , sample a Markov Chain {z

^k,m_i

}

^M_m=0^k⁻¹

with initial distribution λ

_i,θ^k−1

and transition kernel P

_i,θ^k−1

.

3. For all i ∈ I

_k

, compute the function ϑ → Q ˆ

^k_i

(ϑ, θ

^k−1

) defined by (42).

4. Set θ

^k

∈ arg min

ϑ∈Θ

P

N

i=1

R ˆ

^k_i

(ϑ) where ˆ R

^k_i

(ϑ) are defined recursively as follows:

R ˆ

^k_i

(ϑ) =

( Q ˆ

^k_i

(ϑ, θ

^k−1

) if i ∈ I

_k

R ˆ

^k−1_i

(ϑ) otherwise (43)

Whether we use Markov Chain Monte Carlo or direct simulation, we need to control the supremum norm of the fluctuations of the Monte Carlo approximation. Let i ∈ J 1, N K , {q

_i

(z

i

, ϑ), z

i

∈ Z

i

, ϑ ∈ Θ} be a family of measurable functions, λ

i

a probability measure on Z

_i

× Z

_i

. We define:

C

_i

(q

_i

) , sup

θ∈Θ

sup

M >0

M

^−1/2

E

_i,θ

"

sup

ϑ∈Θ

M−1

X

m=0

q

_i

(z

_i^m

, ϑ) − Z

Zi

q

_i

(z

_i

, ϑ)p

_i

(z

_i

, θ)λ

_i

(dz

_i

)

# . (44) M 4. For all i ∈ J 1, N K :

C

_i

(log f

_i

) < ∞ and C

_i

(∇ log f

_i

) < ∞ . (45) The assumption M 4 is based on maximal inequality for beta-mixing sequences obtained in (Doukhan et al., 1995). This condition can be translated in terms of drift and minorisation conditions (see (Meyn and Tweedie, 2012)). Finally, we consider the following assumption on the number of simulations:

M 5. {M

_k

}

_k≥0

is a non deacreasing sequence of integers which satisfies P

∞

k=0

M

_k^−1/2

<

∞.

We now state the convergence theorem for the MBMCEM algorithm:

Theorem 2. Assume M1-M5. Let θ

^k

k≥1

be a sequence generated from θ

⁰

∈ Θ by the iterative application described by algorithm 4. Then:

(i) `(θ

^k

)

k≥1

converges almost surely.

(14)

(ii) θ

^k

k≥1

satisfies the Asymptotic Stationary Point Condition.

Proof. The proof is postponed to section 5.2

3.1 MBMCEM for a curved exponential family

Using the notations introduced in section 2.2, we can write the mini-batch MCEM algorithm can be described as follows:

Algorithm 5 mini-batch MCEM for a curved exponential family

Initialisation: given an initial parameter estimate θ

⁰

, for all i ∈ J 1, N K and m ∈ J 0, M

₀

− 1 K , sample a Markov Chain {z

_i^0,m

}

^M_m=0⁰⁻¹

with initial distribution λ

_i,θ0

and transition kernel P

_i,θ0

and compute s

⁰_i

=

_M¹

0

P

M0

m=1

S ˜

_i

(z

^0,m_i

).

Iteration k: given the current estimate θ

^k−1

:

1. Pick a set I

_k

uniformly on {A ⊂ J 1, N K , card(A) = p}.

2. For all i ∈ I

_k

and m ∈ J 0, M

_k

− 1 K , sample a Markov Chain {z

^k,m_i

}

^M_m=0^k⁻¹

with initial distribution λ

_i,θ^k−1

and transition kernel P

_i,θ^k−1

.

3. Compute s

^k_i

such as:

s

^k_i

= (

1

Mk

P

Mk−1

m=1

S ˜

i

(z

_i^k,m

) if i ∈ I

_k

s

^k−1_i

otherwise (46)

4. Set θ

^k

= ˆ θ(s

^k

) where s

^k

= (s

^k_i

, i ∈ J 1, N K )

Example 2. For illustration purposes we can apply the MBMCEM to the Gaussian Mixture Model example of Section 2.2 even though the conditional expectation (38) is tractable.

At iteration k of the MBMCEM algorithm, pick a set I

_k

and for i ∈ I

_k

, m ∈ J 0, M

_k

−1 K and j ∈ J 1, J K , draw a Monte Carlo batch {z

_i^k,m

}

^M_m=0^k⁻¹

from the conditional posterior distribution p

i

(z

i

, θ

^k−1

) and update the sufficient statistics as ˜ S

_i^1,j

(y

i

, z

_i^k,m

) , 1

_z^k,m

i =j

, S ˜

_i^2,j

(y

i

, z

_i^k,m

) , 1

_zk,m

i =j

y

i

and ˜ S

_i^3,j

(y

i

, z

^k,m_i

) , 1

_zk,m

i =j

y

_i²

. For i / ∈ I

_k

, j ∈ J 1, J K , m ∈

J 0, M

_k

− 1 K and d ∈ J 1, 3 K , z

_i^k,m

= z

_i^k−1,m

and ˜ S

_i^d,j

(y

_i

, z

_i^k,m

) = ˜ S

_i^d,j

(y

_i

, z

_i^k−1,m

). Finally

(15)

the maximisation step yields:

π

^k_j

= P

N

i=1

S ˜

_i^1,j

(y

i

, z

^k,m_i

)

N , (47)

µ

^k_j

= P

N

i=1

S ˜

_i^2,j

(y

_i

, z

^k,m_i

) P

N

i=1

S ˜

_i^1,j

(y

_i

, z

^k,m_i

) , (48)

σ

^k_j

= P

N

i=1

S ˜

_i^3,j

(y

_i

, z

^k,m_i

) P

_N

i=1

S ˜

_i^1,j

(y

i

, z

^k,m_i

) − (µ

^k_j

)

²

. (49)

4 Numerical examples

4.1 A Linear mixed effects model 4.1.1 The model

We consider, in this section, a linear mixed effects model (Lavielle, 2014). We denote by y = (y

_i

∈ R

ⁿⁱ

, i ∈ J 1, N K ) the observations where for all i ∈ J 1, N K :

y

_i

= A

_i

θ + B

_i

z

_i

+

_i

. (50)

A

i

∈ R

ⁿⁱ^×p

and B

i

∈ R

ⁿⁱ^×m

are design matrices, θ ∈ R

^p

is a vector of parameters, z

_i

∈ R

^m

are the latent data (i.e. the random effects in the context of mixed effects models) which are assumed to be distributed according to a multivariate Gaussian distribution N (0, Ω) where Ω ∈ R

^m×m

. We also assume that the residual errors

i

∈ R

ⁿⁱ

are distributed according to N (0, Σ), where Σ ∈ R

ⁿⁱ^×nⁱ

, and that the sequences of variables (z

_i

, i ∈ J 1, N K ) and (

_i

, i ∈ J 1, N K ) are i.i.d. and mutually independent. The covariance matrices Ω and Σ are assumed to be known. For all i ∈ J 1, N K , the conditional distribution of the observations given the latent variables y

_i

|z

_i

and of the latent variables given the observations z

_i

|y

_i

are respectively given by:

y

i

|z

_i

∼ N (A

i

θ + B

i

z

i

, Σ), (51)

z

i

|y

_i

∼ N (µ

i

, Γ

i

). (52)

where:

Γ

_i

= (B

_i^>

Σ

⁻¹

B

_i

+ Ω

⁻¹

)

⁻¹

, (53) µ

i

= Γ

i

B

_i^>

Σ

⁻¹

(y

i

− A

i

θ). (54) This model belongs to the curved exponential family introduced in section 2.2 where for all i ∈ J 1, N K :

S ˜

i

(z

i

) , z

i

and s ¯

i

(θ) = Γ

i

B

_i^>

Σ

⁻¹

(y

i

− A

i

θ) (55) ψ

i

(θ) , (y

i

− A

i

θ)

^>

Σ

⁻¹

(y

i

− A

i

θ) (56)

φ

_i

(θ) , B

^>

Σ

⁻¹

(y

_i

− A

_i

θ) (57)

(16)

1 2 3 4

0 20 40 60

passes

θ1

5 6 7 8 9

0 20 40 60

passes

θ2

batch size (in %) 0.0150

100

Figure 1: Convergence of the vector of parameter estimates θ

^k

function of passes over the data.

Maximising L(s, θ), defined in (30), with respect to θ yields the following maximisation function for all s = (s

i

∈ R

^m

, i ∈ J 1, N K ):

θ(s) ˆ ,

N

X

i=1

A

^>_i

Σ

⁻¹

A

i

!

−1 N

X

i=1

A

^>_i

Σ

⁻¹

(y

i

− B

i

s

i

).

Thus, the k − th update of the MBEM algorithm consists in sampling a subset of indices I

k

and computing θ

^k

= ˆ θ(s

^k

) where:

s

^k_i

=

( ¯ s

i

(θ

^k−1

) if i ∈ I

_k

. s

^k−1_i

otherwise.

4.1.2 Simulation and runs

We generate a synthetic dataset, with d = 2, θ : (θ

₁

: 4, θ

₂

: 9), N = 10000 and for all

i ∈ J 1, N K , n

i

= 10 observations per individual and random design matrices (A

i

, i ∈ J 1, N K )

and (B

_i

, i ∈ J 1, N K ). Two runs of the MBEM are executed starting from different initial

values ((θ

₁⁰

: 1, θ

₂⁰

: 5) and (θ

₁⁰

: 3, θ

₂⁰

: 7)) to study the convergence behaviour of these

algorithms depending on the initialisation. Figure 1 shows the convergence of the vector

of parameter estimates (θ

^k₁

, θ

^k₂

)

^K_k=0

over passes of the EM algorithm, the MBEM algorithm

where half of the data is considered at each iteration and the Incremental EM algorithm

(i.e. a single data point is considered at each iteration). The speed of convergence is a

monotone function of the batch size in this case, the smaller the batch the faster the

convergence.

(17)

4.2 Logistic regression for a binary variable 4.2.1 The model

Let y = (y

_i

, i ∈ J 1, N K ) be the vector of binary responses where for each individual i, y

i

= (y

ij

, j ∈ J 1, n

i

K ) is a sequence of conditionally independent random variables taking values in {0, 1} which corresponds to the j-th responses for the i-th subject. We consider a logistic regression problem in which the parameters depend upon each individual i.

Denote by z

i

∈ R

^m

the vector of regression coefficients (the latent data) for individual i and (d

_ij

∈ R

^m

, j ∈ J 1, n

_i

K ) the associated explanatory variables. The conditional distribution of the observations y

_i

given the latent variables z

_i

is given by:

logit(P(y

_ij

= 0|z

_i

)) = d

^>_ij

z

_i

.

For all i ∈ J 1, N K , we assume that z

i

are independent and marginally distributed according to N (β, Ω) where β ∈ R

^m

and Ω ∈ R

^m×m

. The complete log-likelihood is expressed as:

log f (z, θ) ∝

N

X

i=1 ni

X

j=1

n

y

_ij

d

^>_ij

z

_i

− log(1 + e

^d^>^ij^zⁱ

) o

(58)

−

N

X

i=1

1 2 log(|Ω|) + 1 2 Tr

Ω

⁻¹

(z

_i

− β)(z

_i

− β )

^>

. (59)

Since the expectation of the complete log likelihood with respect to the conditional

distribution of the latent variables given the observations is intractable, we use the

MCEM and the MBMCEM algorithms, which require to simulate random draws from

this conditional distribution. We use the saemix R package (Comets et al., 2017) to run

a Metropolis-Hastings within Gibbs sampler (Brooks et al., 2011) where for all i ∈ J 1, N K

we denote the vector of regression coefficients z

_i

= (z

_i,d

∈ R, d ∈ J 1, m K ). Thus for all

dimension d ∈ J 1, m K of the parameter, the Markov Chain is constructed as follows:

(18)

Algorithm 6 Random Walk Metropolis

Initialisation: given the current parameter estimates (β

_d^k−1

, ω

_d^k−1

) set T

_d⁽⁰⁾

= ω

_d^k−1

and sample the initial state z

⁽⁰⁾_i,d

∼ N (β

_d^k−1

, T

_d⁽⁰⁾

)

Iteration t: given the current chain state z

_d^(t−1)

: 1. Sample a candidate state:

z

_i,d^(c)

∼ N (z

_i,d^(t−1)

, T

_d^(t−1)

) (60) 2. Accept with probability min

1, α

^(t)

(z

^(c)_i,d

, z

_i,d^(t−1)

) 3. Update the variance of the proposal as follows:

T

_d^(t)

= T

_d^(t−1)

+ (1 + δ(α

^(t)

− α

^∗

)) (61)

where δ = 0.4, α

^(t)

is the MH acceptance ratio at iteration t and α

^∗

= 0.4 is the optimal acceptance ratio (Robert and Casella, 2005). This model belongs to the curved exponential family introduced in section 2.2 where for all i ∈ J 1, N K , ˜ S

i

(z

i

) , (z

i

, z

_i^>

z

i

).

At iteration k, the MBMCEM algorithm consists in:

1. Picking a set I

_k

uniformly on {A ⊂ J 1, N K , card(A) = p}.

2. For all i ∈ I

_k

and m ∈ J 0, M

_k

− 1 K , sampling a Markov Chain {z

^k,m_i

}

^M_m=0^k⁻¹

using algorithm 6.

3. Computing s

^k_i

= (s

^1,k_i

, s

^2,k_i

) such as:

(s

^1,k_i

, s

^2,k_i

) = (

₁

Mk

P

Mk−1

m=0

z

_i^k,m

,

_M¹

k

P

Mk−1

m=0

(z

_i^k,m

)

^>

z

_i^k,m

if i ∈ I

_k

(s

^1,k−1_i

, s

^2,k−1_i

) otherwise (62)

4. Updating the parameters as follows:

β

^k

= 1 N

N

X

i=1

s

^1,k_i

, (63)

Ω

^k

= 1 N

N

X

i=1

s

^2,k_i

− (β

^k

)

^>

β

^k

. (64)

4.2.2 Simulation and runs

In the sequel, m = 3, N = 1200 and for all i ∈ J 1, N K , n

i

= 15. For all i ∈ J 1, N K

and j ∈ J 1, n

_i

K , the design vector d

_ij

is defined as: d

_ij

= (1, −20 + 5(j − 1), 10d3i/N e).

(19)

-4 -3 -2 -1 0

0 400 800 1200

passes

β1

-0.5 -0.4 -0.3

0 400 800 1200

passes

β2

0.5 0.6 0.7 0.8 0.9 1.0

0 400 800 1200

passes

β3

batch size (in %) 2535

5085 100

Figure 2: Convergence of the vector of fixed parameters β for different batch sizes function of passes over the data.

We generate a synthetic dataset using the following generating values for the fixed and random effects (β

1

= −4, β

₂

= −0.5, β

₃

= 1, ω

1

= 0.3, ω

2

= 0.2, ω

3

= 0.2). We run the MBMCEM algorithm with the size of the Monte Carlo batch increasing polynomially, M

k

, M

0

+ k

²

with M

0

= 50. Figure 2 shows the convergence of the fixed effects (β

1

, β

2

, β

3

) estimates obtained with both the MCEM and the MBMCEM algorithms for different batch sizes. The effect of the batch size on the convergence rate differs from the previous example. Whereas smaller batches implied faster convergence for the mini batch EM algorithm, here, an optimal batch size of 50% accelerates the algorithm. Figure 2 highlights a non monotonic evolution of the convergence rate with respect to the size of the batch.

5 Proofs

5.1 Proof of Theorem 1 5.1.1 Proof of (i)

First, let us define for θ ∈ Θ

A ¯

^k

(θ) ,

N

X

i=1

A

^k_i

(θ) , (65)

where for all i ∈ J 1, N K , A

^k_i

is defined in (24). For any k ≥ 1 and for all θ ∈ Θ the following decomposition plays a key role:

A ¯

^k

(θ) = ¯ A

^k−1

(θ) + X

i∈I_k

B

_i,θ^k−1

(θ) − X

i∈I_k

A

^k−1_i

(θ). (66) Since by construction ¯ A

^k

(θ

^k

) ≤ A ¯

^k

(θ

^k−1

), we get:

A ¯

^k

(θ

^k

) ≤ A ¯

^k−1

(θ

^k−1

) + X

i∈I_k

B

_i,θ^k−1

(θ

^k−1

) − X

i∈I_k

A

^k−1_i

(θ

^k−1

). (67)

(20)

Since for i ∈ I

k

, B

_i,θ^k−1

is a surrogate of `

i

at θ

^k−1

we get that B

_i,θ^k−1

(θ

^k−1

) = `

i

(θ

^k−1

).

On the other hand, for i ∈ J 1, N K , A

^k−1_i

≡ B

_i,θτi,k−1

and B

_i,θτi,k−1

is a surrogate of `

i

at θ

^τ^i,k−1

, thus we obtain that `

_i

(θ

^k−1

) − A

^k−1_i

(θ

^k−1

) ≤ 0. Plugging these two relations in (67) we obtain:

A ¯

^k

(θ

^k

) ≤ A ¯

^k−1

(θ

^k−1

) + X

i∈I_k

`

_i

(θ

^k−1

) − X

i∈I_k

A

^k−1_i

(θ

^k−1

) (68)

≤ A ¯

^k−1

(θ

^k−1

) . (69)

As a result, the sequence A ¯

^k

(θ

^k

)

k≥0

is monotonically decreasing. Since, under assumption M 3, this quantity is bounded from below with probability one, we obtain its almost sure convergence. Taking the expectations with respect to the sampling distributions of the previous inequalities implies the convergence of the (deterministic) sequence E[ ¯ A

^k

(θ

^k

)]

k≥0

. Let us denote for all θ ∈ Θ and a subset J ⊂ J 1, N K :

`

_J

(θ) , X

i∈J

`

_i

(θ) , (70)

A

^k−1_J

(θ) , X

i∈J

A

^k−1_i

(θ) . (71)

Inequality (67) gives : 0 ≤

n

X

k=1

A

^k−1_I

k

(θ

^k−1

) − `

_I_k

(θ

^k−1

) ≤

n

X

k=1

A ¯

^k−1

(θ

^k−1

) − A ¯

^k

(θ

^k

) = ¯ A

⁰

(θ

⁰

) − A ¯

ⁿ

(θ

ⁿ

) . (72)

Consequently, the sum of positive terms P

n

k=1

A

^k−1_I

k

(θ

^k−1

) − `

_I_k

(θ

^k−1

)

n≥1

converges almost surely and

A

^k−1_I

k

(θ

^k−1

) − `

_I_k

(θ

^k−1

)

k≥1

converges almost surely to zero. The Beppo-Levi theorem and the Tower property of the conditional expectation imply:

M , E

"

_∞

X

k=0

A

^k−1_I

k

(θ

^k−1

) − `

_I_k

(θ

^k−1

)

#

=

∞

X

k=0

E h A

^k−1_I

k

(θ

^k−1

) − `

_I_k

(θ

^k−1

) i

(73)

=

∞

X

k=0

E h E h

A

^k−1_I

k

(θ

^k−1

) − `

_I_k

(θ

^k−1

)

F

_k−1

ii , (74) with

E h

`

_I_k

(θ

^k−1

) F

k−1

i

= p

N `(θ

^k−1

) , E

h [A

^k−1_I

k

(θ

^k−1

) F

_k−1

i

= p N

N

X

i=1

A

^k−1_i

(θ

^k−1

) = p

N A ¯

^k−1

(θ

^k−1

) ,

(21)

where F

_k−1

= σ(I

j

, j ≤ k − 1) is the filtration generated by the sampling of the indices.

We thus obtain:

M = p N

∞

X

k=0

E

h A ¯

^k−1

(θ

^k−1

) − `(θ

^k−1

) i

= p N E

"

_∞

X

k=0

A ¯

^k−1

(θ

^k−1

) − `(θ

^k−1

)

#

< ∞ . (75) This last equation shows that:

k→∞

lim

A ¯

^k

(θ

^k

) − `(θ

^k

) = 0 a.s. (76) which implies the almost sure convergence of `(θ

^k

)

k≥0

. 5.1.2 Proof of (ii)

Let us define, for all k ≥ 0, ¯ h

_k

as:

h ¯

^k

: ϑ →

N

X

i=1

A

^k_i

(ϑ) − `

i

(ϑ) . (77)

¯ h

^k

is L-smooth with L = P

N

i=1

L

i

since each of its component is L

i

-smooth by definition of the surrogate functions. Using the particular parameter ϑ

^k

= θ

^k

−

_L¹

∇ ¯ h

_k

(θ

^k

) we have the following classical inequality for smooth functions (cf. Lemma 1.2.3 in (Nesterov, 2007)):

0 ≤ ¯ h

^k

(ϑ

^k

) ≤ ¯ h

^k

(θ

^k

) − 1

2L k∇ h ¯

^k

(θ

^k

)k

²₂

(78)

= ⇒ k∇ ¯ h

^k

(θ

^k

)k

²₂

≤ 2L ¯ h

^k

(θ

^k

) . (79) Using (76), we conclude that lim

k→∞

k∇ ¯ h

^k

(θ

^k

)k

₂

= 0 a.s. Then, the decomposition of h∇`(θ

^k

), θ − θ

^k

i for any θ ∈ Θ yields:

h∇`(θ

^k

), θ − θ

^k

i = h∇ A ¯

^k

(θ

^k

), θ − θ

^k

i − h∇ ¯ h

^k

(θ

^k

), θ − θ

^k

i . (80) Note that θ

^k

is the result of the minimisation of the sum of surrogates ¯ A

^k

(θ) on the constrained set Θ, therefore h∇ A ¯

^k

(θ

^k

), θ − θ

^k

i ≥ 0. Thus, we obtain, using the Cauchy- Schwarz inequality:

h∇`(θ

^k

), θ − θ

^k

i ≥ −h∇ ¯ h

^k

(θ

^k

), θ − θ

^k

i (81)

≥ −k∇ h ¯

^k

(θ

^k

)k

₂

kθ − θ

^k

k

₂

. (82) By minimising over Θ and taking the infimum limit on k, we get:

lim inf

k→∞

inf

θ∈Θ

h∇`(θ

^k

), θ − θ

^k

i

kθ − θ

^k

k

₂

≥ − lim

k→∞

k∇ ¯ h

^k

(θ

^k

)k

₂

= 0 , (83)

which is the Asymptotic Stationary Point Condition (ASPC).

(22)

5.2 Proof of Theorem 2

We preface the proof by the following lemma which is of independent interest:

Lemma 1. Let (V

_k

)

_k≥0

be a non negative sequence of random variables such that E[V

₀

] <

∞. Let (X

k

)

_k≥0

a non negative sequence of random variables and (E

k

)

_k≥0

be a sequence of random variables such that P

∞

k=0

E[|E

_k

|] < ∞. If for any k ≥ 1:

V

k

≤ V

k−1

− X

k

+ E

k

(84)

then:

(i) for all k ≥ 0, E[V

_k

] < ∞ and the sequence (V

_k

)

_k≥0

converges a.s. to a finite limit V

∞

.

(ii) the sequence (E[V

_k

])

_k≥0

converges and lim

k→∞

E[V

_k

] = E[V

∞

].

(iii) the series P

∞

k=0

X

k

converges almost surely and P

∞

k=0

E[X

k

] < ∞.

Remark 2. Note that the result still holds if (V

_k

)

_k≥0

is a sequence of random variables which is bounded from below by a deterministic quantity M ∈ R.

Proof. We first show that for all k ≥ 0, E[V

k

] < ∞. Note indeed that:

0 ≤ V

_k

≤ V

0

−

k

X

j=1

X

j

+

k

X

j=1

E

j

≤ V

0

+

k

X

j=1

E

j

, (85)

showing that E[V

_k

] ≤ E[V

₀

] + E h P

k

j=1

E

_j

i

< ∞.

Since 0 ≤ X

k

≤ V

k−1

− V

k

+ E

k

we also obtain for all k ≥ 0, E[X

k

] < ∞. Moreover, since E h

P

∞ j=1

|E

_j

| i

< ∞, the series P

∞

j=1

E

_j

converges a.s. We may therefore define:

W

_k

= V

_k

+

∞

X

j=k+1

E

j

. (86)

Note that E[|W

_k

|] ≤ E[V

k

] + E h P

∞

j=k+1

|E

_j

| i

< ∞. For all k ≥ 1, we get:

W

_k

≤ V

k−1

− X

_k

+

∞

X

j=k

E

_j

≤ W

k−1

− X

_k

≤ W

k−1

, (87) E[W

k

] ≤ E[W

k−1

] − E[X

k

] . (88) Hence the sequences (W

_k

)

k≥0

and (E[W

_k

])

_k≥0

are non increasing. Since for all k ≥ 0, W

k

≥ − P

∞

j=1

|E

_j

| > −∞ and E[W

k

] ≥ − P

∞

j=1

E[|E

_j

|] > −∞, the (random) sequence