Maximum Likelihood Estimation in Latent Data Models

1.3.1 Latent Data Models

In this section, we formally introduce an instance of general models class called Latent Data Models that are used during the modeling phase of the learning procedure. Let Z be a subset of R^m, µ be a σ-finite measure on the Borel σ-algebra Z = B(Z) and {f(z,θ),θ ∈ Θ} be a family of positive µ-integrable Borel functions on Z. Set z ∈ Z.

Define, for allθ∈Θ:

g(y;θ) :=^Z

f(z, y;θ)µ(dz), p(z|y;θ) :=







f(z,y;θ)

g(y;θ) ifg(y;θ)6= 0

0 otherwise

(1.3.1)

Note that p(z|y;θ) defines a probability density function with respect to µ and P = {p(z|y;θ);θ ∈Θ; (y, z)∈Y×Z} a family of probability densities. We denote by{Pθ;θ∈ Θ} the associated family of probability measures. Naturally, the loss function L(θ) is defined for allθ∈Θ as follows:

L(θ) := logg(y;θ). (1.3.2)

Example 1.1 We now give some examples of latent structure:

• In presence of missing data, y stands for the observed data and the latent variables z are the missing data.

• For mixed effects models, the latent variableszare the random effects and identifying the structure of the latent data mainly corresponds to the inter-individual variability among the individuals of the dataset. This setting is presented in Section 1.4 and studied Chapter 6.

• For mixture models, the latent variables correspond to the unknown mixture labels taking values in a discret finite set. This setting is studied Chapters 4, 5 and 7.

Remark 1.1 In this thesis, we are interested in an empirical approach to the Maximum Likelihood Estimation problem. We considernindependent and not necessarily identically distributed vector of observations(y_i ∈Y_i, i∈J1, nK)where Y_i is a subset ofR^lⁱ and latent data(zi ∈Zi, i∈J1, nK). For all θ∈Θ, we set

f(z, y;θ) =^Yⁿ

i=1

f_i(z_i, y_i;θ), g(y;θ) =

i=1

gi(yi;θ), p(z|y;θ) =

i=1

p_i(z_i|y_i;θ).

(1.3.3)

Thus, the objective function (2.3.2) formulates:

L(θ) :=

i=1

logg_i(y_i;θ) =

i=1

L_i(θ). (1.3.4)

Note that in order to avoid singularities and degeneracies of the MLE as highlighted in [Fraley and Raftery, 2007], one can regularize the objective function through a prior distribution over the model parameters, see Chapter4 for an illustrative example.

1.3.2 The EM Algorithm

A popular class of inference algorithms to minimize (2.3.2) is the Expectation-Maximization (EM) algorithm developed in the pioneering work byDempster et al.[1977].

The EM is an iterative procedure that minimizes the functionθ → L(θ) when its direct minimisation is difficult. Denote byθ^(k−1) the current fit of the parameter at iterationk, then thek-th step of the EM algorithm might be decomposed into two steps. The E-step consists of computing the surrogate function defined for allθ∈Θ as :

Q(θ,θ^(k−1)) :=^Z

Zp(z|y;θ^(k−1)) logf(z, y;θ)µ(dz). (1.3.5) In the M-step, the value ofθ minimizing Q(θ,θ^(k−1)) is calculated and is set as the new parameter estimate θ^(k). These two steps are repeated until convergence. The essence of the EM algorithm is that decreasing Q(θ,θ^(k−1)) forces a decrease of the function θ→ L(θ), see [McLachlan and Krishnan,2007] and the references therein.

Remark 1.2 Using the concavity of the logarithmic function and the Jensen inequality, we can show that Q(θ,θ^(k−1)) is a majorizing surrogate function for the objective L(θ)

at θ^(k−1). This scheme nicely falls into the MM principle introduced in Section 1.2.1 and is exploited in [Gunawardana and Byrne, 2005]. Chapter 5 expands this remark and develops a global analysis of an incremental variant of the EM, introduced by Neal and Hinton [1998].

Remark 1.3 A common assumption regarding the direct applicability of the EM for latent data models (see, in particular, the discussion of Dempster et al. [1977]) is to consider that the complete model belongs to the curved exponential family,i.e., for all θ ∈Θ:

logf(z, y,θ) =−ψ(θ) +hS˜(z, y), φ(θ)i. (1.3.6) where ψ :θ 7→ R and φ :θ 7→ R are twice continuously differentiable functions of θ and S˜ : Z7→ S is a statistic taking its values in a convex subset S of R. Then, both steps of the EM formulate in terms of sufficient statistics. Note that this assumption is rather not restrictive as many models of interest in machine learning satisfy it.

Prior Work The EM method has been the subject of considerable interest since its for-malization in [Dempster et al.,1977]. Most prior works studying the convergence of EM methods consider theasymptoticand/orlocalbehaviors to avoid making any non-convexity assumption. The global convergence to a stationary point (either a local minimum or a saddle point) of the EM method has been established byWu et al. [1983] as an extension of prior work developed in [Dempster et al., 1977]. The global convergence is a direct consequence of the EM method to be monotone, i.e., the objective function never de-creases. Locally and under regularity conditions, a linear convergence rate to a stationary point has been studied in [McLachlan and Krishnan,2007, chapters 3 and 4]. Following Remark1.1, a natural enhancement of those methods corresponds to constructing cheaper updates at each iteration. For instance, the convergence of the Incremental EM (iEM) method was first tackled byGunawardana and Byrne [2005] exploiting the interpretation of the method as an alternating minimization procedure under the information geometric framework developed in [Csiszár and Tusnády, 1984]. More recently, the local but non-asymptotic convergenceof EM methods has been studied in several works. These results typically require the initializations to be within a neighborhood of an isolated stationary point and the (negated) log-likelihood function to be strongly convex locally. Such condi-tions are either difficult to verify in general or have been derived only for specific models;

see for example [Balakrishnan et al.,2017,Wang et al.,2015a, Xu et al.,2016a] and the references therein. The local convergence of a variance reduced EM method (called sEM-VR), that is not exactly an incremental method but rather builds upon a control variate principle, has been studied in [Chen et al., 2018, Theorem 1] under a pathwise global stability condition.

Our Contributions It is thus of utmost importance to improve and analyze EM vari-ants in order to address the two challenges mentioned at the very beginning of this Intro-duction,i.e.,the increasing amount of data points and the non-convexity of the objective.

Chapter 5 of this manuscript proposes and analyzes several variance reduced and incre-mental versions of the EM algorithm, such as th iEM and the sEM-VR mentioned above, in order to scale to large number of points and speed up the convergence. Particularly, this chapter considers the iEM as in [Gunawardana and Byrne, 2005] and extends their result under mild assumptions, such as the non-convexity of the objective and the uniform sampling of the indices, performed independently throughout the passes over the data.

1.3.3 The SAEM Algorithm

In many situations, the expectation step of the EM algorithm (2.3.5) can be numerically involved or even intractable. To address that issue, Wei and Tanner [1990a] propose to replace the expectation by a Monte Carlo integration, leading to the so-called Monte Carlo EM (MCEM). Another option, developed in [Delyon et al.,1999a], is the Stochastic Approximation of the EM (SAEM) that develops as follows:

1. Simulation step: Draw the Monte Carlo batch of latent variables {z^(k)_m }^M_m=1^(k) from its posterior distribution p(z|y;θ^(k−1)).

2. Stochastic approximation step: update the approximation, denoted Q_k(θ), of the conditional expectation (2.3.5):

Qk(θ) =Qk−1(θ) +γk



M_(k)⁻¹

M_(k)

m=1

logf(z^(k)_m , y;θ)−Qk−1(θ)



 , (1.3.7) where {γ_k}_k>0 is a sequence of decreasing stepsizes withγ1= 1.

3. Maximization step:

θ^(k) = arg max

θ∈Θ

Q_k(θ). (1.3.8)

During the stochastic approximation phase, the conditional distribution of the parameters is obtained as it is the distribution in which the latent variablesz are imputed to obtain a complete dataset from which the conditional log-likelihood is derived, see [Kuhn and Lavielle,2004].

In the simulation step, since the relation between the observed data and the latent data can be non linear, sampling from the posterior distribution is hard and often requires using an inference algorithm. Kuhn and Lavielle[2004] proved almost-sure convergence of the sequence of parameters obtained by this algorithm coupled with an MCMC procedure during the simulation step. Indeed,{z^(k)m }^M_m=1^(k) is a Monte Carlo batch. In simple

scenar-ios, the samples{zm^(k)}^M_m=1^(k) are conditionally independent and identically distributed with distributionp(z|y,θ^(k−1)). Nevertheless, in most cases, sampling exactly from this distri-bution is not an option and the Monte Carlo batch is sampled by Monte Carlo Markov Chains (MCMC) algorithm.

Figure 1.2 – Metropolis-Hastings (MH) algorithm: representation of a proposalq(z) and the targetπ(z) distributions in one dimension.

MCMC algorithms are a class of methods allowing to sample from complex distribution over (possibly) large dimensional space. An important class of samplers, called Metropolis-Hastings (MH) algorithm, iteratively draw samples from a proposal distribution q with the distribution of the newly drawn sample only depending on the current one. With some probability, the sample is either accepted as the new state of the chain or rejected.

It is well-known, see [Mengersen and Tweedie, 1996, Roberts and Rosenthal, 2011] that the Independent Sampler, a sub-class of MH samplers where the proposal is independent of the current state of the chain, is geometrically ergodic if and only if, for a given ε,

z∈infR^p

q(z)/π(z) ≥ε >0 where π(z) is the target distribution. More generally, it is shown in [Roberts and Rosenthal,2011] that the mixing rate in total variation depends on the expectation of the acceptance ratio under the proposal distribution which is also directly related to the ratio of proposal to the target. This observation naturally suggests to find a proposal which approximates the target. Figure2.2 illustrates that remark where the proposal is a simple Gaussian distribution. From this figure, one can acknowledge that the efficacy of the sampler will be impacted by the level of similarity (eg. they belong to the same family of distributions) between the two distributions of interest.

In the stochastic approximation step, the sequence of decreasing positive integers{γ_k}_k>0 controls the convergence of the algorithm. In practice,γ_k is set equal to 1 during the first K1 iterations to let the algorithm explore the parameter space without memory and to converge quickly to a neighborhood of the ML estimate. The stochastic approximation is performed during the finalK₂ iterations whereγ_k= 1/k^awith in generala= 0.7, ensuring the almost sure convergence of the estimate.

Prior Work The SAEM algorithm has been shown theoretically to converge to a max-imum of the likelihood of the observations under very general conditions [Delyon et al.,

1999a]. As already mentioned, this result has been extended byKuhn and Lavielle[2004], to include an MCMC sampling scheme in the simulation phase. Recent work by Allasson-nière and Chevallier [2019], exhibits a new class of algorithms where the simulation step is performed using an annealed version of the posterior distribution and is motivated by saddle points escaping problems.

Our Contributions We consider the SAEM algorithm through Chapters 6 and 7. In particular Chapter6contains an improvement of the aforementioned sampling procedure.

We exploit the remark about the Independent Sampler by introducing an efficient MH proposal based on the Laplace approximation of the incomplete log likelihood. A lineari-sation of the structural model is shown to be equivalent for the class of continuous data models. Chapter 7 exploits the finite sum structure of the objective function (following Remark1.1) and proposes an incremental variant of the SAEM which asymptotic behavior is shown theoretically and experimentally.

Dans le document The DART-Europe E-theses Portal (Page 32-37)