• Aucun résultat trouvé

Incremental Minimization of Finite Sum Non-convex Functions

Dans le document The DART-Europe E-theses Portal (Page 63-68)

analysis for the MISSO method. Finally, Section 3.4 presents numerical applications to illustrate our findings including parameter inference for logistic regression with missing covariates and variational inference for Bayesian neural network.

Notations We denote J1, nK={1, . . . , n}. Unless otherwise specified,k · k denotes the standard Euclidean norm and h· | ·i is the inner product in Euclidean space. For any functionf : Θ→ R,f0(θ,d) is the directional derivative of f atθ along the direction d, i.e.,

f0(θ,d) := lim

t→0+

f(θ+td)f(θ)

t . (3.1.2)

The directional derivative is assumed to exist for the functions introduced throughout this paper.

3.2 Incremental Minimization of Finite Sum Non-convex Functions

The objective function in (3.1.1) is composed of a finite sum of possibly non-smooth and non-convex functions. A popular approach here is to apply the MM method. The MM method tackles (3.1.1) through alternating between two steps — (i) minimizing a surrogate function which upper bounds the original objective function; and (ii)updating the surrogate function to tighten the upper bound.

As mentioned in the Introduction, the MISO method proposed by Mairal [2015b] is de-veloped as an iterative scheme that only updates the surrogate functionspartially at each

iteration. Formally, for any i ∈ J1, nK, we consider a surrogate function Lbi(θ;θ) which satisfies

S3.1 For all i∈J1, nK andθ ∈Θ, the functionLbi(θ;θ) is convex w.r.t. θ, and it holds Lbi(θ;θ)≥ Li(θ), ∀ θ∈Θ, (3.2.1) where the equality holds whenθ =θ.

S3.2 For any θi ∈Θ, i∈J1, nK and some > 0, the difference function eb(θ;{θi}ni=1) :=

1 n

Pn

i=1Lbi(θ;θi)− L(θ) is defined for all θ ∈ Θ and differentiable for all θ ∈ Θ, where Θ = {θ ∈ Rd,infθ0∈Θkθ−θ0k < } is an -neighborhood set of Θ. Moreover, for some constant L, the gradient satisfies

k∇eb(θ;{θi}ni=1)k2 ≤2Leb(θ;{θi}ni=1), ∀ θ∈Θ. (3.2.2) S3.1 is a common condition used for surrogate optimization, see [Mairal, 2015b, Section 2.3]. Meanwhile, S3.2 can be satisfied when the difference function eb(θ;{θi}ni=1) is L -smooth for allθ ∈Rd, where the condition can be implied through applying [Razaviyayn et al.,2013, Proposition 1].

Algorithm 3.1 MISO method [Mairal,2015b]

1: Input: initializationθ(0).

2: Initialize the surrogate function asA0i(θ) :=Lbi(θ;θ(0)), i∈J1, nK.

3: for k= 0,1, ... do

4: Pick ik uniformly fromJ1, nK.

5: Update Ak+1i (θ) as:

Ak+1i (θ) = (

Lbi(θ;θ(k)), ifi=ik

Aki(θ), otherwise.

6: Set θ(k+1) ∈arg minθ∈Θ 1 n

Pn

i=1Ak+1i (θ).

7: end for

The inequality (4.2.11) impliesLbi(θ;θ)≥ Li(θ)>−∞for anyθ∈Θ. The MISO method is an incremental version of the MM method, as summarized by 3.1. As seen in the pseudo code, the MISO method maintains an iteratively updated set of surrogate upper-bound functions {Aki(θ)}ni=1 and updates the iterate through minimizing the average of the surrogate functions.

Particularly, only one out of the n surrogate functions is updated at each iteration [cf. Line 5] and the sum function n1Pni=1Ak+1i (θ) is designed to be ‘easy to optimize’, for example, it can be a sum of quadratic functions. As such, the MISO method is suit-able for large-scale optimization as the computation cost per iteration is independent ofn.

Moreover, under S3.1, S3.2, it was shown that the MISO method converges almost surely to a stationary point of (3.1.1) [Mairal,2015b, Proposition 3.1].

We now consider the case when the surrogate functionsLbi(θ;θ) are intractable. LetZ be a measurable set,pi :Z×Θ→R+ be a pdf,ri: Θ×Θ×Z→Rbe a measurable function andµibe aσ-finite measure, we consider surrogate functions which satisfy S3.1, S3.2that can be expressed as an expectation:

Lbi(θ;θ) :=Z

Z

ri(θ;θ, zi)pi(zi;θ)µi(dzi) ∀ (θ,θ)∈Θ×Θ. (3.2.3) Plugging (3.2.3) into the MISO method is not feasible since the update step in Step 6 involves a minimization of an expectation. Several motivating examples of (3.1.1) are given in Section3.2.

We propose theMinimization by Incremental Stochastic Surrogate Optimization (MISSO) method which replaces the expectation in (3.2.3) by Monte Carlo integration and then optimizes (3.1.1) incrementally. Denote by M ∈ N the Monte Carlo batch size and let zi,m ∈Z,m= 1, ..., M be a set of samples for all i∈J1, nK. These samples can be drawn (Case 1) i.i.d. from the distribution pi(·;θ) or (Case 2) from a Markov chain with the stationary distributionpi(·;θ); see Section4.2.1for illustrations. To this end, we define

Lei(θ;θ,{zi,m}Mm=1) := 1 M

M

X

m=1

ri(θ;θ, zi,m) (3.2.4) and we summarize the proposed MISSO method in3.2. As seen, the procedure is similar to the MISO method but it involves two types of randomness. The first randomness comes from the selection of ik in Line 5. The second randomness is that a set of Monte-Carlo approximated functionsAeki(θ) is used in lieu ofAki(θ) when optimizing for the next iterate θ(k). We now discuss two applications of the MISSO method.

Example 1: Maximum Likelihood Estimation for Latent Variable Model La-tent variable models [Bishop, 2006] are constructed by introducing unobserved (latent) variables which help explain the observed data. We considernindependent observations ((yi, zi), i ∈ JnK), that can be non identically distributed, where yi is observed and zi is latent. In this incomplete data framework, define{fi(zi, yi,θ),θ∈Θ} to be the complete data likelihood models,i.e.,joint likelihood of the observations and latent variables. Let

gi(yi,θ) :=Z

Z

fi(zi, yi,θ)µi(dzi), i∈J1, nK (3.2.7) denote the incomplete data likelihood,i.e.,the marginal likelihood of the observations. For ease of notations, the dependence on the observations is made implicit. The maximum likelihood (ML) estimation problem takes Li(θ) to be the ith negated incomplete data

Algorithm 3.2 MISSO method

1: Input: initializationθ(0); a sequence of non-negative numbers{M(k)}k=0.

2: For all i ∈ J1, nK, draw M(0) Monte-Carlo samples with the stationary distribution pi(·;θ(0)).

3: Initialize the surrogate function as

Ae0i(θ) :=Lei(θ;θ(0),{zi,m(0)}Mm=1(k)), i∈J1, nK. (3.2.5)

4: for k= 0,1, ... do

5: Pick a function indexik uniformly onJ1, nK.

6: Draw M(k) Monte-Carlo samples with the stationary distributionpi(·;θ(k)).

7: Update the individual surrogate functions recursively as:

Aek+1i (θ) =

Lei(θ;θ(k),{zi,m(k)}Mm=1(k)), ifi=ik

Aeki(θ), otherwise. (3.2.6)

8: Setθ(k+1)∈arg minθ∈ΘLe(k+1)(θ) := 1nPni=1Aek+1i (θ).

9: end for

log-likelihoodLi(θ) :=−loggi(yi,θ).

Assume without loss of generality that gi(yi,θ) 6= 0 for all θ ∈ Θ, we define by pi(zi|yi,θ) := fi(zi, yi,θ)/gi(yi,θ) the conditional distribution of the latent variable zi given the observation yi. A surrogate function Lbi(θ;θ) satisfying S3.1 can be obtained through writingfi(zi, yi,θ) = fpi(zi,yi,θ)

i(zi|yi,θ)pi(zi|yi,θ) and applying the Jensen inequality:

Lbi(θ;θ) =Z

Zlogpi(zi,θ)/fi(zi, yi,θ)

| {z }

=ri(θ;θ,zi)

pi(zi|yi,θ)µi(dzi), (3.2.8)

We note that S3.2can also be verified for common distribution models. We can apply the MISSO method following the above specification ofri(θ;θ, zi), pi(zi|yi,θ).

Example 2: Variational Inference Let ((xi, yi), i∈J1, nK) be i.i.d. input-output pairs andw∈W ⊆Rdbe a latent variable. When conditioned on the inputx= (xi, i∈J1, nK), the joint distribution ofy= (yi, i∈J1, nK) and wis given by:

p(y, w|x) =π(w)Qni=1p(yi|xi, w). (3.2.9) Our goal is to compute the posterior distributionp(w|y, x). In most cases, the posterior distribution p(w|y, x) is intractable and is approximated using a family of parametric distributions,{q(w,θ),θ∈Θ}. The variational inference (VI) problem [Blei et al.,2017a]

boils down to minimizing the KL divergence betweenq(w,θ) and the posterior distribution

p(w|y, x), as follows:

minθ∈Θ L(θ) := KL (q(w;θ)||p(w|y, x)) :=Eq(w;θ)

log q(w;θ)/p(w|y, x). (3.2.10) Using (3.2.9), we decompose L(θ) =n−1Pni=1Li(θ) + const.where:

Li(θ) :=−Eq(w;θ)

logp(yi|xi, w)+ 1 nEq(w;θ)

logq(w;θ)(w)=ri(θ) +d(θ). (3.2.11) Directly optimizing the finite sum objective function in (3.2.10) can be difficult. First, with n 1, evaluating the objective function L(θ) requires a full pass over the entire dataset. Second, for some complex models, the expectations in (3.2.11) can be intractable even if we assume a simple parametric model for q(w;θ). Assume that Li is L-smooth, i.e., Li is differentiable on Θ and its gradient ∇Li is L-Lipschitz. We apply the MISSO method with a quadratic surrogate function defined as:

Lbi(θ;θ) :=Li(θ) +DθLi(θ)|θ−θE+L

2kθ−θk2. (3.2.12) It is easily checked thatLbi(θ;θ) satisfies S3.1, S3.2. To compute the gradient ∇Li(θ), we apply the re-parametrization technique suggested in [Blundell et al., 2015, Kingma and Welling, 2014, Paisley et al., 2012]. Let t : Rd×Θ 7→ Rd be a differentiable function w.r.t. θ ∈ Θ which is designed such that the law of w = t(z,θ), where z ∼ Nd(0,I), is q(·,θ). By [Blundell et al.,2015, Proposition 1], the gradient of −ri(·) in (3.2.11) is:

θEq(w;θ)logp(yi|xi, w)=Ez∼Nd(0,I)

Jtθ(z,θ)∇wlogp(yi|xi, w)w=t(z,θ), (3.2.13) where for each z ∈ Rd, Jtθ(z,θ) is the Jacobian of the function t(z,·) with respect to θ evaluated at θ. In addition, for most cases, the term ∇d(θ) can be evaluated in closed form.

ri(θ;θ, z) :=Dθd(θ)−Jtθ(z,θ)∇wlogp(yi|xi, w)w=t(z,θ)|θ−θE+L

2kθ−θk2 .

(3.2.14) Finally, using (3.2.12) and (3.2.14), the surrogate function (3.2.4) is given by Lei(θ;θ,{zm}Mm=1) := M−1PMm=1ri(θ;θ, zm) where {zm}Mm=1 is an i.i.d sample from N(0,I).

Dans le document The DART-Europe E-theses Portal (Page 63-68)