Incremental Minimization of Finite Sum Non-convex Functions

analysis for the MISSO method. Finally, Section 3.4 presents numerical applications to illustrate our findings including parameter inference for logistic regression with missing covariates and variational inference for Bayesian neural network.

Notations We denote J1, nK={1, . . . , n}. Unless otherwise specified,k · k denotes the standard Euclidean norm and h· | ·i is the inner product in Euclidean space. For any functionf : Θ→ R,f⁰(θ,d) is the directional derivative of f atθ along the direction d, i.e.,

f⁰(θ,d) := lim

t→0⁺

f(θ+td)−f(θ)

t . (3.1.2)

The directional derivative is assumed to exist for the functions introduced throughout this paper.

3.2 Incremental Minimization of Finite Sum Non-convex Functions

The objective function in (3.1.1) is composed of a finite sum of possibly non-smooth and non-convex functions. A popular approach here is to apply the MM method. The MM method tackles (3.1.1) through alternating between two steps — (i) minimizing a surrogate function which upper bounds the original objective function; and (ii)updating the surrogate function to tighten the upper bound.

As mentioned in the Introduction, the MISO method proposed by Mairal [2015b] is de-veloped as an iterative scheme that only updates the surrogate functionspartially at each

iteration. Formally, for any i ∈ J1, nK, we consider a surrogate function L^b_i(θ;θ) which satisfies

S3.1 For all i∈J1, nK andθ ∈Θ, the functionL^b_i(θ;θ) is convex w.r.t. θ, and it holds Lb_i(θ;θ)≥ L_i(θ), ∀ θ∈Θ, (3.2.1) where the equality holds whenθ =θ.

S3.2 For any θ_i ∈Θ, i∈J1, nK and some > 0, the difference function eb(θ;{θ_i}ⁿ_i=1) :=

1 n

i=1Lb_i(θ;θ_i)− L(θ) is defined for all θ ∈ Θ and differentiable for all θ ∈ Θ, where Θ = {θ ∈ R^d,infθ⁰∈Θkθ−θ⁰k < } is an -neighborhood set of Θ. Moreover, for some constant L, the gradient satisfies

k∇e_b(θ;{θ_i}ⁿ_i=1)k² ≤2Le_b(θ;{θ_i}ⁿ_i=1), ∀ θ∈Θ. (3.2.2) S3.1 is a common condition used for surrogate optimization, see [Mairal, 2015b, Section 2.3]. Meanwhile, S3.2 can be satisfied when the difference function eb(θ;{θ_i}ⁿ_i=1) is L -smooth for allθ ∈R^d, where the condition can be implied through applying [Razaviyayn et al.,2013, Proposition 1].

Algorithm 3.1 MISO method [Mairal,2015b]

1: Input: initializationθ⁽⁰⁾.

2: Initialize the surrogate function asA⁰_i(θ) :=L^b_i(θ;θ⁽⁰⁾), i∈J1, nK.

3: for k= 0,1, ... do

4: Pick ik uniformly fromJ1, nK.

5: Update A^k+1_i (θ) as:

A^k+1_i (θ) = (

Lb_i(θ;θ^(k)), ifi=ik

A^k_i(θ), otherwise.

6: Set θ^(k+1) ∈arg minθ∈Θ 1 n

i=1A^k+1_i (θ).

7: end for

The inequality (4.2.11) impliesL^b_i(θ;θ)≥ L_i(θ)>−∞for anyθ∈Θ. The MISO method is an incremental version of the MM method, as summarized by 3.1. As seen in the pseudo code, the MISO method maintains an iteratively updated set of surrogate upper-bound functions {A^k_i(θ)}ⁿ_i=1 and updates the iterate through minimizing the average of the surrogate functions.

Particularly, only one out of the n surrogate functions is updated at each iteration [cf. Line 5] and the sum function _n¹^Pⁿi=1A^k+1_i (θ) is designed to be ‘easy to optimize’, for example, it can be a sum of quadratic functions. As such, the MISO method is suit-able for large-scale optimization as the computation cost per iteration is independent ofn.

Moreover, under S3.1, S3.2, it was shown that the MISO method converges almost surely to a stationary point of (3.1.1) [Mairal,2015b, Proposition 3.1].

We now consider the case when the surrogate functionsL^b_i(θ;θ) are intractable. LetZ be a measurable set,pi :Z×Θ→R+ be a pdf,ri: Θ×Θ×Z→Rbe a measurable function andµ_ibe aσ-finite measure, we consider surrogate functions which satisfy S3.1, S3.2that can be expressed as an expectation:

Lb_i(θ;θ) :=^Z

r_i(θ;θ, z_i)p_i(z_i;θ)µ_i(dz_i) ∀ (θ,θ)∈Θ×Θ. (3.2.3) Plugging (3.2.3) into the MISO method is not feasible since the update step in Step 6 involves a minimization of an expectation. Several motivating examples of (3.1.1) are given in Section3.2.

We propose theMinimization by Incremental Stochastic Surrogate Optimization (MISSO) method which replaces the expectation in (3.2.3) by Monte Carlo integration and then optimizes (3.1.1) incrementally. Denote by M ∈ N the Monte Carlo batch size and let zi,m ∈Z,m= 1, ..., M be a set of samples for all i∈J1, nK. These samples can be drawn (Case 1) i.i.d. from the distribution pi(·;θ) or (Case 2) from a Markov chain with the stationary distributionp_i(·;θ); see Section4.2.1for illustrations. To this end, we define

Le_i(θ;θ,{z_i,m}^M_m=1) := 1 M

m=1

r_i(θ;θ, z_i,m) (3.2.4) and we summarize the proposed MISSO method in3.2. As seen, the procedure is similar to the MISO method but it involves two types of randomness. The first randomness comes from the selection of i_k in Line 5. The second randomness is that a set of Monte-Carlo approximated functionsA^e^k_i(θ) is used in lieu ofA^k_i(θ) when optimizing for the next iterate θ^(k). We now discuss two applications of the MISSO method.

Example 1: Maximum Likelihood Estimation for Latent Variable Model La-tent variable models [Bishop, 2006] are constructed by introducing unobserved (latent) variables which help explain the observed data. We considernindependent observations ((y_i, z_i), i ∈ JnK), that can be non identically distributed, where y_i is observed and z_i is latent. In this incomplete data framework, define{f_i(zi, yi,θ),θ∈Θ} to be the complete data likelihood models,i.e.,joint likelihood of the observations and latent variables. Let

g_i(y_i,θ) :=^Z

f_i(z_i, y_i,θ)µ_i(dz_i), i∈J1, nK (3.2.7) denote the incomplete data likelihood,i.e.,the marginal likelihood of the observations. For ease of notations, the dependence on the observations is made implicit. The maximum likelihood (ML) estimation problem takes L_i(θ) to be the ith negated incomplete data

Algorithm 3.2 MISSO method

1: Input: initializationθ⁽⁰⁾; a sequence of non-negative numbers{M_(k)}^∞_k=0.

2: For all i ∈ J1, nK, draw M₍₀₎ Monte-Carlo samples with the stationary distribution p_i(·;θ⁽⁰⁾).

3: Initialize the surrogate function as

Ae⁰_i(θ) :=L^e_i(θ;θ⁽⁰⁾,{z_i,m⁽⁰⁾}^M_m=1^(k)), i∈J1, nK. (3.2.5)

4: for k= 0,1, ... do

5: Pick a function indexi_k uniformly onJ1, nK.

6: Draw M_(k) Monte-Carlo samples with the stationary distributionp_i(·;θ^(k)).

7: Update the individual surrogate functions recursively as:

Ae^k+1_i (θ) =







Le_i(θ;θ^(k),{z_i,m^(k)}^M_m=1^(k)), ifi=i_k

Ae^k_i(θ), otherwise. (3.2.6)

8: Setθ^(k+1)∈arg minθ∈ΘLe^(k+1)(θ) := ¹_n^Pⁿi=1Ae^k+1_i (θ).

9: end for

log-likelihoodL_i(θ) :=−logg_i(y_i,θ).

Assume without loss of generality that g_i(y_i,θ) 6= 0 for all θ ∈ Θ, we define by p_i(z_i|y_i,θ) := f_i(z_i, y_i,θ)/g_i(y_i,θ) the conditional distribution of the latent variable z_i given the observation yi. A surrogate function L^b_i(θ;θ) satisfying S3.1 can be obtained through writingf_i(z_i, y_i,θ) = ^f_pⁱ^(zⁱ^,yⁱ^,θ)

i(zi|yi,θ)p_i(z_i|y_i,θ) and applying the Jensen inequality:

Lb_i(θ;θ) =^Z

Zlogp_i(z_i,θ)/f_i(z_i, y_i,θ)

| {z }

=ri(θ;θ,zi)

p_i(z_i|y_i,θ)µ_i(dz_i), (3.2.8)

We note that S3.2can also be verified for common distribution models. We can apply the MISSO method following the above specification ofri(θ;θ, zi), pi(zi|y_i,θ).

Example 2: Variational Inference Let ((x_i, y_i), i∈J1, nK) be i.i.d. input-output pairs andw∈W ⊆R^dbe a latent variable. When conditioned on the inputx= (x_i, i∈J1, nK), the joint distribution ofy= (yi, i∈J1, nK) and wis given by:

p(y, w|x) =π(w)^Qⁿi=1p(yi|x_i, w). (3.2.9) Our goal is to compute the posterior distributionp(w|y, x). In most cases, the posterior distribution p(w|y, x) is intractable and is approximated using a family of parametric distributions,{q(w,θ),θ∈Θ}. The variational inference (VI) problem [Blei et al.,2017a]

boils down to minimizing the KL divergence betweenq(w,θ) and the posterior distribution

p(w|y, x), as follows:

minθ∈Θ L(θ) := KL (q(w;θ)||p(w|y, x)) :=Eq(w;θ)

log q(w;θ)/p(w|y, x). (3.2.10) Using (3.2.9), we decompose L(θ) =n⁻¹^Pⁿ_i=1L_i(θ) + const.where:

L_i(θ) :=−Eq(w;θ)

logp(y_i|x_i, w)+ 1 nEq(w;θ)

logq(w;θ)/π(w)=r_i(θ) +d(θ). (3.2.11) Directly optimizing the finite sum objective function in (3.2.10) can be difficult. First, with n 1, evaluating the objective function L(θ) requires a full pass over the entire dataset. Second, for some complex models, the expectations in (3.2.11) can be intractable even if we assume a simple parametric model for q(w;θ). Assume that L_i is L-smooth, i.e., L_i is differentiable on Θ and its gradient ∇L_i is L-Lipschitz. We apply the MISSO method with a quadratic surrogate function defined as:

Lb_i(θ;θ) :=L_i(θ) +^D∇_θL_i(θ)|θ−θ^E+L

2kθ−θk². (3.2.12) It is easily checked thatL^b_i(θ;θ) satisfies S3.1, S3.2. To compute the gradient ∇L_i(θ), we apply the re-parametrization technique suggested in [Blundell et al., 2015, Kingma and Welling, 2014, Paisley et al., 2012]. Let t : R^d×Θ 7→ R^d be a differentiable function w.r.t. θ ∈ Θ which is designed such that the law of w = t(z,θ), where z ∼ N_d(0,I), is q(·,θ). By [Blundell et al.,2015, Proposition 1], the gradient of −r_i(·) in (3.2.11) is:

∇_θE_q(w;θ)logp(y_i|x_i, w)=Ez∼N_d(0,I)

J^tθ(z,θ)∇_wlogp(y_i|x_i, w)_w=t(z,θ), (3.2.13) where for each z ∈ R^d, J^t_θ(z,θ) is the Jacobian of the function t(z,·) with respect to θ evaluated at θ. In addition, for most cases, the term ∇d(θ) can be evaluated in closed form.

r_i(θ;θ, z) :=^D∇_θd(θ)−J^tθ(z,θ)∇_wlogp(y_i|x_i, w)_w=t(z,θ)|θ−θ^E+L

2kθ−θk² .

(3.2.14) Finally, using (3.2.12) and (3.2.14), the surrogate function (3.2.4) is given by Le_i(θ;θ,{z_m}^M_m=1) := M⁻¹^P^M_m=1ri(θ;θ, zm) where {z_m}^M_m=1 is an i.i.d sample from N(0,I).

Dans le document The DART-Europe E-theses Portal (Page 63-68)