Approximate Likelihood Inference via Laplace Approximation

1.3 Approximate Likelihood Inference via Laplace Ap-proximation

In this manuscript we will limit the discussion on estimation methods to those based on the likelihood, hence the need to provide a characterization of this function. As in many latent variable problems, the expressions of the marginal PDF/PMF f_θ(y_ij) of the observed outcomes, which are at the basis of the likelihood function, are obtained after integration of the random effects from the joint distribution of [y_ij,u_i]. Usingϕto denote the density of the multivariate standard normal random vectoru_iand with the assumption of independence between observational units, the likelihood for model (1.1) is the result of a product of multivariate integrals:

L(θ) = ^Yⁿ Moreover, taking advantage of the expression of the multivariate standard normal density and the structure of the exponential family, defining functions `_i as :

`_i(θ,u_i) :=`_i(u_i,θ;φ) = −1 and using the conventions on integral notation, the product (1.9) is equal to either of the following integrals : some particular models, e.g. the LMM of Section 1.2.1, these integrals are non-analytic and therefore need to be approximated numerically.

One of such approximations results from the use of Gaussian Quadrature methods.

Rabe-Hesketh et al. [2002], for instance, show that the likelihood contributions can be written as products of univariate integrals by exploiting the independence of standardized random effects. These integrals can then be approximated via weighted sums of Hermite polynomials evaluated at quadrature points determined by a quadrature rule. Additional accuracy can be obtained using theAdaptiveversion of Gaussian Quadrature (AGQ) which consists in allocating more weight and more points to the regions with more density. Many studies have shown the good properties of the inference based on these approximations when the number of quadrature points increases see e.g. Rabe-Hesketh et al. [2002], Rabe-Hesketh et al.[2005],Rabe-Hesketh and Skrondal[2008],Pinheiro and Chao[2012], but the implementations are limited to models with simple random structures, because of the excessive amount of computational resources required. Moreover, even very efficient implementations take a lot of time to return accurate estimates, see e.g. Huber et al.

[2004], making them unappealing for repeated fits in the spirit of bootstrap inference.

Another possibility is to consider aQuasi-Likelihoodapproach. FollowingGreen[1987], Breslow and Clayton [1993] highlight the fact that the exponentsn_i`_i(θ,u_i) in (1.10) can be written as the sum between the conditional log-likelihood contributions and a penalty term on the eucledian norm of u_i. Hence, they propose replacing the first term in this sum with a characterization of the relationship between the conditional expectation µ_ij

and variance v(µij) by means of the derivative of the integrated quasi-likelihood function d_ij(y_ij, µ_ij) := −2^Ry^µij^ij[y_ij −x]/v(x)dx, yielding a Penalized Quasi-Likelihood (PQL) ob-jective function. In its classic implementation, the PQL is optimized sequentially with respect to u_i and β, yielding values of the linear predictor and the conditional expecta-tion evaluated at the optima ηb_ij and µb_ij. On the basis of these quantities, the estimation problem can be written as a LMM y_ij^w = x^T_ijβ+z^T_ijD_σu_i +_ij for a working response y_ij^w = g(yij) = g(µbij) + (yij −µbij)g⁰(µbij) with ij ∼ N(0, wij) and wij = v(µbij)[g⁰(µbij)]² which can then be fit with appropriate methods for LMM known for being less compu-tationally intensive and easier to implement. However, in spite of this convenience, it has been assessed that the resulting estimates present systematic biases especially for the variance component parameters and in presence of very discrete outcomes, see e.g.

Breslow and Lin [1995], Jang and Lim [2006]. Owing to this inconsistency, we shall not base our proposals on this method.

A widespread alternative consists in applying the Laplace Approximation Method for Integrals, (see e.g. De Bruijn 1970, Barndorff-Nielsen and Cox 1989, Ch. 6), which is a popular way of approximating multiple integrals of the form:

I(α) = ^Z In such a situation, the integral can be approximated by the value of the integrand around

v, a procedure that can be seen as applying AGQ after choosing only one quadrature point. In order to characterize the approximation of the Likelihood of GLMM with this method, we shall use the notational conventions of Magnus et al.[1995] andRaudenbush et al. [2000]. Specifically, let `^(k)_i (θ,u_i) = ∂_u^(k)_i ì(θ,u_i) = ∂vec `^(k−1)_i (θ,u_i)/∂u^T_i denote the Partial Derivative, or Jacobian, of k-th order of the function `_i with respect to u_i. Under this convention, the Gradient of `_i with respect to u_i is the transpose of the first Jacobian: ∇uiì(θ,u_i) = [`⁽¹⁾_i (θ,u_i)]^T and a multivariate Taylor Series expansion of ì

around a point u₀ can be written as follows:

`_i(θ,u_i) = `_i(θ,u₀) +`⁽¹⁾_i (θ,u₀) [u_i−u₀] with the following characterization for terms of order k ≥3:

Tik =Tik(θ,u₀) := 1 k!

_k−1

⊗ (u_i−u₀)^T `^(k)_i (θ,u₀) (u_i−u₀), (1.13) where ⊗v^k =v⊗v· · · ⊗v (k-times) represents a k-fold Kronecker product of a vector v. By writing ˜u_i to designate the minimum of `_i, sometimes called the Mode of the joint PMF/PDF for [y_i^T,u_i^T]^T, ˜u_i := ˜u_i(θ) = argmaxui[−ni`i(θ,u_i)] and carrying out the expansion around this value, the second term of the expansion (1.12) vanishes, while the quadratic form in the third term recalls the exponent in the density of a normal random vector, yielding the following characterization after exponetiation of the series and subsequent integration:

L_i(θ) = (2π)^q/2|V_i(θ)|^1/2exp^h−n_i`˜_i(θ)ⁱexp [ε_i(θ)] (1.14)

1.3. Approximate Likelihood Inference via Laplace Approximation 9 where: ˜`^(k)_i (θ) := `^(k)_i (θ,u˜_i;φ), V_i(θ) := [n_i`˜⁽²⁾_i (θ)]⁻¹, R_i := −n_i^P^∞_k=3T˜_ik for ˜T_ik = T_ik(θ,u˜_i), and ε_i(θ) := logE[exp (R_i)] with the expectation taken over the density of a N[˜u_i,V_i(θ)] random vector. With these considerations, the Laplace-approximated con-tributions can be formulated as follows:

logLi(θ) = q

2log (2π) + 1

2log|Vi(θ)| −ni`˜i(θ) +εi(θ), (1.15) which, after neglecting the approximation error in the contributions, yields the Laplace-approximated log-Likelihood (LALL): respect to θ, in the spirit of the Maximum Likelihood (ML) approach, to obtain what we shall call Laplace-approximated Maximum Likelihood Estimators (LAMLE) of the model parameters. Hence, this strategy entails the following two-step procedure:

• Step 1: Optimization of `_i(θ,u_i) with fixed θ^b to obtain the modes ˜u_i(θ).

• Step 2: Optimization of log ˜L(θ) to update the values of the estimates

which, in the implementations, can be performed as two separate routines to accelerate the estimation times. Moreover, it is of course possible to improve the approximation (and therefore inference) by taking into account higher-order terms in the Taylor expansion of

`_i(θ,u_i) in equation (1.12), yielding a variety of higher-order approximations, see e.g.

Lindley [1980], Liu and Pierce [1993], Raudenbush et al. [2000], yet most of modern implementations rely on the first-order approximation for computational simplicity.

It is important to point out that the asymptotics of the procedure have been assessed since long both theoretically, see e.g. Shun and McCullagh[1995], and via simulations, see e.g. Joe [2008]. For instance,Douc et al. [2004] show that the estimates obtained on the grounds of an approximate log-likelihood such as (1.16) will have the correct asymptotic distribution as long as the approximation error ε_n(θ) := ^Pⁿi=1ε_i(θ) converges to zero in probability. Other works, such as a recent preprint by Ogden [2016], show that this inference can have first-order accuracy provided conditions on the absolute error of Score functions, which in the case of the LALL is given by the uniform norm of the gradient of the approximation error, i.e. δ_n(θ) := supθ∈Θk∇_θε_n(θ)k. More specifically, they show that when δ_n(θ) = o_p(r_n) among other conditions, the LAMLE converge in distribution to that of the MLE at a rater_n^1/2 which will depend on the conditional distribution of the outcome [Ogden, 2016, Theorem 2].

The only case where the likelihood (1.9) has a closed-form expression is when the outcome is Gaussian, i.e. in the context of LMM. To see this, rewrite model (1.5) asy_i = X_iβ+ε_i where ε_i =Z_iD_σu_i+_i are drawn independently from aN_n_i(0,Σ_i) distribution with Σ_i =φI_n_i +Z_i∆_σZ^T_i because of multivariate normality and independence between vectors _i and u_i. On the basis of this consideration, the literature of LMM proposes two competing likelihood-based estimation methods, namely the Maximum Likelihood (ML) and the Residual or Restricted Maximum Likelihood (REML) approaches. While the ML can be obtained directly by optimizing the closed-form likelihood with respect to the model parameters using a gradient-based algorithm, it is possible to obtain the same estimates with the LALL, since the Laplace approximation is exact in the context

of Gaussian response. To illustrate this purpose, let us define ρ² from the terms in the exponential of the integrand in equation (1.11), as in:

`(θ,u)∝

i=1

ky_i−X_iβ−Z_iD_σu_ik²₂+φku_ik²₂ =ρ²(u,β;φ,σ). (1.17) It is straightforward to see that the optimization with respect to u and β, in the spirit of the Laplace Method, implies optimization of ρ, an operation that is, in the words of Bates [2010], a Penalized Least Squares problem yielding the Henderson’s Estimating Equations [Henderson, 1950] for fixed σ and φ. This procedure is at the core of some implementations such as the Rpackages nlme[Pinheiro and Bates,2009] and lme4[Bates et al., 2015].

Dans le document On the Inference of Random Effects in Generalized Linear Mixed Models (Page 20-23)