Bayesian computation: a perspective on the current state, and sampling backwards and forwards

(1)

HAL Id: hal-01113421

https://hal.archives-ouvertes.fr/hal-01113421

Preprint submitted on 5 Feb 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

state, and sampling backwards and forwards

Peter Green, Krzysztof Latuszyski, Marcelo Pereyra, Christian Robert

To cite this version:

Peter Green, Krzysztof Latuszyski, Marcelo Pereyra, Christian Robert. Bayesian computation: a

perspective on the current state, and sampling backwards and forwards. 2015. �hal-01113421�

(2)

(will be inserted by the editor)

Bayesian computation: a perspective on the current state, and

sampling backwards and forwards

Peter J. Green · Krzysztof Latuszy´nski · Marcelo Pereyra ·

Christian P. Robert

Received: date / Accepted: date

Abstract The past decades have seen enormous im-provements in computational inference based on sta-tistical models, with continual enhancement in a wide range of computational tools, in competition. In Bayesian inference, ﬁrst and foremost, MCMC techniques con-tinue to evolve, moving from random walk proposals to Langevin drift, to Hamiltonian Monte Carlo, and so on, with both theoretical and algorithmic inputs opening wider access to practitioners. However, this impressive evolution in capacity is confronted by an even steeper increase in the complexity of the models and datasets to be addressed. The diﬃculties of modelling and then handling ever more complex datasets most likely call for a new type of tool for computational inference that dramatically reduce the dimension and size of the raw data while capturing its essential aspects. Approximate models and algorithms may thus be at the core of the next computational revolution.

Keywords Bayesian analysis · MCMC algorithms · ABC techniques · optimisation

Supported in part by “SuSTaIn”, EPSRC grant EP/D063485/1, at the University of Bristol, and “i-like”, EPSRC grant EP/K014463/1, at the University of Warwick. Krzysztof Latuszy´nski holds a Royal Society Uni-versity Research Fellowship, and Marcelo Pereyra a Marie Curie Intra-European Fellowship for Career Development. Peter Green also holds a Distinguished Professorship at UTS, Sydney, and Christian Robert a chair at Ceremade, Universit´e Paris-Dauphine.

Peter Green and Marcelo Pereyra

School of Mathematics, University of Bristol E-mail: P.J.Green, Marcelo.Pereyra@bristol.ac.uk Krzysztof Latuszy´nski and Christian P. Robert Dept. of Statistics, University of Warwick

E-mail: K.G.Latuszynski@warwick.ac.uk, robert@ensae.fr

1 Introduction

One may reasonably balk at the terms “computational statistics” and “Bayesian computation” since, from its very start, statistics has always involved some compu-tational step to extract information, something man-ageable like an estimator or a prediction, from raw data. This incomplete review of the recent past, cur-rent state, and immediate future of MCMC and related algorithms thus ﬁrst requires us to explain what we mean by computation in a statistical context, before turning to what we perceive as medium term solutions and possible deadends.

Computations are an issue in statistics whenever processing a dataset becomes a diﬃculty, a liability, or even an impossibility. Obviously, the computational challenge varies according to the time when it is faced: what was an issue in the 19th century is most likely not so any longer so (take for instance the derivation of the moment estimates of a mixture of two normal distribu-tions so painstakenly set by Pearson (1894) for estimat-ing the ratio of “forehead” breadth to body length on a dataset of 1,000 crabs or the intense algebraic deriva-tions found in the analysis of variance of the 1950’s and 1960’s (Searle et al. 1992)).

The introduction of simulation tools in the 1940’s followed hard on the heels of the invention of the com-puter and certainly contributed an impetus towards faster and better computers, at least in the ﬁrst decade of this revolution. This shows that these tools were both needed, and unavailable without electronic calcu-lators. The introduction of Markov chain Monte Carlo is harder to pin down as some partial versions can be traced all the way back to 1944–45 and the Manhattan project at Los Alamos (Metropolis 1987). It is surpris-ingly much later, i.e., only by the early 1990’s, that

(3)

such methods became part of the Bayesian toolbox, that is, some time after the devising of other computer-dependent tools like the bootstrap or the EM algo-rithm, and despite the availability of personal comput-ers that considerably eased programming and exper-imenting (Robert and Casella 2010). It is presumably pointless to try to attribute this delay to a deﬁnite cause but a certain lack of probabilistic culture within the statistics community is probably partly to blame.

What makes this time-lag in MCMC methods be-coming assimilated into the statistics community even more surprising is that fact that Bayesian inference having a significant role in statistical practice was re-ally on hold pending the discovery of flexible computa-tional tools that (implictly or explicitly) delivered val-ues for the medium- to high-dimensional integrals that underpin the calculation of posterior distributions, in all but toy problems where conjugacy provided explicit answers. In fact, until Bayesians discovered MCMC, the only computational methodology that seemed to offer much chance of making practical Bayesian statis-tics practical was the portfolio of quadrature methods developed under Adrian Smith’s leadership at Notting-ham (Naylor and Smith 1982; Smith et al. 1985, 1987). The very first article in the first issue of Statis-tics and Computing, whose quarter-century we cele-brate in this special issue, was (to the journal’s credit!) on Bayesian analysis, and was precisely in this direc-tion of using clever quadrature methods to approach moderately high-dimensional posterior analysis (Della-portas and Wright 1991). By the next (second) issue, sampling-based methods had started to appear, with three papers out of five in the issue on or related to Gibbs sampling (Verdinelli and Wasserman 1991; Car-lin and Gelfand 1991; Wakefield et al. 1991).

Now, reflecting upon the evolution of MCMC meth-ods over the 25 or so years they have been at the fore-front of Bayesian inference, the focus has evolved a long way, from hierarchical models that extended the lin-ear, mixed and generalised linear models (Albert 1988; Carlin et al. 1992; Bennett et al. 1996) which were ini-tially the focus, and graphical models that stemmed from image analysis (Geman and Geman 1984) and artificial intelligence, to dynamical models driven by ODE’s (Wilkinson 2011b) and diffusions (Roberts and Stramer 2001; Dellaportas et al. 2004; Beskos et al. 2006), hidden trees (Larget and Simon 1999; Huelsen-beck and Ronquist 2001; Chipman et al. 2008; Aldous et al. 2008) and graphs, aside with decision making in highly complex graphical models. While research on MCMC theory and methodology is still active and con-tinually branching (Papaspiliopoulos et al. 2007;

An-drieu and Roberts 2009; Latuszy´nski et al. 2011; Douc

and Robert 2011), progressively incorporating the ca-pacities of parallel processors and GPUs (Lee et al. 2009; Jacob et al. 2011; Strid 2010; Suchard et al. 2010; Scott et al. 2013; Calderhead 2014), we wonder if we are not currently facing a new era where those meth-ods are no longer appropriate to undertake the anal-ysis of new models, and of new formulations where models are no longer completely defined. We indeed believe that imprecise models, incomplete information and summarised data will become, if not already, a cen-tral aspect of statistical analysis, due to the massive influx of data and the need to provide non-statisticians with efficient tools. This is why we also cover in this survey the notions of approximate Bayesian computa-tion (ABC) and comment on the use of optimisacomputa-tion tools.

The plan of the paper is that in Sections 2 and 3 we discuss recent progress and current issues in Markov chain Monte Carlo and Approximate Bayesian Com-putation respectively. In Section 4, we highlight some araes of modern optimisation that, through lack of fa-miliarity, are making less impact in the mainstream of Bayesian computation than we think justiﬁed. Our Dis-cussion in Section 5 raises issues about data science and relevance to applications, and looks to the future.

2 MCMC, approximate simulations from an exact target

When MCMC techniques were introduced to the main-stream statistical (Bayesian) community in 1990, they were received with skepticism that they could one day become the central tool of Bayesian inference. For in-stance, despite the insurance provided by the ergodic theorem, many researchers thought at ﬁrst that the convergence of those algorithms was a mere theoreti-cal anticipation rather than a practitheoreti-cal reality, in con-trast to traditional Monte Carlo methods, and hence that they could not be trusted to provide “exact” an-swers. This perspective is obviously obsolete by now, when MCMC output is considered as “exact” as regular Monte Carlo, if possibly less eﬃcient in some settings. Nowadays, MCMC is again attracting more attention (than in the past decade, say, where developments were more about alternatives, some of which described in the following sections), both because of methodologi-cal developments linked to better theoretimethodologi-cal tools, for instance in the handling of stochastic processes, and because of new advances in accelerated computing via parallel and cloud computing.

(4)

2.1 Basics on MCMC

The introduction of Markov chain based methods within Monte Carlo thus took a certain degree of argument to reach the mainstream statistical community, when compared with other groups who were using MCMC methods 10 to 30 years earlier. It may sound unlikely at the current stage of our knowledge, but using meth-ods that (a) generated correlated output, (b) required some burnin time to remove the impact of the initial distribution and (c) did not lead to a closed form ex-pression for asymptotic variances were indeed met with resistance at ﬁrst. As often, the immense computing advantages oﬀered by this versatile tool soon overcame the reluctance to accept those methods as similarly “ex-act” as other Monte Carlo techniques, applications driv-ing the move from the early 1990’s. We reproduce be-low the generic version of the “all variables at once” Metropolis–Hastings algorithm (Metropolis et al. 1953; Hastings 1970; Besag et al. 1995; Robert and Casella 2011) as it (still) constitutes in our opinion a fundamen-tal advance in computational statistics, namely that, given a computable density π (up to a normalising con-stant) and a proposal Markov kernel q(·|·), there exists a universal machine that returns a Markov chain with the proper stationary distribution, hence an associated operational MCMC algorithm.

Algorithm 1 Metropolis–Hastings algorithm (generic version)

Choose a starting value θ(0)₎ for t = 1 to N do

Generate θ∗_{from a proposal q(·|θ}(t−1)₎ Compute the acceptance probability

ρ(t)= 1∧ π(θ∗_{) q(θ}(t−1)_|θ∗_))π(θ(t−1)_q(θ∗_|θ(t−1)₎ Generate ut ∼ U(0, 1) and take θ(t) = θ∗if ut ≤ ρ(t), θ(t)_{= θ}(t−1) _otherwise.

end for

The first observation about the Metropolis–Hastings is that the flexibility in choosing q is a blessing, but also a curse since the choice determines the performance of the algorithm. Hence a large part of the research on MCMC along the past 30 years (if we arbitrarily set the starting date at Geman and Geman (1984)) has been on choice of the proposal q to improve the effi-ciency of the algorithm, and in characterising its con-vergence properties. This typically requires gathering or computing additional information about π and we discuss some of the fundamental strategies in subse-quent sections. Algorithm 1, and its variants in which variables are updated singly or in blocks according to

some schedule, remains a keystone in standard use of MCMC methodology, even though the newer Hamil-tonian Monte Carlo approach (see Section 2.3) may sooner or later come to replace it. While there is noth-ing intrinsically unique to the nature of this algorithm, or optimal in its convergence properties (other than the result of Peskun (1973) on the optimality of the accep-tance ratio), attempts to bypass Metropolis–Hastings are few and limited. For instance, the birth-and-death process developed by Stephens (2000) used a continu-ous time jump process to explore a set of models, only to be later shown (Capp´e et al. 2002) to be equivalent to the (Metropolis–Hastings) reversible jump approach of Green (1995).

Another aspect of the generic Metropolis–Hastings that became central more recently is that while the accept–reject step does overcome need to know the nor-malising constant, it still requires π, unnormalised, and this may be too expensive to compute or even intractable for complicated models and large datasets. Much recent research eﬀort has been devoted to design and under-standing of appropriate modiﬁcations that use estima-tors or approximations of π instead and we take the opportunity to summarise some of the progress in this direction.

2.2 MALA and Manifold MALA

Stochastic diﬀerential equations (SDEs) have been and still are informing Monte Carlo development in a num-ber of seminal ways. A key insight is that the Langevin diﬀusion solving

dXt= 1

2∇ log π(Xt)dt + dBt (1)

has π as its stationary and limiting distribution. Here

Bt is the standard Brownian motion and ∇ denotes

gradient. The crude approach of sampling an Euler dis-cretisation (Kloeden and Platen (1992)) of (1) and us-ing it as an approximate sample from π was introduced in the applied literature (Ermak (1975); Doll and Dion (1976)). The method results in a Markov chain evolving according to the dynamics

Xn+1|Xn= x ∼ Q(Xn, ·)

:= x +h

2∇ log π(x) + h

1/2_{N (0, I}

d×d),(2)

for a chosen discretisation step h. There is a delicate tradeoff between accuracy of the approximation im-proving as h → 0 and sampling efficiency (as measured e.g. by the effective sample size) improving when h in-creases. It was soon followed by its Metropolised version

(5)

(Rossky et al. (1978)) that uses the Euler approxima-tion of (2) to produce a proposal in the Metropolis–

Hastings algorithm 1, by letting q(·|θ(t−1)_{) := θ}(t−1)₊

h

2∇ log π(θ(t−1)) + h1/2N (0, Id×d). While in the

prob-ability community Langevin diﬀusions and their equi-librium distributions had also been around for some time (Kent (1978)), it was the Roberts and Tweedie (1996a) paper that brought the approach to the centre of interest of the computational statistics community and sparked a systematic study, development and ap-plications of Metropolis adjusted Langevin algorithms (hence MALA) and their cousins.

There is a large body of empirical evidence that at the extra price of computing the gradient, MALA algorithms typically provide a substantial speed-up in convergence on certain types of problems. However for very light-tailed distributions the drift term may grow to inﬁnity and cause additional instability. More pre-cisely, for distributions with suﬃciently smooth con-tours, MALA is geometrically ergodic (c.f. Roberts and

Rosenthal (2004)) if the tails of π decay as exp{−|θ|β}

with β ∈ [1, 2], while the random walk Metropolis al-gorithm is geometrically ergodic for all β ≥ 1 (Roberts and Tweedie (1996a); Mengersen and Tweedie (1996)). This lack of geometrical ergodicity has been precisely quantiﬁed by Bou-Rabee and Hairer (2012).

Various refinements and extensions have been pro-posed. These include optimal scaling and choice of the discretisation step h, adaptive versions (both discussed in Section 2.4), combinations with proximal operators Pereyra (2013); Schreck et al. (2013), and applications and algorithm development for the infinite-dimensional context Pillai et al. (2012); Cotter et al. (2013). One particular direction of active research is considering a more general version of equation (1) with state depen-dent drift and diffusion coefficient

dXt= σ(X_t) 2 ∇ log π(Xt) + γ(Xt) 2 dt+√σ(Xt)dBt(3) γi(Xt) = X j ∂σij(Xt) ∂Xj ,

which also has π as invariant distribution (Xifara et al. (2014), c.f. Kent (1978)). The resulting proposals are

q(·|θ(t−1)) := h

2

σ(θ(t−1))∇ log π(θ(t−1)) + γ(θ(t−1))

+h1/2N (0, σ(θ(t−1))) + θ(t−1).

Choosing appropriate σ for improved ergodicity is how-ever nontrivial. The idea has been explored in Stramer and Tweedie (1999a,b); Roberts and Stramer (2002) and more recently Girolami and Calderhead (2010) in-troduced a mathematically-coherent approach of relat-ing σ to a metric tensor on a Riemannian manifold of

probability distributions. The resulting algorithms are termed Manifold MALA (MMALA), Simplified MMALA Girolami and Calderhead (2010), and position-dependent MALA (PMALA) Xifara et al. (2014), and differ in im-plementation cost, depending on how precise is the use they make of versions of equation (3). The approach still leaves the specification of the metric to be used in the space of probability distributions to the user, however there are some natural choices. For example, building the metric tensor from the spectrally-positive version of the Hessian of π and randomising the discretisation step size h results in an algorithm that is as robust as random walk Metropolis, in the sense that it is

geomet-rically ergodic for targets with tail decay of exp{−|θ|β_}

for β > 1 (see Wolny (2014)).

2.3 Hamiltonian Monte Carlo

As with many improvements in the literature, starting with the very notion of MCMC, Hamiltonian (or hy-brid) Monte Carlo (HMC) stems from Physics (Duane et al. 1987). After a slow emergence into the statistical community (Neal 1999), it is now central in statistical software like STAN (Stan Development Team 2014). For a complete account of this important ﬂavour of MCMC, the reader is referred to Neal (2013), which in-spired the description below; see also Betancourt et al. 2014 for a highly mathematical diﬀerential-geometry approach to HMC.

This method can be seen as a particular and eﬃ-cient instance of auxiliary variables (see, e.g., Besag and Green 1993 and Rubinstein 1981), in which we apply a deterministic-proposal Metropolis method to the aug-mented target. In physical terms, the idea behind HMC is to add a “kinetic energy” term to the “potential en-ergy” (negative log-target), leading to the Hamiltonian

H(q, p) = log π(q) + pTM−1p/2

where q denotes the object to be simulated (i.e., the pa-rameter), p its speed or momentum and M the Hamil-tonian matrix of π. In more statistical language, HMC creates an auxiliary variable q such that moving accord-ing to Hamilton’s equations

dq dt = ∂H ∂p = ∂H ∂p = M −1_p dp dt = − ∂H ∂q = − ∂ log π ∂q

preserves the joint distribution with density exp −H(p, q), hence the marginal distribution of q, that is, π(q). Hence, if we could simulate exactly this joint distribution of

(6)

(q, p), a sample from π(q) would be a by-product. How-ever, in practice, the equation is solved approximately and hence requires a Metropolis correction. As discussed in, e.g., Neal (2013), the dynamics induced by Hamil-ton’s equations is reversible and volume-preserving in the (q, p) space, which means in practice that there is no need for a Jacobian in Metropolis updates. The practi-cal implementation is practi-called the leapfrog approximation (Girolami and Calderhead 2011) as it relies on a small step level ǫ, updating p and q via a modiﬁed Euler’s method called the leapfrog that is reversible and pre-serves volume as well. This discretised update can be repeated for an arbitrary number of steps.

When considering the implementation via a Metropo-lis algorithm, a new value of the momentum p is drawn

from the pseudo-prior π(p) ∝ exp{−pT_M−1_{p/2} and}

it is followed by a Metropolis step, which proposal is driven by the leapfrog approximation to the Hamilto-nian dynamics on (q, p) and which acceptance is gov-erned by the Metropolis acceptance probability. What makes the potential strength of this augmentation (or dis-integration) scheme is that the value of H(q, p) hardly changes during the Metropolis move, which means that it is most likely to be accepted and that it may produce a very different value of π(q) without modifying the overall acceptance probability. In other words, moving along level sets is almost energy-free, but if the move proceeds for long “enough”, the chain can reach far-away regions of the parameter space, thus avoid the myopia of standard MCMC algorithms. As explained in Neal (2013), this means that Hamiltonian Monte Carlo avoids the inefficient random walk behaviour of most Metropolis–Hastings algorithms. What drives the ex-ploration of the different values of H(q, p) is therefore the simulation of the momentum, which makes its cali-bration both quite influential and delicate as it depends on the unknown normalising constant of the target. (By calibration, we mean primarily the choice of the time discretisation step ε in the leapfrog approximation and of the number L of leapfrog leaps, but also the choice of the precision matrix M .)

2.4 Optimal scaling and Adaptive MCMC

The convergence of the Metropolis-Hastings algorithm 1 depends crucially on the choice of the proposal distribu-tion q, as does the performance of both more complex MCMC and SMC algorithms, that often are hybrids using Metropolis–Hastings as simulation substeps.

Optimising over all implementable q appears to be a “disaster problem” due to its inﬁnite-dimensional char-acter, lack of clarity about what is implementable, what is not, and the fact that this optimal q must depend in

a complex way on the target π to which we have only a limited access. In particular MALA provides a partic-ular approach to constructing π-tailored proposals and HMC can be viewed as a combination of Gibbs and speciﬁc Metropolis moves for an extended target.

In this optimisation context, it is thus reasonable to restrict ourselves to some parametric family of

propos-als qξ, or more generally of Markov transition kernels

Pξ, where ξ ∈ Ξ is a tuning parameter, possibly high

dimensional.

The aim of adaptive Markov chain Monte Carlo is conceptually very simple. One expects that there is a

set Ξπ⊂ Ξ of good parameters ξ for which the kernel

Pξ converges quickly to π, and one allows the algorithm

to search for Ξπ “on the ﬂy” and redesign the

transi-tion kernel during the simulatransi-tion as more and more information about π becomes available. Thus an

adap-tive MCMC algorithm would apply the kernel Pξ(t) to

sample θ(t) given θ(t−1), where the tuning parameter

ξ(t) _{is itself a random variable which may depend on}

the whole history θ(0)_{, . . . , θ}(t−1) _{and ξ}(t−1)_{. Adaptive}

MCMC rests on the hope that the adaptive

parame-ter ξ(t) _{will ﬁnd Ξ}

π, stay there essentially forever and

inherit good convergence properties.

There are at least two fundamental diﬃculties in ex-ecuting this strategy in practice. First, standard mea-sures of eﬃciency of Markovian kernels, like the to-tal variation convergence rate (c.f. Meyn and Tweedie

(2009); Roberts and Rosenthal (2004)), L2_{(π) spectral}

gap (Diaconis and Stroock (1991); Roberts (1996); Saloff-Coste (1997); Levin et al. (2009)) or asymptotic vari-ance (Peskun (1973); Geyer (1992); Tierney (1998)) in the Markov chain central limit theorem will not be available explicitly, and their estimation from a Markov chain trajectory is often an even more challenging task than the underlying MCMC estimation problem itself. Secondly, when executing an adaptive strategy and trying to improve the transition kernel on the fly, the Markov property of the process is violated, therefore standard theoretical tools do not apply, and establish-ing validity of the approach becomes significantly more difficult. While the approach has been successfully ap-plied in some very challenging practical problems (Solo-nen et al. (2012); Richardson et al. (2010); Griffin et al. (2014)), there are examples of seemingly reasonable adap-tive algorithms that fail to converge to the intended

target distribution (Bai et al. (2011); Latuszy´nski et al.

(2013)), indicating that compared to standard MCMC even more care must be taken to ensure validity of in-ferential conclusions.

While heuristics-based adaptive algorithms have been considered already in Gilks et al. (1994), a remarkable result providing a tool to coherently address the

(7)

dif-ficulty of optimising Markovian kernels is the Roberts et al. (1997) paper that considers settings of increasing dimensionality and investigates efficiency of the random walk Metropolis algorithm as a function of its average acceptance rate. More specifically, given a sequence of targets with iid components constructed from conve-niently smooth marginal f,

πd(θ) :=

d

Y

i=1

f (θi), for d = 1, 2, . . . (4)

consider a sequence of Markov chains Xd, d = 1, 2, . . . ,

where the chain Xd = (Xd(t))t=0,1,... is a random walk

Metropolis targeting πd with proposal increments

dis-tributed as N (0, σ2

nId×d).

It then turns out that the only sensible scaling of

the proposal as dimensionality increases is to take σ2

d =

l2_d−1_{. In this regime the sequence of time-rescaled ﬁrst}

coordinate processes

Z_d(t) := X_d,1(⌊td⌋), for d = 1, 2, . . .

converges in a suitable sense to the solution Z of a stochastic diﬀerential equation

dZt= h(l)1/2dBt+

1

2h(l)∇ log f (Zt)dt.

Hence maximising the speed of the above diffusion h(l) is equivalent to maximising the efficiency of the algo-rithm as the dimension goes to infinity. Surprisingly, there is a one-to-one correspondence between the value

lopt= argmax h(l) and the mean acceptance probability

of 0.234.

The magic number 0.234 does not depend on f and gives a universal tuning recipe to be used for example in adaptive algorithms: choose the scale of the incre-ment so that approximately 23% of the proposals are accepted.

The result, established under restrictive assumptions, has been empirically verified to hold much more gen-erally, for non iid targets and also in medium- and even low-dimensional examples with d as small as 5. It has been also combined with relative efficiency loss due to mismatch between the proposal and target co-variance matrices (see Roberts and Rosenthal (2001)). A large body of theoretical work extends optimal scal-ing formally to different and more general scenarios. For example Metropolis for smooth non iid targets has been addressed e.g. by Bédard (2007), and in infinite dimensional settings by Beskos et al. (2009). Discrete and other discontinuous targets have been considered in Roberts (1998) and Neal et al. (2012). For MALA al-gorithms an optimal acceptance rate of 0.574 has been established in Roberts and Rosenthal (1998) and con-firmed in infinite-dimensional settings in Pillai et al.

(2012) along with the stepsize σ2

d = l2d−1/3. Hybrid

Monte Carlo (see Section 2.3) has been analysed in a similar spirit by Beskos et al. (2013) resulting in an op-timal acceptance rate of 0.651 and leapfrog step size

h = l × d−1/4_{. These results not only inform about}

op-timal tuning, but also provide an eﬃciency ordering on the algorithms in d−dimensions. Metropolis algorithms need O(d) steps to explore the state space, while MALA

and HMC need respectively O(d1/3_{) and O(d}1/4_).

Further extensions include the transient phase be-fore reaching stationarity (Christensen et al. (2005); Jourdain et al. (2012, 2014)), scaling of multiple-try MCMC (Bédard et al. (2012)), delayed rejection MCMC (Bédard et al. (2014)), and temperature scale of paral-lel tempering type algorithms (Atchadé et al. (2011); Roberts and Rosenthal (2014)). Interestingly, the op-timal scaling of the discussed in Section 2.5 pseudo-marginal algorithms as obtained in Sherlock et al. (2014) suggests the acceptance rate of just 0.07.

While each of these numerous optimal scaling re-sults gives rise, at least in principle, to an adaptive MCMC design, the pioneering and most successful algo-rithm is the Adaptive Metropolis of Haario et al. (2001). With its increasing popularity in applications, this has fuelled the development of the ﬁeld.

Here one considers a normal increment proposal that estimates the target covariance matrix from past sam-ples and applies appropriate dimension-dependent scal-ing and covariance shrinkage. Precisely, the proposal takes the form

q(·|θ(t−1)) = N (θ(t−1), C(t)), (5)

with the covariance matrix

C(t)= (2.38) 2 d ˆ cov(θ(0), . . . , θ(t−1)) + εId×d (6) which is eﬃciently computed using a recursive formula. Versions and reﬁnements of the adaptive Metropolis algorithm Roberts and Rosenthal (2009); Andrieu and Thoms (2008) have served well in applications and mo-tivated much of the theoretical development. These in-clude, among many other contributions, adaptive Metropo-lis, delayed rejection adaptive Metropolis (Haario et al. (2006)), regional adaptation and parallel chains Craiu et al. (2009), and the robust version of Vihola (2012) estimating the shape of the distribution rather than its covariance matrix and hence suitable for heavy tailed targets.

Analogous development of adaptive MALA algo-rithms in Atchad´e (2006); Marshall and Roberts (2012) and of adaptive Hamiltonian and Riemannian Manifold Monte Carlo in Wang et al. (2013) building on the adap-tive scaling theory, resulted in a similar drastic mixing improvement as the original Adaptive Metropolis.

(8)

Another substantial and still unexplored area where adaptive algorithms are applied for very high dimen-sional and multimodal problems is model and variable selection (Nott and Kohn (2005); Richardson et al. (2010); Lamnisos et al. (2013); Ji and Schmidler (2013); Grif-ﬁn et al. (2014)). These algorithms can incorporate re-versible jump moves (Green 1995) and are guided by scaling limits for discrete distributions as well as tem-perature spacing of parallel tempering to address mul-timodality. Successful implementations allow for fully Bayesian variable selection in models with over 20 000 variables for which otherwise only ad hoc heuristic ap-proaches have been used in literature.

To address the second difficulty with adaptive al-gorithms, several approaches have been developed to establish their theoretical underpinning. While for stan-dard MCMC convergence in total variation and law of large numbers are obtained almost trivially, and the effort concentrates on stronger results, like CLTs, geo-metric convergence, nonasymptotic analysis, and, maybe most importantly, comparison and ordering of algorithms, adaptive samplers are intrinsically difficult. The most elegant and theoretically-valid strategy is to change the underlying Markovian kernel at regeneration times only (Gilks et al. (1998)). Unfortunately, this is not very ap-pealing for practitioners since regenerations are diffi-cult to identify in more complex settings and are essen-tially impractically rare in high dimensions. The orig-inal Adaptive Metropolis of Haario et al. (2001) has been validated (under some restrictive additional con-ditions) by controlling the dependencies introduced by the adaptation and using convergence results for mixin-gales. The approach has been further developed in Atchadé and Rosenthal (2005) and Atchadé (2006) to verify its ergodicity under weaker assumptions and apply the mixin-gale approach to adaptive MALA. Another successful approach (Andrieu and Moulines (2006) refined in Saks-man and Vihola (2010)) rests on martingale difference approximations and martingale limit theorems to ob-tain, under suitable technical assumptions, versions of LLN and CLTs. There are close links between analysing adaptive MCMC and stochastic approximation algo-rithms and in particular the adaptation step can be often written as a mean field of the stochastic approxi-mation procedure; Andrieu and Robert (2001); Atchade et al. (2011); Andrieu et al. (2015) contribute to this direction of analysis. Fort et al. (2011) develop an ap-proach where both adaptive and interacting MCMC gorithms can be treated in the same framework. This al-lows to address “external adaptation” algorithms such as the interacting tempering algorithm (a simplified version of the celebrated equi-energy sampler of Kou

et al. (2006)) or adaptive parallel tempering in Miaso-jedow et al. (2013).

We present here the rather general but fairly simple coupling approach (Roberts and Rosenthal (2007)) to establishing convergence. Successfully applied to a va-riety of adaptive Metropolis samplers under weak regu-larity conditions (Bai et al. (2011)), adaptive Gibbs and adaptive Metropolis within adaptive Gibbs samplers

( Latuszy´nski et al. (2013)), it shows that two

prop-erties Diminishing Adaptation and Containment are suﬃcient to guarantee that an adaptive MCMC algo-rithm will converge asymptotically to the correct target distribution. To this end recall the total variation dis-tance between two measures deﬁned as kν(·) − µ(·)k :=

supA∈F|ν(A) − µ(A)|, and for every Markov transition

kernel Pξ, ξ ∈ Ξ and every starting point θ ∈ Θ deﬁne

the ε convergence function Mε: Θ × Ξ → N as

Mε(θ, ξ) := inf{t ≥ 1 : kPξt(θ, ·) − π(·)k ≤ ε}.

Let {(θ(t)_{, ξ}(t)_)}∞

t=0be the corresponding adaptive MCMC

algorithm and by A(t)_{((θ, ξ), ·) denote its marginal}

dis-tribution at time t, i.e.

A(t)((θ, ξ), B) := P(θ(t)∈ B|θ(0)= θ, ξ(0)= ξ).

The adaptive algorithm is ergodic for every starting

val-ues of θ and ξ if limt→∞kA(t)((θ, ξ, ·) − π(·)k = 0. The

two conditions guaranteeing ergodicity are

Deﬁnition 1 (Diminishing Adaptation) The

adap-tive algorithm with starting values θ(0)_{= θ and ξ}(0)_{= ξ}

satisﬁes Diminishing Adaptation, if lim t→∞D (t) _{= 0 in probability, where} D(t):= sup θ∈Θ kPξ(t+1)(θ, ·) − P_ξ(t)(θ, ·)k.

Deﬁnition 2 (Containment) The adaptive algorithm

with starting values θ(0)_{= θ and ξ}(0)_{= ξ satisﬁes}

Con-tainment, if for all ε > 0 the sequence {Mε(θ(t), ξ(t))}∞t=0

is bounded in probability.

While diminishing adaptation is a standard require-ment, Containment is subject to some discussion. On one hand, it may seem diﬃcult to verify in practice; on the other, it may appear restrictive in the context of er-godicity results under some weaker conditions (c.f. Fort

et al. (2011)). However, it turns out ( Latuszy´nski and

Rosenthal (2014)) that if Containment is not satisﬁed, then the algorithm may still converge, but with posi-tive probability it will be asymptotically less eﬃcient than any nonadaptive ergodic MCMC scheme. Hence algorithms that do not satisfy Containment are termed AdapFail and are best avoided. The condition has been further studied in Bai et al. (2011) and is in particular

(9)

implied by simultaneous geometric or polynomial drift conditions of the adaptive kernels.

Given that adaptive algorithms may be incorpo-rated in essentially any sampling scheme, their intro-duction seems to be one of the most important inno-vations of the last two decades. However, despite sub-stantial eﬀort and many ingenious contributions, the theory of adaptive MCMC lags behind practice even more than may be the case in other computational ar-eas. While theory always matters, the numerous unex-pected and counterintuitive examples of transient adap-tive algorithms suggest that in this area theory matters even more for healthy development.

For adaptive MCMC to become a routine tool, a clear-cut result is need saying that under some easily veriﬁable conditions these algorithms are valid and per-form not much worse than their nonadaptive counter-part with ﬁxed parameters. Such a result is yet to be established and may require deeper understanding of how to construct stable adaptive MCMC, rather than aiming heavy technical artillery at algorithms currently in use without modifying them.

2.5 Estimated likelihoods and pseudo-marginals There are numerous settings of interest where the tar-get density π(·|x) is not available in closed form. For instance, in latent variable models, the likelihood func-tion ℓ(θ|x) is often only available as an intractable in-tegral ℓ(θ|x) = Z Z g(z, x|θ) dz , which leads to π(θ|x) ∝ π(θ) Z Z g(z, x|θ) dz

being equally intractable. A solution proposed from the early days of MCMC (Tanner and Wong 1987) is to con-sider z as an auxiliary variable and to simulate the joint distribution π(θ, z|x) by a standard method, leading to simulating the marginal density π(·|x) as a by-product. However, when the dimension of the auxiliary variable z grows with the sample size, this technique may run into difficulties as induced MCMC algorithms are more and more likely to have convergence issues. An illustra-tion of this case is provided by hidden Markov mod-els, which have eventually to resort to particle filters as Markov chain algorithms become ineffective (Chopin 2007; Fearnhead and Clifford 2003). Another situation where the target density π(·|x) cannot be directly com-puted is the case of the “doubly intractable” likelihood

(Murray et al. 2006a), when the likelihood function it-self contains a term that is intractable ℓ(θ|x) ∝ g(x|θ) and makes the normalising constant

Z(θ) = Z

X

g(x|θ) dx

impossible to compute. Examples of this kind abound in Markov random ﬁelds models, as for instance for the Ising model (Murray et al. 2006b; Møller et al. 2006).

Both the approaches of Murray et al. (2006a) and Møller et al. (2006) require sampling data from the likelihood, which limits their applicability. The latter uses in addition an importance sampling function and retrospectively can be reinterpreted as Grouped Inde-pendence Metropolis-Hastings (GIMH of Andrieu and Roberts (2009), see below) with sample size 1. When perfect sampling from the likelihood is impossible, Giro-lami et al. (2013) develop an approach, also in the framework of GIMH, where the likelihoods are unbi-asedly estimated by random truncation of their series expansions.

Andrieu and Roberts (2009) propose a more general resolution of such problems by designing a Metropolis– Hastings algorithm that replaces the intractable tar-get density π(·|x) with an unbiased estimator, follow-ing an idea of Beaumont (2003). Proper changes to the Metropolis–Hastings acceptance ratio are suﬃcient to guarantee that the stationary density of the corre-sponding Markov chain remains equal to π(·|x). Indeed,

if ˆπ(θ|x, z) is an unbiased estimator of π(θ|x) when

z ∼ q(·|θ), it is rather straightforward to check that the acceptance ratio

ˆ π(θ∗_{|x, z}∗₎ ˆ π(θ|x, z) q(θ∗_{, θ)q(z|θ)} q(θ, θ∗_)q(z∗_|θ∗₎

preserves stationarity with respect to an extended tar-get (see Andrieu and Roberts (2009)) for details) when

z ∼ q(·|θ), z∗_{∼ q(·|θ), and θ}∗_{|θ ∼ q(θ, θ}∗_).

The performance of the approach will depend on

the quality of the estimators ˆπ and so both

stabilis-ing them as well as understandstabilis-ing this relationship is an active area of current development. In particular, the improvements from using multiple samples of z to estimate π can be concluded from Andrieu and Vihola (2012) where the eﬃciency of the algorithm is studied in terms of its spectral gap and CLT asymptotic variance. Sherlock et al. (2014), on the other hand, investigate the eﬃciency as a function of the acceptance rate and variance of the noise.

Design and understanding of pseudo-marginal algo-rithms is a dynamic direction of methodological devel-opment that in the coming years will be further fu-elled not only by complex models with intractable like-lihoods, but also by the need of MCMC algorithms

(10)

for Big Data in contexts where the likelihood func-tion can not be evaluated for the whole dataset (Korat-tikara et al. 2013; Bardenet et al. 2014; Teh et al. 2014; Maclaurin and Adams 2014; Minsker et al. 2014). 2.6 Particle MCMC

While we refrain from covering particle filters here, since others (names?!) are focussing on this technique, a re-cent advance at the interface between MCMC, pseudo-marginals, and particle filtering is the notion of parti-cle MCMC (or pMCMC), developed by Andrieu et al. (2011). This is in facta specific case of a pseudo-marginal algorithm, taking advantage of the state-space models used by particle filters. And it differs from particle fil-ters in that it targets (mostly) the marginal posterior distribution of the parameters.

The simplest setting in which pMCMC applies is one

of a state-space model where a latent sequence x0:T is

a Markov chain with joint density

p0(x0|θ)p1(x1|x0, θ)) · · · pT(xT|xT −1, θ) ,

and is associated with an observed sequence y1+T such

that y1+T|X1:T, θ ∼ T Y i=1 qi(yi|xi, θ) .

The iterations of pMCMC are MCMC-like in that, at

iteration t, a new value θ′ _{of θ is proposed from an}

ar-bitrary transition kernel h(·|θ(t)_{) and then a new value}

of the latent series x′

0:Tn is generated from a particle

ﬁlter approximation of p(x0:T|θ′, y1:T). Since the

par-ticle ﬁlter returns as a by-product (Del Moral et al. 2006) an unbiased estimator of the marginal posterior

of y1:T, ˆq(y1:T|θ′), this estimator can be used as such in

the Metropolis–Hastings ratio ˆ

q(y1:T|θ′)π(θ′)h(θ(t)|θ′)

ˆ

q(y1:T|θ)π(θ(t))h(θ′|θ(t))

∧ 1 .

Its validation follows from the general argument of An-drieu and Roberts (2009), although some additional (notational) eﬀort is needed to demonstrate all random variables used therein are correctly assessed (see An-drieu et al. 2011 and Wilkinson 2011a, the latter pro-viding a very progressive introduction to the notions of pMCMC and particle Gibbs, which helped greatly in composing this section).

This approach is being used increasingly in com-plex dynamic models like those found in signal process-ing (Whiteley et al. 2010), dynamical systems like the PDEs in biochemical kinetics (Wilkinson 2011b) and probabilistic graphical models (Lindsten et al. 2014). An extension to approximating the sequential ﬁltering distribution is found in Chopin et al. (2013).

2.7 Parallel MCMC

Since MCMC relies on local updating based on the current value of a Markov chain, opportunities for ex-ploiting parallel resources, either CPU or GPU, would seem quite limited, In fact, the possibilities reach far beyond the basic notion of running independent or cou-pled MCMC chains on several processors. For instance, Craiu and Meng (2005) construct parallel antithetic coupling to create negatively correlated MCMC chains (see also Frigessi et al. 2000), while Craiu et al. (2009) use parallel exploration of the sample space to tune an adaptive MCMC algorithm. Jacob et al. (2011) ex-ploit GPU facilities to improve by Rao-Blackwellisation the Monte Carlo approximations produced by a Markov chain, even though the parallelisation does not improve the convergence of the chain. See also Lee et al. (2009) and Suchard et al. (2010) for more detailed contribu-tions on the appeal of using GPUs towards massive par-allelisation, and Wilkinson (2005) for a general survey on the topic.

Another recently-explored direction is “prefetching”. Based on Brockwell (2006) this approach computes the

22_{, 2}3_{, . . . , 2}k _{values of the posterior that will be needed}

2, 3, . . . , k sweeeps ahead by simulating the possible “fu-tures” of the Markov chain in parallel. Running a regu-lar Metropolis–Hastings algorithm then means building a decision tree back to the current iteration and draw-ing 2, 3, . . . , k uniform variates to go down the tree to the appropriate branch. As noted by Brockwell (2006), “in the case where one can guess whether or not ac-ceptance probabilities will be ‘high’ or ‘low’, the tree could be made deeper down ‘high’ probability paths and shallower in the ‘low’ probability paths.” This idea is exploited in Angelino et al. (2014), by creating “spec-ulative moves” that consider the reject branch of the prefetching tree more often than not, based on some preliminary or dynamic evaluation of the acceptance rate. Using a fast but close-enough approximation to the true target (and a ﬁxed sequence of uniforms) may also produce a “single most likely path” on which prefetched simulations can be run. The basic idea is thus to run simulations and costly likelihood computations on many parallel processors along a prefetched path, a path that has been prefetched for its high approximate likelihood. There are obviously instances where this speculative simulation is not helpful because the actual chain ends up following another path with the genuine target. An-gelino et al. (2014) actually go further by constructing sequences of approximations for the precomputations. The proposition for the sequence found therein is to subsample the original data and use a normal

(11)

approx-imation to the diﬀerence of the log (sub-)likelihoods. See Strid (2010) for related ideas.

A different use of parallel capabilities is found in Calderhead (2014). At each iteration of Calderhead’s algorithm, N proposed values are generated conditional on the “current” value of the Markov chain, which actu-ally consists of (N +1) components and from which one component is drawn at random to serve as a seed for the next proposal distribution and the simulation of N other values. In other words, this is a data-augmentation scheme with the index I on the one side and the N mod-ified components on the other side. The neat trick in the proposal (and the reason for the gain in efficiency) is that the stationary distribution of the auxiliary vari-able can be determined and hence used (N + 1) times in updating the vector of (N + 1) components. (Note that picking the index at random means computing all (N + 1) possible transitions from one component to the N others, hence a potential increase in the comput-ing cost, even though what costs the most is usually the likelihood computation, dispatched on the paral-lel processors.) While there are (N + 1) terms involved at each step, the genuine Markov chain is truly over a single chain and the N other proposed values are not re-cycled. An interesting feature in this approach is when the original Metropolis–Hastings algorithm is expressed as a finite state space Markov chain on the set of indices {1, . . . , N + 1}. Conditional on the values of the (N + 1) dimensional vector, the stationary distribution of that sub-chain is no longer uniform. Hence, picking (N + 1) indices from the stationary helps in selecting the most appropriate images, which explains why the rejection rate decreases. The paper indeed evaluates the impact of increasing the number of proposals in terms of ef-fective sample size (ESS), acceptance rate, and mean squared jump distance, based two examples. Since this proposal is an almost free bonus resulting from using N processors, when compared with more complex cou-pled chains, it sounds worth investigating and compar-ing with those more complex parallel schemes.

Neiswanger et al. (2013) introduced the notion of embarrassingly parallel MCMC, where “embarrassing” refers to the “embarrassingly simple” solution proposed therein, namely to solve the diﬃculty in handling very large datasets by running completely independent par-allel MCMC samplers on parpar-allel threads or comput-ers and using the outcomes of those samplcomput-ers as den-sity estimates, pulled together as a product towards an approximation of the true posterior density. In other words, the idea is to break the posterior as

p(θ|x) =

m

Y

i=1

pi(θ|x) (7)

and to use the estimate ˆ p(θ|x) = m Y i=1 ˆ pi(θ|x)

where the individual estimates are obtained, say, non-parametrically. The method is then “asymptotically ex-act” in the weak (and unsurprising) sense of converg-ing in the number of MCMC iterations. Still, there is a theoretical justiﬁcation that is not found in previous parallel methods that mixed all resulting samples with-out accounting for the subsampling. And the point is made that, in many cases, running MCMC samplers with subsamples produces faster convergence. The de-composition of p(·) into its components is done by par-titioning the iid data into M subsets and taking a power 1/m of the prior in each case. (This may induce issues about impropriety.) However, the subdivision is arbi-trary and can thus be implemented in cases other than the fairly restrictive iid setting. Because each (subsam-ple) nonparametric estimate involves T terms, the re-sulting overall estimate contains T m terms and the au-thors suggest using an independent Metropolis sampler to handle this complexity. This is in fact necessary for producing a ﬁnal sample from the (approximate) true posterior distribution.

In a closely related way, Wang and Dunson (2013) start from the same product representation of the tar-get (posterior), namely, (7). However, they criticise the choice made by Neiswanger et al. (2013) to use MCMC approximations for each component of the product for the following reasons:

1. Curse of dimensionality in the number of parameters p;

2. Curse of dimensionality in the number of subsets m; 3. Tail degeneration;

4. Support inconsistency and mode misspeciﬁcation. While point 1 is clearly relevant, although there may be other ways than kernel estimation to mix samples from the terms in the product, terms Neiswanger et al. (2013) called the subposteriors, which is also a drawback with the current method, point 2 is not such a clearcut draw-back: while the T m explosion in the number of terms in a product of m sums of T terms sounds self-defeating, but Neiswanger et al. (2013) use a clever device to avoid the combinatorial explosion, namely operating on one component at a time. Having non-manageable targets is not such an issue in the post-MCMC era. Point 3 is formally correct, in that the kernel tail behaviour in-duces the kernel estimate tail behaviour, most likely disconnected from the true target tail behaviour, but this feature is true for any non-parametric estimate, even for the Weierstrass transform deﬁned below, and

(12)

hence maybe not so relevant in practice. In fact, by lift-ing the tails up, the simulation from the subposteriors should help in visiting the tails of the true target. Fi-nally, point 4 does not seem to be serious: assuming the true target can be computed up to a normalising constant, the value of the target for every simulated parameter could be computed, eliminating those out-side the support of the product and highlighting modal regions.

The Weierstrass transform of a density f is a con-volution of f and of an arbitrary kernel K. Wang and Dunson (2013) propose to simulate from the product of the Weierstrass transform, using a multi-tiered Gibbs sampler. Hence, the parameter is only simulated once and from a controlled kernel, while the random eﬀects from the convolution are related with each subposte-rior. While the method requires coordination between the parallel threads, the components of the target are separately computed on a single thread. The clearest perspective on the Weierstrass transform may possi-bly be the rejection sampling version where simulations from the subpriors are merged together into a normal proposal on θ, to be accepted with a probability de-pending on the subprior simulations.

VanDerwerken and Schmidler (2013) keeps with the spirit of parallel papers like consensus Bayes (Scott et al. 2013), embarrassingly parallel MCMC (Neiswanger et al. 2013), Weierstrass MCMC (Wang and Dunson 2013), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. The idea of the authors is to re-place an exploration of the whole space operated via a single Markov chain (or by parallel chains acting inde-pendently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. The motivation is that “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more impor-tantly, each chain can work in parallel to the others. More speciﬁcally, given a partition of the space, into

sets Ai with posterior weights wi, parallel chains are

associated with targets equal to the original target

re-stricted to those Ais. This is therefore an MCMC

ver-sion of partitioned sampling. With regard to the short-comings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. It is somewhat unclear (a) whether or not this impacts ergodicity (it all depends on the way the chain is constructed, i.e., against which tar-get) as it could lead to an over-representation of some boundary regions and (b) whether or not it improves

the overall convergence properties of the chain(s). A more delicate issue with the partitioned MCMC ap-proach stands with the partitioning. Indeed, in a com-plex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition

set Ai has enough energy to explore this set. It

actu-ally sounds unlikely that a target with huge gaps be-tween modes will see a considerable improvement from

the partioned version when the partition sets Aiare

se-lected on the go, because some of the boundaries be-tween the partition sets may be hard to reach with an oﬀ-the-shelf proposal. A last comment about this innovative paper is that the adaptive construction of the partition has much in common with Wang-Landau schemes (Wang and Landau 2001; Atchad´e and Liu 2004; Lee et al. 2005; Jacob and Ryder 2014).

3 ABC and other acronyms, exact simulations from an approximate target

Motivated by highly complex models where MCMC algorithms and other Monte Carlo methods were too ineﬃcient by far, approximate methods have emerged where the output cannot be considered as simulations from the genuine posterior, even under idealised situ-ations of inﬁnite computing power. These methods in-clude ABC techniques, described in more details be-low, but also variational Bayes (Jaakkola and Jordan 2000), empirical likelihood (Owen 2001), INLA (Rue et al. 2009) and other solutions that rely on pseudo-models, or on summarised versions of the data, or both. It is quite important to signal this evolution as we think that it may be a central feature of computational Bayesian statistics in the coming years. From a statis-tical perspective, it also induces a somewhat paradox-ical attitude where loss of information loss is balanced by improvement in precision, for a given computational budget. This perspective is not only interesting at the computational level but forces us (as statisticians) to re-evaluate in depth the nature of a statistical model and could produce a paradigm shift in the near future by giving a brand new meaning to George Box’s motto that “all models are wrong”.

3.1 ABC per se

It seems important to discuss ABC (Approximate Bayesian computation) in this partial tour of Bayesian

(13)

compu-tational techniques as (a) they provide the only ap-proach to their model for some Bayesians, (b) they deliver samples in the parameter space that are exact simulations from a posterior of some kind (Wilkinson

2013), πABC_{(θ|x) if not the original posterior π(θ|x),}

(c) they may be more intuitive to some researchers outside statistics, as they entail simulating from the inferred model, i.e., going forward from parameter to data, rather than backward, from data to parameter, as in traditional Bayesian inference, (d) they can be merged with MCMC algorithms, and (e) they allow drawing inference directly from summaries of the data rather than the data itself.

ABC techniques play a role in the 2000s that MCMC methods did in the 1990s, in that they handle new mod-els for which earlier (e.g., MCMC) algorithms were at a loss, in the same way the latter (MCMC) were able to handle models that regular Monte Carlo approaches could not reach, such as latent variable models (Tanner and Wong 1987; Diebolt and Robert 1994; Richardson and Green 1997). New models for which ABC unlocked the gate include Markov random fields, Kingman’s co-alescent for phylogeographical data, likelihood models with an untractable normalising constant, and models defined by their quantile function or their characteris-tic function. While the ABC approach first appeared a “quick-and-dirty” solution, to be considered only until more elaborate representations could be found, those al-gorithms have been progressively incorporated into the statistician’s toolbox as a novel form of generic non-parametric inference handling partly-defined statistical models. They are therefore attractive as much for this reason as for being handy computational solutions when everything else fails.

A statistically intriguing feature of those methods is that they customarily require—for greater eﬃciency— replacing the data with (much) smaller-dimension

sum-maries1_{or summary statistics, because of the}

complex-ity of the former. In almost every case calling for ABC, those summaries are not suﬃcient statistics and the method thus implies from the start a loss of statistical information, at least at a formal level since relying on the raw data is out of the question and therefore the ad-ditional information it provides is moot. This imposed reduction of the statistical information raises many rel-evant questions, from the choice of summary statistics (Blum et al. 2013) to the consistency of the ensuing inference (Robert et al. 2011).

1 _{Maybe due to their initial introduction in population} ge-netics, the oxymoron ‘summary statistics’ is now prevalent in descriptions of ABC algorithms, included in the statistical literature, where the (linguistically suﬃcient) term ‘statistic’ would suﬃce.

Although it has now diffused into a wide range of applications, the technique of Approximate Bayesian Computation (ABC) was first introduced by and for population genetics (Tavaré et al. 1997; Pritchard et al. 1999) to handle ancestry models driven by Kingman’s coalescent and with strictly intractable likelihoods Beau-mont (2010). The likelihood function of such genetic models is indeed “intractable” in the sense that, while derived from a fully defined and parameterised proba-bility model, this function cannot be computed (at all or within a manageable time) for a single value of the pa-rameter and for the given data. Bypassing the original example to avoid getting mired into the details of pop-ulation genetics, examples of intractable likelihoods in-clude densities with intractable normalising constants, i.e., f (x|θ) = g(y|θ)/Z(θ) such as in Potts (Potts 1952) and auto-exponential (Besag 1972) models, and pseudo-likelihood models (Cucala et al. 2009).

Example 1 A very simple illustration of an intractable

likelihood is provided by Bayesian inference based on the median and median absolute deviation statistics of a sample from an arbitrary location-scale family, x1, . . . , xn

iid

∼ σ−1_g(σ−1_{{x − µ}), as the joint}

distribu-tion of this statistic is not available in closed form. ◭

The concept at the core of ABC methods can be seen as both very na¨ıve and intrinsically related to the foundations of Bayesian statistics as inverse

probabil-ity (Rubin 1984). This concept is that data x

simu-lated conditional on values of the parameter close to the “true” value of the parameter should look more similar

to the actual data x0than data x simulated conditional

on values of the parameter far from the “true” value. ABC actually involves an acceptance/rejection step in that parameters simulated from the prior are accepted only when

d(x, x0) < ǫ ,

where d(·, ·) is a distance and ǫ > 0 is called the toler-ance. It can be shown that the algorithm exactly sam-ples the posterior when ǫ = 0, but this is very rarely achievable in practice (Grelaud et al. 2009). An algo-rithmic representation is as follows:

(14)

Algorithm 2 ABC (basic version)

for t = 1 to N do repeat

Generate θ∗_{from the prior π(}_·) Generate x∗_{from the model f (}_·|θ∗₎ Compute the distance ρ(x0_,_x∗₎ Accept θ∗_{if ρ(x}0_,_x∗_{) < ǫ} until acceptance

end for

return N accepted values of θ∗

Calibration of the ABC method in Algorithm 2 in-volves selecting the distance ρ(·, ·) and deducing the tol-erance from computational cost constraints. However, in realistic settings, ABC is never implemented as such because comparing raw data to simulated raw data is rarely efficient, noise dominating signal (see, e.g., Marin et al. (2011) for toy examples). It is therefore natural that one first considers dimension-reduction techniques to bypass this curse of dimensionality. For instance, if rudimentary estimates S(x) of the parameter θ are available, they are good candidates. In the ABC liter-ature, they are called summary statistics, a term that does not impose any constraint on their form and hence leaves open the question of performance, as discussed in Marin et al. (2011); Blum et al. (2013). A more practi-cal version of the ABC algorithm is shown in Algorithm 3 below, with a different output for each choice of the summary statistic. We stress in this version of the al-gorithm the construction of the tolerance ǫ as a

quan-tile of the simulated distances ρ(S(x0_{), S(x}(t)_{)), rather}

than an additional parameter of the method.

Algorithm 3 ABC (version with summary)

for t = 1 to Nref do

Generate θ(t)_{from the prior π(}_·) Generate x(t)_{from the model f (}_·|θ(t)₎ Compute dt= ρ(S(x0), S(x(t))) end for

Order distances d(1)≤ d(2)≤ . . . ≤ d(Nref)

return the values θ(t)_{associated with the k smallest} dis-tances

An immediate question about this approximate al-gorithm is how much it remains connected with the original posterior distribution and in case it does not, where does it draw its legitimacy. A ﬁrst remark in this connection is that it constitutes at best a convergent

approximation to the posterior distribution π(θ|S(y0)).

It can easily be seen that ABC generates outcomes from a genuine posterior distribution when the data is ran-domised with scale ǫ (Wilkinson 2013; Fearnhead and Prangle 2012). This interpretation indicates a decrease

in the precision of the inference but it does not provide a universal validation of the method. A second perspec-tive on the ABC output is that it is based on a non-parametric approximation of the sampling distribution (Blum 2010; Blum and Fran¸cois 2010), connected with both indirect inference (Drovandi et al. 2011) and k-nearest neighbour estimation (Biau et al. 2014). While a purely Bayesian nonparametric analysis of this aspect has not yet emerged, this brings an additional if cau-tious support for the method.

Example 2 Continuing from the previous example of a

location-scale sample only monitored through the pair median plus mad statistic, we consider the special case

of a normal sample x1, . . . , xn ∼ N (µ, σ2), with n =

100. Using a conjugate prior µ ∼ N (0, 10), σ−2 _∼

Ga(2, 5), we generated 106_{parameter values, along with}

the corresponding pairs of summary statistics. When creating the distance ρ(·, ·), we used both following ver-sions:

ρ1(S(x0), S(x)) =|med(x

0

)−med(x|_/_mad(med(X))

+|mad(x0)−mad(x|_/_mad(mad(X))

ρ2(S(x0), S(x)) =|med(x0)−med(x|/mad(med(X))+

| log mad(x0_{)−log mad(x|}

/mad(log mad(X))

where the denominators are computed from the refer-ence table in order to scale the components properly. Figure 1 shows the impact of the choice of this dis-tance, but even more clearly the discrepancy between inference based on the ABC and the true inference on

(µ, σ2).

The discrepancy can however be completely elimi-nated by post-processing: Figure 2 reproduces Figure 1 by comparing the histograms of an ABC sample with the version corrected by Beaumont et al.’s (2002) local regression, as the latter is essentially equivalent to a

(15)

µ Density −20 −10 0 10 20 0.00 0.02 0.04 σ Density 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.4 0.8 1.2 µ Density −20 −10 0 10 20 0.00 0.04 0.08 σ Density 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 µ Density −20 −10 0 10 20 0.0 0.5 1.0 1.5 2.0 σ Density 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Fig. 1 Comparison of the posterior distributions on µ (left) and σ (right) when using an ABC algorithm 3 with distance ρ1 (top) and ρ2 (central), and when using a standard Gibbs sampler (bottom). All three samples are based on the same number of subsampled parameters. The dataset is aN (3, 22₎ sample and the tolerance value ǫ corresponds to α = .5% of the reference table.

µ Density −30 −20 −10 0 10 20 30 0.00 0.02 0.04 0.06 igmau Density 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 µ Density 2.85 2.90 2.95 3.00 3.05 0 2 4 6 8 10 12 igmau Density 1.90 1.95 2.00 2.05 0 5 10 15 µ Density 2.90 2.95 3.00 3.05 0 2 4 6 8 10 12 igmau Density 1.94 1.96 1.98 2.00 2.02 2.04 2.06 0 5 10 15 20

Fig. 2 Comparison of the posterior distributions on µ (left) and σ (right) when using an ABC algorithm 3 with distance ρ1(top), a post-processed version by Beaumont et al.’s (2002) local regression (central), and when using a standard Gibbs sampler (bottom). The simulation setting is the same as in Figure 1.

Barber et al. (2013) studies the rate of convergence for ABC algorithms through the mean square error

when approximating a posterior moment. They show

the convergence rate is of order O(n2_/q+4_{), when q is}

the dimension of the ABC summary statistic,

associ-ated with an optimal tolerance in O(n−1/4_{). Those rates}

are connected with the nonparametric nature of ABC, as already suggested in the earlier literature: for in-stance, Blum (2010), who links ABC with standard ker-nel density non-parametric estimation and ﬁnd a

toler-ance (re-expressed as a bandwidth) of order n−1/q+4₎

and an mse of order2_/_q+4_{as well, while Fearnhead and}

Prangle (2012) obtain similar rates, with a tolerance of

order n−1/q+2 _{for noisy ABC. See also Calvet and}

Czel-lar (2014). SimiCzel-larly, Biau et al. (2014) obtain precise convergence rates for ABC interpreted as a k-nearest-neighbour estimator.

Lee and Latuszy´nski (2014) have also produced

pre-cise characterisations of the geometric ergodicity or lack thereof of ABC-MCMC algorithms. Among four ver-sions of ABC algorithms, from the standard ABC-MCMC (with N replicates of the simulated pseudo-data to each simulated parameter value) to versions involving simu-lations of the replicates repeated at the subsequent step, use of a stopping rule in the generation of the pseudo data, and a “gold-standard algorithm based on the (un-available) measure of an ǫ ball around the data. Based a result by Roberts and Tweedie (1996b), also used in Mengersen and Tweedie (1996), namely that an MCMC chain cannot be geometrically ergodic when there exist almost-absorbing states, they derive that (under some technical assumptions) the first two versions above can-not be variance-bounding (i.e., that the spectral gap is zero), while the last two versions can be both variance-bounding and geometrically ergodic under some appro-priate conditions on the prior and the above ball mea-sure. This result is thus rather striking in simulating a random number of auxiliary variables is sufficient to produce geometric ergodicity. We note that this result does not contradict the parallel result of Bornn et al. (2014), who establish that there is no efficiency gain in simulating N > 1 replicates of the pseudo-data, since there is no randomness involved in that approach. How-ever, the latter approach only applies to functions with finite variances.

When testing hypotheses and selecting models, the Bayesian approach relies on modelling hypotheses and model indices as part of the parameter and hence ABC naturally operates as this level as well, as demonstrated in Algorithm 4 following Cornuet et al. (2008), Gre-laud et al. (2009) and Toni et al. (2009). In fields like population genetics, model choice and hypotheses vali-dation is presumably the primary motivation for using ABC methods as exemplified in Belle et al. (2008); Cor-nuet et al. (2010); Excoffier et al. (2009); Ghirotto et al.