Bayesian inversion by parallel interacting Markov chains

(1)

HAL Id: hal-00505406

https://hal-mines-paristech.archives-ouvertes.fr/hal-00505406

Submitted on 23 Jul 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bayesian inversion by parallel interacting Markov chains

Thomas Romary

To cite this version:

Thomas Romary. Bayesian inversion by parallel interacting Markov chains. Inverse Problems in Science and Engineering, Taylor & Francis, 2010, 18 (1), pp.111-130. �10.1080/17415970903234620�.

�hal-00505406�

(2)

ThomasRomary

∗∗

Centre de Géosienes,Equipe Géostatistique, Eole des Mines de Paris,

35rue Saint-Honoré ,

77300,Fontainebleau, Frane

April15, 2009

Abstrat

Markov hains Monte-Carlo (MCMC) methods are known to produe samples of virtually

any distribution. Theyhave alreadybeen widelyused intheresolution ofnon-linear inverse

problemswherenoanalytialexpressionfortheforward relationbetweendataand modelpa-

rametersisavailable, andwherelinearizationisunsuessful. However, inBayesian inversion,

thetotalnumberofsimulationsweanaordishighlyrelatedtotheomputationalostofthe

forward model. Hene, theomplete browsing of thesupport of theposterior distribution is

hardlyperformedat naltime,espeiallywhentheposteriorishighdimensionaland/ormul-

timodal. In the latterase,the hainmay staystukinone ofthemodes. Reently,theidea

ofmakinginterat severalMarkovhainsat dierent temperatures hasbeenexplored. These

methods improve themixing propertiesof lassial singleMCMC. Furthermore,these meth-

ods anmakeeientuseoflargeCPUlusters,withoutinreasingtheglobal omputational

ost withrespetto lassial MCMC.

Keywords

Inverse problem ; Bayesian inversion ; MCMC ; interating Markov hains ; tempering ;

Historymathing

∗ ∗

Correspondingauthor. Email: thomas.romarymines-paristeh.fr

(3)

Monte-Carlo methods are beoming inreasingly important for the solution of nonlinear in-

verse problems. Typially,theinverse problemis formulated as a searh for solutions tting

the data within a ertain tolerane, given by data unertainties. In a non-probabilisti set-

ting this means that we searh for solutions with alulated data whose distane from the

observed islessthan a xed, positive number. InaBayesian ontext, thetolerane issoft: a

largenumberof samplesof statistiallynear-independent modelsfrom thea posterior proba-

bilitydistribution aresought. Suh solutions areonsistent withdataand prior information,

as they t the data within error bars, and adhere to soft prior onstraints given by a prior

probabilitydistribution.

Preisely,weonsiderthestudyofasystem

X ∈ X

^,ôn^whih^we^haveânîndiret^measure-

ment

d

^,^thatîs ^funtion ôf ^the ^state ôf

X

^, ^modeled ^by

F(X)

^,^and ^some ^a ^prioriinformation underthe form of the prior distribution

P (X)

^. ^Wê âlso ônsider ^that^the measurement

d

^is

aetedbyanerrorandthatwe knowhowtosimulate

F

^up^to ^anapproximationerror,both errors being aounted for by

P (d | X)

^. ^W^e ^also ^dene ^the ^joint distribution

P (d, X)

^. ^Then,

assuming thatall these distributions admit a densitywithrespetto theLebesgue measure,

denoted

f ( · )

^,^the ^onditional ^density ^of

X

^with^respet^to

d

^takes^the^following ^form:

f (X | d) = f (d | X)f(X) R

X

f (d, X)dX .

⁽¹⁾

This is the Bayesian formulation of inverse problem and

P (X | d)

^, ^whose ^density

f (X | d)

^, ^is

the posterior distribution, see [1℄. The formula (1) shows that this problem an be viewed

asa lassialstatistialinfereneproblem, where wewantto sampleindependentrealizations

from theposterior distribution. Note that thenormalization onstant in (1)is generally in-

tratable in high-dimensional problems. Therefore, we onsider that the posterior is known

upto aonstant, beingdened fromthepriorknowledgeonthesystemstudiedand thedata

withits assoiated measurement error.

There exists several methods for solving (1 ) suh as the Kitanidis-Oliver algorithm (see

[2℄ and [3℄), developed for petroleum engineering appliations and the neighbourhood algo-

rithm([4 ℄and[5℄),developedforgeophysial inverseproblems. Inspiteofitsuniversalitythe

speedofonvergene oftherstone isontroversial: itonsistsinperformingalargenumber

of optimizations withan observed datum perturbed aording to its measurement error. It

is partiularly diult to know how many optimizations should be performed. The seond

one seems to be limited for low-dimensional problems: itan be seen asa geometri version

of an iterated importane sampling sheme (see e.g. [6 ℄, hapter 14). This artile fous on

Monte-Carlo Markovhains (MCMC)methods for their universalityand therelative ease of

their implementation.

MCMC methods suit indeed partiularly for this problem, asthey areknownto produe

samples ofvirtually anyposterior distribution. Two problems mayarisethen. Onone hand,

thedimensionof theproblemmaybe solarge thatthehainhasto be run foran intratable

numberofiterationsto onvergeand toahieve aneient sampling oftheposterior, wesay

that they have weak mixing properties. On the other hand, an evaluation of the forward

operator

F

^an ^be ^very ^omputer ^demanding ^so ^that ^the pratitioner wishes to minimize thenumber ofiterations. Moreover, when the posterior hasseveral disonneted modesin a

high-dimensional spae, whih isoftenthease innonlinear Bayesian inversion, theproblem

of exploring the wholesupport of theposterior is a diult one. It an be shown thateven

for very simple problems most lassial Markov hain algorithms an fail at identifying the

main modesofthe posterior,beauseof their lakof mixing (see[7 ℄).

(4)

olletionof hains inparallel at dierent temperatures and allowing them to interat. This

method is not more omputer demanding than lassial MCMC sine it an be easily paral-

lelized.

Thispaperaimsat providing researhers andengineers withsome reipesto applyinter-

ating MCMC methods. Thus, itbegins in setion2 with basisfor MCMC methods, some

examplesoflassialalgorithmandearlierattemptstoimprovemixingpropertieslikeanneal-

ingand tempering tehniques, whihrely onthesame basi priniplesasinterating MCMC

tehniques, exposed in setion 3. In setion 4, we will show an appliation to a reservoir

engineering problem. The paperends withsomeonlusions andperspetiveof futurework.

2 Markov hains Monte-Carlo methods

MCMC,introduedbyMetropolisetal. [8℄,isapopularmethodfor generatingsamplesfrom

virtuallyanydistribution

π

^dened ^on

( X , B ( X ))

^,^where

B ( X )

^stands^for^the^Borel^sets^of

X

^.

In partiular there is no need for the normalizing onstant of

π

^to ^be ^known ^and ^the ^spae

X ⊆ R

^d (for some integer

d

⁾ ôn ^whih îtîs ^dened ân ^be ^high dimensional. We reall here some lassial results on MCMC methods. For a omprehensive review of MCMC, see [6℄,

hapters 6to 13. For a more detailedaount on Markovhainstheory,see [9 ℄.

2.1 Priniples

ThemethodonsistsinsimulatinganergodiMarkovhain

{ X

_n

, n ≥ 0 }

^on

X

^with^transition

probability

P

^suh ^that

π

îsâ ^stationary ^density ^for ^this^hain, î.e.

∀ A ∈ B ( X )

^:

Z

X

P (x, A)π(x)dx = π(A).

⁽²⁾

Suhsamples an be usede.g. to ompute integrals

π(h) = Z

X

h(x)π(x)dx,

⁽³⁾

estimating thisquantityby

S

_n

(h) = 1 n

n

X

i=1

h(X

_i

),

⁽⁴⁾

for some

h : X → R

. A very useful onept in onstruting ergodi Markov hains is re- versibility. A Markovhain isreversibleifit satisesthedetailed balane ondition:

P (x, dy)π(dx) = P (y, dx)π(dy).

⁽⁵⁾

Thismeansthat, ifstartedinstationarity, theMarkovhainhasthesame haneof starting

at

x

^and^jumping ^to

y

^as^starting^at

y

^and^jumping ^to

x

^.

We illustrate the priniples of MCMC with the Metropolis-Hastings (MH) update. It

requiresthe hoie of a proposal distribution

q

^. ^The ^role ^of

q

^onsistsⁱⁿ ^proposing ^potential

transitions for the Markov hain. Given that the hain is urrently at

x

^, ^a ^andidate

y

^is

aepted withprobability

α(x, y)

^dened ^as:

α(x, y) =

( min n

1,

^π(y)_π(x)^q(y,x)_q(x,y)

o

if

π(x)q(x, y) > 0,

1

^otherwise.

(6)

(5)

Otherwise,itisrejeted and theMarkovhainstays at itsurrent loation

x

^. ^The^transition

kernel

P

^of^this ^Markov ^hain^takes^the ^form, ^for

(x, A) ∈ X × B ( X )

^:

P(x, A) =

Z

A

α(x, y)q(x, y)dy + 1

_A

(x) Z

X

(1 − α(x, y))q(x, y)dy.

⁽⁷⁾

The Markov hain dened by

P

^is ^reversible ^with ^respet ^to

π

^and ^therefore ^admits

π

^as

invariant distribution. Conditionsontheproposaldistribution

q

^that^guaranteeirreduibility and positive reurreneareeasy to meet and manysatisfatoryhoies arepossible.

2.2 Some examples of Metropolis-Hastings samplers

The arbitrariness of the hoie of

q(x, · )

^allows onsiderable freedom to design a multitude of dierent hains, eah with stationary distribution

π

^, ^although ⁱⁿ ^the ^Bayesian ^inversion

framework,

q

^should^rely^on^the^a^prioridistribution. Someexamplesinlude(see[6℄,hapter 7,for more examples):

1. theindependentsampler(IMH):

q(x, y) = q(y)

^,^where

q

^is^generally^the^priorⁱⁿ^Bayesian

inversion,

2. thesymmetri inrements random-walk sampler (SIMH):

q(x, y) = q( | y − x | )

^, ^where

q

an be azero-mean versionof theprior,

3. theLangevin sampler(LMH): assuming that

π

^is dierentiable on

X

^,^it^allows ^to ^take

advantage ofthegradient information to givethesampling diretion,

q

^takes^the^form:

q(x, y) ∼ N

x + h

²

2 ∇ log(π(x)), h

²

I

_d

,

⁽⁸⁾

where

h

îsâ ^parameter ^to ^hooseâording ^to ê.g. ^[10℄ ôr ^[11℄. ^Note^that â^bad ^hoie

of

h

ân îndue êrrati^behaviour ôf^the^hain,

4. Theadaptive algorithm of [12 ℄ (ASIMH): Inthis algorithm,

y

^is^proposed^aording^to

q

_θ_n

(x, · ) = N (x, Γ

_n

)

^,^where

θ = (µ, Γ)

^. ^We âlso ônsider â non-dereasing sequene of positive step sizes

{ γ

_n

}

^,^suh ^that

P

∞

n=1

γ

_n

= ∞

^and

P

∞

n=1

γ

_n^1+δ

< ∞

^for ^some

δ > 0

^.

Inpratie,we generallyuse:

γ

_n

= 1/n

^,^as^suggestedⁱⁿ^{[12 ℄.} ^The^parameter ^estimation

algorithm takes the following form:

µ

_n+1

= µ

_n

+ γ

_n+1

(X

_n+1

− µ

_n

) , n ≥ 0,

Γ

_n+1

= Γ

_n

+ γ

_n+1

(X

_n+1

− µ

_n

) (X

_n+1

− µ

_n

)

^tr

− Γ

_n

,

⁽⁹⁾

5. TheGibbssampler: Here

X = X

1

× . . . ×X

d^,^and

q = q

_i^leavesâllôordinates^xedêxept

the

i

^thône,^whihît^proposesâording^to^theônditionaldistribution

(x

i

|{ x

j

}

j6=i

)

^. ^This

implies that

α(x, y) = 1

^for ^all

x

^and

y

^, ^so^there ^are ^no ^rejetions. ^If ^the^resulting

i

^th

omponentGibbssamplerisalled

P

_i^,^then^theseômponentsân^beômbined^to^yield

the random-san Gibbs sampler whih is the average

P

RS

=

¹_d

(P

1

+ . . . + P

_d

)

^, ^or ^the

deterministi-san Gibbs sampler whih isthe produt

P

_DU

= P

₁

· · · P

_d^.

2.3 Comments

One of the problems with Metropolis-Hastings algorithms is the abundane of hoie avail-

ablefor hoosingthe proposaldistribution

q(x, · )

^. ^Fôr înstane êven îf^the^type ôfâlgorithm

(6)

appropriate for

π( · )

^. ^Suh â ^problemîs ^knownâs â ^saling ^problem. ^Tô ^make ^this ^question

more onrete, onsider the following problem. Suppose that

q(x, · )

^is distributed as thed- dimensional normaldistribution

N (x, σ

²

I

_d

)

^,^for ^some

σ

²

> 0

^. ^W^e ^reall^that ^the^aeptane

probabilities forthis algorithm aregiven by(6). For verysmallvalues of

σ

²^,^small^jumps^are

attempted bythe algorithm, and beause of the form of (6 ), these moves arealmost always

aepted. The Markovhain mixes very slowly beause its inrements are sosmall. Onthe

other hand,if

σ

² îs ^hosen^to ^be ^very ^large, ^long^distane ^jumps âreâttempted ^by^the âlgo-

rithm,mostofwhiharerejeted. Thealgorithmthereforespendslongperiods oftimeinthe

same state,and thus the algorithm still onverges slowly. For thisproblem, "very large" and

"very small" have to be interpreted in a way related to the partiular form of

π

^. ^It ^seems

reasonable that "moderate" values of

σ

² ^should ^be ^preferred. ^However, ^it ^is ^diult ^to ^see

howtogureoutwhatvaluesare"moderate",espeiallyif

π

îs^veryômpliated. În^Bayesian

inversionontext,therandomwalktypealgorithms, liketheSIMHortheLMH,generallyfail

at identifyingdierent modes. Inlarge dimensional spae, thesalingfator generallyhasto

be "small" so as to get an aeptable aeptane rate (6). Therefore, these two algorithms

perform generallyaloalexplorationandarehardlyabletojumpfromonemode toanother.

We will thenreferto them as"loal" samplers. Note thatthey arealsogenerally really slow

to onvergetowards thestationaryregime inBayesian inversionontext.

Conversely, the IMH does not need any tuning. It will explore largely thesurfae of the

posterior distribution andwe will referto itasa"global" sampler. Nevertheless, inpratial

appliations,unless

q

^is^the^posteriordistribution,thetransitionswillobviouslyalmostalways be rejeted.

Finally,weannotieherethatthehaingeneratedbytheadaptivealgorithm isnolonger

homogeneous, butit anbeproved (see[12 ℄, [13℄and [14 ℄inamore general framework) that

it has the orret ergodi properties. The idea of adaptive sampling is to improve the pro-

posal eieny,makingitaslose aspossible to theposterior density. However, itshouldbe

stressed herethat thealgorithm presented above generallyfails inmulti-modal ontext for a

lownumber ofiterations (see e.g. [7 ℄). Regarding theGibbs sampler,itdoesnot seem to be

welladaptedtotheBayesian inversionproblem: theimportantnumberofallsoftheforward

modellimitsits relevany.

Due to the sequential nature of MCMC algorithm and to takle multi-modality problems,

MCMC pratitioners generally use several hains that they run in parallel. By simulating

several hains, variability and dependene on the initial value are redued and it should be

easierto ontrol onvergene to thestationarydistribution byomparing theestimation, us-

ing dierent hains, of quantities of interest. However, good performanes of these parallel

methods require a degree of a priori knowledge on the distribution of interest

π

^, ⁱⁿ ^order

to onstrut an initial distribution on

X

^whih ^takesînto âount ^the ^featuresôf

π

^(modes,

shape of highdensity regions,et.). Thisis rarely theasein Bayesian inversion. Moreover,

in highly non-linear setups, like in Bayesian inversion, a slow mixing hain will presumably

stayinthe neighborhood ofthe starting point witha highprobability (see [6 ℄hapter 12 for

a morethorough disussion).

Duetothe omplexityoftheposteriordistribution(e.g. multi-modality and/ordisonneted

support) in Bayesian inversion problems and lassial limitations of MH algorithms, other

methods than lassial MHalgorithm shouldbe investigated. Simulated annealing andtem-

pering, whih arepresented inthe nextparagraph, onsists instudying modied versions of

theposterior.

(7)

The simulated annealing algorithm has been introdued by [8℄, then generalized by [15℄ for

optimization problems. It an be applied to both optimization and simulation problems (see

[6℄ and referene therein). The simulated tempering has been introdued independently in

[16 ℄ and[17℄.

Thefundamentalideaofthesealgorithmsisthatahangeofsale,namedtemperature,allows

larger moves on the surfae of the distribution to explore, ompared with lassial MCMC

methods. Indeed, this hange of sale allows to avoid thehainto remain trapped ina loal

mode.

The name and inspiration of the rst one ome from annealing in metallurgy, a tehnique

involving heating and ontrolled ooling of a material to inrease the size of its rystals

and redue their defets. The heat auses the atoms to beome unstuk from their initial

positions (a loal minimum of the internal energy) and wander randomly through states of

higher energy;theslowooling gives themmore hanes of ndingongurations withlower

internalenergythantheinitialone. Converselythetemperingisabrutaloolingfollowedbya

aontrolledreheatingoftheworkpieetoatemperaturebelowitslowerritialtemperature.

Preipitationhardening alloys,likemanygrades ofaluminumandsuperalloys,aretempered

to preipitateintermetalli partileswhihstrengthen themetal.

Thesetwo methods aim partiularly at generating samplesfrom Gibbsdistribution.

Denition 2.1 A

X

^-valued ^random ^eld

X

^, îs â ^Gibbs êld ôf ênergy

E

^, îf îts ^probâbility

densityfuntion (with respet to the Lebesgue measure) is:

f (x) = 1

Z e

^−E(x)

, Z = Z

X

e

^−E(x)

dx,

⁽¹⁰⁾

named Gibbs density.

For pratial problems, the onstant

Z

îs ^generally întratable ^due ^to ^the ^dimension ôf

X

^.

Notethatina wide variety ofinverse problems,theposteriordistribution (1)takestheform

(10 ); for instane when both prior and measurement error areassumed Gaussian. Note also

that simulated annealing and tempering are not onned to ope with Gibbs distributions.

We present here both algorithms inthis framework for sake of simpliity. For other kind of

targetdistributions

π

^,^thepratitionerhastoonsiderattenedversionsgiven by

π

_T

= π

^1/T^.

2.4.1 Simulated annealing

Given a positive temperature

T

^, ^a ^Markov ^hain

X

^is ^generated ^from ^the ^following ^Gibbs

density:

π

T

(x) ∝ exp ( − E(x)/T ).

⁽¹¹⁾

The simulated annealing is performed by gradually lowering the temperature

T

^from ^a

highvaluetonear-zero. Closeto

T = 0

^the^Gibbsdistribution approximatesadelta funtion attheglobal minimumfor

E(x)

^(ifîtîsûnique). ^For^simulation ^purposes,^theôoling ân^be

stopped at the value

T = 1

^.

This algorithm an be viewed as a non-homogeneous version of the MH algorithm. In-

deed,sine

T

^dereasesâlong^theâlgorithm,^the^kernelôf^the^hain^vâries^with^time. ^Classial

theoretialresultsonMarkovhainsdoesnotapplyforthisalgorithm. Heuristirulesaregen-

erallyappliedtoensurethevalidityofsimulated annealing: thestartingtemperature mustbe

highenough andits dereaseslow enough. Aonvergene resultexists, foroptimization pur-

pose, withtheonditionthat

T

^dereases^as

1/ log(n)

^. ^In^pratie, ^ageometriallydereasing sequeneis generallyused.

(8)

Theprinipleofsimulatedtemperingislinkedtothesimulatedannealingoneinthesensethat

we will again onsider Gibbs distributions saled by a temperature parameter

T

^. ^However,

this algorithm aims at sampling from a Gibbs distribution

π

^rather ^than ^minimizing ^the

energy of the system. Weonsider herea nite sequeneof temperatures and theassoiated

Gibbs distributions. In this algorithm, we authorize the hain to hange temperature level

aordingtoagivenprobability. Thiswillallowthehaintogo baktohigher temperatures,

esapingeventual loalmodesof thetarget distribution, thatresults inbetter mixing.

Werstdene aninreasingsequeneoftemperatures

1 = T

₀

< . . . < T

_K^,^with^its^assoiated

Gibbs densities

π

_i

(x) ∝ e

⁻

E(x)

Ti , an auxiliary

{ 0, . . . , K }

^-valued ^variable

M

^, ^and ^the ^joint

distribution:

µ(x, m) = ρ

_m

π(x),

K

X

i=0

ρ

_m

= 1.

⁽¹²⁾

We also dene the probabilities

p

U ^and

p

D ^of ^moving "up", from

m

^to

m + 1

^,^and "down", from

m

^to

m − 1

^,^with^only

T

^hanging, ^the ^hain^being ^at temperature

T

_m^,^and ^the^proba-

bilityof hoosing a xedlevelmove

(1 − p

_U

− p

_D

)

^.

The priniple isto simulate a

X × { 0, . . . , K }

^-valued^hain

(X

n

, M

n

)

^. ^Denoting respetively

q

_i→i+1

(X

_n+1

| X

_n

= x

_n

)

^and

q

_i+1→i

(X

_n+1

| X

_n

= x

_n

)

^the probabilities of transition proposition towards the superior and theinferior temperature level, theaeptaneprobability of a

transitionfrom

T

_i ^to

T

_i+1 ^isproportional to:

ρ

_i→i+1

(x

_n

, x

_n+1

) = p

_D

p

_U

π

_i+1

π

_i

q

_i→i+1

(X

_n+1

= x

_n+1

| X

_n

= x

_n

)

q

_i+1→i

(X

_n+1

= x

_n+1

| X

_n

= x

_n

) .

⁽¹³⁾

To maintain the detailed balane ondition, it is then neessary that

ρ

_i+1→i

(x

_n+1

, x

_n

) = 1/ρ

_i→i+1

(x

_n

, x

_n+1

)

^and ^to ^hoose ^the proposition distributions

q

_i→i+1

(X

_n+1

| X

_n

= x

_n

)

^and

q

i+1→i

(X

n+1

| X

n

= x

n

)

âordingly. ^Still, ît îsîmportant ^to ^notie ^that ⁽¹³⁾^dependsôn ^the

normalization onstants of

π

_i+1 ^and

π

_i^. ^Wê ân ^bypass ^this ^diulty^by^designing ân êqual

numberofmovesfrom

m

^to

m + 1

^and^from

m + 1

^to

m

ând^byâepting^theêntire ^sequene

asa single proposal, thus aneling the normalizing onstants intheaeptane probability,

asdesribed e.g. in[18℄.

2.4.3 Comments

Attempts to improve mixing properties of the hain by simulated annealing fail generally

beause ofthe monotonousdereaseintemperature; ifthehaingetsinaloalmode,itmay

be impossibleto esapeit ifthetemperature is alreadytoolow.

Conerning the simulated tempering, the potential gain in a better exploration of the

supportofthetargetdistribution, soastosay,abettermixing,doesnot seemtoompensate

fortheinreasedamountofforward operatorevaluationsforinverseproblems(

2K

^bigger,^for

theshemepresentedabove). However,thepresentationofthismethodisagoodintrodution

to theinterating Markovhainsalgorithms exposedinthenext setion.

3 Parallel interating Markov hains

Thepriniple ofmakinginterat MarkovChainsrstappearsin[19 ℄ underthenameparallel

tempering (PT). It has been mostly applied in physio-hemial simulations, see [20 ℄ and

referenes therein. It is known in the literature under dierent names suh as: exhange