P. Gregori andJ. Mateu R.S. Stoica AversatileMCMCstrategyforsamplingposteriordistributionsofanalyticallyintractablemodels

(1)

A versatile MCMC strategy for sampling posterior distributions of analytically intractable

models

R. S. Stoica

^∗

Universit´e Lille 1 Laboratoire Paul Painlev´e 59655 Villeneuve d’Ascq Cedex France

P. Gregori

^†

and J. Mateu

^‡

Universitat Jaume I

Campus Riu Sec, E-12071 Castellon, Spain

Abstract

This paper proposes a new versatile strategy for sampling posterior distributions for analytically intractable models. Building such samplers using Markov Chain Monte Carlo methodology usually leads to algorithms which are rather expensive from a computational point of view, hence having very few chances to be used for practical applications. The strategy we propose overcomes this drawback and is easy to use. A direct application of the proposed method is shown and discussed: maximum likelihood estimation of the parameters of a spatial point process.

R´esum´e

Cet article propose une méthodologie souple pour la simulation des lois a posteriori des modèles ayant des constantes de normalisation qui ne sont pas calculables analytiquement. Utiliser na¨ıvement la philosophie Monte Carlo pour ce type de problème, amène à des algorithmes coûteux du point de vue du temps de calcul, donc peu utilisables en pratique. La stratégie que nous proposons élimine cette difficulté et de plus elle est facile à utiliser. Nous montrons et discutons une application pratique : estimation des paramètres d’un processus ponctuel par la méthode du maximum de vraisemblance.

Mathematics Subject Classifications (2000): 60J22,60G55

Keywords: computational methods in Markov chains, maximum likelihood estimation, point processes.

Note: part of the work of the first author was done during his stays at University Jaume I and INRA Avignon.

(2)

1 General context and presentation of the prob- lem

The modern probability theory together with the increasing performances of the computers allow nowadays the use of complex stochastic models in numerous application domains such as environmental sciences, astronomy or image analysis. The dynamics of a disappearing species, the distribution of galaxies in our universe, or brain activity images can be seen as the realisation of stochastic processes controlled by some parameters. Scientists wish to answer two major questions. The first one is how the model behaves knowing that it is governed by some given parameters. The second one is the reverse of the first question.

That is, what are the parameters of the model that can mimic an observed phenomenon.

In probabilistic terms, answers to the second question can be formulated if one can sample from the posterior law,i.e. the law of the model parameters, conditioned to the observation of the phenomenon of interest.

There is always a balance between the model capacity to “catch” reality and its mathematical properties. Usually, realistic models are rather “complicated”, hence their mathematical properties may look “not so appealing”. One of the most common drawback exhibited by such models is that they are often analytically intractable. Therefore, computations related to such models cannot be done using exact mathematical formulae. The solution to that problem is to perform these computations by means of computer simulations.

These are reasons why today there exists a challenge in developing statistical methodology allowing inference for such complex models. The proposed solu- tions have to meet several requirements such as: mathematical rigour, simplicity and computational efficiency.

This is the context within our paper is situated. In the following, the problem we address is introduced: building a versatile sampler for posterior distributions of analytically intractable models.

Letν be the Lebesgue measure inR^d, Θ a compact subset ofR^d of positive measure, and (Θ,TΘ, ν) the natural restriction of the measure space (R^d,T, ν).

Let us consider a stochastic process Y, that is defined on a probability space (Ω,F, µ) and characterised by the probability density p(y|θ) with respect to the reference measureµ. θ∈Θ represents the model parameter vector. For the probability density we write

p(y|θ) =f(y;θ) c(θ)

withf(y;θ) : Ω×Θ→R₊ andc(θ) the normalising constant given by c(θ) =

Z

Ω

f(y;θ)dµ(y).

The modelp(y|θ) is said to be analytically intractable if an analytical expression is available forf(y;θ) but not forc(θ). The usual conditions thatf(y;θ) has

(3)

to meet are: (a) positivity; (b) continuity with respect toθ; (c)µ-integrability, i.e. 0< c(θ)<∞. A typical example of stochastic processes meeting all these requirements is the exponential family with unnormalised probability densities

f(y;θ) = expht(y), θi, (1)

witht(y) the sufficient statistic vector andh·,·ithe inner product.

The Bayesian framework enables us to write the posterior distribution of such models as follows

p(θ|y)∝p(y|θ)p(θ), (2)

wherep(θ) is the prior distribution on the model parameters.

From a theoretical point of view, straightforward strategies for sampling p(θ|y) are available through Monte Carlo philosophy. For instance, let us consider the Metropolis-Hastings algorithm. Assuming the system is in the state θ, this algorithm first chooses a new value ψ according to a proposal density q(θ→ψ). The valueψis then accepted with probabilityαi(θ→ψ) given by

αi(θ→ψ) = min

1,p(ψ|y) p(θ|y)

q(ψ→θ) q(θ→ψ)

. (3)

The transition kernel of the Markov chain simulated by such a Metropolis- Hastings dynamics, has the following expression

Pi(θ, A) = Z

A

αi(θ→ψ)q(θ→ψ)1{ψ∈A}dψ + 1{θ∈A}

1−

Z

A

αi(θ→ψ)q(θ→ψ)dψ

.

The conditions that the proposal densityq(θ→ψ) has to meet, so that the simulated Markov chain has as equilibrium distribution

πi(A) = Z

A

p(θ|y)dν(θ),

are rather mild ([10, 13, 14]). Nevertheless, a trivial choice for the proposal density will meet an important drawback: the computation of the normalising constants ratio in (3). The considered models are analytically intractable, so the quantityc(θ)/c(ψ) cannot be computed exactly. The solution to that is to approximate the normalising constants ratio by its Monte Carlo counterpart.

So, in theory such a sampling dynamics simulates a Markov chain having πi

as stationary distribution. Throughout this paper, we call this kind of Markov chain anidealMarkov chain. Yet, its implementation requires inside computations based again on Monte Carlo techniques, that makes the overall strategy computationally heavy.

Now, let us consider the following proposal density q(θ→ψ) =q∆(θ→ψ|x) =f(x;ψ)/c(ψ)

I(θ,∆,x) 1b(θ,∆/2){ψ} (4)

(4)

with ∆ > 0 and x an outcome of the stochastic process on (Ω,F, µ) driven by the probability density p(x|θ). 1b(θ,∆/2){·} is the indicator function over b(θ,∆/2), which is the ball of centreθand radius ∆/2. Finally,I(θ,∆,x) is the quantity given by the integralR

b(θ,∆/2)f(x;φ)/c(φ)dφ.

This choice for q(θ →ψ) guarantees the convergence of the chain towards πi and avoids the evaluation of the normalising constant ratioc(θ)/c(ψ) in (3).

Still such a choice requires the computation of integrals likeI(θ,∆,x). In fact, the Markov chain obtained by using (4) is again an ideal chain.

Yet, the construction of such an ideal chain is the starting point of our paper. In the following, we will show that a subtle “manipulation” of the ideal chain given by (4) leads to another Markov chain that avoids the previous computational problem while preserving the required convergence properties.

This new chain is able to follow as close as desired the ideal chain. Therefore throughout this paper, we will call it theshadowchain. Since the shadow chain is simpler to simulate and exhibits good convergence properties, we will use it to build a sampler from the posterior distribution of interest.

The authors in [2, 12] present the first MCMC sampler ofp(θ|y) which does not require the computation of the normalising constants ratio. The principle of their solution is simple and elegant: instead of sampling directly fromp(θ|y), another variable x is introduced and a sampler from p(θ,x|y) is built. That technique is known in the literature as the auxiliary variable method. Yet, as the authors themselves explain, the presented solution has some drawbacks.

First, there is a need for drawing exact samples fromp(x|θ). This can make the method computationally heavy if the range of the model parameters is not carefully limited ([9]). Second, for the choice of the auxiliary densityp(x|θ,y) there are almost no theoretical constraints nor guidelines. From a theoretical point of view this may be seen as a very elegant characteristic of the method.

Nevertheless, this choice is model dependent and affects the mixing properties of the chain. For instance, whenever posterior of point processes are considered, finding the appropriate auxiliary density to be used becomes the delicate point of the auxiliary variable method [2].

The method we present here, while clearly inspired by the previously cited work, adopts a completely different philosophy, as it will be shown later. Nev- ertheless, it may be considered that our strategy has a common point with the auxiliary variable method, since it uses an extra variable, too. In comparison, the method we propose does not exhibit all the mentioned difficulties and it is really easy to use.

The paper continues as follows. First, the construction of the shadow Markov chain is given. In the following section, convergence issues are discussed. A straightforward application to spatial point process models is then presented:

maximum likelihood estimation for the parameters of the Strauss model. Fi- nally, conclusions and perspectives are depicted.

(5)

2 Construction of the shadow Markov chain

The shadow chain is obtained “transforming” the ideal chain by “getting to the limit” in the proposal density (4). This operation is explained by the theoretical results presented in this section.

In the following, some more notation is needed. V∆ denotes the volume of any ball of radius ∆/2 and with the centre a point in Θ. U∆ =U∆(θ→ψ) =

1

V∆1b(θ,∆/2){ψ} is the uniform probability density over the ball of centreθand radius ∆/2. The probability density p(x|φ) is assumed to be a continuously differentiable function inφ, so we can writep(x|·)∈ C¹(Θ).

Theorem 1. Let x be a point in Ω such that the function p(x|φ) is strictly positive and continuous with respect toφ, then we have that:

(i) The probability distributions given by the proposal densitiesq∆(θ→ ·)and U∆(θ→ ·)are uniformly as close as desired inθ∈Θas∆ approaches 0.

That is for any fixedθ∈Θ, we have

∆→0lim+

Z

A

|q∆(θ→ψ)−U∆(θ→ψ)|dψ= 0 uniformly in A∈ TΘ andθ∈Θ.

(ii) For any fixed θ ∈ Θ, the quotient functions ^q_q^∆_∆^(θ→·)_(·→θ) and

f(x;·)

c(·) 1b(θ,∆/2)(·)

f(x;θ)

c(θ) 1b(·,∆/2)(θ)

are uniformly as close as desired in Θ as ∆ approaches 0, i.e., for any fixed θ∈Θ, we have

∆→0lim+

sup

ψ∈Θ

q∆(θ→ψ|x) q∆(ψ→θ|x)−

f(x;ψ)

c(ψ) 1b(θ,∆/2)(ψ)

f(x;θ)

c(θ) 1b(ψ,∆/2)(θ) = 0 uniformly in θ∈Θ.

Ifp(x|·)∈ C¹(Θ), then rates of convergence can be provided.

Proof:

(i) Both density functions vanish outside the ballb(θ,∆/2). Forψ∈b(θ,∆/2), the integral mean value theorem applied to the denominator of q∆ leads to

q∆(θ→ψ) =

f(x;ψ) c(ψ)

V∆f(x;θ^∗) c(θ^∗)

for someθ^∗ ∈b(θ,∆/2). The positivity and the continuity of the density p (uniform continuity indeed, since Θ is compact) allows us to do the following. Letm(x) := infφ∈Θp(x|φ)>0 since it is actually a minimum.

(6)

ForA∈ TΘ we have Z

A

|q∆(θ→ψ)−U∆(θ→ψ)|dψ = Z

A∩b(θ,∆/2))

f(x;ψ) c(ψ)

V∆f(x;θ^∗) c(θ^∗)

− 1 V∆

dψ≤ 1

V∆

sup

φ∈Θ

c(φ) f(x;φ)

Z

A∩b(θ,∆/2)

f(x;ψ)

c(ψ) −f(x;θ^∗) c(θ^∗)

dψ≤ µ(A∩b(θ,∆/2))

V∆

m(x)⁻¹ sup

d(ψ,θ^∗)<∆

f(x;ψ)

c(ψ) −f(x;θ^∗) c(θ^∗)

≤ m(x)⁻¹ sup

d(ψ,θ^∗)<∆

f(x;ψ)

c(ψ) −f(x;θ^∗) c(θ^∗)

where the last supremum is independent ofψ andθ^∗, and approaches to 0 as far as ∆ does. With the regularity condition onp(x|·), we can tune up the inequality, using the (differential) mean value theorem, to

Z

A

|q∆(θ→ψ)−U∆(θ→ψ)|dψ≤m(x)⁻¹∆ sup

ψ^∗∈Θ

kDΘp(x|ψ^∗)k :=C1(x, p,Θ)∆

whereC1(x, p,Θ) is a constant depending onx, pand Θ.

(ii) As previously, the use of the integral mean value theorem gives the result, since

sup

ψ∈Θ

f(x;ψ)

c(ψ) 1b(θ,∆/2)(ψ)

f(x;θ)

c(θ) 1b(ψ,∆/2)(θ) ≤

sup

ψ∈b(θ,∆/2)

f(x;ψ)

c(ψ) /(V∆f(x;θ^∗) c(θ^∗) )

f(x;θ)

c(θ) /(V∆f(x;ψ^∗) c(ψ^∗) )−

f(x;ψ) c(ψ) f(x;θ)

c(θ)

≤

sup

ψ∈b(θ,∆/2)

f(x;ψ) c(ψ) f(x;θ)

c(θ)

f(x;ψ^∗) c(ψ^∗) f(x;θ^∗)

c(θ^∗)

−1 ≤ M(x)m(x)⁻² sup

d(θ^∗,ψ^∗)≤∆

f(x;ψ^∗)

c(ψ^∗) −f(x;θ^∗) c(θ^∗)

where M(x) := sup_φ∈Θp(x|θ) < ∞ is a maximum and θ^∗ ∈ b(θ,∆/2), ψ^∗∈b(ψ,∆/2) are obtained from the integral mean value theorem. As in (i), under the regularity condition onp(x|·), this inequality can evolve to

sup

ψ∈Θ

f(x;ψ)

c(ψ) 1b(θ,∆/2)(ψ)

f(x;θ)

c(θ) 1b(ψ,∆/2)(θ)

≤M(x)m(x)⁻²∆ sup

ψ^∗∈Θ

kDΘp(x|ψ^∗)k :=C2(x, p,Θ)∆

whereC2(x, p,Θ) is a constant depending onx, pand Θ.

(7)

Theorem 1 leads to the construction of the shadow Markov chain. The first part of the result enables us to use the uniform proposalU∆(θ→ψ) instead of q∆(θ→ψ) that is used for the ideal chain. The second part of Theorem 1 allows the “simplification” of the normalising constants involved in the expression of the acceptance probability. So, for small values of ∆, a new Markov chain comes out if these changes are applied to the proposal and the acceptance probability of the ideal chain. This is the shadow chain. It appears clearly, that simulating the shadow chain is far much more simple than simulating the ideal chain. The algorithm simulating the shadow chain is the following: suppose the system is in a stateθ, a new valueψ is chosen uniformly in the ball b(θ,∆/2), then the stateψ is accepted with probability

αs(θ→ψ) = min

1,p(ψ|y)

p(θ|y) ×f(x;θ)c(ψ)1b(ψ,∆/2){θ}

f(x;ψ)c(θ)1b(θ,∆/2){ψ}

. (5)

By construction, the shadow chain is irreducible and aperiodic ([11]). Yet, we do not have any knowledge about the existence and the uniqueness of its equilibrium distribution.

This section ends up with the following corollary that is a direct consequence of Theorem 1.

Corollary 1. The acceptance probabilities of the ideal and shadow chains given by (3)and (5) respectively, are uniformly as closed as desired whenever ∆ ap- proaches0.

3 Convergence issues

The results presented in this section analyse the asymptotic behaviour of the ideal and the shadow chains, assuming they are both started simultaneously from the same initial state.

In the following, for ease of reading, the notations P_i,∆⁽ⁿ⁾(θ, A) = P_i⁽ⁿ⁾ and P_s,∆⁽ⁿ⁾(θ, A) =Ps⁽ⁿ⁾will be used.

Lemma 1. Let Pi,∆ and Ps,∆ be the transition kernels for the ideal and the shadow Markov chains using a general ∆ > 0, respectively. Then for every ǫ > 0 and every n ∈ N, there exists ∆0 = ∆0(ǫ, n) > 0 such that for every

∆≤∆0 we have|P_i,∆⁽ⁿ⁾(θ, A)−P_s,∆⁽ⁿ⁾(θ, A)|< ǫuniformly in θ∈ΘandA∈ TΘ. If p(x|θ)∈ C¹(Θ), then a description of∆0(ǫ, n)can be provided.

Proof: If n = 1, the definition of the transition kernels, the introduction of the term U∆(θ → ψ)αi(θ → ψ)−U∆(θ → ψ)αi(θ → ψ), then the use of the triangle’s inequality and also the boundedness of functions 1A(·),αi(·) and

(8)

αs(·), allow us to write

|Pi(θ, A)−Ps(θ, A)| ≤ Z

ψ∈b(θ,∆/2)

|q∆(θ→ψ)αi(θ→ψ)−U∆(θ→ψ)αs(θ→ψ)|dψ + 1A(θ)

Z

ψ∈b(θ,∆/2)

|q∆(θ→ψ)[1−αi(θ→ψ)]−U∆(θ→ψ)[1−αs(θ→ψ)]|dψ

≤3 Z

ψ∈b(θ,∆/2)

|q∆(θ→ψ)−U∆(θ→ψ)|dψ + 2

Z

ψ∈b(θ,∆/2)

U∆(θ→ψ)|αs(θ→ψ)−αs(θ→ψ)|dψ

Under the hypothesis of Theorem 1, and then applying Theorem 1(i) and Corollary 1, the transition kernels of the ideal and the shadow Markov chains, respectively are uniformly close as well (say for anyǫ >0, there exists ∆(ǫ,1)>

0 such that we have|Ps(θ, A)−Pi(θ, A)|< ǫif ∆≤∆(ǫ,1), independently ofθ andA).

Ifp(x|·)∈ C¹(Θ), then the previous inequalities can be completed to

|Pi(θ, A)−Ps(θ, A)| ≤3C1(x, p,Θ)∆V∆+ 2C2(x, p,Θ)∆V∆

:=C3(x, p,Θ)∆^d+1

giving a candidate expression for ∆0(ǫ,1) := C4(x, p,Θ)ǫ^d+1¹ . C3 and C4 are constants depending onx, p and Θ.

Forn >1 we get by induction that

≤ |P_i⁽ⁿ⁻¹⁾||P_i⁽¹⁾−P_s⁽¹⁾|+|P_i⁽ⁿ⁻¹⁾−P_s⁽ⁿ⁻¹⁾||P_s⁽¹⁾|

≤ |P_i⁽¹⁾−P_s⁽¹⁾|+|P_i⁽ⁿ⁻¹⁾−P_s⁽ⁿ⁻¹⁾|

≤ · · · ≤n|P_i⁽¹⁾−P_s⁽¹⁾|< ǫ uniformly inθ andAfor all ∆ such that

∆≤∆0(ǫ, n) := ∆0(ǫ/n,1) =C4(x, p,Θ)ǫ n

_d+1¹

. (6)

This first result shows that if the ideal and the shadow chains are started from the same initial state and run during a fixed timen, it is possible to find a positive ∆ such that the two chains evolve as close as desired.

Knowing thatπi, the distribution we are interested in, is also the equilibrium distribution of the ideal chain, the following question arises: how is it possible to converge towardsπi using the ideal chain, while the shadow chain follows it as close as desired? The next results give the answer to that question.

(9)

Lemma 2. For every ǫ > 0 and every sequence of positive integers {nj}^∞_j=1 there exists a sequence{∆j}^∞_j=1 such that the following inequality

|P_i,∆⁽ⁿ¹₁⁾· · ·P_i,∆⁽ⁿ^k_k⁾−P_s,∆⁽ⁿ¹₁⁾· · ·P_s,∆⁽ⁿ^k_k⁾|< ǫ holds for everyk∈N, uniformly in θ∈ΘandA∈ TΘ.

Proof: Let us first take any sequence{αj}^∞_j=1 of positive real numbers whose series converges to 1. Then for each j, we take αjǫ > 0 and nj and apply previous lemma taking ∆j = ∆j(αjǫ, nj)>0 such that|P_i,∆⁽ⁿ^j_j⁾−P_s,∆⁽ⁿ^j_j⁾|< αjǫ.

Now, by induction

|P_i⁽ⁿ¹⁾· · ·P_i⁽ⁿ^k⁾−P_s⁽ⁿ¹⁾· · ·P_s⁽ⁿ^k⁾| ≤

≤ |P_i⁽ⁿ¹⁾· · ·P_i⁽ⁿ^k−1⁾||P_i⁽ⁿ^k⁾−P_s⁽ⁿ^k⁾|

+|P_i⁽ⁿ¹⁾· · ·P_i⁽ⁿ^k⁻¹⁾−P_s⁽ⁿ¹⁾· · ·P_s⁽ⁿ^k−1⁾||P_s⁽ⁿ^k⁾|

≤ |P_i⁽ⁿ^k⁾−P_s⁽ⁿ^k⁾|+|P_i⁽ⁿ¹⁾· · ·P_i⁽ⁿ^k−1⁾−P_s⁽ⁿ¹⁾· · ·P_s⁽ⁿ^k−1⁾|

≤ · · · ≤ Xk

j=1

|P_i⁽ⁿ^j⁾−P_s⁽ⁿ^j⁾| ≤



 Xk

j=1

αj



ǫ=ǫ

From the definitions in the beginning of the paper, it is perfectly possible to use an infinite {∆j} sequence of to construct an infinite sequence of ideal and shadow transition kernels. On the basis of the result obtained in Lemma 2, it is possible to find such an infinite sequence of ∆’s so that the composition of the induced ideal and shadow transition kernels leads to two inhomogeneous Markov chains evolving as close as desired.

In fact, that result describes the asymptotic behaviour of the inhomogeneous chains given byQ^k_i = P_i,∆⁽ⁿ¹₁⁾· · ·P_i,∆⁽ⁿ^k_k⁾ and Q^k_s =P_s,∆⁽ⁿ¹₁⁾· · ·P_s,∆⁽ⁿ^k_k⁾ with {nj}^k_j=1, {∆j}^k_j=1 andk∈N⁺. In the following we call these two chains induced byQ^k_i andQ^k_s, the compound ideal and shadow chain, respectively.

The question to be answered now is: how the convergence properties of the compound ideal chain are influenced by the{∆j}sequence?

Theorem 2. Let πi be the invariant distribution of the ideal chain andQ^k_i the compound ideal chain constructed using a{∆j}sequence as in Lemma 2. If the {∆j}sequence has divergent series, then for any positive measureν0on(Θ,TΘ), we have that

k→∞lim ν0Q^k_i =πi

Proof: By reduction to the absurd, we suppose that there exists another positive measureπ^′_isuch that the conclusion holds. Hence, forν0=πi we have

(10)

limk→∞πiQ^k_i =π^′_i. Asπi is the invariant distribution for all theQ^k_i, it results thatπi=π^′_i.

The convergence of the infinite compound ideal chain towards πi, can be obtained only ifQ^k_i(θ, A)>0 for anyθandA, wheneverπi(A)>0 and ∆k →0 whilek→ ∞. That condition is fulfilled if the associate{∆j}sequence verifies P

k∆k>∞.

Lemma 2 and Theorem 2 state that the compound ideal and shadow chain can evolve asymptotically close and that the equilibrium distribution of the compound ideal chain converges towardsπi. Then, the following result follows.

Theorem 3. For every ǫ >0, one can choose a particular sequence {nj}^∞_j=1 and get a sequence{∆j}^k_j=1 such that for any positive measure ν0 on (Θ,TΘ), we have that

k→∞lim |ν0Q^k_s(A)−πi(A)|< ǫ uniformly inA∈ TΘ.

Proof: The result follows easily from previous theorem and Lemma 2, once a sequence{∆j}^∞_j=1 with divergent series is found usingǫ >0 and{nj}^∞_j=1.

For instance, let us take the constant sequence {nj} = N for all positive integersjand{αj}^∞_j=1of Lemma 2 proportional to{1/j^κ}^∞_j=1. We haveκ >1, since its series must be convergent. Then we get

∆j = ∆j(αjǫ, N) = ∆j(αjǫ/N,1)∝(ǫ/N)^d+1¹ (1/j)^d+1^κ ,

whose series diverges whenever κ ≤ d+ 1. Other choices for the sequences {nj}^∞_j=1 and{αj}^∞_j=1 are available, provided they verify

X

j

αj = 1 and X

j

(αj/nj)^d+1¹ =∞.

The obtained result is important because of the direct applications that can be derived from it: whenever sampling from the ideal chain is needed, sampling using the shadow chain can be performed instead. The immediate advantage of doing that is the simplicity in the computation and in the implementation of the shadow chain.

It may be noticed that all these results were proved using compound ideal and shadow chains built using the same variablex. The results proof perfectly stands if the compound ideal and shadow chains are built using also a sequence of variables{xj}^∞_j=1 associated to the sequences{∆j, nj}^∞_j=1. That possibility makes a link with the auxiliary variable method [2, 12].

(11)

4 Application: maximum likelihood estimation of point process parameters

4.1 Set-up of the problem

In this section, the previous theoretical development is used to build an algorithm able to sample from the posterior law of a spatial point process. The direct application of such sampler is the maximum likelihood estimation of the parameters of the considered model.

The point process model considered here is the Strauss process ([8, 15]).

This model was introduced in order to analyse point patternsy={y1, . . . , yn} exhibiting clustering tendency. The points yi are all situated in a bounded regionK⊂R². Its unnormalised probability density is given by

f(y;θ) = exp [n(y) logβ+sr(y) logγ]

withn(y) the total number of points iny,sr(y) the number of pairs of points in ywithin a distance less than or equal tor, β >0 the intensity parameter (the chemical activity in statistical physics terminology) andγ∈(0,1] the interaction parameter. The equivalence with (1) is immediate sincet(y) = (n(y), sr(y)) and θ= (logβ,logγ).

Parameter estimation of a Strauss point process can be formulated as follows:

a point patternythat is supposed to be the outcome of a Strauss process is fully observed in the windowK; the interaction radiusris considered known, so the statisticst(y) can be computed; we want to estimate the model parameter bθ, that ensures an average behaviour of the model equivalent to the observations.

The maximum likelihood framework can be used to answer this question. So, the Maximum Likelihood Estimate (MLE) θbof the Strauss model parameters is given by

θb= arg max

θ∈Θly(θ) = arg max

θ∈Θ

f(y;θ)

c(θ) = arg max

θ∈Θ

expht(y);θi c(θ) ,

where ly(θ) = ^f(y;θ)_c(θ) is the likelihood function ([3, 7, 5, 4]). It can be shown that the MLE verifies

t(y) =E_b

θt(Y). (7)

4.2 Computation of the MLE

The MLE computation can be considered solved if the likelihood functionly(θ) = p(y|θ) can be evaluated through the entire parameter space Θ. Such an ex- ploration can be performed while sampling from the posterior law p(θ|y) = ly(θ)p(θ), whenever a uniform priorp(θ) over the parameter space Θ is used. In that case, the MLE estimate is also the Maximum A Posteriori (MAP) estimate.

To do this, a shadow chain is constructed. The shadow chain works as follows: assuming the chain is in the stateθ, a new valueψis chosen uniformly

(12)

in the ballb(θ,∆/2), then the stateψis accepted with probability αs(θ→ψ) = min

1,expht(y), ψi

expht(y), θi × expht(x), θi1b(ψ,∆/2){θ}

expht(x), ψi1b(θ,∆/2){ψ}

. (8) The modifications in (8) are due to the fact that we introduce in (5) the probability densities of the Strauss model.

Under these considerations, the theoretical development presented in the previous section can be used to build an algorithm able to sample from the considered posterior law:

Algorithm

Step 0: Observe the sufficient statisticst(y) and choose the sequences{∆j},{nj} and{xj};

Step 1: Choose for the initial condition any valueθ0∈Θ and setk= 1;

Step 2: Run the shadow chain using ∆k, nk,xk and pick up for θk the final obtained value;

Step 3: Makeθ0=θk and incrementk;

Step 4: Go to Step 2, or to Step 5 if a sufficiently high number of θ’s was obtained;

Step 5: Compute the MLE component-wise by taking the mode of the histogram corresponding to the considered parameter.

4.3 How to use the algorithm

The performances of the previously presented algorithm depend on the choice of the three sequences{∆j},{nj}and{xj}. In this section, we explain our choices with respect to them.

For the sake of simplicity, the{nj}sequence was set to a constant value. So, we have chosennj= 100 for allj.

The theoretical requirements for the{xj}sequence are very mild: positivity and continuity of p(xj|θ) with respect θ. For our purpose we have generated each element of the{xj} sequence by samplingp(xj|θj−1). That sampling was done by means of a Metropolis-Hastings algorithm ([6, 4]). It is important to notice that the theoretical development of our method does not require the{xj} sequence to be made of exact samples of any probability distribution. Hence, in practice only few iterations of the previously mentioned Metropolis-Hastings dynamics are sufficient to obtain one element from the {xj} sequence. For our experiments, we have used only 10 iterations of such a Metropolis-Hastings algorithm to generate one element xj of the sequence. Each time, the initial state of the Metropolis-Hastings dynamics wasxj−1 for each j, while x0 was the empty set. The motivation of such a choice for the construction of the{xj}

(13)

sequence is that it allows good mixing behaviour of the algorithm while the computation time remains reasonable.

The theoretical development at the basis of our algorithm requires the{∆j} sequence to have divergent series while its elements approache 0. Under those circumstances, we have voted for the following schedule able to generate the {∆j}sequence

∆j = 1

N j^α (9)

withj∈N⁺,N = 100 andα∈ ¹₃,1

and an experiment able to test the performances of the algorithm depending on theαparameter in the{∆j}sequence (9) was carried out.

The first step of that experiment was to simulate 1000 samples of a Strauss process of parametersθ= (3.91,−1.60). The interaction radius parameter was r= 0.1. The parameter space was set to Θ = [0,10]×[−10,0] since it covers a very large class of Strauss models. The bounded region K was the square [0,1.5]×[0,1.5]. In order to reduce edge effects the sufficient statistics were considered only in the sub-window [0.25,1.25]×[0.25,1.25]. The obtained value for the average of its sufficient statistics was (24.19,2.69). That value was given as the entry of the algorithm. By (7) we expect the algorithm result to be close to the true model parameters, hence controlling the outputs of the proposed method.

The second step of the experiment was to run our algorithm using different values forαin (9). Those values were: 0.34, 0.5 and 1.0. Each time, the initial value ∆0was set to 1. The algorithm was run during 10⁷ iterations. To reduce correlation, samples ofθ were picked up every 100 iterations of the algorithm.

So, only 10⁵values forθwere finally kept. Theθpaths and the component-wise histograms computed for each αare shown in Figure 1. The MLE estimates calculated using these histograms are shown in Table 1.

α 0.34 0.5 1.0

log[β 3.94 3.94 3.80 log[γ -1.71 -1.78 -1.99

Table 1: αdependency of the MLE estimates.

Statistics for the MLE estimates were also computed. To do this, each MLE estimate was computed using only 1000 samples of θ. Hence, using all the samples, we get 100 MLE estimates for each value ofα. So, theαinfluence on the performances of the algorithm was observed via the statistics computed for the MLE estimates. The obtained results are shown in Table 2.

The best performances of the algorithm are obtained forαclose to ¹₃. Ta- bles 1 and 2 show that forα= 0.34, the MLE estimate is close to the theoretical one, while the MLE statistics indicate a small bias and variability. Moreover, the θ path and the histograms indicate good mixing properties of the chain

(14)

a) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

−10−8−6−4−20

b) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

0.00.20.40.60.81.0

c) ⁻¹⁰ ⁻⁸ ⁻⁶ ⁻⁴ ⁻² ⁰

0.00.10.20.30.40.5

d) ⁻¹⁰ ⁰ ² ⁴ ⁶ ⁸ ¹⁰

−8−6−4−20

e) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

0.00.20.40.60.81.0

f) ⁻¹⁰ ⁻⁸ ⁻⁶ ⁻⁴ ⁻² ⁰

0.00.10.20.30.40.5

g) ⁻¹⁰ ⁰ ² ⁴ ⁶ ⁸ ¹⁰

−8−6−4−20

h) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

0.00.51.01.52.02.53.0

i) ⁻¹⁰ ⁻⁸ ⁻⁶ ⁻⁴ ⁻² ⁰

012345

Figure 1: αdependency of the algorithm illustrated by the obtainedθpath and corresponding component-wise histograms: a)-c)α= 0.34; d)-f)α= 0.5; g)-i) α= 1.0.

α 0.34 0.5 1.0

Eh log[βi

3.92 3.94 3.73 V arh

log[βi

0.009 0.02 0.003 Ehlog[γi

-1.74 -1.80 -2.02 V arh

log[γi

0.04 0.2 0.004

Table 2: αdependency of the MLE statistics.

(15)

simulated by the algorithm. Ifαapproaches 1, the bias in estimation increases.

That is explained by the fact, that wheneverα approaches 1, the corresponding{∆j} series diverges slowlier. Hence, the chain simulated by the algorithm looses the good mixing behaviour, allowing the possibility that the chain gets

“locked” in a certain region of the parameter space.

Ifαis smaller than ¹₃ then the {∆j}sequence becomes “rapidly” divergent, so that theθ path will not approximate samples fromπi anymore. Here again, the experiments were confirming the theoretical development.

Initial conditions dependency was studied by running the algorithm from three different starting points. In all three cases, the αparameter was set to 0.34 while exactly the same conditions for the Strauss model were used.

The parameter paths and the associated histograms are presented in Figure 2 while the statistics of the obtained MLE estimates are shown in Table 3. It can be observed that the initial conditions play no role for the shadow chain. From a numerical point of view the statistics are the “same”. It can be also noticed that the parameter path graphs induce “identical” geometrical explored regions in the parameter space and also “identical” histograms.

θ0= (logβ0,logγ0) (4,-1.5) (0,-10) (10,0) Eh

log[βi

3.93 3.94 3.93

V arh log[βi

0.009 0.007 0.011 Eh

log[γi

-1.74 -1.73 -1.73 V arh

log[γi

0.04 0.05 0.02

Table 3: Initial conditions dependency of the MLE statistics obtained using an algorithm withα= 0.34.

All these experiments suggest that the proposed method has to be used in the same spirit as the simulated annealing algorithm. In both cases there is thorough theoretical background behind the algorithm construction, while both techniques need a “cooling schedule” that is of major importance for their performances. At the same time, an analogy with a Gibbs within Metropolis- Hastings algorithm can be also found taking into account the way the {xj} sequence is used. It is important to notice that, in all the cases, the obtained results indicate that the algorithm is able to go into the parameter region where the MLE lies. So, in the worth of cases, such an approach can be used to improve the performances of the classical Monte Carlo MLE.

4.4 Simulation study

Parameter estimation based on Monte Carlo maximum likelihood for point processes encounters several difficulties. Among them we recall: the edge effects, the dependence of the initial conditions and the variance of the observed suffi-

(16)

a) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

−10−8−6−4−20

b) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

0.00.20.40.60.81.0

c) ⁻¹⁰ ⁻⁸ ⁻⁶ ⁻⁴ ⁻² ⁰

0.00.10.20.30.40.5

d) ⁻¹⁰ ⁰ ² ⁴ ⁶ ⁸ ¹⁰

−8−6−4−20

e) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

0.00.20.40.60.8

f) ⁻¹⁰ ⁻⁸ ⁻⁶ ⁻⁴ ⁻² ⁰

0.00.10.20.30.40.5

g) ⁻¹⁰ ⁰ ² ⁴ ⁶ ⁸ ¹⁰

−8−6−4−20

h) ⁰ ² ⁴ ⁶ ⁸ ¹⁰

0.00.20.40.60.81.0

i) ⁻¹⁰ ⁻⁸ ⁻⁶ ⁻⁴ ⁻² ⁰

0.00.10.20.30.40.5

Figure 2: Initial conditions dependency of the parameter path obtained using an algorithm with α = 0.34: a)-c) θ0 = (4,−1.5); d)-f) θ0 = (0,−10); g)-i) θ0= (10,0)

(17)

cient statistics. Several strategies exist in the literature to avoid the first three mentioned drawbacks [1, 4]. In general minus sampling, and re-starting the maximum likelihood procedure until convergence are usual choices adopted by the majority of practitioners in order to solve those three problems, respectively.

At our knowledge, the problem of the variance of the observed sufficient statistics remains a delicate one. The classical solution generally adopted by practitioners is to re-start the estimation procedure and to compute statistics on the obtained estimates. In the context of Monte Carlo maximum likelihood computation, doing statistics on MLE estimates is clearly time consuming.

The algorithm we have proposed in the previous section can be used to proceed with such a study. For this experiment, we have considered six Strauss models with parameters indicated in Table 4. The rest of the settings were identical to those used in the previous section.

logβ 3.91 3.91 3.91 4.60 4.60 4.60 logγ -0.69 -1.60 -2.99 -0.69 -1.60 -2.99

Table 4: Set of parameters used in the simulation study for the Strauss model with the interaction radiusr= 0.1.

Concerning the algorithm parameters, they were again chosen exactly the same as in the previous experiments with the exception of the{∆j} sequence, for which we have chosen a fixed value ∆ = 0.01. That choice gives a sequence that is rather close to the sequence with parameterα= 0.34 used in the previous section. Keeping in mind the analogy with the simulated annealing algorithm, that is equivalent with the situation whenever the algorithm is run using a “fixed temperature”.

For each model a procedure consisting of several steps was implemented.

First 1000 samples using a Coupling From The Past (CFTP) algorithm were drawn and the corresponding sufficient statistics kept. Each valuet(y) of the sufficient statistics was considered as an entry for our algorithm. For each value of the sufficient statistics 10⁵ iterations of the algorithm were carried out. To reduce correlation, the samples ofθ were picked up every 100 iterations of the algorithm. In total, this procedure gives 1000 samples ofp(θ|y) for each value of t(y). For these samples a histogram was calculated and the mode of this histogram was computed. This value approximates the maximum likelihood estimate of the model parameters while observing the statistict(y). The procedure is then continued for the rest of the sufficient statistics. At the end of the procedure for one model, we get 1000 such estimates.

Histograms and statistics were computed for the obtained estimates. The histograms are shown in Figures 3 and 4, while the statistics are presented in Table 5. In all the cases, it can be observed that their expectation is very close to the true parameter value of the model.

The variance of the obtained estimates depends on the model parameters.

The variance of an estimated parameter increases whenever the associate suf-

(18)

ficient statistics has a small value. The same is also true whenever the true parameter value is strongly penalising the considered interaction. That is the case for the models with logγ = −2.99. These models are almost hard core processes, sinceγ= 0.05.

This drawback is encountered when performing Monte Carlo maximum likelihood computation, too. Its direct consequence is that the convergence of the algorithm will be very slow. Still in our situation, looking at the mode of the histograms of the MLE estimates, we can still recover the true value of the parameters.

logβ 3.91 3.91 3.91 4.60 4.60 4.60 logγ -0.69 -1.60 -2.99 -0.69 -1.60 -2.99 Eh

log[βi

3.93 3.91 3.88 4.64 4.61 4.59 V arh

log[βi

0.09 0.09 0.10 0.11 0.11 0.10 Ehlog[γi

-0.79 -2.00 -4.52 -0.76 -1.73 -3.74 V arh

log[γi

0.22 2.02 6.60 0.08 0.30 3.84

Table 5: The mean and the variance of the maximum likelihood estimates (log[β,log[γ).

5 Conclusion and perspectives

In this paper we have presented a versatile Monte Carlo strategy for sampling the posterior law of analytically intractable models. That strategy consists of a good theoretical basis leading to the construction of sampling algorithms.

The implementation of such an algorithm was used for a practical application:

point processes parameter estimation through maximum likelihood. The performances of the method are at least equivalent to the classical Monte Carlo maximum likelihood and to the auxiliary variable method applied to point processes [4, 2, 12]. Its strong points are the simplicity, the independence of the initial conditions and the capacity of doing statistical inference. Furthermore, that method does not exhibit the difficulties encountered by earlier methods ([2, 12]).

Even if theoretical and experimental arguments were given for the optimal choice of the algorithm parameters, that question remains an open problem. Another interesting perspective is the development of new simple methodologies able to estimate, for instance, the interaction radius for a Strauss model. That would be a step further towards the “dream” of the statistics community: simple methodologies which are not model dependent ([7, 16]).

(19)

a) ⁰³ ^3.5 ⁴ ^4.5 ⁵ ^5.5

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

b) ⁻⁶⁰ ⁻⁵ ⁻⁴ ⁻³ ⁻² ⁻¹ ⁰

0.05 0.1 0.15 0.2 0.25 0.3 0.35

c) ^2.5⁰ ³ ^3.5 ⁴ ^4.5 ⁵

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

d) ⁻¹⁰⁰ ⁻⁹ ⁻⁸ ⁻⁷ ⁻⁶ ⁻⁵ ⁻⁴ ⁻³ ⁻² ⁻¹ ⁰

0.05 0.1 0.15 0.2 0.25 0.3 0.35

e) ^2.5⁰ ³ ^3.5 ⁴ ^4.5 ⁵

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

f) ⁻¹⁰⁰ ⁻⁹ ⁻⁸ ⁻⁷ ⁻⁶ ⁻⁵ ⁻⁴ ⁻³ ⁻² ⁻¹ ⁰

0.02 0.04 0.06 0.08 0.1 0.12 0.14

Figure 3: Histogram of the estimates corresponding to the true model: a),b) (3.91,−0.69); c),d) (3.91,−1.60); e),f) (3.91,−2.99)

(20)

a) ⁰^3.5 ⁴ ^4.5 ⁵ ^5.5 ⁶ ^6.5

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

b) ^−2.5⁰ ⁻² ^−1.5 ⁻¹ ^−0.5 ⁰

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

c) ⁰³ ^3.5 ⁴ ^4.5 ⁵ ^5.5 ⁶

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

d) ⁻⁷⁰ ⁻⁶ ⁻⁵ ⁻⁴ ⁻³ ⁻² ⁻¹ ⁰

0.05 0.1 0.15 0.2 0.25

e) ⁰³ ^3.5 ⁴ ^4.5 ⁵ ^5.5 ⁶

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

f) ⁻¹⁰⁰ ⁻⁹ ⁻⁸ ⁻⁷ ⁻⁶ ⁻⁵ ⁻⁴ ⁻³ ⁻² ⁻¹

0.05 0.1 0.15 0.2 0.25

Figure 4: Histogram of the estimates corresponding to the true model: a),b) (4.60,−0.69); c),d) (4.60,−1.60); e),f) (4.60,−2.99)

(21)

Acknowledgements

The first author is grateful to D. Allard, J. Chadœuf, N. Dessasis, A. Kret- zschmar and S. Soubeyrand for helpful and interesting discussions.

References

[1] A. J. Baddeley. A crash course in stochastic geometry. In O. Barndorff- Nielsen, W.S. Kendall, and M.N.M. van Lieshout, editors, Stochastic geometry, likelihood and computation. CRC Press/Chapman and Hall, Boca Raton, 1999.

[2] K.K. Berthelsen and J. Møller. Bayesian analysis of markov point processes.

In A. Baddeley P. Gregori J. Mateu R. Stoica and D. Stoyan, editors, Case studies in spatial point process modeling. Springer, Lecture Notes in Statistics 185, 2006.

[3] A.E. Gelfand and B.P. Carlin. Maximum-likelihood estimation for constrained- or missing- data models. Canadian Journal of Statistics, 21:303–311, 1993.

[4] C. J. Geyer. Likelihood inference for spatial point processes. In O. Barndorff-Nielsen, W.S. Kendall, and M.N.M. van Lieshout, editors, Stochastic geometry, likelihood and computation. CRC Press/Chapman and Hall, Boca Raton, 1999.

[5] C.J. Geyer. On the convergence of Monte Carlo maximum likelihood cal- culations. Journal of Royal Statistical Society, Series B, 54(1):261–274, 1994.

[6] C.J. Geyer and J. Møller. Simulation procedures and likelihood inference for spatial point processes. Scandinavian Journal of Statistics, 21:359–373, 1994.

[7] C.J. Geyer and E.A. Thompson. Constrained Monte Carlo maximum likelihood for dependent data. Journal of Royal Statistical Society, Series B, 54(3):657–699, 1992.

[8] F.P. Kelly and B.D. Ripley. A note on Strauss’s model for clustering.

Biometrika, 63(2):357–360, 1976.

[9] M.N.M.van Lieshout and R.S. Stoica. Perfect simulation for marked point processes. Computational Statistics and Data Analysis, 51:679–698, 2006.

[10] K.L. Mengersen and R.L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics, 24(1):101–121, 1996.

[11] S.P. Meyn and R.L. Tweedie. Markov chains and stochastic stability.

Springer Verlag, 1993.

(22)

[12] J. Møller, A. N. Pettitt, K. K. Berthelsen, and R. W. Reeves. An effi- cient markov chain monte carlo method for distributions with intractable normalizing constants. Biometrika, 93:451–458, 2006.

[13] G.O. Robert and A.F.M. Smith. Simple conditions for the convergence of the Gibbs sampler and Metropolis-Hastings algorithms. Stochastic Pro- cesses and their Applications, 49:207–216, 1994.

[14] G.O. Roberts and R.L. Tweedie. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms.

Biometrika, 83(1):95–110, 1996.

[15] D.J. Strauss. A model for clustering. Biometrika, 62(2):467–475, 1975.

[16] L. Tierney. Markov chains for exploring posterior distribution (with dis- cussion). The Annals of Statistics, 22(4):1701–1762, 1994.