HAL Id: inria-00577793
https://hal.inria.fr/inria-00577793
Submitted on 18 Mar 2011
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Markov fields and variational EM
Lamiae Azizi, Florence Forbes, Senan Doyle, Myriam Charras-Garrido, David Abrial
To cite this version:
Lamiae Azizi, Florence Forbes, Senan Doyle, Myriam Charras-Garrido, David Abrial. Spatial risk mapping for rare disease with hidden Markov fields and variational EM. [Research Report] RR-7572, INRIA. 2011. �inria-00577793�
a p p o r t
d e r e c h e r c h e
0249-6399ISRNINRIA/RR--7572--FR+ENG
Optimization, Learning and Statistical Methods
Spatial risk mapping for rare disease with hidden Markov fields and variational EM
Lamiae Azizi — Florence Forbes — Senan Doyle — Myriam Charras-Garrido — David Abrial
N° 7572
March 2011
Centre de recherche INRIA Grenoble – Rhône-Alpes
Lamiae Azizi∗† , Florence Forbes∗, Senan Doyle∗ , Myriam Charras-Garrido† , David Abrial†
Theme : Optimization, Learning and Statistical Methods Équipes-Projets Unité d'épidémiologie Animale, INRA
Clermont-Ferrand-Theix et MISTIS
Rapport de recherche n° 7572 March 2011 33 pages
Abstract: We recast the disease mapping issue of automatically classifying geographical units into risk classes as a clustering task using a discrete hidden Markov model and Poisson class-dependent distributions. The designed hidden Markov prior is non standard and consists of a variation of the Potts model where the interaction parameter can depend on the risk classes. The model parameters are estimated using an EM algorithm and the mean eld approxi- mation. This provides a way to face the intractability of the standard EM in this spatial context, with a computationally ecient alternative to more intensive simulation based Monte Carlo Markov Chain (MCMC) procedures. We then focus on the issue of dealing with very low risk values and small numbers of observed cases and population sizes. We address the problem of nding good initial parameter values in this context and develop a new initialization strat- egy appropriate for spatial Poisson mixtures in the case of not so well separated classes as encountered in animal disease risk analysis. Using both simulated and real data, we compare this strategy to other standard strategies and show that it performs well in a lot of situations.
Key-words: Classication; Discrete hidden Markov random eld; Disease mapping; Poisson mixtures; Potts model; Variational EM.
∗MISTIS team
†Unité d'épidémiologie Animale, INRA Clermont-Ferrand-Theix
cachés
Résumé : Nous abordons la cartographie automatique d' unités géographiques en classes de risque comme un problème de clustering à l'aide de modèles de Markov cachés discrets et de modèles de mélange de Poisson. Le modèle de Markov caché proposé est une variante du modèle de Potts, où le paramètre d'interaction dépend des classes de risque. An d'estimer les paramètres du modèle, nous utilisons l'algorithme EM combiné à une approche variationnelle champ-moyen. Cette approche nous permet d'appliquer l'algorithme EM dans un cadre spatial et présente une alternative ecace aux méthodes d'estimation basées sur des simulations intensives de type Markov chain Monte Carlo (MCMC).
Nous abordons également les problèmes d'initialisation, spécialement quand les taux de risque sont petits (cas des maladies animales). Nous proposons une nouvelle stratégie d'initialisation appropriée aux modèles de mélange de Poisson quand les classes sont mal séparées. Pour illustrer notre méthodolgie, nous présentons des résultats d'application sur des données épidémiologiques réelles et simulées et montrons la performance de la stratégie d'initialisation présentée en comparaison à celles utilisées usuellement.
Mots-clés : Classication, Champs aléatoires de Markov cachés discrets, Cartographie du risque, Mélanges de Poisson, Modèle de Potts, EM variationnel.
Contents
1 Introduction 4
2 Designing hidden Markov elds for spatial disease mapping 6 2.1 A spatial hidden structure model . . . . 7 2.2 An observation model for count data . . . . 9 3 Estimating disease maps using variational EM variants 10 3.1 A tractable EM variant . . . 10 3.2 The Search/Run/Select strategy . . . 12
4 A search procedure for initializing EM 13
4.1 Searching the risk parameters space using the EM trajectories properties . . . 14 4.2 A full search procedure . . . 15
5 Illustrations 16
5.1 Simulated data sets . . . 17 5.2 The BSE data set . . . 20
6 Discussion 20
1 Introduction
The analysis of the geographical variations of a disease and their representa- tion on a map is an important step in epidemiology. The goal is to identify homogeneous regions in terms of disease risk and to gain better insights into the mechanisms underlying the spread of the disease. Traditionally, the region under study is partitioned into a number of areas on which the observed cases of a given disease are counted and compared to the population size in this area. It has long ago become clear that spatial dependencies between counts had to be taken into account when analyzing such location dependent data. The number of observed cases are usually modelled by Poisson distributions. Most statisti- cal methods for risk mapping of aggregated data (e.g. Mollie, 1999; Richardson et al., 1995; Pascutto et al., 2000; Lawson et al., 2000) are based on a Poisson log-linear mixed model and follow the one proposed by Besag, York & Mollie (1991). The so-called BYM model introduced by Besag et al. (1991), extended by Clayton & Bernadinelli (1992) and called the convolution model by Mollie (1996), is one of the most popular approaches and used extensively in this con- text. This model corresponds to a Bayesian hierarchical modelling approach.
It is based on an Hidden Markov Random Field (HMRF) model where the latent intrinsic risk eld is modelled by a Markov eld with continuous state space, namely a Gaussian Conditionally Auto-Regressive (CAR) model. Recent developments in this context concern in particular spatio-temporal mapping (Knorr-Held & Richardson, 2003; Robertson et al., 2010; Lawson & Song, 2010) and multivariate disease mapping (Knorr-Held et al., 2002; MacNab, 2010). For all these procedures, the model inference therefore results in a real-valued es- timation of the risk at each location and one of the main reported limitations (e.g. by Green & Richardson, 2002) is that local discontinuities in the risk eld are not modelled leading to potentially oversmoothed risk maps. Also, in some cases, coarser representations where areas with similar risk values are grouped are desirable (Abrial et al., 2005). Grouped representations have the advantage of providing clearly delimited areas for dierent risk levels, which is helpful for decision-makers to interpret the risk structure and determine protection mea- sures. These areas at risk can be viewed as clusters as in (Knorr-Held & Rasser, 2000), but we prefer to interpret them as risk classes, as Green & Richardson (2002) and Alfo et al. (2009), since geographically separated areas can have similar risks and be grouped in the same class. In consequence, the classes can be less numerous than the clusters and their interpretation by decision-makers easier in terms of risk value. Using the BYM model, it is possible to derive from the output such a grouping, using either xed risk ranges (usually dicult to choose in practice) or more automated clustering techniques (e.g. Fraley &
Raftery, 2007). In any case this post-processing step is likely to be sub-optimal.
By contrast, in this work, we investigate procedures that include such a risk classication.
There have been several attempts to take into account the presence of dis- continuities in the spatial structure of the risk. Within hierarchical approaches, one possibility is to move the spatial dependence one level higher in the hierar- chy. Green & Richardson (2002) proposed to replace the continuous risk eld by a partition model involving the introduction of a nite number of risk levels and allocation variables to assign each area under study to one of these levels.
Spatial dependencies are then taken into account by modelling the allocation
variables as a discrete state-space Markov eld, namely a spatial Potts model.
This results in a discrete HMRF modelling. The general eect is also to re- cast the disease mapping issue into a clustering task using spatial nite Poisson mixtures. In the same spirit, Fernandez & Green (2002) proposed another class of spatial mixture models, in which the spatial dependence is pushed yet one level higher. Of course, the higher the spatial dependencies in the hierarchy the more exible the model but also the more dicult the parameter estimation.
As regards inference, these various attempts have in common the use of MCMC techniques which can seriously limit even prevent from applying them to large data sets in a reasonable time.
Following the idea of using a discrete HMRF model for disease mapping, we propose to use for inference, as an alternative to simulation based techniques, an Expectation Maximization (EM) framework (Dempster et al., 1977; McLachlan
& Peel, 2000). This framework is commonly used to solve clustering tasks (Fra- ley & Raftery, 2007) but leads to intractable computation when considering non trivial Markov dependencies. However, approximation techniques are available and among them, we propose to investigate variational approximations for their computational eciency and good performance in practice. In particular, we consider the so-called mean eld principle that provides a deterministic way to deal with intractable Markov Random Field (MRF) models (Celeux et al., 2003) and has proven to perform well in a number of applications, e.g. (Forbes et al., 2010; Vignes & Forbes, 2009; Blanchet & Forbes, 2008).
An attempt in this direction has been recently made by Alfo et al. (2009) but with a rather limited consideration for experimental validation and robustness of their setting. The approach in (Alfo et al., 2009) has been tested on a single data set regarding human heart disease. Human disease data usually has the particularity that the populations under consideration are large and the risk values relatively high. This is not fully representative of epidemiological stud- ies, especially non contagious animal disease studies. When considering animal epidemiology, we may have to instead face low size populations and risk levels much smaller than 1, typically1e−5to1e−3. Diculties in applying techniques that work in the human case to data sets in the animal case have not been investigated. In addition, the authors in (Alfo et al., 2009) report no diculties regarding initialization and model selection. This is far from being the case in all practical problems. In this paper we propose to go beyond the work of Alfo et al. (2009) and to address a number of related issues. More specically, we investigate in more detail the model behavior. We pay special attention to one of the main inherent issues when using EM procedures, namely algorithm initialization. The model selection issue, e.g. the determination of the number of classes, is addressed using some previous work (Forbes & Peyrard, 2003) in which a mean eld approximation of the Bayesian Information Criterion (BIC) is provided for HMRF models. The EM solution can depend highly on its start- ing position. We show that in contrast to the example in (Alfo et al., 2009), simple initializations do not always work, especially for rare disease for which the risks are small. We then propose and compare dierent initialization strate- gies in order to get a robust way of initializing for most situations arising in practice.
In addition, we build on the standard hidden Markov eld model used in (Green & Richardson, 2002; Alfo et al., 2009) by considering a more general formulation that is able to encode more complex interactions than the standard
Potts model. In particular, we are able to encode the fact that risk levels in neighboring regions cannot be too dierent, whereas the standard Potts model penalizes neighboring risks equally, whatever the amplitude of their dierence.
The paper is structured as follows. In Section 2, we present the HMRF setting and in particular a general Potts model formulation that allows us to dene an instance of the model more appropriate for disease mapping than the standard version of (Green & Richardson, 2002; Alfo et al., 2009). In Section 3, we explain how disease mapping can be addressed in a more computation- ally ecient way using variational approximations within an EM framework for inference. Important issues related to the EM procedure are then addressed in Section 4. We propose a new ecient initialization strategy and a criterion for selecting the number of risk classes if necessary. Results are reported and compared in Section 5 on both simulated and real data sets. A discussion ends the paper in Section 6.
2 Designing hidden Markov elds for spatial dis- ease mapping
Discrete HMRF have been widely used for a number of classication tasks. Most applications are related to image analysis but other examples include population genetics (Francois et al., 2006), bioinformatics (Vignes & Forbes, 2009), among others. In a clustering or classication context, it has the advantage to pro- vide some insight and control on the clustering regularity through a meaningful and easy to understand parametric model. Hidden structure models and more specically mixture models are among the most statistically mature methods of clustering. In this paper, we recast the disease mapping issue into a clustering task. A clustering or labelling problem is specied in terms of a set of sites S and a set of labels L. A site often represents an item, a point or a region. A set of sites may be categorized in terms of their regularity. Sites on a lattice are considered as spatially regular (e.g. the pixels of a 2D image). Sites which do not present spatial regularity are considered as irregular. This is the usual case when sites represent geographic locations. A label is an event that may happen to a site. In our disease mapping problem, a typical event is the association of a region to a certain risk level. We will consider only the case where there are a nite number of such risk levels, i.e. the labels assume discrete values in a set ofK labels. In the following, it is convenient to considerLas the set of K-dimensional indicator vectors L={e1, . . . , eK} where eachek has all its components set to 0 except thekth which is 1.
Based on count data for a rare phenomenon observed in a predened set S of N areas (e.g. geographical regions), the goal is to assign to each region a risk level among a nite set ofK possible levels{λ1, . . . , λK}when these risk levels are themselves unknown and need to be estimated. In addition, as already mentioned, we are interested in accounting for spatial dependencies between counts in various regions in order to get spatially consistent risk estimation.
In general, risks are expected to be more similar in nearby areas than in areas that are far apart. The idea is to exploit the risk information from neighboring areas to provide more reliable risk estimates in each area. In each area, two values are usually available, the number yi (i = 1, . . . , N) of observed cases of
the given disease and the population sizeni. A common assumption is that for an area indexed by i ∈S, the number of casesyi is a realization of a Poisson distribution whose parameter depends on the risk level assigned to the area.
The unknown risk assignment and risk level compose respectively the hidden part and unknown parameters of the model.
Therefore, the data is naturally divided into observed datay={y1, . . . , yN} and unobserved or missing membership data z={z1, . . . , zN}. The latter are considered as random variables denoted by Z={Z1, . . . , ZN}. When the Zi's are independent, the model reduces to a standard mixture model. When the Zi's are not independent, the inter-relationship between sites can be maintained by a neighborhood system usually dened through a graph. Two neighboring sites correspond to two nodes of the graph linked by an edge.
For disease mapping, we usually consider the simplest possible graph structure connecting contiguous locations: regions i and j are neighbors if and only if they are spatially contiguous.
2.1 A spatial hidden structure model
The dependencies between neighboring Zi's are then modelled by further as- suming that the joint distribution of {Z1, . . . , ZN} is a discrete MRF on this specic graph:
P(z;β) = W(β)−1 exp(−H(z;β)) (1) whereβ is a set of parameters,W(β)is a normalizing constant andH is a function assumed to be of the following form (we restrict to pair-wise interac- tions),
H(z;β) =X
i∈S
Vi(zi;β) +X
i,j i∼j
Vij(zi, zj;β),
where the Vi's andVij's are respectively functions referred to as singleton and pair-wise potentials. We writei∼j when areasi andj are neighbors, so that the second sum above is over neighboring areas.
The set of parametersβ consists of two setsβ = (α,IB)whereαand IB are dened as follows. We can consider pair-wise potentials Vij that depend onzi
andzjbut also possibly oniandj. Since thezi's can only take a nite numberK of values, for eachiandj, we can dene aK×Kmatrix IBij= (IBij(k, l))1≤k,l≤K
and write without loss of generality Vij(zi, zj;β) =−IBij(k, l) if zi = ek and zj =el. Using the indicator vector notation and denoting zit the transpose of vectorzi, it is equivalent to writeVij(zi, zj;β) =−ztiIBijzj. This latter notation has the advantage to make sense also when the vectors are arbitrary and not necessarily indicators. This will be useful when describing the algorithms of Section 3.
Similarly we consider singleton potentialsVithat may depend onziand oni, so that denoting byαiaK−dimensional vector, we can writeVi(zi, β) =−αi(k) ifzi =ek, whereαi(k)is the kthcomponent of αi, or equivalently Vi(zi, β) =
−ztiαi. This vector αi acts as weights for the dierent values of zi. Whenαi
is zero, no risk level is favored, i.e. for a given areai, if no information on the neighboring areas is available, then all risk levels have the same probability. If in addition, for all i and j, IBij = b×IK where b is a real scalar and IK is the K×K identity matrix, parametersβ reduce to a single scalar interaction
parameterb, we get the Potts model traditionally used for image segmentation.
In this case, parameterbcan be interpreted as a strength of interaction between neighbors. The higher b, the more weight is given to the neighbors. Ifb is set to0, only the individual features are taken into account, reducing our model to the independent non spatial case.
Note that the standard Potts model is most of the time appropriate for classication since it tends to favor neighbors that are in the same class (i.e have the same risk level). However this model penalizes pairs that have dierent risk levels with the same penalty whatever the values of these risk levels. In practice, it may be more appropriate from a disease mapping point of view to encode higher penalties when the risk levels are further apart so as to model abrupt changes in the risk level of neighboring regions are undesirable. It follows that cases where the IBij's are far fromb×IK can be useful in situations where encoding ner interactions between regions is desirable. This is part of the exibility and modelling capabilities of the model we describe.
In practice, these parameters can be tuned according to experts, a pri- ori knowledge, or they can be estimated from the data. In the rst case, our model can deal with the most general parametrization, namely β = {αi, IBij, i, j = 1, . . . , N}. In the latter case, the part to be estimated is usually assumed independent of the region indices iand j, so that in what follows we will reduce αand IB respectively to a single vector and a single matrix. Note that formulated as such the model is not identiable in the sense that dierent values of the parameters, namely(α,IB)and(α+γ111K,IB+γ211K×K)lead to
the same probability distribution. Notationsγ1andγ2are scalars and11K and 11K×K respectively denote theK component vector andK×Kmatrix with all components being 1. This issue is generally easily handled by imposing some additional constraint such asα(k) = 0and IB(k, l) = 0for one of the pairs(k, l). In the disease mapping context, we propose to use for IB a matrix with three non zero diagonals dened for some positive real valueb by:
IB(k, k) = b for allk= 1, . . . , K
IB(k, l) = b/2 for all(k, l)such that|k−l|= 1
IB(k, l) = 0 otherwise. (2)
The idea is to favor rst, neighbors in the same risk class and then neighbors in risk classes that are close. All other pairs of risk classes being equally weighted.
This is the simplest non standard IB structure that can encode smooth variations in the risk level.
The diculty with MRF models is that the computation of P(z;β)in (1) is not possible due to the normalizing constantW(β)which involvesKN terms and is intractable except in some trivial cases. By contrast the conditional prob- ability p(zi|zN(i);β), wherezN(i) denotes the values{zj, j∈ N(i)} associated to the setN(i)of neighbors ofi, is easily computed using the following formula:
P(zi|zN(i);β) =
exp(zit(α+IB P
j∈N(i)
zj))
K
P
k=1
exp(α(k) +etkIB P
j∈N(i)
zj)
. (3)
2.2 An observation model for count data
The previous lines dene the hidden structure of the model. For the model to be fully dened, the observation model needs also to be specied. The class dependent distribution P(yi|zi)is usually a standard distribution, typically in rare disease mapping, a Poisson distribution P(yi; ni λzi), where ni is the population size in areaiand thezisubscript inλziindicates that the distribution parameter depends on the specic value ofzi which determines the level of the risk and thus the risk value among a nite number of them{λ1, . . . , λK}. With our vectorial notation, we will write zitλforλzi with λ= [λ1, . . . , λK]t. In the one dimensional Poisson case, this corresponds to:
P(Yi=yi|Zi=zi;λ) = P(yi; nizitλ)
= exp(−niztiλ) (nizitλ)yi
yi! (4)
Note that in practice, epidemiologists usually prefer to consider relative risks rather than absolute risks. Relative risk correspond to the ratio between a local and an overall risk. However, in the case of a unique population without any structure, the use of relative risk is equivalent to the use of absolute risk.
For the distribution of the observed variables y given the classication z, the usual conditional independence assumption leads to:
P(Y=y|Z=z;λ) =Y
i∈S
P(yi;ni zitλ).
It follows that the energy of the hidden eldzgiven the observed eldyis:
H(z|y;λ, β) =H(z;β)−X
i∈S
logP(yi;niztiλ),
and its conditional probability is:
P(z|y;λ, β) =W(β)−1exp(−H(z;β) +X
i∈S
logP(yi;niztiλ)).
The unknown parameters (to be estimated) of this model are then denoted by Ψ= (λ, α,IB).
In labelling problems, most approaches fall into two categories. The rst ones focus on nding the best z using a Bayesian decision principle such as Maximum A Posteriori (MAP) or Maximum Posterior Marginal (MPM) rules.
This explicitly involves the use ofP(z|y)and uses the fact that the conditional eld denoted byZ|Y=yis a Markov eld. This includes methods such as ICM (Besag, 1986) and Simulated Annealing (Geman & Geman, 1984) which dier in the way they deal with the intractableP(z|y)and use its Markovianity. A sec- ond type of approaches is related to a missing data point of view. Originally, the focus is on estimating parameters using a maximum likelihood principle when some of the data are missing (the zi's here). The reference algorithm in such cases is the Expectation-Maximization (EM) algorithm (Dempster et al., 1977).
In addition to providing estimates of the parameters, the EM algorithm pro- vides also a classicationzby oering the possibility to restore the missing data.
However, when applied to HMRFs, the algorithm is not tractable and requires
approximations. It follows variations such as the Gibbsian EM of Chalmond (1989), the MCEM algorithm and a generalization of it (Qian & Titterington, 1991), the PPL-EM algorithm of Qian & Titterington (1991) and various Mean Field like approximations of EM (Celeux et al., 2003). We focus on the latter in the next section for their computational eciency and good results in practice.
3 Estimating disease maps using variational EM variants
In disease mapping, the question of interest is to recover the unknown map z interpreted as a classication into a nite number K of labels representing K risk classes. These labels are unobserved and have to be considered as missing data. Our aim is therefore to classify each region in one of the K risk classes.
To do so, we consider a MPM principle consisting in assigning each regionito the classek that maximizes P(Zi =ek|y;Ψ). Such maximizations depend on the model parameters Ψ which is usually unknown (or partly unknown when prior knowledge can be incorporated) and has to be estimated. In this paper, to deal with the spatial dependence structure, we use the EM algorithm with some of the approximations presented in (Celeux et al., 2003). These approx- imations are based on the mean eld principle which consists in replacing the intractable Markov distributions by factorized ones for which the exact EM can be carried out. It corresponds to one of the simplest variational approximations and allows to take the Markovian structure into account while preserving the good features of EM. The work in (Celeux et al., 2003) generalizes the mean eld principle and introduces dierent factorized models resulting in dierent procedures. Note that in practice, these algorithms have to be extended to incorporate the estimation of matrix IB and to include irregular neighborhood structure coming from arbitrary graphs and not from regular pixel grids like in image analysis.
3.1 A tractable EM variant
Briey, these algorithms can be presented as follow. They are based on the EM algorithm which is an iterative procedure aiming at maximizing the log- likelihood (for the observed variables y) of the model by maximizing at each iteration the expectation of the complete log-likelihood (for the observed and hidden variablesyandz) knowing the data and a current estimate of the model parameters. When the model is an Hidden Markov Model with parameters Ψ, there are two diculties in evaluating this expectation. Both the normal- izing constant W(β) in (1) and the conditional probabilities P(zi | y;Ψ) and P(zi, zj |y;Ψ)forjin the neighborhoodN(i)ofi, cannot be computed exactly.
Informally, the mean eld approach consists in approximating the intractable probabilities by neglecting uctuations from the mean in the neighborhood of each region i. This is obtained by assuming the neighboring zj forj in N(i), xed to their mean values. More generally, we talk about mean eld-like ap- proximations when the valuezidoes not depend on the valueszjforj6=iwhich are all set to constants (not necessarily to the means) independently of the value ofzi. These constant values denoted byz˜={˜z1, . . . ,z˜N}are not arbitrary but satisfy some appropriate consistency conditions (see Celeux et al., 2003). It
follows that P(zi|y;Ψ)is approximated by
P(zi |y,z˜N(i);Ψ)∝ P(yi;ni ztiλ)P(zi|˜zN(i);β)
∝ P(yi;nizitλ) exp(zti(α+IB X
j∈N(i)
˜
zj)). (5)
The normalizing constant is not specied but its computation is not an issue as it involves only a sum over Kterms. Then, for all j ∈ N(i), P(zi, zj |y;Ψ)is approximated by P(zi | y,z˜N(i);Ψ)P(zj | y,z˜N(j);Ψ). Both approximations are easy to compute. Using such approximations leads to algorithms which in their general form consist in repeating the following two steps. At iterationq, (1) Create from the data y and some current parameter estimates Ψ(q−1) a
congurationz˜(q) ={z˜(q)1 , . . . ,z˜(q)N }, i.e. values for theZi's that satisfy some consistency conditions (see Celeux et al., 2003). Replace the Markov distributionP(z;β)of (1) by the factorized distribution Q
i∈S
P(zi|˜zN(q)(i);β) using expression (3). It follows that the joint distributionP(y,z;Ψ)can also be approximated by a factorized distribution:
Y
i∈S
P(yi; ni ztiλ)P(zi|˜z(q)N(i);β)
and the two problems encountered when considering the EM algorithm with the exact joint distribution disappear. The second step is therefore, (2) Apply the EM algorithm for this factorized model with starting values
Ψ(q−1), to get updated estimatesΨ(q)of the parameters.
In particular the so-called mean eld and simulated eld algorithms per- form step (1) in two dierent ways. They correspond to two dierent sets of consistency conditions that can be satised by iterating some straightforward equations. The mean eld algorithm consists of updating thez˜i(q)'s by setting, for all i = 1, . . . , N, z˜i(q) to the mean of distribution P(zi | y,z˜N(i)(q) ;Ψ(q−1)). Note that as zi is an indicator vector, the mean valuez˜i(q) is a vector made of the respective probabilities to be in each of theK classes. In the simulated eld algorithm, z˜i(q) is simulated from P(zi | y,z˜N(q)(i);Ψ(q−1)). Note also that to save additional notation, the updating described above is synchronous while we actually implemented a sequential updating of thez˜i(q)'s: each siteiis updated in turn using the new values of the other sites as soon as they become avail- able rather than waiting until all sites have been updated. Step (1) can then be iterated a number of times to guarantee that the corresponding consistency equations are satised. In practice the xed point is reached in few iterations.
Similarly, in step (2), performing a small number of EM iterations is usually enough.The approximate EM step decomposes to the following E and M steps:
The E-step consists in computing, using equation (5), the approximate posteriors denoted by˜t(q)ik:
for alliandk, ˜t(q)ik =P(Zi=ek|y,z˜N(i)(q) ;Ψ(q−1)).
The M-step consists of updating the parameters. The risk updates are available in closed-form (equation (6) below) while the MRF prior parameters need to be computed numerically (equation (7)):
for allk, λ(q)k = P
i∈S
t˜(q)ik yi
P
i∈S
niyi
, (6)
and
β(q) = arg max
β
X
i∈S K
X
k=1
˜t(q)ik log ˜π(q)ik(β), (7)
whereπ˜(q)ik(β) =P(Zi=ek|˜zN(q)(i);β).
Computing the gradient in (7) withβ = (α,IB)and IB of the form specied in equation (2), the solutionsαandbof (7) satisfy:
fork= 1, . . . , K, X
i∈S
˜
πik(q)(α, b) =X
i∈S
˜t(q)ik ,
and, P
i∈S K
P
k=1
˜t(q)ik −˜πik(q)(α, b) P
j∈N(i)
˜
zj(q)(k) +12 (˜zj(q)(k−1) + ˜zj(q)(k+ 1))
!
= 0 The HMRF estimation provides us with estimations for the risk values λk, for k = 1, . . . , K, but also for the hidden eld parameters, i.e matrix IB and vector α. Once the parameters are estimated, approximations of the P(Zi = ek|y;Ψ)values required to classify each region using the MPM principle can be calculated. The area i is assigned to the class k for which the later posterior probability is the highest. Note that when using the mean eld approximation principle, MAP and MPM lead to the same solution due to the product form of the approximation in step (1) above.
3.2 The Search/Run/Select strategy
The likelihood function to be maximized generally possesses many stationary points of dierent natures (including local maxima and minima). Consequently, convergence to the global maximum with the EM algorithm may strongly de- pend on the starting parameters values. To anticipate this initialization issue, we cast the EM algorithm above into a more general procedure. As in (Bier- nacki et al., 2003), we adopt a three stage Search/Run/Select strategy whose goal is to identify in a reasonable amount of time the highest likelihood:
Search. Build a search method for generating M sets of initial parameters values. These sets can be either generated at random or using some ini- tialization strategy (see Section4).
Run. For each initial position from the search step, run the variational EM algorithm described in Section 3.1 until a stopping criterion (e.g. a xed number of iterations or stabilization of the log-likelihood) is satised. As
specied in Section 3.1, two variantes of the algorithm are possible de- pending on the approximation chosen in step (1).
Select. Select the set of estimated parameter values that provides the highest likelihood among theM trials. Build from this selected solution the risk map as specify at the end of Section 3.1.
In the next Section, we focus on the Search step and describe the initial- ization strategy we propose for a more ecient exploration of the parameter space.
4 A search procedure for initializing EM
The sensitivity of the EM algorithm to starting values is a well documented issue. It is particularly critical and very likely to strike when classes are not well separated. To overcome this sensitivity, dierent initialization strategies have been proposed and investigated in the context of independent Gaussian mixtures (see for instance McLachlan & Peel (2000) chap.2) or Biernacki et al.
(2003)). For Gaussian mixtures, Biernacki et al. (2003) propose a strategy based on randomly drawing the mixture model parameter values a number of times to provide random initializations of either Classication EM (CEM), Stochastic EM (SEM) or EM itself stopped after a xed number of iterations. The resulting set of parameter values leading to the highest likelihood is then selected and the EM algorithm ran until convergence starting from this selected position. Other initialization strategies have been investigated in (Karlis & Xekalaki, 2003) for both Gaussian and Poisson mixtures leading also to the conclusion that it was advisable to start from several dierent initial values to ensure more reliable results.
More generally, most strategies can be divided into two categories: the ones based on initial parameter values and the ones based on an initial partition of the data. As the EM goal is primary to estimate parameters, it produces a sequence of parameter estimates and it is therefore natural to initialize the algorithm with initial parameter values. However, as mentioned in (Biernacki, 2004) for the Gaussian case, at each iteration of the algorithm, the E and M steps produce estimated values that are not arbitrary but linked through some equations. The sequence of such estimates corresponds then to an EM trajec- tory in the parameter space. Then, parameter values randomly drawn do not necessarily belong to an EM trajectory and this can result in computationally inecient strategies as the maximum likelihood solution necessarily belongs to one of the possible EM trajectories. In the case of standard Gaussian mixtures in a non spatial context, Biernacki (2004) shows that all EM trajectories are included in a space dened by two equations linking the parameter estimates.
An initialization strategy is then derived and showed in (Biernacki, 2004) to be more ecient than exploring widely the parameter space without any account of the EM trajectories.
A second type of strategies for choosing starting values is to use random partitions of the data into K groups and then to compute for each group, the parameter estimates, here the risk level as the ratio of the observed counts in the group over the population size of the group. These values are then used as starting values. They are by construction in the EM trajectory space but they
tend to provide values close to each others and then not to explore the space eciently (see Figure 1 (d) for an illustration).
In the disease mapping context, the initialization issue has not really been addressed. In this context, Alfo et al in (Alfo et al., 2009) use 500 runs of a short- length CEM algorithm before their approximate EM. They report satisfying results with this initialization procedure although it is mentioned in (Biernacki et al., 2003) that this choice is generally not a stable strategy because CEM is actually more sensitive to the starting value than EM itself. We suspect then that the example in (Alfo et al., 2009) is so that there is no real initialization problem and any strategy would provide a satisfying solution. In this paper, we address initialization following the EM trajectory-based idea developed in (Biernacki, 2004) for Gaussian mixtures in a non spatial case. We show that it is possible to derive, for our spatial Poisson mixture, constraints on the parameter estimates to ensure that they belong to an EM trajectory. We propose this way a new initializing strategy and dene our Search procedure.
4.1 Searching the risk parameters space using the EM tra- jectories properties
Compared to the work in (Biernacki, 2004), our task is complicated by the addition of a spatial Markov prior whose parameters need also to be initialized.
Our rst approach is then to focus on the initialization of the risks or Poisson distributions parametersλ. It is interesting to note that whatever the model for the spatial prior an equation similar to that in (Biernacki, 2004) can be found that links theλk's values. Letn= P
i∈S
ni be the total population size. At each iterationq, we denote byn(q)k the quantity:
n(q)k = P
i∈S
˜t(q)ikni
n ,
which can be interpreted as the proportion of the population in the kth risk level. It follows easily that PK
k=1
n(q)k = 1. Using then equation (6) for the current risk level estimations, it comes,
K
X
k=1
n(q)k λ(q)k = P
i∈S
yi
n = ¯λ. (8)
λ¯ can be interpreted as an average risk and has the property to depend on the observed data only. At each iteration of the algorithm, the current parameter estimates λ(q)k satisfy this equation. Consequently all EM trajectories are in- cluded in the space dened by this equation. The idea is then to produce values for theλk's by sampling in this space. A simple way to achieve this is to follow the simulation steps below:
Step 1. Values for the n(0)k 's are rst drawn using a Dirichlet distribution D(π, . . . , π)with π= 1 for a uniform sampling on the space dened by
K
Pn(0)k = 1.
Step 2. Then k is chosen at random in the set {1, . . . , K} and the λ(0)l 's for l 6= k are drawn uniformly and without replication in the sample {yn1
1, . . . ,nyN
N}. The lastλ(0)k is set to verify:
λ(0)k = λ¯−P
l6=kn(0)k λ(0)l n(0)k
. (9)
These steps generate a vector of random parameters. The number of initial val- ues generated this way is set by the user. Note however, that as in (Biernacki, 2004) for Gaussian parameters, the later equation (9) in step 2 does not guar- antee thatλ(0)k is strictly positive. If this is not the case, the simulated sample is discarded and the procedure restarted from step 1.
To illustrate the proposed strategy and the dierences with other initial- izations using random parameter values or random partitions, we consider a simple non spatial two-class case where the true parameter values areλ1= 0.1, λ2 = 0.2 and the proportions for the two classes are π1 =π2 = 0.5. We con- sider a hundred of sites N = 100 and create values for the population sizes ni by sampling at random with replication among the integers between 10 and 109. We then simulate the yi's from a two component Poisson mixture model using equation (4). The histogram of the yi's is shown in Figure 1 (a). The correspondingλ¯ value is around 0.141.
In the two-class case, it follows from equation (8) that the EM trajectories and therefore the maximum likelihood estimates of λ1 and λ2 are constrained to live in the greyed area of Figure 1 (b) delimited by horizontal lines λ1 = ¯λ andλ2= ¯λ. The latter Figure also shows 100 initial values ofλ1andλ2drawn uniformly at random between 0 and 0.4 and ordered so thatλ1< λ2. Among the 100 points, only 42 lie within the EM trajectory area. The point corresponding to the true valuesλ1= 0.1andλ2= 0.2is marked with an X". For comparison, we show respectively in Figures 1 (c) and (d), 100 values obtained by using our strategy and starting from 100 random partitions. As expected, random partition initializations tend to produce values close to ¯λand fail in exploring the space eciently. By contrast, our strategy better explores the grey area and in particular around the true values (0.1, 0.2). The phenomenon is even more striking as we increase the number of simulated values for initialization.
In Figure 1 (e), (f) and (g), we then show the(λ1, λ2)values obtained af- ter 1 iteration of EM starting from the initial values obtained with the three initializing strategies. To better visualize the dierences on the plots, we show only 20 such values. They lie all within the grey area but only our proposed strategy tends to produce values near the true values. Random partition values are somewhat clustered at the wrong place and values from random parameter values are more dispersed. Among the 20 points, more points are likely not to lead close to the desired solution.
4.2 A full search procedure
To complete our Search procedure, we need in addition to initialλvalues, start- ing positions for the Markov prior parametersβ = (α,IB). When IB reduced to some valueb(denition (2)), our full Search procedure decomposes then in two steps as follows: