A faire RL :
homog´en´eiser les notations avec les macros de FR,
remplacer IS par SIS partout, insister sur le sequential = le long de la construction de l’arbre,
mieux s´eparer l’obtention de la r´ecurrence, les pi exactes et les pi chapeaux,
mettre en bleu les mots cl´es
ajouter l terme p(Htau)=distribution stationnaire des ´etats all´eliques partout ou c’est n´ecessaire
FR :
1) Faire backward dans un cours pr´ec´edent.... (et d’autres trucs sur ma pr´esentation de backward)
2) l’op´erateur diff´erentielφj n’est pas explicit´e =¿ phrases plus compliqu´ees
Module de Master 2 Biostatistique: mod`eles de g´en´etique des populations
Likelihood-based demographic inference using the coalescent
Rapha¨el Leblois & Fran¸cois Rousset
Centre de Biologie pour la Gestion des populations (CBGP, Montpellier) Institut des Sciences de l’Evolution, (ISEM, Montpellier)
Janvier 2017
Introduction
Likelihoods under the coalescent Felsenstein et al.’s MCMC
Metropolis-Hastings MsVarexample
Conclusions on MCMC Griffiths et al.’s IS
Griffiths et al’s recursion Old IS scheme
New IS scheme
A general method based on diffusion Approximations of π
Simulation tests Precision Validation Robustness MCMCvs. IS Conclusions
Introduction
Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS
Simulation tests Conclusions
Typical biological question :
• There are demographic evidences that orang-utan population sizes have collapsed
→ but what is the major cause of the decline, when did it start and how strong is it ?
• Canpopulation genetics help ? - Can weinfer the timeof the
event ?
- Can weinfer the strengthof the population size decrease ?
Methods based on coalescence simulations (Reminder...)
?
forwardintime 6backwardintime
Genealogy of the population
Genealogy of
the sample Coalescent tree
☇
;
☇
; P(Tk =t) ≈ k(k−1)
2N e−t
k(k−1)
2N P(m∣t) = (µt)me−µt m!
Two different ways to use the coalescent theory
• Exploratory approaches & simulation tests
- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (samplevs.
population)
Specify the model and parameter values
Simulated data sets
• Inferential approach
- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods)
a real data set
infer the model parameters Coalescent process
Coalescent process
Two different ways to use the coalescent theory
• Exploratory approaches & simulation tests
- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (samplevs.
population)
Specify the model and parameter values
Simulated data sets
• Inferential approach
- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods)
a real data set
infer the model parameters Coalescent process
Coalescent process
Likelihood-based inference under the coalescent
• Inferential approaches are based on the modeling of
population genetic processes. Each population genetic model is characterized by a set of demographic and genetic
parametersP
• The aim is to infer those parameters from a polymorphism data set (i.e. a genetic sample)
• The genetic sample is then considered as the realization (”output”) of a stochastic process defined by the demo-genetic model
Likelihood-based inference under the coalescent
• First, compute or estimate the likelihood L(P∗;D), i.e. the probability P(D;P∗) of observing the dataD for some parameter values P∗
• Second, infer the likelihood surface over all parameter values, find the set of parameter values that maximize it, and compute CI (maximum likelihood method),
or Compute posterior distributions and compare with priors (Bayesian approach).
Introduction
Likelihoods under the coalescent
Felsenstein et al.’s MCMC Griffiths et al.’s IS
Simulation tests Conclusions
Likelihood computations under the coalescent
• Problem: Most of the time, the likelihoodP(D;P∗) of a genetic sample cannot be computed because there is no explicit mathematical expression
• However, the probabilityP(D;P∗∣Gk) of observing the dataD given a specific genealogy Gk can be computed for some parameter values P∗.
• Then we take the sum of all genealogy-specific likelihoods on the whole genealogical space, weighted by the probability of the genealogy given the parameters :
L(P∗;D) = ∫G P(D;P∗∣G)P(G;P∗)dG
Likelihood computations under the coalescent
• The likelihood can be written as the sum of P(D;P∗∣Gk) over the genealogical space (all possible genealogies) :
L(P∗;D) = ∫GP(D;P∗∣G)P(G;P∗)dG
• Genealogies are missing data, they are important for the computation of the likelihood but there is no interest in estimating them.
→ very different from the phylogenetic approaches
Mutation Demography
(Coalescent)
Likelihood computations under the coalescent
• The likelihood can be written as the sum of P(D;P∗∣Gk) over the genealogical space (all possible genealogies) :
L(P∗;D) = ∫GP(D;P∗∣G)P(G;P∗)dG
...Usually impossible to sum over all possible genealogies...
→ Monte Carlo simulations are used : a large number K of genealogies are simulated according toP(G;P∗) and the mean over those simulations is taken as the expectation of P(D;P∗∣G):
L(P∗;D) =EP(G;P∗)(P(D;P∗∣G)) ≈ 1 K
K
∑
k=1
P(D;P∗∣Gk)
Likelihood computations under the coalescent
• The likelihood can be written as the sum of P(D;P∗∣Gk) over the genealogical space (all possible genealogies) :
L(P∗;D) = ∫GP(D;P∗∣G)P(G;P∗)dG
...Usually impossible to sum over all possible genealogies...
→ Monte Carlo simulations are used : L(P∗;D) =EP(G;P∗)(P(D;P∗∣G)) ≈ 1
K
K
∑
k=1
P(D;P∗∣Gk)
many many genealogies necessary for a good estimation of the likelihood...
Likelihood computations under the coalescent
• Monte Carlo simulations are used :
L(P∗;D) =EP(G;P∗)(P(D;P∗∣G)) ≈ 1 K
K
∑
k=1
P(D;P∗∣Gk)
Monte Carlo simulations are often not very efficient because there are too many genealogies giving extremely low
probabilities of observing the data, more efficient algorithms are used to explore the genealogical space and focus on genealogies well supported by the data.
Likelihood computations under the coalescent
• Two main approaches developed using more efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G).
MCMC Monte Carlo Markov chains on the genealogical and the parameter space, based on Felsenstein’s pruning algorithm (1973,1981)
Felsenstein, J. (1981). ”Evolutionary trees from DNA sequences : A maximum likelihood approach”. J. of Mol. Evol. 17 (6) : 368-376.
IS Importance Sampling on genealogies, based on the work of Griffiths & Tavar´e 1994.
Griffiths, R.C. and S. Tavar´e (1994). Simulating probability distributions in the coalescent. Theor. Pop. Biol., 46 :131-159.
Likelihood computations under the coalescent
• More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G)
MCMC Felsenstein’s pruning algorithm.
- Easier to implement, can easily consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)
IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult
- Implemented in fewer softwares (Genetree,Migraine)
Likelihood computations under the coalescent
• More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G)
MCMC Felsenstein’s pruning algorithm (quick overview) - Easier to implement, can consider various models
- Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)
IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult
- Implemented in fewer softwares (Genetree,Migraine)
Introduction
Likelihoods under the coalescent Felsenstein et al.’s MCMC
Griffiths et al.’s IS Simulation tests Conclusions
The approach of Felsenstein et al.
• Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes :
1 The probability of a genealogy given the parameters of the demographic modelP(Gk;Pdemo∗ )can be computed from the distributions of time between events.
2 The probability of the data given a genealogy and mutational parametersP(D;Pmut∗ ∣Gk)can be computed from the mutation model parameters, the mutation rate, tree topology and branch lengths.
The approach of Felsenstein et al.
• Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes :
1 P(Gk;Pdemo∗ )computed from the distributions of time between events.
2 P(D;Pmut∗ ∣Gk)computed from the mutation parameters, tree topology and branch lengths.
• From this, an efficient algorithm to explore the genealogical and the parameter spaces should allow the inference of the likelihood over the two spaces.
→ MCMC
Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample
Conclusions on MCMC
MCMC with Metropolis-Hastings sampler
• Full conditional distributions can not be computed, MCMC classical sampler can not thus be used (e.g. Gibbs)
→ Monte Carlo Markov Chains (MCMC) simulations using the Metropolis-Hastings (MH) algorithm
- To explore the genealogy space (G)
- and the parameter space (P=Pdemo+Pmut)
all algorithms based on the ’Felsenstein et al.’ approach uses similar MH/MCMC algorithms with slight differences in the MCMC update steps.
Metropolis-Hastings sampling for the coalescent
For the Metropolis-Hastings algorithm, we need to compute the ratio of the probability of the proposed update over the current state :
1. Computation of P(Gk;Pdemo) : P(Gk;Pdemo) =MRCA−1∏
i=0
γ(ti+1)e− ∫titi+1γ(t)dt
- Example for a stable WF population (coalescence only, time homogeneous)
P(Gk;Pdemo) =MRCA∏−1
i=0
ki+1(ki+1−1)
2 e−(ti+1−ti)ki+1(ki+12 −1)
Metropolis-Hastings sampling for the coalescent
1. Computation of P(Gk;Pdemo): P(Gk;Pdemo) =MRCA−1∏
i=0
γ(ti+1)e− ∫titi+1γ(t)dt
2 Then compute the probability P(D;Pmut∣Gk) of the dataD given the genealogyGk, by going from the MRCA to the leaves and considering the probability of occurrence of all mutations on each branch of lengthtb and their effects (i.e.transition among genetic states x→y) :
Metropolis-Hastings sampling for the coalescent
Mutation matrix :
transition probability between genetic states(x,y) Poisson probability for thembmutations
2 Then compute the probability P(D;Pmut∣Gk):
P(D;Pmut∣Gk) =nb branch∏
b=1
effect of mutations
³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ P(y∣x,mb) ⋅
number of mutations
³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ P(mb∣tb)
=2(n−1)∏
b=1
((Matmut)mb)x,y(µtb)mbe−µtb mb!
Metropolis-Hastings sampling for the coalescent
1. Computation of P(Gk;Pdemo): P(Gk;Pdemo) =MRCA−1∏
i=0
γ(ti+1)e− ∫titi+1γ(t)dt
2 Then compute P(D;Pmut∣Gk) : P(D;Pmut∣Gk) =2(n−1)∏
b=1
((Matmut)mb)x,y(µtb)mbe−µtb mb!
3 These probabilities are plugged into the MH formula for acceptance probabilities of candidate changes for the next state of the Markov chain.
Reminder :P(D;P∣Gk) =P(D;Pmut∣Gk)P(Gk;Pdemo)
Metropolis-Hastings sampling for the coalescent
• for each update, the new state (P′ orG′) is accepted or rejected according to the Metropolis-Hastings ratio,
• the MH ratio is chosen so that the chain converge towards the good stationary distribution P(D;P), e.g.
rMH = P(D;P′∣G)Prior(P′) P(D;P∣G)Prior(P)
P(P′→P) P(P →P′)
Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample
Conclusions on MCMC
Coalescent-based MCMC example : MsVar
• One example of a coalescent-based MCMC algorithm : MsVar
Beaumont, M. 1999. Detecting Population Expansion and Decline Using Microsatellites. Genetics.
• Biological contexte:
Past changes in population sizes (cf. Orang-Utans) - Details of the demographic and mutation models - few results on the Orang-Utan data set
Coalescent-based MCMC example : MsVar
• Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.
Coalescent-based MCMC example : MsVar
• Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.
Population contraction or expansion
Coalescent-based MCMC example : MsVar
• Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.
3 demographic parameters : N,T,Nanc
+ 1 mutation parameter µ 3 scaled parameters (diffusion approx.) :θ,D, θanc
Coalescent-based MCMC example : MsVar
• Mutation model: Stepwise Mutation Model (SMM)
Intro Likelihood & coa MCMC IS Sim tests Conclusions
Coalescent-based MCMC example : MsVar
• Aim : infer those parameters (P or Pscaled) from a unique actual genetic sample using coalescent-based MCMC algorithms
P =N,T,Nanc, µ Pscaled =θ,D, θanc
MH/MCMC of MsVar
1. Initialization step : Build a genealogy that is compatible with the data
→ Starting with the sample, choose a set of events depending on starting values of the parameters ; the events are also chosen to be compatible with the data
2. MCMC steps : Explore the parameter and the genealogical space
→Update the parameters for population sizes(θact,D, θanc). orUpdate the genealogy(sequence and times of coalescence and mutation events (Ti))
both updates made using the Metropolis-Hastings algorithm
MCMC updates in MsVar
Ti=times of coa & mut,r=θθact
anc pop size ratio,tf =D time of pop size change
M. Beaumont : “This scheme was devised by trial and error to obtain good rates of convergence.”
Analyses of MsVar results
• First check that the chains mixed and converged properly
→ Visual check (very useful)
• Traces of likelihood / parameters
• Autocorrelation
→ Computeconvergence criteriaamong chains (GR, ...) not always useful...
→ Run different chains and check concordance between results
Problem: Convergence is often pretty bad with such coalescent-based MCMC algorithms ... but simulation tests show that posterior distributions are generally correct (at least the mode as point estimate) despite no clear convergence indices...
Analyses of MsVar results
• Bayesian method →compare posteriors (plain) and priors (dashed)
... and test different priors
Analyses of MsVar results
• Bayesian method →compute Bayes factor to check for contraction or expansion signal
BF= (Posterior prob. model 1) (Posterior prob. model 2)
(Prior prob. model 2) (Prior prob. model 1)
• Equal priors for models 1 and 2, the Bayes factor for a contraction is thus
BF=Posterior P(Nanc/Nact >1) Posterior P(Nanc/Nact <1)
BF=# MCMC steps where (Nanc/Nact >1)
# MCMC steps where (Nanc/Nact <1)
An application of MsVar :
Orang-Utans and the deforestation of Borneo
Does the genome of Orang-utans carry the signature of population bottlenecks ?
(Goossens et al. 2006 PLoS Biology)
An application of MsVar :
Orang-Utans and the deforestation of Borneo
(Delgado & Van Schaik, 2001 Evol. Anthropology)
Population sizes have collapsed : what is the cause ?
Can population genetics help ?
An application of MsVar :
Orang-Utans and the deforestation of Borneo
• The data
An application of MsVar :
Orang-Utans and the deforestation of Borneo
• MsVarresults
→ MsVarefficiently detects a past decrease in population size
An application of MsVar :
Orang-Utans and the deforestation of Borneo
FE : beginning of massive forest exploitation F : first farmers
HG : first hunter-gatherers
• MsVarresults
→ MsVarefficiently detects a past decrease in population size...
... and allows for the dating of the beginning of the decrease :
massive forest exploitation seems to be the most likely cause
Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample
Conclusions on MCMC
Conclusions about MsVar / MCMC approaches
• Coalescent theory provides a powerful framework for statistical inference
→ Allows to infer past history from a unique actual sample ! (it was impossible with moment based methods)
• Gene genealogies are missing data (but important...)
→ MCMCs with coalescent simulations are “difficult” (to run)
• But what is the robustness to model assumptions :
• Mutational processes
(e.g. large mutation steps→ long branches)
• Population structure
(e.g. immigrants→long branches)
Introduction
Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS
Simulation tests Conclusions
Likelihood computations under the coalescent
• More efficient algorithms that allows better exploration of the genealogies (i.e. proportionally toP(D;P∣G)).
MCMC Felsenstein’s pruning algorithm.
- Easier to implement, can consider various models
- Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)
IS Griffiths &Tavar´e’s coalescent recursion (cf. Ewens’ recursion) - Extension to different models may be difficult
- Implemented in fewer softwares (Genetree,Migraine)
The approach of Griffiths et al.
• Coalescent-based likelihood at a given point of the parameter space is an integral aver all possible histories (genealogies with mutations) leading to the present genetic sample
• Monte Carlo scheme used to compute this integral
• Histories are build backward in time, event by event, starting from the present sample
• But computation of exact backward transition probabilities is often too difficult
→ an IS scheme is used to compute the likelihoods by simulation
Griffiths et al.’s IS
Griffiths et al’s recursion Old IS scheme
New IS scheme
A general method based on diffusion Approximations of π
Recursions for sampling distributions
Ewens 1972 : a Wright-Fisher, infinite-allele model
General recursion at stationarity [withej = (0,⋯,0,1jth,0,⋯,0)] : Pn(a) = θ
n−1+θPn−1(a−e1)+ n−1 n−1+θ ∑
aj+1>0
j(aj +1)
n−1 Pn−1(a+ej−ej+1) where given that a coalescence occurs and that the descendant
sample has(⋯,aj,aj+1,⋯), the ancestral one has
(⋯,aj+1,aj+1−1,⋯)and the probability that one of the aj +1 alleles withj gene copies is chosen to duplicate isj(aj +1)/(n−1). Rappel : les effectifs de l’´echantilons (vecteur aest ordonn´e selon les effectifs des alleles = allele frequency spectrum =aj est le nombre de l’alleles ayant j copies dans l’´echantilon)
Griffiths and Tavar´e :
recursion for mutation models defined by a matrix(pij)of mutation rates fromi to j
The recursion of Griffiths et al.
• Coalescent-based likelihood at a given point of the parameter space is an integral over all possible histories (genealogies with mutations)
H = {Hk;k=0, ..., τ}
corresponding to all coalescent or mutation events that occurred fromH0 the current sample state toHτ the allelic state of the most recent common ancestor (MRCA) of the sample.
The recursion of Griffiths et al.
• Then for any given state Hk of the history (cf. Ewens) : p(Hk) = ∑
{Hk′}
p(Hk∣Hk′)p(Hk′)
whereHk′ is the ancestral sample state (i.e. the state before the last event)
andp(Hk∣Hk′) are the forward transition probabilities (i.e.
from the ancestral to the current state)
The recursion of Griffiths et al.
• Griffiths & Tavar´e 1994 : example for a single population
p(Hk=η) = 1 (n(n−1)2N +nµ)
⎡⎢⎢⎢
⎢⎣(nµ∑
i ∑
j∶nj>0,j≠i
ni+1
n pijp(Hk′=η−ej+ei)) + (n(n−1)
2N ∑
j∶nj>1
nj−1
n−1p(Hk′=η−ej))⎤⎥
⎥⎥⎥⎦. - Settingθ=4Nµandβ=n(n−1+θ), we have
p(Hk=η) =1 β
⎡⎢⎢⎢
⎢⎣θ∑
i ∑
j∶nj>0,j≠i
(ni+1)pijp(Hk′=η−ej+ei)
+n ∑
j∶nj>1
(nj−1)p(Hk′=η−ej)⎤⎥
⎥⎥⎥⎦,
The recursion of Griffiths et al.
• Griffiths & Tavar´e 1994 : example for a single population
p(Hk=η) =1 β
⎡⎢⎢⎢
⎢⎣θ∑
i ∑
j∶nj>0,j≠i
(ni+1)pijp(Hk′=η−ej+ei)
+n ∑
j∶nj>1
(nj−1)p(Hk′=η−ej)⎤⎥
⎥⎥⎥⎦,
• Such recursions are too difficult to solve except for very simple models (WF + IAM, cf Ewens)
→ Griffiths & Tavar´e (1994) proposed to use aMonte Carlo approachusing sequential importance samplingon past histories to solve the recursion.
Griffiths et al.’s IS
Griffiths et al’s recursion Old IS scheme
New IS scheme
A general method based on diffusion Approximations of π
Inference of the likelihood by simulation
• Griffiths & Tavar´e 1994 :
p(Hk=η) =1 β
⎡⎢⎢⎢
⎢⎣θ∑
i ∑
j∶nj>0,j≠i
(ni+1)pijp(Hk′=η−ej+ei)
+n ∑
j∶nj>1
(nj−1)p(Hk′=η−ej)⎤⎥
⎥⎥⎥⎦,
or equivalently
p(Hk) =wGT(Hk)( ∑
i,j∶nj>0,j≠i
Mij(Hk)p(Hk−ej+eai) + ∑
j∶nj>1
Cj(Hk)p(Hk−ej))
Inference of the likelihood by simulation
• Griffiths & Tavar´e 1994 : Backward absorbing Markov chain based on forward transition probabilities
p(Hk) =wGT(Hk)( ∑
i,j∶nj>0,j≠i
Mij(Hk)p(Hk−ej+eai) + ∑
j∶nj>1
Cj(Hk)p(Hk−ej))
→ Histories are build backward event by event using absorbing Markov chain (abs. state = MRCA) based on forward transitions probabilities (“uniform sampling” based on Mij(Hk) andCj(Hk)) among all possible events.
wGT(Hk)is the weight associated with the IS proposal.
Inference of the likelihood by simulation
• Expending the recursion p(Hk) = ∑{Hk′}p(Hk∣Hk′)p(Hk′) over all possible ancestral histories of a current sample leads to
p(H0) =E[p(H0∣H1)...p(Hτ−1∣Hτ)p(Hτ)]
Then
L(P;D) =p(H0) = ∫HWGT(H)fGT(H) ≈ 1 L
L
∑
h=1
WGT(Hh)
≈1 L
L
∑
h=1 τ
∏
k=0
wGT((Hh)k).
This IS schemefGT(H)is not very efficient because it does not appropriately consider that some backward transitions are more likely than others given the current state (example : SMM mutation).
Griffiths et al.’s IS
Griffiths et al’s recursion Old IS scheme
New IS scheme
A general method based on diffusion Approximations of π
Intro Likelihood & coa MCMC IS Sim tests Conclusions
Towards a better IS scheme
(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk′) be a given distribution, then
p(Hk) = ∑
{Hk′}
p(Hk∣Hk′)
Q(Hk′) Q(Hk′)p(Hk′)
=EQ[p(H0∣H1)
Q(H1) ...p(Hτ−1∣Hτ) Q(Hτ) ]
whereEQ is expectation over the distribution of full histories induced by Q. This means thatQ may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk∣Hk′/Q(Hk′)).
minimizes the variance of likelihood estimates 1
L
L
∑
h=1 τ
∏
k=0
wGT((Hh)k).
Towards a better IS scheme
(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk′) be a given distribution, then
p(Hk) =EQ[p(H0∣H1)
Q(H1) ...p(Hτ−1∣Hτ) Q(Hτ) ]
whereEQ is expectation over the distribution of full histories induced by Q. This means thatQ may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk∣Hk′/Q(Hk′)).
• The problem is then to find the proposal distribution that minimizes the variance of likelihood estimates
1 L
L
∑
h=1 τ
∏
k=0
wGT((Hh)k).
Towards a better IS scheme
(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
• The ideal proposal is the backward transition probability p(Hk′∣Hk), because the IS weights are then
p(Hk′)
Q(Hk′) =p(Hk∣Hk′)
p(Hk′∣Hk) = p(Hk) p(Hk′)
and thus their product is always the sample likelihood, p(H0). expliciter
→ a single tree reconstruction allows exact likelihood computations (null variance).
• However, backward transition probabilitiesp(Hk′∣Hk) are generally unknown
Aim : find good approximations ˆp(Hk′∣Hk) ofp(Hk′∣Hk)
Towards a better IS scheme
(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
• The likelihood at a given point is an integral over all possible histories H = {Hk;k=0, ..., τ}.
• Markov coalescent process →p(Hk) = ∑p(Hk∣Hk′)p(Hk′) andp(H0) =E[p(H0∣H1)...p(Hτ−1∣Hτ)p(Hτ)].
• However, forward transition probabilities p(Hk∣Hk′) are not efficient in a backward process
• Importance sampling techniques based on an approximation p(Hˆ k′∣Hk) of p(Hk′∣Hk)are used to build more likely histories
p(H0) =Eˆp[p(H0∣H1) ˆ
p(H1∣H0)...p(Hτ−1∣Hτ) ˆ
p(Hτ∣Hτ−1)].
Intro Likelihood & coa MCMC IS Sim tests Conclusions
Linking optimal weights to addition of a gene to a sample
Represent sample probabilityp(n) as integral over the joint distributionf(x) of allele frequencies in the population
p(n) = ∫xp(n∣x)f(x)dx=Ef[(n n) ∏
i
Xini] where
(n
n) = n!
∏ini! is the binomial coefficient.
a samplen and that an additional gene copy is of typej is Ef[Xj(n
n) ∏
i
Xini] = nj +1
n+1p(n+ej).
We write this joint probability asp(n)times π(j∣n), where π is thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifHk andHk′ differ by the addition of one gene of type j, we can write the optimal IS weight as
p(Hk′ =n)
p(Hk =n+ej) = nj +1 n+1
1 π(j∣n)
Intro Likelihood & coa MCMC IS Sim tests Conclusions
Linking optimal weights to addition of a gene to a sample
Represent sample probabilityp(n) as integral over the joint distributionf(x) of allele frequencies in the population
p(n) = ∫xp(n∣x)f(x)dx=Ef[(n n) ∏
i
Xini]
Then the joint probability that we have a samplen and that an additional gene copy is of typej is
Ef[Xj(n n) ∏
i
Xini] = nj +1
n+1p(n+ej).
thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifHk andHk′ differ by the addition of one gene of type j, we can write the optimal IS weight as
p(Hk′ =n)
p(Hk =n+ej) = nj +1 n+1
1 π(j∣n)
Linking optimal weights to addition of a gene to a sample
Then the joint probability that we have a samplenand that an additional gene copy is of typej is
Ef[Xj(n n) ∏
i
Xini] = nj +1
n+1p(n+ej).
We write this joint probability asp(n)times π(j∣n), where π is thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifHk andHk′ differ by the addition of one gene of typej, we can write the optimal IS weight as
p(Hk′ =n)
p(Hk =n+ej) = nj +1 n+1
1 π(j∣n)
Towards a better IS scheme : the π’s
• Let π(⋅∣Hk)be the conditional distribution of the allelic type of a n+1 gene, given Hk the configuration (i.e. allelic types) of the first n genes of the sample.
• Then the optimal IS distribution (exact backward transition probabilities) is, for a single population :
p(Hk′∣Hk) = 1 βθnj
π(i∣Hk−ej)
π(j∣Hk−ej)Pij for Hk′ =Hk−ej+ei
p(Hk′∣Hk) = 1 β
nj(nj−1)
π(j∣Hk−ej) for Hk′ =Hk−ej
Towards a better IS scheme : the ˆ π’s
• Unfortunately,π’s are generally unknown
→ Stephens & Donnelly (2000) proposed a good approximation ˆπfor theπs for a single WF population.
→ de Iorio & Griffiths (2004) proposed a general method for appoximating theπs under different mutational and demographic models
• Then approximate backward transition probabilities using the ˆ
πs are used : ˆ
p(Hk′∣Hk) = 1 βθnj
ˆ
π(i∣Hk−ej) ˆ
π(j∣Hk−ej)Pij for Hk′ =Hk−ej+ei
ˆ
p(Hk′∣Hk) = 1 β
nj(nj−1) ˆ
π(j∣Hk−ej) for Hk′ =Hk−ej
Griffiths et al.’s IS
Griffiths et al’s recursion Old IS scheme
New IS scheme
A general method based on diffusion Approximations of π
The backward equation for f (X
t∣X
0= x)
Pour un processus de diffusion, la densit´e de probabilit´e f des fr´equences all´eliques satisfait l’´equation arri`ere de Kolmogorov, qui d´ecrit les changements def au cours du temps sous la forme
df(Xt∣X0=x)
dt =Φ(f(x)),
o`u Φ est un op´erateur diff´erentiel qui prend ici la forme Φ=1
2∑
i∈E
∑
j∈E
xi(δij −xj) ∂2
∂xi∂xj + ∑
j∈E
(∑
i∈E
xirij) ∂
∂xj
= ∑
j∈E
Φj ∂
∂xj
avec
R= {rij} ≡ θ
2(P−I)
o`u P= {pij} est la matrice de mutation, et I la matrice identit´e.
The backward equation for E [g (X
t)∣X
0= x]
In the same way as
df(Xt∣X0=x)
dt =Φ(f(x)),
the following “generator equation” (Karlin and Taylor, 1981, p.215) holdsfor any functiong(x)with bounded second derivatives
t→0lim
E[g(Xt)∣X0=x] −g(x)
t =Φ(g(x)). We will apply this result withg the sample probability given population allele frequenciesx.
ˆ
π’s computation
Pour obtenir une r´ecurrence sur les probabilit´esp(n) avecn=H0
de l’´echantillon, on ´ecritp(n) sous la formeE[g(x)]
p(n) =E[(n n) ∏
i
Xini] o`u
(n
n) = n!
∏ini!. On a donc
d(p(n))
dt =Φ[p(n)].
A l’´equilibre stationnaire, d(p(n))/dt est nulle. En d´eveloppant l’expression pour Φ[p(n)], on retrouve alors la r´ecurrence entre les p(n).
Explicit recursions in terms of π
Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence or mutation event :
N(n(n−1
N +µ))p(n) =N∑
j
nnj−1
N p(n−ej) +Nµ∑
j
∑
i
Pij(ni +1−δij)p(n−ej +ei). Expressing allp(.) in terms ofp(n−ej)s for distinctjs :
N∑
j
(n−1
N +µ)π(j∣n−ej)np(n−ej) =N∑
d,j
nnj −1
N p(n−ej) +Nµ∑
j
∑
i
Pijnπ(i∣n−ej)p(n−ej)
...huge system of linear equations, not easier to solve in this form.
Griffiths et al.’s IS
Griffiths et al’s recursion Old IS scheme
New IS scheme
A general method based on diffusion Approximations of π
ˆ
π’s computation
On note que Φ[p(n)]peut s’´ecrire sous la forme
∑
j∈E
Φj
∂
∂xj [p(n)],
La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer lesπ (d´eriv´es pr´ec´edemment de p(n), solution de Φ[p(n)] =0) par des ˆπ d´eriv´es des solutions de
E[Φj∂p(n)
∂xj ]=E[Φj ∂
∂xj (n n) ∏
i
xini]=0,pour chaquej ∈E.
ˆ
π’s computation
La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer lesπ (d´eriv´es pr´ec´edemment de p(n), solution de Φ[p(n)] =0) par des ˆπ d´eriv´es des solutions de
E[Φj ∂
∂xj (n n) ∏
i
xini]=0, pour chaque j ∈E.
ce qui donne, pour une population panmictique, pour chaquej ∈E nj(n−1+θ)pˆ(n) =
n(nj −1)pˆ(n−ej) + ∑
i∈E
θPij(ni+1−δij)pˆ(n−ej+ei)
ˆ
π’s computation
Rappel :π(j∣n) peut ˆetre exprim´e en fonction de p(n) et p(n+ej) :
π(j∣n)p(n) = nj +1
n+1p(n+ej).
Si l’on consid`ere que cette relation est aussi valable pour les ˆ
π et ˆp, ce qui ne sera g´en´eralement pas le cas, on a ˆ
π(j∣n)ˆp(n) = nj+1
n+1pˆ(n+ej)
ˆ
π’s computation
Approximer lesp(n), solutions de Φ[p(n)] =0, par les ˆp(n)solutions de E[Φj
∂
∂xj (n n) ∏
i
xini]=0, pour chaquej∈E. ce qui donne, pour une population panmictique, pour chaquej∈E
nj(n−1+θ)ˆp(n) =
n(nj−1)ˆp(n−ej) + ∑
i∈E
θPij(ni+1−δij)ˆp(n−ej+ei)
et en utilisant ˆπ(j∣n)pˆ(n) = nnj++11ˆp(n+ej)et rempla¸cantnparn+ej, on obtient donc pour chaquej∈E :
(n−1+θ)πˆ(j∣n) =nj+ ∑
i∈E
θPijπˆ(i∣n)
C’est le systeme lin´eaire permettant le calcul des ˆπ(j∣n)pour un mod`ele de Wright-Fisher.
New IS scheme with the ˆ π’s
faire deux dipaos de bilan du nouveau schema d’IS bilan en reprennant des bouts des diapos 63, 64, 65, 66, 70 et 71
A much better IS scheme based on the ˆ π’s
• Drastic gain in efficiently with this new IS scheme (old IS : millions of trees)
→ extract backward transition probabilities for a WF model with parent independent mutation (i.e. KAM)
→ only 30 histories necessary for a good estimation of the likelihood for more complex models (structured populations & KAM)
• but efficiency slightly decrease with non parent-independent mutations models,
e.g. stepwise mutation model (200 histories for structured populations &
SMMM)
• and still limited efficiency for time inhomogeneous demographic models,
e.g. one population with past size change (cf. Orang-Utan example)
→ up to 20,000 histories necessary for strong disequilibrium scenarios (e.g.
quick change in population size)
Implementations of IS : Genetree and Migraine
• Genetree(Bahlo & Griffiths 2000, old IS algorithm) - 2 to 4 populations with migration (ISM)
• Migraine(Rousset & Leblois 2007-2014, new IS algorithms) - One single stable population (KAM, SMM, GSM, ISM) - One pop. with past size variation (KAM, SMM, GSM, ISM) - 2 populations with migration (KAM, SMM, ISM)
- Isolation By Distance in 1D and 2D (KAM)
Implementation of IS in Migraine
1. C++ core IS computations
• Stratified random sampling of parameter points
• Estimation of the likelihood at each point using IS 2. R code for “post-treatment”
• Likelihood surface interpolation by Kriging
• Inference of MLEs and CIs
• Plots of 1D and 2D likelihood profiles
Introduction
Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS
Simulation tests Conclusions
Intro Likelihood & coa MCMC IS Sim tests Conclusions
Simulation tests
Can we trust the demographic / historical inferences made with those methods ?
• Bias, RMSE, coverage properties of confidence intervals
• robustness to realistic but “uninteresting” mis-specifications
→ to this aim, we tested by simulation :
- The performances of Migraineto infer dispersal under IBD - The performances of MsVarandMigraineto detect and
measure past pop size changes few interesting results...