Two different ways to use the coalescent theory

(1)

A faire RL :

homog´en´eiser les notations avec les macros de FR,

remplacer IS par SIS partout, insister sur le sequential = le long de la construction de l’arbre,

mieux s´eparer l’obtention de la r´ecurrence, les pi exactes et les pi chapeaux,

mettre en bleu les mots cl´es

ajouter l terme p(Htau)=distribution stationnaire des états alléliques partout ou c’est nécessaire

FR :

1) Faire backward dans un cours précédent.... (et d’autres trucs sur ma présentation de backward)

2) l’opérateur différentielφj n’est pas explicité =¿ phrases plus compliquées

(2)

Module de Master 2 Biostatistique: modèles de génétique des populations

Likelihood-based demographic inference using the coalescent

Rapha¨el Leblois & Fran¸cois Rousset

Centre de Biologie pour la Gestion des populations (CBGP, Montpellier) Institut des Sciences de l’Evolution, (ISEM, Montpellier)

Janvier 2017

(3)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC

Metropolis-Hastings MsVarexample

Conclusions on MCMC Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

Simulation tests Precision Validation Robustness MCMCvs. IS Conclusions

(4)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS

Simulation tests Conclusions

(5)

Typical biological question :

• There are demographic evidences that orang-utan population sizes have collapsed

→ but what is the major cause of the decline, when did it start and how strong is it ?

• Canpopulation genetics help ? - Can weinfer the timeof the

event ?

- Can weinfer the strengthof the population size decrease ?

(6)

Methods based on coalescence simulations (Reminder...)

?

forwardintime ⁶backwardintime

Genealogy of the population

Genealogy of

the sample Coalescent tree

☇

;

☇

; P(Tk =t) ≈ k(k−1)

2N e^−t

k(k−1)

2N P(m∣t) = (µt)^me^−µt m!

(7)

Two different ways to use the coalescent theory

• Exploratory approaches & simulation tests

- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (samplevs.

population)

Specify the model and parameter values

Simulated data sets

• Inferential approach

- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods)

a real data set

infer the model parameters Coalescent process

Coalescent process

(8)

Two different ways to use the coalescent theory

• Exploratory approaches & simulation tests

- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (samplevs.

population)

Specify the model and parameter values

Simulated data sets

• Inferential approach

- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods)

a real data set

infer the model parameters Coalescent process

Coalescent process

(9)

Likelihood-based inference under the coalescent

• Inferential approaches are based on the modeling of

population genetic processes. Each population genetic model is characterized by a set of demographic and genetic

parametersP

• The aim is to infer those parameters from a polymorphism data set (i.e. a genetic sample)

• The genetic sample is then considered as the realization (”output”) of a stochastic process defined by the demo-genetic model

(10)

Likelihood-based inference under the coalescent

• First, compute or estimate the likelihood L(P^∗;D), i.e. the probability P(D;P^∗) of observing the dataD for some parameter values P^∗

• Second, infer the likelihood surface over all parameter values, find the set of parameter values that maximize it, and compute CI (maximum likelihood method),

or Compute posterior distributions and compare with priors (Bayesian approach).

(11)

Introduction

Likelihoods under the coalescent

Felsenstein et al.’s MCMC Griffiths et al.’s IS

(12)

Likelihood computations under the coalescent

• Problem: Most of the time, the likelihoodP(D;P^∗) of a genetic sample cannot be computed because there is no explicit mathematical expression

• However, the probabilityP(D;P^∗∣G_k) of observing the dataD given a specific genealogy G_k can be computed for some parameter values P^∗.

• Then we take the sum of all genealogy-specific likelihoods on the whole genealogical space, weighted by the probability of the genealogy given the parameters :

L(P^∗;D) = ∫_G P(D;P^∗∣G)P(G;P^∗)dG

(13)

Likelihood computations under the coalescent

• The likelihood can be written as the sum of P(D;P^∗∣G_k) over the genealogical space (all possible genealogies) :

L(P^∗;D) = ∫_GP(D;P^∗∣G)P(G;P^∗)dG

• Genealogies are missing data, they are important for the computation of the likelihood but there is no interest in estimating them.

→ very different from the phylogenetic approaches

Mutation Demography

(Coalescent)

(14)

Likelihood computations under the coalescent

L(P^∗;D) = ∫_GP(D;P^∗∣G)P(G;P^∗)dG

...Usually impossible to sum over all possible genealogies...

→ Monte Carlo simulations are used : a large number K of genealogies are simulated according toP(G;P^∗) and the mean over those simulations is taken as the expectation of P(D;P^∗∣G):

L(P^∗;D) =E_P(G;P^∗)(P(D;P^∗∣G)) ≈ 1 K

K

∑

k=1

P(D;P^∗∣G_k)

(15)

Likelihood computations under the coalescent

L(P^∗;D) = ∫_GP(D;P^∗∣G)P(G;P^∗)dG

...Usually impossible to sum over all possible genealogies...

→ Monte Carlo simulations are used : L(P^∗;D) =E_P(G;P^∗)(P(D;P^∗∣G)) ≈ 1

K

∑

k=1

P(D;P^∗∣Gk)

many many genealogies necessary for a good estimation of the likelihood...

(16)

Likelihood computations under the coalescent

• Monte Carlo simulations are used :

L(P^∗;D) =E_P(G;P^∗)(P(D;P^∗∣G)) ≈ 1 K

K

∑

k=1

P(D;P^∗∣G_k)

Monte Carlo simulations are often not very efficient because there are too many genealogies giving extremely low

probabilities of observing the data, more efficient algorithms are used to explore the genealogical space and focus on genealogies well supported by the data.

(17)

Likelihood computations under the coalescent

• Two main approaches developed using more efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G).

MCMC Monte Carlo Markov chains on the genealogical and the parameter space, based on Felsenstein’s pruning algorithm (1973,1981)

Felsenstein, J. (1981). ”Evolutionary trees from DNA sequences : A maximum likelihood approach”. J. of Mol. Evol. 17 (6) : 368-376.

IS Importance Sampling on genealogies, based on the work of Griffiths & Tavar´e 1994.

Griffiths, R.C. and S. Tavar´e (1994). Simulating probability distributions in the coalescent. Theor. Pop. Biol., 46 :131-159.

(18)

Likelihood computations under the coalescent

• More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G)

MCMC Felsenstein’s pruning algorithm.

- Easier to implement, can easily consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)

IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult

- Implemented in fewer softwares (Genetree,Migraine)

(19)

Likelihood computations under the coalescent

• More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G)

MCMC Felsenstein’s pruning algorithm (quick overview) - Easier to implement, can consider various models

- Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)

IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult

(20)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC

Griffiths et al.’s IS Simulation tests Conclusions

(21)

The approach of Felsenstein et al.

• Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes :

1 The probability of a genealogy given the parameters of the demographic modelP(Gk;P_demo^∗ )can be computed from the distributions of time between events.

2 The probability of the data given a genealogy and mutational parametersP(D;P_mut^∗ ∣Gk)can be computed from the mutation model parameters, the mutation rate, tree topology and branch lengths.

(22)

The approach of Felsenstein et al.

• Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes :

1 P(G_k;P_demo^∗ )computed from the distributions of time between events.

2 P(D;P_mut^∗ ∣G_k)computed from the mutation parameters, tree topology and branch lengths.

• From this, an efficient algorithm to explore the genealogical and the parameter spaces should allow the inference of the likelihood over the two spaces.

→ MCMC

(23)

Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample

Conclusions on MCMC

(24)

MCMC with Metropolis-Hastings sampler

• Full conditional distributions can not be computed, MCMC classical sampler can not thus be used (e.g. Gibbs)

→ Monte Carlo Markov Chains (MCMC) simulations using the Metropolis-Hastings (MH) algorithm

- To explore the genealogy space (G)

- and the parameter space (P=P_demo+P_mut)

all algorithms based on the ’Felsenstein et al.’ approach uses similar MH/MCMC algorithms with slight differences in the MCMC update steps.

(25)

Metropolis-Hastings sampling for the coalescent

For the Metropolis-Hastings algorithm, we need to compute the ratio of the probability of the proposed update over the current state :

1. Computation of P(G_k;P_demo) : P(G_k;P_demo) =^MRCA−1∏

i=0

γ(t_i+1)e^{− ∫}^ti^ti+1^γ(t)dt

- Example for a stable WF population (coalescence only, time homogeneous)

P(G_k;P_demo) =^MRCA∏⁻¹

i=0

k_i₊₁(k_i₊₁−1)

2 e⁻⁽^tⁱ⁺¹⁻^tⁱ⁾^ki+1⁽^ki+1² ⁻¹⁾

(26)

Metropolis-Hastings sampling for the coalescent

1. Computation of P(G_k;P_demo): P(Gk;Pdemo) =^MRCA−1∏

i=0

γ(ti+1)e^{− ∫}^ti^ti+1^γ(t)dt

2 Then compute the probability P(D;P_mut∣G_k) of the dataD given the genealogyG_k, by going from the MRCA to the leaves and considering the probability of occurrence of all mutations on each branch of lengtht_b and their effects (i.e.transition among genetic states x→y) :

(27)

Metropolis-Hastings sampling for the coalescent

Mutation matrix :

transition probability between genetic states(x,y) Poisson probability for thembmutations

2 Then compute the probability P(D;Pmut∣G_k):

P(D;P_mut∣G_k) =^{nb branch}∏

b=1

effect of mutations

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ P(y∣x,m_b) ⋅

number of mutations

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ P(m_b∣t_b)

=²⁽ⁿ⁻¹⁾∏

b=1

((Mat_mut)^m^b)_x,y(µt_b)^m^be^−µt^b m_b!

(28)

Metropolis-Hastings sampling for the coalescent

1. Computation of P(G_k;P_demo): P(Gk;Pdemo) =^MRCA−1∏

i=0

γ(ti+1)e^{− ∫}^ti^ti+1^γ(t)dt

2 Then compute P(D;P_mut∣G_k) : P(D;P_mut∣G_k) =²⁽ⁿ⁻¹⁾∏

b=1

((Mat_mut)^m^b)_x,y(µt_b)^m^be^−µt^b m_b!

3 These probabilities are plugged into the MH formula for acceptance probabilities of candidate changes for the next state of the Markov chain.

Reminder :P(D;P∣Gk) =P(D;Pmut∣Gk)P(Gk;Pdemo)

(29)

Metropolis-Hastings sampling for the coalescent

• for each update, the new state (P^′ orG^′) is accepted or rejected according to the Metropolis-Hastings ratio,

• the MH ratio is chosen so that the chain converge towards the good stationary distribution P(D;P), e.g.

rMH = P(D;P^′∣G)Prior(P^′) P(D;P∣G)Prior(P)

P(P^′→P) P(P →P^′)

(30)

Conclusions on MCMC

(31)

Coalescent-based MCMC example : MsVar

• One example of a coalescent-based MCMC algorithm : MsVar

Beaumont, M. 1999. Detecting Population Expansion and Decline Using Microsatellites. Genetics.

• Biological contexte:

Past changes in population sizes (cf. Orang-Utans) - Details of the demographic and mutation models - few results on the Orang-Utan data set

(32)

Coalescent-based MCMC example : MsVar

• Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.

(33)

Coalescent-based MCMC example : MsVar

Population contraction or expansion

(34)

Coalescent-based MCMC example : MsVar

3 demographic parameters : N,T,N_anc

+ 1 mutation parameter µ 3 scaled parameters (diffusion approx.) :θ,D, θanc

(35)

Coalescent-based MCMC example : MsVar

• Mutation model: Stepwise Mutation Model (SMM)

(36)

Intro Likelihood & coa MCMC IS Sim tests Conclusions

Coalescent-based MCMC example : MsVar

• Aim : infer those parameters (P or P_scaled) from a unique actual genetic sample using coalescent-based MCMC algorithms

P =N,T,N_anc, µ P_scaled =θ,D, θanc

(37)

MH/MCMC of MsVar

1. Initialization step : Build a genealogy that is compatible with the data

→ Starting with the sample, choose a set of events depending on starting values of the parameters ; the events are also chosen to be compatible with the data

2. MCMC steps : Explore the parameter and the genealogical space

→Update the parameters for population sizes(θact,D, θanc). orUpdate the genealogy(sequence and times of coalescence and mutation events (Ti))

both updates made using the Metropolis-Hastings algorithm

(38)

MCMC updates in MsVar

T_i=times of coa & mut,r=_θ^θ^act

anc pop size ratio,t_f =D time of pop size change

M. Beaumont : “This scheme was devised by trial and error to obtain good rates of convergence.”

(39)

Analyses of MsVar results

• First check that the chains mixed and converged properly

→ Visual check (very useful)

• Traces of likelihood / parameters

• Autocorrelation

→ Computeconvergence criteriaamong chains (GR, ...) not always useful...

→ Run different chains and check concordance between results

Problem: Convergence is often pretty bad with such coalescent-based MCMC algorithms ... but simulation tests show that posterior distributions are generally correct (at least the mode as point estimate) despite no clear convergence indices...

(40)

Analyses of MsVar results

• Bayesian method →compare posteriors (plain) and priors (dashed)

... and test different priors

(41)

Analyses of MsVar results

• Bayesian method →compute Bayes factor to check for contraction or expansion signal

BF= (Posterior prob. model 1) (Posterior prob. model 2)

(Prior prob. model 2) (Prior prob. model 1)

• Equal priors for models 1 and 2, the Bayes factor for a contraction is thus

BF=Posterior P(N_anc/N_act >1) Posterior P(Nanc/Nact <1)

BF=# MCMC steps where (N_anc/N_act >1)

# MCMC steps where (N_anc/N_act <1)

(42)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

Does the genome of Orang-utans carry the signature of population bottlenecks ?

(Goossens et al. 2006 PLoS Biology)

(43)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

(Delgado & Van Schaik, 2001 Evol. Anthropology)

Population sizes have collapsed : what is the cause ?

Can population genetics help ?

(44)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

• The data

(45)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

• MsVarresults

→ MsVarefficiently detects a past decrease in population size

(46)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

FE : beginning of massive forest exploitation F : first farmers

HG : first hunter-gatherers

• MsVarresults

→ MsVarefficiently detects a past decrease in population size...

... and allows for the dating of the beginning of the decrease :

massive forest exploitation seems to be the most likely cause

(47)

Conclusions on MCMC

(48)

Conclusions about MsVar / MCMC approaches

• Coalescent theory provides a powerful framework for statistical inference

→ Allows to infer past history from a unique actual sample ! (it was impossible with moment based methods)

• Gene genealogies are missing data (but important...)

→ MCMCs with coalescent simulations are “difficult” (to run)

• But what is the robustness to model assumptions :

• Mutational processes

(e.g. large mutation steps→ long branches)

• Population structure

(e.g. immigrants→long branches)

(49)

Introduction

(50)

Likelihood computations under the coalescent

• More efficient algorithms that allows better exploration of the genealogies (i.e. proportionally toP(D;P∣G)).

MCMC Felsenstein’s pruning algorithm.

- Easier to implement, can consider various models

- Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)

IS Griffiths &Tavar´e’s coalescent recursion (cf. Ewens’ recursion) - Extension to different models may be difficult

(51)

The approach of Griffiths et al.

• Coalescent-based likelihood at a given point of the parameter space is an integral aver all possible histories (genealogies with mutations) leading to the present genetic sample

• Monte Carlo scheme used to compute this integral

• Histories are build backward in time, event by event, starting from the present sample

• But computation of exact backward transition probabilities is often too difficult

→ an IS scheme is used to compute the likelihoods by simulation

(52)

Griffiths et al.’s IS

New IS scheme

(53)

Recursions for sampling distributions

Ewens 1972 : a Wright-Fisher, infinite-allele model

General recursion at stationarity [withe_j = (0,⋯,0,1^jth,0,⋯,0)] : P_n(a) = θ

n−1+θP_n−1(a−e₁)+ n−1 n−1+θ ∑

a_j+1>0

j(aj +1)

n−1 P_n−1(a+e_j−e_j₊₁) where given that a coalescence occurs and that the descendant

sample has(⋯,a_j,a_j+1,⋯), the ancestral one has

(⋯,aj+1,aj+1−1,⋯)and the probability that one of the aj +1 alleles withj gene copies is chosen to duplicate isj(a_j +1)/(n−1). Rappel : les effectifs de l’échantilons (vecteur aest ordonné selon les effectifs des alleles = allele frequency spectrum =aj est le nombre de l’alleles ayant j copies dans l’échantilon)

Griffiths and Tavar´e :

recursion for mutation models defined by a matrix(pij)of mutation rates fromi to j

(54)

The recursion of Griffiths et al.

• Coalescent-based likelihood at a given point of the parameter space is an integral over all possible histories (genealogies with mutations)

H = {Hk;k=0, ..., τ}

corresponding to all coalescent or mutation events that occurred fromH0 the current sample state toHτ the allelic state of the most recent common ancestor (MRCA) of the sample.

(55)

The recursion of Griffiths et al.

• Then for any given state H_k of the history (cf. Ewens) : p(H_k) = ∑

{H_k′}

p(H_k∣H_k^′)p(H_k^′)

whereH_k^′ is the ancestral sample state (i.e. the state before the last event)

andp(Hk∣Hk^′) are the forward transition probabilities (i.e.

from the ancestral to the current state)

(56)

The recursion of Griffiths et al.

• Griffiths & Tavar´e 1994 : example for a single population

p(Hk=η) = 1 (ⁿ⁽ⁿ⁻¹⁾2N +nµ)

⎡⎢⎢⎢

⎢⎣(nµ∑

i ∑

j∶n_j>0,j≠i

ni+1

n pijp(H_k′=η−ej+ei)) + (n(n−1)

2N ∑

j∶n_j>1

nj−1

n−1p(Hk^′=η−ej))⎤⎥

⎥⎥⎥⎦. - Settingθ=4Nµandβ=n(n−1+θ), we have

p(Hk=η) =1 β

⎡⎢⎢⎢

⎢⎣θ∑

i ∑

j∶n_j>0,j≠i

(ni+1)pijp(Hk^′=η−ej+ei)

+n ∑

j∶n_j>1

(nj−1)p(Hk^′=η−ej)⎤⎥

⎥⎥⎥⎦,

(57)

The recursion of Griffiths et al.

• Griffiths & Tavar´e 1994 : example for a single population

p(Hk=η) =1 β

⎡⎢⎢⎢

⎢⎣θ∑

i ∑

j∶n_j>0,j≠i

+n ∑

j∶n_j>1

(nj−1)p(Hk^′=η−ej)⎤⎥

⎥⎥⎥⎦,

• Such recursions are too difficult to solve except for very simple models (WF + IAM, cf Ewens)

→ Griffiths & Tavar´e (1994) proposed to use aMonte Carlo approachusing sequential importance samplingon past histories to solve the recursion.

(58)

New IS scheme

(59)

Inference of the likelihood by simulation

• Griffiths & Tavar´e 1994 :

p(Hk=η) =1 β

⎡⎢⎢⎢

⎢⎣θ∑

i ∑

j∶n_j>0,j≠i

+n ∑

j∶n_j>1

(nj−1)p(H_k′=η−ej)⎤⎥

⎥⎥⎥⎦,

or equivalently

p(Hk) =wGT(Hk)( ∑

i,j∶n_j>0,j≠i

Mij(Hk)p(Hk−ej+eai) + ∑

j∶n_j>1

Cj(Hk)p(Hk−ej))

(60)

Inference of the likelihood by simulation

• Griffiths & Tavar´e 1994 : Backward absorbing Markov chain based on forward transition probabilities

p(H_k) =w_GT(H_k)( ∑

i,j∶n_j>0,j≠i

Mij(H_k)p(H_k−e_j+e_ai) + ∑

j∶n_j>1

Cj(Hk)p(Hk−ej))

→ Histories are build backward event by event using absorbing Markov chain (abs. state = MRCA) based on forward transitions probabilities (“uniform sampling” based on Mij(Hk) andCj(Hk)) among all possible events.

w_GT(H_k)is the weight associated with the IS proposal.

(61)

Inference of the likelihood by simulation

• Expending the recursion p(H_k) = ∑{H_k′}p(H_k∣H_k^′)p(H_k^′) over all possible ancestral histories of a current sample leads to

p(H₀) =E[p(H₀∣H₁)...p(H_τ−1∣H_τ)p(H_τ)]

Then

L(P;D) =p(H₀) = ∫_HW_GT(H)f_GT(H) ≈ 1 L

L

∑

h=1

W_GT(Hh)

≈1 L

L

∑

h=1 τ

∏

k=0

wGT((H^h)k).

This IS schemef_GT(H)is not very efficient because it does not appropriately consider that some backward transitions are more likely than others given the current state (example : SMM mutation).

(62)

New IS scheme

(63)

Towards a better IS scheme

(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

→ A better Importance Sampling (IS) scheme should be used : Let Q(H_k^′) be a given distribution, then

p(H_k) = ∑

{H_k′}

p(H_k∣H_k^′)

Q(H_k^′) Q(H_k^′)p(H_k^′)

=EQ[p(H₀∣H₁)

Q(H1) ...p(H_τ−1∣H_τ) Q(Hτ) ]

whereEQ is expectation over the distribution of full histories induced by Q. This means thatQ may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(H_k∣H_k^′/Q(H_k^′)).

minimizes the variance of likelihood estimates 1

L

∑

h=1 τ

∏

k=0

w_GT((Hh)k).

(64)

Towards a better IS scheme

→ A better Importance Sampling (IS) scheme should be used : Let Q(H_k^′) be a given distribution, then

p(H_k) =EQ[p(H₀∣H₁)

Q(H1) ...p(H_τ−1∣H_τ) Q(Hτ) ]

whereEQ is expectation over the distribution of full histories induced by Q. This means thatQ may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(H_k∣H_k^′/Q(H_k^′)).

• The problem is then to find the proposal distribution that minimizes the variance of likelihood estimates

1 L

L

∑

h=1 τ

∏

k=0

w_GT((Hh)k).

(65)

Towards a better IS scheme

• The ideal proposal is the backward transition probability p(H_k^′∣H_k), because the IS weights are then

p(H_k^′)

Q(H_k^′) =p(H_k∣H_k^′)

p(H_k^′∣H_k) = p(H_k) p(H_k^′)

and thus their product is always the sample likelihood, p(H₀). expliciter

→ a single tree reconstruction allows exact likelihood computations (null variance).

• However, backward transition probabilitiesp(H_k^′∣H_k) are generally unknown

Aim : find good approximations ˆp(H_k^′∣H_k) ofp(H_k^′∣H_k)

(66)

Towards a better IS scheme

• The likelihood at a given point is an integral over all possible histories H = {H_k;k=0, ..., τ}.

• Markov coalescent process →p(H_k) = ∑p(H_k∣H_k^′)p(H_k^′) andp(H₀) =E[p(H₀∣H₁)...p(H_τ−1∣H_τ)p(H_τ)].

• However, forward transition probabilities p(H_k∣H_k^′) are not efficient in a backward process

• Importance sampling techniques based on an approximation p(Hˆ k^′∣Hk) of p(Hk^′∣Hk)are used to build more likely histories

p(H₀) =Eˆp[p(H₀∣H₁) ˆ

p(H₁∣H₀)...p(H_τ−1∣H_τ) ˆ

p(H_τ∣H_τ−1)].

(67)

Linking optimal weights to addition of a gene to a sample

Represent sample probabilityp(n) as integral over the joint distributionf(x) of allele frequencies in the population

p(n) = ∫_xp(n∣x)f(x)dx=Ef[(n n) ∏

i

X_iⁿⁱ] where

(n

n) = n!

∏ini! is the binomial coefficient.

a samplen and that an additional gene copy is of typej is Ef[X_j(n

n) ∏

i

X_iⁿⁱ] = nj +1

n+1p(n+e_j).

We write this joint probability asp(n)times π(j∣n), where π is thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifH_k andHk^′ differ by the addition of one gene of type j, we can write the optimal IS weight as

p(Hk^′ =n)

p(H_k =n+e_j) = n_j +1 n+1

1 π(j∣n)

(68)

Linking optimal weights to addition of a gene to a sample

Represent sample probabilityp(n) as integral over the joint distributionf(x) of allele frequencies in the population

p(n) = ∫_xp(n∣x)f(x)dx=Ef[(n n) ∏

i

X_iⁿⁱ]

Then the joint probability that we have a samplen and that an additional gene copy is of typej is

Ef[X_j(n n) ∏

i

X_iⁿⁱ] = n_j +1

n+1p(n+e_j).

thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifH_k andH_k^′ differ by the addition of one gene of type j, we can write the optimal IS weight as

p(H_k^′ =n)

p(H_k =n+e_j) = nj +1 n+1

1 π(j∣n)

(69)

Linking optimal weights to addition of a gene to a sample

Then the joint probability that we have a samplenand that an additional gene copy is of typej is

Ef[X_j(n n) ∏

i

X_iⁿⁱ] = n_j +1

n+1p(n+e_j).

We write this joint probability asp(n)times π(j∣n), where π is thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifH_k andH_k^′ differ by the addition of one gene of typej, we can write the optimal IS weight as

p(H_k^′ =n)

p(Hk =n+ej) = n_j +1 n+1

1 π(j∣n)

(70)

Towards a better IS scheme : the π’s

• Let π(⋅∣Hk)be the conditional distribution of the allelic type of a n+1 gene, given H_k the configuration (i.e. allelic types) of the first n genes of the sample.

• Then the optimal IS distribution (exact backward transition probabilities) is, for a single population :

p(Hk^′∣Hk) = 1 βθnj

π(i∣H_k−e_j)

π(j∣H_k−e_j)Pij for Hk^′ =Hk−ej+ei

p(H_k^′∣H_k) = 1 β

n_j(n_j−1)

π(j∣H_k−ej) for H_k^′ =H_k−e_j

(71)

Towards a better IS scheme : the ˆ π’s

• Unfortunately,π’s are generally unknown

→ Stephens & Donnelly (2000) proposed a good approximation ˆπfor theπs for a single WF population.

→ de Iorio & Griffiths (2004) proposed a general method for appoximating theπs under different mutational and demographic models

• Then approximate backward transition probabilities using the ˆ

πs are used : ˆ

p(H_k^′∣H_k) = 1 βθnj

ˆ

π(i∣H_k−e_j) ˆ

π(j∣H_k−e_j)Pij for H_k^′ =H_k−ej+ei

ˆ

p(H_k^′∣H_k) = 1 β

nj(nj−1) ˆ

π(j∣H_k−e_j) for H_k^′ =H_k−e_j

(72)

New IS scheme

(73)

The backward equation for f (X

_t

∣X

₀

= x)

Pour un processus de diffusion, la densité de probabilité f des fréquences alléliques satisfait l’équation arrière de Kolmogorov, qui décrit les changements def au cours du temps sous la forme

df(Xt∣X0=x)

dt =Φ(f(x)),

où Φ est un opérateur différentiel qui prend ici la forme Φ=1

2∑

i∈E

∑

j∈E

x_i(δ_ij −x_j) ∂²

∂x_i∂x_j + ∑

j∈E

(∑

i∈E

x_ir_ij) ∂

∂x_j

= ∑

j∈E

Φ_j ∂

∂xj

avec

R= {r_ij} ≡ θ

2(P−I)

o`u P= {p_ij} est la matrice de mutation, et I la matrice identit´e.

(74)

The backward equation for E [g (X

_t

)∣X

₀

= x]

In the same way as

df(Xt∣X0=x)

dt =Φ(f(x)),

the following “generator equation” (Karlin and Taylor, 1981, p.215) holdsfor any functiong(x)with bounded second derivatives

t→0lim

E[g(X_t)∣X₀=x] −g(x)

t =Φ(g(x)). We will apply this result withg the sample probability given population allele frequenciesx.

(75)

ˆ

π’s computation

Pour obtenir une r´ecurrence sur les probabilit´esp(n) avecn=H0

de l’´echantillon, on ´ecritp(n) sous la formeE[g(x)]

p(n) =E[(n n) ∏

i

X_iⁿⁱ] o`u

(n

n) = n!

∏ini!. On a donc

d(p(n))

dt =Φ[p(n)].

A l’équilibre stationnaire, d(p(n))/dt est nulle. En développant l’expression pour Φ[p(n)], on retrouve alors la récurrence entre les p(n).

(76)

Explicit recursions in terms of π

Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence or mutation event :

N(n(n−1

N +µ))p(n) =N∑

j

nnj−1

N p(n−e_j) +Nµ∑

j

∑

i

P_ij(n_i +1−δ_ij)p(n−e_j +e_i). Expressing allp(.) in terms ofp(n−ej)s for distinctjs :

N∑

j

(n−1

N +µ)π(j∣n−e_j)np(n−e_j) =N∑

d,j

nnj −1

N p(n−e_j) +Nµ∑

j

∑

i

P_ijnπ(i∣n−e_j)p(n−e_j)

...huge system of linear equations, not easier to solve in this form.

(77)

New IS scheme

(78)

ˆ

π’s computation

On note que Φ[p(n)]peut s’´ecrire sous la forme

∑

j∈E

Φj

∂

∂x_j [p(n)],

La technique d’approximation développée par de Iorio & Griffiths est d’approximer lesπ (dérivés précédemment de p(n), solution de Φ[p(n)] =0) par des ˆπ dérivés des solutions de

E[Φ_j∂p(n)

∂x_j ]=E[Φ_j ∂

∂x_j (n n) ∏

i

x_iⁿⁱ]=0,pour chaquej ∈E.

(79)

ˆ

π’s computation

La technique d’approximation développée par de Iorio & Griffiths est d’approximer lesπ (dérivés précédemment de p(n), solution de Φ[p(n)] =0) par des ˆπ dérivés des solutions de

E[Φ_j ∂

∂x_j (n n) ∏

i

x_iⁿⁱ]=0, pour chaque j ∈E.

ce qui donne, pour une population panmictique, pour chaquej ∈E n_j(n−1+θ)pˆ(n) =

n(nj −1)pˆ(n−ej) + ∑

i∈E

θPij(ni+1−δij)pˆ(n−ej+ei)

(80)

ˆ

π’s computation

Rappel :π(j∣n) peut ˆetre exprim´e en fonction de p(n) et p(n+e_j) :

π(j∣n)p(n) = n_j +1

n+1p(n+e_j).

Si l’on consid`ere que cette relation est aussi valable pour les ˆ

π et ˆp, ce qui ne sera g´en´eralement pas le cas, on a ˆ

π(j∣n)ˆp(n) = nj+1

n+1pˆ(n+e_j)

(81)

ˆ

π’s computation

Approximer lesp(n), solutions de Φ[p(n)] =0, par les ˆp(n)solutions de E[Φj

∂

∂x_j (n n) ∏

i

x_iⁿⁱ]=0, pour chaquej∈E. ce qui donne, pour une population panmictique, pour chaquej∈E

nj(n−1+θ)ˆp(n) =

n(nj−1)ˆp(n−ej) + ∑

i∈E

θPij(ni+1−δij)ˆp(n−ej+ei)

et en utilisant ˆπ(j∣n)pˆ(n) = ⁿ_n^j₊⁺₁¹ˆp(n+e_j)et rempla¸cantnparn+e_j, on obtient donc pour chaquej∈E :

(n−1+θ)πˆ(j∣n) =nj+ ∑

i∈E

θPijπˆ(i∣n)

C’est le systeme lin´eaire permettant le calcul des ˆπ(j∣n)pour un mod`ele de Wright-Fisher.

(82)

New IS scheme with the ˆ π’s

faire deux dipaos de bilan du nouveau schema d’IS bilan en reprennant des bouts des diapos 63, 64, 65, 66, 70 et 71

(83)

A much better IS scheme based on the ˆ π’s

• Drastic gain in efficiently with this new IS scheme (old IS : millions of trees)

→ extract backward transition probabilities for a WF model with parent independent mutation (i.e. KAM)

→ only 30 histories necessary for a good estimation of the likelihood for more complex models (structured populations & KAM)

• but efficiency slightly decrease with non parent-independent mutations models,

e.g. stepwise mutation model (200 histories for structured populations &

SMMM)

• and still limited efficiency for time inhomogeneous demographic models,

e.g. one population with past size change (cf. Orang-Utan example)

→ up to 20,000 histories necessary for strong disequilibrium scenarios (e.g.

quick change in population size)

(84)

Implementations of IS : Genetree and Migraine

• Genetree(Bahlo & Griffiths 2000, old IS algorithm) - 2 to 4 populations with migration (ISM)

• Migraine(Rousset & Leblois 2007-2014, new IS algorithms) - One single stable population (KAM, SMM, GSM, ISM) - One pop. with past size variation (KAM, SMM, GSM, ISM) - 2 populations with migration (KAM, SMM, ISM)

- Isolation By Distance in 1D and 2D (KAM)

(85)

Implementation of IS in Migraine

1. C++ core IS computations

• Stratified random sampling of parameter points

• Estimation of the likelihood at each point using IS 2. R code for “post-treatment”

• Likelihood surface interpolation by Kriging

• Inference of MLEs and CIs

• Plots of 1D and 2D likelihood profiles

(86)

Introduction

(87)

Simulation tests

Can we trust the demographic / historical inferences made with those methods ?

• Bias, RMSE, coverage properties of confidence intervals

• robustness to realistic but “uninteresting” mis-specifications

→ to this aim, we tested by simulation :

- The performances of Migraineto infer dispersal under IBD - The performances of MsVarandMigraineto detect and

measure past pop size changes few interesting results...