• Aucun résultat trouvé

Two different ways to use the coalescent theory

N/A
N/A
Protected

Academic year: 2022

Partager "Two different ways to use the coalescent theory"

Copied!
114
0
0

Texte intégral

(1)

A faire RL :

homog´en´eiser les notations avec les macros de FR,

remplacer IS par SIS partout, insister sur le sequential = le long de la construction de l’arbre,

mieux s´eparer l’obtention de la r´ecurrence, les pi exactes et les pi chapeaux,

mettre en bleu les mots cl´es

ajouter l terme p(Htau)=distribution stationnaire des ´etats all´eliques partout ou c’est n´ecessaire

FR :

1) Faire backward dans un cours pr´ec´edent.... (et d’autres trucs sur ma pr´esentation de backward)

2) l’op´erateur diff´erentielφj n’est pas explicit´e =¿ phrases plus compliqu´ees

(2)

Module de Master 2 Biostatistique: mod`eles de g´en´etique des populations

Likelihood-based demographic inference using the coalescent

Rapha¨el Leblois & Fran¸cois Rousset

Centre de Biologie pour la Gestion des populations (CBGP, Montpellier) Institut des Sciences de l’Evolution, (ISEM, Montpellier)

Janvier 2017

(3)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC

Metropolis-Hastings MsVarexample

Conclusions on MCMC Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

Simulation tests Precision Validation Robustness MCMCvs. IS Conclusions

(4)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS

Simulation tests Conclusions

(5)

Typical biological question :

There are demographic evidences that orang-utan population sizes have collapsed

→ but what is the major cause of the decline, when did it start and how strong is it ?

Canpopulation genetics help ? - Can weinfer the timeof the

event ?

- Can weinfer the strengthof the population size decrease ?

(6)

Methods based on coalescence simulations (Reminder...)

?

forwardintime 6backwardintime

Genealogy of the population

Genealogy of

the sample Coalescent tree

;

; P(Tk =t) ≈ k(k−1)

2N e−t

k(k−1)

2N P(m∣t) = (µt)me−µt m!

(7)

Two different ways to use the coalescent theory

Exploratory approaches & simulation tests

- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (samplevs.

population)

Specify the model and parameter values

Simulated data sets

Inferential approach

- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods)

a real data set

infer the model parameters Coalescent process

Coalescent process

(8)

Two different ways to use the coalescent theory

Exploratory approaches & simulation tests

- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (samplevs.

population)

Specify the model and parameter values

Simulated data sets

Inferential approach

- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods)

a real data set

infer the model parameters Coalescent process

Coalescent process

(9)

Likelihood-based inference under the coalescent

Inferential approaches are based on the modeling of

population genetic processes. Each population genetic model is characterized by a set of demographic and genetic

parametersP

The aim is to infer those parameters from a polymorphism data set (i.e. a genetic sample)

The genetic sample is then considered as the realization (”output”) of a stochastic process defined by the demo-genetic model

(10)

Likelihood-based inference under the coalescent

First, compute or estimate the likelihood L(P;D), i.e. the probability P(D;P) of observing the dataD for some parameter values P

Second, infer the likelihood surface over all parameter values, find the set of parameter values that maximize it, and compute CI (maximum likelihood method),

or Compute posterior distributions and compare with priors (Bayesian approach).

(11)

Introduction

Likelihoods under the coalescent

Felsenstein et al.’s MCMC Griffiths et al.’s IS

Simulation tests Conclusions

(12)

Likelihood computations under the coalescent

Problem: Most of the time, the likelihoodP(D;P) of a genetic sample cannot be computed because there is no explicit mathematical expression

However, the probabilityP(D;P∣Gk) of observing the dataD given a specific genealogy Gk can be computed for some parameter values P.

Then we take the sum of all genealogy-specific likelihoods on the whole genealogical space, weighted by the probability of the genealogy given the parameters :

L(P;D) = ∫G P(D;P∣G)P(G;P)dG

(13)

Likelihood computations under the coalescent

The likelihood can be written as the sum of P(D;P∣Gk) over the genealogical space (all possible genealogies) :

L(P;D) = ∫GP(D;P∣G)P(G;P)dG

Genealogies are missing data, they are important for the computation of the likelihood but there is no interest in estimating them.

→ very different from the phylogenetic approaches

Mutation Demography

(Coalescent)

(14)

Likelihood computations under the coalescent

The likelihood can be written as the sum of P(D;P∣Gk) over the genealogical space (all possible genealogies) :

L(P;D) = ∫GP(D;P∣G)P(G;P)dG

...Usually impossible to sum over all possible genealogies...

→ Monte Carlo simulations are used : a large number K of genealogies are simulated according toP(G;P) and the mean over those simulations is taken as the expectation of P(D;P∣G):

L(P;D) =EP(G;P)(P(D;P∣G)) ≈ 1 K

K

k=1

P(D;P∣Gk)

(15)

Likelihood computations under the coalescent

The likelihood can be written as the sum of P(D;P∣Gk) over the genealogical space (all possible genealogies) :

L(P;D) = ∫GP(D;P∣G)P(G;P)dG

...Usually impossible to sum over all possible genealogies...

→ Monte Carlo simulations are used : L(P;D) =EP(G;P)(P(D;P∣G)) ≈ 1

K

K

k=1

P(D;P∣Gk)

many many genealogies necessary for a good estimation of the likelihood...

(16)

Likelihood computations under the coalescent

Monte Carlo simulations are used :

L(P;D) =EP(G;P)(P(D;P∣G)) ≈ 1 K

K

k=1

P(D;P∣Gk)

Monte Carlo simulations are often not very efficient because there are too many genealogies giving extremely low

probabilities of observing the data, more efficient algorithms are used to explore the genealogical space and focus on genealogies well supported by the data.

(17)

Likelihood computations under the coalescent

Two main approaches developed using more efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G).

MCMC Monte Carlo Markov chains on the genealogical and the parameter space, based on Felsenstein’s pruning algorithm (1973,1981)

Felsenstein, J. (1981). ”Evolutionary trees from DNA sequences : A maximum likelihood approach”. J. of Mol. Evol. 17 (6) : 368-376.

IS Importance Sampling on genealogies, based on the work of Griffiths & Tavar´e 1994.

Griffiths, R.C. and S. Tavar´e (1994). Simulating probability distributions in the coalescent. Theor. Pop. Biol., 46 :131-159.

(18)

Likelihood computations under the coalescent

More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G)

MCMC Felsenstein’s pruning algorithm.

- Easier to implement, can easily consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)

IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult

- Implemented in fewer softwares (Genetree,Migraine)

(19)

Likelihood computations under the coalescent

More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D;P∣G)

MCMC Felsenstein’s pruning algorithm (quick overview) - Easier to implement, can consider various models

- Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)

IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult

- Implemented in fewer softwares (Genetree,Migraine)

(20)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC

Griffiths et al.’s IS Simulation tests Conclusions

(21)

The approach of Felsenstein et al.

Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes :

1 The probability of a genealogy given the parameters of the demographic modelP(Gk;Pdemo )can be computed from the distributions of time between events.

2 The probability of the data given a genealogy and mutational parametersP(D;Pmut Gk)can be computed from the mutation model parameters, the mutation rate, tree topology and branch lengths.

(22)

The approach of Felsenstein et al.

Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes :

1 P(Gk;Pdemo )computed from the distributions of time between events.

2 P(D;Pmut Gk)computed from the mutation parameters, tree topology and branch lengths.

From this, an efficient algorithm to explore the genealogical and the parameter spaces should allow the inference of the likelihood over the two spaces.

→ MCMC

(23)

Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample

Conclusions on MCMC

(24)

MCMC with Metropolis-Hastings sampler

Full conditional distributions can not be computed, MCMC classical sampler can not thus be used (e.g. Gibbs)

→ Monte Carlo Markov Chains (MCMC) simulations using the Metropolis-Hastings (MH) algorithm

- To explore the genealogy space (G)

- and the parameter space (P=Pdemo+Pmut)

all algorithms based on the ’Felsenstein et al.’ approach uses similar MH/MCMC algorithms with slight differences in the MCMC update steps.

(25)

Metropolis-Hastings sampling for the coalescent

For the Metropolis-Hastings algorithm, we need to compute the ratio of the probability of the proposed update over the current state :

1. Computation of P(Gk;Pdemo) : P(Gk;Pdemo) =MRCA−1

i=0

γ(ti+1)e− ∫titi+1γ(t)dt

- Example for a stable WF population (coalescence only, time homogeneous)

P(Gk;Pdemo) =MRCA1

i=0

ki+1(ki+11)

2 e−(ti+1ti)ki+1(ki+12 −1)

(26)

Metropolis-Hastings sampling for the coalescent

1. Computation of P(Gk;Pdemo): P(Gk;Pdemo) =MRCA−1

i=0

γ(ti+1)e− ∫titi+1γ(t)dt

2 Then compute the probability P(D;Pmut∣Gk) of the dataD given the genealogyGk, by going from the MRCA to the leaves and considering the probability of occurrence of all mutations on each branch of lengthtb and their effects (i.e.transition among genetic states x→y) :

(27)

Metropolis-Hastings sampling for the coalescent

Mutation matrix :

transition probability between genetic states(x,y) Poisson probability for thembmutations

2 Then compute the probability P(D;Pmut∣Gk):

P(D;Pmut∣Gk) =nb branch

b=1

effect of mutations

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ P(y∣x,mb) ⋅

number of mutations

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ P(mb∣tb)

=2(n−1)

b=1

((Matmut)mb)x,y(µtb)mbe−µtb mb!

(28)

Metropolis-Hastings sampling for the coalescent

1. Computation of P(Gk;Pdemo): P(Gk;Pdemo) =MRCA−1

i=0

γ(ti+1)e− ∫titi+1γ(t)dt

2 Then compute P(D;Pmut∣Gk) : P(D;Pmut∣Gk) =2(n−1)

b=1

((Matmut)mb)x,y(µtb)mbe−µtb mb!

3 These probabilities are plugged into the MH formula for acceptance probabilities of candidate changes for the next state of the Markov chain.

Reminder :P(D;PGk) =P(D;PmutGk)P(Gk;Pdemo)

(29)

Metropolis-Hastings sampling for the coalescent

for each update, the new state (P orG) is accepted or rejected according to the Metropolis-Hastings ratio,

the MH ratio is chosen so that the chain converge towards the good stationary distribution P(D;P), e.g.

rMH = P(D;P∣G)Prior(P) P(D;P∣G)Prior(P)

P(P→P) P(P →P)

(30)

Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample

Conclusions on MCMC

(31)

Coalescent-based MCMC example : MsVar

One example of a coalescent-based MCMC algorithm : MsVar

Beaumont, M. 1999. Detecting Population Expansion and Decline Using Microsatellites. Genetics.

Biological contexte:

Past changes in population sizes (cf. Orang-Utans) - Details of the demographic and mutation models - few results on the Orang-Utan data set

(32)

Coalescent-based MCMC example : MsVar

Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.

(33)

Coalescent-based MCMC example : MsVar

Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.

Population contraction or expansion

(34)

Coalescent-based MCMC example : MsVar

Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.

3 demographic parameters : N,T,Nanc

+ 1 mutation parameter µ 3 scaled parameters (diffusion approx.) :θ,D, θanc

(35)

Coalescent-based MCMC example : MsVar

Mutation model: Stepwise Mutation Model (SMM)

(36)

Intro Likelihood & coa MCMC IS Sim tests Conclusions

Coalescent-based MCMC example : MsVar

Aim : infer those parameters (P or Pscaled) from a unique actual genetic sample using coalescent-based MCMC algorithms

P =N,T,Nanc, µ Pscaled =θ,D, θanc

(37)

MH/MCMC of MsVar

1. Initialization step : Build a genealogy that is compatible with the data

Starting with the sample, choose a set of events depending on starting values of the parameters ; the events are also chosen to be compatible with the data

2. MCMC steps : Explore the parameter and the genealogical space

Update the parameters for population sizes(θact,D, θanc). orUpdate the genealogy(sequence and times of coalescence and mutation events (Ti))

both updates made using the Metropolis-Hastings algorithm

(38)

MCMC updates in MsVar

Ti=times of coa & mut,r=θθact

anc pop size ratio,tf =D time of pop size change

M. Beaumont : “This scheme was devised by trial and error to obtain good rates of convergence.”

(39)

Analyses of MsVar results

First check that the chains mixed and converged properly

→ Visual check (very useful)

Traces of likelihood / parameters

Autocorrelation

→ Computeconvergence criteriaamong chains (GR, ...) not always useful...

→ Run different chains and check concordance between results

Problem: Convergence is often pretty bad with such coalescent-based MCMC algorithms ... but simulation tests show that posterior distributions are generally correct (at least the mode as point estimate) despite no clear convergence indices...

(40)

Analyses of MsVar results

Bayesian method →compare posteriors (plain) and priors (dashed)

... and test different priors

(41)

Analyses of MsVar results

Bayesian method →compute Bayes factor to check for contraction or expansion signal

BF= (Posterior prob. model 1) (Posterior prob. model 2)

(Prior prob. model 2) (Prior prob. model 1)

Equal priors for models 1 and 2, the Bayes factor for a contraction is thus

BF=Posterior P(Nanc/Nact >1) Posterior P(Nanc/Nact <1)

BF=# MCMC steps where (Nanc/Nact >1)

# MCMC steps where (Nanc/Nact <1)

(42)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

Does the genome of Orang-utans carry the signature of population bottlenecks ?

(Goossens et al. 2006 PLoS Biology)

(43)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

(Delgado & Van Schaik, 2001 Evol. Anthropology)

Population sizes have collapsed : what is the cause ?

Can population genetics help ?

(44)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

The data

(45)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

MsVarresults

→ MsVarefficiently detects a past decrease in population size

(46)

An application of MsVar :

Orang-Utans and the deforestation of Borneo

FE : beginning of massive forest exploitation F : first farmers

HG : first hunter-gatherers

MsVarresults

→ MsVarefficiently detects a past decrease in population size...

... and allows for the dating of the beginning of the decrease :

massive forest exploitation seems to be the most likely cause

(47)

Felsenstein et al.’s MCMC Metropolis-Hastings MsVarexample

Conclusions on MCMC

(48)

Conclusions about MsVar / MCMC approaches

Coalescent theory provides a powerful framework for statistical inference

Allows to infer past history from a unique actual sample ! (it was impossible with moment based methods)

Gene genealogies are missing data (but important...)

MCMCs with coalescent simulations are “difficult” (to run)

But what is the robustness to model assumptions :

Mutational processes

(e.g. large mutation steps long branches)

Population structure

(e.g. immigrantslong branches)

(49)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS

Simulation tests Conclusions

(50)

Likelihood computations under the coalescent

More efficient algorithms that allows better exploration of the genealogies (i.e. proportionally toP(D;P∣G)).

MCMC Felsenstein’s pruning algorithm.

- Easier to implement, can consider various models

- Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE,IM)

IS Griffiths &Tavar´e’s coalescent recursion (cf. Ewens’ recursion) - Extension to different models may be difficult

- Implemented in fewer softwares (Genetree,Migraine)

(51)

The approach of Griffiths et al.

Coalescent-based likelihood at a given point of the parameter space is an integral aver all possible histories (genealogies with mutations) leading to the present genetic sample

Monte Carlo scheme used to compute this integral

Histories are build backward in time, event by event, starting from the present sample

But computation of exact backward transition probabilities is often too difficult

→ an IS scheme is used to compute the likelihoods by simulation

(52)

Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

(53)

Recursions for sampling distributions

Ewens 1972 : a Wright-Fisher, infinite-allele model

General recursion at stationarity [withej = (0,⋯,0,1jth,0,⋯,0)] : Pn(a) = θ

n−1+θPn−1(a−e1)+ n−1 n−1+θ ∑

aj+1>0

j(aj +1)

n−1 Pn−1(a+ej−ej+1) where given that a coalescence occurs and that the descendant

sample has(⋯,aj,aj+1,⋯), the ancestral one has

(⋯,aj+1,aj+1−1,⋯)and the probability that one of the aj +1 alleles withj gene copies is chosen to duplicate isj(aj +1)/(n−1). Rappel : les effectifs de l’´echantilons (vecteur aest ordonn´e selon les effectifs des alleles = allele frequency spectrum =aj est le nombre de l’alleles ayant j copies dans l’´echantilon)

Griffiths and Tavar´e :

recursion for mutation models defined by a matrix(pij)of mutation rates fromi to j

(54)

The recursion of Griffiths et al.

Coalescent-based likelihood at a given point of the parameter space is an integral over all possible histories (genealogies with mutations)

H = {Hk;k=0, ..., τ}

corresponding to all coalescent or mutation events that occurred fromH0 the current sample state toHτ the allelic state of the most recent common ancestor (MRCA) of the sample.

(55)

The recursion of Griffiths et al.

Then for any given state Hk of the history (cf. Ewens) : p(Hk) = ∑

{Hk}

p(Hk∣Hk)p(Hk)

whereHk is the ancestral sample state (i.e. the state before the last event)

andp(Hk∣Hk) are the forward transition probabilities (i.e.

from the ancestral to the current state)

(56)

The recursion of Griffiths et al.

Griffiths & Tavar´e 1994 : example for a single population

p(Hk=η) = 1 (n(n−1)2N +)

⎡⎢⎢⎢

⎢⎣(

i

j∶nj>0,j≠i

ni+1

n pijp(Hk=ηej+ei)) + (n(n1)

2N

j∶nj>1

nj1

n1p(Hk=ηej))⎤⎥

⎥⎥⎥⎦. - Settingθ=4Nµandβ=n(n1+θ), we have

p(Hk=η) =1 β

⎡⎢⎢⎢

⎢⎣θ

i

j∶nj>0,j≠i

(ni+1)pijp(Hk=ηej+ei)

+n

j∶nj>1

(nj1)p(Hk=ηej)⎤⎥

⎥⎥⎥⎦,

(57)

The recursion of Griffiths et al.

Griffiths & Tavar´e 1994 : example for a single population

p(Hk=η) =1 β

⎡⎢⎢⎢

⎢⎣θ

i

j∶nj>0,j≠i

(ni+1)pijp(Hk=ηej+ei)

+n

j∶nj>1

(nj1)p(Hk=ηej)⎤⎥

⎥⎥⎥⎦,

Such recursions are too difficult to solve except for very simple models (WF + IAM, cf Ewens)

→ Griffiths & Tavar´e (1994) proposed to use aMonte Carlo approachusing sequential importance samplingon past histories to solve the recursion.

(58)

Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

(59)

Inference of the likelihood by simulation

Griffiths & Tavar´e 1994 :

p(Hk=η) =1 β

⎡⎢⎢⎢

⎢⎣θ

i

j∶nj>0,j≠i

(ni+1)pijp(Hk=ηej+ei)

+n

j∶nj>1

(nj1)p(Hk=ηej)⎤⎥

⎥⎥⎥⎦,

or equivalently

p(Hk) =wGT(Hk)(

i,jnj>0,ji

Mij(Hk)p(Hkej+eai) + ∑

jnj>1

Cj(Hk)p(Hkej))

(60)

Inference of the likelihood by simulation

Griffiths & Tavar´e 1994 : Backward absorbing Markov chain based on forward transition probabilities

p(Hk) =wGT(Hk)(

i,jnj>0,ji

Mij(Hk)p(Hkej+eai) + ∑

jnj>1

Cj(Hk)p(Hkej))

→ Histories are build backward event by event using absorbing Markov chain (abs. state = MRCA) based on forward transitions probabilities (“uniform sampling” based on Mij(Hk) andCj(Hk)) among all possible events.

wGT(Hk)is the weight associated with the IS proposal.

(61)

Inference of the likelihood by simulation

Expending the recursion p(Hk) = ∑{Hk}p(Hk∣Hk)p(Hk) over all possible ancestral histories of a current sample leads to

p(H0) =E[p(H0∣H1)...p(Hτ−1∣Hτ)p(Hτ)]

Then

L(P;D) =p(H0) = ∫HWGT(H)fGT(H) ≈ 1 L

L

h=1

WGT(Hh)

1 L

L

h=1 τ

k=0

wGT((Hh)k).

This IS schemefGT(H)is not very efficient because it does not appropriately consider that some backward transitions are more likely than others given the current state (example : SMM mutation).

(62)

Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

(63)

Intro Likelihood & coa MCMC IS Sim tests Conclusions

Towards a better IS scheme

(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk) be a given distribution, then

p(Hk) = ∑

{Hk}

p(Hk∣Hk)

Q(Hk) Q(Hk)p(Hk)

=EQ[p(H0∣H1)

Q(H1) ...p(Hτ−1∣Hτ) Q(Hτ) ]

whereEQ is expectation over the distribution of full histories induced by Q. This means thatQ may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk∣Hk/Q(Hk)).

minimizes the variance of likelihood estimates 1

L

L

h=1 τ

k=0

wGT((Hh)k).

(64)

Towards a better IS scheme

(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk) be a given distribution, then

p(Hk) =EQ[p(H0∣H1)

Q(H1) ...p(Hτ−1∣Hτ) Q(Hτ) ]

whereEQ is expectation over the distribution of full histories induced by Q. This means thatQ may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk∣Hk/Q(Hk)).

The problem is then to find the proposal distribution that minimizes the variance of likelihood estimates

1 L

L

h=1 τ

k=0

wGT((Hh)k).

(65)

Towards a better IS scheme

(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

The ideal proposal is the backward transition probability p(Hk∣Hk), because the IS weights are then

p(Hk)

Q(Hk) =p(Hk∣Hk)

p(Hk∣Hk) = p(Hk) p(Hk)

and thus their product is always the sample likelihood, p(H0). expliciter

→ a single tree reconstruction allows exact likelihood computations (null variance).

However, backward transition probabilitiesp(Hk∣Hk) are generally unknown

Aim : find good approximations ˆp(Hk∣Hk) ofp(Hk∣Hk)

(66)

Towards a better IS scheme

(Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

The likelihood at a given point is an integral over all possible histories H = {Hk;k=0, ..., τ}.

Markov coalescent process →p(Hk) = ∑p(Hk∣Hk)p(Hk) andp(H0) =E[p(H0∣H1)...p(Hτ−1∣Hτ)p(Hτ)].

However, forward transition probabilities p(Hk∣Hk) are not efficient in a backward process

Importance sampling techniques based on an approximation p(Hˆ k∣Hk) of p(Hk∣Hk)are used to build more likely histories

p(H0) =Eˆp[p(H0∣H1) ˆ

p(H1∣H0)...p(Hτ−1∣Hτ) ˆ

p(Hτ∣Hτ−1)].

(67)

Intro Likelihood & coa MCMC IS Sim tests Conclusions

Linking optimal weights to addition of a gene to a sample

Represent sample probabilityp(n) as integral over the joint distributionf(x) of allele frequencies in the population

p(n) = ∫xp(n∣x)f(x)dx=Ef[(n n) ∏

i

Xini] where

(n

n) = n!

ini! is the binomial coefficient.

a samplen and that an additional gene copy is of typej is Ef[Xj(n

n) ∏

i

Xini] = nj +1

n+1p(n+ej).

We write this joint probability asp(n)times π(j∣n), where π is thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifHk andHk differ by the addition of one gene of type j, we can write the optimal IS weight as

p(Hk =n)

p(Hk =n+ej) = nj +1 n+1

1 π(j∣n)

(68)

Intro Likelihood & coa MCMC IS Sim tests Conclusions

Linking optimal weights to addition of a gene to a sample

Represent sample probabilityp(n) as integral over the joint distributionf(x) of allele frequencies in the population

p(n) = ∫xp(n∣x)f(x)dx=Ef[(n n) ∏

i

Xini]

Then the joint probability that we have a samplen and that an additional gene copy is of typej is

Ef[Xj(n n) ∏

i

Xini] = nj +1

n+1p(n+ej).

thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifHk andHk differ by the addition of one gene of type j, we can write the optimal IS weight as

p(Hk =n)

p(Hk =n+ej) = nj +1 n+1

1 π(j∣n)

(69)

Linking optimal weights to addition of a gene to a sample

Then the joint probability that we have a samplenand that an additional gene copy is of typej is

Ef[Xj(n n) ∏

i

Xini] = nj +1

n+1p(n+ej).

We write this joint probability asp(n)times π(j∣n), where π is thus the probability that an additional gene is of typej, given we have already drawn the samplen from the population. Thus ifHk andHk differ by the addition of one gene of typej, we can write the optimal IS weight as

p(Hk =n)

p(Hk =n+ej) = nj +1 n+1

1 π(j∣n)

(70)

Towards a better IS scheme : the π’s

Let π(⋅∣Hk)be the conditional distribution of the allelic type of a n+1 gene, given Hk the configuration (i.e. allelic types) of the first n genes of the sample.

Then the optimal IS distribution (exact backward transition probabilities) is, for a single population :

p(Hk∣Hk) = 1 βθnj

π(i∣Hk−ej)

π(j∣Hk−ej)Pij for Hk =Hk−ej+ei

p(Hk∣Hk) = 1 β

nj(nj−1)

π(j∣Hk−ej) for Hk =Hk−ej

(71)

Towards a better IS scheme : the ˆ π’s

Unfortunately,π’s are generally unknown

Stephens & Donnelly (2000) proposed a good approximation ˆπfor theπs for a single WF population.

de Iorio & Griffiths (2004) proposed a general method for appoximating theπs under different mutational and demographic models

Then approximate backward transition probabilities using the ˆ

πs are used : ˆ

p(Hk∣Hk) = 1 βθnj

ˆ

π(i∣Hk−ej) ˆ

π(j∣Hk−ej)Pij for Hk =Hk−ej+ei

ˆ

p(Hk∣Hk) = 1 β

nj(nj−1) ˆ

π(j∣Hk−ej) for Hk =Hk−ej

(72)

Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

(73)

The backward equation for f (X

t

∣X

0

= x)

Pour un processus de diffusion, la densit´e de probabilit´e f des fr´equences all´eliques satisfait l’´equation arri`ere de Kolmogorov, qui d´ecrit les changements def au cours du temps sous la forme

df(Xt∣X0=x)

dt =Φ(f(x)),

o`u Φ est un op´erateur diff´erentiel qui prend ici la forme Φ=1

2∑

i∈E

j∈E

xiij −xj) ∂2

∂xi∂xj + ∑

j∈E

(∑

i∈E

xirij) ∂

∂xj

= ∑

j∈E

Φj

∂xj

avec

R= {rij} ≡ θ

2(P−I)

o`u P= {pij} est la matrice de mutation, et I la matrice identit´e.

(74)

The backward equation for E [g (X

t

)∣X

0

= x]

In the same way as

df(Xt∣X0=x)

dt =Φ(f(x)),

the following “generator equation” (Karlin and Taylor, 1981, p.215) holdsfor any functiong(x)with bounded second derivatives

t→0lim

E[g(Xt)∣X0=x] −g(x)

t =Φ(g(x)). We will apply this result withg the sample probability given population allele frequenciesx.

(75)

ˆ

π’s computation

Pour obtenir une r´ecurrence sur les probabilit´esp(n) avecn=H0

de l’´echantillon, on ´ecritp(n) sous la formeE[g(x)]

p(n) =E[(n n) ∏

i

Xini] o`u

(n

n) = n!

ini!. On a donc

d(p(n))

dt =Φ[p(n)].

A l’´equilibre stationnaire, d(p(n))/dt est nulle. En d´eveloppant l’expression pour Φ[p(n)], on retrouve alors la r´ecurrence entre les p(n).

(76)

Explicit recursions in terms of π

Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence or mutation event :

N(n(n−1

N +µ))p(n) =N∑

j

nnj−1

N p(n−ej) +Nµ∑

j

i

Pij(ni +1−δij)p(n−ej +ei). Expressing allp(.) in terms ofp(n−ej)s for distinctjs :

N∑

j

(n−1

N +µ)π(j∣n−ej)np(n−ej) =N∑

d,j

nnj −1

N p(n−ej) +Nµ∑

j

i

Pijnπ(i∣n−ej)p(n−ej)

...huge system of linear equations, not easier to solve in this form.

(77)

Griffiths et al.’s IS

Griffiths et al’s recursion Old IS scheme

New IS scheme

A general method based on diffusion Approximations of π

(78)

ˆ

π’s computation

On note que Φ[p(n)]peut s’´ecrire sous la forme

j∈E

Φj

∂xj [p(n)],

La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer lesπ (d´eriv´es pr´ec´edemment de p(n), solution de Φ[p(n)] =0) par des ˆπ d´eriv´es des solutions de

E[Φj∂p(n)

∂xj ]=E[Φj

∂xj (n n) ∏

i

xini]=0,pour chaquej ∈E.

(79)

ˆ

π’s computation

La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer lesπ (d´eriv´es pr´ec´edemment de p(n), solution de Φ[p(n)] =0) par des ˆπ d´eriv´es des solutions de

E[Φj

∂xj (n n) ∏

i

xini]=0, pour chaque j ∈E.

ce qui donne, pour une population panmictique, pour chaquej ∈E nj(n−1+θ)pˆ(n) =

n(nj −1)pˆ(n−ej) + ∑

i∈E

θPij(ni+1−δij)pˆ(n−ej+ei)

(80)

ˆ

π’s computation

Rappel :π(j∣n) peut ˆetre exprim´e en fonction de p(n) et p(n+ej) :

π(j∣n)p(n) = nj +1

n+1p(n+ej).

Si l’on consid`ere que cette relation est aussi valable pour les ˆ

π et ˆp, ce qui ne sera g´en´eralement pas le cas, on a ˆ

π(j∣n)ˆp(n) = nj+1

n+1pˆ(n+ej)

(81)

ˆ

π’s computation

Approximer lesp(n), solutions de Φ[p(n)] =0, par les ˆp(n)solutions de E[Φj

∂xj (n n) ∏

i

xini]=0, pour chaquejE. ce qui donne, pour une population panmictique, pour chaquejE

nj(n1+θ)ˆp(n) =

n(nj1)ˆp(nej) + ∑

iE

θPij(ni+1δij)ˆp(nej+ei)

et en utilisant ˆπ(jn)pˆ(n) = nnj++11ˆp(n+ej)et rempla¸cantnparn+ej, on obtient donc pour chaquejE :

(n1+θ)πˆ(jn) =nj+ ∑

iE

θPijπˆ(in)

C’est le systeme lin´eaire permettant le calcul des ˆπ(jn)pour un mod`ele de Wright-Fisher.

(82)

New IS scheme with the ˆ π’s

faire deux dipaos de bilan du nouveau schema d’IS bilan en reprennant des bouts des diapos 63, 64, 65, 66, 70 et 71

(83)

A much better IS scheme based on the ˆ π’s

Drastic gain in efficiently with this new IS scheme (old IS : millions of trees)

extract backward transition probabilities for a WF model with parent independent mutation (i.e. KAM)

only 30 histories necessary for a good estimation of the likelihood for more complex models (structured populations & KAM)

but efficiency slightly decrease with non parent-independent mutations models,

e.g. stepwise mutation model (200 histories for structured populations &

SMMM)

and still limited efficiency for time inhomogeneous demographic models,

e.g. one population with past size change (cf. Orang-Utan example)

up to 20,000 histories necessary for strong disequilibrium scenarios (e.g.

quick change in population size)

(84)

Implementations of IS : Genetree and Migraine

Genetree(Bahlo & Griffiths 2000, old IS algorithm) - 2 to 4 populations with migration (ISM)

Migraine(Rousset & Leblois 2007-2014, new IS algorithms) - One single stable population (KAM, SMM, GSM, ISM) - One pop. with past size variation (KAM, SMM, GSM, ISM) - 2 populations with migration (KAM, SMM, ISM)

- Isolation By Distance in 1D and 2D (KAM)

(85)

Implementation of IS in Migraine

1. C++ core IS computations

Stratified random sampling of parameter points

Estimation of the likelihood at each point using IS 2. R code for “post-treatment”

Likelihood surface interpolation by Kriging

Inference of MLEs and CIs

Plots of 1D and 2D likelihood profiles

(86)

Introduction

Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS

Simulation tests Conclusions

(87)

Intro Likelihood & coa MCMC IS Sim tests Conclusions

Simulation tests

Can we trust the demographic / historical inferences made with those methods ?

Bias, RMSE, coverage properties of confidence intervals

robustness to realistic but “uninteresting” mis-specifications

→ to this aim, we tested by simulation :

- The performances of Migraineto infer dispersal under IBD - The performances of MsVarandMigraineto detect and

measure past pop size changes few interesting results...

Références

Documents relatifs

neumayeri is endemic to the upper Antarctic continental shelf, the star-like topology with a central dominant haplotype, low genetic diversity and recent demographic expansion

We are interested in observing the Isidore system (http://www.rechercheisidore.fr/apropos), because it presents, on a single platform, all the basic materials a researcher

To this aim, we give asymptotics for the number σ (n) of collisions which occur in the n-coalescent until the end of the chosen external branch, and for the block counting

This section includes analysis of three primary projects: the Vocal Augmentation and Manipulation Prosthesis (VAMP), a wearable gesture-based instrument for a singer to

(cf. The inner product on H sets Hand H* in duality. The prime mapping is then coincident with the ket-mapping. The bra-topology is the weakest topology making every ket-functional

That is, the reaction of the R&amp;D effort in the US to an increase in EU’s R&amp;D productivity is found to be submissive – using Scherer’s (1991) terminology. This robust

In this report we broadly define demographic consequences as the proportion of ART births and ‘net impact’ of ART on national fertility levels, the effectiveness of ART usage at

The effects of the control in the sound are studied investigating the coupling between this mode and the second partial of the A string of the guitar since its frequency is 220 Hz