Maximum likelihood inference comparing sequence and allelic markers in a population of variable size: an Importance Sampling approach

(1)

HAL Id: hal-02932322

https://hal.inrae.fr/hal-02932322

Submitted on 7 Sep 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Maximum likelihood inference comparing sequence and allelic markers in a population of variable size: an

Importance Sampling approach

Champak Reddy, Francois Rousset, Pierre Pudlo, Raphaël Leblois

To cite this version:

Champak Reddy, Francois Rousset, Pierre Pudlo, Raphaël Leblois. Maximum likelihood inference comparing sequence and allelic markers in a population of variable size: an Importance Sampling approach. Mathematical and Computational Evolutionnary Biology 2012, Jun 2012, Montpellier, France. 2012. �hal-02932322�

(2)

The demographic model :

a single exponentially contracting population

Maximum likelihood inference comparing sequence and allelic markers in a

population of variable size: an Importance Sampling approach

Champak Beeravolu Reddy

1

, François Rousset

2

, Pierre Pudlo

1,3

& Raphaël Leblois

1

_{CBGP, INRA, Campus International de Baillarguet CS 30016 34988 Montferrier-sur-Lez cedex}

2

_{ISEM, CNRS, Université de Montpellier II - CC 065 34095 MONTPELLIER Cedex 05}

3

_{I3M, Université de Montpellier II - CC 051 34095 MONTPELLIER Cedex 05}

Introduction & Objectives

Building on the Griﬃths & Tavaré (1994a,b,c) approach, Stephens & Donnelly (2000) introduced an eﬃcient importance sampling scheme for computing the likelihood of a genetic sample. Their method consists of approximating the conditional probability of the allelic type of an additional gene given those currently in the sample. This technique was further extended by

De Iorio & Griﬃths (2004a,b) to a subdivided population framework for various mutation models between gene types.

For sequence loci, the results of De Iorio & Griﬃths (DIG) represent an improvement in terms of eﬃciency over those of Bahlo

& Griﬃths (2000) but this hasn't been tested on more than a single panmictic population. DIG consider the DNA sequences to

be evolving under an Inﬁnitely-many Sites Mutation model (ISM). For allelic loci (microsatellites) evolving under a Step-wise

Mutation model (SMM), we use the Importance Sampling proposal distributions of De Iorio et al. (2005).

Here, we used results from Griﬃths & Tavaré (1994c) in order to implement the case of a single exponentially contracting population (see box: The demographic model) and contrasted the performance of estimation from sequence and allelic markers. This approach gave us an insight into the potential complementarity of the considered genetic markers for improving the precision of the demographic and mutational estimates.

References:

Griﬃtths, R.C. & Tavaré, S. (1994a) Simulating probability distributions in the coalescent. Theor. Popn. Biol., 46,131-159. Griﬃtths, R.C. & Tavaré, S. (1994b) Ancestral inference in population genetics. Stat. Sci., 9, 307-319.

Griﬃtths, R.C. & Tavaré, S. (1994c) Sampling theory for neutral alleles in a varying environment. Phil. Trans. R. Soc. Lond. B, 344, 403-410. Bahlo, M. & Griﬃths, R.C. (2000) Inference from gene trees in a subdivided population. Theor. Popn. Biol. 57, 79-95.

Stephens & M., Donnelly, P. (2000) Inference in molecular population genetics. J. Roy. Statist. Soc. B 62, 605–655.

De Iorio & M. & Griﬃths, R.C. (2004a) Importance sampling on coalescent histories, I. Adv. Appl. Probab. 36, 417–433.

De Iorio M. & Griﬃths, R.C. (2004b) Importance sampling on coalescent histories, II. Subdivided population models. Adv. Appl. Probab. 36, 434-454.

De Iorio M, Griﬃths R.C., Leblois R. & Rousset F. (2005) Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models. Theor Popul Biol. 68, 41–53.

Leblois R., Estoup A. & Rousset F. (2009) IBDSIM: a computer program to simulate genotypic data under isolation by distance. Molecular Ecology Resources, 9, 107–109.

Aﬃliations & Sponsors:

Software & Simulations

We have developed software which simulate data in the well known Genepop (for allelic data) and the Nexus (for sequence data) formats. These simulations were generated using IBDSim (http://raphael.leblois.free.fr/#softwares) which follows an exact coalescent algorithm. This data was then analysed by the Migraine software (http://kimura.univ-montp2.fr/~rousset/Migraine.htm) for inferring the demographic and mutational parameters (using DIG's method). Note that the current version of Migraine can also handle more complex demographic models than the one presented here such as Isolation by Distance (IBD) based on DIG's method. We choose a simulation strategy which consisted of varying the time since the contraction began (D = T/2N) keeping the other parameters constant. A total of 10 demographic scenarios were thus obtained, each of which was repeated 200 times. The initial size of the population was set to 200 000 haploid individuals which progressively diminishes down to 200 individuals. Each sample consists of 100 individuals genotyped at 10 loci.

2Nancµ = 4.2 0.001 0.003 0.01 0.03 0.10.3 0.5 1 2 5 10 −120 −115 −110 2Nµ D ln(L) −122 −120 −118 −116 −114 −112 −110 −108 −106 D= 2.332 0.001 0.003 0.01 0.03 0.10.3 1 2 5 10 20 −118 −116 −114 −112 −110 −108 2Nµ 2N_anc µ ln(L) −118 −116 −114 −112 −110 −108 −106 2Nµ = 0.204 0.5 1 2 5 10 1 2 5 10 20 −116 −114 −112 −110 −108 D 2N_anc µ ln(L) −118 −116 −114 −112 −110 −108 −106 0.0 0.2 0.4 0.6 0.8 1.0 0.001 0.01 0.01 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 + 0.001 0.003 0.01 0.03 0.1 0.3 0.5 1 2 5 10

Profile likelihood ratio

2Nµ on a log scale D on a log scale 0.0 0.2 0.4 0.6 0.8 1.0 0.001 0.01 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 + 0.001 0.003 0.01 0.03 0.1 0.3 1 2 5 10 20

2Nµ on a log scale 2N anc µ on a log sca le 0.0 0.2 0.4 0.6 0.8 1.0 0.001 0.001 0.01 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9+ 0.5 1 2 5 10 1 2 5 10 20

D on a log scale 2N anc µ on a log sca le

Perspectives

Overall estimation performance

_{Example of Migraine's ouput (ISM, D = 1.25)}

_{Diﬀerences between markers}

Likelihood ratio tests (LRT)

The current version of Migraine includes implementations of 1D and 2D IBD models (including the island model as a subcase), as well as the two-populations model described in De Iorio et al. (2005) and the elementary one-population model for allelic data. We are currently testing novel Importance Sampling proposals for the Generalized Stepwise Mutation model

(GSM) and DIG's algorithms for sequence data (ISM). Other

mutation models such as Single Nucleotide Polymorphisms

(SNP's) will be available shortly. Planned developments also

include Isolation with Migration models and continent-island models. Finally, a user friendly graphical interface is also underway in order to ease the use of Migraine (no pun intended).

An extreme scenario (ISM, D = 0.025)

0.0 0.2 0.4 0.6 0.8 1.0 0.001 0.001 0.001 0.01 0.01 0.01 0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.7 0.8 + 0.003 0.01 0.03 0.1 0.3 1 0.01 0.1 1 10 100

2Nµ on a log scale 2N anc µ on a log sca le 0.0 0.2 0.4 0.6 0.8 1.0 0.001 0.001 0.01 0.01 0.01 0.05 0.05 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.7 0.7 0.7 0.8 0.9 0.9 + 0.003 0.01 0.03 0.1 0.3 1 0.01 0.03 0.1 0.3 1 3 10 30

2Nµ on a log scale D on a log scale

We accounted for the diﬀerences in the evolution of the genetic markers as follows: for microsatellites under a SMM,

we assumed a mutation rate of 10-3 while for DNA sequences

under an ISM we choose 5x10-5/sequence. The simulated data

had the following number of haplotypes/alleles (N_a) and

expected heterozygosity (H_exp) for the cases presented here:

ISM D = 1.25 : N_a = 2.6, H_exp = 0.39 D = 0.025 : N_a = 7.44, H_exp = 0.65 SMM D = 1.25 : N_a = 4.4, H_exp = 0.58 D = 0.025 : N_a = 13.4, H_exp = 0.87

Results & Conclusions

Very recent and extreme contractions result in flat profile likelihoods (right) or saddle point like situations (left). Here, the surfaces indicate that the markers have little information on the bottleneck and thus the current population size. This results in very wide confidence intervals for the parameters D and N. Finally large N values are confounded with the true

values of N_anc resulting also in wide conﬁdence intervals.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qqqqqqqqqq qq qqqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqq q q q qqqqqqqqqq q q qqqqqqqqqqqqqqqqqq qqqqqqqqqq qqqqqqqqqqq q q qqqqqqqqqqqqqqq q qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqq q 2Nmu = 0.02 KS: 0.00835 c(0,

1) Rel. bias, rel. MSE

5.32, 14.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqq qqqqqqqqqqqqq q qqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqq qqqqqqqqqq q q qqqqqqqqqqqqqqq q qqqqqqqqqqqqqq qqqqqqqqqqqqqq q q q q D = 0.125 KS: 0.474 c(0,

1.14, 6.44 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqq qqqqqqqqqqqq qqqqqqqqqqqqq q qqqqqqqqqqqqqq qqqqqqqqqqqqq qqqq qqq qqq qqqqqqqqqq q qqqqqqqqqqqqqqq qqqqqqqqqqqq q qqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqq q qqqqqqqqqq q q q q q q q q q q q q 2Nancmu = 2 KS: 0.0411 c(0,

0.0147, 0.211 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qqqqqqqqqqqq q q qqqqqqqqqq qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqq qqqqqqqqqq qqqqqqqqqqq qqqqqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqq qqqqqqqqqqqq qqqqqqqqqq qqqqqqqq qqq q q qq qqqqqqqqqq q q Nratio = 0.01 KS: 0.0766 DR: 0.92 ( 0 ) c(0,

54, 596 c(0, 1) ECDF of P−v alue s

LRT's were performed on each of the estimated parameters for a given bottleneck scenario. This consists of calculating 200 LR statistics (one for each repetition) using the likelihood values at the true parameters.

As the LR statistic approximately follows a chi-square distribution, we calculated the corresponding p-values. The ﬁgures on the left plot the empirical cdf of the obtained p-values. This helps us check whether the distribution of the p-values for each parameter is uniform.

The bottleneck scenario implemented here:

1° was successfully detected over a wide range of scenarios;

2° was inferred with reasonable precision for intermediate scenarios. The SMM estimates performed better than their ISM counterparts but this can be understood in terms of heterozygosity and the number of alleles/haplotypes present in the simulated data (see box: Diﬀerences between markers); 3° the time taken in comparision to typical "MCMC" approaches was very short (~4hrs for a single repetition). Slightly longer runs improve estimation performance (not

shown here);

4° for all scenarios, the LRT's indicate a very good conformity of the likelihood surface generated by Migraine to the true likelihood.

CBR & RL are supported by the IM-MODEL@CORALFISH project while CBR, FR, PP & RL are supported by the EMILE project, both funded by the French National Research Agency.

BDR : Bottleneck Detection Rate (POWER)

T/2N rel. Bias & rel. MS E −0.2 0.2 0.5 1 2 5 10 20 0.025 0.0625 0.125 0.25 0.5 1.25 2.5 3.5 3.5 7.5 BDR: 0.76 0.98 1 1 1 1 1 0.98 0.98 0.5 2*N*µ T/2N 2*Nanc*µ P(LRT−pvalue ~ U[0;1])<0.05 P(LRT−pvalue ~ U[0;1])<0.01

BDR : Bottleneck Detection Rate (POWER)

T/2N rel. Bias & rel. MS E −0.2 0.2 0.5 1 2 5 10 20 50 0.025 0.0625 0.125 0.25 0.5 1.25 2.5 3.5 5 7.5 BDR: 0.42 0.78 0.92 0.96 0.99 0.99 1 0.96 0.8 0.46 2*N*µ T/2N 2*Nanc*µ P(LRT−pvalue ~ U[0;1])<0.05 P(LRT−pvalue ~ U[0;1])<0.01