• Aucun résultat trouvé

Maximum likelihood inference of population size contractions from microsatellite data

by

Rapha¨el Leblois, Pierre Pudlo, Joseph N´eron, Fran¸cois Bertaux, Champak Reddy Beeravolu, Renaud Vitalis, Fran¸cois Rousset

A. Details on the likelihood computations and settings of the inference method

1

A.1. Coalescent-based IS algorithms and disequilibrium models

2

In this section, we give a more detailed overview of the method used to compute

like-3

lihoods of genetic data at a given locus and technical details regarding the Monte Carlo

4

algorithm.

5

The likelihood at a given point of the parameter space is estimated using Stephens

6

and Donnelly (2000) andde Iorio and Griffiths (2004a) ’s importance sampling approach.

7

An ancestral history, i.e. a coalescence tree with mutations, is defined as the set of all

8

ancestral configurations H = {Hk;k = 0,−1, ...,−m}, corresponding to all coalescent or

9

mutation events that occurred from H0 the current sample state (i.e. the sample allelic

10

configuration, or allelic counts) toH−m the allelic state of the most recent common

ances-11

tor (MRCA) of the sample. The Markov nature of the backward coalescent process implies

12

that p(Hk) =P

{Hk−1}p(Hk|Hk−1)p(Hk−1) and expending the recursion over possible

an-13

cestral histories of a current sample leads to p(H0) = Ep[p(H0|H−1)...p(H−m+1|p(H−m)].

14

However, forward transition probabilities p(Hk|Hk−1) can not directly be used in a

back-15

ward process and backward transition probabilities p(Hk−1|Hk) are unknown, except in

16

some specific simple models such as parent independent mutations (PIM) in a single

sta-17

ble panmictic population. Importance sampling techniques based on an approximation

18

ˆ

p(Hk−1|Hk) of p(Hk−1|Hk) are thus used to derive the probability of a sample over

possi-19

ble histories

20

p(H0) =Epˆ

p(H0|H−1) ˆ

p(H−1|H0)...p(H−m+1|H−m) ˆ

p(H−m|H−m+1)

. (1)

The likelihood of the data is then estimated as the average value of the probability of a

21

sample configuration H0 given an ancestral history Hi, overnH independent simulations,

22

INon-standard abbreviations : IS (importance sampling), MCMC (Monte Carlo Markov Chain), ML (maximum likelihood), RMSE (root mean square error), KS (Kolmogorov-Smirnov test), LRT (likelihood ratio test), CI (confidence or credibility intervals), KAM (K-allele model), SMM (stepwise mutation model), GSM (generalized stepwise mutation model), SNP (single nucleotide polymorphism), IBD (isolation by distance), BDR (bottleneck detection rate), FBDR (false bottleneck detection rate), FEDR (false expansion detection rate)

backward in time, of possible ancestral histories: p(H0) ≈ n1

H

PnH

i=1p(H0|Hi). The

distri-23

bution of the possible ancestral histories is generated by an absorbing Markov chain with

24

transition probabilities ˆp(Hk−1|Hk), and the likelihood is estimated by averaging the

prod-25

uct of sequential importance weights corresponding the ratio p(Hp(Hˆ i,0|Hi,1)

i,1|Hi,0)...p(Hp(Hˆ i,−m+1|Hi,−m)

i,−m|Hi,−m+1) of

26

forward and backward transition probabilities obtained for each history Hi. Computation

27

of backward transition probabilities ˆp relies on Stephens and Donnelly’s ˆπ approximations

28

of the unknown probability π that an additional gene sampled from a population is of a

29

given allelic type conditional on a previous sample configuration (see A.4 for an example

30

on ˆπ computations).

31

For efficient importance sampling distributions (i.e. approximate backward transition

32

probabilities ˆpclose to the exactp), precise estimation of the likelihood can be obtain with

33

very few histories explored. For example, under parent independent mutation model (e.g.

34

a K allele model, KAM), the importance sampling scheme ofStephens and Donnelly (2000)

35

for a single isolated population is optimal because the ˆπ’s computed following Stephens

36

and Donnelly (2000) are equal to the trueπ’s and all the ratios of p(Hp(Hˆ k|Hk−1)

k−1|Hk) are cancelled

37

out. In such ideal case, consideration of a single ancestral history is thus sufficient to

38

get the exact likelihood (de Iorio and Griffiths, 2004a,b; Stephens and Donnelly, 2000).

39

However, departure from parent independent mutation, from panmixia and from

time-40

homogeneity decrease the efficiency of IS proposals, and precise estimation of likelihoods

41

then implies to explore more ancestral histories. Under a time-homogeneous model of

42

isolation by distance, Rousset and Leblois (2007, 2012) found that 30 replicates of the

43

absorbing Markov chain, rebuilding 30 independent possible ancestral histories, is enough

44

to get perfect LRT-Pvalue distributions. Here, for the first time, de Iorio and Griffiths

45

(2004a)’s IS algorithm is applied to a time-inhomogeneous demographic model using ˆπ

46

probabilities computed from a time-homogeneous demographic model as described in the

47

main text. Our work clearly shows that IS proposal computed as described in the main

48

text are less and less efficient for demographic scenarios with increasing disequilibrium.

49

A.2. Some efficient modifications of the original IS algorithm

50

To speed up the computations, we can stop the simulation of the genealogies during the

51

IS algorithm, before reaching the MRCA (see Jasra, De Iorio and Chadeau-Hyam, 2011;

52

Rousset and Leblois, 2007). Here, the algorithm is stopped when reaching demographic

53

equilibrium (i.e., after time T whenN(t) =Nanc) and we finalize the current IS estimates

54

of the likelihood with the PAC-likelihood of the ancestral lineages (Li and Stephens, 2003;

55

Cornuet and Beaumont, 2007;Rousset and Leblois, 2007,2012). This scheme will be called

56

PACanc. Using analytical formulas for the exact probability of the last pair of genes,

57

computed as in Rousset (2004), also slightly decreases the computation time in the same

58

vein as in de Iorio et al. (2005) and Rousset and Leblois (2007,2012). The IS scheme can

59

thus be stopped when reaching an ancestral sample of size 2 and finalized with this exact

60

formula, hereafter called 2ID. And the combination of both PACancand 2ID will be called

61

PACanc2ID. A detailed comparison between the four schemes (strict IS, 2ID, PACanc and

62

PACanc2ID) is presented is presented in TableS1for the baseline scenarios under a SMM

63

and under a GSM (case[0], [A] to [F] under a SMM; case[G] to [L] under a GSM with

64

p = 0.22; and case[K] to [M] under a GSM with p = 0.74). For the GSM, data sets are

65

simulated under a GSM with 40 allelic states but analyzed with a GSM with 50 possible

66

allelic states. This slight mis-specification of the mutation model has a relatively strong

67

influence when p= 0.74: LRT-Pvalue distributions are not close tho the 1:1 regardless of

68

the number of loci. Such strong effect is also probably due to the consideration of large

69

θanc values. Apart from this mutation model effect, our results show that performances

70

are similar in all cases for the different algorithms. All simulations with a GSM, i.e. for

71

the tests of the effect of mutational processes and population structure, are analyzed using

72

the PACanc2ID, unless otherwise specified.

73

TableS1:Effectsofthealgorithmusedandthenumberoflociontheperformanceofestimationsforourbaselinesimulationwithθ=0.4, D=1.25andθanc=40.0underaSMM,aGSMwithp=0.22andp=0.74.ˆL:typeofalgorithmused;n`:numberofloci;IS:importance sampling;2ID:exactcomputationofthelikelihoodforthelastpairofgenes;PACanc:PAC-likelihoodusedintheancestralstablepopulation.BDR: Bottleneckdetectionrate.FEDR:Falseexpansiondetectionrate. pθDθancBDR(FEDR) case/ˆLn`rel.biasRRMSEKSrel.biasRRMSEKSrel.biasRRMSEKSrel.biasRRMSEKS SM M

[0]IS10NANANA0.0350.560.0560.0620.270.0680.0460.470.461(0) [J]2ID10NANANA0.0380.0540.0600.0690.270.0540.0430.460.811(0) [K]PACanc10NANANA0.0610.550.0510.0590.270.0480.0480.470.6851(0) [L]PACanc2ID10NANANA0.0570.5440.0320.0690.270.0140.0550.480.591(0) [A]IS25NANANA0.00660.310.350.00790.160.730.00550.300.840.986(0) [M]PACanc2ID25NANANA0.0120.310.480.0160.170.500.00520.30.9470.995(0) [B]IS50NANANA0.0150.230.620.00160.120.69-0.00710.220.510.982(0) [N]PACanc2ID50NANANA0.0500.230.410.0360.120.770.0210.240.810.99(0)

GS M

0.22 [C]IS100.260.910.160.0330.510.120.160.660.140.221.330.6570.990(0) [O]2ID100.240.880.00800.00470.510.3770.0910.480.10.101.060.0280.99(0) [P]PACanc100.210.860.390.0520.520.360.170.590.370.211.160.400.985(0) [Q]PACanc2ID100.230.850.820.0210.510.0960.110.490.530.111.00.6750.99(0) [D]IS500.170.470.120.0590.250.440.0120.140.75-0.0850.390.0821.0(0) [R]PACanc2ID500.190.480.670.0700.260.0630.00160.150.85-0.130.370.111(0)

GS M

0.74 [E]IS100.0160.140.0.0940.1370.520.110.420.67<10122.463.4<10120.965(0) [S]PACanc2ID10-0.0220.180.310.0720.560.06910.270.69<10121.62.86.5·10110.96(0) [F]IS500.0450.0813.8·1050.340.44<10120.400.49<10121.62.4<10121.0(0) [T]PACanc2ID500.00470.0700.560.240.40<10120.300.56<10122.043.7<10121.0(0)

KAM [G]IS10NANANA-0.0700.640.0110.140.710.0000342.114.80.0120.84(0) [H]IS25NANANA-0.0270.490.54-0.0580.690.540.612.60.0410.97(0) [I]IS50NANANA-0.0840.320.085-0.220.510.190.4022.740.06751.0(0)

A.3. Detailed parameterization of the inference method

74

Analysis of the simulated data sets with Migraineis done in two or three automatic

75

iterations depending on the demographic scenario. For the first iteration, np parameters

76

points, with replicates for one every thirty of them, are sampled from an initial large range

77

of parameter values set toθ= [0.001−100],D= [0.001−20], andθanc = [0.001−600], for

78

all simulations. For analyses under the GSM, the parameter p is sampled from an initial

79

range of [0−0.9]. Such large exploration ranges is not very efficient in terms of parameter

80

space exploration but allows us to automatically obtain good likelihood surface inferences

81

for all simulated scenarios without the need of manual tuning for each scenario. Likelihoods

82

are then estimated for these firstnp(1 + 1/30) points, and a likelihood surface is inferred by

83

Kriging. From this likelihood surface, np additional points, with again replicates for one

84

every thirty of them, are computed inside a convex envelope including the whole P=0.001

85

confidence region to limit or extend the parameter space previously explored. For the

86

second iteration, likelihoods are estimated for these additional np(1 + 1/30) points and a

87

second likelihood surface is inferred from all 2np(1 + 1/30) points. When a third step is

88

considered, the final likelihood surface is then inferred from all 3np(1 + 1/30) points. This

89

iterative procedure, described in details inRousset and Leblois (2012) and in theMigraine

90

user manual, appears extremely efficient as it ensures that many points are sampled near

91

the top of the likelihood surface. For this work, we chose np = 600 for models with three

92

parameters (i.e. under the SMM), and np = 2,400 for analyses under the GSM with four

93

parameters.

94

As discussed in the main text (Section3.2in the main text), IS distributions are more

95

or less efficient depending on the level of “disequilibrium” in the demographic model (e.g.

96

the rate of instantaneous population size change). For this reason, we choose to simulate

97

a high number ancestral histories (nH = 2,000) to obtain good CI coverage properties

98

in almost all scenarios explored. Doing so, we probably lost an important amount of

99

computation time when analyzing scenarios for which much smallernH values are sufficient

100

for good likelihood inference. However, it allows us to avoid choosing the minimal number

101

of histories compatible with each scenario. The first exception was for a few situations

102

with very recent and strong contractions (θ = 0.4, θanc = 400, D 6 0.25), for which up

103

to 200,000 ancestral histories were considered. The efficiency of the de Iorio and Griffiths

104

(2004a)’s IS algorithms for time-inhomogeneous demographic models is further discussed

105

in the main text.

106

Finally, analyses of 200 data sets and the consideration a large numbernpof parameter

107

points and/or multiple iterative analyses may be time consuming by IS when a large

108

number of histories nH have to be considered ( see. For this reason, we used the fastest

109

PACanc2ID-likelihood approximation (seeA.1for details on PACanc2ID-likelihood) when

110

testing the effect of population structure with a GSM mutational model because inference

111

under models with 4 parameters necessitate to explore a much larger number of points

112

than with 3 parameters (i.e. np = 2,400 vs 600 for each iteration, see above).

113

A.4. π’s computation under a Generalized Stepwise MutationModel (GSM)ˆ

114

The main computation step using DeIorio and Griffiths’ approach is to solve a system

115

of recursive equations (eq.10 inde Iorio and Griffiths, 2004b) which gives the ˆπs (see above,

116

p.2, for a definition) for a given sample configuration n as a function of the demographic

117

and mutational parameters of the model considered. For a single population with a scaled

118

mutation rate θ = 2N µ and for any mutation model represented by the matrix of allelic

119

state transition P with K possible allelic states, andP

jpij = 1, the system of recursive

1 nj the total number of genes in the sample. In

122

de Iorio et al. (2005), we considered an unbounded strict stepwise mutation model (SMM)

123

on i∈ {...,−2,−1,0,1,2, ...} with Pij = 0.5 if |i−j|= 1 and zero otherwise. We solved

124

the above system of equations using Fourier transforms (eq. 3.5 to 3.9 in de Iorio et al.,

125

2005), and obtained the following solution :

126

In a computer implementation, considering an infinite number of possible allelic state

128

is not really tractable. However, using eq.3 with a sufficiently large number of alleles (e.g

129

K > 200) leads to efficient computations as well as almost perfect coverage properties of

130

confidence intervals as shown in this paper.

131

The aim here is to extend the ˆπ computations to consider a generalized stepwise

mu-132

tation model (GSM), in which each mutation event equally leads to the gain or the loss

133

of X repeats, with X being geometrically distributed with parameter g (named p in the

134

main text). Under these conditions, Pij = g|i−j| for |i−j| 6= 0 and zero for i = j and

135

Fourier transforms can also be used to solve the system for an infinite number of alleles

136

Then equation (3.17) ofde Iorio et al. (2005) for computation of the ˆπ under a two popu-lation model becomes here where α andβ are indices representing the two populations, and

139

Following de Iorio et al. (2005), we get

with c` as previously defined (eq. 4).

150

WriteR1 andR2 for the parameter defined in (7) whenρ=ρ1 andρ=ρ2 respectively.

Using (8) and (10), we obtain ai(k, j) = 1

Di

qα−1nαkh

(nβqβ−1+mβ+θ)(1 +g2) +θg(1−g)i

ck−j(Ri)

−q−1α nαk

h

g(nβq−1β +mβ+θ) +θ(1−g)/2 i

ck−j+1(Ri) +ck−j−1(Ri) +qβ−1nβkmα

(1 +g2)ck−j(Ri)−g ck−j+1(Ri) +ck−j−1(Ri)

with Di = 1 +g2ig(1−g). Finally, ai(k, j) = 1

Di

qα−1nαkh

(nβqβ−1+mβ)(1 +g2) +θ(1 +g)i

ck−j(Ri)

−q−1α nαk

h

g(nβq−1β +mβ) +θ(1 +g)/2 i

ck−j+1(Ri) +ck−j−1(Ri) +q−1β nβkmα

(1 +g2)ck−j(Ri)−g ck−j+1(Ri) +ck−j−1(Ri)

. (11) Wheng= 0, mutation model is exactly SMM. Setting g= 0 in (11) leads to

ai(k, j) =qα−1nαkh

(nβqβ−1+mβ+θ)i

ck−ji)

−qα−1nαk[θ/2] ck−j+1i) +ck−j−1i) +qβ−1nβkmα[ck−ji)]

sinceDi = 1 andRii in this case. This is exactly (3.19) of de Iorio et al. (2005).

153

For a single population model, i.e. setting all migration terms {mi} to 0 and only

154

keeping terms with α indices, the expression for ˆπ reduces to eq. 3 except that the cl(ρ)

155

terms are replaced by

156

I`(ρ) =

(1 +g2)c`(R)−g

c`+1(R) +c`−1(R)

1 +g2+ρg(1−g) , (12)

with

157

R= 2g+ρ(1−g)

1 +g2+ρg(1−g) (13)

and cl(ρ) defined in eq. 4. As expected, the higher g is, the more slowly ˆπ values are

158

decreasing for distant mutation under the GSM than under the SMM. The above

expres-159

sion are increasingly poor approximations of bounded, and practically usable, GSMs as g

160

increases.

161

For a single population model, an alternative is to use the approach of Stephens and

162

Donnelly (2000) to solve the system of recursive equations. NotingΠˆ the vector of ˆπ(j|n),

163

their approach is based on the following representation of system of recursive equations :

where Iis the identity matrix of dimensionK.

165

Using matrix inversion computations to solve the system is straightforward, but it is

166

efficient only for mutation models with a relatively small number of alleles and for

time-167

homogeneous models. It is computationally highly demanding for time-inhomogeneous

168

models, as for the case of a population with variable size considered here, because it

169

requires matrix inversions at almost every step of the backward history reconstruction,

170

or q values have changed, it is more efficient to compute the eigenvalues and eigenvectors

173

of Ponce at the beginning of a run (actually, only once for each new value of g). Then,

174

each time θ, n or q changes, we compute R·(I− nθ

qΛ)−1·L, where Λ is the diagonal

175

matrix of eigenvalues of P, R is the matrix of its right eigenvectors andL that of its left

176

eigenvectors. Moreover, the eigen decomposition allows one to evaluate the required ˆπ(j|n)

177

only, not the full Πˆ vector.

178

More details are as follows. Let λk, R(k) and L(k) be thekth eigenvalue and the kth

179

right and left eigenvectors of P, respectively. Let a= (al) be the vector of coordinates of

180 Reconsidering the initial system (eq.2), we can write

K

Using eigenvectors and eigenvalues definitions, we havePK

j=1pijR(m)jm·R(m)i , and

183

K

B. Comparison with MsVar

187

Table S2: Simulated data sets from Girod et al. (2011)

D(T) θ(N) θanc (Nanc) 0.025 (10) 0.4 (200) 4.0 (2,000) 0.025 (10) 0.4 (200) 40.0 (20,000) 0.025 (10) 0.4 (200) 400.0 (200,000) 0.125 (50) 0.4 (200) 4.0 (2,000) 0.125 (50) 0.4 (200) 40.0 (20,000) 0.125 (50) 0.4 (200) 400.0 (200,000)

D(T) θ(N) θanc (Nanc) 0.25 (100) 0.4 (200) 4.0 (2,000) 0.25 (100) 0.4 (200) 40.0 (20,000) 0.25 (100) 0.4 (200) 400.0 (200,000) 1.25 (500) 0.4 (200) 4.0 (2,000) 1.25 (500) 0.4 (200) 40.0 (20,000) 1.25 (500) 0.4 (200) 400.0 (200,000) InGirod et al. (2011), five data sets were simulated for each of the scenario with the SimCoal2software (Laval and Excoffier, 2004) and analyzed with theMsVarsoftware. For each scenario, 5 data sets ofng = 100 genes, genotyped atn`= 10 unlinked microsatellite loci were analyzed.

For comparison with MsVar, we reanalyzed all 5 data sets (100 genes genotyped at 10

188

loci) considered in each of the 12 contraction scenarios of Girod et al. (2011) described

189

in Table S2. All results are presented in Fig. S1, S2, and S3. Note that MsVar is

im-190

plemented under a Bayesian Framework. We thus compare Bayes factors vs. LRT for

191

bottleneck detection and HPD intervals vs. confidence intervals for parameter inference.

192

First, Migraineand MsVar globally give similar results, with a small but clear advantage

193

of MigraineoverMsVarboth in terms of higher BDRs and smaller FEDRs. For parameter

194

inference, it is more difficult to draw clear conclusions from our comparison of the two

195

methods because of the small number of data sets analyzed for each scenario. Analyses of

196

such a limited number of simulated data sets in Girod et al. (2011) was due to the large

197

computation times of MsVar.

198

Over all demographic scenarios, the stronger differences in terms of parameter inference

199

are due (1) to differences in the detection of past contraction for some data sets and (2)

200

different behavior of the two methods for demographic situations with strong and recent

201

changes in population size. First, when a contraction is detected with Migraine but not

202

with MsVar, parameter inference with Migraine is as expected more precise than with

203

MsVar. This phenomenon can be seen on Fig. S1 for some data sets for D = 0.025,

204

0.125 and 0.25. Second, for strong and recent contractions, MsVar and Migraine have

205

both difficulties to correctly infer likelihood surfaces and often give biased estimations for

206

some parameters as well as too narrow CI which often do not contain the simulated value

207

(Fig. S3). On one hand, MsVarshows important convergence issues for those scenarios, as

208

shown in Girod et al. (2011), often underestimatesθanc and gives too narrow CIs. On the

209

other hand, for the same scenarios, IS algorithms implemented in Migraineare much less

210

efficient than for other less extreme scenarios (see Section 3.2). Migraineresults reported

211

on Fig.S3 were obtained by simulation of nH = 200,000 ancestral histories for D= 0.125

212

and 0.25, andnH = 2,000 for all other cases. However, even with such large exploration of

213

ancestral histories, Migraine still give biased estimates but, contrarily to MsVar, it gives

214

CIs that almost always contain the expected value.

215

To further compare inferences without considering those two phenomena, we focussed

216

on combinations of data sets and parameters for which there is enough information to obtain

217

reasonable estimations with both methods and for which there is no MCMC convergence

218

nor IS inefficiency issues. In such situations, we note that inference of D and θanc using

219

MigraineandMsVargive similar results in terms of point estimates and CIs, (e.g., Fig.S2

220

for θanc = 40.0 and D > 0.125, Fig. S3 for θanc = 400.0 and D = 1.25). However,

221

compared to Migraine’s estimations, MsVar gives slightly lower θanc estimates with CIs

222

that more often do not contain the simulated value. Both methods also give similar results

223

for inference of θ in terms of point estimates and upper bounds of CIs (e.g. Fig. S2 with

224

D= 1.25), butMsVarsometimes infer lower CIs bounds that are well below those obtained

225

with Migraine(e.g. data sets #2 and #4).

226

1e−061e−031e+001e+03

Migraine vs MsVar theta=0.4, Ancestral theta=4.0

D=0.025 (T=10 generations)

Figure S1: Comparison of the results obtained with Migraine (black) and MsVar (gray) for the analyses of the 20 contraction data sets from Girod et al. (2011) with θ = 0.4, θanc = 4.0 and D={0.025; 0.125; 0.25; 1.25}(see TableS2).

Horizontal lines indicate the parameter value used for the simulation. For each data set numbered from 1 to 5, point estimates are represented by a square andMigraineconfidence intervals andMsVarcredibility intervals are represented by vertical lines. Dotted lines indicate an infinite bound forMigraineconfidence intervals. BDR: bottleneck detection rate, FEDR: expansion detection rate. NC: proportion of data sets for whichMsVardid not converge after 3 months. SeeGirod et al. (2011) for details aboutMsVaranalyses.

1e−061e−031e+001e+031e+06

1e−071e−051e−031e−011e+01 _ _ _ _ _

_

Migraine vs MsVar theta=0.4, Ancestral theta=40.0

D=0.025 (T=10 generations)

Figure S2: Comparison of the results obtained with Migraine (black) and MsVar (gray) for the analyses of the 20 contraction data sets from Girod et al. (2011) with θ = 0.4, θanc = 40.0 and D={0.025; 0.125; 0.25; 1.25}(see TableS2).

1e−031e−011e+01

Migraine vs MsVar theta=0.4, Ancestral theta=400.0

D=0.025 (T=10 generations)

Figure S3: Comparison of the results obtained with Migraine (black) and MsVar (gray) for the analyses of the 20 contraction data sets from Girod et al. (2011) with θ = 0.4, θanc = 400.0 and D={0.025; 0.125; 0.25; 1.25}(see TableS2).

C. LRT-Pvalue cumulative distributions for all scenarios considered in the

227

manuscript

228

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●

2Nmu = 0.4

KS: 0.345

c(0, 1)

Rel. bias, rel. RMSE 0.00656, 0.314

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●

D = 1.25

KS: 0.732

c(0, 1)

Rel. bias, rel. RMSE 0.00691, 0.161

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

2Nancmu = 40

KS: 0.837

c(0, 1)

Rel. bias, rel. RMSE 0.00545, 0.301

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●

Nratio = 0.01

KS: 0.814 DR: 0.986 ( 0 )

c(0, 1)

Rel. bias, rel. RMSE 0.058, 0.39

c(0, 1) ECDF of P−values

Figure S4:

case A

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●

●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●

2Nmu = 0.4

KS: 0.623

c(0, 1)

Rel. bias, rel. RMSE 0.0152, 0.232

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●

●●●●●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●

D = 1.25

KS: 0.686

c(0, 1)

Rel. bias, rel. RMSE 0.00161, 0.118

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

● ●● ●●●

●●●●●●●●●●●●●●●●●●● ●●●● ●● ●

●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●●●●●

2Nancmu = 40

KS: 0.505

c(0, 1)

Rel. bias, rel. RMSE −0.00715, 0.216

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●

●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●

Nratio = 0.01

KS: 0.781 DR: 0.982 ( 0 )

c(0, 1)

Rel. bias, rel. RMSE 0.0531, 0.282

c(0, 1) ECDF of P−values

Figure S5:

case B

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

2Nmu = 0.4

KS: 8.82e−06

c(0, 1)

Rel. bias, rel. RMSE 9.94, 20

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●

D = 0.025

KS: 0.286

c(0, 1)

Rel. bias, rel. RMSE 2.8, 12.4

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

2Nancmu = 40

KS: 0.109

c(0, 1)

Rel. bias, rel. RMSE 0.102, 0.427

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●

Nratio = 0.01

KS: 0.225 DR: 0.764 ( 0 )

c(0, 1)

Rel. bias, rel. RMSE 6.11, 13.7

c(0, 1) ECDF of P−values

Figure S24:

case 1

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●

2Nmu = 0.4

KS: 0.0103

c(0, 1)

Rel. bias, rel. RMSE 7.35, 13.2

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

D = 0.0625

KS: 3.63e−07

c(0, 1)

Rel. bias, rel. RMSE 0.53, 1.52

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

2Nancmu = 40

KS: 0.0351

c(0, 1)

Rel. bias, rel. RMSE 0.082, 0.287

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

●●●●●●●●●●●●●●●●●●● ●●●●

Nratio = 0.01

KS: 3.23e−09 DR: 0.976 ( 0 )

c(0, 1)

Rel. bias, rel. RMSE 5.35, 10.6

c(0, 1) ECDF of P−values

Figure S25:

case 2

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

case [0.4, 0.25, 400]

Figure S56: Cumulative distributions of LRT-Pvalues for a recent and very strong contraction

scenario, withθ= 0.4,D= 0.25 andθanc= 400.0 with (a)nH = 2,000 and (b)nH = 20,000 and

(c) nH = 200,000.

References

229

Cornuet JM, Beaumont MA, 2007. A note on the accuracy of PAC-likelihood inference

230

with microsatellite data. Theor. Popul. Biol., 71:12–19.

231

de Iorio M, Griffiths RC, 2004a. Importance sampling on coalescent histories. Advances

232

in Applied Probabilities, 36:417–433.

233

de Iorio M, Griffiths RC, 2004b. Importance sampling on coalescent histories. II. Subdivided

234

population models. Advances in Applied Probabilities, 36:434–454.

235

de Iorio M, Griffiths RC, Leblois R, Rousset F, 2005. Stepwise mutation likelihood

compu-236

tation by sequential importance sampling in subdivided population models. Theoretical

237

Population Biology, 68:41–53.

238

Girod C, Vitalis R, Leblois R, Fr´eville H, 2011. Inferring population decline and expansion

239

from microsatellite data: a simulation-based evaluation of the Msvar method. Genetics,

240

188:165–179.

241

Jasra A, De Iorio M, Chadeau-Hyam M, 2011. The time machine: a simulation approach

242

for stochastic trees. Proceedings of the Royal Society A: Mathematical, Physical and

243

Engineering Science, 467:2350–2368.

244

Laval G, Excoffier L, 2004. SIMCOAL 2.0: a program to simulate genomic diversity over

245

large recombining regions in a subdivided population with a complex history.

Bioinfor-246

matics, 20:2485–2487.

247

Li N, Stephens M, 2003. Modeling linkage disequilibrium and identifying recombination

248

hotspots using single-nucleotide polymorphism data. Genetics, 165:2213–2233.

249

Rousset F, 2004. Genetic structure and selection in subdivided populations. Princeton,

250

New Jersey: Princeton Univ. Press.

251

Rousset F, Leblois R, 2007. Likelihood and approximate likelihood analyses of genetic

252

structure in a linear habitat: performance and robustness to model mis-specification.

253

Mol. Biol. Evol., 24:2730–2745.

254

Rousset F, Leblois R, 2012. Likelihood-based inferences under a coalescent model of

iso-255

lation by distance: two-dimensional habitats and confidence intervals. Mol. Biol. Evol.,

256

29:957–973.

257

Stephens M, Donnelly P, 2000. Inference in molecular population genetics (with discussion).

258

J. R. Stat. Soc., 62:605–655.

259

Documents relatifs