by
Rapha¨el Leblois, Pierre Pudlo, Joseph N´eron, Fran¸cois Bertaux, Champak Reddy Beeravolu, Renaud Vitalis, Fran¸cois Rousset
A. Details on the likelihood computations and settings of the inference method
1
A.1. Coalescent-based IS algorithms and disequilibrium models
2
In this section, we give a more detailed overview of the method used to compute
like-3
lihoods of genetic data at a given locus and technical details regarding the Monte Carlo
4
algorithm.
5
The likelihood at a given point of the parameter space is estimated using Stephens
6
and Donnelly (2000) andde Iorio and Griffiths (2004a) ’s importance sampling approach.
7
An ancestral history, i.e. a coalescence tree with mutations, is defined as the set of all
8
ancestral configurations H = {Hk;k = 0,−1, ...,−m}, corresponding to all coalescent or
9
mutation events that occurred from H0 the current sample state (i.e. the sample allelic
10
configuration, or allelic counts) toH−m the allelic state of the most recent common
ances-11
tor (MRCA) of the sample. The Markov nature of the backward coalescent process implies
12
that p(Hk) =P
{Hk−1}p(Hk|Hk−1)p(Hk−1) and expending the recursion over possible
an-13
cestral histories of a current sample leads to p(H0) = Ep[p(H0|H−1)...p(H−m+1|p(H−m)].
14
However, forward transition probabilities p(Hk|Hk−1) can not directly be used in a
back-15
ward process and backward transition probabilities p(Hk−1|Hk) are unknown, except in
16
some specific simple models such as parent independent mutations (PIM) in a single
sta-17
ble panmictic population. Importance sampling techniques based on an approximation
18
ˆ
p(Hk−1|Hk) of p(Hk−1|Hk) are thus used to derive the probability of a sample over
possi-19
ble histories
20
p(H0) =Epˆ
p(H0|H−1) ˆ
p(H−1|H0)...p(H−m+1|H−m) ˆ
p(H−m|H−m+1)
. (1)
The likelihood of the data is then estimated as the average value of the probability of a
21
sample configuration H0 given an ancestral history Hi, overnH independent simulations,
22
INon-standard abbreviations : IS (importance sampling), MCMC (Monte Carlo Markov Chain), ML (maximum likelihood), RMSE (root mean square error), KS (Kolmogorov-Smirnov test), LRT (likelihood ratio test), CI (confidence or credibility intervals), KAM (K-allele model), SMM (stepwise mutation model), GSM (generalized stepwise mutation model), SNP (single nucleotide polymorphism), IBD (isolation by distance), BDR (bottleneck detection rate), FBDR (false bottleneck detection rate), FEDR (false expansion detection rate)
backward in time, of possible ancestral histories: p(H0) ≈ n1
H
PnH
i=1p(H0|Hi). The
distri-23
bution of the possible ancestral histories is generated by an absorbing Markov chain with
24
transition probabilities ˆp(Hk−1|Hk), and the likelihood is estimated by averaging the
prod-25
uct of sequential importance weights corresponding the ratio p(Hp(Hˆ i,0|Hi,1)
i,1|Hi,0)...p(Hp(Hˆ i,−m+1|Hi,−m)
i,−m|Hi,−m+1) of
26
forward and backward transition probabilities obtained for each history Hi. Computation
27
of backward transition probabilities ˆp relies on Stephens and Donnelly’s ˆπ approximations
28
of the unknown probability π that an additional gene sampled from a population is of a
29
given allelic type conditional on a previous sample configuration (see A.4 for an example
30
on ˆπ computations).
31
For efficient importance sampling distributions (i.e. approximate backward transition
32
probabilities ˆpclose to the exactp), precise estimation of the likelihood can be obtain with
33
very few histories explored. For example, under parent independent mutation model (e.g.
34
a K allele model, KAM), the importance sampling scheme ofStephens and Donnelly (2000)
35
for a single isolated population is optimal because the ˆπ’s computed following Stephens
36
and Donnelly (2000) are equal to the trueπ’s and all the ratios of p(Hp(Hˆ k|Hk−1)
k−1|Hk) are cancelled
37
out. In such ideal case, consideration of a single ancestral history is thus sufficient to
38
get the exact likelihood (de Iorio and Griffiths, 2004a,b; Stephens and Donnelly, 2000).
39
However, departure from parent independent mutation, from panmixia and from
time-40
homogeneity decrease the efficiency of IS proposals, and precise estimation of likelihoods
41
then implies to explore more ancestral histories. Under a time-homogeneous model of
42
isolation by distance, Rousset and Leblois (2007, 2012) found that 30 replicates of the
43
absorbing Markov chain, rebuilding 30 independent possible ancestral histories, is enough
44
to get perfect LRT-Pvalue distributions. Here, for the first time, de Iorio and Griffiths
45
(2004a)’s IS algorithm is applied to a time-inhomogeneous demographic model using ˆπ
46
probabilities computed from a time-homogeneous demographic model as described in the
47
main text. Our work clearly shows that IS proposal computed as described in the main
48
text are less and less efficient for demographic scenarios with increasing disequilibrium.
49
A.2. Some efficient modifications of the original IS algorithm
50
To speed up the computations, we can stop the simulation of the genealogies during the
51
IS algorithm, before reaching the MRCA (see Jasra, De Iorio and Chadeau-Hyam, 2011;
52
Rousset and Leblois, 2007). Here, the algorithm is stopped when reaching demographic
53
equilibrium (i.e., after time T whenN(t) =Nanc) and we finalize the current IS estimates
54
of the likelihood with the PAC-likelihood of the ancestral lineages (Li and Stephens, 2003;
55
Cornuet and Beaumont, 2007;Rousset and Leblois, 2007,2012). This scheme will be called
56
PACanc. Using analytical formulas for the exact probability of the last pair of genes,
57
computed as in Rousset (2004), also slightly decreases the computation time in the same
58
vein as in de Iorio et al. (2005) and Rousset and Leblois (2007,2012). The IS scheme can
59
thus be stopped when reaching an ancestral sample of size 2 and finalized with this exact
60
formula, hereafter called 2ID. And the combination of both PACancand 2ID will be called
61
PACanc2ID. A detailed comparison between the four schemes (strict IS, 2ID, PACanc and
62
PACanc2ID) is presented is presented in TableS1for the baseline scenarios under a SMM
63
and under a GSM (case[0], [A] to [F] under a SMM; case[G] to [L] under a GSM with
64
p = 0.22; and case[K] to [M] under a GSM with p = 0.74). For the GSM, data sets are
65
simulated under a GSM with 40 allelic states but analyzed with a GSM with 50 possible
66
allelic states. This slight mis-specification of the mutation model has a relatively strong
67
influence when p= 0.74: LRT-Pvalue distributions are not close tho the 1:1 regardless of
68
the number of loci. Such strong effect is also probably due to the consideration of large
69
θanc values. Apart from this mutation model effect, our results show that performances
70
are similar in all cases for the different algorithms. All simulations with a GSM, i.e. for
71
the tests of the effect of mutational processes and population structure, are analyzed using
72
the PACanc2ID, unless otherwise specified.
73
TableS1:Effectsofthealgorithmusedandthenumberoflociontheperformanceofestimationsforourbaselinesimulationwithθ=0.4, D=1.25andθanc=40.0underaSMM,aGSMwithp=0.22andp=0.74.ˆL:typeofalgorithmused;n`:numberofloci;IS:importance sampling;2ID:exactcomputationofthelikelihoodforthelastpairofgenes;PACanc:PAC-likelihoodusedintheancestralstablepopulation.BDR: Bottleneckdetectionrate.FEDR:Falseexpansiondetectionrate. pθDθancBDR(FEDR) case/ˆLn`rel.biasRRMSEKSrel.biasRRMSEKSrel.biasRRMSEKSrel.biasRRMSEKS SM M
[0]IS10NANANA0.0350.560.0560.0620.270.0680.0460.470.461(0) [J]2ID10NANANA0.0380.0540.0600.0690.270.0540.0430.460.811(0) [K]PACanc10NANANA0.0610.550.0510.0590.270.0480.0480.470.6851(0) [L]PACanc2ID10NANANA0.0570.5440.0320.0690.270.0140.0550.480.591(0) [A]IS25NANANA0.00660.310.350.00790.160.730.00550.300.840.986(0) [M]PACanc2ID25NANANA0.0120.310.480.0160.170.500.00520.30.9470.995(0) [B]IS50NANANA0.0150.230.620.00160.120.69-0.00710.220.510.982(0) [N]PACanc2ID50NANANA0.0500.230.410.0360.120.770.0210.240.810.99(0)
GS M
0.22 [C]IS100.260.910.160.0330.510.120.160.660.140.221.330.6570.990(0) [O]2ID100.240.880.00800.00470.510.3770.0910.480.10.101.060.0280.99(0) [P]PACanc100.210.860.390.0520.520.360.170.590.370.211.160.400.985(0) [Q]PACanc2ID100.230.850.820.0210.510.0960.110.490.530.111.00.6750.99(0) [D]IS500.170.470.120.0590.250.440.0120.140.75-0.0850.390.0821.0(0) [R]PACanc2ID500.190.480.670.0700.260.0630.00160.150.85-0.130.370.111(0)
GS M
0.74 [E]IS100.0160.140.0.0940.1370.520.110.420.67<10−122.463.4<10−120.965(0) [S]PACanc2ID10-0.0220.180.310.0720.560.06910.270.69<10−121.62.86.5·10−110.96(0) [F]IS500.0450.0813.8·10−50.340.44<10−120.400.49<10−121.62.4<10−121.0(0) [T]PACanc2ID500.00470.0700.560.240.40<10−120.300.56<10−122.043.7<10−121.0(0)
KAM [G]IS10NANANA-0.0700.640.0110.140.710.0000342.114.80.0120.84(0) [H]IS25NANANA-0.0270.490.54-0.0580.690.540.612.60.0410.97(0) [I]IS50NANANA-0.0840.320.085-0.220.510.190.4022.740.06751.0(0)
A.3. Detailed parameterization of the inference method
74
Analysis of the simulated data sets with Migraineis done in two or three automatic
75
iterations depending on the demographic scenario. For the first iteration, np parameters
76
points, with replicates for one every thirty of them, are sampled from an initial large range
77
of parameter values set toθ= [0.001−100],D= [0.001−20], andθanc = [0.001−600], for
78
all simulations. For analyses under the GSM, the parameter p is sampled from an initial
79
range of [0−0.9]. Such large exploration ranges is not very efficient in terms of parameter
80
space exploration but allows us to automatically obtain good likelihood surface inferences
81
for all simulated scenarios without the need of manual tuning for each scenario. Likelihoods
82
are then estimated for these firstnp(1 + 1/30) points, and a likelihood surface is inferred by
83
Kriging. From this likelihood surface, np additional points, with again replicates for one
84
every thirty of them, are computed inside a convex envelope including the whole P=0.001
85
confidence region to limit or extend the parameter space previously explored. For the
86
second iteration, likelihoods are estimated for these additional np(1 + 1/30) points and a
87
second likelihood surface is inferred from all 2np(1 + 1/30) points. When a third step is
88
considered, the final likelihood surface is then inferred from all 3np(1 + 1/30) points. This
89
iterative procedure, described in details inRousset and Leblois (2012) and in theMigraine
90
user manual, appears extremely efficient as it ensures that many points are sampled near
91
the top of the likelihood surface. For this work, we chose np = 600 for models with three
92
parameters (i.e. under the SMM), and np = 2,400 for analyses under the GSM with four
93
parameters.
94
As discussed in the main text (Section3.2in the main text), IS distributions are more
95
or less efficient depending on the level of “disequilibrium” in the demographic model (e.g.
96
the rate of instantaneous population size change). For this reason, we choose to simulate
97
a high number ancestral histories (nH = 2,000) to obtain good CI coverage properties
98
in almost all scenarios explored. Doing so, we probably lost an important amount of
99
computation time when analyzing scenarios for which much smallernH values are sufficient
100
for good likelihood inference. However, it allows us to avoid choosing the minimal number
101
of histories compatible with each scenario. The first exception was for a few situations
102
with very recent and strong contractions (θ = 0.4, θanc = 400, D 6 0.25), for which up
103
to 200,000 ancestral histories were considered. The efficiency of the de Iorio and Griffiths
104
(2004a)’s IS algorithms for time-inhomogeneous demographic models is further discussed
105
in the main text.
106
Finally, analyses of 200 data sets and the consideration a large numbernpof parameter
107
points and/or multiple iterative analyses may be time consuming by IS when a large
108
number of histories nH have to be considered ( see. For this reason, we used the fastest
109
PACanc2ID-likelihood approximation (seeA.1for details on PACanc2ID-likelihood) when
110
testing the effect of population structure with a GSM mutational model because inference
111
under models with 4 parameters necessitate to explore a much larger number of points
112
than with 3 parameters (i.e. np = 2,400 vs 600 for each iteration, see above).
113
A.4. π’s computation under a Generalized Stepwise MutationModel (GSM)ˆ
114
The main computation step using DeIorio and Griffiths’ approach is to solve a system
115
of recursive equations (eq.10 inde Iorio and Griffiths, 2004b) which gives the ˆπs (see above,
116
p.2, for a definition) for a given sample configuration n as a function of the demographic
117
and mutational parameters of the model considered. For a single population with a scaled
118
mutation rate θ = 2N µ and for any mutation model represented by the matrix of allelic
119
state transition P with K possible allelic states, andP
jpij = 1, the system of recursive
1 nj the total number of genes in the sample. In
122
de Iorio et al. (2005), we considered an unbounded strict stepwise mutation model (SMM)
123
on i∈ {...,−2,−1,0,1,2, ...} with Pij = 0.5 if |i−j|= 1 and zero otherwise. We solved
124
the above system of equations using Fourier transforms (eq. 3.5 to 3.9 in de Iorio et al.,
125
2005), and obtained the following solution :
126
In a computer implementation, considering an infinite number of possible allelic state
128
is not really tractable. However, using eq.3 with a sufficiently large number of alleles (e.g
129
K > 200) leads to efficient computations as well as almost perfect coverage properties of
130
confidence intervals as shown in this paper.
131
The aim here is to extend the ˆπ computations to consider a generalized stepwise
mu-132
tation model (GSM), in which each mutation event equally leads to the gain or the loss
133
of X repeats, with X being geometrically distributed with parameter g (named p in the
134
main text). Under these conditions, Pij = g|i−j| for |i−j| 6= 0 and zero for i = j and
135
Fourier transforms can also be used to solve the system for an infinite number of alleles
136
Then equation (3.17) ofde Iorio et al. (2005) for computation of the ˆπ under a two popu-lation model becomes here where α andβ are indices representing the two populations, and
139
Following de Iorio et al. (2005), we get
with c` as previously defined (eq. 4).
150
WriteR1 andR2 for the parameter defined in (7) whenρ=ρ1 andρ=ρ2 respectively.
Using (8) and (10), we obtain ai(k, j) = 1
Di
qα−1nαkh
(nβqβ−1+mβ+θ)(1 +g2) +θg(1−g)i
ck−j(Ri)
−q−1α nαk
h
g(nβq−1β +mβ+θ) +θ(1−g)/2 i
ck−j+1(Ri) +ck−j−1(Ri) +qβ−1nβkmα
(1 +g2)ck−j(Ri)−g ck−j+1(Ri) +ck−j−1(Ri)
with Di = 1 +g2+ρig(1−g). Finally, ai(k, j) = 1
Di
qα−1nαkh
(nβqβ−1+mβ)(1 +g2) +θ(1 +g)i
ck−j(Ri)
−q−1α nαk
h
g(nβq−1β +mβ) +θ(1 +g)/2 i
ck−j+1(Ri) +ck−j−1(Ri) +q−1β nβkmα
(1 +g2)ck−j(Ri)−g ck−j+1(Ri) +ck−j−1(Ri)
. (11) Wheng= 0, mutation model is exactly SMM. Setting g= 0 in (11) leads to
ai(k, j) =qα−1nαkh
(nβqβ−1+mβ+θ)i
ck−j(ρi)
−qα−1nαk[θ/2] ck−j+1(ρi) +ck−j−1(ρi) +qβ−1nβkmα[ck−j(ρi)]
sinceDi = 1 andRi=ρi in this case. This is exactly (3.19) of de Iorio et al. (2005).
153
For a single population model, i.e. setting all migration terms {mi} to 0 and only
154
keeping terms with α indices, the expression for ˆπ reduces to eq. 3 except that the cl(ρ)
155
terms are replaced by
156
I`(ρ) =
(1 +g2)c`(R)−g
c`+1(R) +c`−1(R)
1 +g2+ρg(1−g) , (12)
with
157
R= 2g+ρ(1−g)
1 +g2+ρg(1−g) (13)
and cl(ρ) defined in eq. 4. As expected, the higher g is, the more slowly ˆπ values are
158
decreasing for distant mutation under the GSM than under the SMM. The above
expres-159
sion are increasingly poor approximations of bounded, and practically usable, GSMs as g
160
increases.
161
For a single population model, an alternative is to use the approach of Stephens and
162
Donnelly (2000) to solve the system of recursive equations. NotingΠˆ the vector of ˆπ(j|n),
163
their approach is based on the following representation of system of recursive equations :
where Iis the identity matrix of dimensionK.
165
Using matrix inversion computations to solve the system is straightforward, but it is
166
efficient only for mutation models with a relatively small number of alleles and for
time-167
homogeneous models. It is computationally highly demanding for time-inhomogeneous
168
models, as for the case of a population with variable size considered here, because it
169
requires matrix inversions at almost every step of the backward history reconstruction,
170
or q values have changed, it is more efficient to compute the eigenvalues and eigenvectors
173
of Ponce at the beginning of a run (actually, only once for each new value of g). Then,
174
each time θ, n or q changes, we compute R·(I− nθ
q+θΛ)−1·L, where Λ is the diagonal
175
matrix of eigenvalues of P, R is the matrix of its right eigenvectors andL that of its left
176
eigenvectors. Moreover, the eigen decomposition allows one to evaluate the required ˆπ(j|n)
177
only, not the full Πˆ vector.
178
More details are as follows. Let λk, R(k) and L(k) be thekth eigenvalue and the kth
179
right and left eigenvectors of P, respectively. Let a= (al) be the vector of coordinates of
180 Reconsidering the initial system (eq.2), we can write
K
Using eigenvectors and eigenvalues definitions, we havePK
j=1pijR(m)j =λm·R(m)i , and
183
K
B. Comparison with MsVar
187
Table S2: Simulated data sets from Girod et al. (2011)
D(T) θ(N) θanc (Nanc) 0.025 (10) 0.4 (200) 4.0 (2,000) 0.025 (10) 0.4 (200) 40.0 (20,000) 0.025 (10) 0.4 (200) 400.0 (200,000) 0.125 (50) 0.4 (200) 4.0 (2,000) 0.125 (50) 0.4 (200) 40.0 (20,000) 0.125 (50) 0.4 (200) 400.0 (200,000)
D(T) θ(N) θanc (Nanc) 0.25 (100) 0.4 (200) 4.0 (2,000) 0.25 (100) 0.4 (200) 40.0 (20,000) 0.25 (100) 0.4 (200) 400.0 (200,000) 1.25 (500) 0.4 (200) 4.0 (2,000) 1.25 (500) 0.4 (200) 40.0 (20,000) 1.25 (500) 0.4 (200) 400.0 (200,000) InGirod et al. (2011), five data sets were simulated for each of the scenario with the SimCoal2software (Laval and Excoffier, 2004) and analyzed with theMsVarsoftware. For each scenario, 5 data sets ofng = 100 genes, genotyped atn`= 10 unlinked microsatellite loci were analyzed.
For comparison with MsVar, we reanalyzed all 5 data sets (100 genes genotyped at 10
188
loci) considered in each of the 12 contraction scenarios of Girod et al. (2011) described
189
in Table S2. All results are presented in Fig. S1, S2, and S3. Note that MsVar is
im-190
plemented under a Bayesian Framework. We thus compare Bayes factors vs. LRT for
191
bottleneck detection and HPD intervals vs. confidence intervals for parameter inference.
192
First, Migraineand MsVar globally give similar results, with a small but clear advantage
193
of MigraineoverMsVarboth in terms of higher BDRs and smaller FEDRs. For parameter
194
inference, it is more difficult to draw clear conclusions from our comparison of the two
195
methods because of the small number of data sets analyzed for each scenario. Analyses of
196
such a limited number of simulated data sets in Girod et al. (2011) was due to the large
197
computation times of MsVar.
198
Over all demographic scenarios, the stronger differences in terms of parameter inference
199
are due (1) to differences in the detection of past contraction for some data sets and (2)
200
different behavior of the two methods for demographic situations with strong and recent
201
changes in population size. First, when a contraction is detected with Migraine but not
202
with MsVar, parameter inference with Migraine is as expected more precise than with
203
MsVar. This phenomenon can be seen on Fig. S1 for some data sets for D = 0.025,
204
0.125 and 0.25. Second, for strong and recent contractions, MsVar and Migraine have
205
both difficulties to correctly infer likelihood surfaces and often give biased estimations for
206
some parameters as well as too narrow CI which often do not contain the simulated value
207
(Fig. S3). On one hand, MsVarshows important convergence issues for those scenarios, as
208
shown in Girod et al. (2011), often underestimatesθanc and gives too narrow CIs. On the
209
other hand, for the same scenarios, IS algorithms implemented in Migraineare much less
210
efficient than for other less extreme scenarios (see Section 3.2). Migraineresults reported
211
on Fig.S3 were obtained by simulation of nH = 200,000 ancestral histories for D= 0.125
212
and 0.25, andnH = 2,000 for all other cases. However, even with such large exploration of
213
ancestral histories, Migraine still give biased estimates but, contrarily to MsVar, it gives
214
CIs that almost always contain the expected value.
215
To further compare inferences without considering those two phenomena, we focussed
216
on combinations of data sets and parameters for which there is enough information to obtain
217
reasonable estimations with both methods and for which there is no MCMC convergence
218
nor IS inefficiency issues. In such situations, we note that inference of D and θanc using
219
MigraineandMsVargive similar results in terms of point estimates and CIs, (e.g., Fig.S2
220
for θanc = 40.0 and D > 0.125, Fig. S3 for θanc = 400.0 and D = 1.25). However,
221
compared to Migraine’s estimations, MsVar gives slightly lower θanc estimates with CIs
222
that more often do not contain the simulated value. Both methods also give similar results
223
for inference of θ in terms of point estimates and upper bounds of CIs (e.g. Fig. S2 with
224
D= 1.25), butMsVarsometimes infer lower CIs bounds that are well below those obtained
225
with Migraine(e.g. data sets #2 and #4).
226
1e−061e−031e+001e+03
Migraine vs MsVar theta=0.4, Ancestral theta=4.0
D=0.025 (T=10 generations)
Figure S1: Comparison of the results obtained with Migraine (black) and MsVar (gray) for the analyses of the 20 contraction data sets from Girod et al. (2011) with θ = 0.4, θanc = 4.0 and D={0.025; 0.125; 0.25; 1.25}(see TableS2).
Horizontal lines indicate the parameter value used for the simulation. For each data set numbered from 1 to 5, point estimates are represented by a square andMigraineconfidence intervals andMsVarcredibility intervals are represented by vertical lines. Dotted lines indicate an infinite bound forMigraineconfidence intervals. BDR: bottleneck detection rate, FEDR: expansion detection rate. NC: proportion of data sets for whichMsVardid not converge after 3 months. SeeGirod et al. (2011) for details aboutMsVaranalyses.
1e−061e−031e+001e+031e+06
1e−071e−051e−031e−011e+01 _ _ _ _ _
_
Migraine vs MsVar theta=0.4, Ancestral theta=40.0
D=0.025 (T=10 generations)
Figure S2: Comparison of the results obtained with Migraine (black) and MsVar (gray) for the analyses of the 20 contraction data sets from Girod et al. (2011) with θ = 0.4, θanc = 40.0 and D={0.025; 0.125; 0.25; 1.25}(see TableS2).
1e−031e−011e+01
Migraine vs MsVar theta=0.4, Ancestral theta=400.0
D=0.025 (T=10 generations)
Figure S3: Comparison of the results obtained with Migraine (black) and MsVar (gray) for the analyses of the 20 contraction data sets from Girod et al. (2011) with θ = 0.4, θanc = 400.0 and D={0.025; 0.125; 0.25; 1.25}(see TableS2).
C. LRT-Pvalue cumulative distributions for all scenarios considered in the
227
manuscript
228
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●
2Nmu = 0.4
KS: 0.345
c(0, 1)
Rel. bias, rel. RMSE 0.00656, 0.314
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●
D = 1.25
KS: 0.732
c(0, 1)
Rel. bias, rel. RMSE 0.00691, 0.161
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
2Nancmu = 40
KS: 0.837
c(0, 1)
Rel. bias, rel. RMSE 0.00545, 0.301
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Nratio = 0.01
KS: 0.814 DR: 0.986 ( 0 )
c(0, 1)
Rel. bias, rel. RMSE 0.058, 0.39
c(0, 1) ECDF of P−values
Figure S4:
case A
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●
2Nmu = 0.4
KS: 0.623
c(0, 1)
Rel. bias, rel. RMSE 0.0152, 0.232
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●
●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●
D = 1.25
KS: 0.686
c(0, 1)
Rel. bias, rel. RMSE 0.00161, 0.118
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●● ●●● ●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●
●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
2Nancmu = 40
KS: 0.505
c(0, 1)
Rel. bias, rel. RMSE −0.00715, 0.216
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●
●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●
Nratio = 0.01
KS: 0.781 DR: 0.982 ( 0 )
c(0, 1)
Rel. bias, rel. RMSE 0.0531, 0.282
c(0, 1) ECDF of P−values
Figure S5:
case B
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
2Nmu = 0.4
KS: 8.82e−06
c(0, 1)
Rel. bias, rel. RMSE 9.94, 20
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
D = 0.025
KS: 0.286
c(0, 1)
Rel. bias, rel. RMSE 2.8, 12.4
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
2Nancmu = 40
KS: 0.109
c(0, 1)
Rel. bias, rel. RMSE 0.102, 0.427
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●
Nratio = 0.01
KS: 0.225 DR: 0.764 ( 0 )
c(0, 1)
Rel. bias, rel. RMSE 6.11, 13.7
c(0, 1) ECDF of P−values
Figure S24:
case 1
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
2Nmu = 0.4
KS: 0.0103
c(0, 1)
Rel. bias, rel. RMSE 7.35, 13.2
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
D = 0.0625
KS: 3.63e−07
c(0, 1)
Rel. bias, rel. RMSE 0.53, 1.52
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
2Nancmu = 40
KS: 0.0351
c(0, 1)
Rel. bias, rel. RMSE 0.082, 0.287
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●
Nratio = 0.01
KS: 3.23e−09 DR: 0.976 ( 0 )
c(0, 1)
Rel. bias, rel. RMSE 5.35, 10.6
c(0, 1) ECDF of P−values
Figure S25:
case 2
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
case [0.4, 0.25, 400]
Figure S56: Cumulative distributions of LRT-Pvalues for a recent and very strong contraction
scenario, withθ= 0.4,D= 0.25 andθanc= 400.0 with (a)nH = 2,000 and (b)nH = 20,000 and
(c) nH = 200,000.
References
229
Cornuet JM, Beaumont MA, 2007. A note on the accuracy of PAC-likelihood inference
230
with microsatellite data. Theor. Popul. Biol., 71:12–19.
231
de Iorio M, Griffiths RC, 2004a. Importance sampling on coalescent histories. Advances
232
in Applied Probabilities, 36:417–433.
233
de Iorio M, Griffiths RC, 2004b. Importance sampling on coalescent histories. II. Subdivided
234
population models. Advances in Applied Probabilities, 36:434–454.
235
de Iorio M, Griffiths RC, Leblois R, Rousset F, 2005. Stepwise mutation likelihood
compu-236
tation by sequential importance sampling in subdivided population models. Theoretical
237
Population Biology, 68:41–53.
238
Girod C, Vitalis R, Leblois R, Fr´eville H, 2011. Inferring population decline and expansion
239
from microsatellite data: a simulation-based evaluation of the Msvar method. Genetics,
240
188:165–179.
241
Jasra A, De Iorio M, Chadeau-Hyam M, 2011. The time machine: a simulation approach
242
for stochastic trees. Proceedings of the Royal Society A: Mathematical, Physical and
243
Engineering Science, 467:2350–2368.
244
Laval G, Excoffier L, 2004. SIMCOAL 2.0: a program to simulate genomic diversity over
245
large recombining regions in a subdivided population with a complex history.
Bioinfor-246
matics, 20:2485–2487.
247
Li N, Stephens M, 2003. Modeling linkage disequilibrium and identifying recombination
248
hotspots using single-nucleotide polymorphism data. Genetics, 165:2213–2233.
249
Rousset F, 2004. Genetic structure and selection in subdivided populations. Princeton,
250
New Jersey: Princeton Univ. Press.
251
Rousset F, Leblois R, 2007. Likelihood and approximate likelihood analyses of genetic
252
structure in a linear habitat: performance and robustness to model mis-specification.
253
Mol. Biol. Evol., 24:2730–2745.
254
Rousset F, Leblois R, 2012. Likelihood-based inferences under a coalescent model of
iso-255
lation by distance: two-dimensional habitats and confidence intervals. Mol. Biol. Evol.,
256
29:957–973.
257
Stephens M, Donnelly P, 2000. Inference in molecular population genetics (with discussion).
258
J. R. Stat. Soc., 62:605–655.
259