• Aucun résultat trouvé

Interpretation of Variation Across Marker Loci as Evidence of Selection

N/A
N/A
Protected

Academic year: 2021

Partager "Interpretation of Variation Across Marker Loci as Evidence of Selection"

Copied!
48
0
0

Texte intégral

(1)

HAL Id: halsde-00342552

https://hal.archives-ouvertes.fr/halsde-00342552

Submitted on 27 Jan 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

Interpretation of Variation Across Marker Loci as

Evidence of Selection

Renaud Vitalis, Kevin Dawson, Pierre Boursot

To cite this version:

Renaud Vitalis, Kevin Dawson, Pierre Boursot. Interpretation of Variation Across Marker Loci as

Evidence of Selection. Genetics, Genetics Society of America, 2001, 158 (4), pp.1811-1823.

�halsde-00342552�

(2)

Marker Loci as Evidence of Selection

Renaud Vitalis ?yz

, Kevin Dawson ?x

, and Pierre Boursot ?

?

Laboratoire Génome,Populations etInteractions,C.C. 063,UniversitéMontpellier II, 34095Montpellier Cedex 05,France

y

LaboratoireGénétiqueet Environnement, C.C. 065,Institutdes Sciencesde l'Évolution de Montpellier, UniversitéMontpellier II, 34095Montpellier Cedex 05,

France

z

StationBiologique de laTourduValat, LeSambuc,13200 Arles,France

x

I.A.C.R.Long AshtonResearch Station, Departmentof Agricultural Science, Universityof Bristol, BristolBS41 9AF,UnitedKingdom

(3)

Evidence of selectionfrom markerloci Keywords:

Populationdivergence, geneticdrift, coalescent theory,neutrality tests, IBD probabilities

Corresponding author: Renaud VITALIS

Laboratoire Génétiqueet Environnement,C.C. 065 Institut des Sciences de l'Évolutionde Montpellier Université Montpellier II

Place Eugène Bataillon, 34095Montpellier Cedex 05, France Tel.: +33 4 671432 50

Fax.: +33 4 671436 22

(4)

Populationstructure andhistory havesimilareectsonthe geneticdiversity at all neutral loci. However, some markerloci may also have been strongly inuenced by naturalselection. Selection shapesgeneticdiversity ina locus-specic manner. If we could identify those loci that have responded to se-lection during the divergence of populations, then we may obtain better estimates of the parameters of population history, by excluding these loci. Previous attempts havebeen made toidentify outlierlocifrom the distribu-tion of sample statistics under neutral models of population structure and history. Unfortunately these methods depend on assumptions about popu-lation structure and history, and these are usually unknown. In this paper, we denenew population-specic parameters of population divergence, and construct sample statistics which are estimators of these parameters. We then usethejointdistributionofthese estimatorstoidentifyoutlierlocithat may be subject to selection. We found that outlier loci are easier to recog-nize whenthis jointdistributionis conditionedonthe total numberofallelic states in the pooled sample, at each locus. This is because the conditional distribution is less sensitiveto the values of nuisanceparameters.

(5)

P

resumed neutral polymorphic loci are commonly used in making in-ferences about patterns of dierentiation within oramong populations of the same or closely related species. For this purpose, genetic distances (see, e.g., Nei, 1972) or Wright's (1951) F-statistics are estimated from allele-frequencydata. Underparticularmodelsofpopulationstructure,these parameters are relatedto demographicor historicalparameters,such as the eective population size, the rate of migration between populations or the time since the populations diverge fromtheir commonancestral population. However, misinterpretationscanoccur, ifoneisnotable toclearly distin-guish between the patterns generated by random geneticdrift orby natural selection. The problemisthatselectiveprocessescan alsoaect neutralloci. A locus which is neutral will respond to selection whenever it is in link-age disequilibrium (statistical association among allelic states at dierent loci) with other loci which are subject to selection. Such associations may arise by chance in small populations (Hill and Robertson, 1966, 1968; Ohta and Kimura, 1969). For example, stabilizing or balancing selection operatingatalocustendstomaintainanelevatedlevelofvariationatclosely linked neutral loci(Hudson and Kaplan, 1988; Strobeck, 1983). Selec-tion actingonany locushas aneect on looselylinked loci,whichresembles a reduction of eective population size (Barton, 1995, 1998; Robertson, 1961). Local adaptation tends to increase populationdierentiation at loci whereselectionacts,andveryhighF

ST

valuesmaybefoundatcloselylinked neutral loci (Charlesworth et al., 1997). The substitution of advanta-geous mutations at a locus may also reduce neutral variation at linked loci

(6)

background selection , caused by the selection against deleterious muta-tions(Charlesworthetal.,1993)resultsinareduced eectivepopulation size for neutral genes in the region of the chromosome where this selection is acting. Background selection may also increase the apparent population dierentiation(Charlesworthet al.,1997).

Therefore, it is of prime interest to identify loci that are responding to selection in order to exclude them from the genetic analysis of population structure orhistory. It wasrecognized early onby Cavalli-Sforza(1966) that any formof selectionwillaect some regions of the genome more than others, whereas populationhistory, demography, migration and the mating system will aect the wholegenome inthe same way. Accordingly, Lewon-tin and Krakauer (1973) proposed two tests of selective neutrality. Both tests are based on the sampling distribution of a statistic

b

F, the standard-ized varianceof gene frequency, whichis anestimatorofthe parameter F

ST . Theirrsttestisagoodnessofttestcomparingtheobserved distributionof

b

F estimates(one estimatefromeachlocus)toa 2

distributionwith (n 1) degrees of freedom, wheren isthe numberof populations sampled. The sec-ond test is based on the comparison of the observed variance of

b

F (across loci)noted s

2 F

, with the theoretical variance approximated as

 2 = kF 2 n 1 (1)

where F is the mean value of b

F averaged across loci, and k is a constant which, according to Lewontin and Krakauer (1973), should not exceed 2 whateverthe underlyingdistribution ofallelic frequency. The ratio s

2 F

= 2

(7)

should be distributed approximatelyas a  =d:f:, the number of degrees of freedom d:f:being determinedby the number of bi-allelicloci.

However, since populations of the same species share, to a certain ex-tent, a common history and since populations are connected through the dispersal of individuals,

b

F values will be correlated across loci. For exam-ple, the geographic and historical relationships between populations may have a hierarchical structure if populations have derived from a common ancestral population as a sequence of successive splits. This is the pattern expected when the fragmentation of a species range occurred as a sequence of population subdivisions. The eect of such a population history is al-ways to increase the expected varianceof

b

F (Robertson, 1975a,b). More-over, even simple models of divergence by drift (Nei and Chakravarti, 1977),island models(Nei etal.,1977),orsteppingstone modelsofdispersal (NeiandMaryuyama,1975)inatetheexpectedvariance,making Lewon-tin and Krakauer's (1973) test unreliable in most cases (Lewontin and Krakauer, 1975).

Morerecently,Bowcocketal.(1991)studiedtheworld-wide human ge-netic dierentiation based onDNA polymorphism. Simulating a reasonably well supported evolutionary scenario of divergence, they evaluated the the-oretical distribution of F

ST

conditional on initial gene frequencies. Among 100 nuclear RFLP markers a number of genes exhibited lower or, more of-ten, highervariationthan expectedunderneutrality. Inanimportantpaper, Beaumont and Nichols (1996) proposed a method based onthe analysis ofthe expecteddistributionofF

ST

conditionalonheterozygosityratherthan allele frequency. The conditional distribution, built under an island model

(8)

models (colonisation, stepping-stone). Interestingly, departures from equi-librium do not alter much the expected distribution, whenever F

ST

is less than 0.5. Yet, unequalnumbers ofimmigrantspergenerationoverthewhole population generated some discrepancies with the symmetric island model for heterozygositiesin the range[0.1, 0.5] (see Figure3d inBeaumont and Nichols, 1996).

Thus, their approach might be awed whenever the true population his-tory consistsof repeated branching events, orwhen the connectivity of pop-ulations is uneven. However, we can not infer patterns of migration or his-toricalbranching, andtestforthehomogeneityofthemarkers withthesame data. This is what Felsenstein (1982) described as the  innitely many parameters  problem. A solution tothis problemis torestrict attention to simple but realistic scenarios which may apply to any pair of populations (Robertson,1975b;Tsakas and Krimbas,1976). This reduces the num-ber of parameters in the model. Here, we develop a model of population divergence. We dene population-specic parameters, as functions of prob-abilities of identity for pairs of genes taken within or among populations. These parameters are simply relatedto the ratio of divergence time over ef-fectivepopulationsize. We construct simpleestimators of these population-specicparameters. Wethenexaminetheexpectedjointdistributionofthese estimators, under a wide range of neutralscenarios of divergence. This sug-gests anew methodtoassess thehomogeneityofresponse ofgeneticmarkers from empirical data. Finally,we apply our new method to a data set of al-lozyme locifrom Drosophila simulans populations, and compare our results

(9)

THE MODEL

We consider two haploid populations of constant sizes N 1

and N 2

, which completelyseparated  generationsago, froma single populationof station-ary size N

0

. Bycomplete separation, we mean that the populations did not exchange any migrantsbetween the timeof thesplit andthe present. We do not assume that the common ancestral populationwas at equilibriumwhen it split. Instead, we allow the ancestral population to have gone through a bottleneck, 

0

generations before present (with  0

> ). Before this, the ancestral population was at mutation-drift equilibrium, with constant size N

e

. Generations do not overlap. New mutations arise at a rate , and fol-low the innite allele model (IAM). This model of population divergence is illustrated in Figure1.

[Figure1 about here.]

Let Q w;i

be the probability that two genes sampled at random within population i are identical by descent (IBD) and Q

a

, the probability that a gene sampled at random from population 1 is IBD to a gene sampled at randomfrompopulation2. IBDprobabilitiesare denedastheprobabilities that two genes have not mutated since their most recent common ancestor (Malécot, 1975). The probability that a pair of genes are IBD is equal to the probability that these genesare identical in state (IIS), whenever the mutation process follows the IAM.

More generally, let Q h

denote the IBD probability of any pair of genes: h = (w;i) when two genes are sampled within population i, or h =a when

(10)

for Q h

, as a function of the coalescence time (Slatkin, 1991). Under a continuous time approximation(Hudson, 1990)

Q h = Z 1 0 t c h (t)dt (2) wherec h

(t)isthe probabilityofcoalescenceattforapair ofgenesoftypeh, and =(1 )

2

. The waitingtime fora coalescent event inapopulationof size N

i

has an exponentialdistribution with mean N i

. The IBD probability for a pair of genes inpopulationireduces to

Q w;i = Z  0 t N i e t=N i dt+(1 C i )Q 0 (3) where Q 0

is the IBD probability for two genes sampled at randomfrom the commonancestralpopulationattime (justbeforethesplit),and(1 C

i )=  e =Ni

is the probability that the two genes neither coalesce nor mutate in the i

th

population, in the time-interval 0 < t 6 . The rst term on the right-handside ofequation (3)isthe probabilitythat thetwogenes coalesce in the time-period0<t6, and are IBD. Following equation(2), the IBD probability forapair ofgenessample atrandomfromthe commonancestral populationjustbeforethe split at time  is given by

Q 0 = Z  0  t  N 0 e (t )=N0 dt+(1 C 0 ) Z 1  0 t  0 N e e (t 0)=Ne dt (4) where (1 C 0 ) = 0   e (0 )=N0

is the probability that the two genes neither coalesce nor mutate in the time-interval  < t 6 

0

(11)

occurring during the populationbottleneck. During this time-interval ( < t 6 

0

) the waiting time for a coalescent event is exponentially distributed with meanN

0

. Thelastterminequation(4) averagesovercoalescentevents occurring in the ancestral population, at mutation-drift equilibrium. This lastterm represents the IBD probabilityfor two randomly sampledgenes in astationarypopulationofsizeN

e

,whichis1=(1+),with=2N e

. Solving the integrals in the low-mutation limit(where

t  e 2t ), we nd that the solution of equation(3) is Q w;i  1  i +1  1 e T i ( i +1)  +e T i ( i +1) Q 0 (5) where  i =2N i  and T i ==N i . The value of Q 0

is given by the solution of equation (4) Q 0  1  0 +1  1 e T0(0+1)  +e T0(0+1)  1 +1  (6) where  0 = 2N 0  and T 0 = ( 0 )=N 0

. The probability for a gene in population1to be IBD with agene in population2 isjust given by

Q a =  Q 0 (7)

Obviously,twosuchgenescannot coalesceduringthe  generationsbetween themomentofdivergenceandthepresent. They areIBDonlyiftheir respec-tive ancestors are IBD when populations 1 and 2 diverge, and furthermore, if they do not undergo mutation during the divergence. Now, it is useful to consider the parameter

(12)

F i = w;i a 1 Q a (8)

It is worth noting that the weighted sum of F i

over the two populations gives the intraclass correlation for the probability of identity by descent for genes withinpopulations,relativelytogenesbetween populations. This isof particular interest, because the properties of the intraclass correlations for the probability ofidentity in state ( IIScorrelations ) (Cockerham and Weir,1987) canbededuced fromthe propertiesof thecorresponding intra-class IBD correlations, inthe low-mutationlimit (Rousset, 1996). Indeed, suchratios ofidentityprobabilities ofthe formofequation(8)givethe same low-mutation limit, whether one considers the innite allele model or other mutation models (Rousset, 1996, 1997).

If we neglect new mutations arising during the divergence process, Q a reduces toQ 0 and Q w;i =C i (1 Q 0 )+Q 0 . Thus F i 1 e Ti (9)

Notethatequation(9)givesawellknownresultwhenbothdaughter popula-tions areassumed tohavethe samesize N,sothatF

1 =F

2

=F 1 e =N

(see, e.g., Reynolds et al., 1983). Hereafter, the parameter T i

will be re-ferred to as a the  branch length  of population i. An important result is that, in the low-mutation limit, the new parameters F

1

and F 2

do not depend onthe  nuisance parameters  or T

0

. This suggests that asimple moment-basedestimator

b T

i

(13)

T i = ln(1 F i ) (10) where b F i is anestimatorof F i

(see Appendix for details).

PROPERTIES

Simulation procedure: For each set of parameter values, a sequence of articial data sets was generated using standard coalescent simulations, as described by, e.g., Hudson (1990). The simulations were performed as fol-lows (see Figure 1 for an illustrated example of one simulated genealogy). Foreachpopulation,the genealogyofasample ofn

i

genesis generatedfora periodoftimerangingfrompresent to generationsinthe past. During this period, all the coalescent events are separated by exponentially distributed time-intervals,with means N

1 = n 1 2  in population1 and N 2 = n 2 2  in popula-tion 2 (See Equation 3). At time , the number n

0

of lineages that remain represents the ancestors of all the genes sampled in populations 1 and 2. The genealogy of these lineages is generated for the time-period[;

0 ], and all the coalescence events are separated by exponentially distributed time-intervals, with mean N

0 = n 0 2 

(see the rst term in the right-hand side of equation 4). At time 

0

, the lineages that remain are the ancestors of all the genes sampled in populations 1 and 2. The genealogy of these n

e genes is generated for the period[

0

;+1], with allcoalescent events separated by exponentially distributed time-intervals with mean N

e =

ne 2



(see the second term in the right-hand side of equation 4). Once the complete genealogy is obtained, the mutation events are superimposed onthe coalescent tree of lineages. Inthe results which follow,each articialdata set consisted of two

(14)

population2.

Simulation results: By calculating the estimators b F 1 and b F 2 for each of these articial data sets, it was possible to obtain a close approximation to the expected distribution of these estimators (see Appendix for details). Figure 2 shows this expected joint distribution of

b F 1 and b F 2 , for various combinationsofthe nuisance parameters and T

0

. Inthis case, the true branchlengths were T

1 =T 2 =0:1(hence F 1 =F 2 0:0953). The expected value of the estimator

b F 1 (resp. b F 2

) was always close to the value of the parameter F

1

(resp. F 2

). One can show that, by construction, the points  b F 1 ; b F 2 

liewithintheupper-righttrianglewithvertices(1,1),(-1,1)and (1,-1). The joint distribution of these two statistics has a negative correlation. Most importantly, it is clear from this gure that the joint distribution of

b F 1 and b F 2

depends strongly on the nuisance parameters, even though their expectations remainclose to the true values of F

1

and F 2

.

[Figure2 about here.]

It can be seen that, for smaller values of T 0

, the joint distribution be-comes tighter as  increases. On the other hand, for larger values of , the distribution is found to widen as T

0

increases. In both cases, it is the level of variation that remains before divergence which is crucial in shaping the joint distribution. With small  and large T

0

, the lineages coalesce rapidly before the divergence, and the number of distinct mutations (allelic states) that canbemaintainedissmall. Inthis case,thevarianceofthe estimatesof populationsbranch lengthsislarge, asillustrated by the widejoint

(15)

distribu-tion of F 1

and F 2

. Therefore, the joint distributionof F 1

and F 2

is not ideal for investigating the homogeneity of results of a set of molecular markers. Indeed, other factors such as heterogeneous mutation rates across loci may beinvoked toexplain disparitiesof branch lengthestimates among markers. Fortunately, this problemcan beovercomeby consideringthe joint distribu-tion of b F 1 and b F 2

,conditionaluponthetotalnumberk ofallelicstatesinthe pooledsampleateachlocus. Figure3shows theestimated jointdistribution for T 1 = T 2 = 0:1 (hence F 1 = F 2  0:0953), conditioned on k = 4. The combinationsof nuisance parameter values are the same asin Figure2.

[Figure3 about here.]

The expected joint conditional distribution appears to be almost inde-pendent on the nuisance parameters. So, given the observed values for the parametersF

1 andF

2

,andgiven thenumberofallelesinthesample,onecan obtain the conditionaljointdistribution, andthen ahigh probabilityregion, that should contain 95% of the observed measures of pairwise

b F

i

's values. This result provides the justication for using the conditional distributions toanalyzethehomogeneityinthe patternsofgeneticdierentiationrevealed by a(large) set of markers.

APPLICATIONS

Inthissection,wepresentamethodologyforidentifyingoutlierlocibya pair-wise analysis of populations. For each pair of populations (i;j), we suggest the following protocol:

1. For allloci, the statistics b F i and b F j

(16)

i j weighted by the heterozygosities (1

b Q i ) and (1 b Q j ), respectively (see Appendix). ThiscorrespondstotheweightingoflocisuggestedbyWeirand Cockerham (1984) for the multilocus estimatorof F

ST . 3. Theexpectedjointdistributionof

b F i and b F j isgeneratedbyperforming 10000 coalescent simulations for a given set of nuisance parameters values. This isrepeatedusing awide rangeofvaluesforthe nuisanceparameters. In the Drosophila simulans data set discussed below, allthe pairwise combina-tions for  and T

0

where performed, with  =1, 5 or10, and T 0

=0:01, 0:1 or 1. Thus, a total of 90000 coalescent simulations were performed in this example. The simulatedsamplesize are chosen toberepresentative ofthose actually realizedin the real data set.

4. For each expected joint distribution of b F i and b F j , we construct all the distributions,conditional onthe number of allelicstates k inthe pooled sample,fork=2;3;:::(Thepooledsampleisthesampleobtainedbypooling the samples from populations i and j). Remember, there is one expected distribution foreachset of nuisance parametersvalues. Foreachconditional distribution, weidentify the  high probability or highdensity region, in the range of the points

b F i and b F j

, where 95% of the data is expected to lie (see Appendix for the construction of this high probability region).

5. For each value of the number of allelic states in the pooled sample, we superimpose a scatter plot of the observed data points (pairs of

b F 1 and b F 2

values) over an outline of the 95% high probability region, in order to identify outlier loci.

(17)

simulans data set, described in Singh et al. (1987) and Choudhary et al. (1992). The raw data set was kindly provided by R. S. Singh and R. A. Morton. Among 111 allozyme loci, 43 were found to be polymorphic in the 5 populations studied in Europe and Africa. The samples consisted in isofemalelinesmaintainedinthelaboratory. Thehaploidsamplesizesranged fromn =26ton =55. Figure4showstheanalysisperformedonaparticular pair of populations (France and Tunisia). The multilocus estimates of the parametersF

1

(Frenchpopulation)andF 2

(Tunisianpopulation)were0.0064 and 0.0617, respectively. The expected distributions with these averaged values,conditionedonthenumberofallelesinthepooledsample,are plotted with the actual monolocus pairwise(

b F 1 ; b F 2 )estimates.

[Figure4 about here.]

In the great majority of cases, the points fall within the 95% condence region. With 43 lociwe would expect two (0:05432) tolie outside the region by chance. But considering the joint distributions for loci with 3 or more alleles, we found 4 loci that clearly lie outside. Cautionis required in the case of loci which lie on the borders of the possible range (Figure 4B). These correspond to locithat have an allele xed in one population. Slight variations in the nuisance parameters can increase or decrease the relative proportion of loci that may x one allele in a population. Indeed, we found some conditions under which the 95% envelope contained these two loci. This problem can remain even when we condition on the observed number of alleles. Onthe otherhand,twootherloci(coding forGlutamatePyruvate Transaminase and Carbonic Anhydrase-3) are clear outliers of the expected

(18)

the French population, these two loci fell either ourside, or on the edges of the 95% high probabilityregion.

[Figure5 about here.]

InallthepairswhichincludedthepopulationfromCongo,twolocicoding respectively for the Larval Protein-10 (Pt-10) and the Phosphoglucomutase (PGM) were found tolieoutside oronthe limitof the 95% highprobability region (Figure 5). The locus coding for the Larval Protein-10 systemati-callygivesalongerestimated branchlengthforthis Africanpopulationthan do all other loci, while it gives similar branch lengths to other loci for the other populations. This suggests that genetic variation has been severely reduced by a factor other than geneticdrift inthis African population. The locus coding for Phosphoglucomutase gives a longer branch length estimate than the other loci in three cases (Figures 5A-C), and a shorter one in one case (Figure5D). The locus coding for Phosphoglucomutase was also found to lie outside the limit of the 95% high probability region, in all the pairs which included the populationfromSeychelle Island (Figure6). In order to strengthen ourpresumptionthat theselociwere outsidethelimitallowed by a neutral model, we checked whether these loci also lie outside the limit of the 99% high probabilityregion. The same results were obtained. Forthese loci, we did not nd any plausible neutral scenario of divergence by drift whichcould providesucha scatterof points. We thusconclude thatnatural selection may have acted on these loci, or on closely linked regions within the genome.

(19)

Pyruvate Transaminase and Carbonic Anhydrase-3 have been orare subject to selection. These loci are clear outliers in some pairwise comparisons in-volving the French population, but only fall in the limits of the condence region in other comparisons. Moreover, when considering 99% condence regions insteadof 95%condence regions,some lociwere nolonger detected as outliers, but rather as lyingon the edges of the condence limit. The lo-cus coding forisocitratedehydrogenase-1wasfound tobeanoutlierinthree (outoffour)pairswhichincludedthepopulationfromSeychelleIsland. Over-all, six more loci were detected as outliers, in single pairwise comparisons. Therefore, we shouldbe very careful inconsideringthose latterlocias being under selection. Indeed, if a locus has responded to selection in one partic-ular contemporary population since it became isolated, then we expect this locus to show up as an outlier in all (or most) comparisons involving this population. This pattern is exactly what we found for the two loci coding for Larval Protein-10 and Phosphoglucomutase in the Congo and Seychelle Island populations.

Evaluating therobustnessof thismethodto theassumptionsof the model: In the dataset discussed above, itislikelythat the populations of D. simulans have exchanged migrantsafter divergence. More generally, one can wonder whether complete isolation and divergence by random drift ac-curately describes natural situations. An alternative approach would be to develop a new model of population divergence, that allows subsequent mi-gration after separation. But if we want to make inferences about a more realistic (and hence a more complex) model of divergence, then we need to

(20)

(i) recent separation followed by very little migration or (ii) ancient sepa-ration followed by a moderateamountof migration. This is a dicult task, that would require more powerful methods for inferring parameter values (e.g., maximum likelihood;see Nielsen and Slatkin, 2000) that would be muchmoretimeconsuming. FurthernotethatNielsenandSlatkin(2000) assume that the mutationrate is zero.

So, we are interested in testing if our method (which assumes evolu-tion in complete isolation after divergence) is undermined when applied to pairs of populations that stillexchange genes afterdivergence. It should be borne in mind that gene ow, like genetic drift, aects the whole genome in the same way. We generated articial datasets under neutral models of population divergence, including high mutation rates and moderate levels of migration between populations. We used a modied version of the al-gorithm described by Hudson (1990), that accounts for symmetric migra-tion between populations. Considering populations 1 and 2 altogether, all events (coalescence and migration)are exponentially distributed with mean N 1 N 2 =  N 2 n 1 2  N 1 n 2 2  +m(n 1 +n 2 )N 1 N 2 

, where m is the backward mi-gration rate (Nordborg, 2001). Conditionally on the occurrence of one event, two genes coalesce in population 1 (resp. population 2) with proba-bility N 2 n1 2  =  N 2 n1 2  N 1 n2 2  +m(n 1 +n 2 )N 1 N 2  (resp. N 1 n 2 2  =  N 2 n 1 2  N 1 n 2 2  +m(n 1 +n 2 )N 1 N 2 

) or one gene migrate from population 2 to population 1 (resp. from population 1 to population 2) with probability mn 1 =  N 2 n1 2  N 1 n2 2  +m(n 1 +n 2 )N 1 N 2  (resp. m n 2 =  N 2 n 1 2  N 1 n 2 2  +m(n 1 +n 2 )N 1 N 2 

(21)

For each set of parameters, we generated 20 datasets composed of two samples (n

1 = n

2

=50) of 50 loci each. The parameter values are given in Table 1. For each dataset, we applied our method as described above. We generated jointdistributions,conditionalonthe numberof alleles,according to the actual numbers of alleles in each sample. Forall sets of parameters, wegroupedlociwith8allelesandmoreinasingleclass. Thenumberofjoint conditional distributionsgenerated perarticial dataset(i.e.,the numberof classes fordierentnumbers ofalleles) rangedfrom3to7. Foreach dataset, over all the joint conditional distributions taken together, we expected to detect0:0550=2:5outlierloci,justby chance. WeperformedWilcoxon's signed-rank tests (see, e.g., Mendenhall et al., 1990) to determine if the distribution ofthe numberofdetected outlierlociwasshifted tothe rightof 2.5 (one-tailedtest).

[TABLE 1 about here.]

Table1showsthetotalobservednumberofoutlierloci(meanandmedian over 20 independent simulated datasets) detected for a range of nuisance parameter values (low and high mutationrates, short orlong divergence by random drift, with or without migration). In no case could we reject the null hypothesis that the number of detected outlier loci was equal to 2.5 (against the alternative hypothesis that the number of detected outlier was greater than 2.5). Thus, our approach is conservative in the sense that the 95% condence region contains at least95% of the locigenerated by a truly neutral model. At the levelof 5% we do not (falsely)detect outlierlociina sample of neutralmarkers (type I error).

(22)

applied Beaumont and Nichols's (1996) procedure to the D. simulans dataset. Basedonapreliminaryexaminationofthedata,3loci(codingfor -Fucosidase, Dipeptidase-1and Mannose PhosphataseIsomerase) were found tolieoutsidethe95%condenceregionoftheconditionaljointdistributionof

b F

ST

andmean heterozygosity. Thepercentileswere determinedasdescribed in Beaumont and Nichols (1996). Surprisingly, none of these 3 loci were detected asoutliersusingourmethod. Theremaybeseveralreasonsforthis: Wesuspectthat,inthepresentcase,theinclusionofaverydistantinsular population (Seychelle Island) may bias their analysis. Indeed, populations heterogeneous with respect totheir demographicparameters (eective pop-ulation sizes and migration rates) have been shown to strongly aect their method (Beaumont and Nichols, 1996). Isolation (low migration rates) together with population bottlenecks, can introduce a further bias. Con-sider asan extremecase, the xation of aprivate alleleatsome locusin one population. This maybeunexpected forapolymorphiclocusina mutation-migration-drift equilibriummodel, unless there is a strong asymmetry, with some populations being smaller and receiving less immigrants than others. However, this is not unexpected for a model of separation and isolation, where there has been populationbottlenecks. This may boost the F

ST esti-mateatsomelocus,andthusexcludeitfromthe95%highprobabilityregion. So, isolatedpopulations should probably be excludedfrom Beaumont and Nichols's (1996) analysis.

Moreover, in general, the loci which were outliers in our analysis gave small values of (global) F

ST

(23)

ST

analysis is likely to detect outlier loci which exhibit unusually large F ST values. However, a process which would cause an apparent decrease of ge-netic variationatone locusina singlelocalpopulation, withoutleadingtoa decrease of the variationover allpopulations, would not bedetected Beau-mont and Nichols's (1996) procedure. In other words, if selectionacts on one locusata localscale, pairwisecomparisonsof populations ismore likely to be ecient fordetecting outlier loci.

DISCUSSION

Using population-specic estimators of branch lengths: Conven-tionalpairwisegeneticdistancesorpairwisemeasuresof population dieren-tiation are based on the assumption that the sizes of populations are equal and constantthrough timeorthat dispersal,if any, issymmetric. For exam-ple, the pairwiseF

ST

parameter isdened asa ratioof identityprobabilities within and among populations. But the within-populationterm is taken as an average over the pair of populations. Thus, the denition of the param-eter implicitly assumes that both populations share the same demographic parameters. Weir and Cockerham's (1984) estimator  of F

ST

is con-structed to have low bias and variance, assuming that the populations are independentreplicates ofthe same stochasticprocess. This meansthat pop-ulations are supposed tohave the same size, and that they do not exchange migrants. Without these assumptions,  would be a complex function of unequal (within-population) identity probabilities.

In contrast, the F i

(24)

the two populations have separated, they remain completelyisolated. From the estimation of F

i

's for a pair of populations, we can infer the branch lengths. The ratioof these branch lengthestimates is inversely proportional totheratioofeectivepopulationsizes. Thus,theseestimatesmaybeseenas measures of the intensity of geneticdrift that has occurred since population divergence. The main drawback to this approach is that when estimates of IISprobabilitiesaresmallerwithinpopulationsthanamong(i.e.,

b Q w;i < b Q a ), b F i

becomesnegative,andthemoment-basedestimatorofbranchlengthfails. Although this can arise justby chance for some loci, averaging

b

Q estimates over locireduces the problem.

Provided that we obtain good estimates of branch lengths for a pair of populations (which requires the pooling of information frommany indepen-dent loci) we may be able to evaluate the consistency of locus-specic esti-mates. Indeed,thejointdistributionofbranchlengthestimates,conditioned on the number of alleles inthe pooled sample, depends onlyweakly on nui-sance parameters of the simple model of divergence by drift. In particular, thisconditionaldistributionisnotsensitivetodeparturesfrommutation-drift equilibrium beforeisolation,or todierences inmutation rates.

Detection of selection acting on genetic markers: We saw from the analysisoftheD.simulansdatasetthatthegreatmajorityoflocialwaysfall in the condence region of the conditional pairwise distributions of branch length estimates,while somelocidonot. Overall,we identied two locithat were probably subject to selection in the population from Congo. We con-cluded that the distribution of variability at these loci may be shaped by

(25)

locithateitherlieontheedges,orfalljustoutsidethehighprobabilityregion of the expected conditional distribution in the French population, although we should be cautious about these latter loci. It is noteworthy that our es-timation of the density of F

i

parameters (see Appendix) is discontinuous, becauseofthe discretenatureofthedata(theallelecounts). Thisis particu-larlytruewhenthenumberofallelesonwhichthedistributionisconditioned is small(foragiven setof parameters,the lowerthe numberof allelicstates, the more discontinuous the null distribution: see Figure 4). Using discrete distributionsisclearlypreferabletousingsome(unnecessary)continuous ap-proximationsto it. Moreover, wheneverthe nulldistribution isbased onthe samenumberofallelicstatesandthesamenumberofgenesasinthe sample, thereisnotendency forlocitoshowupasoutlierjustbecauseofthe discrete nature of the distribution (i.e., a locus can not, by construction, shows up between arc-shaped areas, located at the edge of some distributions). Yet, whenanapparentoutlierlieveryclosetothe 95%highprobabilityregion,it is highlyadvisable tocheck whether this locus alsolieoutside the 99% high probability region.

The maincriticismsof Lewontin and Krakauer's (1973)attempts to interpret across-loci heterogeneity of F

ST

values arose from their failure to consider allele frequencies as random variables, whose distribution depends on the underlying model of population structure and history. Indeed, un-even patternsof dispersalamongpopulations (Neiand Maryuyama,1975) or sequences of population splits within the species (Robertson, 1975a,b) may strongly undermine the approach. Lewontin and Krakauer (1975)

(26)

populationstructure did not depart toomuchfrom the island model. However,conditioningthedistributionofF

ST

ontheheterozygosity (Beau-mont and Nichols, 1996) or on gene frequency for biallelic loci (Bow-cock et al.,1991)has been shown togivesurprisinglyrobustresults, inthe sense that strong departures from the model assumptions do not alter very muchthedistribution. Thestrongesteectonthejointexpecteddistribution of F

ST

and heterozygosity occurs when populations are heterogeneous with respect totheir demographicparameters (Beaumontand Nichols, 1996), for example when populations are founded by very dierent numbers of in-dividuals, or when populations are arranged in an irregular stepping-stone lattice. However, Beaumont and Nichols(1996) considered a large num-berd ofsubpopulationsinthemetapopulation(d=100)and thisparameter strongly inuences the expected heterozygosity [H

e

 4Nd=(1+4Nd), for diploids]. In addition, at a local scale, F

ST

is only weakly inuenced by the total populationsize Nd (Rousset, 2001). The number of popula-tions has a stronger role than acknowledged by Beaumont and Nichols (1996) in determining whether mutation has an eect on F

ST

or not. It has been shown that, considering smaller numbers of populations, F

ST es-timates may be reduced by mutation, especially with a stepwise mutation model(seeFlint et al.,1999). Withd=100islands,the sets ofparameters used inBeaumont andNichols(1996)did notaccountfor anycasewhere mutation may depress F

ST .

AsalreadysuggestedbyTsakasandKrimbas(1976),restricting Lewon-tin and Krakauer's (1973) tests topairs of populations removesall kinds

(27)

history,twopopulationsultimatelydescendfromasingleancestraloneinthe past. Still,nuisance parameters may broaden the joint distribution of pair-wise F

i

's(Figure2). However, conditioningonthe numberof alleles (Figure 3)alsogivesdistributionsthatarerobustenoughtovariationsinthevaluesof nuisance parameters. It isobviousthat foreachanalysis of apair of popula-tions, we deliberately discardthe informationbroughtby other populations, whichmaydecreasethepowerofthemethod(TsakasandKrimbas,1976). Butwebelievethat thisenablesustoexplainawiderrangeofpatterns than any symmetrical model, such as the island model. In this respect, our ap-proach is conservative. Moreover, we found that low or moderate gene ow did not undermine our approach, inthe sense that we did not falsely detect outlier loci,when they were neutral (Table 1). We compared and discussed the performance of our method to that of Beaumont and Nichols's 1996 using the empirical data from Singh et al. (1987) and Choudhary et al. (1992). Wefurthertestedwhether ourmethodwouldfalselyrejected neutral loci(type I error),undera widerange ofnuisanceparametervalues(see Ta-ble 1). In particular, since the method assumes that the mutations arising after divergence can be neglected, we checked that high mutation rates do not weaken the approach.

We have found that patterns such as those identied in the Tunisia vs Congodatasetasevidenceofselection,can beproducedbyneutralmodels  where the coalescent process occurs independently at each locus. Indeed, similar scatters ofpointscouldbeobtained wheneverthe parametersF

1 and F

2

(28)

of unlinked neutral loci, some of which having been strongly inuenced by selection (remind that the eect of selection resembles a reduction in the eective populationsize experienced by these loci,asdescribed by Barton, 1995,1998;Robertson,1961). So,itiscertainlyplausiblethatthepatterns whichwe haveidentied inthe Tunisiavs Congodata set were produced by selection. Athoroughinvestigationoftheconditionsunderwhichourmethod failstoidentifyselected loci(typeII error)wouldbedesirable. However, this isnotfeasible,astherangeofmodelswhichincorporateselectionisverylarge. An important task for the future is to consider a more general neutral modelofthedivergence oftwopopulations,wheregeneowmaycontinue af-ter themomentofseparation. Itisalsodesirabletoextendthisapproach to more elaborate neutral models, incorporating recombination. More so-phisticated estimators of the divergence parameters (branch lengths) would then be required. We assumed that the mutation process follows the IAM and we allowed a wide range of possible mutationrates. In the IAM, genes that are identicalinstate are alsoidentical by descent. This maynot be the case withothermutationmodels suchaswith theK alleleorstepwise muta-tion processes, which can produce IIS genes that are not IBD (homoplasy). The IAM is probably an adequate model for allozyme data. It is certainly not soappropriateforpotentiallymorevariablemarkers, suchas microsatel-lites. Recent studies reveal that the processes of mutation of microsatellite markers maybemorecomplexthanpreviouslythoughtandmayvarygreatly among loci (Estoup and Angers, 1998). Furthermore, the eect of homo-plasyonmeasuresofpopulationsubdivisionsisnotsimple(Rousset,1996).

(29)

methodacross dierentclasses of nuclearmarkers that dier inprocesses of mutation. Clearly, if a whole class of marker loci,which are known to have a very distinct mutation process, are identied as outliers by our analysis, then this class of markers should be interpreted with caution.

If we could identify those marker loci that have responded to selection during the process of divergence, then we may be able to obtain improved estimates of the parameters of population structure and history, by exclud-ing theseloci(Ross et al.,1999). Our methoddiers from previousones in allowingselectiontobedetected inparticularpopulations,andinsome pair-wise comparisonsbut not others. This opensup the possibilitythat markers may be discarded only in the analysis of those populations where there is evidence that they have responded to selection. It is also of interest to use this approachtoscreenthegenomeforregionsthathaverespondedtostrong selectioninthe recentpast. Ifpopulationshavedivergedphenotypicallyand if this has been caused by selection, then it may even be possible to iden-tify candidate regionsforthe QuantitativeTraitLoci(QTL)underlyingthis adaptive divergence.

ACKNOWLEDGEMENTS

We are very grateful to R.S. Singh and R.A. Morton for providing the Drosophila simulans data set. We thank I. Olivieri for helpful comments onapreviousdraftofthismanuscriptandS.Billiardforvaluablediscussions aboutthestructuredcoalescent. Wearegratefultotwoanonymousreviewers who constructively commented on and criticized the manuscript. This work

(30)

the European Communities (DG XII) to P.B., and R.V. was also partially funded by the Fondation Sansouire. This is publication number 2001-XXX of the Institut des Sciences de l'Évolution de Montpellier.

(31)

Barton, N.H., 1995 Linkageandthe limitstonaturalselection. Genetics, 140: 821841.

Barton, N. H., 1998 The eect of hitch-hiking on neutral genealogies. Genet. Res., 72: 123133.

Beaumont, M. A., and R. A. Nichols, 1996 Evaluating loci for use in the geneticanalysis of populationstructure. Proc.R.Soc. Lond.B, 263: 16191626.

Bowcock, A. M., J. R. Kidd, J. L. Mountain, J. M. Hebert, L. Carotenuto,K. K. Kiddand L. L. Cavalli-Sforza, 1991 Drift, admixture, and selection in human evolution: A study with DNA poly-morphisms. Genetics, 88: 839843.

Cavalli-Sforza, L. L., 1966 Population structure and human evolution. Proc. R.Soc. Lond. B, 164: 362379.

Charlesworth,B., M.T. Morgan andD.Charlesworth,1993 The eect of deleterious mutations on neutral molecular variation. Genetics, 134: 12891303.

Charlesworth,B., M. Nordborg and D. Charlesworth,1997 The eects of local selection, balanced polymorphism and background selec-tiononequilibriumpatterns ofgeneticdiversityinsubdividedpopulations. Genet. Res., 70: 155174.

Choudhary, M., M. B. Coulthart and R. S. Singh, 1992 A com-prehensive study of genic variation in natural populations of Drosophila

(32)

melanogaster and itssiblingspecies, D.simulans. Genetics, 130:843853.

Cockerham, C. C., 1973 Analyses of gene frequencies. Genetics, 74: 697700.

Cockerham, C.C., and B.S.Weir, 1987 Correlations,descent measures: drift with migrationan mutation. Proc. Natl. Acad. Sci. USA, 84: 8512 8514.

Estoup, A., and B. Angers, 1998 Microsatellites and minisatellites for molecularecology: Theoreticaland empiricalconsiderations, pp. 5586in Advances in Molecular Ecology, edited by G. R. Carvalho. IOS Press, Amsterdam.

Felsenstein, J., 1982 How can we infergeography and history fromgene frequencies ? J. Theor.Biol., 96:920.

Flint, J., J. Bond, D. C. Rees, A. J. Boyce, J. M. Roberts-Thomson, L. Excoffier, J. B. Clegg, M. A. Beaumont, R. A. Nichols and R. M. Harding, 1999 Minisatellitemutationalprocesses reduce F

ST

estimates. Hum. Genet., 105: 567576.

Hill, W. G., and A. Robertson, 1966 The eect of linkage onlimits to articialselection. Genet. Res., 8: 269294.

Hill, W. G., and A. Robertson, 1968 Linkage disequilibrium in nite populations. Theor.Appl. Genet.,38: 226231.

Hudson, R. R., 1990 Gene genealogies and the coalescent process. Oxf. Surv. Evol.Biol., 7: 144.

(33)

with selectionand recombination. Genetics, 120: 831840.

Kaplan,N.L.,R.R.HudsonandC.H.Langley,1989 Thehitchhiking eect revisited. Genetics, 123: 887899.

Lewontin,R. C.,andJ. Krakauer,1973 Distributionofgenefrequency asatestofthetheoryoftheselectiveneutralityofpolymorphism.Genetics, 74: 175195.

Lewontin, R. C., and J. Krakauer, 1975 Testing the heterogeneity of F values. Genetics, 80: 397398.

Malécot, G.,1975 Heterozygosityandrelationshipinregularlysubdivided populations. Theor.Popul. Biol.,8: 212241.

Maynard Smith, J., and J. Haigh, 1974 The hitch-hiking eect of a favourable gene. Genet. Res., 23: 2335.

Mendenhall, WM, DD Wackerly and RL Scheaffer, 1990 Math-ematical statistics with applications. PWS-KENT Publishing Company, Boston.

Nei, M., 1972 Genetic distance between populations. Am. Nat., 106: 283292.

Nei, M., andA.Chakravarti,1977 Drift varianceofF ST

andG ST

statis-tics obtained from a nite number of isolated populations. Theor. Popul. Biol., 11:307325.

(34)

F ST

inanitenumberofincompletelyisolatedpopulations. Theor.Popul. Biol., 11:291306.

Nei, M., and T. Maryuyama, 1975 Lewontin-Krakauer test for neutral genes. Genetics, 80: 395.

Nielsen, R., and M. Slatkin, 2000 Likelihoodanalysis of ongoing gene owand historicalassociation. Evolution,54: 4450.

Nordborg, M., 2001 Coalescent theory,pp. 179212 inHandbook of sta-tisticalgenetics,editedbyD.J.Balding,M.BishopandC.Cannings. John Wiley &Sons, Ltd, Chichester.

Ohta, T., and M. Kimura, 1969 Linkage disequilibrium due to random geneticdrift. Genet.Res., 13: 4755.

Reynolds, J., B. S. Weir and C. C. Cockerham, 1983 Estimationof thecoancestrycoecient: Basisforashorttermgeneticdistance.Genetics, 105: 767779.

Robertson, A., 1961 Inbreeding in articial selection programmes. Ge-neticalResearch,2: 189194.

Robertson, A., 1975a Gene frequency distribution as a test of selective neutrality. Genetics, 81: 775785.

Robertson, A., 1975b RemarksontheLewontin-Krakauer test. Genetics, 80: 396.

(35)

L. Keller, 1999 Assessing genetic structure with multiple classes of molecularmarkers: Acasestudyinvolvingtheintroducedreant Solenop-sis invicta. Mol. Biol.Evol.,16: 525543.

Rousset,F.,1996 Equilibriumvaluesofmeasuresofpopulationsubdivision for stepwise mutationprocesses. Genetics, 142: 13571362.

Rousset,F.,1997 Geneticdierentiationand estimationofgeneowfrom F-statisticsunder isolationby distance. Genetics, 145: 12191228.

Rousset, F., 2001 Inferences from spatial population genetics, pp. 179 212 in Handbook of statistical genetics, edited by D. J. Balding, M. Bishop and C. Cannings.John Wiley& Sons, Ltd, Chichester.

Singh, R. S., M. Choudhary and J. R. David, 1987 Contrasting pat-ternsofgeographicvariationinthecosmopolitansiblingspeciesDrosophila melanogaster and D. simulans. Biochem. Genet., 25: 2740.

Slatkin, M., 1991 Inbreeding coecients and coalescence times. Genet. Res., 58: 167175.

Strobeck, C., 1983 Expected linkage disequilibrium for a neutral locus linked toa chromosomal arrangement. Genetics, 103:545555.

Strobeck, C., 1987 Average numberof nucleotide dierences in asample froma single subpopulation: A test for populationsubdivision. Genetics, 117: 149153.

Takahata,N., 1988 The coalescentintwopartiallyisolateddiusion pop-ulations. Genet. Res., 52: 213222.

(36)

values: A suggestion and a correction. Genetics, 84: 399401.

Weir, B. S.,andC. C.Cockerham, 1984 EstimatingF-statisticsforthe analysis of populationstructure. Evolution,38: 13581370.

Wright, S., 1951 The genetical structure of populations. Ann. Eugen., 15: 323354.

(37)

Parametersestimation: Foranygiven alleleu,weusetheindicator vari-able x

iju

for describing the state of the j th gene in the i th population, with i = (1;2). x iju

= 1 if the allelic type is u, x iju

= 0, otherwise. Let p iu be the frequency of allele u in the i

th population. Then p iu = E(x iju j p), whereE( j p)denotes theexpectation, conditionalonthearraypofallthe allele frequencies. Considering the second moments of the random variable x

iju

, it follows that E(x 2 iju

j p) = p iu

and, since individuals are sampled in-dependently from the i

th population, E(x iju x ij 0 u j p)= p 2 iu for j 0 6=j. Then, summing over all alleles gives the probability for two genes in population i to be identical in state (IIS)

Q w;i =E k X u=1 p 2 iu ! (A 1)

where E denotes now the expectation overthe distribution of allele frequen-cies p and k isthe number of alleles inthe population. The IIS probability for two genes respectively taken inpopulation1 and 2 is given by

Q a =E " k X u=1 (p 1u p 2u ) # (A 2)

An unbiased estimator of the frequency of allele u among n i

sampled in-dividuals from the i

th

population is simply given by pb iu = P n j=1 x iju =n i . Expanding the square of this expression, and then taking expectation, gives E(bp 2 iu jp)=[p iu +n i (n i 1)p 2 iu ]=n i . Therefore,

(38)

b Q w;i = X u=1 [bp iu (n i b p iu 1)]=(n i 1) (A 3)

isanunbiasedestimatoroftheprobabilityfortwogenesinpopulationj tobe identicalinstate,withk being thenumberofallelesinthesample. Similarly

b Q a = k X u=1 (bp 1u b p 2u ) (A 4)

is an unbiased estimator of the IIS probability of two genes taken in the ancestral population,beforedivergence. Approximatingtheexpectationofa ratio by the ratio of expectations, an estimatorof F

i is given by b F i = P k u=1 [bp iu (n i b p iu 1)=(n i 1) pb 1u b p 2u ] 1 P k u=1 (bp 1u b p 2u ) (A 5)

When combiningthe informationbrought by allalleles atmore than one lo-cus, amultilocusestimatorisdenedasthe ratioofthe sumoflocus-specic numerators overthesum oflocus-specicdenominators(see,e.g.,Weirand Cockerham,1984). Itisworthnotingthat,whendaughterpopulationsizes are equal, this simple way to estimate parameters (i.e., equating Qs to

b Q s in equation (8)to get

b

F) directly yields Cockerham's estimators (Cocker-ham, 1973; Weir and Cockerham, 1984) developed with the methods of analysis of variance (see Rousset, 2001, for a thorough demonstration of the equivalence between estimator formulas based on analyses of variance and expressions in termsof frequency of identical genes). Ourestimator dif-fers from previous ones (e.g., Reynolds et al., 1983) in allowing separate parameters F

i

(39)

i

ter values,coalescentsimulationswereperformed,thusgeneratingarticial data sets . Each articialdataset yieldsapair of estimates

b F 1 and b F 2 . An approximation to the expected joint distribution was obtained as follows. First, a 2-dimensional histogram was constructed. Recall that the points  b F 1 ; b F 2 

are constrained to lie within the upper-right triangle of a square with vertices (-1,-1), (1,-1), (-1,1) and (1,1). The whole square region was covered by a 2-Dimensionalarray (or mesh) of 100100 square cells. Each cellhasthussidesoflength0.02. Eachobservation

 b F 1 ; b F 2 

wasbinnedinthe appropriate cell. The cellcounts were divided by the total numberof obser-vations, to obtain a discrete probability distribution over the 2-dimensional array. This discrete distribution is a close approximation to the expected jointdistribution of the estimators

 b F 1 ; b F 2 

. The q-level  high probability region  (q = 95%, or any other value) is constructed as follows. The cells are sorted inorder of decreasing probability. Finally,starting fromthe cells with the highest associatedprobabilities, cells are sequentially added to the condence region, until the cumulative probability of the whole set of cells obtained is equal to(or just exceeds) the chosen q-value.

Fromthisprocedure, weobtainforeachsimulationaregionwithinwhich a proportion q of the data lies. Notice that this condence region is not necessarily continuous. Constructing the high probability region using the discretedistributionisclearly preferabletousingsome(unnecessary) contin-uous approximationto it.

(40)

Results from applications to various divergence scenarios Detected outliers   T 0 mean median pvalue Nomigration: m =0 10 5 1 1 1.85 2.0 0.98 10 5 10 10 2 1.15 1.0 1.00 10 3 1 1 2.60 3.0 0.28 10 3 10 10 2 1.75 2.0 0.76 Lowmigration: m=0:01 10 5 1 1 2.30 2.5 0.79 10 5 10 10 2 2.25 2.0 0.77 10 3 1 1 2.00 2.0 0.99 10 3 10 10 2 1.20 1.0 1.00 Moderate migration: m=0:1 10 5 1 1 2.30 2.0 0.87 10 5 10 10 2 2.05 2.0 0.96 10 3 1 1 2.25 2.0 0.89 10 3 10 10 2 1.85 2.0 0.98

For all sets of parameters, 50 loci were scored among 100 haploid sampled individuals(50ineachpopulation). Themeannumberofdetectedoutlierloci isgiven,aswellasthemedianofthedistributionofthatnumber. Weprovide the p value of Wilcoxon's signed-rank tests, performed on the distributions ofdetectedoutliers,todeterminewhetherthisdistributionwasshiftedtothe right of 2.5(one-tailed test).

(41)

Figure 1. A gene genealogy under our model, for n = 10 genes sampled in eachpopulation. Inthis example,the parametersvalues areN

1 =N 2 =100, N 0 =500, N e =1000,  =50, 0 =150 and =10 3 .

Figure 2. Expected distribution of pairs of b F 1 and b F 2

estimates, for wide ranges of values of the nuisance parameters  = 2N

e  and T 0 . T i = =N i is 0.10 for both daughter populations (with  = 50 and N

1 = N

2

= 500), giving an expected value F

i

 0:0953, as indicated by the dotted lines. For all parameter sets,  = 10

4

and N 0

= 1000. One hundred individuals are sampled in each daughter population. The light gray area denes a region in which 95% of the simulated points are expected to lie (see Appendix for details).

Figure 3. Expected distribution of pairs of b F 1 and b F 2 estimates conditioned onanumberofallelesinthesampleequalto4. AsinFigure1,widerangesof valueshavebeenusedforthenuisanceparameters. Thedottedlinesindicate the expected values for F

1 and F 2 . Figure4. b F 1 and b F 2

valuesestimatedfrom43lociinDrosophilasimulans for thepairwisecomparisonofthepopulationsfromFrance(n =55)andTunisia (n =52). n isthenumberofisofemalelinestyped foreachenzymaticsystem (haploid sample size). Each locus is represented with a black dot. The averaged values are

b F 1 =0:0064 and b F 2

=0:0617 asindicated by the dotted lines. Thin lines enclose a regionin which95% of the simulateddata points are expectedtolie. Fourdistributionsareshown, conditionedonthenumber of allelicstates inthe wholesample. A. Expected distributionof pairwiseF

(42)

k =3. C.idemwithk=4. D.idemwithk =5. Blackarrowsindicateoutlier loci. The locicoding for the Glutamate Pyruvate Transaminase (GPT) and Carbonic Anhydrase-3 (Ca-3) are shown respectively inC and D.

Figure5. b F 1 and b F 2

valuesestimatedfrom43lociinDrosophilasimulans for all the pairwisecomparisons involvingthe populationfromCongo (n=45). A. Expected distribution for the populations from France (n = 55) and Congo. B. Tunisia (n = 52) vs Congo. C. Congo vs Cape Town, South Africa (n = 32). D. Congo vs Seychelle Island (n = 26). All distributions are conditionedonk=4. Eachlocusisrepresented withablackdot. Dotted lines give the expected values for

b F 1 and b F 2

. For each expected conditional distribution, black arrows indicate the locicoding for the Larval Protein-10 (Pt-10) and Phosphoglucomutase (PGM). Figure6. b F 1 and b F 2

valuesestimatedfrom43lociinDrosophilasimulans for all the pairwisecomparisons involving the populationfrom Seychelle Island (n =26). A.ExpecteddistributionforthepopulationsfromFrance(n=55) and Seychelle Island. B. Tunisia (n = 52) vs Seychelle Island. C. Congo (n = 45) vs Seychelle Island. D. Cape Town, South Africa (n = 32) vs Seychelle Island. Distributions in A and C are conditioned on k = 4 and distributionsinB andDare conditionedonk=3. Eachlocusisrepresented with a black dot. Dotted lines give the expected values for

b F 1 and b F 2 . For eachexpectedconditionaldistribution,blackarrowsindicatethelocuscoding for Phosphoglucomutase (PGM).

(43)

Population 1

Population 2

t

t

0

Present

N

e

N

0

Time

Effective size

(44)

T

0

= 0.01

T

0

= 0.1

T

0

= 1

q

= 1

q

= 5

q

= 10

F

1

F

2

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

(45)

T

0

= 0.01

T

0

= 0.1

T

0

= 1

q

= 1

q

= 5

q

= 10

F

1

F

2

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

(46)

F

1

F

2

A.

B.

C.

D.

Ca-3

GPT

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

(47)

F

1

F

2

A.

B.

C.

D.

PGM

Pt-10

PGM

PGM

PGM

Pt-10

Pt-10

Pt-10

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

(48)

F

1

F

2

A.

B.

C.

D.

PGM

-

1

0

1

-

1

0

1

-

1

0

1

-

1

0

1

PGM

PGM

PGM

Références

Documents relatifs

L’utilisation de modèles coarse-grained présentent l’inconvénient de rajouter un ou plusieurs paramètres supplémentaires par rapport à une simulation par

En France, la proportion de seniors exerçant une activité pro- fessionnelle est plus faible que dans la plupart des autres pays européens : 36,8 % des 55-64 ans occupaient un emploi

In addition, we found no evidence of any 373 codons under selection in TLR7, compared to three and nine sites in TLR4 and TLR5, respectively, 374 similar to the pattern of

Measures of difference (L  −  R and DIF) may reflect regression of mammary buds. Based on the lack of genomic variation detected for L − R and DIF, these traits appear to

for parent animals and the additive effects due to quantitative trait loci linked to the marker locus only for animals which have the marker data or provide relationship

It is shown here that to maximize single generation response, BV for a QTL with dominance must be derived based on gene frequencies among selected mates rather than

of QTLs to the genetic variance of the trait; and (ii) under continued selection the amount of genetic variability that is hidden by negative correlations

momentum flux), reduced frequency, Reynolds number, and three non-dimensional geometric parameters relating the area of the tip clearance, the area of the actuator