HAL Id: hal-01604024
https://hal.archives-ouvertes.fr/hal-01604024
Submitted on 5 Jun 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate
Bayesian Computation Approach.
Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frederic Austerlitz
To cite this version:
Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frederic Austerlitz. Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Compu- tation Approach.. conférence Jaques Monod ”Coalescence des approches théoriques et expérimentales en génomique évolutive et biologie des systèmes”, 2016, Roscoff, France. �hal-01604024�
Inferring Population Size History from Large Samples of Genome-Wide Molecular Data -
An Approximate Bayesian Computation Approach
Simon Boitard
1, Willy Rodríguez
2, Flora Jay
3, Stefano Mona
4, Frédéric Austerlitz
5(1) GenPhySE, Castanet-Tolosan, France. (2) IMT, Toulouse, France. (3) LRI, Orsay, France. (4) ISYEB, Paris, France. (5) UMR 7206, Paris, France.
ABC E STIMATION APPROACH
• Model : Kingman’s coalescent with mutation and recombination, piecewise constant population size with 21 fixed time windows.
• Simulated data : 450,000 samples of n haploid genomes, one genome = 100 2Mb-long “contigs”. Population sizes log-uniformly distributed from 10 to 100,000, with correlation between consecu- tive windows.
• Summary statistics : Folded Allele Frequency Spectrum (AFS) and average Linkage Disequilibrium (LD) for 18 bins of physical dis- tance between SNP (from 300bp to 1.4Mbp).
• ABC analysis : 0.5% samples accepted, neuralnet regression (Blum and François, 2010) from abc R package (Csilléry et al, 2012).
P REDICTION ERROR ( RANDOM HISTORIES )
Estimation of demographic history improved by :
(i) combining LD and AFS statistics (ii) increasing sample size
0.00.20.40.60.81.0
generations before present (log scale)
population size prediction error
0 10 100 1,000 10,000 100,000
●
● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
● LD statistics
AFS statistics without the proportion of SNP AFS statistics
AFS and LD statistics
0.00.20.40.60.81.0
generations before present (log scale)
population size prediction error
0 10 100 1,000 10,000 100,000
n=10 n=15 n=20 n=25 n=50
E STIMATION FOR SPECIFIC HISTORIES
Recent demography recovered and credible intervals, in contrast to PSMC (Li and Durbin, 2011) and MSMC (Schiffels and Durbin, 2014).
ABC, n=50
years before present (log scale)
effective population size (log scale)
10 100 1,000 10,000 100,000
1001,00010,000100,000
domestication breed formation
PSMC : n=2
10 100 1,000 10,000 100,000
1001,00010,000100,000
domestication breed formation
MSMC : n=4
10 100 1,000 10,000 100,000
1001,00010,000100,000
domestication breed formation
generations before present (log scale)
effective population size (log scale)
true value
estimations for one replicate 90% CI for one replicate
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
C OMPLEMENTARITY OF AFS AND LD STATISTICS
Depending on the scenario, LD or AFS is more informative.
AFS statistics
generations before present (log scale)
effective population size (log scale)
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
LD statistics
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
AFS and LD statistics
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
0 10 100 1,000 10,000 100,000
1001,00010,000100,000
E[TMRCA]
A PPLICATION TO NGS DATA IN CATTLE
Use genotype LD because of phas- ing errors and common alleles be- cause of sequencing errors.
After domestication, contiuous de- cline and clustering of breed histo- ries by area of origin.
years before present (log scale)
effective population size (log scale)
0 10 100 1,000 10,000 100,000
1003001,0003,00010,00030,000100,000
●
●
● ● ● ● ●
● ●
● ●
● ●
● ●
●
●
● ●
●
● haplotype LD, all SNP genotype LD, all SNP genotype LD, MAF 10%
genotype LD, MAF 20%
years before present (log scale)
effective population size (log scale)
10 100 1,000 10,000 100,000
1003001,0003,00010,00030,000100,000
domestication breed formation
BrownSwiss Fleckvieh Holstein SwedishRed
Charolais Limousin Hereford Jersey
C ONCLUSIONS
• Accurate estimation of population size history, including the re- cent past, through the use of large samples of genomes and the combination of AFS and LD information.
• Based on unphased data and robust to sequencing errors.
• Ref PLoS Genet 12(3): e1005877, source code at https://forge- dga.jouy.inra.fr/projects/popsizeabc/
• Perspectives : simulate longer genomes through the use of faster simulation software (msprime, Kelleher et al, 2016), decoupling of observed times and change times.