Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach.

(1)

HAL Id: hal-01604024

https://hal.archives-ouvertes.fr/hal-01604024

Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate

Bayesian Computation Approach.

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frederic Austerlitz

To cite this version:

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frederic Austerlitz. Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Compu- tation Approach.. conférence Jaques Monod ”Coalescence des approches théoriques et expérimentales en génomique évolutive et biologie des systèmes”, 2016, Roscoff, France. �hal-01604024�

(2)

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data -

An Approximate Bayesian Computation Approach

Simon Boitard

¹

, Willy Rodríguez

²

, Flora Jay

³

, Stefano Mona

⁴

, Frédéric Austerlitz

⁵

(1) GenPhySE, Castanet-Tolosan, France. (2) IMT, Toulouse, France. (3) LRI, Orsay, France. (4) ISYEB, Paris, France. (5) UMR 7206, Paris, France.

ABC E STIMATION APPROACH

• Model : Kingman’s coalescent with mutation and recombination, piecewise constant population size with 21 fixed time windows.

• Simulated data : 450,000 samples of n haploid genomes, one genome = 100 2Mb-long “contigs”. Population sizes log-uniformly distributed from 10 to 100,000, with correlation between consecu- tive windows.

• Summary statistics : Folded Allele Frequency Spectrum (AFS) and average Linkage Disequilibrium (LD) for 18 bins of physical dis- tance between SNP (from 300bp to 1.4Mbp).

• ABC analysis : 0.5% samples accepted, neuralnet regression (Blum and François, 2010) from abc R package (Csilléry et al, 2012).

P REDICTION ERROR ( RANDOM HISTORIES )

Estimation of demographic history improved by :

(i) combining LD and AFS statistics (ii) increasing sample size

0.00.20.40.60.81.0

generations before present (log scale)

population size prediction error

0 10 100 1,000 10,000 100,000

●

● ● ● ● ● ● ● ● ● ● ●

●

● LD statistics

AFS statistics without the proportion of SNP AFS statistics

AFS and LD statistics

0.00.20.40.60.81.0

population size prediction error

0 10 100 1,000 10,000 100,000

n=10 n=15 n=20 n=25 n=50

E STIMATION FOR SPECIFIC HISTORIES

Recent demography recovered and credible intervals, in contrast to PSMC (Li and Durbin, 2011) and MSMC (Schiffels and Durbin, 2014).

ABC, n=50

years before present (log scale)

effective population size (log scale)

10 100 1,000 10,000 100,000

1001,00010,000100,000

domestication breed formation

PSMC : n=2

10 100 1,000 10,000 100,000

1001,00010,000100,000

MSMC : n=4

10 100 1,000 10,000 100,000

1001,00010,000100,000

true value

estimations for one replicate 90% CI for one replicate

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

C OMPLEMENTARITY OF AFS ^AND LD ^STATISTICS

Depending on the scenario, LD or AFS is more informative.

AFS statistics

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

LD statistics

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

AFS and LD statistics

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

A PPLICATION TO NGS DATA IN CATTLE

Use genotype LD because of phas- ing errors and common alleles because of sequencing errors.

After domestication, contiuous de- cline and clustering of breed histories by area of origin.

0 10 100 1,000 10,000 100,000

1003001,0003,00010,00030,000100,000

●

● ● ● ● ●

● ●

●

● ●

●

● haplotype LD, all SNP genotype LD, all SNP genotype LD, MAF 10%

genotype LD, MAF 20%

10 100 1,000 10,000 100,000

1003001,0003,00010,00030,000100,000

BrownSwiss Fleckvieh Holstein SwedishRed

Charolais Limousin Hereford Jersey

C ^ONCLUSIONS

• Accurate estimation of population size history, including the re- cent past, through the use of large samples of genomes and the combination of AFS and LD information.

• Based on unphased data and robust to sequencing errors.

• Ref PLoS Genet 12(3): e1005877, source code at https://forge- dga.jouy.inra.fr/projects/popsizeabc/

• Perspectives : simulate longer genomes through the use of faster simulation software (msprime, Kelleher et al, 2016), decoupling of observed times and change times.

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach.

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data -

An Approximate Bayesian Computation Approach

Simon Boitard

, Willy Rodríguez

, Flora Jay

, Stefano Mona

, Frédéric Austerlitz

ABC E STIMATION APPROACH

P REDICTION ERROR ( RANDOM HISTORIES )

E STIMATION FOR SPECIFIC HISTORIES

C OMPLEMENTARITY OF AFS AND LD STATISTICS

A PPLICATION TO NGS DATA IN CATTLE

C ONCLUSIONS

C OMPLEMENTARITY OF AFS ^AND LD ^STATISTICS

C ^ONCLUSIONS