• Aucun résultat trouvé

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach.

N/A
N/A
Protected

Academic year: 2021

Partager "Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach."

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01604024

https://hal.archives-ouvertes.fr/hal-01604024

Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate

Bayesian Computation Approach.

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frederic Austerlitz

To cite this version:

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frederic Austerlitz. Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Compu- tation Approach.. conférence Jaques Monod ”Coalescence des approches théoriques et expérimentales en génomique évolutive et biologie des systèmes”, 2016, Roscoff, France. �hal-01604024�

(2)

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data -

An Approximate Bayesian Computation Approach

Simon Boitard

1

, Willy Rodríguez

2

, Flora Jay

3

, Stefano Mona

4

, Frédéric Austerlitz

5

(1) GenPhySE, Castanet-Tolosan, France. (2) IMT, Toulouse, France. (3) LRI, Orsay, France. (4) ISYEB, Paris, France. (5) UMR 7206, Paris, France.

ABC E STIMATION APPROACH

Model : Kingman’s coalescent with mutation and recombination, piecewise constant population size with 21 fixed time windows.

Simulated data : 450,000 samples of n haploid genomes, one genome = 100 2Mb-long “contigs”. Population sizes log-uniformly distributed from 10 to 100,000, with correlation between consecu- tive windows.

Summary statistics : Folded Allele Frequency Spectrum (AFS) and average Linkage Disequilibrium (LD) for 18 bins of physical dis- tance between SNP (from 300bp to 1.4Mbp).

ABC analysis : 0.5% samples accepted, neuralnet regression (Blum and François, 2010) from abc R package (Csilléry et al, 2012).

P REDICTION ERROR ( RANDOM HISTORIES )

Estimation of demographic history improved by :

(i) combining LD and AFS statistics (ii) increasing sample size

0.00.20.40.60.81.0

generations before present (log scale)

population size prediction error

0 10 100 1,000 10,000 100,000

● ● ● ● ● ● ● ●

LD statistics

AFS statistics without the proportion of SNP AFS statistics

AFS and LD statistics

0.00.20.40.60.81.0

generations before present (log scale)

population size prediction error

0 10 100 1,000 10,000 100,000

n=10 n=15 n=20 n=25 n=50

E STIMATION FOR SPECIFIC HISTORIES

Recent demography recovered and credible intervals, in contrast to PSMC (Li and Durbin, 2011) and MSMC (Schiffels and Durbin, 2014).

ABC, n=50

years before present (log scale)

effective population size (log scale)

10 100 1,000 10,000 100,000

1001,00010,000100,000

domestication breed formation

PSMC : n=2

10 100 1,000 10,000 100,000

1001,00010,000100,000

domestication breed formation

MSMC : n=4

10 100 1,000 10,000 100,000

1001,00010,000100,000

domestication breed formation

generations before present (log scale)

effective population size (log scale)

true value

estimations for one replicate 90% CI for one replicate

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

C OMPLEMENTARITY OF AFS AND LD STATISTICS

Depending on the scenario, LD or AFS is more informative.

AFS statistics

generations before present (log scale)

effective population size (log scale)

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

LD statistics

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

AFS and LD statistics

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

0 10 100 1,000 10,000 100,000

1001,00010,000100,000

E[TMRCA]

A PPLICATION TO NGS DATA IN CATTLE

Use genotype LD because of phas- ing errors and common alleles be- cause of sequencing errors.

After domestication, contiuous de- cline and clustering of breed histo- ries by area of origin.

years before present (log scale)

effective population size (log scale)

0 10 100 1,000 10,000 100,000

1003001,0003,00010,00030,000100,000

● ●

● ●

● ●

● ●

haplotype LD, all SNP genotype LD, all SNP genotype LD, MAF 10%

genotype LD, MAF 20%

years before present (log scale)

effective population size (log scale)

10 100 1,000 10,000 100,000

1003001,0003,00010,00030,000100,000

domestication breed formation

BrownSwiss Fleckvieh Holstein SwedishRed

Charolais Limousin Hereford Jersey

C ONCLUSIONS

Accurate estimation of population size history, including the re- cent past, through the use of large samples of genomes and the combination of AFS and LD information.

• Based on unphased data and robust to sequencing errors.

Ref PLoS Genet 12(3): e1005877, source code at https://forge- dga.jouy.inra.fr/projects/popsizeabc/

Perspectives : simulate longer genomes through the use of faster simulation software (msprime, Kelleher et al, 2016), decoupling of observed times and change times.

Références

Documents relatifs

• Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts,

• Simulated samples are summarized using the same statistics as the observed sample, and the best 0.5% of population size histories (i.e. those providing statistics that are the

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frédéric Austerlitz. Inferring Popula- tion Size History from Large Samples of Genome-Wide Molecular Data - An

To do so, we merged our Cuban dataset with publicly available whole-genome and genome-wide SNP datasets for European, African, and Native American ancestries (called

Outside this window however, population size histo- ries estimated by MSMC often had a totally different trend than the true history, (see for instance the small constant size or

There are at least two different ways of adaptation for a population: selection can either act on a new mutation (hard selective sweep), or on preexisting alleles that

Inferring the ancestral dynamics of population size from genome wide molecular data - an ABC approach..

Genome wide sequence data contains rich information about population size history, cf PSMC (Li and Durbin, 2011).... Development of an