Inferring the ancestral dynamics of population size from genome wide molecular data - an ABC approach

(1)

HAL Id: hal-02739431

https://hal.inrae.fr/hal-02739431

Submitted on 2 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entific research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Inferring the ancestral dynamics of population size from

genome wide molecular data - an ABC approach

Simon Boitard, Stanislas Sochacki

To cite this version:

Simon Boitard, Stanislas Sochacki. Inferring the ancestral dynamics of population size from genome

wide molecular data - an ABC approach. 10. World Congress on Genetics Applied to Livestock

Production (WCGALP), Aug 2014, Vancouver, Canada. American Society of Animal Science, 2014,

Proceedings 10th World Congress of Genetics Applied to Livestock Production. �hal-02739431�

(2)

Inferring the ancestral dynamics of population size from

genome wide molecular data - an ABC approach

Simon Boitard ^1,2 , Stanislas Sochacki ¹

1 : UMR 7205 ISYEB (EPHE - MNHN - CNRS - UPMC), Paris.

2 : UMR 1313 GABI (INRA - AgroParisTech), Jouy en Josas

WCGALP 2014

(3)

Motivation

Genome wide sequence data contains rich information about population

size history, cf PSMC (Li and Durbin, 2011).

(4)

Development of an ABC approach

Several estimation methods :

Sequentially Markovian Coalescent : PSMC (Li and Durbin, 2011),

dical (Sheehan et al, 2013), MSMC (Schiffels and Durbin, 2014).

Runs of Homozygocity : MacLeod et al (2013), Harris and Nielsen

(2013).

So far limited to small sample sizes (n = 1 to ≈ 5 diploid individuals).

→ low accuracy for recent history estimation.

Approximate Bayesian Computation (ABC) could take advantage of

both genome wide data and large sample size.

(5)

Outline

1 Methods

2 Simulation Results

3 Application to bovine NGS data

(6)

Outline

1 Methods

2 Simulation Results

3 Application to bovine NGS data

(7)

Methods

Principles of Approximate Bayesian Computation (ABC)

Model with parameter θ (multi-dimensional), dataset D.

Approximate P(θ|D) by P(θ|S), for a set S of (meaningfull!)

summary statistics.

Estimate P(θ|S) by intensive simulations :

1 Compute S = f (D)

2 For i from 1 to I:

1 Sample parameter value θ i from the prior distribution of θ.

2 Simulate dataset D i from the model with parameter value θ i .

3 Compute S i = f (D i ).

4 Keep θ i if dist(S i , S) < .

3 Estimate P(θ|S) from the selected θ ⁱ values, by simple counting or

other regression approaches.

(8)

Model

D = n diplo¨ıd genomes.

Coalescent model with mutation and recombination.

One single panmictic population (no structure).

Piecewise constant effective population size, m fixed time windows.

(9)

Methods

Prior distributions

Per generation per bp mutation rate : µ = 1e − 8.

Per generation per bp recombination rate : r ∼ U (0.1e − 8, 1e − 8).

Population size :

log(N ₀ ) ∼ U (1, 5).

log(N _{i +1} ) = log(N _i ) + α, α ∼ U (−1, 1).

1 ≤ log(N _i ) ≤ 5.

(10)

Summary statistics - Allele Frequency Spectrum (AFS)

Proportion of polymorphic sites over the genome.

Proportion of sites with i copies of the minor allele, for i from 1 to n.

●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.02 0.04 0.06 0.08 0.10 0.12 0.14

minor allele count

frequency

(11)

Methods

Summary statistics - Linkage Disequilibrium (LD)

Correlation r ² between genetic data at two polymorphic sites.

Expected r ² at recombination distance c related to population size at

time t = 1/(2c) (Hayes et al, 2003).

Mean r ² for m bins of distance between sites (one per time window).

(12)

Implementation

Oberved data in 4 cattle breeds :

Whole genome of n unrelated animals in Angus (n = 25), Fleckvieh

(n = 25), Holstein (n = 35) and Jersey (n = 15).

NGS data from the 1000 bull genome project, run 2

(http://www.1000bullgenomes.com/).

Phased data from Daetwyler et al. (2014).

Simulated data :

n diplo¨ıd individuals (n = 15, 25 or 35), each with 100 independent

2Mb segments.

400 000 samples simulated with ms, with m = 21 time windows.

Summary statistics for observed and simulated samples computed

with the same python scripts.

Final statistical analysis with the R package abc.

(13)

Simulation Results

Outline

1 Methods

2 Simulation Results

3 Application to bovine NGS data

(14)

Cross validation, n = 25

0.0 0.2 0.4 0.6 0.8 1.0

generations

error

N0 N130000

all statistics

Estimation error

P

i (θ i − ˆ θ i ) ²

I ∗Var (θ i ) based on 500 CV replicates.

(15)

Simulation Results

Cross validation, n = 25

0.0 0.2 0.4 0.6 0.8 1.0

generations

error

N0 N130000

all statistics

allele frequencies

linkage disequilibrium

(16)

Outline

1 Methods

2 Simulation Results

3 Application to bovine NGS data

(17)

Application to bovine NGS data

Influence of phasing and sequencing errors (Holstein)

0 1 2 3 4 5 6

2.0 2.5 3.0 3.5 4.0 4.5 5.0

log10(years)

log10(Ne)

haplotype LD, MAF 5%

genotype LD, MAF 5%

genotype LD, MAF 20%

(18)

Final estimation (genotype LD, MAF 20%)

0 1 2 3 4 5 6

2.0 2.5 3.0 3.5 4.0 4.5 5.0

log10(years)

log10(Ne)

Angus

Fleckvieh

Holstein

Jersey

(19)

Application to bovine NGS data

Conclusions and perspectives - Methodology

ABC provides accurate estimation of population size dynamics from

present up to at least 20 000 generations b.p..

Combining AFS and LD is useful.

ABC can be applied to a wide range of data types : large sample size,

unphased data, RAD sequencing . . .

ABC requires little assumptions concerning the underlying evolution

model.

(20)

Conclusions and perspectives - Cattle history

Two strong bottlenecks, starting 30 000 years b.p. (domestication)

and 1 200 years b.p. (diversification?).

General shape consistent with previous estimation from MacLeod et

al (2013) in Holstein.

Quantitative differences (larger population size, second bottleneck

older) likely due to differences in the recombination rate per bp (fixed

at 10 ⁻⁸ in their study, estimated at 2 ∗ 10 ⁻⁹ in ours).

(21)

Application to bovine NGS data

Acknowledgements

1000 bull genomes project.

Genotoul Bioinformatics Platform.

Willy Rodriguez (INSA Toulouse).

Bertrand Servin (INRA, Toulouse).

Lounes Chikhi (University Toulouse III), Olivier Mazet, Simona

Grusea (INSA Toulouse).

ANR Demochips : Fr´ ederic Austerlitz, Stefano Mona . . .

(22)

Simulations under a bottleneck scenario

(MacLeod et al, 2013)

0 1 2 3 4 5 6

2.0 2.5 3.0 3.5 4.0 4.5 5.0

log10(years)

log10(Ne)

true value

average estimation

estimation for one sample

Inferring the ancestral dynamics of population size from genome wide molecular data - an ABC approach

HAL Id: hal-02739431

https://hal.inrae.fr/hal-02739431

Submitted on 2 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entific research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Inferring the ancestral dynamics of population size from

genome wide molecular data - an ABC approach

Simon Boitard, Stanislas Sochacki

To cite this version:

Simon Boitard, Stanislas Sochacki. Inferring the ancestral dynamics of population size from genome

wide molecular data - an ABC approach. 10. World Congress on Genetics Applied to Livestock

Production (WCGALP), Aug 2014, Vancouver, Canada. American Society of Animal Science, 2014,

Proceedings 10th World Congress of Genetics Applied to Livestock Production. �hal-02739431�

Inferring the ancestral dynamics of population size from

genome wide molecular data - an ABC approach

Simon Boitard 1,2 , Stanislas Sochacki 1

1 : UMR 7205 ISYEB (EPHE - MNHN - CNRS - UPMC), Paris.

2 : UMR 1313 GABI (INRA - AgroParisTech), Jouy en Josas

WCGALP 2014

Motivation

Genome wide sequence data contains rich information about population

size history, cf PSMC (Li and Durbin, 2011).

Development of an ABC approach

Several estimation methods :

Sequentially Markovian Coalescent : PSMC (Li and Durbin, 2011),

dical (Sheehan et al, 2013), MSMC (Schiffels and Durbin, 2014).

Runs of Homozygocity : MacLeod et al (2013), Harris and Nielsen

(2013).

So far limited to small sample sizes (n = 1 to ≈ 5 diploid individuals).

→ low accuracy for recent history estimation.

Approximate Bayesian Computation (ABC) could take advantage of

both genome wide data and large sample size.

Outline

1 Methods

2 Simulation Results

3 Application to bovine NGS data

Outline

1 Methods

2 Simulation Results

3 Application to bovine NGS data

Methods

Principles of Approximate Bayesian Computation (ABC)

Model with parameter θ (multi-dimensional), dataset D.

Approximate P(θ|D) by P(θ|S), for a set S of (meaningfull!)

summary statistics.

Estimate P(θ|S) by intensive simulations :

1 Compute S = f (D)

2 For i from 1 to I:

1 Sample parameter value θ i from the prior distribution of θ.

2 Simulate dataset D i from the model with parameter value θ i .

3 Compute S i = f (D i ).

4 Keep θ i if dist(S i , S) < .

3 Estimate P(θ|S) from the selected θ i values, by simple counting or

other regression approaches.

Model

D = n diplo¨ıd genomes.

Coalescent model with mutation and recombination.

One single panmictic population (no structure).

Piecewise constant effective population size, m fixed time windows.

Methods

Prior distributions

Per generation per bp mutation rate : µ = 1e − 8.

Per generation per bp recombination rate : r ∼ U (0.1e − 8, 1e − 8).

Population size :

log(N 0 ) ∼ U (1, 5).

log(N i +1 ) = log(N i ) + α, α ∼ U (−1, 1).

1 ≤ log(N i ) ≤ 5.

Summary statistics - Allele Frequency Spectrum (AFS)

Proportion of polymorphic sites over the genome.

Simon Boitard ^1,2 , Stanislas Sochacki ¹

4 Keep θ i if dist(S i , S) < .

3 Estimate P(θ|S) from the selected θ ⁱ values, by simple counting or

log(N ₀ ) ∼ U (1, 5).

log(N _{i +1} ) = log(N _i ) + α, α ∼ U (−1, 1).

1 ≤ log(N _i ) ≤ 5.

Correlation r ² between genetic data at two polymorphic sites.

Expected r ² at recombination distance c related to population size at

Mean r ² for m bins of distance between sites (one per time window).

i (θ i − ˆ θ i ) ²