• Aucun résultat trouvé

Inferring the ancestral dynamics of population size from genome wide molecular data - an ABC approach

N/A
N/A
Protected

Academic year: 2021

Partager "Inferring the ancestral dynamics of population size from genome wide molecular data - an ABC approach"

Copied!
39
0
0

Texte intégral

(1)

HAL Id: hal-02802676

https://hal.inrae.fr/hal-02802676

Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entific research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Inferring the ancestral dynamics of population size from

genome wide molecular data - an ABC approach

Simon Boitard

To cite this version:

Simon Boitard. Inferring the ancestral dynamics of population size from genome wide molecular data -

an ABC approach. Stochastic models in ecology, evolution and genetics (SMEEG), Dec 2013, Angers,

France. �hal-02802676�

(2)

Inferring the ancestral dynamics of population size from

genome wide molecular data - an ABC approach

Simon Boitard

UMR 7205 OSEB (EPHE - MNHN - CNRS), Paris.

UMR 1313 GABI (INRA - AgroParisTech), Jouy en Josas

(3)

Motivation

Genome wide sequence data contains rich information about population

size history, cf PSMC (Li and Durbin, 2011).

(4)

Pairwise Sequentially Markovian Coalescent (PSMC)

Markov chain for T 2 based on the Sequentially Markovian Coalescent

(SMC), transitions depend on N(t).

Estimation through an Hidden Markov Model (HMM).

Limited to one individual (n = 2) → not efficient for recent times.

(5)

Development of an ABC approach

Several estimation methods (Drummond et al, 2012; MacLeod et al,

2013; Sheehan et al, 2013), but limited to n = 2 or small genomic

regions.

ABC could take advantage of both genome wide data and large n.

Little assumptions required concerning the underlying model.

(6)

Application to farm animal species

Many genome sequences now available (pig, cattle, sheep, chicken),

and a huge amount of animals with dense genotyping data.

Several bottlenecks expected along their history :

Last glaciation : -25 000 – -60 000 years

Domestication : -10 000 years.

Creation of modern breeds and intensive selection : -200 years.

Here 25 unrelated animals (n = 50) from the Holstein cattle breed

(www.1000bullgenomes.com)

(7)

Outline

1 Methods

2 Results

Simulations

Application to Holstein data

3 Conclusions and perspectives

(8)

Outline

1 Methods

2 Results

Simulations

Application to Holstein data

3 Conclusions and perspectives

(9)

Principles of ABC (Approximate Bayesian Computation)

To estimate the parameters θ of a model from a dataset D, we

approximate the posterior probability P(θ|D) by the quantity P(θ|S),

for a set S of (meaningfull!) summary statistics.

We estimate P(θ|S) by simulations, with the following procedure :

1 Compute S = f (D)

2 For i from 1 to I:

1 Sample parameter θ i from the prior distribution of θ.

2 Simulate dataset D i from the model with parameter θ i .

3 Compute S i = f (D i ).

4 Select the simulation if dist(S i , S) < .

3 Estimate the posterior distribution of θ from the selected θ i values, by

simple counting or other approaches (regression).

(10)

Model

Coalescent with mutation and recombinaison, n = 50 haplotypes.

No structure.

Piecewise constant effective population size.

(11)

Intervals are defined from a previous PSMC analysis ...

2.5 3.0 3.5 4.0 4.5 5.0 5.5

5000 10000 15000 20000 25000 30000 35000

log10(générations)

Ne

(12)

... as well as breeding history

0 1 2 3 4 5 6

5000 10000 15000 20000 25000 30000 35000

log10(générations)

Ne

(13)

Prior distributions

Per generation per bp mutation rate : µ = 2.5e − 8.

Per generation per bp recombination rate : r ∼ U (0.2e − 8, 1e − 8).

Population size :

log(N 0 ) ∼ U (1, 5).

log(N i +1 ) = log(N i ) + α, α ∼ U (−1, 1).

1 ≤ log(N i ) ≤ 5.

(14)

Summary statistics - Allele Frequency Spectrum (AFS)

Frequency of polymorphic sites over the genome.

Frequency of sites with i copies of the minor allele, for i from 1 to

n/2.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.02 0.04 0.06 0.08 0.10 0.12 0.14

minor allele count

frequency

Variance of these frequencies over the genome.

(15)

Methods

Summary statistics - Linkage Disequilibrium (LD)

Correlation between allelic data at two polymorphic sites.

LD at distance d related to population size at time t = 2c(d ) 1 .

(16)

Summary statistics - Linkage Disequilibrium (LD)

Correlation between allelic data at two polymorphic sites.

Mean and variance of LD for several distances between sites.

LD at distance d related to population size at time t = 2c(d ) 1 .

(17)

Implementation

Simulations :

Haplotype data simulated with ms. One sample = 50 independent

2MB segments.

500 000 simulated samples, ≈ 40h on a cluster with 500 jobs in parallel

(4 min per sample on average).

Holstein data :

Several pre-processing steps required to obtain haplotype data

(sequencing, alignment, genotype calling, haplotype estimation).

Haplotype data processed with the same Python program.

Final statistical analysis with the R package abc.

(18)

Outline

1 Methods

2 Results

Simulations

Application to Holstein data

3 Conclusions and perspectives

(19)

Outline

1 Methods

2 Results

Simulations

Application to Holstein data

3 Conclusions and perspectives

(20)

Cross validation

0.00 0.05 0.10 0.15 0.20 0.25

générations

erreur

0 10 40 200 400 800 1200 2000 5000 2E4

rejection 0.02 median

rejection 0.01 median

rejection 0.005 median

rejection 0.002 median

rejection 0.001 median

ridge 0.02 median

ridge 0.01 median

ridge 0.005 median

ridge 0.002 median

ridge 0.001 median

Estimation error

P

i (θ i − ˆ θ i ) 2

I ∗Var (θ i ) based on 100 CV replicates.

(21)

Influence of AFS and LD statistics - Cross Validation

0.00 0.05 0.10 0.15 0.20 0.25 0.30

generations

error

0 10 40 200 400 800 1200 2000 5000 2E4

all stat

AFS

LD

(22)

Influence of AFS and LD statistics - Cross Validation

0.00 0.05 0.10 0.15 0.20 0.25 0.30

generations

error

0 10 40 200 400 800 1200 2000 5000 2E4

all stat

no VAR_AFS

no VAR_LD

(23)

Influence of AFS and LD statistics - PLS regression

−1.0 −0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

Comp 1

Comp 2

AFS_1 AFS_2

AFS_3

AFS_4

AFS_5

AFS_6

AFS_7 AFS_8

AFS_9

AFS_10 AFS_11 AFS_12 AFS_13

AFS_14

AFS_15

AFS_16

AFS_17

AFS_18

AFS_19 AFS_20

AFS_21

AFS_22

AFS_23

AFS_24

AFS_25

V_AFS_1

V_AFS_2

V_AFS_3

V_AFS_4

V_AFS_5

V_AFS_6

V_AFS_7

V_AFS_8

V_AFS_9

V_AFS_10

V_AFS_11

V_AFS_12

V_AFS_13

V_AFS_14

V_AFS_15

V_AFS_16 V_AFS_17

V_AFS_18

V_AFS_19

V_AFS_20

V_AFS_21

V_AFS_22

V_AFS_23 V_AFS_24

V_AFS_25 LD_2000

LD_417

LD_167

LD_83

LD_50

LD_31

LD_14

LD_4

LD_1

V_LD_2000

V_LD_417

V_LD_167

V_LD_83

V_LD_50

V_LD_31

V_LD_14

V_LD_4

V_LD_1

SNPf

r

N10 N0

N200 N40 N400

N800

N1200

N2000

N5000

N20000

(24)

Outline

1 Methods

2 Results

Simulations

Application to Holstein data

3 Conclusions and perspectives

(25)

Prior check

−10 0 10 20

−5 0 5 10 15 20

PC 1

PC 2

● ●

● ●

●●

● ●

● ● ●

●●

●●

● ●

●●

● ●

● ●

● ●

● ●

●●

●●

● ●

● ●

● ●●

● ●

●●

● ●

●●

● ●

● ●

● ●

● ●

●●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

(26)

Estimated dynamics

0 1 2 3 4 5

2.5 3.0 3.5 4.0 4.5 5.0

log10(generations)

log10(Ne)

ridge 0.01 90%

rige 0.01 median

ridge 0.01 10%

(27)

Data is informative

1 2 3 4 5

0.0 0.2 0.4 0.6 0.8 1.0

log10(Ne)

density

● ● ● ●

● ●

● ● ●

● ●

0 to 10 generations

40 to 200 generations

> 20 000 generations

(28)

Comparison with PSMC

0 1 2 3 4 5

0 10000 20000 30000 40000 50000 60000

log10(generations)

Ne

(29)

Comparison with PSMC

0 1 2 3 4 5

0 10000 20000 30000 40000 50000 60000

log10(generations)

Ne

(30)

Outline

1 Methods

2 Results

Simulations

Application to Holstein data

3 Conclusions and perspectives

(31)

Conclusions

The approach seems to work (low cross validation errors, sensible

credible intervals).

Combining AFS and LD is useful.

Variance of AFS is useful, but variance of LD is not.

Estimated demography is quite consistent with PSMC, but credible

intervals are rather large.

Estimation of recent population size seems too large (> 1000).

Influence of sequencing errors (MacLeod et al, 2013)?

(32)

Perspectives

Objective definition of time intervals.

ABC with more segments (L = 100)?

ABC based on more replicates? Second more local step?

Estimation with PLS?

(33)

Acknowledgements

Stanislas Sochacki (Ecole Polytechnique).

Lounes Chikhi (University Toulouse III), Willy Rodriguez, Olivier

Mazet, Simona Grusea (INSA Toulouse).

Bertrand Servin (INRA, Toulouse).

1000 bull genomes project.

(34)

Credible Intervals

65 70 75 80 85 90 95

générations

propor tion in CI 80

0 10 40 200 400 800 1200 2000 5000 2E4

rejection 0.002 median

rejection 0.001 median

ridge 0.02 median

ridge 0.01 median

ridge 0.005 median

Proportion in CI 80 1 I P

i 1(ˆ q 10i ) <= θ i <= ˆ q 90i )).

(35)

Summary statistics

Proportion of SNPs : f = P(x > 0), x number of copies of the minor

allele.

Allele frequency spectrum (AFS) : P(x = i|x > 0) for i from 1 to 25.

Variance of AFS : std (d i ) ∗ f for i from 1 to 25, d i distance between

two consecutive sites with i copies of the minor allele.

Linkage disequilibrium (LD) : E[r 2 (d )] and std [r 2 (d )], r 2 (d ) LD

between SNPs at distance d .

d =1kb, 4kb, . . . 2Mb, corresponding to time intervals in the model.

Ex : d =1kb → c = 10 −5 M → t = 2c 1 = 50000.

(36)

Number of segments

For each position i , S i i.i.d with E[S i | θ], Var (S i | θ).

Our statistics are averages, i.e. S L = 1 L P L

i =1 S i

→ E[S L | θ] = E[S i | θ], Var (S L | θ) = 1 L Var (S i | θ)

Var (S genome | θ) = 3∗10 1 9 Var (S i | θ)

Var (S 50∗2Mb | θ) = 1 5 Var (S 10∗2Mb | θ)

When does this variance become too large?

Références

Documents relatifs

• Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts,

• Simulated samples are summarized using the same statistics as the observed sample, and the best 0.5% of population size histories (i.e. those providing statistics that are the

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frédéric Austerlitz. Inferring Popula- tion Size History from Large Samples of Genome-Wide Molecular Data - An

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Compu- tation Approach.. conférence Jaques Monod ”Coalescence des

To do so, we merged our Cuban dataset with publicly available whole-genome and genome-wide SNP datasets for European, African, and Native American ancestries (called

The other six ancestries are represented by European wild boars, Hampshire (UKHS) and Berkshire (UKBK), and four international commercial breeds including Duroc, Large

 Descartes’  rule  of  signs  and  the  identifiability  of  population   demographic  models  from  genomic  variation  data..  Efficient  inference  of

Genome wide sequence data contains rich information about population size history, cf PSMC (Li and Durbin, 2011).... Development of an