• Aucun résultat trouvé

Estimating the evolution history of species from genome sequences

N/A
N/A
Protected

Academic year: 2021

Partager "Estimating the evolution history of species from genome sequences"

Copied!
36
0
0

Texte intégral

(1)

Estimating the evolution history of species from

genome sequences

Simon Boitard1, Bertrand Servin1, Olivier Mazet2

1 : INRA, G´en´etique Physiologie et Syst`emes d’Elevage (GenPhySE), Toulouse 2 : INSA, Institut de Math´ematiques de Toulouse

CIMI, Semestre Math´ematiques et Informatique pour les sciences du vivant

(2)

Population genetics

Inference of evolutionary history over short time scale, typically within species (in contrast to phylogeny). Typical questions (humans): “out of Africa” hypothesis, Neandertal introgression within modern humans . . .

Typical questions (breeding species): domestication process, impact of intensive selection since the 50’s . . .

(3)

Outline

1 Data and objectives

2 Single population model

3 Estimation of population size from single locus data

(4)

Outline

1 Data and objectives

2 Single population model

3 Estimation of population size from single locus data

4 Estimation of population size from whole-genome sequences

(5)

Genetic data

n DNA sequences of length L from the same species. Mostly similar, but at some positions different alleles exist (genetic polymorphism).

(6)

Single Nucleotide Polymorphism (SNP)

Only one nucleotide is changed. Very common on the genome.

Result from a single mutation event during evolution → only two distinct alleles, one ancestral (denoted 0) and one derived (denoted 1).

(7)

Single Nucleotide Polymorphism (SNP)

Only one nucleotide is changed. Very common on the genome.

Result from a single mutation event during evolution → only two distinct alleles, one ancestral (denoted 0) and one derived (denoted 1).

(8)

General evolution model

(9)
(10)

General evolution model

(11)

Outline

1 Data and objectives

2 Single population model

3 Estimation of population size from single locus data

(12)

the Wright-Fisher process

(13)
(14)

the Wright-Fisher process

(15)
(16)

The coalescent process

(17)
(18)

Important properties

At each generation, probability that no coalescence occurs is

qN(n) = n−1 Y i =1 (1 − i N) = 1 − n(n − 1) 2N + O( 1 N2)

Coalescence time TkN (2 ≤ k ≤ n) has geometric distribution

P(TkN > t) = (qN(k))t

All lineages coalesce at the same rate.

Number of mutations on a branch of length t is Binomial B(t, µ), µ mutation rate per meiosis and per nucleotide (bioliogically known).

(19)

Kingman’s coalescent (1982)

N → +∞, rescaled time τ = Nt

Coalescence time TkN tends to Tk, with exponential

distribution

P(Tk > τ ) = e−

k(k−1)

2 τ

Coalescence events imply one single pair of lineages. Number of mutations on a branch of length τ is Poisson P(θ2τ ), with θ = 2Nµ (population scaled mutation rate).

(20)

Advantages of the coalescent approach

Very efficient way to simulate genetic data. Easily extended to more complex models (variable population sizes, structured populations . . . ).

Used to express the likelihood of observed genetic data. Provides conceptual framework to understand the influence of some evolutionary parameters on observed genetic data.

(21)

Outline

1 Data and objectives

2 Single population model

3 Estimation of population size from single locus data

(22)

Constant population size : intuition

larger N → longer coalescence times → more mutations.

(23)

Constant population size : the Watterson estimator

Sn number of polymorphic sites in a sample of n DNA

sequences of length L. θW = 1 LSn( n−1 X k=1 1 k) −1 E[θW] = θ = 2Nµ.

As µ is known, this provides an unbiased estimator of N. Var(θW) = (θ2 Pn−1 k=1 k12 +Lθ Pn−1 k=1 1k)( Pn−1 k=1k1) −2

(24)

Variable population size : intuition

Population expansion → larger coalescence times in the recent past → higher proportion of derived alleles at low frequency.

Population decline → larger coalescence times in the distant past → higher proportion of derived alleles at intermediate frequency.

(25)

Variable population size : estimation approaches

Likelihood: P(D | N()) = X T P(D | T )P(T | N())

D observed sequences, N() population size history, T coalescence tree. No analytical expression P(D | N()) ≈ I X i =1 P(D | Ti) Ti simulated from P(T | N()). High dimension of T

(26)

Outline

1 Data and objectives

2 Single population model

3 Estimation of population size from single locus data

4 Estimation of population size from whole-genome sequences

(27)

Recombination

In diploid species (e.g. humans), recombination during meiosis → each gamete is a mixture of two sequences, one

inherited from the mother and one from the father.

Negligible at short distance (single locus), but important when studying whole-genome sequences.

(28)

The Ancestral Recombination Graph (ARG)

(29)

Consequences for inference

Coalescence trees at two distinct loci are neither similar nor independent.

Complex correlation structure.

Dimension of the space of genealogies explodes, previous MCMC or IS approaches no longer possible.

(30)

The Approximate Bayesian Computation (ABC) approach

First proposed by Beaumont (2002), allows estimating

parameters θ of a model when likelihood cannot be evaluated. Approximate the posterior P(θ|D) by the posterior P(θ|S), for a set S of (meaningfull!) summary statistics.

Estimate P(θ|S) using intensive simulations:

1 Compute S = f (D) 2 For i from 1 to I:

1 Sample parameter θi from a prior distribution.

2 Simulate dataset Di from the model with parameter θi.

3 Compute Si = f (Di).

4 Select the simulation if dist(Si, S) < .

3 Estimate the posterior distribution of θ from the empirical distribution of selected θi values.

(31)

Application to population sizes (Boitard et al, 2016)

Panmictic population with population size history N(). Large sample of whole-genome sequences from this population.

Approximates P(N()|S) using ABC.

Set of ≈ 50 summary statistics, describing (among others) the distribution of allele frequencies.

(32)

Simulation results

ABC, n=50

generations before present (log scale)

eff ectiv e population siz e (log scale) 0 10 100 1,000 10,000 100,000 100 1,000 10,000 100,000 true value estimations for one replicate 90% CI for one replicate

PSMC : n=2 0 10 100 1,000 10,000 100,000 100 1,000 10,000 100,000 MSMC : n=4 0 10 100 1,000 10,000 100,000 100 1,000 10,000 100,000 ABC, n=25

generations before present (log scale)

eff ectiv e population siz e (log scale) 0 10 100 1,000 10,000 100,000 100 1,000 10,000 100,000 true value estimations for one replicate 90% CI for one replicate

PSMC : n=1 0 10 100 1,000 10,000 100,000 100 1,000 10,000 100,000 MSMC : n=2 0 10 100 1,000 10,000 100,000 100 1,000 10,000 100,000 25 / 29

(33)

Analysis of cattle genomes

eff ectiv e population siz e (log scale) 300 1,000 3,000 10,000 30,000 100,000 BrownSwiss Fleckvieh Holstein SwedishRed Charolais Limousin Hereford Jersey Common trajectory before domestication. Continuous decline since domestication.

Ranking of recent sizes consistent with current knowledge of these breeds.

(34)

Influence of population structure

In real life, populations not isolated.

Relationship between populations affect population size estimations.

→ Identifiability issue: can we distinguish population size changes and population structure from genetic data? Mazet et al (2016): not from two genomes, because population size change models can reproduce every possible distribution of T2.

Important conclusion, because one popular estimation method (Li and Durbin, 2011) is based on this distribution.

Distinction would be in theory possible from the joint distribution of (T2, T3) (Grusea et al, in prep.).

(35)

Conclusions

Genetic data informative about species history.

Population genetics: a very active field of research, interface between biology and applied mathematics.

Contributes to answer fundamental questions about human (and other species) history.

Many theoretical and computational issues to be solved. Massive amount of data to be analyzed.

(36)

Our research group in Toulouse

INRA: Bertrand Servin, Simon Boitard

INSA/IMT: Olivier Mazet, Simona Grusea, Willy Rodriguez, Didier Pinchon

EDB : Loun`es Chikhi

Références

Documents relatifs

Tibetan Terriers formed two clusters: cluster (1) includes the native population and its F1 cross with western Tibetan Terriers, which are grouped together with the more ancient

Under differential fixation of duplicated loci, populations with lower poly- morphism levels, such as the inbred white koi (gene diversity of 0.13 using AFLP), can still be assessed

Inferring the ancestral dynamics of population size from genome wide molecular data - an ABC approach..

Genome wide sequence data contains rich information about population size history, cf PSMC (Li and Durbin, 2011).... Development of an

Simon Boitard, Willy Rodríguez, Flora Jay, Stefano Mona, Frédéric Austerlitz. Inferring Popula- tion Size History from Large Samples of Genome-Wide Molecular Data - An

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Compu- tation Approach.. conférence Jaques Monod ”Coalescence des

To do so, we merged our Cuban dataset with publicly available whole-genome and genome-wide SNP datasets for European, African, and Native American ancestries (called

Outside this window however, population size histo- ries estimated by MSMC often had a totally different trend than the true history, (see for instance the small constant size or