• Aucun résultat trouvé

Accounting for Linkage Disequilibrium in genome scans for selection without individual genotypes : the local score approach

N/A
N/A
Protected

Academic year: 2021

Partager "Accounting for Linkage Disequilibrium in genome scans for selection without individual genotypes : the local score approach"

Copied!
33
0
0

Texte intégral

(1)

HAL Id: hal-02789038

https://hal.inrae.fr/hal-02789038

Submitted on 5 Jun 2020

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Accounting for Linkage Disequilibrium in genome scans for selection without individual genotypes : the local

score approach

Maria Inès Fariello Rico, Simon Boitard, Sabine Mercier, Magali San Cristobal

To cite this version:

Maria Inès Fariello Rico, Simon Boitard, Sabine Mercier, Magali San Cristobal. Accounting for

Linkage Disequilibrium in genome scans for selection without individual genotypes : the local score

approach. Methods in Ecology and Evolution, May 2017, Montpellier, France. �hal-02789038�

(2)

Accounting for Linkage Disequilibrium in genome scans for selection without individual genotypes:

the local score approach

Mar´ıa In´ es Fariello

1

, Simon Boitard

2

, Sabine Mercier

3

, Magali San Cristobal

4

1: Facultad de Ingenier´ıa, Universidad de la Rep´ublica, Montevideo, Uruguay

2:en´etique, Physiologie et Syst`emes d’Elevage (GenPhySE), INRA Toulouse

3: Institut de Math´ematiques de Toulouse (IMT), Universit´e de Toulouse

4: Dynamiques et ´ecologie des paysages agriforestiers (Dynafor), INRA Toulouse

Models in Ecology and Evolution, May 30-31, 2017

(3)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

2 / 30

(4)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

(5)

Genome scans for selection

Most genomic regions are neutral, but some of them are (or have been) under selection (natural or artificial).

Detecting the regions under selection is important for theory (evolution) and applications (medicine, agronomy).

Genome wide scans for selection now possible from dense genotyping (SNP chips) or sequencing (NGS) data.

Focus on positive (adaptative) selection.

4 / 30

(6)

Population differentiation approach

Look for markers with contrasted allele frequencies between

populations.

(7)

Population differentiation approach

Look for markers with contrasted allele frequencies between populations.

5 / 30

(8)

Linkage Disequilibrium (LD) helps!

Single-marker statistics have a large variance, high values can be reached just by chance due to drift.

Due to LD, markers in the neighborhood of a selected locus also show elevated differentiation between populations.

Account for LD in selection scans by:

1

using haplotype tests

2

looking for clusters of markers with high differentiation

(9)

Windowing approaches

Cut the genome into fixed windows and computes a summary of the single-marker statistics within each window.

Summarize each window using:

the average of single-marker statistics (Weiret al, 2005).

the number of markers exceeding a given threshold (Myleset al, 2008).

the number of markers differentially fixed between populations (Johanssonet al, 2010).

Individual genotypes not required (pooled sequencing).

Limitations:

How to choosewindow size? thesingle-marker threshold?

How to decide that awindowisunder selection?

Overcome these issues using the statistical local score theory.

7 / 30

(10)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

(11)

F

ST

based tests

p

= (p

1, . . . ,pi, . . . ,pn

): allele frequencies at one SNP in several populations.

p

and

sp2

: observed mean and variance of

p.

FST

=

s

p2

¯p(1−¯p)

H0

: “neutral evolution” (genetic drift)

vs

H1

: “positive selection in one (or more) population ”.

H0

rejected if

FST

too large.

9 / 30

(12)

Lewontin et Krakauer (LK) test (1973)

LK

=

n−

1

F

¯

ST FST

• LK

distribution under

H0

is

χ2

with

n−

1 degrees of freedom.

But, only true if populations have a star like phylogeny with

equal population sizes.

(13)

FLK test (Bonhomme et al, 2010)

Extension of LK accounting for

differences in effective size between populations.

differences in correlations between population pairs.

(first estimated from genome wide data)

11 / 30

(14)

hapFLK test (Fariello et al, 2013)

Define local haplotypes around each SNP position using the model of Scheet and Stephens (2006).

Compute haplotype frequencies in each population.

Apply FLK, considering haplotypes as alleles.

(15)

Detection power

4 populations with hierarchical structure, 1 under selection.

13 / 30

(16)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

(17)

Definition

For each marker

m, define the

score:

Xm

=

−log

10(p

m

)

−ξ

pm

p-value of a test for selection, fixed threshold.

Low p-value =

H0

(neutral evolution) unlikely = high score.

Cumulate scores using the so-called Lindley process:

h0

= 0,

hm

=

max

(0,

hm−1

+

Xm

)

Look for local maxima of the Lindley process, which are asociated to genomic regions that are enriched in high scores / low p-values.

Here

pm

is the p-value of FLK.

15 / 30

(18)

Example

The Lindley process (black line) has several excursions above 0 (local maxima).

The global maximum (H

L

) is called the local score.

Each excurion is associated to an interval enriched in high

scores (in green).

(19)

choosing ξ

p-value threshold in log10 scale.

Ex:

ξ

= 2 cumulates p-values below 10

−2

.

For high

ξ, only

most significant markers contribute:

similar to single point approach.

strong selection.

For low

ξ, more markers contribute:

longer intervals.

recent selection.

17 / 30

(20)

Statistical evidence for selection

How likely is a given excursion under neutrality?

Depends on:

the number of markers in the sequence (M).

the correlation between scores (ρ).

We provided two approaches allowing to compute significance thresholds for excursions :

1 analytical formula: valid if single-marker p-values are unifrom under neutrality.

2 re-sampling approach: valid for all datasets, but requires some computing time.

(21)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

19 / 30

(22)

Simulation procedure

Two populations with same effective size, one neutral and one under selection.

Genomic region of 10Mb with one selected site.

Several statistics compared, in different scenarios.

Detection threshold of each statistic such that selection is detected in 5% of the neutral samples (type I error 5%).

For the local score, also computed using our re-sampling approach

observed type I error 6%.

Tunning parameters (window size,

ξ

. . . ) chosen to optimize

detection power.

(23)

Results

21 / 30

(24)

Results

(25)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

22 / 30

(26)

Lactase region in Humans

Test of selection based on HapMap genotypes (Europea and Asia).

●●●

●●●●

●●●●●

●●

●●●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●● ●● ●●●● ●●●●

1.34e+08 1.35e+08 1.36e+08 1.37e+08 1.38e+08

01234

Position

−log10(p−val)

1.34e+08 1.35e+08 1.36e+08 1.37e+08 1.38e+08

02468

Position

Lindley Process

●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●●●● ●●●●●●●●●●●●●●

● ●●●●●●●●

●●●● ●●●●●●

●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●● ●●●●●●●●●●●●●●●● ●

●●●●●

●●

●●●●●●●●●●●●●● ●●●

●●

●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●● ●●●●

1.34e+08 1.35e+08 1.36e+08 1.37e+08 1.38e+08

12345

Position

hapFLK

(27)

Divergent selection experiment on behaviour in Quail

Pooled DNA from each line sequenced at generation 50 Strong drift (F = 0.4).

24 / 30

(28)

Selection scan on chromosome 1

(29)

Significant regions genome-wide

Chr.

Position L (kb) Genes

1 92,963,481-93,182,440 219 NSUN3, ARL13B

2 1,584,033-1,688,400 104 VIPR1

3 61,586,217-61,604,464 19 ECHDC1, RNF146

3 75,088,250-75,170,494 82 MMS22L

4 11,412,372-11,452,609 40 GLOD5

4 90,953,044-91,008,245 56 CTNNA2

6 35,234,870-35,336,720 102 FOXI2, PTPRE 6 6,311,718-6,644,395 333 UBE2D1, CISD1, IPMK 10 17,825,157-17,825,227 0.07

25 1,296,647-1,296,706 0.059

Genes in bold have been associated to autistic disorders or behavorial traits in Humans.

26 / 30

(30)

Outline

1

Motivations

2

The FLK & hapFLK approaches

3

The local score approach

4

Simulation results

5

Examples

6

Conclusions

(31)

Detecting selection using the local score

Accounts for LD whithout individual genotypes.

One single tunning parameter,

ξ, with intuitive interpretation.

ξ

= 1 recommended for detection power.

Statistical significance of candidate regions easy to compute.

Increased detection power compared to single-marker, window-based or haplotype-based tests.

Convincing results on 2 real datasets with different features.

Can be applied to any single-marker test providing p-values, for selection scans or any other context.

Ref: Fariello

et al, Molecular Ecology 2017.

28 / 30

(32)

Acknowledgements

Quail husbandry and sampling:

C´ecile Arnould & Christine Leterrier, Unit´e de Physiologie de la Reproduction et des Comportements, INRA Tours

Julien Recoquillay, Unit´e de Recherches Avicoles, INRA Tours David Gourichon, Pˆole

d’Exp´erimentation Avicole, INRA Tours

Computing Facilities:

Genotoul bioinformatics platform Toulouse Midi-Pyr´en´ees.

DNA preparation and sequencing:

Olivier Bouchez & G´erald Salin, GeT-PlaGe Genotoul, INRA Toulouse Sophie Leroux & Fr´ed´erique Pitel, GenPhySE, INRA Toulouse

Bioinformatic and statistic analyses:

Patrice Dehais, SIGENAE, INRA Toulouse

David Robelin & Thomas Faraut, GenPhySE, INRA Toulouse

(33)

Advertising

PhD position available at Toulouse, from september 2017.

Supervised by Loun` es Chikhi (Evolution et Diversit´ e Biologique) and Olivier Mazet (INSA).

Influence of population structure on past population size estimation (Mazet

et al, Heredity 2017).

30 / 30

Références

Documents relatifs

The markers found to surround Bru1 within a 12 cM window were surveyed in 405 international sugarcane cultivars that were also phenotyped for rust resistance in Réunion Island

Marker haplotypes explain a much higher proportion of variance than single markers, especially at low F values. The third marker added significantly to both P 2 and P

VarLD graph, LD heatmaps, and LD variation graphical results for the region showing a signal in all indicine/taurus comparisons on BTA6, between 81.5 and 81.7 Mb, inside the

Here, we present the LD decay curves for SNPs on bo- vine autosomes and the X chromosome for three genetic groups of cattle breeds: Bos taurus (taurine), Bos indicus (indicine) and

The design of two half sib families each of size 64 was further investigated for a hypothetical population with e ffective size of 1000 simulated for 6000 generations with a

Methods: In this study, we used the Illumina OvineSNP50 BeadChip array to estimate and compare LD (measured by r 2 and D′), N e , heterozygosity, F ST and ROH in five

To cite this version : Fariello Rico, María Inés and Boitard, Simon and Mercier, Sabine and San Cristobal, Magali Integrating single marker tests in genome scans for selection :

Unité de recherche INRIA Rennes, Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex Unité de recherche INRIA Rhône-Alpes, 655, avenue de l’Europe, 38330 MONTBONNOT ST