• Aucun résultat trouvé

Genotyping-By-Sequencing (GBS): a simple method for a multitude of markers

N/A
N/A
Protected

Academic year: 2021

Partager "Genotyping-By-Sequencing (GBS): a simple method for a multitude of markers"

Copied!
39
0
0

Texte intégral

(1)

HAL Id: hal-02789479

https://hal.inrae.fr/hal-02789479 Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Genotyping-By-Sequencing (GBS): a simple method for a multitude of markers

Philippe Barre, Tom Ruttink

To cite this version:

Philippe Barre, Tom Ruttink. Genotyping-By-Sequencing (GBS): a simple method for a multitude of markers. Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”, 2019. �hal-02789479�

(2)

In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”

This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.

Genotyping-By-Sequencing (GBS):

a simple method

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic

a simple method

for a multitude of markers

(3)

Sequencing as a tool for genotyping

Getting the genotypic classes or the allele frequencies at different

loci of a sample

2

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(4)

Necessity to reduce the complexity

of the genome

Possibility to re-sequence the whole genome if you

have enough money:

1000 $ for 2X = 6.4 10

9

bp for human

562 $ for 4X = 3.6 10

9

bp for alfalfa

Once you have a reference

sequence

3

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

562 $ for 4X = 3.6 10

9

bp for alfalfa

93 $ for 2X = 0.6 10

9

bp for flax

(5)

Necessity to reduce the complexity

of the genome

How to reduce the complexity

Genes : sequencing of mRNA

Choice of plant material

Targeting specific loci : amplicon sequencing or capture

Design of primers or probes

4

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Design of primers or probes

Random sequences : use of restriction enzymes

BUT Patents from KEYGENE : SNP identification by using restriction enzymes and high throughput sequencing methods

EP 1910562, EP 1966393, EP 2002017, EP 2292788, EP 2789696, EP 2963,127, EP 2,302,070

(6)

In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”

This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.

Principle of GBS

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic

(7)

Principle of GBS

Sample 1

Restriction enzyme site

2/ Ligation of adapters primer

A

Single Nucleotide Polymorphism (SNP)

1/ Digestion Genomic DNA

6

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

3/ Amplification < 450 bp

Sample 2

G

Loss of a restriction site : MD 4/ Sequencing

Barcode specific of sample 1

AFLP (Amplified Fragment length Polymorphism) → Electrophoresis:

(8)

Principle of GBS

5/ Demultiplexing with barcodes Many sequences for each sample

Files .fastq

6/ Cleaning

header sequence quality

7

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(9)

Principle of GBS

Reference

7/ Alignment on a reference sequence files .sam and .bam

Read depth Coverage

CIGAR: alignment

Reference name Position Sequence name

8

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

HISEQ8:C5VAYACXX:4:1101:9138:7833 16 MsNRG000127 383005 0 88M *

0 0

CTGCGCTGATTAGCCTGAGAAGAGCACTGACTTGTAGCAGTCTGAACTGCATATGAGCACACTGGGTCCGAGCAGGAAACCAGCG CAG DDDDCDDDDCDDCCEDCEEEFFFFFFHHHEGHJJJIJIJJJJJJJJJIIGJJJJJIHFBGHGGHIHGJJJIJJGIJIIGGHHHGFFFF

RG:Z:readscleanadpt.sai XT:A:R NM:i:1 X0:i:2 X1:i:0 XM:i:1 XO:i:0 XG:i:0

MD:Z:4A83 XA:Z:MsNRG000506,+909328,88M,1;

HISEQ8:C5VAYACXX:4:1101:9015:7835 16 MsNRG000138 746536 0 97M *

0 0

ATGACATTTGTTCTTGACCGTCGGAAAGATTAAGTACGGCATTTTCCAACAGATAGACTAAATGCTTGATATTATTTTCTTCTGATACT GTGTGCTG DEDCDDDCDDDCDDDDDDDFEFGHHHHIE@GJJJIIJJJIHF<JIJJJJJJJIIJJIJJJIHAJJJJJGGJJIJJJJIHHEJIJIIGIHHHHHBFFF

RG:Z:readscleanadpt.sai XT:A:R NM:i:1 X0:i:3 X1:i:0 XM:i:1 XO:i:0 XG:i:0

MD:Z:93T3 XA:Z:MsNRG000119,+373466,97M,1;MsNRG000673,-603245,97M,1;

CIGAR: alignment

Reference name Position Sequence name

(10)

Principle of GBS

8/ Variant calling files .vcf

= ref = ref MD = ref MD

Not enough reads Min. number of reads to

be defined 1=ref 4=Alt Position 452 Interpretation of the allele frequencies Focus on biallelic SNP 9

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Variant calling : sample 1 at position 452 ref: A alt: G, 1 ref, 4 alt, genotype: GG or 0/0 What about the position with enough reads but identical to the reference ? file .gvcf

508 218848 . A G 3834.26 . BaseQRankSum=0.913;ClippingRankSum=-0.308;DP=302;MLEAC=2,0;MLEAF=0.500,0.00;MQRankSum=-2.242;RAW_MQ=413438.00;ReadPosRankSum=-6.469 GT:AD:DP:GQ:PL:SB 1/1:187,115:302:18:3861,18,0,356,6642,4094,346,559,6786,4421,905,6989,4980,7335,11410:115,72,115,0 508 218849 . C . . END=218873 GT:DP:GQ:MIN_DP:PL 0/0:304:50:278:0,50,120,241,1800

(11)

Principle of GBS

6=ref 0=Alt 5 = ref 0=Alt 5=ref 0=Alt

MD MD 1=ref 4=Alt Position 452 Position 400 Sample 1 MD 1=ref 4=Alt MD MD MD MD .vcf .gvcf 10

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

6=ref 0=Alt 5 = ref 0=Alt 5=ref 0=Alt

Sample 2

1=ref 4=Alt MD MD

.gvcf

O = ref 5=Alt 5=ref 0=Alt

MD MD 1=ref 4=Alt MD 1=ref 4=Alt MD MD MD .vcf .gvcf O = ref 5=Alt MD

(12)

Principle of GBS

9/ Merge variant calling files for all samples files .vcf

scaffold_0|ref0040988 166641 . A C . . .

GT:AD:DP:GC 0/0:146,0:146:85.5387987028

0/1:106,180:286:228.995630731 0/0:345,0:345:204.97487344 1/1:0,205:205:120.911257016 1/1:0,202:202:119.11357202

0/0:11,0:11:5.32928395111 0/0:231,0:231:136.513314158 1/1:22,973:995:999

chromosome position ref alt Sample 1 Sample 4

11

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

0/0:11,0:11:5.32928395111 0/0:231,0:231:136.513314158 1/1:22,973:995:999 0/0:100,0:100:58.0077078195 ./.:3,0:3:. scaffold_0|ref0040988 166656 . A T . . . GT:AD:DP:GC 0/0:145,0:145:84.9367387115 0/0:287,0:287:170.134951164 0/0:347,0:347:206.176493884 0/0:207,0:207:122.11119102 0/0:202,0:202:119.11357202 0/0:11,0:11:5.32928395111 0/0:231,0:231:136.513314158 0/0:1001,1:1002:57.4056478282 0/0:100,0:100:58.0077078195 ./.:3,0:3:.

Ready for genetic analyses !!!

10/ Filtering: Check the missing data per SNP and per sample

Chro_position1 Chro_position2

Sample 1 0 1

(13)

Principle of GBS

2/ Ligation of adapters 3/ Amplification

1/ Digestion 5/ Demultiplexing files .fastq per sample

6/ Cleaning = Trimming files .fastq per sample

7/ Alignment on a reference sequence files .sam and .bam per sample

Libraries preparation

Bioinformatics

12

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

4/ Sequencing file .fastq for all samples

.sam and .bam per sample

8/ Variant calling files .vcf or gvcf per sample

9/ Merge variant calling file .vcf for all samples

10/ Filtering file .csv or .txt for all samples

Sequencing

To be done by or with the geneticist To be done by or with the geneticist

(14)

In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”

This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.

Library preparation

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic

(15)

Library preparation (Elshire et al., 2011)

14

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Several PCR could be performed on

(16)

Library preparation (Elshire et al., 2011)

15

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(17)

Library preparation (Elshire et al., 2011)

16

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Essential links between :

(18)

Sequencing Illumina Hiseq 2000

O2

17

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(19)

Sequencing Illumina Hiseq 2000

Single read

18

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Read 2

(20)

Library preparation : an example of

adapters and primers

Barcode

Insert

Adapter P1

Adapter P2 Primer P1 O1 Index 5

Primers for sequencing

19

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Barcode Adapter P1

Primer for sequencing

Index 7 For dactylis: - 29 barcodes - 5 i7 - 2 i5 290 samples Have to be compatible 1 lane of Novaseq (about 3.9 109 reads) (about 11 k€) In average 13 109 reads From 2 – 25 M Sequence again some samples

Per sample : if target 10 M reads Library + sequencing about 60 € Without staff

(21)

Library preparation (Elshire et al., 2011)

Amount of adapters Clean: discard small fragments 20

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(22)

Choice of restriction enzyme and

number of reads

N u m b e r o f lo ci ApeKI (5 bp) PstI (6 bp) W= A or T Read depth About 50000 loci with few MD About 200000 loci with many MD 21

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Number of reads (millions)

N u m b e r

Citation: Byrne S, Czaban A, Studer B, Panitz F, Bendixen C, Asp T (2013) Genome Wide Allele Frequency Fingerprints (GWAFFs) of Populations via Genotyping by Sequencing. PLoS ONE 8(3): e57438.

https://doi.org/10.1371/journal.pone.0057438 Sample 1 Sample 2 MD MD MD MD stack

(23)

Choice of restriction enzyme and number

of reads

Depends on :

- the number of markers needed - the capacity to predict MD

- the funding re a d d e p th > 1 0 Dactylis Alfalfa PstI-MseI : 17000 stacks with 10 M 22

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

N u m b e r o f st a ck s w it h re a d Number of reads

Test of different restriction enzymes and pairs of enzymes PstI : 10000 stacks

with 10 M reads

stacks with 10 M reads

(24)

GBS in order to estimate allele

frequencies on pools

Allele frequencies estimated on 48 individuals Allele frequencies estimated on a pool of DNA from 48 Not a good estimation for AF<0.05 or >0.95 23

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Y=0.93X + 0.05 R²= 0.92 prediction interval: ± 0.12

Y=0.96X + 0.02 R²= 0.98 prediction interval: ± 0.10

48 individuals

Allele frequencies estimated on a pool of leaves from 48 individuals from 48

individuals

Good estimation of allele frequencies on pools

(25)

Example of using GBS on pools of

leaves in alfalfa

Estonia Grazing Germany FST: 0.075 to 0.203, P < 0.05 France Julier, B. et al. (2018) Use of GBS markers to distinguish among lucerne varieties,with comparison to morphological traits.

24

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Canada Turf

to morphological traits. Molecular Breeding 38 : 133-145

(26)

Example of using GBS on pools of

leaves in alfalfa

25

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(27)

Example of using GBS on pools of

leaves in alfalfa

26

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

(28)

In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”

This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.

Bioinformatics

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic

(29)

Bioinformatics : a multitude of softwares

on Linux

Demultiplexing : often performed

by the sequences provider

sabre,

GBSX, Stacks (process_radtags)

Cleaning :

Possibility with Windows : CLC genomics but pb to merge many samples Sequences per sample

28

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Remove adapters : cutadapt, scythe

Remove bad quality reads or part of

reads: sickle

Remove too short reads: sickle

1 file of 10 M reads 2.5 Go

(800 Mo compressed)

(30)

Bioinformatics : a multitude of softwares

on Linux

Alignment on a reference sequence

BWA aln or MEM, GSNAP

Samtools, Picard for files management

Importance of the reference sequence: 30% reads of alfalfa mapped on M.truncatula Unified 29

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

M.truncatula

80% mapped on

M.sativa

Veeckman et al. (2019) DNA Research, 2019, 26(1), 1–12 doi: 10.1093/dnares/dsy033 Possibility without a reference sequence Unified Genotyper Haplotype Caller Garsmeur et al. (2018) Nature communications 2638

(31)

Bioinformatics : a multitude of softwares

on Linux

Variant calling

GATK, freebayes,

mpileup, SNAPE

for pools

bamread count

VariantMetaCaller

Gézsi et al. BMC Genomics (2015) 16:875 DOI 10.1186/s12864-015-2050-y

33%

30

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Necessity of many CPU (central processing unit) and large RAM

(random access memory)

Veeckman et al. (2019) DNA Research, 2019, 26(1), 1–12 doi: 10.1093/dnares/dsy033

(32)

Bioinformatics : Stack Mapping Anchor

Point (SMAP)

For individuals and pools

Polymorphism of RE sites

Polymorphism of insertion / deletion

Polymorphism of SNP

T. Ruttink et al. In progress

31

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Polymorphism of SNP

(33)

Bioinformatics : SMAP

32

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Stack anchor point SNP

(34)

Bioinformatics : SMAP Haplotype

A A G A T G C T C Hap. 1 Hap. 2 Hap. 3 SNP 1 SNP 2 SNP 3 33

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

C A C

Hap. 3 Hap. 4

(35)

Bioinformatics : SMAP Haplotype

Calculate haplotype frequencies Haplotype count Discrete call

Filter for diploid Table with haplotype counts per sample

34

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Re-Calculate haplotype frequencies Filter min. haplotype frequencies

Remove noise

(36)

Variants of GBS

Davey et al. (2011) Genome-wide genetic marker discovery and

RRL RAD-seq GBS Random sheer Pool Digest Ligate adapters

RRL: reduced representation libraries

RAD-seq: restriction site associated DNA sequencing GBS: genotyping by sequencing

35

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

genetic marker discovery and genotyping using next-generation sequencing. Nature review Genetics (2011) 12: 499-510 Size selection sheer Ligate adaptors PCR RRL RAD-seq GBS Pair-ends sequencing

(37)

Conclusions

GBS could provide a multitude of markers

For a relatively low cost

For any species

Should decrease but Patent ! If the restriction enzyme is adapted But not targetted 36

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic

Without too many missing data

Relatively quickly

enzyme is adapted to the number of reads (budget) With a good sequence provider and a bioinformatic team / pipeline

(38)

Horizon 2020 of European Union:

Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic

In the frame of the H2020 EUCLEG project

Horizon 2020 of European Union:

Call 2016, SFS 44: “A joint plant breeding programme to decrease the EU's and China's dependency on protein imports”

This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.

(39)

Code CIGAR

in .sam files

in .sam files

Références

Documents relatifs