HAL Id: hal-02789479
https://hal.inrae.fr/hal-02789479 Submitted on 5 Jun 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Genotyping-By-Sequencing (GBS): a simple method for a multitude of markers
Philippe Barre, Tom Ruttink
To cite this version:
Philippe Barre, Tom Ruttink. Genotyping-By-Sequencing (GBS): a simple method for a multitude of markers. Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”, 2019. �hal-02789479�
In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”
This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.
Genotyping-By-Sequencing (GBS):
a simple method
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic
a simple method
for a multitude of markers
Sequencing as a tool for genotyping
Getting the genotypic classes or the allele frequencies at different
loci of a sample
2
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Necessity to reduce the complexity
of the genome
•
Possibility to re-sequence the whole genome if you
have enough money:
–
1000 $ for 2X = 6.4 10
9
bp for human
–
562 $ for 4X = 3.6 10
9
bp for alfalfa
Once you have a reference
sequence
3
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
–
562 $ for 4X = 3.6 10
9
bp for alfalfa
–
93 $ for 2X = 0.6 10
9
bp for flax
Necessity to reduce the complexity
of the genome
•
How to reduce the complexity
–
Genes : sequencing of mRNA
•
Choice of plant material
–
Targeting specific loci : amplicon sequencing or capture
•
Design of primers or probes
4
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
•
Design of primers or probes
–
Random sequences : use of restriction enzymes
BUT Patents from KEYGENE : SNP identification by using restriction enzymes and high throughput sequencing methods
EP 1910562, EP 1966393, EP 2002017, EP 2292788, EP 2789696, EP 2963,127, EP 2,302,070
In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”
This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.
Principle of GBS
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic
Principle of GBS
Sample 1
Restriction enzyme site
2/ Ligation of adapters primer
A
Single Nucleotide Polymorphism (SNP)
1/ Digestion Genomic DNA
6
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
3/ Amplification < 450 bp
Sample 2
G
Loss of a restriction site : MD 4/ Sequencing
Barcode specific of sample 1
AFLP (Amplified Fragment length Polymorphism) → Electrophoresis:
Principle of GBS
5/ Demultiplexing with barcodes Many sequences for each sample
Files .fastq
6/ Cleaning
header sequence quality
7
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Principle of GBS
Reference
7/ Alignment on a reference sequence files .sam and .bam
Read depth Coverage
CIGAR: alignment
Reference name Position Sequence name
8
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
HISEQ8:C5VAYACXX:4:1101:9138:7833 16 MsNRG000127 383005 0 88M *
0 0
CTGCGCTGATTAGCCTGAGAAGAGCACTGACTTGTAGCAGTCTGAACTGCATATGAGCACACTGGGTCCGAGCAGGAAACCAGCG CAG DDDDCDDDDCDDCCEDCEEEFFFFFFHHHEGHJJJIJIJJJJJJJJJIIGJJJJJIHFBGHGGHIHGJJJIJJGIJIIGGHHHGFFFF
RG:Z:readscleanadpt.sai XT:A:R NM:i:1 X0:i:2 X1:i:0 XM:i:1 XO:i:0 XG:i:0
MD:Z:4A83 XA:Z:MsNRG000506,+909328,88M,1;
HISEQ8:C5VAYACXX:4:1101:9015:7835 16 MsNRG000138 746536 0 97M *
0 0
ATGACATTTGTTCTTGACCGTCGGAAAGATTAAGTACGGCATTTTCCAACAGATAGACTAAATGCTTGATATTATTTTCTTCTGATACT GTGTGCTG DEDCDDDCDDDCDDDDDDDFEFGHHHHIE@GJJJIIJJJIHF<JIJJJJJJJIIJJIJJJIHAJJJJJGGJJIJJJJIHHEJIJIIGIHHHHHBFFF
RG:Z:readscleanadpt.sai XT:A:R NM:i:1 X0:i:3 X1:i:0 XM:i:1 XO:i:0 XG:i:0
MD:Z:93T3 XA:Z:MsNRG000119,+373466,97M,1;MsNRG000673,-603245,97M,1;
CIGAR: alignment
Reference name Position Sequence name
Principle of GBS
8/ Variant calling files .vcf
= ref = ref MD = ref MD
Not enough reads Min. number of reads to
be defined 1=ref 4=Alt Position 452 Interpretation of the allele frequencies Focus on biallelic SNP 9
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Variant calling : sample 1 at position 452 ref: A alt: G, 1 ref, 4 alt, genotype: GG or 0/0 What about the position with enough reads but identical to the reference ? file .gvcf
508 218848 . A G 3834.26 . BaseQRankSum=0.913;ClippingRankSum=-0.308;DP=302;MLEAC=2,0;MLEAF=0.500,0.00;MQRankSum=-2.242;RAW_MQ=413438.00;ReadPosRankSum=-6.469 GT:AD:DP:GQ:PL:SB 1/1:187,115:302:18:3861,18,0,356,6642,4094,346,559,6786,4421,905,6989,4980,7335,11410:115,72,115,0 508 218849 . C . . END=218873 GT:DP:GQ:MIN_DP:PL 0/0:304:50:278:0,50,120,241,1800
Principle of GBS
6=ref 0=Alt 5 = ref 0=Alt 5=ref 0=Alt
MD MD 1=ref 4=Alt Position 452 Position 400 Sample 1 MD 1=ref 4=Alt MD MD MD MD .vcf .gvcf 10
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
6=ref 0=Alt 5 = ref 0=Alt 5=ref 0=Alt
Sample 2
1=ref 4=Alt MD MD
.gvcf
O = ref 5=Alt 5=ref 0=Alt
MD MD 1=ref 4=Alt MD 1=ref 4=Alt MD MD MD .vcf .gvcf O = ref 5=Alt MD
Principle of GBS
9/ Merge variant calling files for all samples files .vcf
scaffold_0|ref0040988 166641 . A C . . .
GT:AD:DP:GC 0/0:146,0:146:85.5387987028
0/1:106,180:286:228.995630731 0/0:345,0:345:204.97487344 1/1:0,205:205:120.911257016 1/1:0,202:202:119.11357202
0/0:11,0:11:5.32928395111 0/0:231,0:231:136.513314158 1/1:22,973:995:999
chromosome position ref alt Sample 1 Sample 4
11
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
0/0:11,0:11:5.32928395111 0/0:231,0:231:136.513314158 1/1:22,973:995:999 0/0:100,0:100:58.0077078195 ./.:3,0:3:. scaffold_0|ref0040988 166656 . A T . . . GT:AD:DP:GC 0/0:145,0:145:84.9367387115 0/0:287,0:287:170.134951164 0/0:347,0:347:206.176493884 0/0:207,0:207:122.11119102 0/0:202,0:202:119.11357202 0/0:11,0:11:5.32928395111 0/0:231,0:231:136.513314158 0/0:1001,1:1002:57.4056478282 0/0:100,0:100:58.0077078195 ./.:3,0:3:.
Ready for genetic analyses !!!
10/ Filtering: Check the missing data per SNP and per sample
Chro_position1 Chro_position2
Sample 1 0 1
Principle of GBS
2/ Ligation of adapters 3/ Amplification
1/ Digestion 5/ Demultiplexing files .fastq per sample
6/ Cleaning = Trimming files .fastq per sample
7/ Alignment on a reference sequence files .sam and .bam per sample
Libraries preparation
Bioinformatics
12
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
4/ Sequencing file .fastq for all samples
.sam and .bam per sample
8/ Variant calling files .vcf or gvcf per sample
9/ Merge variant calling file .vcf for all samples
10/ Filtering file .csv or .txt for all samples
Sequencing
To be done by or with the geneticist To be done by or with the geneticistIn the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”
This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.
Library preparation
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic
Library preparation (Elshire et al., 2011)
14
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Several PCR could be performed on
Library preparation (Elshire et al., 2011)
15
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Library preparation (Elshire et al., 2011)
16
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Essential links between :
Sequencing Illumina Hiseq 2000
O2
17
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Sequencing Illumina Hiseq 2000
Single read
18
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Read 2
Library preparation : an example of
adapters and primers
Barcode
Insert
Adapter P1
Adapter P2 Primer P1 O1 Index 5
Primers for sequencing
19
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Barcode Adapter P1
Primer for sequencing
Index 7 For dactylis: - 29 barcodes - 5 i7 - 2 i5 290 samples Have to be compatible 1 lane of Novaseq (about 3.9 109 reads) (about 11 k€) In average 13 109 reads From 2 – 25 M Sequence again some samples
Per sample : if target 10 M reads Library + sequencing about 60 € Without staff
Library preparation (Elshire et al., 2011)
Amount of adapters Clean: discard small fragments 20Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Choice of restriction enzyme and
number of reads
N u m b e r o f lo ci ApeKI (5 bp) PstI (6 bp) W= A or T Read depth About 50000 loci with few MD About 200000 loci with many MD 21Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Number of reads (millions)
N u m b e r
Citation: Byrne S, Czaban A, Studer B, Panitz F, Bendixen C, Asp T (2013) Genome Wide Allele Frequency Fingerprints (GWAFFs) of Populations via Genotyping by Sequencing. PLoS ONE 8(3): e57438.
https://doi.org/10.1371/journal.pone.0057438 Sample 1 Sample 2 MD MD MD MD stack
Choice of restriction enzyme and number
of reads
Depends on :
- the number of markers needed - the capacity to predict MD
- the funding re a d d e p th > 1 0 Dactylis Alfalfa PstI-MseI : 17000 stacks with 10 M 22
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
N u m b e r o f st a ck s w it h re a d Number of reads
Test of different restriction enzymes and pairs of enzymes PstI : 10000 stacks
with 10 M reads
stacks with 10 M reads
GBS in order to estimate allele
frequencies on pools
Allele frequencies estimated on 48 individuals Allele frequencies estimated on a pool of DNA from 48 Not a good estimation for AF<0.05 or >0.95 23Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Y=0.93X + 0.05 R²= 0.92 prediction interval: ± 0.12
Y=0.96X + 0.02 R²= 0.98 prediction interval: ± 0.10
48 individuals
Allele frequencies estimated on a pool of leaves from 48 individuals from 48
individuals
Good estimation of allele frequencies on pools
Example of using GBS on pools of
leaves in alfalfa
Estonia Grazing Germany FST: 0.075 to 0.203, P < 0.05 France Julier, B. et al. (2018) Use of GBS markers to distinguish among lucerne varieties,with comparison to morphological traits.24
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Canada Turf
to morphological traits. Molecular Breeding 38 : 133-145
Example of using GBS on pools of
leaves in alfalfa
25
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Example of using GBS on pools of
leaves in alfalfa
26
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
In the frame of the H2020 EUCLEG project “Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”
This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.
Bioinformatics
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic
Bioinformatics : a multitude of softwares
on Linux
•
Demultiplexing : often performed
by the sequences provider
sabre,
GBSX, Stacks (process_radtags)
…
•
Cleaning :
Possibility with Windows : CLC genomics but pb to merge many samples Sequences per sample28
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
–
Remove adapters : cutadapt, scythe
–
Remove bad quality reads or part of
reads: sickle
–
Remove too short reads: sickle
1 file of 10 M reads 2.5 Go
(800 Mo compressed)
Bioinformatics : a multitude of softwares
on Linux
•
Alignment on a reference sequence
–
BWA aln or MEM, GSNAP
–
Samtools, Picard for files management
Importance of the reference sequence: 30% reads of alfalfa mapped on M.truncatula Unified 29
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
M.truncatula
80% mapped on
M.sativa
Veeckman et al. (2019) DNA Research, 2019, 26(1), 1–12 doi: 10.1093/dnares/dsy033 Possibility without a reference sequence Unified Genotyper Haplotype Caller Garsmeur et al. (2018) Nature communications 2638
Bioinformatics : a multitude of softwares
on Linux
•
Variant calling
–
GATK, freebayes,
mpileup, SNAPE
for pools
–
bamread count
VariantMetaCallerGézsi et al. BMC Genomics (2015) 16:875 DOI 10.1186/s12864-015-2050-y
33%
30
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Necessity of many CPU (central processing unit) and large RAM
(random access memory)
Veeckman et al. (2019) DNA Research, 2019, 26(1), 1–12 doi: 10.1093/dnares/dsy033
Bioinformatics : Stack Mapping Anchor
Point (SMAP)
•
For individuals and pools
•
Polymorphism of RE sites
•
Polymorphism of insertion / deletion
Polymorphism of SNP
T. Ruttink et al. In progress
31
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
•
Polymorphism of SNP
Bioinformatics : SMAP
32
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Stack anchor point SNP
Bioinformatics : SMAP Haplotype
A A G A T G C T C Hap. 1 Hap. 2 Hap. 3 SNP 1 SNP 2 SNP 3 33Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
C A C
Hap. 3 Hap. 4
Bioinformatics : SMAP Haplotype
Calculate haplotype frequencies Haplotype count Discrete call
Filter for diploid Table with haplotype counts per sample
34
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
Re-Calculate haplotype frequencies Filter min. haplotype frequencies
Remove noise
Variants of GBS
Davey et al. (2011) Genome-wide genetic marker discovery and
RRL RAD-seq GBS Random sheer Pool Digest Ligate adapters
RRL: reduced representation libraries
RAD-seq: restriction site associated DNA sequencing GBS: genotyping by sequencing
35
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
genetic marker discovery and genotyping using next-generation sequencing. Nature review Genetics (2011) 12: 499-510 Size selection sheer Ligate adaptors PCR RRL RAD-seq GBS Pair-ends sequencing
Conclusions
•
GBS could provide a multitude of markers
•
For a relatively low cost
•
For any species
Should decrease but Patent ! If the restriction enzyme is adapted But not targetted 36
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding”– 31 October & 1 November 2019 – Brno, Czech Republic
•
Without too many missing data
•
Relatively quickly
enzyme is adapted to the number of reads (budget) With a good sequence provider and a bioinformatic team / pipelineHorizon 2020 of European Union:
Training workshop “Genotyping, phenotyping, data management and analysis in plant breeding” 31 October & 1 November 2019 – Brno, Czech Republic
In the frame of the H2020 EUCLEG project
Horizon 2020 of European Union:
Call 2016, SFS 44: “A joint plant breeding programme to decrease the EU's and China's dependency on protein imports”
This project has received funding from the European Union’s Horizon 2020 Programme for Research & Innovation under grant agreement n°727312.