The Cacao Criollo
Genome v2.0
An Improved Version of the Genome for Genetic and
Functional Genomic Studies
X. Argout
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Introduction
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Introduction : Genome v1.0
•
Strategy:
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Introduction : Genome v1.0
•
Assembly:
473.8
178
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Introduction : Genome v1.0
•
Assembly:
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Introduction
•
Why improving the Criollo genome?
•
Gene coverage : 98%
•
Genome anchored : 66,8%
•
Many genes located in the unkown chromosome (5269)
•
Important for genetic studies (GWAS, QTL resolution)
•
Candidate genes studies
•
Important for genomic selection
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Introduction
•
Why the Criollo genome was fragmented?
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Genome
Repeated sequencesScaffolding with small insert size libraries
Assembly STOP
Contig assemblyContigs
TEs = 35,4%
Introduction
•
Solution
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7 Contig assemblyGenome
Repeated sequencesContigs
Scaffolding with large insert size libaries
OR
Criollo genome V2
•
Materials
•
Assembly V1 contigs
•
Illumina mate Paired libraries :
•3-5kb : cov. 23x•5-8kb : cov. 21x
•8-11kb : cov. 11x
•11-15kb : cov. 6x
•
8x PacBio data error corrected
•
Bac ends v1
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Criollo genome V2
•
Materials
•
Progeny UF676 x ICS95 held in French Guiana
•
Genotyping By Sequencing data for 450 individuals
•
4 857 SNPs for scaffold anchoring
•
Cocoa chloroplast genome (Kane et al., 2012)
•
Cotton mitochondrion genome (Liu et al., 2012)
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 1 : Chloroplast and mitochondrion contig Identification
•
Chloroplast
•
Sequence homology search> 80%
•
37 contigs v1 removed
•
Mitochondrion
•
No cocoa mitochondrion available yet
•
Cotton sequence homology search
•
21 contigs v1 removed
•
removed contigs < 1000 bp
•
25 527 contigs kept from the 25 912
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 2 : contigs v1 with PE 3-5kb
•
Scaffolding test with SSPACE
•
Scaffold validation with genetic data
•
Composite scaffolds (scaffolds with genetic markers located in
different linkage groups)
•
Hypothesis : composite contigs v1?
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 2 : contigs v1 with PE 3-5kb
•
ScaffRemodler
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Criollo genome V2
•
Step 2 : contigs v1 with PE 3-5kb
•
Identification of 25 composite contig
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 3 : Scaffolding PE 3-5kb
•
Consistent with genetic data
•
24 scaffolds to split (ScaffRemodler)
Inserted contig file:Total number of contigs = 25527 Sum (bp) = 290549573 Total number of N's = 294 Sum (bp) no N's = 290549279 GC Content = 34.20% Max contig size = 189922 Min contig size = 1001 Average contig size = 11382 N25 = 36708
N50 = 19777 N75 = 9714
After scaffolding lib3-5: Total number of scaffolds = 4383 Sum (bp) = 303905568 Total number of N's = 13371700 Sum (bp) no N's = 290533868 GC Content = 34.20% Max scaffold size = 2464003 Min scaffold size = 1004 Average scaffold size = 69337 N25 = 378444 N50 = 189145 N75 = 86916 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Ex
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Criollo genome V2
•
Step 4 : Scaffolding PE 5-8 kb
•
Consistent with genetic data
Inserted contig file; Total number of contigs = 4407 Sum (bp) = 303893817 Total number of N's = 13367168 Sum (bp) no N's = 290526649 GC Content = 34.20% Max contig size = 2464003 Min contig size = 1004 Average contig size = 68957 N25 = 378444N50 = 188269 N75 = 86304
After scaffolding lib5-8: Total number of scaffolds = 1906 Sum (bp) = 312317588 Total number of N's = 21790939 Sum (bp) no N's = 290526649 GC Content = 34.20% Max scaffold size = 2803292 Min scaffold size = 1004 Average scaffold size = 163860 N25 = 894026 N50 = 439422 N75 = 226508 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 5 : Scaffolding PE 8-11kb
•
Consistent with genetic data
Inserted contig file; Total number of contigs = 1910 Sum (bp) = 312312945 Total number of N's = 21786304 Sum (bp) no N's = 290526641 GC Content = 34.20% Max contig size = 2803292 Min contig size = 1004 Average contig size = 163514 N25 = 894026N50 = 439422 N75 = 226508
After scaffolding lib8-11: Total number of scaffolds = 1271 Sum (bp) = 315916265 Total number of N's = 25389624 Sum (bp) no N's = 290526641 GC Content = 34.20% Max scaffold size = 3771893 Min scaffold size = 1004 Average scaffold size = 248557 N25 = 1211276 N50 = 709384 N75 = 343075 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 6 : Scaffolding PE 11-15kb
•
Consistent with genetic data
Inserted contig file; Total number of contigs = 1271 Sum (bp) = 315916265 Total number of N's = 25389624 Sum (bp) no N's = 290526641 GC Content = 34.20% Max contig size = 3771893 Min contig size = 1004 Average contig size = 248557 N25 = 1211276N50 = 709384 N75 = 343075
After scaffolding lib11-15: Total number of scaffolds = 980 Sum (bp) = 318241244 Total number of N's = 27714603 Sum (bp) no N's = 290526641 GC Content = 34.20% Max scaffold size = 4705272 Min scaffold size = 1004 Average scaffold size = 324735 N25 = 1580640 N50 = 906533 N75 = 467617 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step7 : Scaffolding Bac Ends
•
Consistent with genetic data
Inserted contig file; Total number of contigs = 981 Sum (bp) = 318235570 Total number of N's = 27708931 Sum (bp) no N's = 290526639 GC Content = 34.20% Max contig size = 4705272 Min contig size = 1004 Average contig size = 324399 N25 = 1580640N50 = 906533 N75 = 467617
After scaffolding Libsanger: Total number of scaffolds = 554 Sum (bp) = 325168055 Total number of N's = 34641416 Sum (bp) no N's = 290526639 GC Content = 34.20% Max scaffold size = 14867920 Min scaffold size = 1004 Average scaffold size = 586945 N25 = 9230816 N50 = 5324109 N75 = 2107648 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 8 : Gap closing
•
554 scaffolds = 325 Mb
•
Ns = 10.6% of the assembly
•
Mix of PacBio + Illumina PE : 5.6% in final assembly
•
Next step, anchor scaffolds to the 10 chromosomes
Cacao A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
Criollo genome V2
•
Step 9 : Chromosome anchoring
•
4857 SNP markers
•
Grouping with Joinmap software
•
Pairwise data export and study of recombination frequencies
between markers to anchor and orientate scaffolds (ScaffHunter
program
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Criollo genome V2
•
Step 9 Chromosome anchoring
•
10 chomosomes
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7 Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Comparison V1 - V2
•
Final assembly : 554 scaffolds (4,792 V1)
•
N50 : 6,5 Mb (0,47Mb V1)
•
Genome anchored : 314,2Mb (218,4Mb V1) = 96,7% anchored
•
High reduction of unknown Chr : 10,5Mb (108,5Mb V1)
•
Unknown sites (Ns) : 5.7% (10.8%)
•
99% of genes anchored to chromosomes
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7
V1
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Tc00 integration into v2
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Tc00 integration into v2
Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7Ca ca o A d va n ced O mi cs W o rk sh o p . P A G 2 01 7