HAL Id: hal-01231793
https://hal.inria.fr/hal-01231793
Submitted on 20 Nov 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Improvement of the assembly of heterozygous genomes of non-model organisms
Anaïs Gouin, Anthony Bretaudeau, Emmanuelle d’Alençon, Claire Lemaitre, Fabrice Legeai
To cite this version:
Anaïs Gouin, Anthony Bretaudeau, Emmanuelle d’Alençon, Claire Lemaitre, Fabrice Legeai. Im-provement of the assembly of heterozygous genomes of non-model organisms. Genome Informatics, Oct 2015, Cold Spring Harbor Laboratory, United States. 2015. �hal-01231793�
Anaïs GOUIN1, Anthony BRETAUDEAU2, Emmanuelle d'Alençon3, Claire LEMAITRE1 and Fabrice LEGEAI2 1INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes cedex, France
2INRA, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Domaine de la Motte – 35653 Le Rheu 3INRA DGIMI, université de Montpellier 1, 34000 Montpellier
Motivation
:
Some heterozygous regions
have a significant divergence between the two
haplotypes and the assembly process can
lead to the construction of two different
contigs, instead of one consensus sequence.
Objective
:
Set up a strategy to detect and
correct false duplications in alreadybuilt
assemblies.
Improvement of the assembly of heterozygous
genomes of nonmodel organisms
scaffold_a Read-depth Expected read-depth superscaffold_c 2 scaffold_b Potential erroneous duplications Expected coverage Potential duplications N um be r of s ca ff ol ds Coverage of scaffolds pre-selection of pairs of “similar” scaffolds at least one hit with :-e-value ≤ 1e-100 -hit length ≥ 1 kb
(or 80% of smallest scaffold) Fasta file
of the assembly (TEs masked)
Fast self whole genome alignment Identification of mis-assemblies Genome correction Re-annotation of lost genes Re-alignment of selected pairs of scaffolds BAM file
(mapped reads onto
the assembly) Fasta file GFF and of annotated proteins Fasta file of the corrected assembly New GFF of gene annotations - alignments + chaining of hits
to get longer alignments
-filtering small chains : ≤ 1 kb (or 80% of smallest scaffold)
Based on 3 main criteria :
- topology : “included” “border” - read depth : - uniqueness : filtering duplications by checking uniqueness of matches
METHOD
APPLICATION
Spodoptera frugiperda genome
≤ Abp ≤ Abp dist1 dist2 ≥ B% of query cumulated read depth lim1 ≤ ≤ lim2 “included”The smallest scaffold is deleted
“border”
The scaffolds are linked by their extremities,
keeping the allele located on the longest scaffold of the pair
Relocation and merging of
supernumerary gene annotations : - alignment of the impacted gene onto the remaining allele (Exonerate) :
- NO => delete allele
- YES => 3 distinct cases
“synonymous”
“no intersection”
“intersection”
Segment in the corrected genome Deleted segment Both alleles annotated : no need to re-annotate the lost gene N ew p re d ic tio n us in g A ug us tu s
Genome correction
Initial assembly Allpaths Corrected assembly Haplomerger Total size (Mb) 526.0 434.9 369.5 Nb. scaffolds 48,272 41,577 37,797 N50 (kb) 39.6 52.8 58.4 Expected size : ~ 400 Mb BUSCO statistics : Benchmarking sets of Universal Single Copy Orthologs (2,675 for Arthropoda species) [6] Plast [3] Lastz AxtChain [2] Exonerate [4] Augustus [5]Annotation stats
Previous release : 25,041 genes ==> 3,746 genes to reannotate # genes % success “no alignment” 34 0 “synonymous” 747 100 “no intersection” 643 45.4 “intersection” 2,322 86.3 ==> Overall success of 80% / New release : 21,578 genes Addition of a new gene in the remaining region Modification of an already annotated gene Initial assembly Corrected assembly Haplomerger Missing 363 336 562 Single copy 1,246 1,586 1,242 Fragmented 476 457 771 Duplicated 590 296 100Read depth analysis : before/after correction
[1] Huang S. Et al, Genome research, 2012 [2] Kent WJ. Et al, Proceedings of the National Academy of Sciences, 2003. [3] Nguyen V.H et al, BMC Bioinformatics, 2009 [4] Slater G. S. et al, BMC bioinformatics, 2005 [5] Stanke M. Et al, Genome Biol, 2006 [6] Waterhouse et al, Nucleic Acids Research, 2013 Improvement of the initial assembly for both methodsHaplomerger merged
more regions, leading to a smaller final
assembly
Comparison with another method : Haplomerger [1]
Reduction of the genome size (17%), increase of the N50 and more single copies for important genes
Reduces less than Haplomerger gain of numerous BUSCO genes
Our method: more conservative, preserves genome consistency and allows easier re-annotation of
impacted genes
* best result by category
* * *