Improvement of the assembly of heterozygous genomes of non-model organisms

(1)

HAL Id: hal-01231793

https://hal.inria.fr/hal-01231793

Submitted on 20 Nov 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Improvement of the assembly of heterozygous genomes of non-model organisms

Anaïs Gouin, Anthony Bretaudeau, Emmanuelle d’Alençon, Claire Lemaitre, Fabrice Legeai

To cite this version:

Anaïs Gouin, Anthony Bretaudeau, Emmanuelle d’Alençon, Claire Lemaitre, Fabrice Legeai. Im-provement of the assembly of heterozygous genomes of non-model organisms. Genome Informatics, Oct 2015, Cold Spring Harbor Laboratory, United States. 2015. �hal-01231793�

(2)

Anaïs GOUIN1_{, Anthony BRETAUDEAU}2_{, Emmanuelle d'Alençon}3_{, Claire LEMAITRE}1 _{and Fabrice LEGEAI}2 1_{INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes cedex, France}

2_{INRA, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Domaine de la Motte – 35653 Le Rheu} 3_{INRA DGIMI, université de Montpellier 1, 34000 Montpellier}

Motivation

:

Some heterozygous regions

have a significant divergence between the two

haplotypes and the assembly process can

lead to the construction of two different

contigs, instead of one consensus sequence.

Objective

:

Set up a strategy to detect and

correct false duplications in alreadybuilt

assemblies.

Improvement of the assembly of heterozygous

genomes of nonmodel organisms

scaffold_a Read-depth Expected read-depth superscaffold_c 2 scaffold_b Potential erroneous duplications Expected coverage Potential duplications N um be r of s ca ff ol ds Coverage of scaffolds pre-selection of pairs of “similar” scaffolds at least one hit with :

-e-value ≤ 1e-100 -hit length ≥ 1 kb

(or 80% of smallest scaffold) Fasta file

of the assembly (TEs masked)

Fast self whole genome alignment Identification of mis-assemblies Genome correction Re-annotation of lost genes Re-alignment of selected pairs of scaffolds BAM file

(mapped reads onto

the assembly) _{Fasta file}GFF and of annotated proteins Fasta file of the corrected assembly New GFF of gene annotations - alignments + chaining of hits

to get longer alignments

-filtering small chains : ≤ 1 kb (or 80% of smallest scaffold)

Based on 3 main criteria :

- topology : “included” “border” - read depth : - uniqueness : filtering duplications by checking uniqueness of matches

METHOD

APPLICATION

Spodoptera frugiperda genome

≤ Abp ≤ Abp dist1 dist2 ≥ B% of query cumulated read depth lim1 ≤ ≤ lim2 “included”

The smallest scaffold is deleted

“border”

The scaffolds are linked by their extremities,

keeping the allele located on the longest scaffold of the pair

Relocation and merging of

supernumerary gene annotations : - alignment of the impacted gene onto the remaining allele (Exonerate) :

- NO => delete allele

- YES => 3 distinct cases

“synonymous”

“no intersection”

“intersection”

Segment in the corrected genome Deleted segment Both alleles annotated : no need to re-annotate the lost gene N ew p re d ic tio n us in g A ug us tu s

Genome correction

Initial assembly Allpaths Corrected assembly Haplomerger Total size (Mb) 526.0 434.9 369.5 Nb. scaffolds 48,272 41,577 37,797 N50 (kb) 39.6 52.8 58.4 Expected size : ~ 400 Mb BUSCO statistics : Benchmarking sets of Universal Single Copy Orthologs (2,675 for Arthropoda species) [6] Plast [3] Lastz AxtChain [2] Exonerate [4] Augustus [5]

Annotation stats

Previous release : 25,041 genes ==> 3,746 genes to reannotate # genes % success “no alignment” 34 0 “synonymous” 747 100 “no intersection” 643 45.4 “intersection” 2,322 86.3 ==> Overall success of 80% / New release : 21,578 genes Addition of a new gene in the remaining region Modification of an already annotated gene Initial assembly Corrected assembly Haplomerger Missing 363 336 562 Single copy 1,246 1,586 1,242 Fragmented 476 457 771 Duplicated 590 296 100

Read depth analysis : before/after correction

[1] Huang S. Et al, Genome research, 2012 [2] Kent WJ. Et al, Proceedings of the National Academy of Sciences, 2003. [3] Nguyen V.H et al, BMC Bioinformatics, 2009 [4] Slater G. S. et al, BMC bioinformatics, 2005 [5] Stanke M. Et al, Genome Biol, 2006 [6] Waterhouse et al, Nucleic Acids Research, 2013 Improvement of the initial assembly for both methods

Haplomerger merged

more regions, leading to a smaller final

assembly

Comparison with another method : Haplomerger [1]

Reduction of the genome size (17%), increase of the N50 and more single copies for important genes

Reduces less than Haplomerger gain of numerous BUSCO genes

Our method: more conservative, preserves genome consistency and allows easier re-annotation of

impacted genes

* best result by category

* * *