• Aucun résultat trouvé

Debugging long-read genome and metagenome assemblies using string graph analysis

N/A
N/A
Protected

Academic year: 2021

Partager "Debugging long-read genome and metagenome assemblies using string graph analysis"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01574824

https://hal.inria.fr/hal-01574824

Submitted on 16 Aug 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Debugging long-read genome and metagenome assemblies using string graph analysis

Pierre Marijon, Jean-Stéphane Varré, Rayan Chikhi

To cite this version:

Pierre Marijon, Jean-Stéphane Varré, Rayan Chikhi. Debugging long-read genome and metagenome

assemblies using string graph analysis. JOBIM 2017- Journées Ouvertes en Biologie, Informatique et

Mathématiques, Jul 2017, Lille, France. �hal-01574824�

(2)

Debugging long-read genome and metagenome assemblies using string graph analysis

Pierre M ARIJON 1 , Jean St´ephane V ARR E ´ 2 and Rayan C HIKHI 2

1

Inria, Universit´ e de Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, F-59000 Lille, France

2

Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, F-59000 Lille, France

Corresponding author: [email protected]

Third-generation long-read sequencing technologies tackle the repeat problem in genome assembly by pro- ducing reads that are long enough to span most repeat instances. In principle one expects that with such reads most bacterial genomes will be assembled into a single contig [1]. However in practice, some datasets fail to be perfectly assembled even with leading assemblers, and are fragmented into a handful of contigs. As a mean to investigate those cases, we consider the string graphs that are generated by assemblers during intermediate stages of the assembly process. We seek to establish a coherent framework for analyzing these graphs, in the hope that they will help us determine the biological causes that led the assembler to output shorter contigs. This poster presents some preliminary results of such an analysis.

We visualized, analyzed and compared assembly graphs generated by Canu [2] and Miniasm assemblers [3] on biological (MBRAC-26 [4]) and synthetic datasets (created with LongISLND [5]). We introduce the concept of graph projection of an assembly graph onto another, taking advantage of the recent GFA format.

We are thus able to observe how reads that are neighbors of contigs extremities overlap, in terms of error rate and overlap length. We implemented an automatic and user-friendly snakemake pipeline that generates a HTML report for each assembly. We identified cases of contigs that were not joined by the assembler despite indications in the string graph that such joins could have been made. These cases highlight potential directions on how to improve the assembly process. In future work we will take advantage of this investigation to propose alternative assembly hypotheses based on string graph analysis.

References

[1] Sergey Koren and Adam M Phillippy. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Current Opinion in Microbiology, 23:110–120, 2015.

[2] Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy.

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Re- search, page gr.215087.116, 2017.

[3] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016.

[4] Esther Singer, Bill Andreopoulos, Robert M. Bowers, Janey Lee, Shweta Deshpande, Jennifer Chiniquy, Doina Ciobanu, Hans-Peter Klenk, Matthew Zane, Christopher Daum, Alicia Clum, Jan-Fang Cheng, Alex Copeland, and Tanja Woyke. Next generation sequencing data of a defined microbial mock community. Scientific Data, 3:160081, 2016.

[5] Bayo Lau, Marghoob Mohiyuddin, John C. Mu, Li Tai Fang, Narges Bani Asadi, Carolina Dallett, and Hugo Y. K.

Lam. LongISLND: in silico sequencing of lengthy and noisy datatypes. Bioinformatics, 32(24):3829–3832, 2016.

Références

Documents relatifs

Whole metagenome analysis with metagWGS Joanna Fourquet, Adeline Chaubet, Hélène Chiapello, Christine Gaspin, Marisa Haenni, Christophe Klopp, Agnese Lupo, Jean Mainguy, Celine

A Long Read project to find optimal technologic combinations for genome assembly and variability, epigenetic marks detection and metagenomic analysis.. Carole Iampietro, Camille

2) with at least one incorrect transition (and no correct tran- sition). The transitions contained in this type of neighbour- hood highlight a choice that can lead to behaviours

Finally, Section 4 provides an illustration of our approach using numerical simulations, and Section 5 describes an application for examining interactions between the genomic markers

Results: In the present work, five clusters have been studied: a cluster of thirteen genes characterized by an F-box domain localized on chromosome 9, a cluster of six genes related

We use Minimap2 [10] (version 2.16-r922), with default parameters, as it is a fast and accurate mapper, specifically designed for long erroneous reads... Two corresponding to

Method We propose a novel method that aims at assigning a genotype for a set of already known SVs in a given individual sample sequenced with long reads data. In other words, the

Here, we present Dog10K_Boxer_Tasha_1.0, an improved chromosome- level highly contiguous genome assembly of Tasha created with long-read technologies that in- creases