• Aucun résultat trouvé

MINTIA : Metagenomic INserT BIoinformatic Annotation in Galaxy

N/A
N/A
Protected

Academic year: 2021

Partager "MINTIA : Metagenomic INserT BIoinformatic Annotation in Galaxy"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01886424

https://hal.archives-ouvertes.fr/hal-01886424

Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Copyright

MINTIA : Metagenomic INserT BIoinformatic Annotation in Galaxy

Sandrine Laguerre, Sarah Maman, Sabrina Legoueix, Elisabeth Laville, Gabrielle Veronese, Christophe Klopp

To cite this version:

Sandrine Laguerre, Sarah Maman, Sabrina Legoueix, Elisabeth Laville, Gabrielle Veronese, et al..

MINTIA : Metagenomic INserT BIoinformatic Annotation in Galaxy. 16th European Conference on Computational Biology, Jul 2017, Prague, Czech Republic. �hal-01886424�

(2)

MINTIA

Metagenomic INsertT BIoinformatic Annotation in Galaxy

Functional metagenomics is used to understand who is doing what in microbial ecosystems. DNA sequencing can be prioritized by activity-based screening of libraries obtained by cloning and expressing metagenomic DNA fragments in an heterologous host. When large insert libraries are used, allowing a direct access to the functions encoded by entire metagenomic loci sizing several dozens of kbp, NGS is required to identify the genes that are responsible for the screened function. MINTIA is an easy to use pipeline assembling and annotating metagenomic inserts.

Sandrine Laguerre 1,2,3 , Sarah Maman 4 , Sabrina Legoueix 5 , Élisabeth Laville 1,2,3 , Gabrielle Potocki-Véronèse 1,2,3 & Christophe Klopp 4,6

1

Université de Toulouse; INSA, UPS, INP, 135 Avenue de Rangueil, F-31077, Toulouse, France,

2

INRA, UMR792 Ingénierie des Systèmes Biologiques et des Procédés, LISBP, F-31400, Toulouse, France,

3

CNRS, UMR5504, F-31400, Toulouse, France,

4

SIGENAE, GenPhySE, Université de Toulouse, INRA, INPT, ENV, Castanet Tolosan, France,

5

TWB, Université de Toulouse, INRA, INSA, CNRS, Ramonville Saint-Agne, France,

6

Plate-forme bio-informatique Genotoul, Mathématiques et Informatique Appliquées de Toulouse, INRA, Castanet Tolosan, France

Pipeline modules : 1/ metagenomic insert assembly & 2/ annotation.

The assembly module (1) named "DNA inserts reconstruction on MiSeq or proton data" uploads the user raw data read files in Galaxy and assembles, cleans and

extracts the longest and most covered contigs for each DNA insert. It handles reads (454, ion torrent,...) or read pairs (Illumina,...). This tools is not able to process PacBio or Oxford Nanopore reads. It sub-selects by default 300X read coverage depending on the sequencing platform and assembles them, removes cloning vector and selects

best contigs. Only contigs with length and average depth over given thresholds are kept. The produced HTML report includes a dynamic graphic with contig length and coverage for each sample.

First results on simulated datasets

Fosmides.zip

HTML zip report HTML table report and

Alignments

Alignments of contigs and ORFs vs. NR and SP db

.game files for Apollo genome viewer Functional and

taxonomical annotations Megan

Licence file

Tree image Megan

- ORFs description tabular - Nucleotidic ORFs fasta - Proteic ORFs fasta

Metagene

Alignments Alignment

against user DB

Database User (fasta file) RPSBlast COGs

- Alignments

- Filtered alignments - graphical view

(HTML table report) E-value

max

- Identity % min - Length min - E-value max

Submission information file

Submission  files

EMBL files

The annotation module (2) aims at obtaining main gene functions and a functional classification. The pipeline launches at least metagene ORF detection, generating fasta files for genes and proteins as well as a

tabular file containing ORFs description.

Depending on additional options selected, contigs and ORFs are

aligned against NCBI NR (Non Redundant) as well as SP (SwissProt) and COGss databases. These blast or RPSBlast results used for functional and taxonomical annotations produce .game formatted files up-loadable in Apollo for visualization and Megan functional classification tree to find the

corresponding organisms. This module can also generate a GeneBank submission file.

15 sets of reads were simulated for paired end MiSeq and single end Proton from Bifidobacterium adolescentis ATCC15703 whole genome and Bacteroides stercoris ATCC43183 scaffold_02_12 and scaffold_02_16. Those sets were analyzed using MINTIA module 1 and 2.

Length of simulated and assembled fosmid sequences are very close.

For all simulations, the assembled and cleaned sequence covers more than 99% of the simulated sequence with a percentage of similarity of 100% showing the high reliability of MINTIA module 1. The graph hereover on the left side compares the length between simulated and assembled sequence. The one on the right shows the length of the non corresponding part in base pairs.

Most of the times MINTIA module2 finds all the CDS of the simulated sequences. The graph on the left shows the number of missing CDS in the assembled sequences.

Close to all the CDS annotated with COGs on the simulated sequences are also annotated with MINTIA module 2. The graph on the right shows the CDS having different annotation between simulated and assembled sequences.

You can contact sandrine.laguerre@insa-toulouse.fr to get access to the modules in their actual state. They will be deposited on the Toolshed later this year.

1

2

The result file only includes selected cleaned contigs

Check input files Vector

sequence (fasta)

Verify : number of input files, valid fastq files, and if data are paired (check R1/R2) or single.

HTML report with:

- Data informations for each fastq file.

- Linux command lines for each steps.

-Graphic : contigs lenght and coverage per sample.

 Links to files for each sample

Determine datatype

Vector detection Assembly

Contig filtering Create VN and VM files

SPAdes command

are differents if data are paired or single.

For each sample :

The assembly is perform by SPAdes

Upload ZIP Input files "MiSeq.zip"  (paired files)

"Proton.zip" (Single)

uploaded in a zipped directory

Extract contigs fullfilling length and depth criteria, filter and clean.

Vector detection using cross-match script

Generate Galaxy outputs

Filter 100X

Sub-selection

Minimal length and

depth

Sequence of

assembled contigs Lenght depth of assembled contigs

Contigs with vector

cleaned/masked

Selected contigs (on length and depth) Vector position

in contigs Vector position

in contigs

Vector position in contigs

Sequence of assembled contigs

Sequence of assembled

contigs

Lenght depth of assembled contigs Lenght depth

of assembled contigs

Contigs with vector

cleaned/masked

Contigs with vector

cleaned/masked

Selected contigs (on length and depth)

Selected contigs (on length and depth)

fastq file ZIP

ZIP directory

with selected

contigs assembled

Références

Documents relatifs

These flight paths show that the bees do in fact learn two feeder locations, recruit their vector memories of the route flights between the hive and the two feeders, and perform

Research supported by Deutsche Forschungsgemeinschaft and SFB 256 (University of Bonn)... This clearly requires a connection on E as an additional structure, i.e. Extending our work

As for the Bacchus pipeline, it has been created at INRA Montpellier (France) to investigate clonal diversity in grapevine genomes. For this task, many softwares have been wrapped

Frick and Grohe [14,15,17] have defined Fixed Parameter Tractable (FPT for short) algorithms for FO model-checking problems on classes of graphs that may have unbounded degree

Asmare Muse, B, Rahman, M, Nagy, C, Cleve, A, Khomh, F & Antoniol, G 2020, On the Prevalence, Impact, and Evolution of SQL code smells in Data-Intensive Systems. in Proceedings

The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path. The decision problem whether a graph has a Hamiltonian path

The PARSEME Shared Task on the automatic identification of verbal multiword expressions (VMWEs) was the first collaborative study on the subject to cover a wide and diverse range

Because these genes, when mutated, may cause variable degrees of re- productive and olfactory dysfunction, as well as diverse nonreproductive, nonolfactory anomalies [as