• Aucun résultat trouvé

GATB: Toolbox for developing efficient NGS software

N/A
N/A
Protected

Academic year: 2021

Partager "GATB: Toolbox for developing efficient NGS software"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01088828

https://hal.archives-ouvertes.fr/hal-01088828

Submitted on 4 Dec 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

GATB: Toolbox for developing efficient NGS software

Erwan Drezen, G Rizk, R Chikhi, Charles Deltel, C Lemaitre, P Peterlongo, D Lavenier

To cite this version:

Erwan Drezen, G Rizk, R Chikhi, Charles Deltel, C Lemaitre, et al.. GATB: Toolbox for develop-ing efficient NGS software. 9th Brazilian Symposium on Bioinformatics, BSB 2014, Oct 2014, Belo Honrizonte, Brazil. �hal-01088828�

(2)

GATB-CORE GATB-TOOLS

GATB-PIPELINE PARTIESTHIRD

GATB's de Bruijn graph: a basis for families of tools ► Data error correction

► Assembly

► Biological motif detection

Several tools based on GATB are already available

Bloocoo K-mer spectrum based read error corrector for large datasets

Minia Short read assembler based on a de Bruijn graph. Results are

of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet)

DiscoSNP Discover Single Nucleotide Polymorphism (SNP) from

non-assembled reads

TakeABreak Detects inversion breakpoints without a reference genome by

looking for fixed size topological patterns in the de Bruijn graph

5. GATB helps you as a NGS user

The GATB C++ library gives you the opportunity to quickly develop new NGS tools that fit your needs.

Major facts about the GATB C++ library ► Object Oriented Design

► Simple and powerful graph API

► Simple and powerful multithreading model ► HDF5 usage for data storage

► Fully documented with numerous code samples

► Complete test suite

6. GATB helps you as a NGS developer

Here is a typical workflow when working with GATB

GATB-CORE transforms the reads into a de Bruijn graph, saves it in a HDF5 file that can be opened by other tools developed with the GATB-CORE API.

The core data structure of GATB is a de Bruijn graph that encodes the main information from the sequencing reads.

Strength of GATB

GATB makes this graph compact by using a Bloom filter (a space efficient probabilistic

data structure) and by using a CFP additional structure that avoids false positive answers from the Bloom filter due to its probabilistic nature.

The GATB philosophy proposes a 3-layer construction to analyze NGS datasets

GATB

GATB

Tool Box for developing

Tool Box for developing

efficient NGS software

efficient NGS software

9th Brazilian Symposium on Bioinformatics, BSB 2014

9th Brazilian Symposium on Bioinformatics, BSB 2014

1. What is GATB ?

Motivation

NGS technologies produce terabytes of data. Efficient and fast NGS algorithms are essential to analyze them.

Objective

The Genome Assembly Tool Box (GATB) ► is an open-source software

► provides an easy way to develop efficient and fast NGS tools ► is based on data structure with a very low memory footprint

► allows complex genomes to be processed on desktop computers

Erwan Drézen

1

, Guillaume Rizk

1

, Rayan Chikhi

2

, Charles Deltel

1

, Claire Lemaitre

1

, Pierre Peterlongo

1

and Dominique Lavenier

1

1 INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes cedex 2 Department of Computer Science and Engineering, Pennsylvania State University, USA

2. Software Solution

3. Compact de Bruijn graph data structure

4. Workflow

Partners

Publications

1. GATB-CORE: a C++ library holding all the services needed for developing software dedicated to NGS data.

2. GATB-TOOLS: a set of elementary NGS tools mainly built upon the GATB library (k-mer counter, contiger, scaffolder, variant detection, etc.).

3. GATB-PIPELINE: a set of NGS pipeline that links together tools from the previous layer.

License & Web Site

GATB is released under the GNU Affero General Public License.

Proprietary licencing for software editors or services providers is currently being studied.

For more details on GATB:

http://gatb.inria.fr

a whole human genome sequencing reads can be handled with 5 GBytes of memory

How to

How to

Analyze Complex

Analyze Complex

Genomes on a

Genomes on a

Simple Desktop

Simple Desktop

Computer ?

Computer ?

>read 1 ACGACGACGTAGACGACTAGCA AAACTACGATCGACTAT >read 2 ACTACTACGATCGATGGTCGCG CTGCTCGCTCTCTCGCT ... >read 100.000.000 TCTCCTAGCGCGGCGTATACGC TCGCTAGCTACGTAGCT ... CGC GCT CCG TCC CTA TAT ATC CGA AGC ATT GAG AAA GGA TGG TTG

Critical False Positive (additional structure) Real de Bruijn graph node False Positive G ra p h (H D F 5) R ea d s (F A S TA ) >14_G1511837 AGTCGGCTAGCATAG TGCTCAGGAGCTTAA ACATGCATGAGAG >14_G1517637 ATCGACTTCTCTTCT TTTCGAGCTTAGCTA ATCA >14_G1517621 CCGATCGTAGAATTA ATAGATATATAA Library API Binaries GATB-CORE Tool 2 Tool 1 Tool 3 GATB-TOOLS

G. Rizk, D. Lavenier, R. Chikhi, DSK: k-mer counting with very low memory

usage, Bioinformatics, 2013 Mar 1;29(5):652-3

R. Chikhi, G. Rizk. Space-efficient and exact de Bruijn graph representation

based on a Bloom filter, Algorithms for Molecular Biology 2013, 8:22

G. Collet, G. Rizk, R. Chikhi, D. Lavenier, Minia on Raspberry Pi, assembling

a 100 Mbp genome on a Credit Card Sized Computer, Poster at the JOBIM

conference, 2013 Jul 1-4 (Toulouse) Best poster award.

K.l Salikhov, G. Sacomoto, G. Kucherov, Using Cascading Bloom Filters to

Improve the Memory Usage for de Brujin Graphs, Algorithms in Bioinformatics,

Références

Documents relatifs

In this programme, particular prominence is being given to coastal and estuarine areas, and there is special emphasis on monitoring the conservation status of critical marine

We compared the speed of emBayesR with BayesR and fastBayesB using three criteria: the time complexity of each iteration (the function in terms of number of SNPs and individuals

The replacement of Asp470 by Ile or Leu contributes with Leu382 and others to a non-polar lining of the substrate channel, which benefits the binding of the aliphatic aldehydes

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Encore peu connue en France, l’éthique narrative pour- rait cependant trouver sa place dans la formation des médecins, dans le cadre d’une réforme qui cherche, depuis

In support of these pre-clinical results, liver biopsies from patients with moderate and severe hepatic steatosis showed increased VLDL receptor levels and reduced PPARβ/δ mRNA

En particulier, cette étude montre la plus grande stabilité des couches d’oxyde PECVD, qui présentent une contrainte résiduelle relativement basse après recuit

Chapitre 4 : « Effets de l’image scientifique dans la compréhension des textes en biologie ».. Description du contexte