• Aucun résultat trouvé

Speeding up NGS software development

N/A
N/A
Protected

Academic year: 2021

Partager "Speeding up NGS software development"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01088683

https://hal.archives-ouvertes.fr/hal-01088683

Submitted on 16 Jan 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Speeding up NGS software development

Erwan Drezen, Guillaume Rizk, Rayan Chikhi, Charles Deltel, Claire Lemaitre, Pierre Peterlongo, Dominique Lavenier

To cite this version:

Erwan Drezen, Guillaume Rizk, Rayan Chikhi, Charles Deltel, Claire Lemaitre, et al.. Speeding up NGS software development. Sequencing, Finishing and Analysis in the Future Meeting, May 2014, Santa Fé, United States. �hal-01088683�

(2)

GATB-CORE GATB-TOOLS

GATB-PIPELINE PARTIESTHIRD

GATB's de Bruijn graph: a basis for families of tools ► Data error correction

► Assembly

► Biological motif detection

Several tools based on GATB are already available

Bloocoo K-mer spectrum based read error corrector for large datasets

Minia Short read assembler based on a de Bruijn graph. Results are

of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet)

DiscoSNP Discover Single Nucleotide Polymorphism (SNP) from

non-assembled reads

TakeABreak Detects inversion breakpoints without a reference genome by

looking for fixed size topological patterns in the de Bruijn graph

5. GATB helps you as a NGS user

The GATB C++ library gives you the opportunity to quickly develop new NGS tools that fit your needs.

Major facts about the GATB C++ library ► Object Oriented Design

► Simple and powerful graph API

► Simple and powerful multithreading model ► HDF5 usage for data storage

► Fully documented with numerous code samples

► Complete test suite

6. GATB helps you as a NGS developer

Here is a typical workflow when working with GATB

GATB-CORE transforms the reads into a de Bruijn graph, saves it in a HDF5 file that can be opened by other tools developed with the GATB-CORE API.

The core data structure of GATB is a de Bruijn graph that encodes the main information from the sequencing reads.

Strength of GATB

GATB makes this graph compact by using a Bloom filter (a space efficient probabilistic

data structure) and by using a CFP additional structure that avoids false positive answers from the Bloom filter due to its probabilistic nature.

The GATB philosophy proposes a 3-layer construction to analyze NGS datasets

GATB

GATB

Speeding-up NGS

Speeding-up NGS

software development

software development

Sequencing, Finishing and Analysis in the Future Meeting, Santa Fé, USA, May 2014

Sequencing, Finishing and Analysis in the Future Meeting, Santa Fé, USA, May 2014

1. What is GATB ?

Motivation

NGS technologies produce terabytes of data. Efficient and fast NGS algorithms are essential to analyze them.

Objective

The Genome Assembly Tool Box (GATB) ► is an open-source software

► provides an easy way to develop efficient and fast NGS tools ► is based on data structure with a very low memory footprint

► allows complex genomes to be processed on desktop computers

Erwan Drézen

1

, Guillaume Rizk

1

, Rayan Chikhi

2

, Charles Deltel

1

, Claire Lemaitre

1

, Pierre Peterlongo

1

and Dominique Lavenier

1

1 INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes cedex 2 Department of Computer Science and Engineering, Pennsylvania State University, USA

2. Software Solution

3. Compact de Bruijn graph data structure

4. Workflow

Partners

Publications

1. GATB-CORE: a C++ library holding all the services needed for developing software dedicated to NGS data.

2. GATB-TOOLS: a set of elementary NGS tools mainly built upon the GATB library (k-mer counter, contiger, scaffolder, variant detection, etc.).

3. GATB-PIPELINE: a set of NGS pipeline that links together tools from the previous layer.

License & Web Site

GATB is released under the GNU Affero General Public License.

Proprietary licencing for software editors or services providers is currently being studied.

For more details on GATB:

http://gatb.inria.fr

a whole human genome sequencing reads can be handled with 5 GBytes of memory

How to

How to

Analyze Complex

Analyze Complex

Genomes on a

Genomes on a

Simple Desktop

Simple Desktop

Computer ?

Computer ?

>read 1 ACGACGACGTAGACGACTAGCA AAACTACGATCGACTAT >read 2 ACTACTACGATCGATGGTCGCG CTGCTCGCTCTCTCGCT ... >read 100.000.000 TCTCCTAGCGCGGCGTATACGC TCGCTAGCTACGTAGCT ... CGC GCT CCG TCC CTA TAT ATC CGA AGC ATT GAG AAA GGA TGG TTG

Critical False Positive (additional structure) Real de Bruijn graph node False Positive G ra p h (H D F 5) R ea d s (F A S TA ) >14_G1511837 AGTCGGCTAGCATAG TGCTCAGGAGCTTAA ACATGCATGAGAG >14_G1517637 ATCGACTTCTCTTCT TTTCGAGCTTAGCTA ATCA >14_G1517621 CCGATCGTAGAATTA ATAGATATATAA Library API Binaries GATB-CORE Tool 2 Tool 1 Tool 3 GATB-TOOLS

G. Rizk, D. Lavenier, R. Chikhi, DSK: k-mer counting with very low memory

usage, Bioinformatics, 2013 Mar 1;29(5):652-3

R. Chikhi, G. Rizk. Space-efcient and exact de Bruijn graph representation

based on a Bloom filter, Algorithms for Molecular Biology 2013, 8:22

G. Collet, G. Rizk, R. Chikhi, D. Lavenier, Minia on Raspberry Pi, assembling

a 100 Mbp genome on a Credit Card Sized Computer, Poster at the JOBIM

conference, 2013 Jul 1-4 (Toulouse) Best poster award.

K.l Salikhov, G. Sacomoto, G. Kucherov, Using Cascading Bloom Filters to

Improve the Memory Usage for de Brujin Graphs, Algorithms in Bioinformatics,

Références

Documents relatifs

We therefore propose that the rules underpinning the molecular evolution of cytoplasmic genomes (i.e. genome size, substitution rate and number of copies per cell), is in

In Theorems 5.1 and 6.3 , additionally, all the different error components (time and space discretizations, linearization, algebraic) are identified, leading to a proposition of a

In order to obtain the radiation pattern of the overall RRA, the array factor is computed from the actual phase responses of each active cell.. The phase response

Im Bezug der Lernstrategien, und -techniken lässt sich die Meinung von Rampillon und Bimmel 2000 zitieren, die der Meinung sind, dass die Lernstrategien die wichtigste

In this work, we presented an analogy between privacy aspects of sharing photos and sharing genomes, and proposed an informed model to share genomic data according to

Provided that all records are time-synchronized, mode shapes can be identified by inspecting the modal displacement time histories Modal displacements are

However, finding the gateways in an automated fashion (e.g., by the formulation of appropriate queries), could increase the efficiency and usability of CHOOSE for

Maximum likelihood phylogenetic tree of all the sequences of the TrV1-B (A) and TrV1-C (B) strains obtained after Sanger sequencing (brown tips) or MinION sequencing followed by