• Aucun résultat trouvé

Minimal perfect hash functions in large scale bioinformatics Problem

N/A
N/A
Protected

Academic year: 2021

Partager "Minimal perfect hash functions in large scale bioinformatics Problem"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01341718

https://hal.archives-ouvertes.fr/hal-01341718

Submitted on 4 Jul 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Minimal perfect hash functions in large scale

bioinformatics Problem

Antoine Limasset, Camille Marchet, Pierre Peterlongo, Lucie Bittner

To cite this version:

Antoine Limasset, Camille Marchet, Pierre Peterlongo, Lucie Bittner. Minimal perfect hash functions in large scale bioinformatics Problem. JOBIM 2016, Jun 2016, Lyon, France. �hal-01341718�

(2)

Minimal perfect hash functions in

large scale bioinformatics

Antoine Limasset, Camille Marchet and Pierre Peterlongo

INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France

[email protected], [email protected], [email protected]

Problem

Quasi-dictionnary

References

Links

QR code

Team

Hash functions

BBhash library

Hash functions

SRC Counter

Output reads in A that has T Kmers in common with the

reads of B and estimate their coverage.

Output reads in A that has T Kmers that apear in set B.

Given A and B sets of reads :

SRC Linker :

Short Read Connector tools

-Memory efficient (less than 3bits per key)

-Fast query (200ns)

-Fast to construct (even for billions elements)

Put a fingerprint in the value and check it

at the query

(Meta)Genomic Data

Billions of short sequences of hundreds of base

pairs, from one or multiple genomes

Questions

Dataset comparison :

Detection of similar reads inter or intra datasets

Indexing huge set of elements

Quasi-dictionary

Problem

False positive rate :

Memory consomption :

Bit/elements : 10 FP rate: 1/10²

Bit/elements : 20 FP rate: 1/10

Bit/elements : 30 FP rate: 1/10

...

BUT

-No membership operation

-A ‘stranger’ key can be associated to a value

Bbhash library :

github.com/rizkg/BBHash

Quasi-dictionary :

github.com/pierrepeterlongo/

quasi_dictionary

Short Read Connector :

github.com/GATB/rconnector

Bowtie2

Langmead, Ben, and Steven L. Salzberg. "Fast gapped-read alignment with

Bowtie 2." Nature methods 9.4 (2012): 357-359.

BWA

Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with

Burrows–Wheeler transform." Bioinformatics 25.14 (2009): 1754-1760.

Starcode

Zorita, Eduard, Pol Cuscó, and Guillaume J. Filion. "Starcode: sequence

clustering based on all-pairs search." Bioinformatics 31.12 (2015):

1913-1919.

BLAST

Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs." Nucleic acids research

25.17 (1997): 3389-3402.

SRC

http://arxiv.org/pdf/1605.08319.pdf

Results

Still have to assess the

qualitative aspects of our

methods ...

Time for indexing and querying 1M

reads with SRC Linker

Time for Indexing and querying

100M reads with SRC Counter

Memory used for indexing

100M reads with SRC Counter

Références

Documents relatifs

In SRC Research Report #98, we reported on the 1992 SRC Algorithm Animation Festival [2], held during the summer of ’92 at Digital Equipment Corporation’s Systems Research

It implements the migration commands described earlier. The runtime at the source accesses the agent server using a remote-object access mechanism known as ‘Net- work Objects’ [5].

In programming environments, the linking process is used to produce a complete program from a collection of program fragments.. In addition, linking is used to com- bine a number

TaggerOne achieves state of the art performance on diseases (NCBI Disease corpus [11]) and chemicals (BioCreative 5 CDR corpus [12]) and is being used to tag

ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls. -f

The distinction between static and dynamic dependencies allows us to give a more precise account of rep exposure: only the mutable components correspond- ing to dynamic dependencies

In the classical string matching problem we are given a text of length n and a pattern of length m and we need to answer the query whether the pattern exists in the text

In the classical string matching problem we are given a text of length n and a pattern of length m and we need to answer the query whether the pattern exists in the text