• Aucun résultat trouvé

Seed design framework for mapping SOLiD reads

N/A
N/A
Protected

Academic year: 2022

Partager "Seed design framework for mapping SOLiD reads"

Copied!
56
0
0

Texte intégral

(1)

Seed design framework for mapping SOLiD reads

Laurent No´e, Marta Gˆırdea,Gregory Kucherov

LIFL (CNRS and Universit´e Lille 1) INRIA Lille - Nord Europe

SHD 2010, ENS Paris, March 24, 2010

(2)

Seed design framework for mapping SOLiD reads

Background and motivation

Seed design Background

Position-restricted seeds General approach Lossy seeds Lossless seeds Experiments

Conclusions and perspectives

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 2/44

(3)

High-throughput sequencing technologies

High-throughput sequencing technologies

454 Life Sciences, Illumina/Solexa, Applied Biosystems (SOLiD), ..., Helicos (Heliscope), ..., IBM, DNA Nanoarrays, ...

Sequencing human genome: >$100 million in 2001, ... yesderday

$48,000, today $4,400, tomorrow $100 (?)

“Reading” the genome by short reads of 25-250bp with redundancy

Central problem of this talk:

Mapping reads to a reference genomic sequence

(4)

High-throughput sequencing technologies

High-throughput sequencing technologies

454 Life Sciences, Illumina/Solexa, Applied Biosystems (SOLiD), ..., Helicos (Heliscope), ..., IBM, DNA Nanoarrays, ...

Sequencing human genome: >$100 million in 2001, ... yesderday

$48,000, today $4,400, tomorrow $100 (?)

“Reading” the genome by short reads of 25-250bp with redundancy Central problem of this talk:

Mapping reads to a reference genomic sequence

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 3/44

(5)

SOLiD

TM

system (Applied Biosystems)

2-base encoding of 35bp reads⇒error-correcting capability helping to reduce the error rate and to better distinguish between

sequencing errors and SNPs

Mappings of color sequences must be implicitly interpreted as nucleotide alignments

A C G T A

C G T 1stbase

2ndbase

AA CC GG TT

AC CA GT TG

AG CT GA TC

AT CG GC TA Template sequence

C A T C T T G

C A A C T T G

C A T C T T G

(6)

Properties and artifacts of SOLiD technology

SNPs correspond to 2 adjacent mismatches The tendency for reading errors to occur

periodically at a distance of 5 positions more often towards the end of the read

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65

0 5 10 15 20 25 30 35

Correlation coefficient

Distance between read positions Reading error probability correlation per position distance

0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 0.026

0 5 10 15 20 25 30 35

Average reading error probability

Position on the read Reading error probability per position

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 5/44

(7)

Read mapping software

Numerous tools proposed since 2008:

Eland, SOCS, PatMaN, MAQ, ZOOM, SHRiMP, MOSAIK, PASS, PerM, RazerS, Bowtie, BWA, SOAP2, segemehl, MPSCAN, BFAST, ...

many of them are based on seeding

Our “edge”: using advanced seed design techniques finely tuned to statistical properites of SOLiD reads

(8)

Read mapping software

Numerous tools proposed since 2008:

Eland, SOCS, PatMaN,MAQ,ZOOM,SHRiMP,MOSAIK,PASS,PerM, RazerS, Bowtie, BWA, SOAP2, segemehl, MPSCAN, BFAST, ...

many of them are based on seeding

Our “edge”: using advanced seed design techniques finely tuned to statistical properites of SOLiD reads

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 6/44

(9)

Read mapping software

Numerous tools proposed since 2008:

Eland, SOCS, PatMaN,MAQ,ZOOM,SHRiMP,MOSAIK,PASS,PerM, RazerS, Bowtie, BWA, SOAP2, segemehl, MPSCAN, BFAST, ...

many of them are based on seeding

Our “edge”: using advanced seed design techniques finely tuned to statistical properites of SOLiD reads

(10)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds General approach Lossy seeds Lossless seeds Experiments

Conclusions and perspectives

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 7/44

(11)

Spaced seeds: background

Seed = pattern of matching characters which is defined to be an evidence of a significant alignment

Ex:

ATCAGTGCAATGCTCAAGA

||.||.||.||||:||.|| ATTAGCGCGATGCGCAGGA

(12)

Spaced seeds: background

Seed = pattern of matching characters which is defined to be an evidence of a significant alignment

Ex: seed#####does not hit this alignment ATCAGTGCAATGCTCAAGA

||.||.||.||||:||.||

ATTAGCGCGATGCGCAGGA

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 8/44

(13)

Spaced seeds: background

Seed = pattern of matching characters which is defined to be an evidence of a significant alignment

Ex: spaced seed##-##-#

##-##-#

ATCAGTGCAATGCTCAAGA

||.||.||.||||:||.||

ATTAGCGCGATGCGCAGGA

(14)

Spaced seeds: background

Seed = pattern of matching characters which is defined to be an evidence of a significant alignment

Ex: spaced seed##-##-#

* ##-##-#

ATCAGTGCAATGCTCAAGA

||.||.||.||||:||.||

ATTAGCGCGATGCGCAGGA

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 8/44

(15)

Spaced seeds: background

Seed = pattern of matching characters which is defined to be an evidence of a significant alignment

Ex: spaced seed##-##-#

* ##-##-#

ATCAGTGCAATGCTCAAGA

||.||.||.||||:||.||

ATTAGCGCGATGCGCAGGA

(16)

Spaced seeds: background

Seed = pattern of matching characters which is defined to be an evidence of a significant alignment

Ex: spaced seed##-##-#

* ##-##-#

ATCAGTGCAATGCTCAAGA

||.||.||.||||:||.||

ATTAGCGCGATGCGCAGGA

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 8/44

(17)

Spaced seeds: background

Spaced seeds are more likely to hit an alignment than contiguous seeds of the same weight (= nb of #)⇒moresensitivesearch [PatternHunter 2002, Yass 2004, ...]

Using seed families (several seeds simultaneously such that a hit of at least one of them is sufficient) further improves the performance [PatternHunter II 2003, Buhler&Sun 2004].

Ex: {###-##,#--##---#-#}

Price: multiplying memory for hash tables.

Spaced seeds can (and should) be adapted to the search situation, depending on various statistical characteristics of searched

sequences, technological artifacts, desired selectivity (directly affecting speed), etc.

(18)

Spaced seeds: background

Spaced seeds are more likely to hit an alignment than contiguous seeds of the same weight (= nb of #)⇒moresensitivesearch [PatternHunter 2002, Yass 2004, ...]

Using seed families (several seeds simultaneously such that a hit of at least one of them is sufficient) further improves the performance [PatternHunter II 2003, Buhler&Sun 2004].

Ex: {###-##,#--##---#-#}

Price: multiplying memory for hash tables.

Spaced seeds can (and should) be adapted to the search situation, depending on various statistical characteristics of searched

sequences, technological artifacts, desired selectivity (directly affecting speed), etc.

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 9/44

(19)

Spaced seeds: background

Spaced seeds are more likely to hit an alignment than contiguous seeds of the same weight (= nb of #)⇒moresensitivesearch [PatternHunter 2002, Yass 2004, ...]

Using seed families (several seeds simultaneously such that a hit of at least one of them is sufficient) further improves the performance [PatternHunter II 2003, Buhler&Sun 2004].

Ex: {###-##,#--##---#-#}

Price: multiplying memory for hash tables.

Spaced seeds can (and should) be adapted to the search situation, depending on various statistical characteristics of searched

sequences, technological artifacts, desired selectivity (directly affecting speed), etc.

(20)

Seed design framework

Iederasoftware (http://bioinfo.lifl.fr/yass/iedera) Computes the seed sensitivity with a dynamic programming algorithm as described in [Kucherov et al., 2006]

“Good” mappings are modeled by aHidden Markov Models with emitting transitions

A seed, or a seed family, is modeled by aseed automaton

Example: The automatonQof the spaced seedπ=#-##

q0 q1 q2

q3

q4 q5

0

1 0

1 0

1

0 1

0 1

0,1

Generates seeds patterns and selects the most sensitive seed families

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 10/44

(21)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds

General approach Lossy seeds Lossless seeds Experiments

Conclusions and perspectives

(22)

Position-restricted seeds

Motivation:

Reads are short sequences offixed length

Reminder: Thereading error probability increases towards the end of the read, implying that a search for similarity within the last positions of the read could lead to erroneous results or no results at all

Idea:

Favor hits on the positions of the read where matches are more likely to be significant

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 12/44

(23)

Position-restricted seeds

Position-restricted seed: a seedπdesignedjointlywith a set of positions P to which it is applied on the read.

Example: π=#-##,Pπ={0,3,9,13,18}

Alignment 1110011101011101101100100 Positioned #-##

seeds #-##

#-##

#-##

#-##

(24)

Position-restricted seeds

Restricting the seed

To take into account the setP of allowed positions, we compute the productofQwithan automatonλP

consisting ofa linear chain ofm+ 1. (m= read length) whose final states are: F ={qi:i−s∈P}(s = the span of the concerned seedπ).

Example: π=#-##(the spans= 4), Pπ={0,1,3,5},m= 10

q0

start * q1 * q2 * q3 * q4 * q5 * q6 * q7 * q8 * q9 * q10

*

# - # #

# - # #

# - # #

# - # #

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 14/44

(25)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds

General approach

Lossy seeds Lossless seeds Experiments

Conclusions and perspectives

(26)

Designing seeds for SOLiD read mapping: lossy vs. lossless

Lossy seeds The goal is to detectmostof the target alignments (better seeds have highersensitivity)

Lossless seeds The goal is to detectallthe alignments with up to a given number of errors (or a given score threshold)

Both settings are used in practice, e.g.

SHRiMP: lossy

ZOOM, PerM, MAQ: lossless

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 16/44

(27)

Designing seeds for SOLiD read mapping: challenges

There are two independent sources of errors in reads with respect to the reference genome:

reading errors(misread colors)

SNPs/indels, i.e.,bona fidedifferences between the reference genome and sequenced data

Both error types must be handled, and theirsuperposition considered in the design process

(28)

Our contribution to seed design for mapping SOLiD reads

We design position-restricted seedsfor mapping SOLiD reads, both in the lossy and lossless settings

In thelossyframework:

We representeach of the two error sources(SNPs and reading errors) by aseparate Hidden Markov Model,combined in a model which allows all error types to be cumulated in the resulting sequences

We design sensitive seeds w.r.t. this combined model

In thelosslessframework:

We are allowed todistinguish between reading errors and SNPsof a seed (e.g: lossless for 1 SNP and 2 reading errors)

This distinction is possible thanks to an automaton that restricts the set of alignments to those with the established number of errors We apply afast algorithm for verifying the lossless property directly on the seed automatonto design lossless seeds

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 18/44

(29)

Our contribution to seed design for mapping SOLiD reads

We design position-restricted seedsfor mapping SOLiD reads, both in the lossy and lossless settings

In thelossyframework:

We representeach of the two error sources(SNPs and reading errors) by aseparate Hidden Markov Model,combined in a model which allows all error types to be cumulated in the resulting sequences

We design sensitive seeds w.r.t. this combined model

In thelosslessframework:

We are allowed todistinguish between reading errors and SNPsof a seed (e.g: lossless for 1 SNP and 2 reading errors)

This distinction is possible thanks to an automaton that restricts the set of alignments to those with the established number of errors We apply afast algorithm for verifying the lossless property directly on the seed automatonto design lossless seeds

(30)

Our contribution to seed design for mapping SOLiD reads

We design position-restricted seedsfor mapping SOLiD reads, both in the lossy and lossless settings

In thelossyframework:

We representeach of the two error sources(SNPs and reading errors) by aseparate Hidden Markov Model,combined in a model which allows all error types to be cumulated in the resulting sequences

We design sensitive seeds w.r.t. this combined model

In thelosslessframework:

We are allowed todistinguish between reading errors and SNPsof a seed (e.g: lossless for 1 SNP and 2 reading errors)

This distinction is possible thanks to an automaton that restricts the set of alignments to those with the established number of errors We apply afast algorithm for verifying the lossless property directly on the seed automatonto design lossless seeds

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 18/44

(31)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds General approach

Lossy seeds

Lossless seeds Experiments

Conclusions and perspectives

(32)

Lossy framework: seed design

Select the most sensitiveseeds w.r.t. “good” read mappings

“Good” mappings are modeled by a combination of two HMMs representing thebiological variationand thereading errors respectively

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 20/44

(33)

Lossy framework: Biological variations model

DNA modifications reflected in the color sequence:

G C A T C T T G A T G A A T G G A G T A T C G

| | . . . | | . | . | | . - - | | | - - | |

G C A C A T T G C G G A A - - G A G - - T C G

B1 B2 B3 B4 B5

c1 c2 c3 c4

c01 c02 c03 c04 B1 B20 B03 B40 B5

Consecutive mutations

c16=c01,c46=c04

fori= 2,3,ci6=c0iin 3/4 cases

B1 B2 B3 B4 B5

c1 c2 c3 c4

c01

B1 B5

Consecutive indels

c016=c1in 3/4 cases

(34)

Lossy framework: Biological variations model (M

SNP/I

)

Match

SNP Indel

pcm=1,pce=0,pci=0 1pSNPpIndel

pcm= 0,pce=

1,pci= 0

pSN

P

pInd el

pc

m=0

.25 ,pc

e=0

.75 ,pc

i=0 1

pSN

P pInd

el

pcm= 0,cpe=

1,

cp

=i

0

pSNP pcm=0.25,pce=0.75,pci=0

pIndel

pcm=0.25,pce=0.75,pci=0 pc

m=0 ,pc

e=0 ,pc

i=1 1

p

SN Pp

Ind el

pSNP pcm=0,pce=0,pci=1

pIndel pcm=0,pce=0,pci=1

States refer to DNA alignment

Emitted symbols refer tocolor alignment Legend (transitions):

color matches color mismatches 1/4 color matches + 3/4 mismatches color indels

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 22/44

(35)

Lossy framework: Reading errors model (M

RE

)

Reminder:The reading error probability increases towards the end of the read Reminder:Errors tend to appear with a periodicity of 5

0/5 reading errors

1/5 reading error

ps

2/5 reading errors

ps

ps ps ps

3/5 reading errors

ps ps ps

ps

ps ps

4/5 reading errors

ps ps

ps ps

5/5 reading errors

ps

(36)

Lossy framework: Combined model

The model which combines both error sources isthe product ofMSNP/I andMRE.

How are errors cumulated (example):

MSNP/I M M ME EM M ME EM M MEIM M M M MIM M M M M M

×

MRE M M M M M M M M MEM M M MEM M M MEMEMEME E

= —————————————————————–

M(SNP/I)×RE M M ME EM M ME EM M MEIM M M MEIEMEME E

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 24/44

(37)

Lossy framework: Seed selection according to sensitivity

Sensitivity of a seed (seed family) is defined to be the probability for at least one of the seeds to hit a read alignment with respect to a given probabilistic model of the alignment [Ma et al., 2002, Keich et al., 2004].

Using the dynamic programming technique of [Kucherov et al., 2006]

withinIedera, we selectthe most sensitive seeds w.r.t. the specified model.

(38)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds General approach Lossy seeds

Lossless seeds

Experiments

Conclusions and perspectives

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 26/44

(39)

Lossless framework: Computing the lossless property

Lossless seeds have the capacity to hitallalignments containing up to an established number of errors.

lossless for 2 mismatches,

lossless for 1 mismatch and 1 indel, lossless for 1 SNP and 3 reading errors, ...

(40)

Lossless framework: Computing the lossless property

Straightforward way: construct a deterministic automaton recognizing the set of all target alignments and test if the language of this automaton is included in the language of the automatonQof the seed –unfeasible in practice.

We proposean efficient dynamic programming algorithm directly applied toQthat can verify theinclusion:

Time complexity: O(|Q| ·readlength); Space complexity: O(|Q|)

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 28/44

(41)

Lossless framework: Separating reading errors and SNPs

The method can be extended in order to split reading errors and SNPs.

Example: automaton for 1 SNP and 2 color substitutons

start

f ail 1

0 0

1 1

0

1

0 0

1 1

0

1

0 0

1 1

0

0,1

Designing lossless seeds fork SNPs andhcolor substitutions

(42)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds General approach Lossy seeds Lossless seeds

Experiments

Conclusions and perspectives

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 30/44

(43)

Lossy seeds restricted to 10 positions

1-Lossy-10p: sensitivity 0.9543 2-Lossy-10p: sensitivity 0.9627

1 5 10 15 20 25 30 1 5 10 15 20 25 30

#######--### : : : : ########## : : : :

: #######--###: : : : : : ########## : : :

: #######--### : : : : : : ########## : :

: : #######--### : : : : : : : ##########:

: : #######--### : : : : : : : ##########

: : :#######--### : : ######----####: : : :

: : : #######--###: : : : ######----#### : :

: : : :#######--### : : : : ######----#### :

: : : : #######--### : : : : ######----####

: : : : : #######--### : : : : :######----####

(44)

Lossy seeds restricted to 12 positions

1-Lossy-12p: sensitivity 0.9626 2-Lossy-12p: sensitivity 0.9685

1 5 10 15 20 25 30 1 5 10 15 20 25 30

#######--### : : : : ########## : : : :

: #######--###: : : : : : ########## : : :

: #######--### : : : : : :########## : :

: : #######--### : : : : : : :########## :

: : #######--### : : : : : : :##########

: : :#######--### : : : : : : : ##########

: : : #######--###: : ####--##--####: : : :

: : : #######--### : : ####--##--#### : : :

: : : : #######--### : : ####--##--#### : : :

: : : : #######--### : : ####--##--#### : :

: : : : :#######--### : : ####--##--#### : :

: : : : : #######--### : : : : :####--##--####

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 32/44

(45)

Lossless seeds for 1 SNP and 2 reading errors

1-Lossless-14p 2-Lossless-8p

1 5 10 15 20 25 30 1 5 10 15 20 25 30

####--#---####--# : : ########## : : : :

:####--#---####--# : : : : ########## : : : : ####--#---####--# : : : : : : ########## :

: ####--#---####--#: : : : : : : ##########

: ####--#---####--# : #####---##### : : : : :####--#---####--# : :#####---##### : : : : : ####--#---####--# : : : : : #####---#####

: : ####--#---####--# : : : : : #####---#####

: : ####--#---####--#:

: : ####--#---####--#

: : :####--#---####--#

: : : ####--#---####--#

: : : ####--#---####--#

: : : ####--#---####--#

(46)

Comparative performance of position-restricted seeds

Theoretical sensitivity/selectivity of single seeds

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 34/44

(47)

Seed comparison

Data: 100000 reads of length 34 fromS. cerevisiae

Scoring scheme: +1 for match, 0 for color mismatch or SNP, -2 for gaps Results: The number of read/reference alignments hit by each

(single or double) seed with scores varying from 28 to 34

20000 30000 40000 50000 60000 70000

Alignments hit by each seed family

1-Lossy-10p 1-Lossy-12p 2-Lossy-10p 2-Lossy-12p 2-Lossless-8p 1-Lossless-14p

63708 53551 45048 36617 27643 18662

65267 54275 45327 36682 27653 18662

65960 54476 45404 36728 27660 18662

65960 54476 45404 36728 27660 18662

64142 53643 45181 36682 27656 18662

62134 52629 44772 36663 27656 18662

(48)

Comparison with other software

our read mapping software. Uses SIMD bandwidth alignment filter.

SHRiMP [Rumble et al, PLoS Comp Bio 2009]. Uses spaced seed family, multi-hit method, SIMD filter. Hashing reads rather than reference genome.

PerM [Chen et al, Bioinformatics 2009]. Uses lossless “periodic”

seeds.

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 36/44

(49)

Experimental results

Comparable setup for the three programs. Cut-off score 46 under scoring system (2,−3,−7,−4).

Seed set ID Number Patterns Positions SHRiMP-default 4 #####---####### All

####--###--#--####

###--#--#---###--####

####--#----#---#--#--####

PerM-F3-S20 2 ###-#--#---###-#--#---## All

####--#----####--#----##

4-Lossy-12 4 ######-###### All

####-####--####

####-##---###--###

######--#----#-####

3-Lossy-12 3 ####-####-#### All

####-###--#----####

####----##--##-####

4-Lossy-10 4 ####-###### All

###-####--###

####-##---#--###

#####---##-#--##

In order to create similar setups for all tools, PerM was configured to accept up to 5 mismatches (default is 3), and our program was launched with the following parameters, equivalent to the default SHRiMP settings:

Parameter Value

match score 2

mismatch score -3

gap opening penalty -7

gap extension penalty -4

minimum score accepted by the SIMD filter 40 minimum score accepted by the alignment algorithm 46

Program Seed set Mapped reads Execution time SHRiMP SHRiMP-default 663,923 (51.85%) 31m07s

PerM lossless 5 mismatches 618,554 (48.30%) 0m25s our tool PerM-F3-S20 539,772 (42.15%) 2m02s our tool SHRiMP-default 675,308 (52.74%) 5m06s

our tool 3-Lossy-12 677,043 (52.87%) 4m40s

our tool 4-Lossy-12 678,455 (52.98%) 6m02s

our tool 4-Lossy-10 679,802 (53.09%) 30m17s Table 1.Comparison with SHRiMP and PerM. Dataset:S. cerevisiae(1,280,536 reads)

(50)

Seed design framework for mapping SOLiD reads

Background and motivation Seed design

Background

Position-restricted seeds General approach Lossy seeds Lossless seeds Experiments

Conclusions and perspectives

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 38/44

(51)

Contributions

A seed design framework for mapping SOLiD reads to a reference genomic sequence

The concept ofposition-restricted seeds, particularly suitable for short alignments with non-uniform error distribution

A modelthat captures thestatistical characteristics of the SOLiD reads, used for the evaluation of lossy seeds

An efficient dynamic programmingalgorithm for verifying the lossless propertyof seeds with the capacity todistinguish between SNPs and reading errorsin seed design

A selection of “ready-to-use” seeds (seed families) (cf http://www.lifl.fr/yass/iedera solid)

An experimental read mapping software (to be released)

(52)

Acknowledgments

ANR project CoCoGen(BLAN07-1 185484) – funding for Laurent No´e.

Valentina BoevaandEmmanuel Barillot(Institut Marie CurieParis) – for helpful discussions and for providing the dataset ofSaccharomyces cerevisiaereads that we used as a testset in our study.

Martin Figeac(Institut national de la sant´e et de la recherche m´edicale) – for sharing insightful knowledge about the SOLiD technology.

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 40/44

(53)

Thank you!

Questions?

(54)

More details in:

No´e, L., Gˆırdea, M., and Kucherov, G. (2010).

Seed design framework for mapping SOLiD reads.

Proceedings of the 14th Annual International Conference on Computational Molecular Biology (RECOMB)(accepted).

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 42/44

(55)

References I

Biosystems, A. (2008).

A Theoretical Understanding of 2 Base Color Codes and Its Application to Annotation, Error Detection, and Error Correction. Methods for Annotating 2 Base Color Encoded Reads in the SOLiDTMSystem.

Brejova, B., Brown, D., and Vinar, T. (2003).

Optimal spaced seeds for Hidden Markov Models, with application to homologous coding regions.

Lecture Notes in Computer Science, 2676:42–54.

Keich, U., Li, M., Ma, B., and Tromp, J. (2004).

On spaced seeds for similarity search.

Discrete Applied Mathematics, 138(3):253–263.

(preliminary version in 2002).

Kucherov, G., No´e, L., and Roytberg, M. (2006).

A unifying framework for seed sensitivity and its application to subset seeds.

Journal of Bioinformatics and Computational Biology, 4(2):553–570.

Ma, B., Tromp, J., and Li, M. (2002).

PatternHunter: Faster and more sensitive homology search.

Bioinformatics, 18(3):440–445.

Sun, Y. and Buhler, J. (2005).

(56)

References II

Zhou, L., Stanton, J., and Florea, L. (2008).

Universal seeds for cDNA-to-genome comparison.

BMC bioinformatics, 9(1):36.

Laurent No´e, Marta Gˆırdea,Gregory Kucherov Seed design framework for mapping SOLiD reads 44/44

Références

Documents relatifs

In a SEED-M, design of the experience and its enabling platform should be very attentive to the needs of easy formation and cultivation of domain specific communities of practice

is used to perform speciation imaging, hence collecting data on a large set of (v, z

In this paper, we draw ideas from humanistic approaches to design, such as positive computing and inclusive design, to identify guidelines that can overcome the

Classification of the 92 accessions of the diversity panel through a Ward hierarchical clustering showed the abil- ity of the 1538 genotyped SNPs to group pea genotypes into two

Intuitively, averaging the abundance across each unitig makes it possible to ’rescue’ genomic k-mers with low abundance (that tend to be lost by k-mer-spectrum techniques, see

Two studies of the future of new energy conversion technologies performed in the period from September 2004 to December 2006 using the concept of limited resource assessment

These four ways of conceptualising relational space have provided the Affective Atlas project with a framework to examine the way the spaces of a National Park are produced..

From mappings to meta-mappings While previous work focused on gen- erating a mapping from a given meta-mapping [8], we tackle the more general problem of mapping reuse, which