• Aucun résultat trouvé

Suffix Tree and Suffix Array of an Alignment

N/A
N/A
Protected

Academic year: 2022

Partager "Suffix Tree and Suffix Array of an Alignment"

Copied!
58
0
0

Texte intégral

(1)

1 Sejong University, Korea

2 Hanyang University, Korea

3 Seoul National University, Korea

4 Normandie Universit´e, Universit´e de Rouen, LITIS EA 4108, France

SeqBio 2013

November 25th-26th 2013 – Montpellier, France

(2)

Outline

1 Indexing Structures

2 Suffix tree of an alignment

3 Suffix array of an alignment

4 Conclusion & Perspectives

TL (LITIS) STA and SAA SeqBio 2013 2 / 36

(3)

Similar data

More and more sequencing of individual genomes

Need for efficient storage, indexing and pattern matching structures

TL (LITIS) STA and SAA SeqBio 2013 3 / 36

(4)

Outline

1 Indexing Structures

2 Suffix tree of an alignment

3 Suffix array of an alignment

4 Conclusion & Perspectives

TL (LITIS) STA and SAA SeqBio 2013 4 / 36

(5)

0 atatgat$

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

6 5

4 3

2

1 0

a g$ t a t

t a g t $

g t a $

a $ t

t $

$

TL (LITIS) STA and SAA SeqBio 2013 5 / 36

(6)

Non Compact Suffix Tree or Suffix Trie

0 1 2 3 4 5 6 7

y = a t a t g a t $ 0 atatgat$

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

7 6

5

4 3

2

1 0

a t

g

$

t a g$

a

a g$ t a t

t a g t $

g t a $

a $ t

t $

$

TL (LITIS) STA and SAA SeqBio 2013 5 / 36

(7)

0 atatgat$

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

6 5

4 3

2

1 0

a g$ t a t

t a g t $

g t a $

a $ t

t $

$

TL (LITIS) STA and SAA SeqBio 2013 5 / 36

(8)

(Compact) Suffix Tree

7 6

5

4 3

2

1 0

a t

g

$

t a g$

a

a g$ t a t

t a g t $

g t a $

a $ t

t $

$

7 6

5

4 3

2

1 0

at t

gat$

$

atgat$

gat$

$

$

atgat$

gat$

TL (LITIS) STA and SAA SeqBio 2013 6 / 36

(9)

6 5

4 3

2

1 0

a g$ t a t

t a g t $

g t a $

a $ t

t $

$

6 5

4 3

2

1 0

gat$

atgat$

gat$

$

atgat$

gat$

TL (LITIS) STA and SAA SeqBio 2013 6 / 36

(10)

(Compact) Suffix Tree

7 6

5 3 4

2 1

0

at t

gat$

$ atgat$

gat$

$ atgat$$

gat$

0 1 2 3 4 5 6 7

a t a t g a t $

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]

TL (LITIS) STA and SAA SeqBio 2013 7 / 36

(11)

7 6

5 3 4

2 1

0

at t

gat$

$ atgat$

gat$

$ atgat$$

gat$

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]

TL (LITIS) STA and SAA SeqBio 2013 7 / 36

(12)

(Compact) Suffix Tree

7 6

5 3 4

2 1

0

at t

gat$

$ atgat$

gat$

$ atgat$$

gat$

0 1 2 3 4 5 6 7

a t a t g a t $

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]

TL (LITIS) STA and SAA SeqBio 2013 7 / 36

(13)

7 6

5 3 4

2 1

0

at t

gat$

$ atgat$

gat$

$ atgat$$

gat$

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]

TL (LITIS) STA and SAA SeqBio 2013 7 / 36

(14)

(Compact) Suffix Tree

0 1 2 3 4 5 6 7

a t a t g a t $

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

tais a factor of y ttis not

atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6

TL (LITIS) STA and SAA SeqBio 2013 8 / 36

(15)

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

tais a factor of y ttis not

atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6

TL (LITIS) STA and SAA SeqBio 2013 8 / 36

(16)

(Compact) Suffix Tree

0 1 2 3 4 5 6 7

a t a t g a t $

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

tais a factor of y ttis not

atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6

TL (LITIS) STA and SAA SeqBio 2013 8 / 36

(17)

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

tais a factor of y ttis not

atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6

TL (LITIS) STA and SAA SeqBio 2013 8 / 36

(18)

(Compact) Suffix Tree

0 1 2 3 4 5 6 7

a t a t g a t $

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

tais a factor of y ttis not

atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6

TL (LITIS) STA and SAA SeqBio 2013 8 / 36

(19)

7 6

5 3 4

2 1

0

(0,2) (1,1)

(4,4) (7,1)

(2,6) (4,4)

(7,1) (7,1)

(2,6) (4,4)

Suffix link:

s(au) =u

TL (LITIS) STA and SAA SeqBio 2013 9 / 36

(20)

Suffix Array

0 1 2 3 4 5 6 7

y = a t a t g a t $ 0 atatgat$

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

SA sort of the suffixes

7 $

5 at$

0 atatgat$

2 atgat$

4 gat$

6 t$

1 tatgat$

3 tgat$

nlogn bits

[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &

Aluru’03]

TL (LITIS) STA and SAA SeqBio 2013 10 / 36

(21)

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

5 at$

0 atatgat$

2 atgat$

4 gat$

6 t$

1 tatgat$

3 tgat$

nlogn bits

[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &

Aluru’03]

TL (LITIS) STA and SAA SeqBio 2013 10 / 36

(22)

Suffix Array

0 1 2 3 4 5 6 7

y = a t a t g a t $ 0 atatgat$

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

SA sort of the suffixes

7 $

5 at$

0 atatgat$

2 atgat$

4 gat$

6 t$

1 tatgat$

3 tgat$

nlogn bits

[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &

Aluru’03]

TL (LITIS) STA and SAA SeqBio 2013 10 / 36

(23)

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

5 at$

0 atatgat$

2 atgat$

4 gat$

6 t$

1 tatgat$

3 tgat$

nlogn bits

[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &

Aluru’03]

TL (LITIS) STA and SAA SeqBio 2013 10 / 36

(24)

Suffix Array

0 1 2 3 4 5 6 7

y = a t a t g a t $ 0 atatgat$

1 tatgat$

2 atgat$

3 tgat$

4 gat$

5 at$

6 t$

7 $

SA sort of the suffixes

7 $

5 at$

0 atatgat$

2 atgat$

4 gat$

6 t$

1 tatgat$

3 tgat$

nlogn bits

[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &

Aluru’03]

TL (LITIS) STA and SAA SeqBio 2013 10 / 36

(25)

These structure can be extended to several sequences:

Generalized Suffix Tree (GST) Generalized Suffix Array (GSA)

TL (LITIS) STA and SAA SeqBio 2013 11 / 36

(26)

Background

M¨akinen et al. [RECOMB 2009, J. Comput. Bio. 2010]: first proposed an index for similar strings using run-length encoding, a suffix array, and BWT

Huang et al. [AAIM 2010]: proposed an index of sizeO(n+NlogN) bits

I nis the total length of common parts in one string

I N is the total length of different parts in all strings

building data structures separately for common parts and for non-common parts

Kuruppu et al. [SPIRE 2010]: relative Lempel-Ziv compression and pattern search

TL (LITIS) STA and SAA SeqBio 2013 12 / 36

(27)

Similar sequences consist of a succession of:

common regions non-common regions Example

x=αβγ andy=αδγ

The alignment andx andy is written α(β/δ)γ

TL (LITIS) STA and SAA SeqBio 2013 13 / 36

(28)

Similar sequences

More than 2 sequences

More than 1 non-common region

TL (LITIS) STA and SAA SeqBio 2013 14 / 36

(29)

2 Suffix tree of an alignment

3 Suffix array of an alignment

4 Conclusion & Perspectives

TL (LITIS) STA and SAA SeqBio 2013 15 / 36

(30)

Suffix tree of an alignment

Intuition

The suffix tree of the alignmentα(β/δ)γ$ ofx=αβγ andy =αδγ is a suffix tree resulted by removing redundancy from the GST of x andy

TL (LITIS) STA and SAA SeqBio 2013 16 / 36

(31)

1 suffixes starting in γ$

2 suffixes starting in αβ

3 suffixes starting in αδ

4 suffixes starting in αα∗−1 where:

α is the longest among αx andαy where

αx is the longest suffix ofα occurring at least twice inx αy is the longest suffix of αoccurring at least twice in y

TL (LITIS) STA and SAA SeqBio 2013 17 / 36

(32)

Example

x=actaatatand y=actatcat α=acta,β =at,γ =at$ andδ=tc The alignment is thus acta(at/tc)at$

αx=taandαy =athus α =ta

TL (LITIS) STA and SAA SeqBio 2013 18 / 36

(33)

x=actaatatand y=actatcat α=acta,β =at,γ =at$ andδ=tc The alignment is thus acta(at/tc)at$

αx=taandαy =athus α =ta

TL (LITIS) STA and SAA SeqBio 2013 18 / 36

(34)

Example

x=actaatatand y=actatcat α=acta,β =at,γ =at$ andδ=tc The alignment is thus acta(at/tc)at$

αx=taandαy =athus α =ta

TL (LITIS) STA and SAA SeqBio 2013 18 / 36

(35)

at$

Type 2 suffixes (suffixes starting inαβ) tat$

atat$

aatat$

taatat$

Type 3 suffixes (suffixes starting inαδ) cat$

tcat$

atcat$

tatcat$

Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$

acta(at/tc)at$

TL (LITIS) STA and SAA SeqBio 2013 19 / 36

(36)

Example

x=ac ta at atand y=ac ta tc at Type 1 suffixes (suffixes starting inγ$)

$ t$

at$

Type 2 suffixes (suffixes starting inαβ) tat$

atat$

aatat$

taatat$

Type 3 suffixes (suffixes starting inαδ) cat$

tcat$

atcat$

tatcat$

Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$

acta(at/tc)at$

TL (LITIS) STA and SAA SeqBio 2013 19 / 36

(37)

at$

Type 2 suffixes (suffixes starting inαβ) tat$

atat$

aatat$

taatat$

Type 3 suffixes (suffixes starting inαδ) cat$

tcat$

atcat$

tatcat$

Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$

acta(at/tc)at$

TL (LITIS) STA and SAA SeqBio 2013 19 / 36

(38)

Example

x=ac ta at atand y=ac ta tc at Type 1 suffixes (suffixes starting inγ$)

$ t$

at$

Type 2 suffixes (suffixes starting inαβ) tat$

atat$

aatat$

taatat$

Type 3 suffixes (suffixes starting inαδ) cat$

tcat$

atcat$

tatcat$

Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$

acta(at/tc)at$

TL (LITIS) STA and SAA SeqBio 2013 19 / 36

(39)

Definition

Compact suffix tree with suffixes of types 1, 2, 3 and 4

TL (LITIS) STA and SAA SeqBio 2013 20 / 36

(40)

Suffix tree of an alignment

Construction

insert suffixes ofx (type 1, 2 and 4 suffixes) →ST(x) computeα

insert suffixes starting inαδ (type 3 suffixes)

TL (LITIS) STA and SAA SeqBio 2013 21 / 36

(41)

Use a doubling technique: check if the suffix ofα of length 1,2,4,8, . . .occurs more than once (represented by at most two leaves) [McCreight’76]

Let α(h) be the longest suffix ofα occurring only once inx found with the doubling technique then h/2≤ |αx|< h

Find αx fromα(h) using suffix links Can be done in O(|α|)

TL (LITIS) STA and SAA SeqBio 2013 22 / 36

(42)

Computation of α

insert suffixes ofy starting in αxδ →T0 T0 has all the information for computingα determine α in T0 using the doubling technique

TL (LITIS) STA and SAA SeqBio 2013 23 / 36

(43)

Theorem

The suffix tree of the alignmentα(β/δ)γ$ ofx=αβγ andy =αδγ can be computed in time O(|x|+|αδγ0|) whereγ0 is a prefix ofγ satisfying some conditions.

Recall than the GST of x andy needsO(|x|+|αδγ|) time.

TL (LITIS) STA and SAA SeqBio 2013 24 / 36

(44)

References

J. C. Na, H. Park, M. Crochemore, J. Holub, C. S. Iliopoulos, L. Mouchard, and K. Park

Suffix tree of an alignment: An efficient index for similar data In T. L. and L. Mouchard, editors, Proceedings of the 24th

International Workshop on Combinatorial Algorithms (IWOCA 2013), number 8288 in Lecture Notes in Computer Science, Rouen, France, 2013. Springer-Verlag, Berlin, to appear

TL (LITIS) STA and SAA SeqBio 2013 25 / 36

(45)

2 Suffix tree of an alignment

3 Suffix array of an alignment

4 Conclusion & Perspectives

TL (LITIS) STA and SAA SeqBio 2013 26 / 36

(46)

Suffix Array of an alignment

Definition

Indices of suffixes of types 1, 2, 3 and 4 sorted in lexicographic

TL (LITIS) STA and SAA SeqBio 2013 27 / 36

(47)

at$

Type 2 suffixes (suffixes starting inαβ) tat$

atat$

aatat$

taatat$

Type 3 suffixes (suffixes starting inαδ) cat$

tcat$

atcat$

tatcat$

Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$

acta(at/tc)at$

TL (LITIS) STA and SAA SeqBio 2013 28 / 36

(48)

Example

x=actaatatand y=actatcat All 4 types suffixes

$ t$

at$

tat$

atat$

aatat$

taatat$

cat$

tcat$

atcat$

tatcat$

cta(at/tc)at$

acta(at/tc)at$

TL (LITIS) STA and SAA SeqBio 2013 28 / 36

(49)

aatat$

acta(at/tc)at$

at$

atat$

atcat$

cat$

cta(at/tc)at$

t$

tat$

taatat$

tcat$

tatcat$

TL (LITIS) STA and SAA SeqBio 2013 28 / 36

(50)

Example

x=actaatatand y=actatcat Suffix Array of the Alignment

SAA (0,8) $ (1,3) aatat$

(0,0) acta(at/tc)at$

(0,6) at$

(1,4) atat$

(2,3) atcat$

(2,5) cat$

(0,1) cta(at/tc)at$

(0,7) t$

(1,5) tat$

(1,2) taatat$

(2,4) tcat$

(2,2) tatcat$

TL (LITIS) STA and SAA SeqBio 2013 28 / 36

(51)

α is the longest among αx andαy where

αx is the longest suffix ofα occurring at least twice inx αy is the longest suffix of αoccurring at least twice in y γ is the longest among γx andγy where

γx is the longest prefix of γ occurring at least twice inx γy is the longest prefix of γ occurring at least twice iny

TL (LITIS) STA and SAA SeqBio 2013 29 / 36

(52)

Suffix array of an alignment: construction

1 construct the GSA of x andαδγdwheredis the symbol following γ in γ

2 delete suffixes of γd

TL (LITIS) STA and SAA SeqBio 2013 30 / 36

(53)

symmetrically compute |γ |by searching forγ in the suffix array of (αxδγx)

TL (LITIS) STA and SAA SeqBio 2013 31 / 36

(54)

Suffix tree of an alignment

Theorem

The suffix array of the alignment α(β/δ)γ$of x=αβγ andy=αδγ can be computed in time O(|x|+|αδγ|).

Recall than the GSA of x andy needsO(|x|+|αδγ|) time.

TL (LITIS) STA and SAA SeqBio 2013 32 / 36

(55)

J. C. Na, H. Park, S. Lee, M. Hong, T. L., L. Mouchard, and K. Park Suffix array of alignment: A practical index for similar data

In O. Kurland, M. Lewenstein, and E. Porat, editors, Proceedings of the 20th International Symposium on String Processing and

Information Retrieval (SPIRE 2013), number 8214 in Lecture Notes in Computer Science, pages 243–254, Jerusalem, Israel, 2013.

Springer-Verlag, Berlin

TL (LITIS) STA and SAA SeqBio 2013 33 / 36

(56)

Outline

1 Indexing Structures

2 Suffix tree of an alignment

3 Suffix array of an alignment

4 Conclusion & Perspectives

TL (LITIS) STA and SAA SeqBio 2013 34 / 36

(57)

suffix tree of an alignment (done) suffix array of an alignment (done) FM-index of an alignment (in progress)

TL (LITIS) STA and SAA SeqBio 2013 35 / 36

(58)

THANK YOU FOR YOUR ATTENTION!

TL (LITIS) STA and SAA SeqBio 2013 36 / 36

Références

Documents relatifs

In the present paper, we prove that the average complexity of the tree align- ment algorithm of [10] is in O(nm), as well as the average complexity of the RNA structure

Structured finite- difference methods and updating schemes based on the secant equation are presented and com- pared numerically inside the multilevel trust-region algorithm proposed

Qualitatively, the results agree with a model for the motion of zig-zag dislocations in an array of ob- stacles.. Comparison of theoretical and experimental results yields

61 patients avec reconstruction postérolatérale du coude, dont 54 évalué selon le score de Andrews-Carson et la durée du recul, la technique chirurgical utilisée (ouverte ou

Results: Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O ( n ) -time and O ( n ) -space algorithm for computing all minimal

The characteristics for the five most repeated prokaryote genomes are pre- sented in Table 1. These short genome sequences have been extensively studied and are well known

The balance between theory and practice is also found in the topics proposed by the international Program Advisory Board and the Scientific and Technical Committee: global snow

We compared the execution times of our algorithm with the linear time suffix array construction algorithm proposed by K¨arkk¨ainen and Sanders [10], and non- linear algorithm of