1 Sejong University, Korea
2 Hanyang University, Korea
3 Seoul National University, Korea
4 Normandie Universit´e, Universit´e de Rouen, LITIS EA 4108, France
SeqBio 2013
November 25th-26th 2013 – Montpellier, France
Outline
1 Indexing Structures
2 Suffix tree of an alignment
3 Suffix array of an alignment
4 Conclusion & Perspectives
TL (LITIS) STA and SAA SeqBio 2013 2 / 36
Similar data
More and more sequencing of individual genomes
Need for efficient storage, indexing and pattern matching structures
TL (LITIS) STA and SAA SeqBio 2013 3 / 36
Outline
1 Indexing Structures
2 Suffix tree of an alignment
3 Suffix array of an alignment
4 Conclusion & Perspectives
TL (LITIS) STA and SAA SeqBio 2013 4 / 36
0 atatgat$
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
6 5
4 3
2
1 0
a g$ t a t
t a g t $
g t a $
a $ t
t $
$
TL (LITIS) STA and SAA SeqBio 2013 5 / 36
Non Compact Suffix Tree or Suffix Trie
0 1 2 3 4 5 6 7
y = a t a t g a t $ 0 atatgat$
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
7 6
5
4 3
2
1 0
a t
g
$
t a g$
a
a g$ t a t
t a g t $
g t a $
a $ t
t $
$
TL (LITIS) STA and SAA SeqBio 2013 5 / 36
0 atatgat$
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
6 5
4 3
2
1 0
a g$ t a t
t a g t $
g t a $
a $ t
t $
$
TL (LITIS) STA and SAA SeqBio 2013 5 / 36
(Compact) Suffix Tree
7 6
5
4 3
2
1 0
a t
g
$
t a g$
a
a g$ t a t
t a g t $
g t a $
a $ t
t $
$
7 6
5
4 3
2
1 0
at t
gat$
$
atgat$
gat$
$
$
atgat$
gat$
TL (LITIS) STA and SAA SeqBio 2013 6 / 36
6 5
4 3
2
1 0
a g$ t a t
t a g t $
g t a $
a $ t
t $
$
6 5
4 3
2
1 0
gat$
atgat$
gat$
$
atgat$
gat$
TL (LITIS) STA and SAA SeqBio 2013 6 / 36
(Compact) Suffix Tree
7 6
5 3 4
2 1
0
at t
gat$
$ atgat$
gat$
$ atgat$$
gat$
0 1 2 3 4 5 6 7
a t a t g a t $
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]
TL (LITIS) STA and SAA SeqBio 2013 7 / 36
7 6
5 3 4
2 1
0
at t
gat$
$ atgat$
gat$
$ atgat$$
gat$
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]
TL (LITIS) STA and SAA SeqBio 2013 7 / 36
(Compact) Suffix Tree
7 6
5 3 4
2 1
0
at t
gat$
$ atgat$
gat$
$ atgat$$
gat$
0 1 2 3 4 5 6 7
a t a t g a t $
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]
TL (LITIS) STA and SAA SeqBio 2013 7 / 36
7 6
5 3 4
2 1
0
at t
gat$
$ atgat$
gat$
$ atgat$$
gat$
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
[Weiner’73,McCreight’76,Ukkonen’92,Farach’97]
TL (LITIS) STA and SAA SeqBio 2013 7 / 36
(Compact) Suffix Tree
0 1 2 3 4 5 6 7
a t a t g a t $
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
tais a factor of y ttis not
atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6
TL (LITIS) STA and SAA SeqBio 2013 8 / 36
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
tais a factor of y ttis not
atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6
TL (LITIS) STA and SAA SeqBio 2013 8 / 36
(Compact) Suffix Tree
0 1 2 3 4 5 6 7
a t a t g a t $
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
tais a factor of y ttis not
atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6
TL (LITIS) STA and SAA SeqBio 2013 8 / 36
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
tais a factor of y ttis not
atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6
TL (LITIS) STA and SAA SeqBio 2013 8 / 36
(Compact) Suffix Tree
0 1 2 3 4 5 6 7
a t a t g a t $
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
tais a factor of y ttis not
atoccurs 3 times at positions 0, 2 and 5 toccurs 3 times at positions 1, 3 and 6
TL (LITIS) STA and SAA SeqBio 2013 8 / 36
7 6
5 3 4
2 1
0
(0,2) (1,1)
(4,4) (7,1)
(2,6) (4,4)
(7,1) (7,1)
(2,6) (4,4)
Suffix link:
s(au) =u
TL (LITIS) STA and SAA SeqBio 2013 9 / 36
Suffix Array
0 1 2 3 4 5 6 7
y = a t a t g a t $ 0 atatgat$
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
SA sort of the suffixes
7 $
5 at$
0 atatgat$
2 atgat$
4 gat$
6 t$
1 tatgat$
3 tgat$
nlogn bits
[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &
Aluru’03]
TL (LITIS) STA and SAA SeqBio 2013 10 / 36
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
5 at$
0 atatgat$
2 atgat$
4 gat$
6 t$
1 tatgat$
3 tgat$
nlogn bits
[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &
Aluru’03]
TL (LITIS) STA and SAA SeqBio 2013 10 / 36
Suffix Array
0 1 2 3 4 5 6 7
y = a t a t g a t $ 0 atatgat$
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
SA sort of the suffixes
7 $
5 at$
0 atatgat$
2 atgat$
4 gat$
6 t$
1 tatgat$
3 tgat$
nlogn bits
[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &
Aluru’03]
TL (LITIS) STA and SAA SeqBio 2013 10 / 36
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
5 at$
0 atatgat$
2 atgat$
4 gat$
6 t$
1 tatgat$
3 tgat$
nlogn bits
[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &
Aluru’03]
TL (LITIS) STA and SAA SeqBio 2013 10 / 36
Suffix Array
0 1 2 3 4 5 6 7
y = a t a t g a t $ 0 atatgat$
1 tatgat$
2 atgat$
3 tgat$
4 gat$
5 at$
6 t$
7 $
SA sort of the suffixes
7 $
5 at$
0 atatgat$
2 atgat$
4 gat$
6 t$
1 tatgat$
3 tgat$
nlogn bits
[Manber & Myers’90,K¨arkk¨ainen & Saunders’03, Kimet al’03, Ko &
Aluru’03]
TL (LITIS) STA and SAA SeqBio 2013 10 / 36
These structure can be extended to several sequences:
Generalized Suffix Tree (GST) Generalized Suffix Array (GSA)
TL (LITIS) STA and SAA SeqBio 2013 11 / 36
Background
M¨akinen et al. [RECOMB 2009, J. Comput. Bio. 2010]: first proposed an index for similar strings using run-length encoding, a suffix array, and BWT
Huang et al. [AAIM 2010]: proposed an index of sizeO(n+NlogN) bits
I nis the total length of common parts in one string
I N is the total length of different parts in all strings
building data structures separately for common parts and for non-common parts
Kuruppu et al. [SPIRE 2010]: relative Lempel-Ziv compression and pattern search
TL (LITIS) STA and SAA SeqBio 2013 12 / 36
Similar sequences consist of a succession of:
common regions non-common regions Example
x=αβγ andy=αδγ
The alignment andx andy is written α(β/δ)γ
TL (LITIS) STA and SAA SeqBio 2013 13 / 36
Similar sequences
More than 2 sequences
More than 1 non-common region
TL (LITIS) STA and SAA SeqBio 2013 14 / 36
2 Suffix tree of an alignment
3 Suffix array of an alignment
4 Conclusion & Perspectives
TL (LITIS) STA and SAA SeqBio 2013 15 / 36
Suffix tree of an alignment
Intuition
The suffix tree of the alignmentα(β/δ)γ$ ofx=αβγ andy =αδγ is a suffix tree resulted by removing redundancy from the GST of x andy
TL (LITIS) STA and SAA SeqBio 2013 16 / 36
1 suffixes starting in γ$
2 suffixes starting in α∗β
3 suffixes starting in α∗δ
4 suffixes starting in αα∗−1 where:
α∗ is the longest among αx andαy where
αx is the longest suffix ofα occurring at least twice inx αy is the longest suffix of αoccurring at least twice in y
TL (LITIS) STA and SAA SeqBio 2013 17 / 36
Example
x=actaatatand y=actatcat α=acta,β =at,γ =at$ andδ=tc The alignment is thus acta(at/tc)at$
αx=taandαy =athus α∗ =ta
TL (LITIS) STA and SAA SeqBio 2013 18 / 36
x=actaatatand y=actatcat α=acta,β =at,γ =at$ andδ=tc The alignment is thus acta(at/tc)at$
αx=taandαy =athus α∗ =ta
TL (LITIS) STA and SAA SeqBio 2013 18 / 36
Example
x=actaatatand y=actatcat α=acta,β =at,γ =at$ andδ=tc The alignment is thus acta(at/tc)at$
αx=taandαy =athus α∗ =ta
TL (LITIS) STA and SAA SeqBio 2013 18 / 36
at$
Type 2 suffixes (suffixes starting inα∗β) tat$
atat$
aatat$
taatat$
Type 3 suffixes (suffixes starting inα∗δ) cat$
tcat$
atcat$
tatcat$
Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$
acta(at/tc)at$
TL (LITIS) STA and SAA SeqBio 2013 19 / 36
Example
x=ac ta at atand y=ac ta tc at Type 1 suffixes (suffixes starting inγ$)
$ t$
at$
Type 2 suffixes (suffixes starting inα∗β) tat$
atat$
aatat$
taatat$
Type 3 suffixes (suffixes starting inα∗δ) cat$
tcat$
atcat$
tatcat$
Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$
acta(at/tc)at$
TL (LITIS) STA and SAA SeqBio 2013 19 / 36
at$
Type 2 suffixes (suffixes starting inα∗β) tat$
atat$
aatat$
taatat$
Type 3 suffixes (suffixes starting inα∗δ) cat$
tcat$
atcat$
tatcat$
Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$
acta(at/tc)at$
TL (LITIS) STA and SAA SeqBio 2013 19 / 36
Example
x=ac ta at atand y=ac ta tc at Type 1 suffixes (suffixes starting inγ$)
$ t$
at$
Type 2 suffixes (suffixes starting inα∗β) tat$
atat$
aatat$
taatat$
Type 3 suffixes (suffixes starting inα∗δ) cat$
tcat$
atcat$
tatcat$
Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$
acta(at/tc)at$
TL (LITIS) STA and SAA SeqBio 2013 19 / 36
Definition
Compact suffix tree with suffixes of types 1, 2, 3 and 4
TL (LITIS) STA and SAA SeqBio 2013 20 / 36
Suffix tree of an alignment
Construction
insert suffixes ofx (type 1, 2 and 4 suffixes) →ST(x) computeα∗
insert suffixes starting inα∗δ (type 3 suffixes)
TL (LITIS) STA and SAA SeqBio 2013 21 / 36
Use a doubling technique: check if the suffix ofα of length 1,2,4,8, . . .occurs more than once (represented by at most two leaves) [McCreight’76]
Let α(h) be the longest suffix ofα occurring only once inx found with the doubling technique then h/2≤ |αx|< h
Find αx fromα(h) using suffix links Can be done in O(|α|)
TL (LITIS) STA and SAA SeqBio 2013 22 / 36
Computation of α
∗insert suffixes ofy starting in αxδ →T0 T0 has all the information for computingα∗ determine α∗ in T0 using the doubling technique
TL (LITIS) STA and SAA SeqBio 2013 23 / 36
Theorem
The suffix tree of the alignmentα(β/δ)γ$ ofx=αβγ andy =αδγ can be computed in time O(|x|+|α∗δγ0|) whereγ0 is a prefix ofγ satisfying some conditions.
Recall than the GST of x andy needsO(|x|+|αδγ|) time.
TL (LITIS) STA and SAA SeqBio 2013 24 / 36
References
J. C. Na, H. Park, M. Crochemore, J. Holub, C. S. Iliopoulos, L. Mouchard, and K. Park
Suffix tree of an alignment: An efficient index for similar data In T. L. and L. Mouchard, editors, Proceedings of the 24th
International Workshop on Combinatorial Algorithms (IWOCA 2013), number 8288 in Lecture Notes in Computer Science, Rouen, France, 2013. Springer-Verlag, Berlin, to appear
TL (LITIS) STA and SAA SeqBio 2013 25 / 36
2 Suffix tree of an alignment
3 Suffix array of an alignment
4 Conclusion & Perspectives
TL (LITIS) STA and SAA SeqBio 2013 26 / 36
Suffix Array of an alignment
Definition
Indices of suffixes of types 1, 2, 3 and 4 sorted in lexicographic
TL (LITIS) STA and SAA SeqBio 2013 27 / 36
at$
Type 2 suffixes (suffixes starting inα∗β) tat$
atat$
aatat$
taatat$
Type 3 suffixes (suffixes starting inα∗δ) cat$
tcat$
atcat$
tatcat$
Type 4 (suffixes suffixes starting inαα∗−1) cta(at/tc)at$
acta(at/tc)at$
TL (LITIS) STA and SAA SeqBio 2013 28 / 36
Example
x=actaatatand y=actatcat All 4 types suffixes
$ t$
at$
tat$
atat$
aatat$
taatat$
cat$
tcat$
atcat$
tatcat$
cta(at/tc)at$
acta(at/tc)at$
TL (LITIS) STA and SAA SeqBio 2013 28 / 36
aatat$
acta(at/tc)at$
at$
atat$
atcat$
cat$
cta(at/tc)at$
t$
tat$
taatat$
tcat$
tatcat$
TL (LITIS) STA and SAA SeqBio 2013 28 / 36
Example
x=actaatatand y=actatcat Suffix Array of the Alignment
SAA (0,8) $ (1,3) aatat$
(0,0) acta(at/tc)at$
(0,6) at$
(1,4) atat$
(2,3) atcat$
(2,5) cat$
(0,1) cta(at/tc)at$
(0,7) t$
(1,5) tat$
(1,2) taatat$
(2,4) tcat$
(2,2) tatcat$
TL (LITIS) STA and SAA SeqBio 2013 28 / 36
α∗ is the longest among αx andαy where
αx is the longest suffix ofα occurring at least twice inx αy is the longest suffix of αoccurring at least twice in y γ∗ is the longest among γx andγy where
γx is the longest prefix of γ occurring at least twice inx γy is the longest prefix of γ occurring at least twice iny
TL (LITIS) STA and SAA SeqBio 2013 29 / 36
Suffix array of an alignment: construction
1 construct the GSA of x andα∗δγ∗dwheredis the symbol following γ∗ in γ
2 delete suffixes of γ∗d
TL (LITIS) STA and SAA SeqBio 2013 30 / 36
symmetrically compute |γ |by searching forγ in the suffix array of (αxδγx)
TL (LITIS) STA and SAA SeqBio 2013 31 / 36
Suffix tree of an alignment
Theorem
The suffix array of the alignment α(β/δ)γ$of x=αβγ andy=αδγ can be computed in time O(|x|+|α∗δγ∗|).
Recall than the GSA of x andy needsO(|x|+|αδγ|) time.
TL (LITIS) STA and SAA SeqBio 2013 32 / 36
J. C. Na, H. Park, S. Lee, M. Hong, T. L., L. Mouchard, and K. Park Suffix array of alignment: A practical index for similar data
In O. Kurland, M. Lewenstein, and E. Porat, editors, Proceedings of the 20th International Symposium on String Processing and
Information Retrieval (SPIRE 2013), number 8214 in Lecture Notes in Computer Science, pages 243–254, Jerusalem, Israel, 2013.
Springer-Verlag, Berlin
TL (LITIS) STA and SAA SeqBio 2013 33 / 36
Outline
1 Indexing Structures
2 Suffix tree of an alignment
3 Suffix array of an alignment
4 Conclusion & Perspectives
TL (LITIS) STA and SAA SeqBio 2013 34 / 36
suffix tree of an alignment (done) suffix array of an alignment (done) FM-index of an alignment (in progress)
TL (LITIS) STA and SAA SeqBio 2013 35 / 36
THANK YOU FOR YOUR ATTENTION!
TL (LITIS) STA and SAA SeqBio 2013 36 / 36