Bit-parallel (γ,δ)-matching and suffix automata

(1)

HAL Id: hal-00619564

https://hal-upec-upem.archives-ouvertes.fr/hal-00619564

Submitted on 15 Jul 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bit-parallel (γ ,δ)-matching and suﬀix automata

Maxime Crochemore, Costas S. Iliopoulos, Gonzalo Navarro, Yoan Pinzon, Alejandro Salinger

To cite this version:

Maxime Crochemore, Costas S. Iliopoulos, Gonzalo Navarro, Yoan Pinzon, Alejandro Salinger. Bit- parallel (γ,δ)-matching and suﬀix automata. Journal of Discrete Algorithms, Elsevier, 2005, 3 (2-4), pp.198-214. �10.1016/j.jda.2004.08.005�. �hal-00619564�

(2)

Bit-parallel (δ, γ ) -Matching and Suffix Automata

^⋆

Maxime Crochemore

^a^,^b^,¹

, Costas S. Iliopoulos

^b

, Gonzalo Navarro

^c^,²^,³

, Yoan J. Pinzon

^b^,^d^,²

, and

Alejandro Salinger

^c

aInstitut Gaspard-Monge, Universit´e de Marne-la-Vall´ee, France.

mac@univ-mlv.fr

bDept. of Computer Science, King’s College, London, England.

{csi,pinzon}@dcs.kcl.ac.uk

cCenter for Web Research, Dept. of Computer Science, University of Chile, Chile.

{gnavarro,asalinge}@dcc.uchile.cl

dLab. de C´omputo Especializado, Univ. Aut´onoma de Bucaramanga, Colombia.

Abstract

(δ, γ)-Matching is a string matching problem with applications to music retrieval.

The goal is, given a patternP_1...m and a textT_1...n on an alphabet of integers, find the occurrencesP^′of the pattern in the text such that (i)∀1≤i≤m, |Pi−P_i^′| ≤δ, and (ii) P

1≤i≤m|Pi−P_i^′| ≤ γ. The problem makes sense for δ ≤γ ≤δm. Several techniques for (δ, γ)-matching have been proposed, based on bit-parallelism or on skipping characters. We first present anO(mnlog(γ)/w) worst-case time andO(n) average-case time bit-parallel algorithm (being w the number of bits in the computer word). It improves the previousO(mnlog(δm)/w) worst-case time algorithm of the same type. Second, we combine our bit-parallel algorithm with suffix automata to obtain the first algorithm that skips characters using bothδ and γ. This algorithm examines less characters than any previous approach, as the others do just δ-matching and check the γ-condition on the candidates. We implemented our algorithms and drew experimental results on real music, showing that our algorithms are superior to current alternatives with high values ofδ.

Key words: Bit-parallelism, approximate string matching, MIDI music retrieval.

⋆ A conference version of this paper appeared in [12].

1 Partly supported by CNRS and NATO.

2 Supported by CYTED VII.19 RIBIDI Project.

3 Funded by Millennium Nucleus Center for Web Research, Grant P01-029-F, Mide- plan, Chile.

(3)

1 Introduction

The string matching problem is to find all the occurrences of a given pattern P1...m in a large text T1...n, both being sequences of characters drawn from a finite character set Σ. This problem is fundamental in computer science and is a basic need of many applications, such as text retrieval, music retrieval, computational biology, data mining, network security, etc. Several of these applications require, however, more sophisticated forms of searching, in the sense of extending the basic paradigm of the pattern being a simple sequence of characters.

In this paper we are interested in music retrieval. A musical score can be viewed as a string: at a very rudimentary level, the alphabet could simply be the set of notes in the chromatic or diatonic notation, or the set of intervals that appear between notes (e.g. pitch may be represented as MIDI numbers and pitch intervals as number of semitones). It is known that exact matching cannot be used to find occurrences of a particular melody, so one resorts to different forms ofapproximatematching, where a limited amount ofdifferences of diverse kinds are permitted between the search pattern and its occurrence in the text.

The approximate matching problem has been used for a variety of musical applications [16,9,20,21,6]. Most computer-aided musical applications adopt an absolute numeric pitch representation (most commonly MIDI pitch and pitch intervals in semitones; duration is also encoded in numeric form). The absolute pitch encoding, however, may be insufficient for applications in tonal music as it disregards tonal qualities of pitches and pitch-intervals (e.g., a tonal transposition from a major to a minor key results in a different encoding of the musical passage and thus exact matching cannot detect the similarity between the two passages). One way to account for similarity between closely related but non-identical musical strings is to permit a difference of at most δ units between the pattern character and its corresponding text character in an occurrence, e.g., a C-major {60,64,65,67} and a C-minor {60,63,65,67}

sequence can be matched if a tolerance δ = 1 is allowed in the matching process. Additionally, we require that the total number of differences across all the pattern positions does not exceedγ, in order to limit the total number of differences while keeping sufficient flexibility at individual positions.

The formalization of the above problem is called (δ, γ)-matching. The problem is defined as follows: the alphabet Σ is assumed to be a set of integer numbers, Σ ⊂ Z. Apart from the pattern P and the text T, two extra parameters, δ, γ ∈ N, are given. The goal is to find all the occurrences P^′ of P in T such that (i) ∀1≤i≤ m, |Pi−P_i^′| ≤ δ, and (ii) ^P_1≤i≤m|Pi−P_i^′| ≤ γ. Note that the problem makes sense forδ ≤γ ≤δm: Ifγ > δ then the limit on the sum of

(4)

differences is larger than the limit on any difference, so one should set δ←γ;

and if γ > δm then condition (i) implies (ii) and we should set γ ←δm.

Several recent algorithms exist to solve this problem. These can be classified as follows:

Bit-parallel: The idea is to take advantage of the intrinsic parallelism of the bit operations inside a computer word of w bits [1], so as to pack several values in a single word and manage to update them all in one shot. In [7,8]

this approach was used to obtainShift-Plus, anO(n mlog(δm)/w) worst- case time algorithm. The algorithm packsmcounters whose maximum value is mδ, hence it needs m⌈log₂(δm+ 1)⌉ bits overall and O(mlog(δm)/w) computer words have to be updated for each text character.

Occurrence heuristics: Inspired by Boyer-Moore techniques [5,22], they skip some text characters according to the position of some characters in the pattern. In [7], several algorithms of this type were proposed forδ-matching (a restricted case whereγ =δm), and they were extended to general (δ, γ)- matching in [10]. The extension is done by checking the γ-condition on each candidate that δ-matches the pattern. These algorithms are Tuned- Boyer-Moore,Skip-Searchand Maximal-Shift, each of which has a counterpart in exact string matching. These algorithms are faster than the bit-parallel ones, as they are simple and skip text characters.

Substring heuristics:Based on suffix automata [14,13], these algorithms skip text characters according to the position of some pattern substrings. In [10,11], three algorithms of this type, calledδ-BM1, δ-BM2 andδ-BM3, are proposed. They try to generalize the suffix automata toδ-matching, but they obtain only an approximation that accepts more occurrences than necessary, and these have to be verified later. They also verify theγ-condition over each δ-matching candidate.

In this paper we present two new (δ, γ)-matching algorithms:

• We improveShift-Plus in two aspects. First, we show that its worst case complexity can be reduced toO(n mlog(γ)/w) by means of a more sophisticated counter management scheme that needs only⌈1 + log₂(γ+ 1)⌉bits per counter. Second, we show how its average-case complexity can be reduced toO(n).

• We combine our bit-parallel algorithm with suffix automata, as already done with other string matching problems [18,19], so as to obtain the first algorithm able of skipping text characters based both onδ- andγ- conditions. All previous algorithms skip characters using the δ-condition only. Moreover, our suffix automaton accepts exactly the suffixes of strings that (δ, γ)-match our pattern, so no candidate verification is necessary at all. Our algorithm examines less characters than any previous technique.

(5)

The algorithms are very efficient and simple to implement. Our experimental results on real music data show that they improve previous work when δ is large (so that their dependence on γ rather than on δ shows up). For short patterns, of length up to 20, the character skipping algorithm is the best, otherwise our simple bit-parallel algorithm dominates.

2 Basic Concepts

In this section we present the concepts our paper builds on: bit-parallelism and suffix automata. We start by introducing some terminology.

A string x ∈Σ^∗ is a factor (or substring) of P if P can be written P =uxv, u, v ∈Σ^∗. A factor x of P is called a suffix (prefix) of P if P =ux (P =xu), u∈Σ^∗.

A bit mask of length r is simply a sequence of bits, denoted br. . . b₁. We use exponentiation to denote bit repetition (e.g. 0³1 = 0001). The length of the computer word is w bits, so the mask of length r ≤ w is stored some- where inside the computer word. Also, we write [x]r to denote the binary representation of number x < 2^r using r bits. We also use C-like syntax for operations on the bits of computer words: “|” is the bitwise-or, “&” is the bitwise-and, and “∼” complements all the bits. The shift-left operation,

“<<”, moves the bits to the left and enters zeros from the right, that is, bmbm−1. . . b2b1 << r = bm−r. . . b2b10^r. Finally, we can perform arithmetic operations on the bits, such as addition and subtraction, which operate the masks as numbers. For instance, br. . . bx10000−1 =br. . . bx01111.

2.1 Bit-Parallelism

In [2,24], a new approach to text searching was proposed. It is based on bit- parallelism[1], a technique consisting in taking advantage of the intrinsic parallelism of the bit operations inside a computer word. By using cleverly this fact, the number of operations that an algorithm performs can be cut down by a factor of at most w, the number of bits in the computer word. Since in current architectures wis 32 or 64, the speedup is very significant in practice.

The Shift-And algorithm [24] uses bit-parallelism to simulate the operation of a nondeterministic automaton that searches the text for the pattern (see Fig. 1). A plain simulation of that automaton takes time O(mn), and Shift- And achieves O(mn/w) worst-case time (optimal speedup).

The algorithm first builds a table B which for each character c∈ Σ stores a

(6)

3 4

a b c d e f g

Σ

1 2 5 6 7

0

Fig. 1. A nondeterministic automaton to search a text for the pattern P =

"abcdefg". The initial state is 0.

bit mask B[c] = bm. . . b₁, so that bi = 1 if and only if Pi = c. The state of the search is kept in a bit mask D = dm. . . d1, where di = 1 whenever the state numberediin Fig. 1 is active. That is, after having scanned text position j, we have di = 1 whenever P_1...i = T_j−i+1...j. Therefore, we report a match whenever dm is set.

We start with D = 0^m and, for each new text character Tj, update D using the formula

D ← ((D <<1) | 0^m−11) & B[Tj]

because each state may be activated by the previous state as long asTj matches the corresponding arrow. The “|0^m−11” after the shift corresponds to the self- loop at the beginning of the automaton (as state 0 is not represented in D).

Seen another way, thei-th bit is set if and only if the (i−1)-th bit was set for the previous text character and the new text character matches the pattern at position i. In other words, T_j−i+1...j =P_1...i if and only if Tj−i+1...j−1 =P_1...i−1 and Tj =Pi.

The cost of this algorithm isO(n). For patterns longer than the computer word (m > w), the algorithm uses ⌈m/w⌉ computer words for the simulation, with a worst-case cost of O(mn/w). By managing to update only those computer words that have some active state, an average case cost of O(n) is achieved.

It is very easy to extend Shift-And to handle classes of characters, where each pattern position does not match just a single character but a set thereof. If Ci

is the set of characters at positioniin the pattern, then we set the i-th bit of B[c] for allc∈Ci.

2.2 Suffix Automata

We describe the BDM pattern matching algorithm [13,14], which is based on a suffix automaton. A suffix automaton on a pattern P1...m is a deterministic finite automaton that recognizes the suffixes of P. The nondeterministic version of this automaton has a very regular structure (see in Fig. 2).

The (deterministic) suffix automaton is well known [13]. Its size, counting both nodes and edges, is O(m), and it can be built inO(m) time [13]. A very important fact is that this automaton can also be used to recognize the factors of P: The automaton is active as long as we have read a factor ofP.

(7)

a b c d e f g

1 2 3 4 5 6 7

0

I ε ε ε ε ε ε ε ε

Fig. 2. A nondeterministic suffix automaton for the patternP ="abcdefg". Dashed lines represent ε-transitions. The initial state is I.

This structure is used in [13,14] to design a pattern matching algorithm called BDM, which is optimal on average (O(nlog_|Σ|(m)/m) time on uniformly distributed text). To search a text T for P, the suffix automaton of P^r=PmPm−1. . . P1 (the pattern read backwards) is built. A window of length m is slid along the text, from left to right. The algorithm reads the window right to left and feeds the suffix automaton with the characters read. During this process, if a final state is reached, this means that the window suffix we have traversed is a prefix of P (because suffixes of P^r are reversed prefixes of P). Then we store the current window position in a variable last, possibly overwriting its previous value. The backward window traversal ends in two possible forms:

(1) We fail to recognize a factor, that is, we reach a characterσthat does not have a transition in the automaton (see Fig. 3). In this case the window suffix read is not a factor of P and therefore it cannot be contained in any occurrence. We can actually shift the window to the right, aligning its starting position to last, which corresponds to the longest prefix ofP seen in the window. We cannot miss an occurrence because in that case the suffix automaton would have found its prefix in the window.

New search Window

Search for a factor with the suffix automaton

σ u

Fail to recognize a factor atσ.

σ

New window Safe shift

Fig. 3. Basic search with the suffix automaton

(2) We reach the beginning of the window, therefore recognizing the pattern P. We report the occurrence, and shift the window exactly as in the previous case (we have the previous last value).

It is possible to simulate the suffix automaton in nondeterministic form by using bit-parallelism [18,19], so as to obtain very efficient and simple algorithms.

(8)

3 Improving the Bit-Parallel Algorithm

First of all, notice thatδ-matching is trivial under the bit-parallel approach, as it can be accommodated using the ability to search for classes of characters.

We define that pattern character c matches text characters c− δ . . . c +δ.

Hence, if B[c] =bm. . . b1, we set bi = 1 if and only if |Pi−c| ≤δ. The rest of the algorithm is unchanged and the same complexities are obtained.

The real challenge is to do (δ, γ)-matching. The solution we present is an improvement over that of [7,8] and it has some resemblances with that of [3]

for Hamming distance.

Let us focus for a moment onγ-matching alone. Instead of storing just one bit di to tell whether P1...i matches Tj−i+1...j, we store a counter ci to record the sum of the absolute differences between the corresponding characters. That is

ci = ^X

1≤k≤i

|Pk−Tj−i+k| (1)

and we wish to report text positions wherecm ≤γ.

The next Lemma shows how to update the ci values for a new text position, and suggests an O(mn) time γ-matching algorithm.

Lemma 1. Assume that we want to compute the countersc₁. . . cmaccording to Eq. (1) for text position j, and have computed c^′₁. . . c^′_m for position j−1.

The ci values satisfy

ci = ^X

1≤k≤i

|Pk−Tj−i+k| = c^′_i−1+|Pi−Tj| assuming c^′₀ = 0.

Proof. Immediate by substitution of c^′_i−1 according to Eq. (1). 2 The update technique given in Lemma 1 is good for a bit-parallel approach.

Let us assume that each ci counter will be represented using ℓ bits, where ℓ will be specified later. Hence the state of the search will be expressed using the bit mask

D = [cm]ℓ [c_m−1]ℓ . . . [c₂]ℓ [c₁]ℓ. (2) We precompute a mask B[c] of counters [bm]ℓ. . .[b1]ℓ, so that bi = |Pi−c|.

Then, the following Lemma establishes the bit-parallel formula to update D.

Lemma 2. Assume that we want to compute bit mask D according to

(9)

Eq. (2) for text positionj, and have computed D^′ for position j−1. Then D = (D^′ << ℓ) + B[Tj]. (3) Proof. The i-th counter of D^′ is c^′_i. After the shift-left (“<<”) the i-th counter becomes c^′_i−1. The i-th counter of B[Tj] is |Pi −Tj|. Hence the i-th counter of the right hand side of the equality is c^′_i−1+|Pi−Tj|. According to

Lemma 1, this is ci. 2

This gives us a solution for γ-matching. Start with D= ([γ+ 1]ℓ)^m (to avoid matching before reading Tm) and update it according to Eq. (3). Every time we have cm ≤ γ, report the last text position processed as the end of an occurrence.

In order to includeδ-matching in the picture, we change slightly the definition ofB[c]. The goal is that if, at any position, it holds|Pi−Tj|> δ, then we ensure that the corresponding occurrence is discarded. For this sake, it is enough to redefine B[c] = [bm]ℓ. . .[b1]ℓ as follows:

bi = if|Pi−c| ≤δ then |Pi−c| else γ+ 1. (4) The next Lemma establishes the suitability of the above formulas for (δ, γ) matching.

Lemma 3. If the update formula of Eq. (3) is applied and B[c] is defined according to Eq. (4), then after processing text positionj it holds thatcm ≤γ if and only if Tj−m+1...j (δ, γ)-matches P.

Proof. By Lemmas 1 and 2 and Eq. (1) we have that, if the original definition bi = |Pi−c| is used, then cm = ^P_1≤k≤m|Pk−T_j−m+k| after processing text position j. The only difference if the definition of Eq. (4) is used is that, if any of the |Pk−Tj−m+k| was larger than δ, then bk > γ for B[Tj−m+k], and thereforeck > γ after processing text position j−m+k. Since counters only increase as they get shifted and added in Eq. (3), that counter ck at position j − m+k will become counter cm at position j, without decreasing. Thus cm > γ after processing text position j. Therefore cm ≤ γ if and only if

Tj−m+1...j (δ, γ)-matchesP. 2

Let us consider now theℓ value. In principle, usingB[c] as in Eq. (4), counter cm can be as large asm(γ+ 1), since bi ≤γ+ 1 (recall that δ≤γ). However, recall that counter values never decrease as they get shifted over D. This means that, once they become larger than γ, we do not need to know how larger they are. Thus, instead of storing the real ci value, we would rather store min(ci, γ+ 1), and then need only⌈log₂(γ+ 2)⌉ bits per counter.

(10)

In principle, whenever ci exceedsγ, we store γ+ 1 for it. The problem is how to restore this invariant after adding B[c] to the counters, and also how to avoid overflows in that summation. We show now that we can handle both problems by using the following number of bits per counter:

ℓ = 1 +⌈log₂(γ+ 1)⌉. (5)

Thus, our bit maskDneedsmℓ=m(1+⌈log₂(γ+1)⌉) bits and our simulation needs O(mlog(γ)/w) computer words.

Instead of representing counter ci as [ci]ℓ, we represent it as

ci −→ [ci+ 2^ℓ−1−(γ+ 1)]ℓ. (6) This guarantees that the highest bit of the counter will be set if and only if ci ≥γ+ 1, as its representation will be ≥2^ℓ−1.

Before adding B[Tj], we will record all those highest bits in a bit mask H = D & (10^ℓ−1)^m, and clear those highest bits from D. Once its highest bit is cleared, every counter representation is smaller than 2^ℓ−1 and we can safely add bi without overflowing the counters, since the resulting value is at most 2^ℓ−1−1 + (γ+ 1) = 2^ℓ−1+γ ≤ 2γ+ 1 because of Eq. (5). And again because of Eq. (5), a counter can hold up to value 2(γ+ 1)−1 = 2γ+ 1. After addingB[Tj] we restore those highest bits set in H.

Note that it is not strictly true that we maintain min(ci, γ+ 1), but it is true that the highest bit of the representation of ci is set if and only if ci > γ, and this is enough for the correctness of the algorithm. The next Lemma establishes this correctness.

Lemma 4. Assume that ci is represented as in Eq. (6) if ci ≤ γ, and as [2^ℓ−1+x]ℓotherwise, for somex≥0. Then, if the update formula of Lemma 2 is applied with the exception that the highest bits set in the counters are removed before and restored after adding B[Tj], then it holds that the representation is maintained after processing Tj.

Proof. Ifci already exceeded γ before addingbi, it will exceedγ after adding bi. In this case, the representation of ci was 2^ℓ−1+x and thus it already had its highest bit set. This bit will be restored after adding bi. Thus, regardless of which value actually stores, the representation will correctly maintain its highest bit set, that is, it will be of the form 2^ℓ−1+x for some x≥0.

On the other hand, if ci did not exceed γ before adding bi, then its representation was ci + 2^ℓ−1 −(γ + 1) and the highest bit was not set. Thus the manipulation of highest bits will not affect its result. After the summation the representation will holdci+bi+ 2^ℓ−1−(γ+ 1). This is a correct representation

(11)

for the new valueci+bi, either if ci +bi ≤γ or ifci+bi > γ, as in the latter case the representation is of the form 2^ℓ−1+x, wherex=ci+bi−(γ+ 1)≥0.

2

Fig. 4 depicts the algorithm. It is called Forward-Scan to distinguish it from our next algorithms that scan windows backward. The preprocessing consists of computingℓ according to Eq. (5) and tableB according to Eq. (4). Pattern P is processed backwards so as to arrangeB[c] in the right order [bm]ℓ. . .[b1]ℓ. Line 10 initializes the search by setting ci = γ + 1 in D, according to the representation of Eq. (6). Occurrences are reported in lines 12–13, whenever cm ≤ γ, that is, the highest bit of the representation of cm is not set. Line 14 is the equivalent to D ← D << ℓ, except that the counter c0 = 0 that is moved to the position of c1 must be represented as 2^ℓ−1 −(γ+ 1) according to Eq. (6). Line 15 computes H as explained, to record the highest bits. Line 16 completes the computation of Eq. (3), by removing bits set in H from D and restoring them after the summation withB[Tj].

Assuming that the bit masks fit in a computer word, that is, mℓ ≤ w, the algorithm complexity isO(m|Σ|+n). If several computer words are needed, the search complexity becomes O(mnlog(γ)/w). However, we defer the details of handling longer bit masks to Section 5, as it is possible to obtainO(n) search time on average.

Forward-Scan (P1...m, T1...n, δ, γ) 1. Preprocessing

2. ℓ ←1 +⌈log₂(γ+ 1)⌉

3. for c∈Σ do 4. B[c]←([0]ℓ)^m 5. for i∈m . . .1 do

6. if |c−Pi| ≤δ then

7. B[c]←(B[c]<< ℓ) | (|c−Pi|) 8. else B[c]←(B[c]<< ℓ) | (γ+ 1) 9. Search

10. D←(10^ℓ−1)^m 11. for j ∈1. . . n do

12. if D & 10^mℓ−1 = 0^mℓ then

13. Report an occurrence at j−m+ 1 14. D←(D << ℓ) | (2^ℓ−1−(γ+ 1))

15. H ←D & (10^ℓ−1)^m

16. D←((D & ∼H) +B[Tj]) | H

Fig. 4. Bit-parallel algorithm for (δ, γ)-matching. Constant values are precomputed.

(12)

4 Using Suffix Automata

As demonstrated in [18,19], the suffix automaton approach of Section 2.2 can be extended to search for more complex patterns by combining it with bit- parallelism. In this section we combine our bit-parallel approach of Section 3 with the suffix automaton concept to obtain an algorithm that does not inspect all the text characters.

Imagine that we process a text window Tpos+1...pos+m right to left. Our goal is that, after having processed Tpos+j, we have computed

ci = ^X

0≤k≤m−j

|P_{i−(m−j)+k}^r −Tpos+m−k| = ^X

0≤k≤m−j

|P2m+1−i−j−k−Tpos+m−k| (7) for m−j + 1 ≤ i ≤ m. This can be obtained by initializing ci = 0 before starting processing the window and then updating the ci values according to the following Lemma.

Lemma 5. Assume that we have valuesc^′_i computed for Tpos+j+1 according to Eq. (7). Then valuesci for Tpos+j satisfy

ci = c^′_i−1+|P_i^r−Tpos+j|.

Proof. It is immediate by rewriting c^′_i−1 according to Eq. (7). 2 If we maintain values ci computed according to Eq. (7), then, after processing Tpos+j, (i) if cm ≤ γ, then ^P_{0≤k≤m−j}|P1+m−j−k − Tpos+m−k| ≤ γ, that is, P1...m−j+1 γ-matches window suffix Tpos+j...pos+m; (ii) if ci > γ for all m−j + 1 ≤ i ≤ m, then the window suffix Tpos+j...pos+m does not γ-match any pattern substring Pm−i+1...(m−i+1)+m−j, and therefore no occurrence can contain Tpos+j...pos+m.

Therefore, a BDM-like algorithm would be as follows. Process text window Tpos+1...pos+m by reading it right to left and maintainingci values. Every time cm ≤ γ, mark the current window position last so as to remember the last time a window suffix γ-matched a pattern prefix. If, at some moment, ci > γ for all i, then shift the window to start at position last and restart. If all the window is traversed and stillcm ≤γ, then report the window as an occurrence and also shift it to start at positionlast. The correctness of this scheme should be obvious from Section 2.2.

A bit-parallel computation of thecivalues is very similar to the one developed in Section 3, as the update formulas of Lemmas 1 and 5 are so close. In order to work onP^r, we simply store B[c] in reverse fashion. Vectorci is initialized at ci = 0 according to Eq. (7). To determine whether cm ≤ γ we simply test

(13)

the highest bit of its representation. To determine whether ci > γ for all i we test all highest bits simultaneously. To account also forδ-matching we change the preprocessing of B[c] just as in Eq. (4).

Fig. 5 depicts the algorithm, called “backward-scanning” because of the way windows are processed. The preprocessing is identical to Fig. 4 except that the pattern is processed left to right. D is initialized in line 13 with ci = 0 considering the representation of Eq. (6). Line 14 continues processing the window as long as ci ≤γ for some i. The update to D is as in Fig. 4, except that the first shift left (“<<”) of each window is omitted to avoid losing the first c1 value. Conditioncm ≤γ is tested in line 18. When it holds, we update last unless we have processed all the window, in which case it means that we found an occurrence and also must maintain the previouslast value. Line 21 shifts D and introduces values γ + 1 from the right, to ensure that the relevant ivalues are m−j+ 1≤i≤mand that the loop will terminate after m iterations. Finally, the window is shifted by last.

Backward-Scan (P1...m, T1...n, δ, γ) 1. Preprocessing

2. ℓ ←1 +⌈log₂(γ+ 1)⌉

3. for c∈Σ do 4. B[c]←([0]ℓ)^m 5. for i∈1. . . m do

6. if |c−Pi| ≤δ then

7. B[c]←(B[c]<< ℓ) | (|c−Pi|) 8. else B[c]←(B[c]<< ℓ) | (γ+ 1) 9. Search

10. pos←0

11. while pos≤n−m do 12. j ←m, last←m 13. D←([2^ℓ−1−(γ+ 1)]ℓ)^m

14. while D & (10^ℓ−1)^m 6= (10^ℓ−1)^m do 15. H ←D & (10^ℓ−1)^m

16. D←((D & ∼H) +B[Tj]) | H

17. j ←j−1

18. if D & 10^mℓ−1 = 0^mℓ then 19. if j >0 then last←j

20. else Report an occurrence at pos+ 1 21. D←(D << ℓ) | 10^ℓ−1

22. pos←pos+last

Fig. 5. Backward scanning algorithm for (δ, γ)-matching. Constant values are precomputed.

Note that, given the invariants we maintain, we can report occurrences with-

(14)

out any further verification. Moreover, we shift the window as soon as the window suffix read does not (δ, γ)-match a pattern substring. This is the first character-skipping algorithm with these properties. Previous ones only approximate this property and require verifying candidate occurrences. Consequently, we inspect less characters than previous algorithms.

5 Handling Longer Patterns

Both algorithms presented are limited by the length of the computer word.

They work form(1 +⌈log(γ+ 1)⌉)≤w. However, in most cases this condition is not fulfilled, so we must handle longer patterns.

5.1 Active Computer Words

The first idea is just to use as many computer words as needed to represent D. In each computer word we store the maximum amount of counters that fully fit into w bits. So we keep κ = ⌊w/ℓ⌋ counters in each word (except the last one, that may be underfilled), needing ⌈m/⌊w/ℓ⌋⌉words to represent D. Each time we update D we have to process all the words simulating the bit-parallel operations. With this approach, the forward scanning takes time O(nmlog(γ)/w). We remark that previous forward scanning algorithms [7,8]

requiredO(nmlog(mδ)/w) time, which is strictly worse than our complexity.

The difference is that we have managed to keep the counters below 2(γ + 1) instead of letting them grow up to mδ. This alternative is called simply

“Forward” in the experiments.

A key improvement can be made to the forward scanning algorithm by notic- ing that it is not necessary to update all the computer words at each iteration.

As countercistores the sum of all the differences between characters P1...iand their corresponding characters in the text (Eq. (1)), depending on the values ofδ and γ and on the size of the alphabet, most of the time the highest counters will have surpassed γ. Once a counter surpasses γ we only require that it stays larger thanγ (recall Section 3 and Lemma 4), so it is not even necessary to update it. Let us say that a computer word is active when at least one of its counters stores some ci ≤γ. The improvement works as follows:

• At each iteration of the algorithm we update the computer words only until the one we have marked as the last active word.

• As we update each word we check whether it is active or not, remembering the new last active word.

(15)

• Finally, we check if the last counter of the last active word is ≤ γ. In that case, the word that follows the last active word must be the new last active word, as in the next iteration its first counter may become less thanγ+ 1, and hence we may need to process it.

This algorithm has the same worst-case complexity of the basic one, but the average case is significantly improved. Consider a random text and pattern following a zero-order model (that is, character probabilities are independent of the neighbor characters). Say thatps is the probability of thes-th character of the alphabet. Then the probability thatPi δ-matches a random text character isπi =^P_P_i_{−δ≤s≤P}_i_+δps. The probability of P_1...i δ-matching T_j−i+1...j for a random text positionj is wi = Π1≤k≤iπk.

The first computer word will be always active; the second will be active only if P_1...κ matches T_j−κ+1...j; the third will be active only if P_1...2κ matches Tj−2κ+1...j; and so on. Hence the average number of computer words active at a random text position is at most

1 +wκ+w2κ+. . . = ^X

i≥0

wiκ

and this is O(1) provided π = max1≤i≤mπi < 1, as in this case wi ≤ πⁱ and the average number of active words is ^P_i≥0wiκ ≤^P_i≥0π^iκ = 1/(1−π^κ). ⁴ Hence, as we updateO(1) computer words on average, the average search time is O(n). Note that this holds even without considering γ, which in practice reduces the constant factor. The lower the values of δ and γ are, the better the performance will be. This alternative is called “Forward last” in the experiments.

Yet another improvement can be made to this algorithm by combining it with the basic single-word algorithm. We store the first word used to representDin a register and run the algorithm using just this word, as long as this one is the only active word. Whenever the second word becomes active, we switch to the multiple word algorithm. We switch back to the basic algorithm whenever the first word becomes the last active word. The use of a register to store the first word yields a better performance, as we have to make less memory accesses.

The more time the first word is the only active word, the more significant is the improvement. This alternative is called “Forward register” in the experiments.

Unfortunately this idea cannot be applied to the backward-scanning algorithm, as in this one we will have counters ≤ γ uniformly distributed across all the computer words. This happens because ci ≤ γ after reading T_pos+j if Pm−i+1...(m−i+1)+m−j (δ, γ)-matches Tpos+j...pos+m (Eq. (7)), and this probability does not necessarily decrease with i (actually it is independent of i on a

4 This holds also if there are O(1)ivalues such that πi = 1.

(16)

uniform distribution). The plain multi-word backward scanning algorithm is called simply “Backward” in the experiments.

5.2 Pattern Partitioning

Another idea to handle long patterns is to partition them into pieces short enough to be handled with the basic algorithm. Notice that ifP (δ, γ)-matches Tj−m+1...j, and we partitionP intojdisjoint pieces of length⌊m/j⌋and⌈m/j⌉, then at least one piece has to (δ, γ^′)-match its corresponding substring of Tj−m+1...j, whereγ^′ =⌊γ/j⌋. The reason is that, otherwise, each piece adds up at least γ^′+ 1 differences, and the total is at least j(γ^′+ 1) =j(⌊γ/j⌋+ 1)>

j(γ/j) = γ, and thenγ-matching is not possible.

Hence we run j (δ, γ^′)-searches for shorter patterns and check every match of a piece for a complete occurrence. The check is simple and does not even need bit parallelism. Note that ifδ > γ^′, we can actually do (γ^′, γ^′)-matching.

We must choose the largest j such that

⌈m/j⌉ (1 +⌈log₂(⌊γ/j⌋+ 1)⌉) ≤ w

and hence we perform j = O(mlog(γ)/w) searches. For forward scanning, each such search costs O(n). Piece verification time is negligible on average.

Hence the average search time of this approach is O(nmlog(γ)/w), which is not attractive compared to the worst-case search time of the basic approach.

However, each of these searches can be made using registers for D, so in practice it could be relevant. It could be also relevant for backward matching, where usingD in registers is not possible for long patterns.

Furthermore, the pieces can be grouped and searched for together using so- called “superimposition” [4,17]. By making groups ofrpieces each, we perform

⌈j/r⌉ searches. For each search, counter bi of B[c] will store the minimum difference between cand thei-th character of any piece in the group, orγ^′+ 1 if none of these differences is smaller thanγ^′+1. Every time we find a match of the whole group we check the occurrence of each of the substrings forming that group. For all the pieces that matched we check the occurrence of the whole pattern at the corresponding position. The greater r is, the less searches we perform, but the more time we spend checking occurrences. The time spent in checking occurrences also increases withδandγ. Because of this, the optimum r depends on δ,γ and m.

These algorithms are called “Forward superp” and “Backward superp” in the experiments. These include the case r= 1, where no superimposition is done.

(17)

6 Experimental Results

In this section we show experimental evidence comparing the different versions of our algorithms againstδ-BM2[10,11], which is the most efficient alternative (δ, γ)-matching algorithm.

The tests were performed using a Pentium IV, 2 GHz, 512 Mb RAM and 512 Kb cache running Suse Linux with w = 32. We used the GNU gcc compiler version 2.95.3. Each data point represents the median of 100 trials.

We ran our experiments using real music data obtained from a database of MIDI files of classic music, totalizing 10.5Mb of absolute pitches. We focused on typical parameter values for music searching, namely 2–4 forδ, 1.5m–2.0m for γ, and 10–200 for m.

The results for forward algorithms are shown in Fig. 6. The variants are called

“Forward” (plain multiword forward), “Forward last” (same but updating only up to the last relevant word), “Forward register” (same but switching to single-word mode when possible), and “Forward superp” (partition plus superimposing in the way that gives the best results).

As expected, Forward and Forward-superp are the slowest and their cost grows linearly with m. Forward-superp shows a better constant factor and it is attractive for very short patterns, but soon its linear dependence withmrenders it useless. Superimposition alleviates this only partially, as the optimum was to superimpose 2 to 7 patterns forδ = 2 and 2 to 5 forδ = 4. These numbers grow slowly asmincreases and stay at a maximum of 5 or 6, making the whole scheme linear in m anyway.

Forward-last and Forward-register, on the other hand, display their O(n) average case time, independent of n. As expected, Forward-register is by far the best. We will consider only this alternative to compare against backward algorithms.

Fig. 7 compares backward algorithms (which includes the relevant competing alternatives), and Forward-register. The backward algorithm only has variants

“Backward” (plain single- or multi-word, as needed) and “Backward superp”

(partition plus superimposing in the best possible way). δ-BM2 is the best existing alternative algorithm.

We observe that partitioning (including superimposition) is also a bad choice for backward scanning. The reasons are the same as for the forward version. In general, backward searching does not behave competitively when many computer words are involved. Backward was better than Forward-register when the whole (superimposed) representation fit in a single computer word. As

(18)

0 0.05 0.1 0.15 0.2 0.25

0 20 40 60 80 100 120 140 160 180 200 m

Forward Forward last Forward register Forward superp

0 0.05 0.1 0.15 0.2 0.25

0 20 40 60 80 100 120 140 160 180 200 m

(a) δ = 2 andγ=1.5m (b) δ= 4 and γ=1.5m

0 0.05 0.1 0.15 0.2 0.25

0 20 40 60 80 100 120 140 160 180 200 m

0 0.05 0.1 0.15 0.2 0.25

0 20 40 60 80 100 120 140 160 180 200 m

(c) δ= 2 andγ=2m (d) δ = 4 andγ=2m Fig. 6. Timing figures for forward algorithms, in seconds per megabyte.

more than a single word is necessary, Forward-register becomes superior. The reason is that backward searching needs to effectively update all its computer words, while the forward versions do so only for a few active computer words.

With respect to the competing algorithm, it can be seen thatδ-BM2is faster than ours for smallδ = 2, but as we use a largerδ= 4 it becomes not compet- itive, as it can be expected from its only-δ filtration scheme. Our algorithms are the only ones that can filter using δ and γ simultaneously.

Finally, we notice that the dependence onδ is significant to the extent that it can double the time it takes by going fromδ =2 toδ =4. The dependence on γ, on the other hand, is not much significant. We note, however, that Forward- register is rather insensitive to both δ and γ, becoming a strong and stable choice for general (δ, γ)-matching.

(19)

0 0.05 0.1 0.15 0.2

0 20 40 60 80 100 120 140 160 180 200 m

Forward register Backward Backward superp BM2

0 0.05 0.1 0.15 0.2

0 20 40 60 80 100 120 140 160 180 200 m

(a) δ = 2 andγ=1.5m (b) δ= 4 and γ=1.5m

0 0.05 0.1 0.15 0.2

0 20 40 60 80 100 120 140 160 180 200 m

0 0.05 0.1 0.15 0.2

0 20 40 60 80 100 120 140 160 180 200 m

(c) δ= 2 andγ=2m (d) δ = 4 andγ=2m

Fig. 7. Timing figures for backward algorithms and the best forward algorithm, in seconds per megabyte.

7 Conclusions

We have presented new bit-parallel algorithms for (δ, γ)-matching, an extended string matching problem with applications in music retrieval. Our new algorithms make use of bit-parallelism and suffix automata and has several ad- vantages over the previous approaches: they make better use of the bits of the computer word, they inspects less text characters, they are simple, extendible, and robust.

Especially important is that our algorithms are the first truly (δ, γ) character- skipping algorithms, as they skip characters using both criteria. Existing approaches do just δ-matching and check the candidates for the γ-condition.

This makes our algorithms a stronger and more stable choice for this problem.

We have also presented several ideas to handle longer patterns, as the algorithms are limited by the length of the computer word. The fastest choice is an algorithm that uses several computer words and updates only those that hold relevant values, switching to single-word mode when possible.

(20)

We have shown that our algorithms are the best choice in practice when δ is not small enough to make up a good filter by itself. In this case, the ability of our algorithms to filter withγ at the same time becomes crucial.

We plan to investigate further on more sophisticated matching problems that arise in music retrieval. For example, it would be good to extend (δ, γ)- matching in order to permit insertions and deletions of symbols, as well as transposition invariance. Bit-parallel approaches handling those options, al- beit not (δ, γ)-matching at the same time, have recently appeared [15].

Another challenging problem is to consider text indexing, that is, preprocessing the musical strings to speed up searches later. A simple solution is the use of a suffix tree of the text combined with backtracking, which yields search times which are exponential on the pattern length but independent of the text length [23].

References

[1] R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.

[2] R. Baeza-Yates and G. Gonnet. A new approach to text searching. Comm.

ACM, 35(10):74–82, October 1992.

[3] R. Baeza-Yates and G. Gonnet. Fast string matching with mismatches.

Information and Computation, 108(2):187–199, 1994.

[4] R. Baeza-Yates and G. Navarro. Faster approximate string matching.

Algorithmica, 23(2):127–158, 1999.

[5] R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):762–772, 1977.

[6] E. Cambouropoulos, T. Crawford, and C. Iliopoulos. Pattern processing in melodic sequences: Challenges, caveats and prospects. In Proc. Artificial Intelligence and Simulation of Behaviour (AISB’99) Convention, pages 42–47, 1999.

[7] E. Cambouropoulos, M. Crochemore, C. Iliopoulos, L. Mouchard, and Y. J.

Pinzon. Algorithms for computing approximate repetitions in musical sequences. InProc. 10th Australasian Workshop on Combinatorial Algorithms (AWOCA’99), pages 129–144, 1999.

[8] E. Cambouropoulos, M. Crochemore, C. S. Iliopoulos, L. Mouchard, and Y. J. Pinzon. Algorithms for computing approximate repetitions in musical sequences. Int. J. Comput. Math., 79(11):1135–1148, 2002.

(21)

[9] T. Crawford, C. Iliopoulos, and R. Raman. String matching techniques for musical similarity and melodic recognition. Computing in Musicology, 11:73–

100, 1998.

[10] M. Crochemore, C. Iliopoulos, T. Lecroq, Y. J. Pinzon, W. Plandowski, and W. Rytter. Occurence and substring heuristics for δ-matching. Fundamenta Informaticae, 55:1–15, 2003.

[11] M. Crochemore, C. Iliopoulos, T. Lecroq, W. Plandowski, and W. Rytter.

Three heuristics for δ-matching: δ-BM algorithms. In Proc. 13th Ann. Symp.

on Combinatorial Pattern Matching (CPM’02), LNCS v. 2373, pages 178–189, 2002.

[12] M. Crochemore, C. Iliopoulos, G. Navarro, and Y. Pinzon. A bit-parallel suffix automaton approach for (δ, γ)-matching in music retrieval. In Proc. 10th Intl.

Symp. on String Processing and Information Retrieval (SPIRE’03), LNCS 2857, 2003.

[13] M. Crochemore and W. Rytter. Text algorithms. Oxford University Press, 1994.

[14] A. Czumaj, M. Crochemore, L. Gasieniec, S. Jarominek, Thierry Lecroq, W. Plandowski, and W. Rytter. Speeding up two string-matching algorithms.

Algorithmica, 12:247–267, 1994.

[15] K. Lemstr¨om and G. Navarro. Flexible and efficient bit-parallel techniques for transposition invariant approximate matching in music retrieval. InProc. 10th Intl. Symp. on String Processing and Information Retrieval (SPIRE’03), LNCS 2857, 2003.

[16] P. McGettrick. MIDIMatch: Musical Pattern Matching in Real Time. MSc.

Dissertation, York University, U.K., 1997.

[17] G. Navarro and R. Baeza-Yates. Improving an algorithm for approximate string matching. Algorithmica, 30(4):473–502, 2001.

[18] G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit- parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA), 5(4), 2000.

[19] G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, 2002.

[20] P. Roland and J. Ganascia. Musical pattern extraction and similarity assessment. In E. Miranda, editor,Readings in Music and Artificial Intelligence, pages 115–144. Harwood Academic Publishers, 2000.

[21] L. A. Smith, E. F. Chiu, and B. L. Scott. A speech interface for building musical score collections. In Proc. 5th ACM conference on Digital Libraries, pages 165–173. ACM Press, 2000.

(22)

[22] D. Sunday. A very fast substring searching algorithm. Comm. ACM, 33(8):132–

142, August 1990.

[23] E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Ann.

Symp. on Combinatorial Pattern Matching (CPM’93), pages 228–242, 1993.

[24] S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.