• Aucun résultat trouvé

Boyer-Moore algorithm

Dans le document Data Mining (Page 177-180)

String Matching 4

4.2.3 Boyer-Moore algorithm

Boyer and Moore proposed their algorithm for string matching [2] around the same time that Knuth, Morris, and Pratt came out with theirs, in 1977. Both the algorithms became historically famous in the research and development of string matching, mainly because of their application to text processing.

Although the computational complexity of both the algorithms is on the av-erage linear, but Boyer-Moore algorithm is likely to be more efficient than the Knuth-Morris-Pratt algorithm for a relatively longer pattern p and rea-sonably large alphabet E.

The key insight of the Boyer-Moore algorithm is that some of the characters in the text can be skipped entirely without comparing them with the pattern, because it can be shown that they can never contribute to an occurrence of the pattern in the text. In Boyer-Moore algorithm, although the text is scanned left to right, comparisons of the pattern and the text are done backwards right to left along the search window while reading the longest suffix of the search window that is also a suffix of the pattern.

The first comparison is made between the last pattern character pm and the text character tm, where m is the length of the pattern p. If pm mismatches with tm and the character tm does not at all appear in pattern p, then it is a wastage in comparing the first m — 1 characters of the pattern with the first m— 1 characters of the text since the pattern cannot occur in any of the first m positions of the text. As a result, the pattern can be shifted safely m places to the right so that the next comparison happens between pm and t<2m. Consider searching for a pattern, say "ababz", in a text which does not contain the character '2' in any of its positions. The total number of comparisons in the text will then be only ^ instead of n. This is a significant performance improvement as compared to prefix comparison-based string matching, such as Knuth-Morris-Pratt or the finite automaton-based algorithms.

In general, if pm does not match with ti and ti does not appear in the pattern p = p\p^ • • -pm, then we simply ignore comparing all the previous m — 1 text characters and shift the pattern m places to the right of ti in the text. This is illustrated with an example in Fig. 4.8(a) for a pattern "6c6a6"

of length five, aligned with the text beginning at index 12. Here p5 = '&' does not match with tie = 'd' and 'd' does not appear in any position of the pattern 'bcbab'. Hence the pattern is shifted right by five places and aligned with the text beginning at index 17, as shown in Fig. 4.8(b), such that further comparison resumes from this location.

On the other hand, if pm ^ ti and tt does appear in the pattern such that the rightmost appearance of ti in pattern is pm-j»then the pattern can safely be shifted by j places to the right of £» in the text in order to align pm-j with U. Thereafter, comparison of pm starts again with ti+j. As an example, pm = PS = '&' does not match with ti = £21 = 'c' as shown in Fig. 4.8(b). However, 'c' appears in the pattern and its rightmost appearance is pm-j = P2 — 'c'.

Hence the pattern '6c6a6' is shifted right by j = 3 places hi order to align

LINEAR-ORDER STRING MATCHING ALGORITHMS 159

Fig. 4.8 Example of skipping character comparisons in Boyer-Moore algorithm for pattern matching, (a) Current pattern position, (b) pattern completely shifted right because 'd' does not appear in the pattern, (c) pattern is shifted by three positions to align with character 'c'.

P2 = V with £2i = 'c', as indicated by the curved arrow in Fig. 4.8(c), and further comparison of p$ resumes with text character ti+j = £2

4-If a match is found between pm and ti, then the preceding characters in the text from t{ are compared sequentially right to left with the corresponding positions in the pattern until there is a mismatch or the pattern is completely matched. If the pattern gets completely matched, this implies that the pattern occurs at location i. Hence the pattern is shifted by one place to the right, and the matching procedure resumes.

The number of positions to slide forward, upon mismatch, depends on the character ti being matched with the rightmost character pm of the pattern.

These numbers can be stored in an array or table, say skip with |E| entries in the table, where E is the alphabet over the text and the pattern. The entry for a symbol a e E in the skip table is skip(a) =m— j when PJ is the rightmost occurrence of cr in pattern p, and skip(a] = m if a does not appear in the pattern at all. Hence we can compute the skip table using the following algorithm.

160 STRING MATCHING

GENERATE-SKIP-TABLE(S, p) 1. Set pattern length, m <— |p|;

2. Initialize sfcip table, skip(a) — m for all symbols a 6 S;

3. Initialize pattern index, j <— 1;

4. for jth character PJ in the pattern, set skip(pj) <— m — j;

5. Increment pattern index, j <— j -I-1;

6. if j < m (i.e., the pattern is not complete) then go to step 4;

7. Stop.

The nature of shift of the pattern has been explained with an example in Fig. 4.9 to find the occurrences of the pattern string 'match' in the text string 'one of them matches and others mismatch from'. The procedure requires only 19 character comparisons, as opposed to 44 or more comparisons by Knuth-Morris-Pratt or the finite automaton-based string matching algorithms.

When a match is found between pm and it, subsequent comparisons are made with preceding characters in the text from ti sequentially right-to-left with the corresponding positions in the pattern. If a mismatch is found at PJ (i.e., PJ 7^ ti-m+j}, then the suffix u = pj+ipj+2 • • -pm of length m — j of the pattern is said to match with the text substring u = ti-m+j+i • • • ti. If the rightmost occurrence of the mismatching character ti-m+j in the pattern is Pm-k, then the pattern is then shifted by k positions right from the mis-matching position in the text to align pm-k with ti-m+j and the matching procedure resumes further. However, the shift will be only one position right if j < m — fe, in order to avoid negative shift to align pm-k with tj_m+J.

It is also possible that a greater shift is obtained, as compared to the above case, when a mismatch occurs after a partial match of a substring. The idea is to find a suffix u = pj+ipj+% • • -pm, occurring in another pattern, as a factor of p. Then the pattern can be shifted safely forward to the right, so that u = ti-m+j+i • • • ti in the text matches with the next occurrence of u in the pattern. If no such factor exists in the pattern, we cannot safely move the whole pattern right to the mismatching character. In this case, the algorithm computes the longest prefix v of p that is also a proper suffix of u. The pattern is then shifted by m — \v\ positions to align with v in the text. The possible shift can be precomputed from the pattern itself and stored in an array or 'shift' table.

During the search stage, the shift for mismatch at pattern location PJ and mismatching text character ti is chosen as max{skip(ti), shift(j)}.

The Boyer-Moore search algorithm has worst-case computational complex-ity of the order O(m * n). However, it is sublinear on the average case. Many

LINEAR-ORDER STRING MATCHING ALGORITHMS 161

Fig. 4.9 Example of Boyer-Moore pattern matching.

variations of Boyer-Moore idea have been proposed to define worst-case algo-rithms of linear order [2]. Although it theoretically provides high-performance, the Boyer-Moore (as well as Knuth-Morris-Pratt) algorithm requires com-plicated preprocessing of the pattern before beginning the actual search of occurrences of the pattern in the string. Hence the Boyer-Moore algorithm, in spite of its promise of sublinear performance on the average, has not been used in many applications in its original form.

Horspool was the first to propose a very simplified version of the Boyer-Moore algorithm [3], by dispensing the processing and use of the shift array all together. It uses a variation of the original skip array only, and it en-sures linear-order computational complexity on the average as well. This is popularly known as the Boyer-Moore-Horspool algorithm for string matching.

Dans le document Data Mining (Page 177-180)