PrefixSpan - PrefixSpan: A Pattern-growth, Depth-first Search Method

2.3 PreﬁxSpan: A Pattern-growth, Depth-ﬁrst Search Method

2.3.2 PreﬁxSpan

Let us first introduce the concepts of prefix and suffix which are essential in PrefixSpan.

Definition 2.5 (Prefix). Suppose all the items within an element are listed alphabetically. For a given sequenceα=e₁e₂· · ·en, where eachei(1in) is an element, a sequence β =e₁e₂· · ·e_m (mn)is called a prefix of αif (1) e_i =ei for im−1; (2) e_m ⊆em; and (3) all items in(em−e_m)are alphabetically after those in e_m.

For example, consider sequences=a(abc)(ac)d(cf). Sequencesa,aa,a(ab) anda(abc) are preﬁxes ofs, but neitherab nora(bc) is a preﬁx.

Definition 2.6 (Suffix). Consider a sequenceα=e₁e₂· · ·e_n, where eache_i (1in)is an element. Letβ=e₁e₂· · ·e_m₋₁e_m (mn)be a subsequence of α. Sequence γ = e_le_l₊₁· · ·e_n is the suffix of α with respect to prefix β, denoted as γ=α/β, if

1.l =im such that there exist 1 i₁ < · · · < im n such that e_j ⊆ ei_j

(1j m), andim is minimized. In other words, e₁...el is the shortest preﬁx ofαwhich containse₁e₂· · ·e_m₋₁e_mas a subsequence; and

2.e_l is the set of items ine_l−e_m that are alphabetically after all items in e_m.

If e_l is not empty, the suﬃx is also denoted as ( items ine_l)el+1· · ·en. Note that if β is not a subsequence of α, the suﬃx of α with respect to β is empty.

Example 2.7 (Prefix and suffix).In our running example, for the sequences= a(abc)(ac)d(cf), (abc)(ac)d(cf) is thesuffixwith respect toa, ( c)(ac)d(cf) is thesuffix with respect toab, and (ac)d(cf) is thesuffixwith respect to (ac).

Based on the concepts of preﬁx and suﬃx, the problem of mining sequential patterns can be decomposed into a set of subproblems as follows.

1. Let{x₁, x₂, . . . , xn}be the complete set of length-1 sequential patterns in a sequence databaseS. The complete set of sequential patterns in S can be divided intondisjoint subsets. Thei^thsubset (1in) is the set of sequential patterns with preﬁxxi.

2.3 PrefixSpan: A Pattern-growth, Depth-first Search Method 23 2. Letαbe a length-l sequential pattern and{β₁, β₂, . . . , βm} be the set of all length-(l+ 1) sequential patterns with prefixα. The complete set of sequential patterns with prefixα, except for αitself, can be divided into m disjoint subsets. The j^th subset (1 j m) is the set of sequential patterns prefixed withβj.

The above recursive partitioning of the sequential pattern mining problem forms adivide-and-conquerframework. The above partitioning process can be visualized as a sequence enumeration tree.

Example 2.8 (Sequence enumeration tree).Let the set of itemsI={a, b, c, d}. Figure 2.2 shows a sequence enumeration tree which enumerates all possible sequences formed using the items.

The divide-and-conquer partitioning process inPreﬁxSpanis to conduct a depth-ﬁrst search of the sequence enumeration tree.

To mine the subsets of sequential patterns, the corresponding projected databases can be constructed.

Definition 2.9 (Projected database). Let α be a sequential pattern in a sequence database S. The α-projected database, denoted as S|α, is the collection of suffixes of sequences inS with respect to prefixα.

Let us examine how to use the preﬁx-based projection approach to mine sequential patterns.

Example 2.10 (PreﬁxSpan). For the same sequence database S in Table 2.1 with min sup = 2, sequential patterns in S can be mined by a preﬁx-projection method in the following steps.

preﬁx projected (suﬃx) database sequential patterns a (abc)(ac)d(cf), (d)c(bc)(ae),

( b)(df)cb, (f)cbc

a, aa, ab, a(bc), a(bc)a, aba, abc, (ab), (ab)c, (ab)d, (ab)f, (ab)dc, ac, aca, acb, acc,ad,adc,af

b ( c)(ac)d(cf), (c)(ae), (df)cb,c b,ba,bc, (bc), (bc)a,bd,bdc,bf c (ac)d(cf), (bc)(ae),b,bc c,ca,cb,cc

d (cf),c(bc)(ae), (f)cb d,db,dc,dcb

e ( f)(ab)(df)cb, (af)cbc e, ea, eab,eac, eacb, eb, ebc, ec, ecb, ef, ef b,ef c,ef cb.

f (ab)(df)cb,cbc f,f b,f bc,f c,f cb

Table 2.2.Projected databases and sequential patterns

1.Find length-1 sequential patterns. ScanSonce to ﬁnd all the frequent items in sequences. Each of these frequent items is a length-1 sequential pattern. They area: 4,b: 4,c: 4,d: 3,e: 3, andf : 3, where the notation

“pattern:count” represents the pattern and its associated support count.

2.Divide search space. The complete set of sequential patterns can be partitioned into the following six subsets according to the six prefixes: (1) the ones with prefixa, (2) the ones with prefix b, . . . , and (6) the ones with prefixf.

3.Find subsets of sequential patterns. The subsets of sequential pat-terns can be mined by constructing the corresponding set of projected databasesand mining each recursively. The projected databases as well as sequential patterns found in them are listed in Table 2.2, while the mining process is explained as follows.

a)Find sequential patterns with prefix a. Only the sequences con-taining ashould be collected. Moreover, in a sequence containing a, only the subsequence prefixed with the first occurrence ofashould be considered. For example, in sequence (ef)(ab)(df)cb, only the subse-quence (b)(df)cbshould be considered for mining sequential patterns prefixed with a. Notice that (b) means that the last element in the prefix, which isa, together withb, form one element.

The sequences in S containing a are projected with respect to a to form thea-projected database, which consists of four suﬃx sequences:

(abc)(ac)d(cf), (d)c(bc)(ae), (b)(df)cband (f)cbc.

By scanning the a-projected database once, its locally frequent items are a : 2, b : 4, b : 2, c : 4, d : 2, and f : 2. Thus all the length-2 sequential patterns preﬁxed with a are found, and they are: aa : 2, ab: 4, (ab) : 2,ac: 4,ad: 2, andaf : 2.

Recursively, all sequential patterns with prefix a can be partitioned into 6 subsets: (1) those prefixed withaa, (2) those withab, . . . , and finally, (6) those withaf. These subsets can be mined by constructing respective projected databases and mining each recursively as follows.

2.3 PrefixSpan: A Pattern-growth, Depth-first Search Method 25 i. Theaa-projected database consists of two non-empty (suffix) sub-sequences prefixed withaa:{(bc)(ac)d(cf),{(e)}. Since there is no hope to generate any frequent subsequence from this projected database, the processing of theaa-projected database terminates.

ii. The ab-projected database consists of the following three suﬃx sequences: (c)(ac)d(cf), (c)a, andc. Recursively mining the ab-projected database returns four sequential patterns: (c), ( c)a,a, andc(that is,a(bc),a(bc)a,aba, andabc.) They form the complete set of sequential patterns preﬁxed withab.

iii. The (ab)-projected database contains the following two sequences:

(c)(ac)d(cf) and (df)cb, which leads to the ﬁnding of the following sequential patterns preﬁxed with (ab):c,d,f, anddc.

iv. Theac-,ad- andaf- projected databases can be constructed and recursively mined similarly. The sequential patterns found are shown in Table 2.2.

b)Find sequential patterns with preﬁx b, c, d, e and f, re-spectively. This can be done by constructing the b-, c- d-, e- and f-projected databases and mining them respectively. The projected databases as well as the sequential patterns found are shown in Table 2.2.

4.The set of sequential patterns is the collection of patterns found in the above recursive mining process. One can verify that it returns exactly the same set of sequential patterns as whatGSPdoes.

Based on the above discussion, the algorithm ofPreﬁxSpanis presented in Figure 2.3.

Input: A sequence databaseS, and the minimum support thresholdmin support.

Output: The complete set of sequential patterns.

Method: CallPrefixSpan(∅,0, S).

SubroutinePrefixSpan(α, l, S|^α)

The parameters are (1)αis a sequential pattern; (2)lis the i-length ofα; and (3) S|^α is theα-projected database ifα=∅, otherwise, it is the sequence database S.

Method:

1. ScanS|α once, ﬁnd each frequent itembsuch that

a) bcan be assembled to the last element ofαto form a sequential pattern;

b) bcan be appended toαto form a sequential pattern.

2. For each frequent itemb, append it toαto form a sequential patternα, and outputα;

3. For eachα, constructα-projected databaseS|α, and callPrefixSpan(α, l+ 1, S|α).

Fig. 2.3. AlgorithmPrefixSpan.

Let us analyze the eﬃciency of the algorithm.

• No candidate sequence needs to be generated by PrefixSpan.

Unlike Apriori-like algorithms, PreﬁxSpan only grows longer sequential patterns from the shorter frequent ones. It neither generates nor tests any candidate sequence non-existent in a projected database. Comparing with GSP, which generates and tests a substantial number of candidate sequences,PreﬁxSpansearches a much smaller space.

• Projected databases keep shrinking.It is easy to see that a projected database is smaller than the original one because only the suffix subse-quences of a frequent prefix are projected into a projected database. In practice, the shrinking factors can be significant because (1) usually, only a small set of sequential patterns grow quite long in a sequence database, and thus the number of sequences in a projected database usually reduces substantially when prefix grows; and (2) projection only takes the suffix portion with respect to a prefix.

• The major cost of PrefixSpan is the construction of projected databases.In the worst case,PreﬁxSpanconstructs a projected database for every sequential pattern. If there exist a good number of sequential patterns, the cost is non-trivial. Techniques for reducing the number of projected databases will be discussed in the next subsection.

Theoretically, the problem of mining the complete set of sequential pat-terns is #P-complete [33]. Therefore, it is impossible to have a polynomial time algorithm unlessP =N P. Even ifP =N P, it is still unclear whether a polynomial time algorithm exists.

Interestingly, we can show that the PrefixSpan algorithm is pseudo-polynomial. That is, the complexity of PrefixSpan is linear with respect to the number of sequential patterns, since each projection generates at least one sequential pattern, and the projection cost is upper bounded by the time of scanning the database once, and counting frequent items in the suffixes.

Dans le document Sequence Data Mining (Page 34-38)