• Aucun résultat trouvé

PrefixSpan

Dans le document Sequence Data Mining (Page 34-38)

2.3 PrefixSpan: A Pattern-growth, Depth-first Search Method

2.3.2 PrefixSpan

Let us first introduce the concepts of prefix and suffix which are essential in PrefixSpan.

Definition 2.5 (Prefix). Suppose all the items within an element are listed alphabetically. For a given sequenceα=e1e2· · ·en, where eachei(1in) is an element, a sequence β =e1e2· · ·em (mn)is called a prefix of αif (1) ei =ei for im−1; (2) em ⊆em; and (3) all items in(em−em)are alphabetically after those in em.

For example, consider sequences=a(abc)(ac)d(cf). Sequencesa,aa,a(ab) anda(abc) are prefixes ofs, but neitherab nora(bc) is a prefix.

Definition 2.6 (Suffix). Consider a sequenceα=e1e2· · ·en, where eachei (1in)is an element. Letβ=e1e2· · ·em−1em (mn)be a subsequence of α. Sequence γ = elel+1· · ·en is the suffix of α with respect to prefix β, denoted as γ=α/β, if

1.l =im such that there exist 1 i1 < · · · < im n such that ej eij

(1j m), andim is minimized. In other words, e1...el is the shortest prefix ofαwhich containse1e2· · ·em−1emas a subsequence; and

2.el is the set of items inel−em that are alphabetically after all items in em.

If el is not empty, the suffix is also denoted as ( items inel)el+1· · ·en. Note that if β is not a subsequence of α, the suffix of α with respect to β is empty.

Example 2.7 (Prefix and suffix).In our running example, for the sequences= a(abc)(ac)d(cf), (abc)(ac)d(cf) is thesuffixwith respect toa, ( c)(ac)d(cf) is thesuffix with respect toab, and (ac)d(cf) is thesuffixwith respect to (ac).

Based on the concepts of prefix and suffix, the problem of mining sequential patterns can be decomposed into a set of subproblems as follows.

1. Let{x1, x2, . . . , xn}be the complete set of length-1 sequential patterns in a sequence databaseS. The complete set of sequential patterns in S can be divided intondisjoint subsets. Theithsubset (1in) is the set of sequential patterns with prefixxi.

2.3 PrefixSpan: A Pattern-growth, Depth-first Search Method 23 2. Letαbe a length-l sequential pattern and1, β2, . . . , βm} be the set of all length-(l+ 1) sequential patterns with prefixα. The complete set of sequential patterns with prefixα, except for αitself, can be divided into m disjoint subsets. The jth subset (1 j m) is the set of sequential patterns prefixed withβj.

The above recursive partitioning of the sequential pattern mining problem forms adivide-and-conquerframework. The above partitioning process can be visualized as a sequence enumeration tree.

Example 2.8 (Sequence enumeration tree).Let the set of itemsI={a, b, c, d}. Figure 2.2 shows a sequence enumeration tree which enumerates all possible sequences formed using the items.

The divide-and-conquer partitioning process inPrefixSpanis to conduct a depth-first search of the sequence enumeration tree.

To mine the subsets of sequential patterns, the corresponding projected databases can be constructed.

Definition 2.9 (Projected database). Let α be a sequential pattern in a sequence database S. The α-projected database, denoted as S|α, is the collection of suffixes of sequences inS with respect to prefixα.

Let us examine how to use the prefix-based projection approach to mine sequential patterns.

Example 2.10 (PrefixSpan). For the same sequence database S in Table 2.1 with min sup = 2, sequential patterns in S can be mined by a prefix-projection method in the following steps.

prefix projected (suffix) database sequential patterns a (abc)(ac)d(cf), (d)c(bc)(ae),

( b)(df)cb, (f)cbc

a, aa, ab, a(bc), a(bc)a, aba, abc, (ab), (ab)c, (ab)d, (ab)f, (ab)dc, ac, aca, acb, acc,ad,adc,af

b ( c)(ac)d(cf), (c)(ae), (df)cb,c b,ba,bc, (bc), (bc)a,bd,bdc,bf c (ac)d(cf), (bc)(ae),b,bc c,ca,cb,cc

d (cf),c(bc)(ae), (f)cb d,db,dc,dcb

e ( f)(ab)(df)cb, (af)cbc e, ea, eab,eac, eacb, eb, ebc, ec, ecb, ef, ef b,ef c,ef cb.

f (ab)(df)cb,cbc f,f b,f bc,f c,f cb

Table 2.2.Projected databases and sequential patterns

1.Find length-1 sequential patterns. ScanSonce to find all the frequent items in sequences. Each of these frequent items is a length-1 sequential pattern. They area: 4,b: 4,c: 4,d: 3,e: 3, andf : 3, where the notation

“pattern:count” represents the pattern and its associated support count.

2.Divide search space. The complete set of sequential patterns can be partitioned into the following six subsets according to the six prefixes: (1) the ones with prefixa, (2) the ones with prefix b, . . . , and (6) the ones with prefixf.

3.Find subsets of sequential patterns. The subsets of sequential pat-terns can be mined by constructing the corresponding set of projected databasesand mining each recursively. The projected databases as well as sequential patterns found in them are listed in Table 2.2, while the mining process is explained as follows.

a)Find sequential patterns with prefix a. Only the sequences con-taining ashould be collected. Moreover, in a sequence containing a, only the subsequence prefixed with the first occurrence ofashould be considered. For example, in sequence (ef)(ab)(df)cb, only the subse-quence (b)(df)cbshould be considered for mining sequential patterns prefixed with a. Notice that (b) means that the last element in the prefix, which isa, together withb, form one element.

The sequences in S containing a are projected with respect to a to form thea-projected database, which consists of four suffix sequences:

(abc)(ac)d(cf), (d)c(bc)(ae), (b)(df)cband (f)cbc.

By scanning the a-projected database once, its locally frequent items are a : 2, b : 4, b : 2, c : 4, d : 2, and f : 2. Thus all the length-2 sequential patterns prefixed with a are found, and they are: aa : 2, ab: 4, (ab) : 2,ac: 4,ad: 2, andaf : 2.

Recursively, all sequential patterns with prefix a can be partitioned into 6 subsets: (1) those prefixed withaa, (2) those withab, . . . , and finally, (6) those withaf. These subsets can be mined by constructing respective projected databases and mining each recursively as follows.

2.3 PrefixSpan: A Pattern-growth, Depth-first Search Method 25 i. Theaa-projected database consists of two non-empty (suffix) sub-sequences prefixed withaa:{(bc)(ac)d(cf),{(e)}. Since there is no hope to generate any frequent subsequence from this projected database, the processing of theaa-projected database terminates.

ii. The ab-projected database consists of the following three suffix sequences: (c)(ac)d(cf), (c)a, andc. Recursively mining the ab-projected database returns four sequential patterns: (c), ( c)a,a, andc(that is,a(bc),a(bc)a,aba, andabc.) They form the complete set of sequential patterns prefixed withab.

iii. The (ab)-projected database contains the following two sequences:

(c)(ac)d(cf) and (df)cb, which leads to the finding of the following sequential patterns prefixed with (ab):c,d,f, anddc.

iv. Theac-,ad- andaf- projected databases can be constructed and recursively mined similarly. The sequential patterns found are shown in Table 2.2.

b)Find sequential patterns with prefix b, c, d, e and f, re-spectively. This can be done by constructing the b-, c- d-, e- and f-projected databases and mining them respectively. The projected databases as well as the sequential patterns found are shown in Table 2.2.

4.The set of sequential patterns is the collection of patterns found in the above recursive mining process. One can verify that it returns exactly the same set of sequential patterns as whatGSPdoes.

Based on the above discussion, the algorithm ofPrefixSpanis presented in Figure 2.3.

Input: A sequence databaseS, and the minimum support thresholdmin support.

Output: The complete set of sequential patterns.

Method: CallPrefixSpan(∅,0, S).

SubroutinePrefixSpan(α, l, S|α)

The parameters are (1)αis a sequential pattern; (2)lis the i-length ofα; and (3) S|α is theα-projected database ifα=, otherwise, it is the sequence database S.

Method:

1. ScanS|α once, find each frequent itembsuch that

a) bcan be assembled to the last element ofαto form a sequential pattern;

or

b) bcan be appended toαto form a sequential pattern.

2. For each frequent itemb, append it toαto form a sequential patternα, and outputα;

3. For eachα, constructα-projected databaseS|α, and callPrefixSpan(α, l+ 1, S|α).

Fig. 2.3. AlgorithmPrefixSpan.

Let us analyze the efficiency of the algorithm.

No candidate sequence needs to be generated by PrefixSpan.

Unlike Apriori-like algorithms, PrefixSpan only grows longer sequential patterns from the shorter frequent ones. It neither generates nor tests any candidate sequence non-existent in a projected database. Comparing with GSP, which generates and tests a substantial number of candidate sequences,PrefixSpansearches a much smaller space.

Projected databases keep shrinking.It is easy to see that a projected database is smaller than the original one because only the suffix subse-quences of a frequent prefix are projected into a projected database. In practice, the shrinking factors can be significant because (1) usually, only a small set of sequential patterns grow quite long in a sequence database, and thus the number of sequences in a projected database usually reduces substantially when prefix grows; and (2) projection only takes the suffix portion with respect to a prefix.

The major cost of PrefixSpan is the construction of projected databases.In the worst case,PrefixSpanconstructs a projected database for every sequential pattern. If there exist a good number of sequential patterns, the cost is non-trivial. Techniques for reducing the number of projected databases will be discussed in the next subsection.

Theoretically, the problem of mining the complete set of sequential pat-terns is #P-complete [33]. Therefore, it is impossible to have a polynomial time algorithm unlessP =N P. Even ifP =N P, it is still unclear whether a polynomial time algorithm exists.

Interestingly, we can show that the PrefixSpan algorithm is pseudo-polynomial. That is, the complexity of PrefixSpan is linear with respect to the number of sequential patterns, since each projection generates at least one sequential pattern, and the projection cost is upper bounded by the time of scanning the database once, and counting frequent items in the suffixes.

Dans le document Sequence Data Mining (Page 34-38)