• Aucun résultat trouvé

The ConSGapMiner Algorithm

Dans le document Sequence Data Mining (Page 127-134)

6.2 Class-Characteristics Distinguishing Sequence Patterns

6.2.2 The ConSGapMiner Algorithm

We now consider theConSGapMineralgorithm, for solving theg-MDS mining problem. It has the following three main subroutines:

i) tree-based depth-first search framework to find a set of distinguishing sequences (containing all minimal distinguishing sequences),

ii) bitset based support and gap calculation, and iii) post processing (minimization).

The first routine computes some set of distinguishing sequence patterns which contains all minimal distinguishing sequences. We call such a set of distinguishing sequence patterns ag-SMDS set, as defined below.

Definition 6.2.4 (g-SMDS set)ASemi-Minimal Distinguishing Sub-sequence set with g maximum gap constraint, g-SMDS set for short, is a super set of the g-MDS set, such that elements in theg-SMDS set but not in theg-MDS set are sequence patterns that satisfy the frequency and infrequency conditions, but not necessarily the minimality condition.

A g-SMDS set may also contain some non-minimal distinguishing se-quences, which will be removed in the minimization process in a batch man-ner. This choice was made since performing minimization whenever a new distinguishing sequence is generated is more expensive than batch-based min-imization.

We now discuss each of the three routines in turn.

SMDS set generation

ConSGapMiner performs a depth-first search in a lexicographic sequence tree.

In this tree, each node contains a sequenceS, a value for countpos(S, g) and a value for countneg(S, g). Each node is the max-prefix2 of each of its chil-dren. During the depth-first search, we extend the current node by a single item from the alphabet, according to a certain lexicographic order. For each newly-generated nodev, we calculate the supports ofv’s associated sequence from posand fromneg. Part of the lexicographic tree for mining the data of Table 6.2 is given in Figure 6.1. Observe that the branches of the lexicographic tree terminate at nodes whosecountpos= 0.

Two basic pruning strategies can be used to reduce the size of the search space of the tree. These will be applied in the candidate generation process.

Non-Minimal Distinguishing Pruning:This strategy is based on the fact that any supersequence of a distinguishing sequence cannot be a min-imal one. Suppose we encounter a node representing sequenceS, wherec is the last item inS,supppos(S, g)δ, andsuppneg(S, g)α. Then i) we never need to extendS and ii) we never need to extend any of the sibling nodes ofSby the itemc. Such an extension would lead to a supersequence ofS which cannot be an MDS.

For Figure 6.1, since supppos(AACC) > 0 and suppneg(AACC) = 0, AACC must be distinguishing. Moreover, we know in the subtree of its sibling AACB, suppneg(AACBC) must be 0, too. SoAACBC can’t be an MDS.

2 Themax-prefix of a sequenceS =s1...sn is s1...sn−1, formed by removing the last item inS. For exampleABC is the max-prefix ofABCDbutABisn’t.

6.2 Class-Characteristics Distinguishing Sequence Patterns 119

Fig. 6.1.Part of the lexicographic tree for mining Table 6.2

Max-Prefix Infrequency Pruning:Whenever a candidate is not frequent inpos, none of its descendants in the tree can be frequent. Thus, whenever we come across a sequence S at a node satisfyingsupppos(S, g)< δ, we do not need to extend this node any further. For example, in Figure 6.1, it is not necessary to extendAAB (which has support zero inpos), since no frequent sequence can be found in its subtree.

It is worth noting that this technique does not generalize to full a-priori like pruning – “if a subsequence is infrequent inpos, then no supersequence of it can be frequent”. Such a statement is not true, because the gap constraint is not class preserved [130]. This means that an infrequent sequence’s supersequence is not necessarily infrequent; this consequently increases the difficulty of the MDS mining problem. Indeed, extending an infrequent subsequence by appending will not lead to a frequent sequence, but extensions by inserting items in the middle of the subsequence may lead to a frequent subsequence.

An example situation is given next. For Figure 6.1, supposeδ= 1/3 and g= 1. ThenAABis not a frequent pattern becausecountpos(AAB,1) = 0.

But looking at AAB’s sibling, the subtree rooted at AAC, we see that countpos(AACB,1) = 1. So a supersequence AACB is frequent, but its subsequenceAAB is infrequent.

The SMDS set generation algorithm is given in Figure 6.2. The algorithm is called at the root of the search tree by Candidate Gen({}, g, I, δ, α), with SM initialized to be the emptyset.

Algorithm:SMDS Gen(S,g,I,δ,α);

Assumption:S is a sequence,gis maximum gap constraint, Iis the alphabet, δis the minimal support forpos,αis the maximum support forneg;

CDSis a local variable storing the children distinguishing sequences ofS;

SM is a global variable containing all computed distinguishing subsequences;

Method:

1: initializeCDS to{}; 2: for eachx∈I do

3: letS=S.x(appendingxtoS);

4: ifSis not a supersequence of any sequence inCDSthen 5: supppos=Support Count(S,g,pos);

6: suppneg=Support Count(S,g,neg);

7: if (suppposδANDsuppnegα) then 8: CDS=CDS∪ {S};

9: elsif (suppposδ) then 10: SMDS Gen(S,g,I,δ,α);

11: SM=SM∪CDS;

Fig. 6.2.TheSMDS Genroutine

Support Calculation and Gap Checking

For each newly-generated candidateS,countpos(S, g) andcountneg(S, g) must be computed. The main challenge comes in checking satisfaction of the gap constraint. A candidate can occur many times within a single sequence. A straightforward idea for gap checking would be to record the occurrences of each candidate in a separate list. When extending the candidate, a scan of the list determines whether or not the extension is legal, by checking whether the gap between the end position and the item being appended is smaller than the (maximum) gap constraint value for each occurrence. This idea becomes ineffective in situations with small alphabet size and small support threshold and many long sequences needing to be checked, since the occurrence list be-comes unmanageably large. Instead, a more efficient method for gap checking can be used, based on a bitset representation of subsequences and the use of boolean operations. This technique is described next.

Definition 6.2.5 (Bitset) A bitset is a sequence of binary bits. Ann-bitset X containsn binary bits, andX[i]refers to thei-th bit ofX.

A bitset can be used to describe how a sequence occurs within another sequence. Suppose we have two sequencesS=s1...sn andS =s1...sm, where mn. The occurrence(s) ofS in S can be represented by an n-bitset. This n-bitsetBS is defined as follows: If there exists a supersequence of S of the form s1...si such that si = sm (the last item of S), then BS[i] is set to 1; otherwise it is set to 0. For example, if S=BACACBCCB, the 9-bitset representingS =ABis 000001001. This indicates how the subsequenceAB can occur inBACACBCCB, with a ’1’ being turned on in each final position

6.2 Class-Characteristics Distinguishing Sequence Patterns 121 where the subsequenceABcould be embedded. IfS is not a subsequence of S, then the bitset representing the occurrences ofS consists of all zeros.

For the special case whereS=sis a single item,BS[i]is set to 1 ifsi=s.

ForS=BACACBCCB, the 9-bitset representingS =C is 001010110.

It will be necessary to compare a given subsequence against multiple other sequences. In this case, the subsequence will be associated with an array of bit-sets, where thek-th bitset describes the occurrences ofSin thek-th sequence.

Initial Bitset Construction: Before mining begins, it is necessary to construct the bitsets that describe how each item of the alphabet occurs in each sequence from theposandnegdatasets. So, each itemxhas associated with it an array of |pos|+|neg| bitsets; the number of bitsets in x’s array which contain one or more 1’s, is equal tocount(x, g).

For the data in Table 6.2, the bitset array for A contains 5 bitsets, namely [0010,11000,00110,0010,10100]. Moreover, countpos(A, g) = 3 and countneg(A, g) = 2.

Bitset Checking:Each candidate sequenceSin the lexicographic tree has a bitset array associated with it, which describes howS can occur in each of the|pos|+|neg|sequences. This bitset array can be directly used to compute countpos(S, g) and countneg(S, g) (i.e. countpos(S, g) is just the number of bitsets in the array not equal to zero, that describe positive sequences). During mining, we extend a sequence S (at a node) to get a new candidate S, by appending some itemx. Before computingcountpos(S, g) andcountneg(S, g), we first need to compute the bitset array for S. The bitset array for S is calculated using the bitset array forS and the bitset array for itemx, and is done in two stages.

Stage 1:Using the bitset array forS, we generate another array of corre-sponding mask bitsets. Each mask bitset captures all the valid extensions of S, with respect to the gap constraint, for a particular sequence inpos∪neg.

Suppose the maximum gap isg. For a given bitsetb in the bitset array ofS, we performg+ 1 times of right shift by distance 1, with 0s filling the leftmost bits. This results ing+ 1 intermediate bitsets, one for each stage of the shift.

By ORing together all the intermediate bitsets, we obtain the final mask bit-set m derived fromb. The mask bitset array for S consists of all such mask bitsets.

Example 6.2.5 Taking the last bitset10100in the previous example and set-ting g= 1, the process is:

10100>>01010 01010>>00101

OR 01111

01111is the mask bitset derived from bitset10100.

Intuitively, a mask bitsetmgenerated from a bitsetb, closes all 1s inb(by setting them to 0) and opens the followingg+ 1 bits (by setting them to 1).

In this way,mcan accept only 1s within ag+ 1 distance from the 1s inb.

Stage 2:We use the mask bitset array forS and the bitset array for item x, to calculate the bitset array forSwhich is the result of appendingxtoS.

Consider a sequence X in pos∪neg and suppose the mask bitset describing it is mand the bitset for itemxist. The bitset describing the occurrence of S inX, is equal tomANDt. If the bitset of the new candidateS does not contain any 1, we can conclude that this candidate is not a subsequence ofX withg-gap constraint.

Example 6.2.6 ANDing01111(the mask bitset for sequenceA) from the last example with C’s bitset00010, gives us AC’s bitset 00010.

Taking the last sequence in Table 6.2,ABACB,B’s5-bitset is01001and its mask 5-bitset is:

01001>>00100 00100>>00010

OR 00110

So BB’s bitset is: 00110 AND 01001 = 00000. This means BB is not a subsequence of ABACB with 1-gap constraint.

Example 6.2.7 Figure 6.3 shows the process of getting the bitset array for BB from that forB. The two tables on the two sides of the arrow⇒show how the masks for B are obtained from the bitset array for B. The & operation is taken on bitset array and the Masks set, yielding the bitset array for BB.

From the figure we can see countpos(BB,1) = 2andcountneg(BB,1) = 0.

Fig. 6.3.The generation ofBB’s bitset array.

The task of computing bitset arrays can be done very efficiently. Modern computer architectures have very fast implementations of shift operations and logical operations. Since the maximum gaps are usually small (e.g. less than 20), the total number of right shifts and logical operations needed is not too large. Consequently, calculatingsupppos(S, g) andsuppneg(S, g) can be done extremely quickly. The algorithm for support counting is given in Figure 6.4.

Minimization

We have already seen how minimal distinguishing pruning eliminates non-minimal candidates during tree expansion. However, the pattern set returned

6.2 Class-Characteristics Distinguishing Sequence Patterns 123 Algorithm:Support Count(S, g, D);

Assumption:gis the maximum gap,S is the max-prefix ofS, the bitset array BARRAYS forS is available, the bitset arrayBARRAYx for the final elementxofS is available,Dis the dataset;

Output:suppD(S, g) andBARRAYS (the bitset array forS);

Method:

1: generate the mask bitsetsM askSfromBARRAYSforg(stage 1 above);

2: do bitwise AND ofM askSandBARRAYxto getBARRAYS (stage 2 above);

3: letcountbe the number of bitsets inBARRAYS which contain 1;

4: returnsuppD(S, g) =count/|D|andBARRAYS;

Fig. 6.4. Support Count(S,g,D): calculatesuppD(S, g)

by Algorithm 6.2 is only semi-minimal, i.e. an SMDS set. For example, in Figure 6.1, we will get ACC, which is a supersequence of the distinguishing sequence CC. Thus, in order to get theg-MDS set, a post-processing mini-mization step is needed.

A na¨ıve idea for removing non-minimal sequences, is to check each against all the others, removing it if it is a supersequence of at least one other. Forn sequences, this leads to anO(n2) algorithm, which is expensive ifnis large.

Two ideas can be used to make it more efficient. Firstly, observe that it is not necessary to check if a sequence is a supersequence of any longer sequence.

To take advantage of this, we cluster the sequence patterns according to their length, when they are output during mining.

Secondly, we use a prefix tree for carrying out minimization. Sequences are inserted into the tree in ascending order of length. Each sequenceS to be inserted into the tree is compared against the sequences already there. This is easily done by stepping through each prefix ofS, at each stage identifying the nodes of the tree which are subsequences of the prefix so far. The process terminates when a leaf node or the end ofS is reached. If a subsequence ofS in the tree is found, thenS is discarded. Otherwise,S must be minimal and it is inserted.

Compared to the naiveO(n2) method, using a prefix tree can help avoid some duplicate comparisons, particularly for situations where there is sub-stantial similarity between the sequential patterns, since each sequential pat-tern prefix is only stored once. For example, consider two shorter patpat-terns P1 =ABCC, P2 =ABCF and a longer pattern P3 =ABCDE. To check whether P3 is minimal by using the naive way, we compareP3 withP1 item-wise for 5 comparisons and with P2 itemwise for 5 comparisons to conclude thatP3is minimal. By using the prefix tree,ABCis built once and compared once, which takes itemwise 3 comparisons and then another 2 comparisons to check the other two items D andE. Finally we know thatP3 is minimal, because no leaf is found. This takes itemwise 5 comparisons total, rather than 10 comparisons using the naive way.

The complete algorithm ofConSGapMiner is provided in Figure 6.5.

Algorithm:ConSGapMiner(pos,neg,g,δ,α)

Assumption:I is the alphabet list,gis the maximum gap constraint, δis the minimal support inpos,αis the maximal support inneg,

a global set SMDS is used to contain the patterns generated by SMDS Gen;

Output:g-MDS setM DS;

Method:

1: SM DS← {};

2: setSto the empty sequence;

3: SMDS Gen(S,g,I,δ,α);

4: letM DSbe the result of minimizingSM DSas described above;

5: returnM DS;

Fig. 6.5.The ConSGapMiner algorithm

6.2.3 Extending ConSGapMiner: Minimum Gap Constraints

Dans le document Sequence Data Mining (Page 127-134)