• Aucun résultat trouvé

Problem Definition

Dans le document Sequence Data Mining (Page 101-104)

5.1 Mining Frequent Closed Partial Orders

5.1.1 Problem Definition

Apartial orderis a binary relation that is reflexive, antisymmetric, and tran-sitive. A total order (also called linear order) is a partial orderR such that for any two items xandy, ifx=y then either R(x, y) orR(y, x) holds.

A partial orderR can be expressed as a directed acyclic graph (DAG for short): the items are the vertices in the graph and x→ y is an edge if and only if (x, y)∈Randx=y. We also write an edgex→yas (x, y) orxy. For example, Figure 5.2(a) shows a partial orderR, which has 13 edges.

a

b c

d

e f

(a) A partial order (b) The transitive reduction a

b c

d

e f

Fig. 5.2. A partial order and its transitive reduction.

Since a partial order is transitive, some edges can be derived from the others and thus are redundant. For example, in Figure 5.2(a), edgea→dis redundant given edgesa→bandb→d. Generally, an edgex→yisredundant if there is a path from xto y that does not contain the edge. For a partial order R, thetransitive reduction of R can be drawn in aHasse diagram: for (x, y)∈Randx=y,xis positioned higher thany; edgex→yis drawn if and only if the edge is not redundant. Figure 5.2(b) shows the transitive reduction of the same partial orderRin Figure 5.2(a). The transitive reduction has only 6 edges. For an orderR, the transitive reduction may have much fewer edges.

In this chapter, we draw a partial order in a Hasse diagram, that is, its transitive reduction, and omit the isolated vertices. For example, Figure 5.3 shows four partial ordersR1,R2,R3 andR4, andR1is further a total order.

LetV be a set of items, which serves as the domain of our string database.

A string defines a global order on a subset of V. Here, we focus on strings

d

instead of general sequences, and assume that each item appears in a string at most once, but not necessarily every item appears in a string.

A string can be written ass=x1· · ·xl, wherex1, . . . ,xl∈V.lis called the lengthof strings, i.e.,len(s) =l. For stringss=x1· · ·xl ands =y1· · ·ym, s is called a super-string of s and s a sub-string of s if (1) m l and (2) there exist integers 1i1<· · ·< iml such thatxij =yj (1jm). We also say scontainss. For a string databaseSDB, thesupport of a strings, denoted bysup(s), is the number of strings inSDB that are super-strings of s.

The total order defined by stringscan be written in thetransitive closure of s, denoted by C(s) = {(xi, xj)|1 i < j l}. Please note that, in the transitive closure, we omit the trivial pairs (xi, xi). For example, for string s = abcd, len(s) = 4. The transitive closure is C(s) = {(a, b), (a, c), (a, d), (b, c), (b, d), (c, d)}. Here, we omit the trivial pairs (a, a), (b, b), (c, c), and (d, d).

Theorder containment relationis defined as, for two partial ordersR1and R2, ifR1⊂R2, thenR1is said to beweakerthanR2andR2 isstrongerthan R1. By intuition, a partially ordered set (or poset for short) satisfyingR2will also satisfyR1. For example, in Figure 5.3,R4⊂R3⊂R2⊂R1. Please note that R4 covers fewer items than the other three partial orders. Trivially, we can add the missing items into the DAG as isolated vertices so that every DAG covers the same set of items. To keep the DAG simple and easy to read, we omit such isolated items.

Frequent Closed Partial Orders (FCPO)

Astring databaseSDBis a multiset of strings. For a partial orderR, a strings is said tosupportRifR⊆ C(s). Thesupport ofRinSDB, denoted bysup(R),

5.1 Mining Frequent Closed Partial Orders 93 is the number of strings in SDB that supportR. Given a minimum support thresholdmin sup, a partial orderRis calledfrequent ifsup(R)min sup.

Following the related definitions and the order containment relation, we have the following result.

Property 5.2 (Anti-monotonicity of frequent partial orders). For a string databaseSDB and partial orders R andR such that R ⊂R, it is the case that sup(R)sup(R). Therefore, ifR is frequent, then R is also frequent.

To avoid the triviality, instead of reporting all frequent partial orders, we can mine the representative ones only.

Example 5.3 (Frequent closed partial orders). Let us consider string database DBin Table 5.2 again. The four sequential patterns discussed in Example 5.1 can be regarded as frequent partial orders which are supported by strings 1, 2 and 4. As discussed before, given that the partial orderR in Figure 5.1 is also supported by strings 1, 2 and 4, the four sequential patterns as frequent partial orders are redundant.

There does not exist another partial orderRsuch thatRis stronger than Rin Figure 5.1 and is also supported by strings 1, 2 and 4. In other words,R is the strongest one among all frequent partial orders supported by strings 1, 2 and 4. Thus, the partial order Ris not redundant and can be used as the representative of the frequent partial orders supported by strings 1, 2 and 4.

Technically,Ris a frequent closed partial order.

A partial order R is closed in a string databaseSDB if there exists no partial order R R such that sup(R) = sup(R). A partial order R is a frequent closed partial order if it is both frequent and closed.

Problem Definition.The problem of mining frequent closed partial orders from strings is to find the complete set of frequent closed partial orders in a given string databaseSDB with respect to a minimum support threshold min sup.

Various Types of Frequent Patterns from Strings

For a string database SDB and a minimum support threshold min sup, in addition to frequent closed partial orders, we can mine some other types of frequent patterns as follows.

Frequent itemsets [2] and frequent closed itemsets [84] . If the ordering information in a string is ignored, a string can be treated as a set of items.

For a set of items X ⊆I,sup(X) is the number of strings in SDB in which X appears.X is afrequent itemsetifsup(X)min sup. A frequent itemset X is afrequent closed itemset if there exists noX ⊃X such thatsup(X) = sup(X).

Sequential patterns [3] and closed sequential patterns [125]. For string s, sup(s) is the number of strings inSDB which containsas a substring.sis a sequential patternifsup(s)min sup. In other words, a sequential pattern is a frequent total order on a subset of items. A sequential patternsis aclosed sequential pattern if there exists no proper super-sequence s of s such that sup(s) =sup(s).

Graph patterns [53] and closed graph patterns [124]. Since a string defines a total order, its transitive closure can be viewed as a DAG. For a DAG G, sup(G) is the number of graphs in which G is an embedded subgraph. G is a frequent graph pattern if sup(G) min sup. A frequent graph pattern G is a frequent closed graph pattern if there exists no graph G such that sup(G) =sup(G) andGis an embedded subgraph ofG.

Then, what are the relationships among the above types of frequent pat-terns? We have the following results based on the related definitions.

Corollary 5.4 (FPO, frequent itemsets and sequential patterns).The set of items in a frequent partial order is a frequent itemset. Moreover, if R is a frequent partial order, then, every path sin R is a sequential pattern.

In a directed acyclic graph (DAG), a vertexv is asinkif no edge leavesv.

A vertex v is asource if no edge entersv. The relationships among frequent closed partial orders, closed sequential patterns and transitive closure graph patterns are described in the following theorems. We leave the proof to the interested readers as an exercise.

Theorem 5.5 (FCPO and closed sequential patterns). Let R be a fre-quent closed partial order, and s1, . . . , sk be all the paths in R’s transitive closure graph such that each path is from a source to a sink. Then, a strings supportsR if and only if it simultaneously supports s1, . . . , sk.

Theorem 5.6 (CPO and transitive closure graph patterns).A partial orderRis a frequent (closed) partial order in a string database if and only if it is a frequent (closed) graph pattern in the corresponding database of transitive closures of strings.

5.1.2 How Is Frequent Closed Partial Order Mining Different from

Dans le document Sequence Data Mining (Page 101-104)