Unexpected Patterns and Rules - tel-00431117, version 1

In the past years, unexpectedness measure has been widely studied in various approaches to pattern and rule discoveries.

Liu and Hsu studied the unexpected structures of discovered rules in [LH96]. In the proposed approach, the existing rules (denoted as E) from prior knowledge are regarded as fuzzy rules by using fuzzy set theory and the newly discovered rules (denoted as B) are matched against the existing fuzzy rules in the post-analysis process. A rule consists of theconditionand theconsequent, so that given two rules Bi and Ej, if the conditional parts of Bi and Ej are similar, but the consequents of the two rules are quite different, then it is considered asunexpected consequent; the inverse is considered as unexpected condition. The computation of the similarity in the matching is based on the attribute name and value. The same techniques are extended to find unexpected patterns in [LHML99].

Moreover, in [LMY01], Liu et al. investigated the problem of finding unexpected information in the context of Web content mining. The proposed approach aims to discover the Web pages

tel-00431117, version 1 - 10 Nov 2009

2.3. UNEXPECTED PATTERNS AND RULES 13 relevant but unknown to the user (i.e., competitor Web site) with respect to existing knowledge of the user (i.e.,user Web site), where the vector space model with the TF-IDF (Term Frequency - Inverse Document Frequency) weight is used in comparing two Web sites: it first computes the corresponding pages between two Web sites by counting the keywords in the pages, then term weights in both documents are compared in order to obtain unexpected terms, and finally unexpected pages and unexpected concepts are extracted by ranking discovered unexpected terms.

Suzuki et al. systematically studied exception rules in the context of association rule mining [SS96, Suz96, Suz97, SK98, HLSL00, SZ05, Suz06]. An association rule can be classified into two categories: acommon sense rule, which is a description of a regularity for numerous objects, and anexception rule, which represents, for a relatively small number of objects, a different, regularity from a common sense rule [SS96, Suz96]. In [SS96, Suz96, Suz97, SK98], the exception rules are considered with respect to the common sense rules within the rule pair r(µ, ν) defined as follows:

r(µ, ν) =

( Aµ ⇒c Aµ∧Bν ⇒c^′ ,

whereAµ, Bν are itemsets and c, c^′ are items. We follow the notions presented in [SK98], Aµ⇒c, Aµ ∧Bν ⇒ c^′, and Bν ⇒ c^′ are respectively called a common rule, an exception rule, and a reference rule. Such a rule pair can be interpreted as “if Aµ then c, however if Aµ and Bν then c^′”. The discovery of rule pairs r(µ, ν) is evolutive from [SS96] to [SK98]. In [SS96], an average compressed entropy (ACE) based approach ACEP, where the average compressed entropy ofcand Aµ is defined as

ACE(c, Aµ) = p(c, Aµ)log2

p(c|Aµ)

p(c) +p(c, Aµ)log2

p(c|Aµ) p(c)

and the interestingness measure of an exception rule is defined by the average compressed entropy product (ACEP) of the rule pair is defined as

ACEP(c, Aµ, c^′, Bν) =ACE(c, Aµ)·ACE(c^′, Aµ∧Bν).

It is not difficult to see that according to the above manner, the an exception rule holds a rel-atively small number of examples (i.e., low support) in a database. In order to reduce the number of potential interesting exception rules, the notions of reliable exception rules and surprising ex-ception rulesare addressed in [Suz97] and [SK98] based on probabilistic and statistic models. The notion of rule pair is extended to rule triplet in [SZ05, Suz06], where a negative rule is regarded as the reference rule. According to the above form of a rule pair, a rule triplet is represented as

(Aµ⇒c, Aµ∧Bν ⇒c^′, Bν 6⇒c^′),

where the ruleBν 6⇒c^′ is the reference rule. In summary, the discovery of exception rules proposed by Suzuki et al. are probabilistic approach based, where the performance is dependent on the

tel-00431117, version 1 - 10 Nov 2009

14 CHAPTER 2. RELATED WORK selection of c. The advantage of Suzuki’s approaches is that they can discover highly unexpected patterns since it also discovers common sense rules.

In [DL98], Dong and Li proposed neighborhood-based interestingness in association rules, which is based on the distance between rules and the neighborhoods of rules. The neighborhood-based interestingness of a rule is defined in terms of the pattern of the fluctuation of confidences or the density of discovered rules in some of its neighborhoods. The distance between rules is studied in the syntax: given two rules r₁ = X₁ ⇒ Y₁ and r₂ = X₂ ⇒ Y₂, the syntax distance between r₁ and r₂ is defined as

distiset(r1, r2) =δ1|(X₁Y1)⊖(X2Y2)|+δ2|X₁⊖X2|+δ3|Y₁⊖Y2|,

whereX⊖Y denotes the symmetric difference between two itemsetsX andY (i.e.,(X−Y)∪(Y− X)), andδ1, δ2, δ3 are non negative real numbers that reflect users’ preferences of the contributions of itemsets. The k-neighborhood(k > 0) of a rule r, denoted asN(r, k), is therefore defined as the set

N(r, k) ={r^′ |distiset(r, r^′)≤k}.

SupposeM is a set of discovered rules andr∈M is a reference rule, theaverage confidenceof the k-neighborhood ofris defined as the average confidenceavg(r, k)of the rules in the setM∩N(r, k)−

{r}; thestandard deviationof thek-neighborhood ofris defined as the standard deviationstd(r, k) of the rules in the set M∩N(r, k)− {r}. So that if the value|(|conf(r)−avg(r, k)|)−std(r, k)|is larger than a given threshold, then the rule r is said to be interesting with unexpected confidence in its k-neighborhood.

Padmanabhan and Tuzhilin proposed a semantics-based belief-driven approach [PT98, PT00, PT02, PT06] to discover unexpected patterns (rules¹) in the context of association rules. In [PT98], Padmanabhan and Tuzhilin first proposed that a rule A ⇒ B is unexpected with respect to a belief X ⇒ Y in a given database D if: (1) B ∧ Y |= F ALSE, which means that the two patterns B and Y logically contradict each other (i.e., ∄R in D such that B∪Y ⊆ R); (2) A∧X holds on a statistically large subset of tuples inD (e.g., with respect to a given minimum support, the pattern A∪X is frequent in the database D); (3) the rule A∧X ⇒ B holds and the rule A∧X ⇒ Y hoes not hold (e.g., the support and confidence of A∧X ⇒B satisfy given minimum support and minimum confidence but those of A∧X ⇒ Y do not). An example can be that given a belief professional ⇒ weekend (professionals shopped on weekends), if the rule (professional, December)⇒weekday(professionals shopped on weekdays in December) holds but the rule (professional, December) ⇒ weekend (professionals shopped on weekends in December) does not, then the rule December⇒ weekday is unexpected relative to the belief professional⇒

1In [PT98, PT00, PT02, PT06], Padmanabhan and Tuzhilin use the terms patternandruleinterchangeably.

tel-00431117, version 1 - 10 Nov 2009

2.3. UNEXPECTED PATTERNS AND RULES 15

weekend. Notice that in this approach, the logically contradiction between patterns is defined by domain experts.

In [PT00, PT06], theminimal setof unexpected patterns (rules) is addressed and the refinement of beliefs by discovered unexpected patterns is further proposed in [PT02]. The notion of minimal set of unexpected patterns is defined based on the monotonicity assumption |=M of rules, that is, rule (A ⇒ B) |=M (C ⇒ D) if A ⊆ C and B = D. Then, given a rule set R, the set R^′ is the minimal set of R if and only if the following conditions hold: (1) R^′ ⊆ R; (2) ∀r ∈ R,∃r^′ ∈ R^′ such that r^′ |=M r; (3) ∀r^′1, r^′2 ∈ R^′, r^′1 6|=M r^′2. The computational task is therefore to discover the minimal set of rules in all discovered unexpected patterns (rules).

In [Spi99], Spiliopoulou presented a belief-driven approach to find unexpected sequence rules based on the notion ofgeneralized sequences². A generalized sequence (org-sequence) is a sequence in the formg1∗g2∗. . .∗gn, whereg1, g2, . . . , gnare elements contained in the sequence (e.g., itemsets) and ∗ is a wild-card (i.e., unknown elements). A sequence rule is then built by splitting a given g-sequence into two adjacent parts: premise (lhs) and conclusion (rhs), denoted as lhs ֒→ rhs.

Further, a belief over g-sequences is defined as a tuple hlhs, rhs, CL, Ci, where lhs ֒→ rhs is a sequence rule,CL is a conjunction of constraints on the frequency of lhs, and C is a conjunction of constraints on the frequency of elements inlhsandrhs. For example, a belief in the above form can be given asha∗b, c, CL, CiwithCL= (support(a∗b)≥0.4∧conf idence(a, b)≥0.8)andC = (conf idence(a∗b, c)≥0.9). That is, the belief proposed by Spiliopoulou is based on the statistical frequency of the elements contained in a g-sequence with respect to a predefined structure (e.g., a∗b ֒→ c). Let B be a collection of predefined beliefs and r = lhs ֒→ rhs be a sequence rule discovered in a given database, then ris expectedif there exists a belief b =hlhs^′, rhs, CL, Ci ∈ B such thatr and b can be matched by verifying CL and C; otherwise the rule r is unexpected.

In [WJL03], Wang et al. studied unexpected association rules with respect to the value of attributes. In this approach, a rule is addressed in the form

A1 =a1, A2 =a2, . . . , Ak=ak ⇒C=c,

whereAi is a non-target attribute, ai is a domain value for Ai, C is the target attribute, andc is a domain value for C. A1 = a1, A2 = a2, . . . , Ak = ak is called the body and C = c is called the head of the rule, respectively denoted b(r) and h(r) of a given rule r. Given a database, a tuple matchesa rule r if b(r)holds on the tuple; a tuple satisfies a rule r if bothb(r) and h(r) hold on the tuple. Given a ruler and a data tuple t, the violation of r by t, denoted as v(t, r), is defined

2Notice that the notion ofgeneralized sequences proposed by Spiliopoulou is different from the same term that we will present in Chapter 6: Generalizations in Unexpected Sequence Discovery.

tel-00431117, version 1 - 10 Nov 2009

16 CHAPTER 2. RELATED WORK as

v(t, r) =

( hm(t, r)×bm(t, r) if bm(t, r)≥σ∧hm(t, r)≥σ^′

0 otherwise ,

where bm(t, r) measures the body match degree and hm(t, r) measures the head match degree between t and r, hm(t, r) = 1 − hm(t, r), and σ, σ^′ are given thresholds. Further, the user knowledge, denoted as K, is defined from the rules with respect to the preference model, which species the user’s knowledge about how to apply knowledge rules to a given scenario or a tuple.

Thus, the violation v_K(t)of user knowledge K by a data tuple t is defined as vK(t) =agg({v(t, r)|r∈ Ct}),

whereCt is the covering knowledge oft(i.e., a set of rules that represent the user knowledge on the data contain t) and agg is a well-behaved aggregate function (i.e., max(V)≤ agg(V) ≤max(V) on a vector V of attribute-value pairs). Therefore, given a database D and discovered rule r, let S be the set of all tuples that satisfy the rule r, theunexpectedness supportof the ruler is defined as

Usup(r) =

P{vK(t)|t∈S}

|S| .

Wang et al. further defined the unexpectedness confidence and theunexpectedness of a ruler as Uconf(r) = Usup(r)

|{t ∈ D |t satisfies b(r)}|

and

Unexp(r) = Usup(r)

|{t∈ D |t satisfies r}|.

Hence, the problem of mining unexpected rules is to find all rules with respect to user defined thresholds on unexpectedness support, unexpectedness confidence, and unexpectedness.

In [BT97], Berger and Tuzhilin discussed the notion of unexpected patterns in infinite temporal databases, where the unexpectedness is determined from the occurrences of a pattern. In [JS05], Jaroszewicz and Scheffer proposed a Bayesian network based approach to discover unexpected patterns, that is, to find the patterns with the strongest discrepancies between the network and the database. These two approaches can also be regarded as frequency based, where unexpectedness is defined from whether itemsets in the database are much more, or much less frequent than the background knowledge suggests.

Dans le document tel-00431117, version 1 - 10 Nov 2009 (Page 36-40)