Discussion - tel-00431117, version 1

16 CHAPTER 2. RELATED WORK as

v(t, r) =

( hm(t, r)×bm(t, r) if bm(t, r)≥σ∧hm(t, r)≥σ^′

0 otherwise ,

where bm(t, r) measures the body match degree and hm(t, r) measures the head match degree between t and r, hm(t, r) = 1 − hm(t, r), and σ, σ^′ are given thresholds. Further, the user knowledge, denoted as K, is defined from the rules with respect to the preference model, which species the user’s knowledge about how to apply knowledge rules to a given scenario or a tuple.

Thus, the violation v_K(t)of user knowledge K by a data tuple t is defined as vK(t) =agg({v(t, r)|r∈ Ct}),

whereCt is the covering knowledge oft(i.e., a set of rules that represent the user knowledge on the data contain t) and agg is a well-behaved aggregate function (i.e., max(V)≤ agg(V) ≤max(V) on a vector V of attribute-value pairs). Therefore, given a database D and discovered rule r, let S be the set of all tuples that satisfy the rule r, theunexpectedness supportof the ruler is defined as

Usup(r) =

P{vK(t)|t∈S}

|S| .

Wang et al. further defined the unexpectedness confidence and theunexpectedness of a ruler as Uconf(r) = Usup(r)

|{t ∈ D |t satisfies b(r)}|

and

Unexp(r) = Usup(r)

|{t∈ D |t satisfies r}|.

Hence, the problem of mining unexpected rules is to find all rules with respect to user defined thresholds on unexpectedness support, unexpectedness confidence, and unexpectedness.

In [BT97], Berger and Tuzhilin discussed the notion of unexpected patterns in infinite temporal databases, where the unexpectedness is determined from the occurrences of a pattern. In [JS05], Jaroszewicz and Scheffer proposed a Bayesian network based approach to discover unexpected patterns, that is, to find the patterns with the strongest discrepancies between the network and the database. These two approaches can also be regarded as frequency based, where unexpectedness is defined from whether itemsets in the database are much more, or much less frequent than the background knowledge suggests.

2.4. DISCUSSION 17

As listed in Table 2.2, most of the existing approaches to discover unexpected patterns and rules are essentially considered within the context of association rules. To the best of our knowledge, before our work, the approaches proposed in [Spi99] and [BT97] are the only ones that concentrates on sequence data. Indeed, although this work considers the unexpected sequences and rules, it is however very different from our problem in the measures and the notions of unexpectedness contained in sequence data.

Approach Data Model Measure Unexpected Structure

[LH96] Association rule Pattern similarity Association rule [SS96] – [SZ05] Association rule Probabilistic Association rule [BT97] Sequence Propositional Temporal Logic Pattern

[DL98] Association rule Distance + Frequency Rule confidence [PT98] – [PT06] Association rule Belief/Semantics Association rule [LHML99] Association rule Pattern similarity Pattern

[Spi99] Sequence Belief/Frequency Sequence rule

[LMY01] Text, Web content VSM/TF-IDF Text/Term

[WJL03] Association rule Rule match Association rule

[JS05] Frequent pattern Bayesian network/Frequency Pattern

Table 2.2: A comparison of unexpected pattern and/or rule mining approaches.

In this thesis, the unexpectedness is stated by the semantics of sequence data, instead of the statistical frequency or distance.

We consider the unexpectedness within the context of domain knowledge and the aspect valid within the context of the classical notions of support and confidence. With summarizing previous approaches, we can find that the detection of unexpectedness is often based on rules, that is, the unexpectedness is considered as the facts that contradict existing rules on data. Therefore, before proposing the notions of unexpected sequences with respect to the belief system on sequence data, in the next chapter, we above all formalize the forms of sequence rules.

tel-00431117, version 1 - 10 Nov 2009

18 CHAPTER 2. RELATED WORK

tel-00431117, version 1 - 10 Nov 2009

Chapter 3 Sequence Rules

In most literatures, unexpectedness is the facts that contradict existing rules on data. Therefore, in order to define the belief system on sequence data for discovering unexpected sequences, in this section, we formalized the notions of sequence rules.

3.1 Introduction

Rule mining is an important topic in data mining research and applications, where the most studied problem is mining association rules [AIS93, AS94, SA95, BMUT97, CA97, DL98, GKM⁺03, DP06, CG07, HCXY07, CW08, KZC08], which finds frequent association relations between the patterns contained in transactional databases. In this chapter, we formalized two categories of sequence rules, which stand for the fundamentals of the belief system proposed in this thesis. We consider two types of sequences rules in beliefs: non-predictive sequence ruleswithout occurrence constraint and predictive sequence rules with occurrence constraint, where we propose the forms of sequence association rulesand predictive sequence implication rules.

The sequence rule mining problems and the definitions of sequence rules are very variant in comparison with association rules.

Mannila proposed the notion of episode rules [MTV97] that can be regarded as in the form sα →sβ of sequence association rules, where the constraint sα ⊑sβ is applied on episode rules for depicting that the occurrence of the sequence sα implies the occurrence of the sequence sβ.

In [DLM⁺98], time and occurrence constraints are applied to sequence rules to analyze time series. Two basic forms ¹ on the shapes discretized from time series are proposed: (1) A =⇒^t B, which depicts that “if shapeAoccurs, then shapeBoccurs within timet”; (2)A₁∧A₂∧. . .∧Am

=v,t⇒ B, which depicts that “if shapesA₁,A₂, . . .Am occur withinv units of times, then shape B occurs within time t”. In [HD04], the constraint on time lags, which is similar to the form proposed in

1We use the notations→or ⇒for denoting a sequence rule as how it was defined the original literature.

tel-00431117, version 1 - 10 Nov 2009

20 CHAPTER 3. SEQUENCE RULES [DLM⁺98], is applied to episode rules [MTV97].

In [CCH02], sequence rules are considered with intermediate elements (e.g.,h(ab)∗(c)iwhere∗ denote a sequence of unknown items between(ab)and(c)in two directions: a forward rulesα →sβ

depicts that the occurrence of sα ⊑ s implies the occurrence of sα·sβ ⊑ s, and a backward rule sα←sβ depicts that the occurrence sβ ⊑s implies the occurrence of sα·sβ ⊑s

In [HS05], a sequence rule formsα

=⇒t w

sβ is proposed in the context of evolutionary computing and genetic programming with a specialized pattern matching hardware from time series, where w is the minimum distance and t is the maximum distance between the sequences sα and sβ

contained in a rule; in [LKL08], the sequence rules in the form sα → sβ is studied in terms of recurrent rules.

Therefore, in order to benefit from existing approaches to sequence rule mining, in this chapter, we formalize sequence rules in terms of the notions of sequence association rules and predictive sequence implication rules.

In this thesis, based on the sequence data model introduced in Section 2.1, we further consider the following supplementary concepts and operations on sequence data.

The length of a sequence s is the number of itemsets contained in s, denoted as |s|; the size of a sequence s is the number of all items contained in s, denoted as ksk. An empty sequence is denoted as∅, where |∅|=k∅k= 0. A sequence of lengthkis called ak-length sequence; a sequence with size n is called a n-size sequence, or simply a n-sequence.

The concatenation of sequences is denoted as s1 ·s2, and the result is the sequence obtained by appending s2 to the end ofs1, so that we have|s1·s2|=|s1|+|s2|and ks1·s2k=ks1k+ks2k.

For example, h(a)(b)i · h(b)(c)i=h(a)(b)(b)(c)i.

The rest of this chapter is organized as follows. In Section 3.2, we formalize the form of sequence association rule. In Section 3.3, we propose the notion of predictive sequence implication rule. In Section 3.4 we propose the notion of consistent sequence rule set. Section 3.5 is a discussion on sequence rules.

Dans le document tel-00431117, version 1 - 10 Nov 2009 (Page 40-44)