Surprising Sequence Patterns - Sequence Data Mining

Roughly speaking, a surprising sequence pattern is one whose (frequency of) occurrence deviates greatly from the prior expectation of such occurrence.

Deﬁnition 6.3.1 Given a dataset D of sequences, a surprisingness measure θ on sequence patterns, and a surprisingness threshold minSurp, a sequence patternS is called asurprising sequence patternifθ(S)minSurpholds.

Essentially, the surprisingnessθ(S) of a sequence patternS can be deﬁned as the diﬀerence of the actual frequency and the expected frequency of S.

A pattern S’s actual frequency of occurrence can be surprising if the actual frequency is much higher or much lower than the expected frequency. A pat-tern can be considered surprising with respect to a dataset, or with respect to a sequence, or with respect to a window of a sequence, depending on how the actual frequency is calculated. In the ﬁrst case, the actual frequency is calculated from the entire dataset. In the second case, the actual frequency is calculated from the given sequence. In the third case, the actual frequency is calculated from a window of the given sequence. The ﬁrst case can be use-ful for identifying some unexpected behavior of a class of sequences (e.g. the sequences of a species). The second and third cases are useful for predicting surprisingness of (windows of) sequences.

Several issues need to be addressed for surprisingness analysis:

• One issue is how to define the frequency (actual or expected) of sequence patterns. Possibilities include (1) per-sequence based definition where all occurrences of a pattern in an entire sequence are considered, and (2) per-window based definition where all occurrences of a pattern in a per-window of a given sequence are considered. Per sequence/window based frequencies can be used to determine frequencies in a whole dataset if desired.

• Another issue is how to estimate the expected frequency (or probability) of a sequence pattern.

• A third issue is how to choose theminSurp threshold. This issue is im-portant for avoiding having too many false positive “surprising sequence patterns”, especially for alarm/fraud detection applications.

Below we discuss how to estimate the expected frequency for the window based deﬁnition of frequency. This essentially follows the approach of [41].

(Reference [129] deals with the same issue for periodic sequence patterns.) A window size is a positive integerw.

6.3 Surprising Sequence Patterns 129 Deﬁnition 6.3.2 Let S =s₁...sm be a sequence pattern and let T =t₁...tn

be a sequence. Then thew-window frequency of S inT is deﬁned as the total number, denoted as C(n, w, m), of occurrences (matches){i₁, ..., im} of S in T such thati₁< ... < im,sj =ti_j for all1jm, andim−i₁w(i.e. the distance between the ﬁrst matching position and the last matching position of the pattern in the sequence is at most w).

To estimate the expected window based frequency of S in D, it is con-venient to think of D as having just one sequence. This can be done by concatenating all sequences of D into one sequence. Moreover, it is conve-nient to assume that the concatenated sequence is generated by a memoryless (Bernoulli) source. It is worth remembering that the expected frequency is just an estimation, although one wants to get accurate estimates.

For each itema, letprob(a) be the estimated probability ofain the dataset D. For example,prob(a) can be estimated as

prob(a) =number of occurrences ofainD ΣX∈D|X| .

Let p(w, m) denote the probability that a window of size w contains at least one occurrence of the sequence pattern S of size m as a subsequence.

Then the expected value ofC(n, w, m) can be estimated as E(C(n, w, m)) =n×p(w, m).

Reference [41] shows that, for suﬃciently large w, p(w, m) can be approx-imated by an expression in terms of prob(a) of all items a in the alpha-bet. Hence E(C(n, w, m)) can be approximated by an expression in terms ofprob(a) of all items ain the alphabet as well.

Reference [41] also computes the variance ofC(n, w, m), and then shows that C(n, w, m) is normally distributed. This allows us to set either an up-per threshold τu(w, m) (for over-represented surprising patterns) or a lower thresholdτ(w, m) (for under-represented surprising patterns). More precisely, for a given level β, the choice of τ_u(w, m) and τ(w, m) ensures that either P(^C⁽^n,w,m_n ⁾ > τu(w, m))β or P(^C⁽^n,w,m_n ⁾ < τ(w, m))β. That is, if one observes more than nτu(w, m) occurrences (upper threshold) or fewer than nτ(w, m) occurrences (lower threshold) of windows with sequence patternS, it is highly unlikely that such a number is generated by the memoryless source, which implies thatSis surprising. The interested readers should consult [41]

for details.

In addition to the memoryless model for approximating the data source, one can also use other models such as Markov models or HMM etc.

We end this section by discussing the related topic of “rare case mining”

[120]. This applies to all types of data. Informally, a case corresponds to a region in the instance space that is meaningful with respect to the domain under study and a rare case is a case that covers a small region of the instance

space and covers relatively few training examples. As a concrete example, with respect to the class bird, non-flying bird is a rare case since very few birds (e.g., ostriches) do not fly. Rare cases can also be rare associations. For example, mop and broom will be a rare association (i.e., case) in the context of supermarket sales [66]. For classification tasks the rare cases may manifest themselves as small disjuncts (namely rules that cover few training examples, where a rule is the set of conditions in a path from the root of a decision tree to a leaf of the tree). In the study of rarity, one needs to decide whether the rarity of a case should be determined with respect to some absolute threshold number of training examples (absolute rarity) or with respect to the relative frequency of occurrence in the underlying distribution of data (relative rarity).

Dans le document Sequence Data Mining (Page 138-141)