Distance Functions over Sequences - Sequence Data Mining

• Frequency based feature selection: The features with high frequencies, namely those having frequency over a given threshold, are selected. One may use window-based frequency or whole-sequence frequency.

• Discrimination based feature selection: Features with relatively higher fre-quency at a desired site or in some selected classes than elsewhere are preferred. To ﬁnd features for identifying a desired site, one prefers fea-tures which occur quite frequently around the site than elsewhere. To ﬁnd features for identifying classes, one prefers features which occur more fre-quently in one class than in other classes.

Some method is needed to determine patterns’ frequency diﬀerences as discussed above. This can be done by directly comparing a site against other parts of the sequences, or comparing one class against other classes. In the biological literature, it is often the case that one actually computes the fre-quency at the desired site or in a class, but uses the so-called “background”

probabilities to determine the frequency away from the given site or the given class. The background probabilities can be sampled from large datasets, e.g.

all potential sequences in the application.

Chapter 6 gives more details on mining patterns that distinguish a site/class against other parts of the sequences or other classes.

3.3 Distance Functions over Sequences

Sequence distance functions are designed to measure sequence (dis)similarities.

Many have been proposed for diﬀerent sequence characteristics and purposes.

After an overview, this section will discuss several major types of sequence distance functions.

3.3.1 Overview on Sequence Distance Functions

A sequence distance is a functiondmapping pairs of sequences to non-negative real numbers so that the following four properties are satisﬁed:

• d(x, y)>0 for sequencesxandy such thatx=y,

• d(x, x) = 0 for all sequencesx,

• d(x, y) =d(y, x) for all sequencesxand y,

• d(x, y)d(x, z) +d(z, y) for all sequences x,y andz.

Moreover, d is a normalized distance function if 0 d(x, y) 1 for all se-quences x and y. These properties and deﬁnitions are special cases of those for general distance functions.

Sequence distance functions can be used for many diﬀerent purposes, in-cluding clustering, ﬁnding known sequences similar to a new sequence (to help infer the properties of new sequences), determining the phylogenetic tree (evolution history) of the species, etc.

Roughly speaking, distance functions can be character (alignment) based, feature based, information theoretic (e.g. Kolmogorov complexity) based [64], conditional probability distribution based [126], etc. In the feature based ap-proach, one would ﬁrst extract features from the sequences, and then compute the distance (e.g. Euclidean or cosine) between the sequences by computing the distance between the feature vectors of the sequences. The character align-ment based ones can be local window based or whole sequence based; they can also be edit distances or more general pairwise similarity score based distances. Details on most of these approaches are given below.

3.3.2 Edit, Hamming, and Alignment based Distances

Theedit distance, also called the Levenshtein distance, between two sequences S₁andS₂is deﬁned to be the minimum number of edit operations to transform S₁ toS₂. The edit operations include changing a letter to another, inserting a letter, and deleting a letter. For example, the edit distance between “school”

and “spool” is 2 (we need one deletion and one change), and the edit distance between “park” and “make” is 3.

Some authors [126] argue that the edit distance may not be the ideal so-lution to measure sequence similarity. Consider the following three sequences:

aaaabbb, bbbaaaa, and abcdefg. The edit distance between aaaabbb and bb-baaaa is 6, and the edit distance between aaaabbb and abcdefg is also 6. In-tuitively aaaabbb and bbbaaaa are more similar to each other than aaaabbb and abcdefg are, but the edit distance cannot tell the diﬀerence.

The Hamming distance between two sequences is limited to cases when the two sequences have identical lengths, and is deﬁned to be the number of positions where the two sequences are diﬀerent. For example, the Hamming distance is 1 for “park” and “mark”, it is 3 for “park” and “make”, and it is 4 between “abcd” and “bcde”.

The edit and Hamming distances charge each mismatch one unit of cost in dissimilarity. In some applications where different mismatches are viewed differently, different costs should be charged to different mismatches. Those costs are usually given as a matrix; examples include PAM [122] (a transition probability matrix) and BLOSUM [49] (blocks substitution matrix). More-over, insertions/deletions can also be charged differently from changes. It is also a common practice to charge more on the first insertion than subsequent insertions, and to charge more on the first deletion than subsequent deletions.

Many fast software tools exist for ﬁnding similar sequences through se-quence alignment; examples include BLAST and variants [4], FASTA [85], Smith-Waterman [98]. These algorithms compare a given sequences against sequences from a given set, to ﬁnd homologous (highly similar) sequences.

PSI-BLAST [4] and HMM-based methods [59] can be used to ﬁnd remote homologous (less similar) sequences.

3.3 Distance Functions over Sequences 53 3.3.3 Conditional Probability Distribution based Distance

Reference [126] considers a conditional probability distribution (CPD) based distance. The idea is to use the CPD of the next symbol (right after a seg-ment of some fixed length L) to characterize the structural properties of a given sequence (or set of sequences). The distance between two sequences is then defined in terms of the difference between the two CPDs of the two sequences. The similarity between two CPDs can be measured by the varia-tional distance or the Kullback-Leibler divergence between the CPDs. Let A be a given alphabet. Let S₁ and S₂ be two sequences. Let Ω be the set of sequences of lengthLwhich occur in eitherS₁ orS₂. LetPi denote the CPD forSi. The variational distance betweenS₁andS₂is then defined as

X∈Ω,x∈A

|P₁(x|X)−P₂(x|X)|

and the Kullback-Leibler divergence betweenS₁ andS₂ is deﬁned as

X∈Ω,x∈A

(P₁(x|X)−P₂(x|X))·logP₁(x|X) P₂(x|X).

This distance can be easily extended to distance between sets (clusters) of sequences. The computation of CPD based distance can be expensive for large L. The clustering algorithm given in [126] computes the distance of a sequence S with each clusterC; that distance is simulated as the the error of predicting the next symbol of the sequence S using the CPD of the cluster C. Only frequent sequences in a clusterC are used in deﬁning CPDs; if a sequence of lengthLis not frequent inC, its longest frequent suﬃx is used in its place.

3.3.4 An Example of Feature based Distance: d2

An example of feature based sequence distance isd2[108], which usesk-grams as features. It uses two parameters: a window length W and a gram (word) lengthw. It considers frequency of allw-grams occurring in a window of length W. The distance between two windows is the Euclidean distance between the two vectors of frequencies of w-grams in the two windows. The d2 distance between two sequences is the minimum of the window distances between any two windows, one from each sequence.

GACTTCTATGTCACCTCAGAGGTAGATAGA CGAATTCTCTGTAACACTAAGCTCTCTTCC

Fig. 3.2.Two DNA sequences

For example, consider the two sequences given in Figure 3.2 for W = 8 and w= 2. To compute the d2 distance between the two sequences, we need

to compute the window distance between all pairs of windows W₁ and W₂. Consider the following pair of windows: W₁ =ACT T CT AT (starting from the second position of the ﬁrst sequence) and W₂ =AAT T CT CT (starting from the third position of the second sequence). The two vectors of frequencies of 2-grams are:

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

W₁ 0 1 0 1 0 0 0 2 0 0 0 0 1 1 0 1

W₂ 1 0 0 1 0 0 0 2 0 0 0 0 0 2 0 1

Sod2(W₁, W₂) =

(0−1)²+ (1−0)²+ (1−0)²+ (1−2)²= 2.

We should note that this is a simplified definition; the more general defi-nition allows one to consider an interval of gram lengths.

The d2 distance is not really a distance metric, since it violates the triangle inequality. Consider the three sequences of aabb, abbc, and bbcc forW = 3 and w= 2. Thend2(aabb, abbc) = 0,d2(abbc, bbcc) = 0 and d2(aabb, bbcc) =

√2; sod2(aabb, abbc) +d2(abbc, bbcc)< d2(aabb, bbcc).

3.3.5 Web Session Similarity

Web session sequences are sequences of URLs. Since each URL (see Figure 3.3) can be viewed as a sequence, we need to consider web session sequences as nested sequences. So distance functions over web session sequences (which are sequences of URLs) need to consider similarity between individual URLs (representing web pages) and similarity between sequences of URLs. Reference [118] considers a distance deﬁned following such an approach.

On URL similarity, [118] determines a similarity weight between two given URLs in two steps. In step 1, the two URLs are compared to get the longest common preﬁx. Here, each level in URLs (a level is a nonempty string between

“/”) is viewed as a token. In step 2, let M be the length of the longer URL between the two, where each level contributes 1 to the length. Then a weight is given to each level of this longer URL: the last level is given the weight of 1, the second to the last is given the weight of 2, etc. The similarity between the two URLs is defined as the sum of the weight of the longest common prefix divided by the sum of the total weights. Observe that, if the two pages are totally different, i.e. the common prefix is the empty sequence, their similarity is 0.0. If the two URLs are exactly the same, their similarity would be 1.0.

For example, consider the three URLs given in Figure 3.3. URL#1 and URL#2 are more similar to each other than URL#1 and URL#3 are. Sim-ilarity between URL#1 and URL#2 is obvious: They are very similar pages with a similar topic about the research work in the Data Mining Laboratory of Wright State University. The similarity between URL#1 and URL#3 is weaker, simply reﬂecting the fact that both pages come from the same server.

Using the approach discussed above on URL#1 and URL#2, the weights for the tokens ofcurrent.html,datamine,labs, andwww.cs.wright.eduare

3.4 Classiﬁcation of Sequence Data 55

Dans le document Sequence Data Mining (Page 63-67)