Similarity Measures - Cluster Analysis for Data Mining and

⎢⎢

⎢⎣

x1,1 x1,2 · · · x1,n

x2,1 x2,2 · · · x2,n

... ... . .. ... xN,1 xN,2 · · · xN,n

⎤

⎥⎥

⎥⎦. (1.1)

In pattern recognition terminology, the rows ofX are calledpatterns or objects, the columns are calledfeatures or attributes, and X is called pattern matrix. In this book,X is often referred to simply as thedata matrix. The meaning of the rows and columns ofXwith respect to reality depends on the context. In medical diagnosis, for instance, the rows ofXmay represent patients, and the columns are then symptoms, or laboratory measurements for the patients. When clustering is applied to the modelling and identification of dynamic systems, the rows of X contain samples of time signals, and the columns are, for instance, physical variables observed in the system (position, velocity, temperature, etc.).

In system identification, the purpose of clustering is to find relationships be-tween independent system variables, called the regressors, and future values of dependent variables, called the regressands. One should, however, realize that the relations revealed by clustering are just acausal associations among the data vec-tors, and as such do not yet constitute a prediction model of the given system.

To obtain such a model, additional steps are needed which will be presented in Section 4.3.

Data can be given in the form of a so-called dissimilarity matrix:

⎡

⎢⎢

⎢⎣

0 d(1,2) d(1,3) · · · d(1, N) 0 d(2,3) · · · d(2, N)

0 . .. ... 0

⎤

⎥⎥

⎥⎦ (1.2)

whered(i, j) means the measure of dissimilarity (distance) between objectxi and xj. Because d(i, i) = 0,∀i, zeros can be found in the main diagonal, and that matrix is symmetric becaused(i, j) =d(j, i). There are clustering algorithms that use that form of data (e.g., hierarchical methods). If data are given in the form of (1.1), the first step that has to be done is to transform data into dissimilarity matrix form.

1.3 Similarity Measures

Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to

most clustering procedures. Because of the variety of feature types and scales, the distance measure (or measures) must be chosen carefully. It is most common to calculate the dissimilarity between two patterns using a distance measure defined on the feature space. We will focus on the well-known distance measures used for patterns whose features are all continuous.

The most popular metric for continuous features is theEuclidean distance

d2(xi,xj) = _d

k=1

(xi,k−xj,k)²

1/2

=xi−xj², (1.3) which is a special case (p= 2) of theMinkowski metric

dp(x_i,x_j) = _d

k=1

|xi,k−xj,k|^p

1/p

=x_i−x_j^p. (1.4) The Euclidean distance has an intuitive appeal as it is commonly used to evalu-ate the proximity of objects in two or three-dimensional space. It works well when a data set has “compact” or “isolated” clusters [186]. The drawback to direct use of the Minkowski metrics is the tendency of the largest-scaled feature to domi-nate the others. Solutions to this problem include normalization of the continuous features (to a common range or variance) or other weighting schemes. Linear cor-relation among features can also distort distance measures; this distortion can be alleviated by applying a whitening transformation to the data or by using the squaredMahalanobis distance

dM(x_i,x_j) = (x_i−x_j)F⁻¹(x_i−x_j)^T (1.5) where the patternsx_i andx_j are assumed to be row vectors, andFis the sample covariance matrix of the patterns or the known covariance matrix of the pattern generation process; dM(·,·) assigns different weights to different features based on their variances and pairwise linear correlations. Here, it is implicitly assumed that class conditional densities are unimodal and characterized by multidimen-sional spread, i.e., that the densities are multivariate Gaussian. The regularized Mahalanobis distance was used in [186] to extract hyperellipsoidal clusters. Re-cently, several researchers [78, 123] have used the Hausdorff distance in a point set matching context.

The norm metric influences the clustering criterion by changing the measure of dissimilarity. The Euclidean norm induces hyperspherical clusters, i.e., clusters whose surface of constant membership are hyperspheres. Both the diagonal and the Mahalanobis norm generate hyperellipsoidal clusters, the difference is that with the diagonal norm, the axes of the hyperellipsoids are parallel to the coordinate axes while with the Mahalanobis norm the orientation of the hyperellipsoids is arbitrary, as shown in Figure 1.2.

Figure 1.2: Different distance norms used in fuzzy clustering.

Some clustering algorithms work on a matrix of proximity values instead of on the original pattern set. It is useful in such situations to precompute all the N(N −1)/2 pairwise distance values for the N patterns and store them in a (symmetric) matrix (see Section 1.2).

Computation of distances between patterns with some or all features being non-continuous is problematic, since the different types of features are not comparable and (as an extreme example) the notion of proximity is effectively binary-valued for nominal-scaled features. Nonetheless, practitioners (especially those in machine learning, where mixed-type patterns are common) have developed proximity mea-sures for heterogeneous type patterns. A recent example is [283], which proposes a combination of a modified Minkowski metric for continuous features and a distance based on counts (population) for nominal attributes. A variety of other metrics have been reported in [70] and [124] for computing the similarity between patterns represented using quantitative as well as qualitative features.

Patterns can also be represented using string or tree structures [155]. Strings are used in syntactic clustering [90]. Several measures of similarity between strings are described in [34]. A good summary of similarity measures between trees is given by Zhang [301]. A comparison of syntactic and statistical approaches for pattern recognition using several criteria was presented in [259] and the conclusion was that syntactic methods are inferior in every aspect. Therefore, we do not consider syntactic methods further.

There are some distance measures reported in the literature [100, 134] that take into account the effect of surrounding or neighboring points. These surrounding points are called context in [192]. The similarity between two points x_i andx_j, given this context, is given by

s(xi,x_j) =f(xi,x_j,E), (1.6) where E is the context (the set of surrounding points). One metric defined us-ing context is themutual neighbor distance (MND), proposed in [100], which is given by

M N D(xi,x_j) =N N(xi,x_j) +N N(xj,x_i), (1.7) where N N(xi,x_j) is the neighbor number of x_j with respect to x_i. The MND is not a metric (it does not satisfy the triangle inequality [301]). In spite of this,

MND has been successfully applied in several clustering applications [99]. This observation supports the viewpoint that the dissimilarity does not need to be a metric. Watanabe’s theorem of the ugly duckling [282] states:

“Insofar as we use a finite set of predicates that are capable of distinguishing any two objects considered, the number of predicates shared by any two such objects is constant, independent of the choice of objects.”

This implies that it is possible to make any two arbitrary patterns equally similar by encoding them with a sufficiently large number of features. As a conse-quence, any two arbitrary patterns are equally similar, unless we use some addi-tional domain information. For example, in the case of conceptual clustering [192], the similarity betweenx_iand x_j is defined as

s(x_i,x_j) =f(x_i,x_j,C,E), (1.8) whereCis a set of pre-defined concepts.

So far, only continuous variables have been dealt with. There are a lot of similarity measures for binary variables (see, e.g., in [110]), but only continuous variables are considered in this book because continuous variables occur in system identification.

Dans le document Cluster Analysis for Data Mining and (Page 23-26)