Oliviero Carugo - Data Mining Techniques for the Life Sciences

Abstract

The present chapter provides the basic information about the measures of proximity between two subjects or groups of subjects. It is obvious that these concepts must be clear in order to apply them to any pattern recognition analysis, both supervised and unsupervised.

Key words:cluster analysis, distance, proximity, similarity.

1. Introduction

The cluster analysis is probably the most widely used technique of unsupervised pattern recognition. Its fundamental objective is to look for clusters in a given population of subjects, each character-ized by a certain number of variables. In other words, the cluster analysis is performed in order to see if the subjects can be classified in different groups. The applications of cluster analyses are very numerous in very different scientific fields. A typical example, in biology, is the study of the taxonomy of the species or the delinea-tion of evoludelinea-tionary trees on the basis of protein sequence align-ments. The present chapter will provide some basic information about the measures of proximity (distance or similarity) between subjects that must be classified with cluster analysis. Clustering techniques will be described in the next chapter.

The cluster analysis, like many other statistical tools, may give different results depending on how it is used. For example, if we take a frog, a cat, a salmon, and an eagle, we can classify them in different ways as a function of the classification criterion we adopt.

If we decide to group the subjects on the basis of the place where they live, we can get three groups, or clusters, one containing the

O. Carugo, F. Eisenhaber (eds.),Data Mining Techniques for the Life Sciences, Methods in Molecular Biology 609, DOI 10.1007/978-1-60327-241-4_10,ªHumana Press, a part of Springer Science+Business Media, LLC 2010

163

salmon and the chicken, which live outside water, one containing the salmon, which lives in water, and the last containing the frog, which can live both inside and outside the water. On the contrary, we get only two clusters if we focus the attention on the ability to fly, since only the eagle is a subject able to fly, while the other subjects, the frog, the dog, and the salmon are unable to fly. This trivial example clearly shows how the results of the unsupervised pattern recognition methods can be fragile, since they strongly depend on the variables that are associated with the statistical units and on the criteria with which the statistical units are grouped into discrete clusters. Nevertheless, these statistical techniques are very precious in the real life of data miners. When little is known about the structure of the data, the cluster analysis can provide a starting point for further investigations.

The definition of cluster is, per se, a rather ambiguous exercise.

Certainly, the classification of entities is a very ancient human ability, deeply inserted into the human nature. Anybody possesses the ability to recognize a dog and to put into the dog group a newly observed associated with the basic features a dog must possess. Nevertheless, the theoretical definition of cluster, the cluster of the dogs, is very ambiguous. Several definitions have been proposed and the most close to the human perception is that of natural clusters. They are defined as a continuous region in the space, containing a high density of subjects, and separated from other clusters by regions containing a low density of subjects. The exact separation between two clusters is therefore defined in a rather arbitrary way. Natural clusters are also termed hard, or crisp, since each single subject may belong to one and only one cluster. Alternatively, it is possible to use the concept of fuzzy cluster and allow the subjects to belong to more than a single cluster, proportionally to their degree of similarity with each clus-ter. Although this second approach may be extremely useful in various disciplines, from neurobiology to economics, we prefer to concentrate here on the concept of natural clusters. From an operational point of view, this means that similar statistical units must be grouped together while dissimilar units must be put in different clusters.

Given its intrinsic ambiguities, it is necessary to examine accu-rately all steps of cluster analysis.

First at all, the statistical variables must be carefully selected.

On the one hand, the inclusion of too many variables may have two major drawbacks: the overall analysis lacks elegance and the computations can become very expensive. On the other hand, some redundancy may be tolerated if this ensures better results.

The selection of the right set of variables depends on the data mining objective and, consequently, it must be performed or, at least, checked by experts in the field in which the data miner operates.

Second, it is necessary to define the proximity measure between subject pairs. The proximity may be evaluated by distance measures or by similarity measures. Although these are concep-tually alternatives, there is little difference in practice. Many differ-ent possibilities have been explored and proposed.

Third, the clustering criterion must be defined. In other words, it is necessary to decide under which conditions two statistical units must be grouped together and also if two clusters, each containing more than one unit, must be fused into a single group. Different clustering criteria may produce different results because of the structure of the data. For example, it is obvious that compact clusters (Fig. 10.1a) should be compared in a different way than elongated clusters (Fig. 10.1b). It is nevertheless nearly impossible to select a priori the optimal clustering criterion in pattern recognition, especially when each statistical unit is associated with a high number of variables. In this case, in fact, each unit corre-sponds to a point in a high-dimensional space and the data structure can hardly be perceived by the common methods of human perception.

Fourth, a very large variety of clustering algorithms is available.

Also at this point, the results of a cluster analysis markedly depend on the choice of an algorithm over another. We will nevertheless concentrate the attention on a particular type of algorithms, the hierarchical ones, which are mostly used in molecular biology. It must, however, be remembered that often the results may change considerably by changing the strategy with which the clustering is carrier out.

Eventually, it is necessary to validate and interpret the results of the cluster analysis. The latter point, like the selection of the variables, strictly depends on the reason why the cluster analysis is performed and consequently relies on the scientific experience of

(a) (b)

Fig. 10.1. Example of data that show a different clustering tendency. In (a) the points tend to cluster in a compact manner while in (b) the clusters are rather elongated.

Proximity Measures for Cluster Analysis 165

the data miner. On the contrary, the result validation is an objec-tive procedure intended to verify the correctness of the cluster analysis output and it is usually performed through appropriate tests.

In conclusion, it appears that cluster analysis starts and ends with two steps that need the advice of people experienced in the field that is investigated (the selection of the variables and the interpretation of the results). In between the initial and the final steps, it is necessary to define and apply four computational steps that may be common to analyses performed in different fields, like sociology, economics, or biology (definition of proximity mea-sures, clustering criteria, clustering algorithms, and result validation).

2. Proximity Measures

The proximity between statistical units can be measured by a distance or by a similarity. This is nevertheless a trivial problem, though the difference between distance and similarity must be kept in mind, especially during computations. On the contrary, it is not trivial to consider the fundamental properties of the proxi-mity measures. They can be divided into two classes: those that are metric and those that are not. For both types of measures, given the statistical unitsXandY, it must be true that

dðX;XÞ ¼ d_min (1) wheredminis the minimal, possible distance, which can be encoun-tered when the statistical unitXis compared to itself. It must be, moreover, always true that

15dmin dðX;YÞ5þ 1 (2) which means that the statistical unitsXandYmay be identical, if their distance is equal tod_min, or different if their distance is higher thand_min. Any type of distance is moreover always commutative, since

dðX;YÞ ¼ dðY;XÞ (3) Exactly the same properties hold also if the proximity is eval-uated by means of similarity measures. In this case it is always true that

sðX;XÞ ¼ s_max (4) 14sðX;YÞ s_max þ 1 (5) sðX;YÞ ¼ sðY;XÞ (6)

Distances and similarities are metric, in the mathematical sense, only if the triangular inequality holds. This implies that given the statistical unitsX,Y, andZ

dðX;ZÞ dðX;YÞ þdðY;ZÞ (7) sðX;ZÞ ½sðX;YÞsðY;ZÞ=½sðX;YÞ þ sðY;ZÞ (8) This inequality is of fundamental importance in data mining, when the data must be examined by means of unsupervised pattern recognition methods. Nevertheless, we must be aware of the fact that many measures of distance or similarity, which are used in molecular biology, are not metric. For example, the proximity between protein three-dimensional structures is very often esti-mated by means of the root-mean-square distance between equivalent and optimally superposed pairs of atoms. Well, this very popular proximity measure is metric only for very similar subjects (e.g., apo and holo metallo-proteins or different single point mutants), but it is not a metric when the data include proteins with very different shapes and sizes.

Beside these theoretical considerations, there are three types of proximities that one might handle: the proximity between indivi-dual units, that between a single unit and a group of units, and that between two clusters, each containing more than one statistical unit.

2.1. Proximity Between Two Statistical Units

The most commonly used distance measure is the Minkowski metric.

Given two units,X¼{x1, x2, . . ., xn} andY¼{y1, y2, . . ., yn}, it is un-weighted distances, or not in the case of un-weighted distances. The parameterpcan assume any positive, integer value. If p¼1, the distance is also known as the Manhattan norm:

d_MN¼Xⁿ

i¼1

w_ijx_iy_ij (10) and ifp¼2, the distance is also known as the Euclidean distance:

d_E¼ Several other distance measures have been used in various applications. For example, thedmaxnorm is defined as

d_max¼maxðw_ijx_iy_ijÞ

1in

(12) Proximity Measures for Cluster Analysis 167

The dG distance includes some information about all the statistical units that are examined since it is defined as

d_G¼ log₁₀ 11 where Miand miare the maximal and minimal values of theith statistical variable within the ensemble of all the statistical units that are examined. Consequently, the distance dG between the unitsXand Ymay vary if it is computed when Xand Yare part of a certain ensemble of units or part of another ensemble of units.

An alternative is thedQdistance, defined as

d_Q¼

All the above distances can be applied to real-type variables.

In the case of qualitative variables, nominal or ordinal, the dis-tance between two statistical units must be defined in different ways. The most popular of them is certainly the Hamming dis-tance, defined as the number of places where two vectors differ.

From a formal point of view, this can be expressed by means of the contingency table. If the variables of the unitsX¼{x1,x2,. . ., x_n} andY¼{y₁,y₂,. . .,y_n} can assumemstates, the contingency table is a squarem mmatrixA, the elementsaijof which are the number of times theith possible value present inXhas been substituted by thejth possible value inY. For example, ifm¼3 and the possible values, or states, of the variables are 1, 2, and 3, the contingency table that compares the unitX¼{2, 1, 2, 3, 1, 2}

withY¼{2, 2, 3, 1, 2, 3} is

As an example, the elementa_1,2is equal to 2 since it happens twice that a variablex¼1 is associated with a variabley¼2 (x2¼1 andy2¼2;x5¼1 andy5¼2).

The Hamming distanced_Hcan be therefore defined as d_H¼X^m

i¼1

X^m

j¼1;i6¼j

a_ij (16)

given that the elements aij, with i 6¼ j, of the contingency table indicate the number of timesxi6¼yi. In the case ofX¼{2, 1, 2, 3, 1, 2} andY¼{2, 2, 3, 1, 2, 3}, therefore,d_H¼5 because only the first variables ofXandYhave the same status,x1¼y1¼2, while for

all the other 5 values ofi,xi6¼yi. The computation of the Ham-ming distance by means of the contingency table is certainly not necessary in simple cases, like that presented above, in which both the number of possible statuses of the variables (m¼3) and the number of variables associated with each statistical unit (n¼6) are small. In other instances, where bothmandncan be very large, the use of the contingency table makes the computations of the Ham-ming distance much easier.

A different distance definition for discrete variables is d_D¼Xⁿ

i¼1

x_iy_i

j j (17)

which is obviously equivalent to the Hamming distance when the data are binary, that is, when each variable may assume only two values, for example, 1 and 2.

If the statistical units are characterized by different types of variables, some of which may assume real number values and some others may assume only discrete values, the distance between two units cannot be measured with the methods described above and other definitions of distance must be used.

Various solutions of this problem have been proposed, the most widely used of which is based on the discretization of the real variables. If, for example, theith variable assumes real values in the closed interval (a,b), which means thataandbare the mini-mal and maximini-mal values that theith variable assumes within the statistical units under exam or in an absolute scale, where theith variable cannot be minor thanaor major thanb, thex_ivalues can be described by a histogram. The interval (a,b) is divided intom subintervals and if the variablexifalls into thejth subinterval, it is transformed intoj1.

As mentioned in the introductory paragraph of this section, the degree of proximity between two statistical units can be mea-sured not only with distance measures, like those described above, but also with similarity measures.

Two similarity measures are used very often. One is the corre-lation coefficient. Given two vectors X¼{x1, x2, . . ., xn} and

and it ranges between1 and +1, being 0 if the two statistical units X and Yare totally independent from each other. The maximal value of +1 is encountered ifXand Yare identical, i.e. perfectly correlated, and the minimal value of1 indicates thatXandYare Proximity Measures for Cluster Analysis 169

perfectly anticorrelated, i.e.X¼ Y. The second measure of simi-larity that is used very often is the inner product. Given the statistical units described above, it is defined as

s_in¼X^TY¼Xⁿ

i¼1

x_iy_i (19)

Generally, the inner product is computed after normaliza-tion of the vectors X and Y, so that both have unit length, by means of

In this way, the inner product lies in the interval (1,+1) and depends on the angle betweenXandY. Identical statistical units are associated with the inner product equal to +1 while a value equal to1 indicates that the units are opposite, i.e.,X¼ Y.

Other measures of similarity can also be used to compare statistical units characterized by variables that can assume real values. A widely used similarity measure is, for example, the Tani-moto similaritysT, defined as

s_T ¼

and, if the vectorsXandYhave been normalized to unit length, can be rewritten as

and may range between0.33 and +1 for opposite or identical vectors, respectively.

If the variables do not assume real values but can be associated with discrete values or statuses, the Tanimoto measure of similarity between the vectorsX andYis defined as the ratio between the number of elements they have in common and the number of elements they do not have in common. By using the contingency table described above, the Tanimoto measure can be defined as

s_T ¼

Alternatively, it is possible to define a similarity that is based on the ratio between the number of elements thatX and Yhave in common and the number of variables that characterize each unit.

Such a similarity measure can be computed as

s_A ¼ P^m

i¼1

a_ii

n (24)

Like for the distance measures, when the variables are of different types, the definitions of similarity described above cannot be used. In these cases, it is necessary to measure the degree of similarity by other means. Like for the distance measures, a possi-ble solution of this propossi-blem is based on the discretization of the real variables by means of histograms. The similarity between the statistical unitsX¼{x₁,x₂,. . .,x_n} andY¼{y₁,y₂,. . .,y_n} can be measured as the sum of the similarity between each pair of variables xiandyi

s_Q ¼Xⁿ

i¼1

s_i (25)

Thesiis the similarity between theith pair of variables and it can be differently computed depending on the type of variable. If the latter is a real number,simay be defined as

s_i¼1jx_iy_ij ri

(26) where ri is the interval of values that is possible or is observed within theith variable. Thus, ifxi¼yi,sireaches its maximal value equal to 1, while if the absolute difference betweenx_iandy_iis equal tori,siassumes its minimal value equal to 0. On the contrary, if the ith variable is not a real variable,siis equal to 1 ifxi¼yiand it is equal to 0 ifx_i6¼ y_i. Independently of the type of variable, each individual similaritysimay have values ranging from 0 and 1, and thesQmeasure of proximity has a minimal value of 0 and a maximal value equal to n, the number of variables associated with each statistical unit.

2.2. Proximity Between a Single Unit and a Group of Units

The proximity between a single statistical unit and a group of several (two or more than two) units must be measured with techniques that are different from those that are used to measure the proximity between two statistical units. It is necessary to compute the proximity between a unit and a group of units in several circumstances, in both supervised and unsupervised pattern recognition methods.

There are two types of proximity measures between a single statistical unit and a group of various units: the single subject can be compared to all the members of the group or it can be com-pared to a profile of the group, which summarizes the most Proximity Measures for Cluster Analysis 171

important features of the group. In the first case, the problem can be handled with the definitions of proximity between pairs of units, though it is necessary to solve the problem of how to handle the n proximities between the single subject and the n units belonging to a group. In the second case, the problem is the definition of the profile that summarizes all the elements of the group, and the proximity between a single unit and a group of units is measured by the proximity between the single unit and the summarizing profile.

If the proximity between a single unit and a group of units is measured as a function of the individual proximities between the single unit and each of the elements of the group, three extreme possibilities exist. The distance can be defined as the maximal distance between the single subject and the group members, as the minimal distance between the individual unit and the elements of the group, or as the average distance between the single unit and the members of the group. Analogously, the similarity between a

Dans le document Data Mining Techniques for the Life Sciences (Page 169-181)