• Aucun résultat trouvé

Computational data mining

4.2 Cluster analysis

4.2.1 Hierarchical methods

Hierarchical methods of clustering allow us to get a family of partitions, each associated with the subsequent levels of grouping among the observations, cal-culated on the basis of the available data. The different families of partitions can be represented graphically through a tree-like structure called a tree of hierar-chical clustering or a dendrogram. This structure associates to every step of the hierarchical procedure, corresponding to a fixed number of groups g, one and only one clustering of the observations in theg groups.

A hierarchical clustering tree can be represented as in Figure 4.1, where for simplicity we suppose there are only five observations available, numbered from 1 to 5. The branches of the tree describe subsequent clusterings of the observations.

At the root of the tree, all the observations are contained in only one class. The branches of the tree indicate divisions of the observations into clusters. The five terminal nodes indicate the situation where each observation belongs to a separate group.

Agglomerative clustering is where the groups are formed from the branches to the root (left to right in Figure 4.1). Divise clustering is where the groups

1

2 Root

Branches 3 4 5

Figure 4.1 Structure of the dendrogram.

Table 4.2 Partitions corresponding to the dendrogram in Figure 4.1.

Number of clusters Clusters 5 (1) (2) (3) (4) (5)

4 (1,2) (3) (4) (5)

3 (1,2) (3,4) (5)

2 (1,2) (3,4,5)

1 (1,2,3,4,5)

are formed from the root to the branches. Statistical software packages usually report the whole dendrogram, from the root to a number of terminal branches equal to the number of observations. It then remains to choose the optimal num-ber of groups. This will identify the result of the cluster analysis, since in a dendrogram the choice of the number of groupsgidentifies a unique partition of the observations.

For example, the partitions of the five observations described by the dendro-gram in Figure 4.1 can be represented as in Table 4.2.

Table 4.2 shows that the partitions described by a dendrogram are nested. This means that, in the hierarchical methods, the elements that are united (or divided) at a certain step will remain united (separated) until the end of the clustering process. Supposing we consider an agglomerative method that proceeds from 5 groups to 1 group, then units 1 and 2 are united at the second step and remain in the same group until the end of the procedure. Nesting reduces the number of partitions to compare, making the procedure computationally more efficient, but the disadvantage is not being able ‘to correct’ errors of clustering committed in the preceding steps. Here is an outline for an agglomerative clustering algorithm:

1. Initialization: givenn statistical observations to classify, every element rep-resents a group (put another way, the procedure starts withn clusters). The clusters will be identified with a number that goes from 1 to n.

2. Selection: the two ‘nearest’ clusters are selected, in terms of the distance initially fixed, for example in terms of the Euclidean distance.

3. Updating: the number of clusters is updated (to n−1) through the union, in a unique cluster, of the two groups selected in step 2. The matrix of the distances is updated, taking the two rows (and two columns) of distances between the two clusters and replacing them with only one row (and one column) of distances, ‘representative’ of the new group. Different clustering methods define this representation in different ways.

4. Repetition: steps 2 and 3 are performedn−1 times.

5. End: the procedure stops when all the elements are incorporated in a unique cluster.

We will now look at some of the different clustering methods mentioned in step 3.

They will be introduced with reference to two groups,C1andC2. Some methods

require only the distance matrix and some require the distance matrix plus the original data matrix. These examples require only the distance matrix:

Single linkage: the distance between two groups is defined as the minimum of then1n2distances between each observation of groupC1and each observation of groupC2:

d(C1, C2)=min(drs)with rC1, sC2

Complete linkage: the distance between two groups is defined as the maxi-mum of the n1n2 distances between each observation of a group and each observation of the other group:

d(C1, C2)=max(drs)withrC1, sC2

Average linkage: the distance between two groups is defined as the arithmetic average of the n1n2 distances between each of the observations of a group and each of the observations of the other group:

d(C1, C2)= 1 n1n2

n1

r=1 n2

s=1

drs with rC1, sC2

Two methods that require the data matrix as well as the distance matrix are the method of the centroid and Ward’s method.

Method of the centroid

The distance between two groups C1 and C2, having nl and n2 observations respectively, is defined as the distance between the respective centroids (usually the means),x1 andx2:

d(C1, C2)=d(x1, x2)

To calculate the centroid of a group of observations we need the original data, and we can obtain that from the data matrix. It will be necessary to replace the distances with respect to the centroids of the two previous clusters by the distances with respect to the centroid of the new cluster. The centroid of the new cluster can be obtained from

x1n1+x2n2

n1+n2

Note the similarity between this method and the average linkage method: the average linkage method considers the average of the distances between the obser-vations of each of the two groups, whereas the centroid method calculates the centroid of each group then measures the distance between the centroids.

Ward’s method

In choosing the groups to be joined, Ward’s method minimises an objective function using the principle that clustering aims to create groups which have maximum internal cohesion and maximum external separation.

The total deviance (T) of thep variables, corresponding tontimes the trace of the variance–covariance matrix, can be divided in two parts: the deviance within the groups (W) and the deviance between the groups (B), soT =W +B. This is analogous to dividing the variance into two parts for linear regression (Section 4.3). In that case B is the variance explained by the regression and W is the residual variance, the variance not explained by the regression. In formal terms, given a partition into ggroups then the total deviance (T) of the p variables corresponds to the sum of the deviances of the single variables, with respect to the overall mean xs, defined by

T =

The deviance within the groups (W) is given by the sum of the deviances of each group:

W = g

k=1

Wk

where Wk represents the deviance of thep variables in the kth group (number nk and centroidxk=[x1k, . . . , xpk]), described by the following expression:

The deviance between the groups, (B) is given by the sum (calculated on all the variables) of the weighted deviances of the group means with respect to the corresponding general averages:

Using Ward’s method, groups are joined so that the increase inW is smaller and the increase in B is larger. This achieves the greatest possible internal cohesion and external separation. Notice that it does not require preliminary calculation of the distance matrix. Ward’s method can be interpreted as a variant of the centroid method, which does require the distance matrix.

How do we choose which method to apply?

In practice there is not a method that can give the most qualified result with every type of data. Experiment with the different alternatives and compare them

in terms of the chosen criteria. We shall see some criteria in Section 4.2.2 and more generally in Chapter 6.

Divise clustering algorithms

The algorithms used for divisive clustering are very similar to those used for tree models (Section 4.5). In general, they are less used in routine applications, as they tend to be more computationally intensive. However, although naive implementation of divisive methods requiresn2 distance calculations on the first iteration, subsequent divisions are on much smaller cluster sizes. Also, efficient implementations do not compute all pairwise distances but only those that are reasonable candidates for being the closest together.