Hierarchical methods - Cluster analysis

Model specification

4.2 Cluster analysis

4.2.1 Hierarchical methods

Hierarchical clustering methods allow us to get a family of partitions, each asso-ciated with the subsequent levels of grouping among the observations, calculated on the basis of the available data. The different families of partitions can be graph-ically represented through a tree-like structure called a hierarchical clustering tree or dendrogram. This structure associates with every step of the hierarchical pro-cedure, corresponding to a fixed number of groupsg, one and only one clustering of the observations in theg groups.

A hierarchical clustering tree can be represented as in Figure 4.1, where for simplicity we suppose there are only five observations available, numbered from 1 to 5. The branches of the tree describe subsequent clusterings of the observations.

At the root of the tree, all the observations are contained in only one class. The branches of the tree indicate divisions of the observations into clusters. The five terminal nodes indicate the situation where each observation belongs to a separate group.

Agglomerative clustering is when the groups are formed from the branches to the root (from left to right in Figure 4.1). Divisive clustering is when the groups are formed from the root to the branches. Statistical software packages usually report the whole dendrogram, from the root to a number of terminal branches equal to the number of observations. It then remains to choose the optimal number of groups. This will identify the result of the cluster analysis, since in a dendrogram the choice of the number of groups gidentifies a unique

2 Root

Branches 3

4 5

Figure 4.1 Structure of the dendrogram.

Table 4.2 Partitions corresponding to the dendrogram in Figure 4.1.

Number of clusters Clusters

5 (1) (2) (3) (4) (5) 4 (1,2) (3) (4) (5) 3 (1,2) (3,4) (5) 2 (1,2) (3,4,5) 1 (1,2,3,4,5)

partition of the observations. For example, the partitions of the five observations described by the dendrogram in Figure 4.1 can be represented as in Table 4.2.

Table 4.2 shows that the partitions described by a dendrogram are nested. This means that, in the hierarchical methods, the elements that are united (or divided) at a certain step will remain united (separated) until the end of the clustering process. Supposing we consider an agglomerative method that proceeds from five groups to one group; then units 1 and 2 are united at the second step and remain in the same group until the end of the procedure. Nesting reduces the number of partitions to compare, making the procedure computationally more efficient, but the disadvantage is not being able ‘to correct’ errors of clustering committed in the preceding steps. Here is an outline for an agglomerative clustering algorithm:

1. Initialisation.Givennstatistical observations to classify, every element rep-resents a group (in other words, the procedure starts with n clusters). The clusters will be identified with a number that goes from 1 ton.

2. Selection.The two ‘nearest’ clusters are selected, in terms of the distance initially fixed, for example, in terms of the Euclidean distance.

3. Updating.The number of clusters is updated (to n−1) through the union, in a unique cluster, of the two groups selected in step 2. The matrix of the distances is updated, taking the two rows (and two columns) of distances between the two clusters and replacing them with only one row (and one column) of distances, ‘representative’ of the new group. Different clustering methods define this representation in different ways.

4. Repetition.Steps 2 and 3 are performedn−1 times.

5. End.The procedure stops when all the elements are incorporated in a unique cluster.

We will now look at some of the different clustering methods mentioned in step 3. They will be introduced with reference to two groups,C1andC2. Some methods require only the distance matrix and some require the distance matrix plus the original data matrix. These examples require only the distance matrix:

• Single linkage. The distance between two groups is defined as the mini-mum of the n1n2distances between each observation of groupC1 and each

observation of group C2:

d (C1, C2)=min(drs) , with r∈C1, s∈C2.

• Complete linkage.The distance between two groups is defined as the max-imum of the n1n2 distances between each observation of a group and each observation of the other group:

d (C1, C2)=max(d_rs), withr ∈C1, s∈C2.

• Average linkage. The distance between two groups is defined as the arith-metic average of the n1n2 distances between each of the observations of a group and each of the observations of the other group:

d (C1, C2)= 1 n1n2

n₁

r=1 n2

s=1

drs, withr ∈C1, s∈C2.

Two methods that require the data matrix as well as the distance matrix are the centroid method and Ward’s method.

Centroid method

The distance between two groups C1 and C2, having n_l and n2 observations respectively, is defined as the distance between the respective centroids (usually the means),x1andx2:

d (C1, C2)=d (x1, x2) .

To calculate the centroid of a group of observations we need the original data, and we can obtain that from the data matrix. It will be necessary to replace the distances with respect to the centroids of the two previous clusters by the distances with respect to the centroid of the new cluster. The centroid of the new cluster can be obtained from

x1n1+x2n2

n1+n2

Note the similarity between this method and the average linkage method: the average linkage method considers the average of the distances among the obser-vations of each of the two groups, while the centroid method calculates the centroid of each group and then measures the distance between the centroids.

Ward’s method

In choosing the groups to be joined, Ward’s method minimises an objective function using the principle that clustering aims to create groups which have maximum internal cohesion and maximum external separation.

The total deviance (T) of thepvariables, corresponding tontimes the trace of the variance–covariance matrix, can be divided in two parts: the deviance within the groups (W) and the deviance between the groups (B), so that T =W +B. This is analogous to dividing the variance into two parts for linear regression

(Section 4.3). In that caseB is the variance explained by the regression andW is the residual variance, the variance not explained by the regression. In formal terms, given a partition into g groups, the total deviance of the p variables corresponds to the sum of the deviances of the single variables, with respect to the overall meanxs, defined by

T =

The deviance within groups is given by the sum of the deviances of each group, W =

k=1

W_k,

where Wk represents the deviance of the p variables in thekth group (number nk and centroidxk=

The deviance between groups is given by the sum (calculated on all the variables) of the weighted deviances of the group means with respect to the corresponding general averages:

Using Ward’s method, groups are joined so that the increase inW is smaller and the increase inB is larger. This achieves the greatest possible internal cohesion and external separation. Notice that it does not require preliminary calculation of the distance matrix. Ward’s method can be interpreted as a variant of the centroid method, which does require the distance matrix.

How do we choose which method to apply?

In practice, there is no method that can give the best result with every type of data. Experiment the different alternatives and compare them in terms of the chosen criteria. A number of criteria are discussed in the following subsection and more generally in Chapter 5.

Divisive clustering algorithms

The algorithms used for divisive clustering are very similar to those used for tree models (Section 4.5). In general, they are less used in routine applications, as they tend to be more computationally intensive. However, although na¨ıve implementation of divisive methods requiresn²distance calculations on the first iteration, subsequent divisions are on much smaller cluster sizes. Also, efficient implementations do not compute all pairwise distances but only those that are reasonable candidates for being the closest together.

Dans le document Applied Data Mining for Business and Industry (Page 56-60)