EXAMPLE OF k-MEANS CLUSTERING AT WORK

may verify that the average-linkage criterion leads to the same hierarchical structure for this example as the complete-linkage criterion. In general, average linkage leads to clusters more similar in shape to complete linkage than does single linkage.

k-MEANS CLUSTERING

Thek-means clustering algorithm [1] is a straightforward and effective algorithm for ﬁnding clusters in data. The algorithm proceeds as follows.

r Step 1:Ask the user how many clusterskthe data set should be partitioned into.

r Step 2:Randomly assignkrecords to be the initial cluster center locations.

r Step 3:For each record, ﬁnd the nearest cluster center. Thus, in a sense, each cluster center “owns” a subset of the records, thereby representing a partition of the data set. We therefore havekclusters,C1,C2, . . . ,Ck.

r Step 4:For each of the k clusters, ﬁnd the clustercentroid, and update the location of each cluster center to the new value of the centroid.

r Step 5:Repeat steps 3 to 5 until convergence or termination.

The “nearest” criterion in step 3 is usually Euclidean distance, although other criteria may be applied as well. The cluster centroid in step 4 is found as fol-lows. Suppose that we have n data points (a₁,b₁,c₁),(a₂,b₂,c₂), . . . ,(an,bn,cn), thecentroidof these points is the center of gravity of these points and is located at point

ai/n,

bi/n, ci/n

. For example, the points (1,1,1), (1,2,1), (1,3,1), and (2,1,1) would have centroid

1+1+1+2

4 ,1+2+3+1

4 ,1+1+1+1

4 =(1.25,1.75,1.00) The algorithm terminates when the centroids no longer change. In other words, the algorithm terminates when for all clustersC₁,C₂, . . . ,Ck, all the records “owned”

by each cluster center remain in that cluster. Alternatively, the algorithm may terminate when some convergence criterion is met, such as no signiﬁcant shrinkage in thesum of squared errors:

where p∈Ci represents each data point in clusteri andmi represents the centroid of clusteri.

Let’s examine an example of how thek-means algorithm works. Suppose that we have the eight data points in two-dimensional space shown in Table 8.1 and plotted in Figure 8.4 and are interested in uncoveringk=2 clusters.

TABLE 8.1 Data Points fork-Means Example

a b c d e f g h

(1,3) (3,3) (4,3) (5,3) (1,2) (4,2) (1,1) (2,1)

Let’s apply thek-means algorithm step by step.

rStep 1:Ask the user how many clusterskthe data set should be partitioned into.

We have already indicated that we are interested ink=2 clusters.

rStep 2:Randomly assignkrecords to be the initial cluster center locations. For this example, we assign the cluster centers to bem₁=(1,1) andm₂=(2,1).

rStep 3 (ﬁrst pass):For each record, ﬁnd the nearest cluster center. Table 8.2 contains the (rounded) Euclidean distances between each point and each cluster centerm1 =(1,1) andm2 =(2,1), along with an indication of which cluster center the point is nearest to. Therefore, cluster 1 contains points{a,e,g}, and cluster 2 contains points{b,c,d,f,h}. Once cluster membership is assigned, the sum of squared errors may be found:

SSE= As remarked earlier, we would like our clustering methodology to maximize the between-cluster variation with respect to the within-cluster variation. Using d(m1,m2) as a surrogate for BCV and SSE as a surrogate for WCV, we have:

BCV

WCV = d(m1,m2)

SSE = 1

36 =0.0278 We expect this ratio to increase with successive passes.

rStep 4 (ﬁrst pass): For each of the k clusters ﬁnd the cluster centroid and update the location of each cluster center to the new value of the centroid. The

Figure 8.4 How willk-means partition this data intok=2 clusters?

EXAMPLE OFk-MEANS CLUSTERING AT WORK 155

TABLE 8.2 Finding the Nearest Cluster Center for Each Record (First Pass)

Point Distance fromm1 Distance fromm2 Cluster Membership

a 2.00 2.24 C1

The clusters and centroids (triangles) at the end of the ﬁrst pass are shown in Figure 8.5. Note thatm₁has moved up to the center of the three points in cluster 1, whilem₂has moved up and to the right a considerable distance, to the center of the ﬁve points in cluster 2.

r Step 5: Repeat steps 3 and 4 until convergence or termination. The centroids have moved, so we go back to step 3 for our second pass through the algorithm.

r Step 3 (second pass):For each record, ﬁnd the nearest cluster center. Table 8.3 shows the distances between each point and each updated cluster centerm1 = (1,2) andm2 =(3.6,2.4), together with the resulting cluster membership. There has been a shift of a single record (h) from cluster 2 to cluster 1. The relatively large change inm2has left recordhnow closer tom1than tom2, so that record h now belongs to cluster 1. All other records remain in the same clusters as previously. Therefore, cluster 1 is{a,e,g,h}, and cluster 2 is{b,c,d,f}. The new sum of squared errors is

Figure 8.5 Clusters and centroidsafter ﬁrst pass throughk-means algorithm.

TABLE 8.3 Finding the Nearest Cluster Center for Each Record (Second Pass)

Point Distance fromm1 Distance fromm2 Cluster Membership

a 1.00 2.67 C1

which is much reduced from the previous SSE of 36, indicating a better clus-tering solution. We also have:

BCV

WCV = d(m₁,m₂) SSE = 2.63

7.88 =0.3338

which is larger than the previous 0.0278, indicating that we are increasing the between-cluster variation with respect to the within-cluster variation.

rStep 4 (second pass): For each of the k clusters, ﬁnd the cluster cen-troid and update the location of each cluster center to the new value of the centroid. The new centroid for cluster 1 is [(1+1+1+2)/4, (3+2+1+1)/4]=(1.25,1.75). The new centroid for cluster 2 is [(3+4+5+4)/4,(3+3+3+2)/4]=(4,2.75).The clusters and centroids at the end of the second pass are shown in Figure 8.6. Centroidsm₁andm₂ have both moved slightly.

rStep 5: Repeat steps 3 and 4 until convergence or termination. Since the cen-troids have moved, we once again return to step 3 for our third (and as it turns out, ﬁnal) pass through the algorithm.

rStep 3 (third pass):For each record, ﬁnd the nearest cluster center. Table 8.4 shows the distances between each point and each newly updated cluster cen-term1=(1.25,1.75) andm2=(4,2.75), together with the resulting cluster membership. Note that no records have shifted cluster membership from the

Figure 8.6 Clusters and centroidsafter second pass throughk-means algorithm.

EXAMPLE OFk-MEANS CLUSTERING AT WORK 157

TABLE 8.4 Finding the Nearest Cluster Center for Each Record (Third Pass)

Point Distance fromm1 Distance fromm2 Cluster Membership

a 1.27 3.01 C1

preceding pass. The new sum of squared errors is SSE=

which is slightly smaller than the previous SSE of 7.88 and indicates that we have our best clustering solution yet. We also have:

BCV

WCV =d(m₁,m₂) SSE =2.93

6.25 =0.4688

which is larger than the previous 0.3338, indicating that we have again increased the between-cluster variation with respect to the within-cluster variation. To do so is the goal of every clustering algorithm, in order to produce well-deﬁned clusters such that the similarity within the cluster is high while the similarity to records in other clusters is low.

r Step 4 (third pass): For each of thek clusters, ﬁnd the clustercentroid and update the location of each cluster center to the new value of the centroid. Since no records have shifted cluster membership, the cluster centroids therefore also remain unchanged.

r Step 5: Repeat steps 3 and 4 until convergence or termination. Since the cen-troids remain unchanged, the algorithm terminates.

Note that thek-means algorithm cannot guarantee ﬁnding the the global min-imum SSE, instead often settling at a local minmin-imum. To improve the probability of achieving a global minimum, the analyst should rerun the algorithm using a variety of initial cluster centers. Moore[2] suggests (1) placing the ﬁrst cluster center on a random data point, and (2) placing the subsequent cluster centers on points as far away from previous centers as possible.

One potential problem for applying the k-means algorithm is: Who decides how many clusters to search for? That is, who decidesk? Unless the analyst has a priori knowledge of the number of underlying clusters, therefore, an “outer loop”

should be added to the algorithm, which cycles through various promising values of k. Clustering solutions for each value ofkcan therefore be compared, with the value ofkresulting in the smallest SSE being selected.

What if some attributes are more relevant than others to the problem formula-tion? Since cluster membership is determined by distance, we may apply the same axis-stretching methods for quantifying attribute relevance that we discussed in Chap-ter 5. In ChapChap-ter 9 we examine another common clusChap-tering method, Kohonen net-works, which are related to artiﬁcial neural networks in structure.

APPLICATION OF k-MEANS CLUSTERING USING

Dans le document An Introduction to Data Mining (Page 172-177)