k-MEANS CLUSTERING

level of clusters. This reveals a rather uniform distribution of our document vectors in the TFIDF vector space.

Along with several other measures, intracluster similarity is widely used for evaluating clustering quality. To evaluate the overall quality of a clustering hierarchy, we take theaverage of intracluster similarityover all clusters. These measures are shown in the last row of Tables 3.1 and 3.2. They show that the average similarity measure produces the best clustering and that cutting off the hierarchy improves the quality of clustering. In Chapter 4 we discuss other measures for evaluating clustering quality (also calledcriterion functions for clustering).

Assume that we know in advance the number of clusters that the algorithm should produce. Then a divisible partitioning strategy would be more appropriate to use because the only decision that needs to be made is how to split clusters. This would also make the method more efficient than agglomerative clustering, where all possible candidates for merging should be evaluated. The best known approach that is based on this idea isk-means clustering, a simple and efficient algorithm used by statisticians for decades. The idea is to represent the cluster by the centroid of the documents that belong to that cluster (the centroid of clusterS is defined asc=(1/|S|)

d∈S d).

The cluster membership is determined by ﬁnding the most similar cluster centroid for each document.

The algorithm takes a set of documentsSand a parameterkrepresenting the desired number of clusters and performs the following steps:

1. Selectkdocuments fromSto be used as cluster centroids. This is usually done at random.

2. Assign documents to clusters according to their similarity to the cluster cen-troids (i.e., for each document, ﬁnd the most similar centroid and assign that document to the corresponding cluster).

3. For each cluster, recompute the cluster centroid using the newly computed cluster members.

4. Go to step 2 until the process converges (i.e., the same documents are assigned to each cluster in two consecutive iterations or the cluster centroids remain the same).

The key point in the algorithm is step 2. In this step documents are moved between clusters to maximize the intracluster similarity. The criterion function for clustering is based on cluster centroids and is analogous to thesum of squared errors in distance-based clustering, which uses the mean. Instead, we use centroids and similarity here. Thus, the function is

J= k

i=1

dl∈Di

sim (ci,dl)

whereciis the centroid of clusterDiand sim (ci,dj) is the cosine similarity betweenci

anddj. Clustering thatmaximizesthis function is calledminimum variance clustering (to avoid confusion, variance is deﬁned with distance and maximizing similarity is equivalent to minimizing distance).

The k-means algorithm produces minimum variance clustering but does not guarantee that it always ﬁnds the global maximum of the criterion function. After each iteration the value ofJincreases, but it may converge to a local maximum. Thus, the result depends greatly on the initial choice of cluster centroids. Because this choice is usually made at random, the clusterings obtained may vary from run to run.

A simple approach to dealing with this problem is to run the algorithm several times with different random number seeds and then select the clustering that maximizes the criterion function.

Let us illustrate thek-means clustering algorithm using our department col-lection. Running the algorithm on the documents represented with all 671 features gives interesting results. In almost all runs the algorithm converges after only two iterations (for allk). This means that the initial choice of centroids in most cases fully determines the clustering; that is, after the ﬁrst assignment of documents to clusters (step 2), the next assignment based on the newly computed centroids does not change the cluster membership, and thus the algorithm terminates. This behavior ofk-means is typical for data without well-formed clusters. No wonder this happens with our document collection: The experiments with agglomerative clustering showed that with respect to their similarity, the documents are quite uniformly distributed in the 671-dimensional space.

To make things more interesting, we select six terms which best represent our documents: history,science,research, offers,students, and hall. This selection is made by using anentropy-based techniquethat we discuss in Chapter 5 along with other feature selection methods. For now we need only a representation that separates well the data points in vector space and thus may reveal clusters in the document collection. Table 3.3 shows TFIDF vectors described in this way. Note the difference with the sparsely populated Table 1.5 that we used in Chapter 1.

Let us set k=2; that is, we want to find two clusters in this collection of documents. The algorithm first selects two documents at random as cluster centroids and then iterates assigning and reassigning documents to clusters. Let us, however, select the initial centroids manually so that we see two interesting situations. The first is when we use the two most similar documents for this purpose: Computer Science and Chemistry. Their similarity (simply the dot product, because the vectors are normalized to unit length) is 0.995461. Note that there are many very similar documents, so there is a good chance for this also to happen at random. Table 3.4 shows how clusters and the criterion function change through the iterations.

Initially, the documents selected appear in different clusters (as originally spec-iﬁed), but very soon Chemistry and similar documents are moved to cluster A. Mean-while, the quality of clustering (the values of the criterion function) increases. The ﬁnal clustering is unbalanced, however, and only cluster A seems to be compact with respect to document similarity. Cluster B is large and quite sparse; it includes vectors that are orthogonal, such as Criminal Justice and Economics (and many others, too) as well as very close documents, such as English and Modern Languages (similarity

k-MEANS CLUSTERING 71

TABLE 3.3 TFDF Representation of the Department Document Collection with Six Attributes

history science research offers students hall

Anthropology 0 0.537 0.477 0 0.673 0.177

Art 0 0 0 0.961 0.195 0.196

Biology 0 0.347 0.924 0 0.111 0.112

Chemistry 0 0.975 0 0 0.155 0.158

Communication 0 0 0 0.780 0.626 0

Computer Science 0 0.989 0 0 0.130 0.067

Criminal Justice 0 0 0 0 1 0

Economics 0 0 1 0 0 0

English 0 0 0 0.980 0 0.199

Geography 0 0.849 0 0 0.528 0

History 0.991 0 0 0.135 0 0

Mathematics 0 0.616 0.549 0.490 0.198 0.201

Modern Languages 0 0 0 0.928 0 0.373

Music 0.970 0 0 0 0.170 0.172

Philosophy 0.741 0 0 0.658 0 0.136

Physics 0 0 0.894 0 0.315 0.318

Political Science 0 0.933 0.348 0 0.062 0.063

Psychology 0 0 0.852 0.387 0.313 0.162

Sociology 0 0 0.639 0.570 0.459 0.237

Theatre 0 0 0 0 0.967 0.254

TABLE 3.4 k-Means Clustering with a Bad Choice of Initial Cluster Centroids

Iteration ClusterA ClusterB Criterion Function J

1 {Computer Science,

Political Science} {Anthropology, Art, Biology, Chemistry, Communication,

TABLE 3.5 k-Means Clustering with a Good Choice of Initial Cluster Centroids

Iteration ClusterA ClusterB Criterion Function J

1 {Anthropology, Biology,

0.98353). This is obviously a bad choice of initial cluster centroids, which in turn illustrates well how sensitive thek-means algorithm is to any irregularities in data (something that is true for all search algorithms that use local criterion functions).

For the second run we choose two least similar documents: Economics and Art.

Their similarity is 0, because they are orthogonal (in fact, there are more orthogonal vectors in our data; see Table 3.3). Table 3.5 shows how the clusters are built around these two documents in three iterations. Obviously, this choice of cluster centroids is better because the clusters are compact and well balanced by size and content. Cluster A collects all natural science–like documents, whereas cluster B collects the artlike documents. Obviously, the choice of cluster centroids is quite good; the initial value of theJfunction is high and does not change much through the iterations. The better quality of this clustering is also indicated by the bigger ﬁnal value of the criterion function compared with the previous run. So it seems that this split of our department collection makes sense, and we shall be using it in the next chapters for document labeling and supervised learning.

Thinking of our primary goal in this chapter, organizing the web content, the clustering experiment shown in Table 3.5 can be seen as an example ofcreating a web directory. Assume that we know the topics of a small set of documents in a large collection. Then we can use documents from this set as cluster centroids and run k-means on the entire collection to obtain subsets according to the given topics. A similar approach based on similarity but with labeled documents (nearest-neighbor

Dans le document DATA MINING THE WEB (Page 87-91)