level of clusters. This reveals a rather uniform distribution of our document vectors in the TFIDF vector space.
Along with several other measures, intracluster similarity is widely used for evaluating clustering quality. To evaluate the overall quality of a clustering hierarchy, we take theaverage of intracluster similarityover all clusters. These measures are shown in the last row of Tables 3.1 and 3.2. They show that the average similarity measure produces the best clustering and that cutting off the hierarchy improves the quality of clustering. In Chapter 4 we discuss other measures for evaluating clustering quality (also calledcriterion functions for clustering).
k-MEANS CLUSTERING
Assume that we know in advance the number of clusters that the algorithm should produce. Then a divisible partitioning strategy would be more appropriate to use because the only decision that needs to be made is how to split clusters. This would also make the method more efficient than agglomerative clustering, where all possible candidates for merging should be evaluated. The best known approach that is based on this idea isk-means clustering, a simple and efficient algorithm used by statisticians for decades. The idea is to represent the cluster by the centroid of the documents that belong to that cluster (the centroid of clusterS is defined asc=(1/|S|)
d∈S d).
The cluster membership is determined by finding the most similar cluster centroid for each document.
The algorithm takes a set of documentsSand a parameterkrepresenting the desired number of clusters and performs the following steps:
1. Selectkdocuments fromSto be used as cluster centroids. This is usually done at random.
2. Assign documents to clusters according to their similarity to the cluster cen-troids (i.e., for each document, find the most similar centroid and assign that document to the corresponding cluster).
3. For each cluster, recompute the cluster centroid using the newly computed cluster members.
4. Go to step 2 until the process converges (i.e., the same documents are assigned to each cluster in two consecutive iterations or the cluster centroids remain the same).
The key point in the algorithm is step 2. In this step documents are moved between clusters to maximize the intracluster similarity. The criterion function for clustering is based on cluster centroids and is analogous to thesum of squared errors in distance-based clustering, which uses the mean. Instead, we use centroids and similarity here. Thus, the function is
J= k
i=1
dl∈Di
sim (ci,dl)
whereciis the centroid of clusterDiand sim (ci,dj) is the cosine similarity betweenci
anddj. Clustering thatmaximizesthis function is calledminimum variance clustering (to avoid confusion, variance is defined with distance and maximizing similarity is equivalent to minimizing distance).
The k-means algorithm produces minimum variance clustering but does not guarantee that it always finds the global maximum of the criterion function. After each iteration the value ofJincreases, but it may converge to a local maximum. Thus, the result depends greatly on the initial choice of cluster centroids. Because this choice is usually made at random, the clusterings obtained may vary from run to run.
A simple approach to dealing with this problem is to run the algorithm several times with different random number seeds and then select the clustering that maximizes the criterion function.
Let us illustrate thek-means clustering algorithm using our department col-lection. Running the algorithm on the documents represented with all 671 features gives interesting results. In almost all runs the algorithm converges after only two iterations (for allk). This means that the initial choice of centroids in most cases fully determines the clustering; that is, after the first assignment of documents to clusters (step 2), the next assignment based on the newly computed centroids does not change the cluster membership, and thus the algorithm terminates. This behavior ofk-means is typical for data without well-formed clusters. No wonder this happens with our document collection: The experiments with agglomerative clustering showed that with respect to their similarity, the documents are quite uniformly distributed in the 671-dimensional space.
To make things more interesting, we select six terms which best represent our documents: history,science,research, offers,students, and hall. This selection is made by using anentropy-based techniquethat we discuss in Chapter 5 along with other feature selection methods. For now we need only a representation that separates well the data points in vector space and thus may reveal clusters in the document collection. Table 3.3 shows TFIDF vectors described in this way. Note the difference with the sparsely populated Table 1.5 that we used in Chapter 1.
Let us set k=2; that is, we want to find two clusters in this collection of documents. The algorithm first selects two documents at random as cluster centroids and then iterates assigning and reassigning documents to clusters. Let us, however, select the initial centroids manually so that we see two interesting situations. The first is when we use the two most similar documents for this purpose: Computer Science and Chemistry. Their similarity (simply the dot product, because the vectors are normalized to unit length) is 0.995461. Note that there are many very similar documents, so there is a good chance for this also to happen at random. Table 3.4 shows how clusters and the criterion function change through the iterations.
Initially, the documents selected appear in different clusters (as originally spec-ified), but very soon Chemistry and similar documents are moved to cluster A. Mean-while, the quality of clustering (the values of the criterion function) increases. The final clustering is unbalanced, however, and only cluster A seems to be compact with respect to document similarity. Cluster B is large and quite sparse; it includes vectors that are orthogonal, such as Criminal Justice and Economics (and many others, too) as well as very close documents, such as English and Modern Languages (similarity
k-MEANS CLUSTERING 71
TABLE 3.3 TFDF Representation of the Department Document Collection with Six Attributes
history science research offers students hall
Anthropology 0 0.537 0.477 0 0.673 0.177
Art 0 0 0 0.961 0.195 0.196
Biology 0 0.347 0.924 0 0.111 0.112
Chemistry 0 0.975 0 0 0.155 0.158
Communication 0 0 0 0.780 0.626 0
Computer Science 0 0.989 0 0 0.130 0.067
Criminal Justice 0 0 0 0 1 0
Economics 0 0 1 0 0 0
English 0 0 0 0.980 0 0.199
Geography 0 0.849 0 0 0.528 0
History 0.991 0 0 0.135 0 0
Mathematics 0 0.616 0.549 0.490 0.198 0.201
Modern Languages 0 0 0 0.928 0 0.373
Music 0.970 0 0 0 0.170 0.172
Philosophy 0.741 0 0 0.658 0 0.136
Physics 0 0 0.894 0 0.315 0.318
Political Science 0 0.933 0.348 0 0.062 0.063
Psychology 0 0 0.852 0.387 0.313 0.162
Sociology 0 0 0.639 0.570 0.459 0.237
Theatre 0 0 0 0 0.967 0.254
TABLE 3.4 k-Means Clustering with a Bad Choice of Initial Cluster Centroids
Iteration ClusterA ClusterB Criterion Function J
1 {Computer Science,
Political Science} {Anthropology, Art, Biology, Chemistry, Communication,
TABLE 3.5 k-Means Clustering with a Good Choice of Initial Cluster Centroids
Iteration ClusterA ClusterB Criterion Function J
1 {Anthropology, Biology,
0.98353). This is obviously a bad choice of initial cluster centroids, which in turn illustrates well how sensitive thek-means algorithm is to any irregularities in data (something that is true for all search algorithms that use local criterion functions).
For the second run we choose two least similar documents: Economics and Art.
Their similarity is 0, because they are orthogonal (in fact, there are more orthogonal vectors in our data; see Table 3.3). Table 3.5 shows how the clusters are built around these two documents in three iterations. Obviously, this choice of cluster centroids is better because the clusters are compact and well balanced by size and content. Cluster A collects all natural science–like documents, whereas cluster B collects the artlike documents. Obviously, the choice of cluster centroids is quite good; the initial value of theJfunction is high and does not change much through the iterations. The better quality of this clustering is also indicated by the bigger final value of the criterion function compared with the previous run. So it seems that this split of our department collection makes sense, and we shall be using it in the next chapters for document labeling and supervised learning.
Thinking of our primary goal in this chapter, organizing the web content, the clustering experiment shown in Table 3.5 can be seen as an example ofcreating a web directory. Assume that we know the topics of a small set of documents in a large collection. Then we can use documents from this set as cluster centroids and run k-means on the entire collection to obtain subsets according to the given topics. A similar approach based on similarity but with labeled documents (nearest-neighbor