• Aucun résultat trouvé

SIMILARITY-BASED CRITERION FUNCTIONS

Dans le document DATA MINING THE WEB (Page 108-113)

One of the most popular criterion functions for clustering is the sum of squared errors. Originally, it usesEuclidean distanceand evaluates clustering, where each cluster is represented by itscenter(centroidormeanin the case of numerical data).

Such clustering can be produced, for example, by thek-meansalgorithm. The idea of this evaluation function is that the meanmibest represents clusterDiif it minimizes the sum of the lengths of the “error” vectorsxmi for allxDi. Thus, the overall evaluation of a clustering is the sum of these “intracluster errors” over allkclusters:

Je=

By simple algebraic manipulation the mean can be eliminated from the expression forJe, thus obtaining an equivalent form of the evaluation function based onpairwise distancebetween cluster members:

SIMILARITY-BASED CRITERION FUNCTIONS 91 For document clustering thecosine similarityis used instead of distance, and the best clustering shouldmaximizethesum of centroid similarityfunction:

Js=

where sim (ci,dj) is the cosine similarity between the cluster centroidci of cluster Di and the vectordj, defined by the cosine of the angle between the two vectors in the TFIDF vector space. That is,

sim (ci,dj)= ci·dj

ci dj The cluster centroidciis the average vector in cluster Di:

ci = 1

|Di|

djDi

dj

The equivalent form of this function based on pairwise similarity is then Js =1

Another simple transformation shows that this function actually uses intracluster similarity—one of the evaluation functions that control merging of clusters in hierar-chical agglomerative clustering:

where sim (Di) is the average pairwise similarity between members of clusterDi. In summary, thesimilarity-based criterion functionhas two equivalent forms:

centroidandpairwise similarity, depending on the clustering approach in which it is used. As we mentioned earlier, clustering that maximizes this function is called minimum variance clustering. Therefore, the functions from this family are called minimum variance criterion functions.

Another issue related to evaluating the quality of clustering is comparing clus-terings with a different number of partitions or different hierarchical structures. In partitioning, the criterion function grows with the number of clusters, reaching its maximum at the extreme case of single-element clusters, whereas in hierarchical clustering it decreases with climbing the hierarchy, reaching its smallest value at the root (a single cluster for the entire sample). We illustrate the behavior of the mini-mum variance criterion with four examples of clustering in our department domain.

Table 4.1 shows four horizontal trees, corresponding to different clusterings of our 20 documents obtained with the agglomerative approach (average similarity criterion for merging clusters) andk-means withk=2,3, and 4.

The value of the sum of centroid similarity function is shown in brackets after the node number. Note that this value corresponds to the cluster that includes all leaves of the subtree rooted at the node. For example, node 3 from the agglomerative tree

TABLE4.1SumofCentroidSimilarityEvaluationofFourClusterings Agglomerativek-means(k=2)k-means(k=3)k-means(k=4) 1[12.0253]1[12.0253]1[12.0253]1[12.0253] 2[9.43932]2[8.53381]2[2.83806]2[3.81771] 3[5.64819]AnthropologyHistoryArt 4[4.6522]BiologyMusicCommunication 5[3.8742]ChemistryPhilosophyEnglish 6[2.95322]ComputerScience3[6.09107]ModernLanguages 7[1.99773]EconomicsAnthropology3[5.44416] ChemistryGeographyBiologyBiology ComputerScienceMathematicsChemistryEconomics PoliticalSciencePhysicsComputerScienceMathematics GeographyPoliticalScienceGeographyPhysics AnthropologyPsychologyMathematicsPsychology 8[1.98347]SociologyPoliticalScienceSociology CriminalJustice3[6.12743]4[7.12119]4[2.83806] TheatreArtArtHistory 9[5.44416]CommunicationCommunicationMusic 10[2.81679]CriminalJusticeCriminalJusticePhilosophy 11[1.97333]EnglishEconomics5[5.64819] PsychologyHistoryEnglishAnthropology SociologyModernLanguagesModernLanguagesChemistry MathemeticsMusicPhysicsComputerScience 92

12[2.90383]PhilosophyPsychologyCriminalJustice 13[1.96187]TheatreSociologyGeography BiologyTheatrePoliticalScience EconomicsTheatre Physics 14[5.40061] 15[2.83806] 16[1.98066] History Music Philosophy 17[3.81771] 18[2.97634] 19[1.99175] English ModernLanguages Art Communication 14.83993(clusters2+14)14.661216.0503217.74812

93

TABLE 4.2

Partitioning Sum of Centroid Similarity

{2,14} 14.8399

{3,9,14} 16.493

{2,15,17} 16.0951

{4,8,9,14} 17.4804

{3,9,15,17} 17.7481 {4,8,9,15,17} 18.7356 {3,10,12,15,17} 18.0246 {4,8,10,12,14} 17.7569 {4,8,10,12,15,17} 19.0121

represents the cluster{Chemistry, Computer Science, Political Science, Geography, Anthropology, Criminal Justice, Theatre}with the value of the sum of centroid sim-ilarity 5.64819. Therefore, the value at the node isnotequal to the sum of values at the constituent clusters. This is also the reason that the value at the root (12.0253) is the same for all clusterings. The idea of this representation is to show an evaluation of each clusterindividually. Then if we want to see the quality of a particular parti-tioning, we simply sum up the evaluations of the constituent clusters. These sums for top-level partitioning (the immediate successors of the root) are shown in the bottom row of the table.

When analyzing the agglomerative clustering we first identify six clusters at the lowest level that jointly cover the entire sample: 4, 8, 10, 12, 15, and 17. There are also even smaller clusters; however, we do not consider them as the lowest-level samples, because otherwise there would be individual documents at the same level, such as cluster 7 and the document Political Science. Then we look at various combinations of those basic clusters and see how the quality of the resulting partitioning changes.

All such combinations are shown in Table 4.2 along with the values of the centroid similarity criterion function (the sum of those functions for the constituent clusters).

The table shows clearly that the criterion function increases with the number of partitions. We have different combinations with the same number of partitions (three, four and five). From those we can choose those with the highest value of similarity.

Intuitively, we want to create bigger clusters. However, when merging clusters, the quality of clustering decreases. Therefore, we need a good balance between quality and size. One way to ensure this is to look at the topics (if they are known). According to its topic, cluster 8 belongs to cluster 14; however, the tree structure does not allow merging it with the latter. So we can keep cluster 8 for top-level partitioning and merge the other branches of the tree, thus obtaining the clustering{4,8,9,14}. If we don’t know the topic structure, a better choice would be clusters at the same level of the hierarchy (starting from the six lowest-level nontrivial clusters, because that is the way the hierarchy was created). Such clusters are usually well balanced in size and quality, too. A good choice according to this criterion is{3,9,14}.

Fork-means clustering we use a different strategy to find the number of clusters.

We first run the algorithm withk=5,6, and 7 to get three additional data points and

Dans le document DATA MINING THE WEB (Page 108-113)