WEB CONTENT MINING - DATA MINING THE WEB

T

he informationretrieval approaches discussed in Chapters 1 and 2 pro-vide content-based access to the Web, while machine learning and data mining approaches organize the Web by content and thus respond directly to the major challenge ofturning the web data into web knowledge. Combined, information retrieval, machine learning, and data mining provide a general framework for mining the web structure and content. In this part of the book we look into the machine learning and data mining components of the frame-work by focusing on two approaches to organizing the Web: clustering and classiﬁcation (categorization). In clustering, web documents are grouped or organized hierarchically by similarity; that is, clustering is concerned with au-tomatic structuring of the web content. Web document classiﬁcation is based primarily on prediction methods, where documents are labeled by topic, pref-erence, or usage, given sets of documents already labeled. Concept learning methods can also be used to generate explicit descriptions of sets of web doc-uments, which can then be applied to categorization of new documents or to better understand the document area or topic.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage By Zdravko Markov and Daniel T. Larose CopyrightC 2007 John Wiley & Sons, Inc.

CHAPTER

3

CLUSTERING

INTRODUCTION

HIERARCHICAL AGGLOMERATIVE CLUSTERING k-MEANS CLUSTERING

PROBABILTY-BASED CLUSTERING

COLLABORATIVE FILTERING (RECOMMENDER SYSTEMS)

INTRODUCTION

The most popular approach to learning is by example. Given a set of objects, each labeled with a class (category), the learning system builds a mapping between ob-jects and classes which can then be used for classifying new (unlabeled) obob-jects.

As the labeling (categorization) of the initial (training) set of objects is done by an agent external to the system (teacher), this setting is calledsupervised learning.

Clustering is another setting for learning, which does not use labeled objects and therefore isunsupervised. The objective of clustering is ﬁnding common patterns, grouping similar objects, or organizing them in hierarchies. In the context of the Web, the objects manipulated by the learning system are web documents, and the class labels are usually topics or user preferences. Thus, a supervised web learning system would build a mapping between documents and topics, while a clustering sys-tem would group web documents or organize them in hierarchies according to their topics.

A clustering system can be useful in web search for grouping search results into closely related sets of documents. Clustering can improve similarity search by focusing on sets of relevant documents. Hierarchical clustering methods can be used to create topic directories automatically, directories that can then be labeled manu-ally. On the other hand, manually created topic directories can be matched against clustering hierarchies and thus the accuracy of the topic directories can be evalu-ated. Clustering is also a valuable technique for analyzing the Web. Matching the

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage By Zdravko Markov and Daniel T. Larose CopyrightC 2007 John Wiley & Sons, Inc.

content-based clustering and the hyperlink structure can reveal patterns, duplications, and other interesting structures on the Web.

There exist various types of clustering, depending on the way that clusters are represented, the cluster properties, and the types of algorithms used for clustering.

Thus, there are four dimensions along which clustering techniques can be categorized:

1. Model-based(conceptual)versus partitioning. Conceptual clustering creates models (explicit representations) of clusters, whereas partitioning simply enu-merates the members of each cluster.

2. Deterministic versus probabilistic. Cluster membership may be deﬁned as a Boolean value (in deterministic clustering) or as a probability (in probabilistic clustering).

3. Hierarchical versus ﬂat. Flat clustering splits the set of objects into subsets, whereas hierarchical clustering creates tree structures of clusters.

4. Incremental versus batch. Batch algorithms use the entire set of objects to create the clustering, whereas incremental algorithms take one object at a time and update the current clustering to accommodate it.

Clustering can be applied to any set of objects as long as a suitable representation of these objects exists. The most common representation, which also works for other machine learning and data mining methods (such as classiﬁcation), is theattribute–

value(orfeature–value)representation. In this representation a number of attributes (features) are identiﬁed for the entire population, and each object is represented by a set of attribute–value pairs. Alternatively, if the order of the features is ﬁxed, a vector of values (data points) can be used instead. The documentvector space model is exactly the same type of representation, where the features are terms.

Another approach to clustering considers documents as outcomes of random processes and tries to identify the parameters of these processes. In our classiﬁca-tion this is a probabilistic model-based approachwhere each cluster is described by the probability distribution that is most likely to have generated the documents within it. In thisgenerative document modelingframework, documents are random eventsrepresented by the terms they contain. However, in contrast to the vector space model, the terms here are not features (dimensions in the vector space); rather, they are considered aselementary (atomic)random events. Another difference with the vector space approach is that the generative models do not use similarity measures or distances between documents or clusters.

By using the standard vector space representation, all clustering approaches known from machine learning and data mining can be applied to document cluster-ing. However, for various reasons (such as quality of results and efficiency) there are several algorithms that are particularly important in this area. These arehierarchical agglomerative clusteringandk-means clustering. In the remainder of this chapter we describe these algorithms briefly and focus on their use for document cluster-ing (for more information on the algorithms, see Larose [1]). We also describe the generative document modeling framework and a particular probabilistic clustering al-gorithm calledexpectation maximization(EM). We conclude the chapter with a short discussion on collaborative filtering (recommender systems), which is an area that

INTRODUCTION 63 uses clustering and involves not only documents but Web users, too. Approaches to evaluation of the clustering quality and clustering models are discussed in Chapter 4.

Hierarchical Agglomerative Clustering

Hierarchical agglomerative clustering is ahierarchical partitioningapproach to clus-tering. It produces a nested sequence of partitions of the set of data points which can be displayed as a tree with a single cluster, including all points at the root and singleton clusters (individual points) at the leaves. The visualization of a hierarchical partitioning tree is called adendrogram. The dendrogram shown in Figure 3.1 dis-plays the hierarchical partitioning of the set of numbers{1,2,4,5,8,10}. The height of the lines connecting clusters indicates the similarity between clusters at this level. The actual value of similarity can be seen on the vertical axis on the left. The similarity measure used in this example is computed as (10−d)/10, whered is the distance between data points or cluster centers.

There are two approaches to generating a hierarchical partitioning, the ﬁrst calledagglomerative. The algorithm starts with the data points and at each step merges the two closest (most similar) points (or clusters at later steps) until a single cluster remains. The second approach to hierarchical partitioning, calleddivisible, starts with the original set of points and at each step splits a cluster until only individual points remain. To implement this approach, we need to decide which cluster to split and how to perform the split.

Theagglomerative hierarchical clusteringapproach is more popular, as it re-quires only the deﬁnition of a distance or similarity function on clusters or points.

For data points in the Euclidean space, the Euclidean distanceis the best choice.

For documents represented as TFIDF vectors, the preferred measure is thecosine similaritydeﬁned as

sim (d1,d2)= d₁·d₂ d1 d2

whered₁ andd₂are the document vectors,·denotes the dot product, andd1 and d2are the lengths of the vectors (L₂norms). If the vectors are normalized to unit length, the similarity is just the dot product.

0.4 0.5 0.6 0.7 0.8 0.9

1 2 4 5 8

{{{1,2},{4,5}},{8,10}}

q=0.75 Similarity

Figure 3.1 Hierarchical partitioning of the set{1,2,4,5,8,10}.

What remains to be done is generalizing the similarity measure so that it also applies to clusters. There are several options for this. Assume thatS1andS2are two sets (clusters) of documents. Then the similarity sim(S1, S2) can be deﬁned in one of the following ways:

rSimilarity between cluster centroids: that is, sim (S1, S2)=sim (c1,c2), where the centroid of clusterSisc=(1/|S|)

d∈Sd.

rThe maximum similarity between documents from each cluster: that is, sim (S₁,S₂)= maxd1∈S1,d2∈S2sim (d₁,d₂). The algorithm that uses this mea-sure is callednearest-neighbor clustering.

rThe minimum similarity between documents from each cluster: that is, sim (S1,S2)=mind₁∈S₁,d₂∈S₂sim (d1,d2). The algorithm that uses this measure is calledfarthest-neighbor clustering.

rThe average similarity between documents from each cluster: that is, sim (S1,S2)=(1/|S1| |S2|)

d₁∈S₁,d₂∈S₂sim (d1,d2).

Note that as deﬁned above, the similarity between clusters also works for indi-vidual documents, because each documentdcan be represented as a singleton cluster {d}.

One of the objectives of clustering is tomaximize the similarity between doc-uments within clusters(intracluster similarity). The similarity between clusters that are merged decreases with climbing up the hierarchy and reaches its lowest value at the top (see the dendrogram in Figure 3.1). Therefore, to obtain a better overall quality of clustering we may want to stop the process of merging clusters at some level of the hierarchy before reaching the top. To implement this idea,control parameters can be introduced. There two options for such parameters: to stop merging when a desired number of clusters is reached or when the similarity between the clusters to be merged becomes less than a speciﬁed threshold. Theagglomerative clustering algorithmdeﬁned below uses two control parameters (kandq) that account for both options:

1. G← {{d}|d ∈S}(initializeGwith singleton clusters, each containing a doc-ument fromS).

2. If|G| ≤k, then exit (stop if the desired number of clusters is reached).

3. FindSi,Sj ∈Gsuch that (i,j)=arg max_(i_,_j)sim (Si,Sj) (ﬁnd the two closest clusters).

4. If sim (Si,Sj)<q, exit (stop if the similarity of the closest clusters is less thanq).

5. RemoveSiandSjfromG.

6. G=G∪ {Si, Sj}(mergeSiandSj, and add the new cluster to the hierarchy).

7. Go to step 2.

Sis the initial set of documents. The algorithm terminates in step 2 or 4 when the stopping conditions are satisﬁed and returns the clustering inG. Ifk=1 andq =0, Gis the complete clustering tree, including its root. Whenk>1 there arektop-level

INTRODUCTION 65 clusters. For example, withk=3 orq =0.75 the hierarchy in Figure 3.1 is cut off, so that the top-level clustering includes three clusters:{1,2},{4,5}, and{8,10}.

To implement agglomerative clustering efﬁciently we need to compute the similarity between all pairs of document vectors once and for all and to store the results in an interdocument similarity table. Thus, forndocuments thespace complexityof this isO(n²). It can be shown that thetime complexityof the algorithm is alsoO(n²).

Let us now see how hierarchical agglomerative clustering can help organize a set of web documents. Consider our collection of web pages describing the departments in the CCSU school of Arts and Sciences that we used in Chapter 1 to illustrate web search approaches. We already have the TFIDF vector representation of the documents in this collection. For the agglomerative clustering experiments we shall use all 671 terms as features. Let us ﬁrst see how the parameterqaffects clustering.

Table 3.1 shows two horizontal trees representing clusterings obtained withq =0 (left) andq =0.04 (right), both from the nearest-neighbor algorithm (kis set to 0).

The documents appear as leaves in the tree. Nodes are numbered sequentially (the root is marked as 1), and next to each node number the similarity between the clusters that are merged into that node is shown in brackets. The level of the hierarchy is indicated by the left indention. Thus, the smaller the indention, the higher the tree level (the root has no indent). The nodes aligned at the same indention are siblings.

Withq =0 the algorithm always merges the closest clusters until the root is reached, no matter how low the similarity between them is. By settingq =0.04, we avoid merging clusters with similarity below 0.04. These are clusters (on the left tree) 4 and 6 (that merge in 3) and 3 and 9 (that merge in 2). Thus, we get a hierarchy with four clusters at the highest level (shown on the right tree): 2, 11, 13, and 16. The similarity value at the root is not shown because it applies to two clusters only.

The similarity values suggest that the documents grouped together in clusters at the bottom of the tree are those that are most similar. These clusters are also quite intuitive. For example, Biology and Mathematics (cluster 10, right), grouped with Psychology (cluster 9) and then with Physics (cluster 8), form a nice cluster of similar science disciplines. Another nice grouping is the humanities cluster (4):

History, Philosophy, English, and Languages. The advantage of having a hierarchy is that it also works as a ﬂat clustering and at the same time shows the internal structure of the clusters. Thus, the immediate successors of any single node (including the root) represent a ﬂat clustering of the documents at the leaves of the tree rooted at that node.

The hierarchy represents agenerality/speciﬁcity relationbetween documents based on the terms they contain. For example, each of the History–Philosophy pair (cluster 5), and English–Languages pair (cluster 6) has some overlap of terms rep-resenting a common topic, while their parent cluster (4) has an overlap with all documents included, thus representing a more general topic that includes the two subtopics. These observations suggest that hierarchical clustering can be used to cre-atetopic directories. For example, the cluster of humanities can be seen as a starting point in the process of creating a directory of documents related to humanitarian sci-ences. The next steps would be to add more documents to the subclusters, which can be done manually or by using the classiﬁcation approaches discussed later.

It is interesting to see how the type of cluster similarity function affects the clustering results. Table 3.2 shows two clusterings obtained with different cluster

TABLE 3.1 Hierarchical Agglomerative Clusterings Produced by the Nearest-Neighbor Algorithm^a

q=0 q=0.04

1 [0.0224143] 1 []

2 [0.0308927] 2 [0.0554991]

3 [0.0368782] 3 [0.0662345]

4 [0.0556825] 4 [0.0864619]

5 [0.129523] 5 [0.177997]

Art History

Theatre Philosophy

Geography 6 [0.186299]

6 [0.0858613] English

7 [0.148599] Languages

Chemistry 7 [0.122659]

Music Anthropology

8 [0.23571] Sociology

Computer 8 [0.0952722]

Political 9 [0.163493]

9 [0.0937594] 10 [0.245171]

10 [0.176625] Biology

Communication Mathematics

Economics Psychology

Justice Physics

11 [0.0554991] 11 [0.0556825]

12 [0.0662345] 12 [0.129523]

13 [0.0864619] Art

14 [0.177997] Theatre

History Geography

Philosophy 13 [0.0858613]

15 [0.186299] 14 [0.148599]

English Chemistry

Languages Music

16 [0.122659] 15 [0.23571]

Anthropology Computer

Sociology Political

17 [0.0952722] 16 [0.0937594]

18 [0.163493] 17 [0.176625]

19 [0.245171] Communication

Biology Economics

Mathematics Justice

Psychology Physics

Average intracluster similarity=0.4257 Average intracluster similarity=0.4516

aqis the cluster similarity cut off parameter.

INTRODUCTION 67

TABLE 3.2 Hierarchical Agglomerative Clusterings Obtained Using Different Cluster Similarity Functions

Average intracluster similarity=0.304475 Average intracluster similarity=0.434181

similarity functions. The farthest-neighbor clusteringshows an interesting pattern.

Although History and Philosophy, and English and Languages, are paired again (con-ﬁrming their strong similarity), the rest of the tree is build by repeatedly grouping a cluster with a single document. This is quite typical for this approach when ap-plied to a set of documents with relatively uniform similarity distribution (not too great variations in document-to-document similarity over the entire collection). The explanation is in the way that similarity between clusters is computed. Assume, for example, that the current clustering of four numbers is{{1,2},{3,4},5}. The farthest neighbor will put together{3,4}and 5, because the distance between them (5−3= 2) is smaller (similarity is bigger) than the distance between{1,2}and{3,4}(4− 1=3), whereas for the nearest-neighbor approach,{1,2}and{3,4}are at the same distance (3−2=1) as{3,4}and 5 (5−4=1).

Generally, the farthest-neighbor algorithm works well when clusters are com-pact and roughly equal in size, whereas the nearest-neighbor algorithm can deal with irregular cluster sizes but is sensitive to outliers. For example, a small cluster or point between two large and well-separated clusters may serve as a bridge between the latter and thus produce strange results (this is known as thechain effect). In fact, the nearest-neighbor and farthest-neighbor algorithms use two extreme approaches to the computation of cluster similarity: the maximum and minimum measures. As in many other cases involving minima and maxima, they tend to be overly sensitive to some irregularities, such as outliers. A solution to these problems is the use of averaging. Two of the similarity functions that we listed earlier use averaging: simi-larity between centroids and average simisimi-larity. Interestingly, both functions provide insights with respect to thequality of clustering.

In step 3 of the agglomerative clustering algorithm we find the two most similar clusters, and in step 6 we merge them. When using average similarity, we take pairs of documents from each cluster and compute the average similarity over all such pairs. If, instead, we first merge the clusters and then compute the average pairwise similarity between all documents in the merger, we get a comparable measure of how close the two original clusters were. Most important, this measure also accounts for the compactness of the cluster or theintracluster similarity, defined as follows:

sim (S)= 1

|S|²

d_i,d_j∈S

sim (di,dj)

The intracluster similarity may easily be computed once we have the cluster centroid.

Assuming that the document vectors are normalized to unit length and substituting for sim (di,dj) in the formula above, we can see that the intracluster similarity is the square length of the cluster centroid.

sim (S)= 1 The right clustering tree shown in Figure 3.3 is produced by the intracluster similarity function. The structure of the hierarchy is quite regular (no merges between clusters and single documents) and shows a close range of intracluster similarities for each

Dans le document DATA MINING THE WEB (Page 77-87)