PROBABILTY-BASED CLUSTERING

classiﬁcation) is discussed in the context of supervised learning in Chapter 5. The k-means clustering algorithm is very popular because it is simple, efﬁcient, and works well in practice, especially when the clusters are big, balanced, and well separated.

By applying the algorithm recursively to the clusters obtained,k-means can easily be extended to produce hierarchical clustering.

Although we have described agglomerative andk-means clustering as based oncosine similaritybetween TFIDF vectors, these algorithms also work with other similarity functions. Two additional similarity measures based on document content have been discussed in Chapter 1:Jaccard similarityanddocument resemblance. The latter is especially useful for clustering large portions of the Web, because it scales up well and allows document resemblance to be computed efﬁciently using document sketches.

Similarity can also be defined by using the link structure of web pages. In Chapter 2 we outlined an approach to find similar pages based on the HITS algorithm for computing hubs and authorities. Link-based similarity can be defined in other ways, too. Ifd1andd2are nodes (documents) in the web graph, the similarity between them can be defined to be:

r The length of the shortest path betweend1andd2

r The number of common ancestors of (pages with links to both)d₁andd₂ r Number of common successors of (pages that are pointed by links in both)d1

andd2

One common approach to computing the overall similarity is to take the maxi-mum of the content similarity (cosine, Jaccard, or resemblance) and a weighted sum of the link-based similarities mentioned above.

In probability-based clustering, a document is considered as a random event that occurs according to different probability distributions depending on the cluster (or class¹) to which the document belongs. The parameters involved in this setting are:

r The document class labels (may be known or unknown) r The parameters of the probability distribution for each cluster r The way that terms are used in the document representation

The last two parameters are related because the terms are considered as random variables, and the type of the latter determines the type of the probability distributions used to model documents. Generally, there are three ways that a term can be used in this model:

1Here we use the termsclassandclusterinterchangeably depending on the context. Both terms mean

“topic” (or “category”), but if the labels are determined by using the group to which the document belongs, we usecluster, otherwise we useclass.

1. As abinary variable, taking value 0 or 1 depending on whether or not the term occurs in the document. This is themultivariate binary model, where documents are binary vectors following amultivariate binary distribution.

2. As a natural number, indicating the number of occurrences (frequency) of the term in the document. In this representation a document is a vector of natural numbers and its probability is computed according to themultinominal distribution.

3. A normally distributedcontinuous variabletaking TFIDF values. The docu-ments in this representation are TFIDF vectors (described in the section “Vector Space Model” in Chapter 1) following amultivariate normal distribution.

The binary and multinominal distributions use the underlying document models directly:set-of-words, andbag-of-words, respectively. Both ignoreorderingbetween terms but do not use much preprocessing, which may cause further loss of information or misrepresentation. Therefore, they are considered more natural than the TFIDF representation. It is also assumed that the probabilistic models take into account the importance of the terms (which the IDF measure accounts for) and can even capture the notion of stopwords.

All three models are commonly used for both probabilistic clustering and clas-siﬁcation. Hereafter we illustrate the probabilistic clustering with the TFIDF model.

The binary and multinominal models are discussed in Chapter 5 in the context of classiﬁcation. The reason for this choice is the popularity of the normal distribution and the fact that it works not only for documents but in many other domains, too. It also has a nice visual representation that makes the approach easy to understand.

To start with, let us pick a term and represent each of our documents with a single value, the TFIDF component in the document vector that corresponds to that term. Because this value may describe documents of different classes (clusters), it may have different distributions. Thus, in a collection of documents from different classes we have a mixture of different distributions. In statistics this is called afinite mixture model(finite, because we assume a finite number of distributions). An example of a single-attribute two-class mixture is shown in Figure 3.2.

The data include the list of values of the termofferstaken from our department collection (see Table 3.3). The label (AorB) corresponds to the cluster, where the document with that value ofoffersbelongs to according to thek-means clustering shown in Table 3.5. Thus, we have a mixture of twonormal(Gaussian)distributions.

The graphs shown next to the data in Figure 3.2 are plots of theprobability density functions(bell-shaped curves) for these distributions. The distributions are defined by theirmeanandstandard deviations. The mixture model also includes theprobability of samplingfor each class (the probability that a random value belongs to a particular class). All these parameters are also shown in Figure 3.2. Using the mixture model, we can define three problems: afinite mixture problem,aclassification problem, and aclustering problem.

Finite Mixture Problem

Given alabeled data set(i.e., we know the class for each attribute value) the problem is to ﬁnd the mean, standard deviation, and the probability of sampling for each

PROBABILTY-BASED CLUSTERING 75

Class Mean Standard deviation Probability of sampling A

Figure 3.2 Two-class mixture model for the termoffers.

cluster. The solution to this problem is a straightforward application of the formulas.

ThemeanμC for each classCis the average value of the attribute (variablex) for documents belonging to that class:

For our department data set we take all values with labelA(Figure 3.2) and compute μA =₁₁¹(0+0+0+0+0+0+0.49+0+0+0.387+0.57)=0.132 Similarly, for documents labeledB, we have

μB = ¹₉(0.961+0.780+0+0.980+0.135+0.928+0+0.658+0)=0.494

In practice, however, the square root of the bias-corrected variance is used (the only difference being that the denominator is|C| −1). Thus, for our data set we have σA =0.229 andσB =0.449.

The probability of sampling P(C) is computed as the proportion of instances in classCto the size of the entire data set. That is,

P(A)=¹¹₂₀ =0.55 P(B)=₂₀⁹ =0.45

The set of parameters we computed above actually describes our data set. In terms of clustering, each tupleμC, σC, P(C)is agenerative document modelof clusterC.

Classiﬁcation Problem

Assume that we have already computed the mixture model for the attributeoffers. That is, the three parametersμC, σC,andP(C) are known for each classC. The problem now is to ﬁnd theclassiﬁcation of a documentwhich has a given valuexfor the attribute offers. To solve the problem we need to decide from which distribution the valuex comes, clusterAorB, which in turn will allow us to classify the document accordingly.

The idea is to compute theconditional probabilities P(A|x) andP(B|x). Then the distribution that has the biggest probability is the one thatxcomes from.

If x comes from a discrete distribution (such as the binary distribution we discussed earlier), we may easily compute P(C|x) by applyingBayes’ rule:

P(C|x)= P(x|C)P(C) P(x)

where P(C) is known from the mixture model and P(x|C) is the probability ofx according to the document distribution in cluster C. Then the probability P(x|C) would simply be the number of occurrences of xin clusterCdivided by the total number of documents inC. We use this approach later for classiﬁcation. In the current situation, however, xis the value of a continuous random variable. Then, strictly speaking, the probability of a continuous random variable beingexactly equalto any particular real value is zero. A practical solution to this problem is to use the value of the probabilitydensity functioninstead. That is,

fC(x)= 1

√2π σC

e^−(x^−μ^C⁾²^/2σ^C²

Of course, this is not exactly the probability we need, but it appears that it works for our purposes. The reason is that in practice we never know the exact value of a variable.

Moreover, in computer arithmetic we always have some degree of approximation.

Then instead of P(x|C), we may use P(x−ε≤x≤x+ε|C), where ε is the accuracy of computation. To compute the latter, we may now use thedensity function for classC. This is the area under the bell-shaped curve between the pointsx−εand x+ε, which for smallεis 2εfC(x). However, as we don’t know the accuracyε(and also because it is the same for all classes), we may simply ignore it and use the value of the density function fC(x) instead, as an estimate of thelikelihoodthatxcomes from distributionC. Thus, we arrive at a formula that works for the continuous case (note that we use the symbol≈instead of =):

P(C|x)≈ fC(x)P(C) P(x)

One may think that computingP(x) causes similar problems; however, we don’t need this probability because it appears in the expressions for all classes. We may calculate only the numerators and thus compare likelihoods instead of probabilities. Further, because the probabilities sum up to 1 [P(A|x)+P(B|x)=1], we may normalize the likelihoods by their sum. Then we will have the correct probabilities.

Let us illustrate the classiﬁcation problem with a simple example. Assume that we don’t know the class label of Communication. The value of the offersattribute for this document is 0.78 (the ﬁfth row in the data table in Figure 3.2). Thus, the

PROBABILTY-BASED CLUSTERING 77 problem is to compute P(A|0.78) and P(B|0.78), and whichever is bigger will determine the class label of Communication. Plugging 0.78 and the corresponding values forμA, σA andμB, σB (from Figure 3.2) in the formula for fC(x), we get

fA(0.78)=0.032 and fB(0.78)=0.725. Then

P(A|0.78)≈ fA(0.78)P(A)=(0.032) (0.55)=0.018 P(B|0.78)≈ fB(0.78)P(B)=(0.725) (0.45)=0.326

These likelihoods clearly indicate that the value 0.78 comes from the distribution of clusterB(i.e., the Communication document belongs to class B). Further, we can easily get the actual probabilities by normalization:

P(A|0.78)= 0.018

0.018+0.326 =0.05 P(B|0.78)= 0.326

0.018+0.326 =0.95

In fact, this classiﬁcation is quite clear if we look at the plot of the densities in Figure 3.2; the value of 0.78 is well under the “bell” of distributionB.

Let us, however, try another value, which may not be that conclusive for the classification problem. Looking at the data table in Figure 3.2, we see that a value of 0 clearly cannot distinguish between the two classes. There are eight 0’s labeled withA and three 0’s labeled withB. Doing the actual calculations results inP(A|0)=0.788 and P(B|0)=0.212. Thus, all documents will be classified as A, which means three wrongly classified documents: Criminal Justice, Music, and Theatre (originally labeled asB). Obviously, one attribute is not enough to make a correct classification.

The mixture model can easily be extended to more than one attribute provided that theindependence assumptionis made. It states that the joint probability of all attributes in a vector is calculated as a product of the probabilities of the individual attributes; that is,

P(x1,x2, ...,xn|C)=

!n i=1

P(xi|C)

In terms ofprobability theorythis means that the events of different attributes having particular values areindependent. For example, knowing the value ofsciencein the Music document should not tell us anything about the values of the other attributes in that document. This may not be true, however. In the particular example, ifscience is 0, we may expect thatresearchis also 0 (because they often go together). In fact, the independence assumption rarely holds (or it is difﬁcult to prove, as this would require a large amount of data), but despite that, the formula above works well in practice. This is the reason that the independence assumption is also called naive Bayes assumption. It plays a key role in the naive Bayes classiﬁcation algorithm, which we just described. We shall revisit this algorithm in Chapter 5 and illustrate its use with different distributions, as we mentioned earlier.

Using the same idea as in the one-dimensional case, we may estimate the likelihood of the vector using the density functions for its components:

P(C|(x1,x2, ...,xn))≈

!n i=1

f_Cⁱ(xi) P(C) P((x1,x2, ...,xn))

Let us now illustrate how to use the naive Bayes assumption to classify the Theatre document by using more attributes. The six-dimensional vector for The-atre is (0,0,0,0,0.967,0.254). Then the problem is to compute the probabilities P(A|(0,0,0,0,0.976,0254)) and P(B|(0,0,0,0,0.976,0254)). For estimating the probabilities with likelihoods, we use the density functions for six attributes:

history,science,research,offers,students, andhall. Thus, we have

P(A|(0,0,0,0,0.976,0.254)≈ f_A¹(0)f_A²(0)f_A³(0)f_A⁴(0)f_A⁵(0.976)f_A⁶(0.254)P(A) P((0,0,0,0,0.976,0254))

where f_A¹through f_A⁶are the density functions of the attributes in the order in which they are listed above. To compute these functions we need to compute the means and the standard deviations for each of these attributes within clusterA. However, when computing f_A¹(0), we run into a little problem. The density function is undeﬁned for history, because its standard deviation is 0 (all values in clusterAare the same). There are different approaches to solving this problem. The simplest is to assume a prespec-iﬁed minimum value, say 0.05, which results in f_A¹(0)=7.979. After completing the calculations, we have

P(A|(0,0,0,0,0.976,0.254))

≈(7.979)(0.5)(0.423)(1.478)(0.007)(1.978)(0.55)=0.019

Similarly, by using the parameters for cluster B (using the same ﬁx for the zero deviations ofscienceandresearch), we compute the likelihood of class B.

P(B|(0,0,0,0,0.976,0.254))

≈(0.705)(7.979)(7.979)(0.486)(0.698)(1.604)(0.45)=10.99 After normalization we have the following probabilities:

P(B|(0,0,0,0,0.976,0.254))= 0.019

0.019+10.99 =0.002 P(B|(0,0,0,0,0.976,0.254))= 10.99

0.019+10.99 =0.998

Clearly, the Theatre document belongs to class B, which is now the correct classiﬁ-cation because this is its original cluster.

Clustering Problem

So far we have discussed two tasks associated with our probabilistic setting:learning (creating models given labeled data) andclassiﬁcation(predicting labels using mod-els). Recall, however, that the cluster labels were created automatically byk-means clustering. So a natural question is whether we can also get these labels automatically within a probabilistic setting.

PROBABILTY-BASED CLUSTERING 79 Expectation maximization(EM) is a popular algorithm used for clustering in the context of mixture models. EM was originally proposed by Demster et al. [2] for the purposes of estimating missing parameters of probabilistic models. Generally, this is anoptimization approach, which given some initial approximation of the cluster parameters,iterativelyperforms two steps: ﬁrst, the expectation step computes the values expected for the cluster probabilities, and second, the maximization step com-putes the distribution parameters and their likelihood given the data. It iterates until the parameters being optimized reach a ﬁxpoint or until thelog-likelihood function, which measures the quality of clustering, reaches its (local) maximum.

To simplify the discussion we ﬁrst describe the one-dimensional case (i.e., the data collection is a set of values x₁,x₂, . . . ,xn of a normally distributed random variable). The algorithm takes a parameter k (predeﬁned number of clusters, as in k-means) and starts with selecting (usually at random) a set of initial cluster parametersμC, σC,andP(C) for each clusterC. It can also start by assigning labels to the data points (again, at random), which will determine the cluster parameters (as was done earlier for the mixture problem). Then the algorithm iterates through the following steps:

1. For each xi and for each clusterC, the probabilitywi =P(C|xi) thatxi be-longs to clusterCis computed. For this purpose the likelihood of this event is obtained using the approach described in the preceding section; that is, wi ≈ fC(xi)P(C), where fC(xi) is the density function. As the members of each cluster are not known explicitly, P(C) is computed as the sum of the weightswi(from the preceding step) for clusterCnormalized across all clus-ters. Likelihoodswiare also normalized across clusters in order to get the correct probabilities.

2. The standard formulas for mean and standard deviation are adjusted to use the cluster membership probabilitieswi as weights. Thus, the followingweighted meanandstandard deviationare computed:

μC =

Note that the sums go for all values, not only for those belonging to the corresponding cluster. Thus, given a sample sizen, we have ann-component weight vector for each cluster.

The iterative process is similar to that ofk-means; the data points are redis-tributed among clusters repeatedly until the process reaches a ﬁxpoint. Thek-means algorithm stops when the cluster membership does not change from one iteration to the next.k-Means uses “hard”² cluster assignment, however, whereas the EM uses

2In fact, there exist versions ofk-means with soft assignment, which are special cases of EM.

“soft” assignment—probability of membership. Consequently, EM may not be able to reach the actual fixpoint; although it converges to it, the probabilities may keep changing forever. Another indication that the algorithm is close to the fixpoint is when thecriterion functionthat measures the quality of clustering reaches a maximum. For our probabilistic setting, this is the overall likelihood that the data come from distri-butions defined with the parameters given. The overall likelihood is computed as a product of the probabilities of all individual data points xi. As this product may in-clude thousands of terms, we usually take a log to smooth its value and avoid possible underflow. Thus, we get thelog-likelihood criterion function:

Practically, each log is taken from the sum of thewi values for the corresponding cluster, but before normalization (otherwise, the sum will be 1 and the log will be 0).

After each iteration the value ofL increases, and when the difference between two successive values becomes negligible, the algorithm stops. It is guaranteed that the algorithm converges to a maximum of the log-likelihood function. This may be a local maximum, however. To ﬁnd the global maximum we may use the same technique that was suggested fork-means—run the algorithm several times with different initial parameters and choose the clustering that maximizes the log-likelihoods from each run.

To illustrate the approach discussed so far, let us run the EM algorithm on a sample from our department data with one attribute: students. Thus, we have a vector (x₁,x₂, . . . ,x₂₀) with values taken from the row studentsin Table 3.3. Let us also choose k=2; that is, we want to ﬁnd two clustersAandBdeﬁned proba-bilistically with their sets of parameters{μA, σA, P(A)}and{μB, σB, P(B)}. First, we need to choose the initial cluster parameters. Let us use the labeling approach for this purpose. For each valuexiwe toss a coin, and if it is a head, we assign la-belA; otherwise, we assign labelB. In other words, we initialize the weight vectors for clusters AandBwith 1’s and 0’s, thus determining the probabilities of cluster membership for each xi. Using the initial weight vectors, the algorithm computes the cluster parameters, the weighted versions ofμA, σA, P(A) andμB, σB, P(B).

In the next step these parameters are used to recompute the weight vectors, which further, will determine the new cluster parameters, and so on. At each step the log-likelihood function is computed so that the iterations stop when its maximum is reached. The entire process with the values of all parameters involved is shown in Table 3.6.

The ﬁrst column shows the data vector (x1,x2, . . . ,x20). The next column (iteration 0) shows the initial setting with the random choice of values for the weight vectors of clustersAandBand the corresponding cluster parameters (in the lower four rows): sum of weights, cluster probabilities (computed by normalizing the sum of weights), weighted mean, and standard deviation. These parameters are used to compute the weight vectors for iteration 1. The latter determine the cluster parameters for iteration 1, which in turn are used to compute the weights for iteration 2, and so on.

In these terms the expectation step of the algorithm is the computation of the weight

TABLE3.6EMIterationsonOne-AttributeData(Attributestudents) Iteration:0123456 Data:wiwiwiwiwiwiwi ixiABABABABABABAB 10.67100.990.011010101010 20.19100.40.60.350.650.290.710.260.740.230.770.210.79 30.11010.410.590.290.710.190.810.130.870.10.90.090.91 40.15010.390.610.310.690.230.770.180.820.150.850.130.87 50.63100.990.011010101010 60.13010.40.60.30.70.20.80.150.850.120.880.10.9 7110101010101010 80010.530.470.350.650.220.780.140.860.10.90.080.92 90100.530.470.350.650.220.780.140.860.10.90.080.92 100.53100.920.080.990.0110101010 110100.530.470.350.650.220.780.140.860.10.90.080.92 120.2100.40.60.350.650.30.70.260.740.240.760.220.78 130100.530.470.350.650.220.780.140.860.10.90.080.92 140.17010.390.610.320.680.250.750.20.80.170.830.150.85 150100.530.470.350.650.220.780.140.860.10.90.080.92 160.31100.520.480.610.390.720.280.80.20.830.170.840.16 170.06100.450.550.30.70.180.820.120.880.090.910.070.93 180.31010.510.490.610.390.710.290.790.210.820.180.840.16 190.46010.820.180.950.050.990.01101010 200.9710101010101010 wi13712.27.811.18.910.29.89.610.49.210.89.110.9 Clusterprobability0.650.350.610.390.560.440.510.490.480.520.460.540.450.55 Weightedλ0.350.190.400.140.440.110.490.100.520.090.540.090.550.09 Weightedσ0.350.140.340.120.330.100.320.090.310.090.300.090.290.09 Log-likelihood−2.92201−1.29017−0.0990390.478880.6970560.7691240.792324

-3 -2 -1 0 1

0 1 2 3 4 5 6

Iteration

Log-likelihood

Figure 3.3 Log-likelihood graph for the example in Table 3.6.

vectors (including the initial vector, which is determined as a random expecta-tion), while the maximization step is the computation of cluster parameters, which is based on the probabilities expected (the weights) and is aimed at maximizing them.

The last row in Table 3.6 shows the values of the log-likelihood function, which

Dans le document DATA MINING THE WEB (Page 91-102)