• Aucun résultat trouvé

NEAREST-NEIGHBOR ALGORITHM

Dans le document DATA MINING THE WEB (Page 136-139)

The nearest-neighbor algorithm is a straightforward application of similarity (or dis-tance) for the purposes of classification. It predicts the class of a new document using the class label of the closest document from the training set. Because it uses just one instance from the training set, this basic version of the algorithm is calledone-nearest neighbor(1-NN). The closeness is measured by minimal distance or maximal simi-larity. The most common approach is to use the TFIDF framework to represent both the test and training documents and to compute the cosine similarity between the document vectors.

Let us again consider our department document collection, represented as TFIDF vectors with six attributes along with the class labels for each document, as shown in Table 5.1. Assume that the class of the Theatre document is unknown.

To determine the class of this document, we compute the cosine similarity between the Theatre vector and all other vectors. The results of these calculations are shown in Table 5.2, which shows the documents sorted by their similarity to the document being classified. The 1-NN approach simply picks the most similar document (the first one from the top of the list), Criminal Justice, and uses its label B to predict the class of Theatre. Obviously, this is a correct prediction because it coincides with the actual class of the document. However, if we look at the nearest neighbor of Theatre (Criminal Justice) we see only one nonzero attribute, which in fact produced the prediction. This is just a tiny portion of our training data. Because of the sparsity

TABLE 5.1 TFDF Representation of Department Documents with Class Labels

history science research offers students hall Class

Anthropology 0 0.537 0.477 0 0.673 0.177 A

Art 0 0 0 0.961 0.195 0.196 B

Biology 0 0.347 0.924 0 0.111 0.112 A

Chemistry 0 0.975 0 0 0.155 0.158 A

Communication 0 0 0 0.780 0.626 0 B

Computer Science 0 0.989 0 0 0.130 0.067 A

Criminal Justice 0 0 0 0 1 0 B

Economics 0 0 1 0 0 0 A

English 0 0 0 0.980 0 0.199 B

Geography 0 0.849 0 0 0.528 0 A

History 0.991 0 0 0.135 0 0 B

Mathematics 0 0.616 0.549 0.490 0.198 0.201 A

Modern Languages 0 0 0 0.928 0 0.373 B

Music 0.970 0 0 0 0.170 0.172 B

Philosophy 0.741 0 0 0.658 0 0.136 B

Physics 0 0 0.894 0 0.315 0.318 A

Political Science 0 0.933 0.348 0 0.062 0.063 A

Psychology 0 0 0.852 0.387 0.313 0.162 A

Sociology 0 0 0.639 0.570 0.459 0.237 A

Theatre 0 0 0 0 0.967 0.254 ? (B)

NEAREST-NEIGHBOR ALGORITHM 119

TABLE 5.2 Documents Sorted by Similarity to Theatre

Document Class Similarity to Theatre

Criminal Justice B 0.967075

Anthropology A 0.695979

Communication B 0.605667

Geography A 0.510589

Sociology A 0.504672

Physics A 0.385508

Psychology A 0.343685

Mathematics A 0.242155

Art B 0.238108

Music B 0.207746

Chemistry A 0.189681

Computer Science A 0.142313

Biology A 0.136097

Modern Languages B 0.0950206

Political Science A 0.0762211

English B 0.0507843

Philosophy B 0.0345299

History B 0

Economics A 0

of the vector space, similar situations often occur in real web document collections with thousands of attributes and documents. In fact, no matter how large the train-ing set is, the 1-NN algorithm makes the prediction ustrain-ing a strain-ingle instance, often relying on few attributes. All this makes the algorithm extremely sensitive to noise (wrong values of some attributes) and irrelevant attributes. Therefore, when using 1-NN, two assumptions have to be made: There isno noise, andall attributes are equally importantfor the classification. The problem is that these assumptions are not realistic, especially for large text document collections. Of course, this is where feature selection may help (we discuss this in the next section). However, there also exist extensions of 1-NN that address this problem: k-NN anddistance-weighted k-NN.

k-NN is a generalization of 1-NN, where the prediction is made by taking the majority vote over theknearest neighbors. The parameterkis selected to be a small odd number (usually, 3 or 5). For example, 3-NN would again, classify Theatre as of class B, because this is the majority label in the top three documents (B,A,B). However, 5-NN will predict class A, because the set of labels of the top five documents is {B,A,B,A,A}(i.e., the majority label is A). Which prediction is more feasible? One may suggest continuing withk=7,9 . . . . However, this is the wrong direction to go, because we are taking votes from documents being less and less similar to the document being classified. So the idea is to take votes only from close neighbors.

But because allknearest neighbors are considered equally important with respect to the classification, the choice ofkis crucial. This is the reason that 1-NN is the most popular version of the nearest-neighbor algorithms.

The other extension,distance-weighted k-NN, actually builds on the problem with the choice of the parameterkthat we just mentioned. The basic idea is to weight votes with the similarity (or distance) of the documents from the training set to the doc-ument being classified. Instead of adding 1’s for each label, we may use a term that is proportional to the similarity (or inversely proportional to the distance). The simplest option is just the similarity function sim (X,Y). Other options are 1/[1−sim(X,Y)]

or 1/[(1−sim(X,Y)]2. For example, the distance-weighted 3-NN with the simplest weighting scheme [sim (X,Y)] will predict class B for the Theatre document be-cause the weight for label B (documents Criminal Justice and Communication) is B =0.967075+0.605667=1.572742, while the weight for Antropology is A=0.695979, and thus B>A. With the weighting scheme it makes sense to use a largerk, even akequal to the number of documents in the training set. For example, fork=19 (all documents excluding Theatre) the prediction for Theatre is A, because the weight for A is 3.2269 and the weight for B is 2.198931.

The distance-weighting approach actually allows the algorithm to use not just a single instance, but more or even all instances. In this respect it is interesting to evaluate and compare the performance of various versions of NN. We use LOO-CV for this purpose because it is the best choice for small data sets: Computational complexity is not a concern and stratification is not needed because the algorithm does not build a model. Table 5.3 summarizes the LOO-CV accuracy evaluation of 1-NN, 3-NN, 5-NN, and 19-NN with and without distance weighting on the document collection shown in Table 5.1. To investigate the effect of more and possibly irrelevant attributes, we run the same experiments with the complete set of 671 attributes. The results show clearly that a small set of relevant attributes works much better than all attributes. Another important observation is that distance weighting does not help much. With a set of relevant attributes, the algorithm works perfectly with k=1 andk=3, and thus there is not much left to be improved. Only fork=5 do we see a little improvement. On the other hand, distance weighting makes no difference in experiments with all attributes. Because of the sparseness of the vector space, the range of similarity values is small (minimum=0.022 and maximum=0.245, whereas for the six-attribute case, minimum=0 and maximum=0.995), and thus weighted votes are not much different from nonweighted votes. Overall, 1-NN is a clear winner in all cases. In general, we may expect improvement withk>1 and distance weighting only in situations with noise (obviously not present in out data) and not too many irrelevant attributes.

The nearest-neighbor algorithm has been used successfully by statisticians for more than 50 years. As a machine learning approach it falls in the category of

TABLE 5.3 LOO-CV of the Nearest-Neighbor Algorithm

No Distance Weighting Distance Weighting

Attributes 1-NN 3-NN 5-NN 19-NN 1-NN 3-NN 5-NN 19-NN

6 1.00 1.00 0.90 0.55 1.00 1.00 1.00 0.85

671 0.85 0.85 0.75 0.55 0.85 0.85 0.75 0.55

Dans le document DATA MINING THE WEB (Page 136-139)