Experiments in MATLAB r - k-Nearest Neighbor Classification

k-Nearest Neighbor Classification

4.5 Experiments in MATLAB r

This section presents some experiments in the MATLAB environment. Thek-NN will be implemented in the simple case in which the samples are points in a two-dimensional space.

In Figure 4.10 the MATLAB functionknnis shown. It has 8 input parameters and only one output parameter, which is the vectorclasscontaining the classification of the samples obtained by thek-NN algorithm. As inputs, the function needs: the numbernof unknown samples to classify; thexandycoordinates of such samples, stored inxandy, respectively; the numberkof nearest neighbors that will be used for classifying the unknown samples; the numberntrainof known samples used as training set for classifying the unknown ones; thex andy coordinates of such known samples are stored inxtrainandytrain; finally, the classes to which each known sample belongs are stored inctrain. Note that this MATLAB function has more parameters than the function in Figure 3.20 performing thek-means algorithm discussed in Chapter 3. Indeed,k-NN is a classification method whereask-means is a clustering method, and thenk-NN needs information about a training set for classifying the unknown samples.

The main loop in the algorithm is aforloop oni, which counts all the unknown samples. For each of them, three main operations must be performed: all the distances between this unknown sample and all the ones in the training set need to be computed;

then the smallestkcomputed distances need to be checked and the corresponding known samples need to be located; finally the unknown sample is classified accord-ing to the known classification of thesekknown samples. The Euclidean distances between the current unknown sample(x(i),y(i))and the samples in the training set(xtrain(j),ytrain(j))are collected in a vectordist. The vectorindcollects the index of the known samples used to compute the distances. It is needed for the identification of theksmallest distances. This task is performed by partially sorting in ascending order the vectordistby using one of the well-known methodologies for sorting data.

The methodology used for sorting the vectordist works as follows. It starts considering the last element indist, it compares this current element to its neighbor in the vector and it exchanges their positions if they are not sorted in an ascending order. Step by step, the current element moves one step toward the first element in the vector. When it reaches the first element, and it eventually exchanges the last two elements, the first element ofdistrefers the minimum distance contained in the whole vector. At this point, the procedure can restart another time from the last element of the vector and repeat the same instructions. This time it is not needed to reach the first vector element, because it already contains the global minimum. It can stop at the second element. If this procedure is repeated a number of times equal to the vector size minus 1, then the vector will be completely sorted. In this case, instead, only the smallestkdistances are searched, and therefore the procedure can be iterated onlyktimes. Not only the distance values are important, but even the indices of the points having these distances from the unknown sample. Therefore,

% this function performs a k-NN algorithm

% on a two-dimensional set of data

% input:

% n - number of samples to classify

% x - x coordinates of the samples to classify

% y - y coordinates of the samples to classify

% k - kNN parameter

% ntrain - number of training samples

% xtrain - x coordinates of the training samples

% ytrain - y coordinates of the training samples

% ctrain - classes to which each training sample belongs

% output:

% class - classes to which each unknown sample belongs

% [class] = knn(n,x,y,k,ntrain,xtrain,ytrain,ctrain) function [class] = knn(n,x,y,k,ntrain,xtrain,ytrain,ctrain)

for i = 1:n,

% computing the distance between (x(i),y(i)) and all the

% training samples

% checking the k smallest distances obtained for kk = 1:k,

for jj = ntrain-1:-1:kk, if dist(jj) > dist(jj+1),

aus = dist(jj); dist(jj) = dist(jj+1); dist(jj+1) = aus;

aus = ind(jj); ind(jj) = ind(jj+1); ind(jj+1) = aus;

end end end

% classifying the unknown sample on the base of the k-nearest

% training samples

maxscore = 1; val = score(1);

for j = 2:k,

Fig. 4.10 The MATLAB functionknn.

during the sorting process, the elements of the vectorindare exchanged according to the changes applied todist.

For classifying the current unknown sample(x(i),y(i)), a “score’’ is assigned to each class and the one having the maximum score is considered to be the class to which the unknown sample belongs. At the start all the scores are set to 0, then each of them is updated according to the classes in which thekclosest known samples are.

The expressionscore(ctrain(ind(kk)))refers to the score related to the class located byctrainwhen it refers to the known sample having indexind(kk). After all the scores are updated, the maximum score and related class are identified and the unknown sample is assigned to the class coded bymaxscore.

A training set of 50 points has been generated using the MATLAB function generate (Figure 3.16) and settingeps to 0.1. The value ofeps allows one to separate the points in 2 groups with a certain margin. The points so obtained are not assigned yet to a class or another. The functionkmeans(Figure 3.20) has been used for clustering these data and therefore for assigning them a classification. Therefore, the generated set of points can now be considered as a training set for thek-NN algo-rithm (see Figure 4.11). Another set of 200 points is then generated by the function generateand by settingeps= 0, so that these points contain no inherent patterns.

Figure 4.12 shows the points marked in accordance with the classification obtained by the functionknnusing the training set in Figure 4.11. The boundary between the two classes is not precisely located, but most of the points can be considered as well classified.

As previously shown, thek-NN algorithm can be computationally expensive if the used training set contains more information than needed. This can happen when the number of samples it contains is too large. In these cases, subsets of the orig-inal training set can be identified for obtaining the same classification in a shorter

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Fig. 4.11 The training set used with the functionknn.

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Fig. 4.12 The classification of unknown samples performed by the functionknn.

time, but with a good accuracy. Figures 4.13 and 4.14 show a MATLAB function implementing the algorithm for finding a consistent subset of a starting training set T_{N N}. The function has as input parameters: the numberntnnof points contained into the original training setT_{N N}; thex coordinatesxtnnof these points and the correspondingycoordinatesytnn; the numerical code indicating the class the point belongs to; finally, thekvalue related to the k-NN algorithm, i.e., the number of classes. The function output parameters are: the numberntcnnof points contained in the condensed subsetTCN N; thex andy coordinates of these points,xtcnnand ytcnn, respectively; the codesctcnnof classes these points belong to. This MAT-LAB function uses the functionknnas sub-procedure, because classifications are performed in the algorithm for findingTCN N.

This function is the translation of the algorithm in Figure 4.3 in the MATLAB language. The roles played by setsXandYare played bywellclassandbadclass in the function. These two sets contain the points that are “well’’ classified and “bad’’

classified during the algorithm. Each of them is represented in MATLAB by an integer number counting its size and three vectors. The set of bad classifications is for instance considered through the variables:nbadclasscounting the number of points in the set;xbadclasscontaining thexcoordinates of its points;ybadclass containing theycoordinates of such points;cbadclasscontaining the corresponding class codes. At the start,badclassis initialized by copying the first sample ofT_{N N} into it. Recursively, then, all the other samples are classified by theknnfunction usingbadclassas training set (firstforloop). Even thoughbadclasscontains one point only at the start, it gets bigger every time a sample is not classified correctly by knn. More precisely, every timeknnruns for classifying one sample, such sample is moved towellclassif it is classified well and it is moved tobadclassif the classification is not correct. After this starting phase, awhileloop starts. This loop

% this function computes a condensed subset T_CNN of

% a given training set T_NN

% input:

% ntnn - number of points in T_NN

% xtnn - x coordinates of points in T_NN

% ytnn - y coordinates of points in T_NN

% ctnn - classes each point in T_NN belongs to

% k - kNN parameter

% output:

% ntcnn - number of points in T_CNN

% xtcnn - x coordinates of points in T_CNN

% ytcnn - y coordinates of points in T_CNN

% ctcnn - classes each point in T_CNN belongs to

% [ntcnn,xtcnn,ytcnn,ctcnn] = condense(ntnn,xtnn,ytnn,ctnn,k) function [ntcnn,xtcnn,ytcnn,ctcnn] = condense(ntnn,xtnn,ytnn,ctnn,k)

% the first point is added to class "badclass"

nbadclass = 1;

% by using (nbadclass,xbadclass,ybadclass,cbadclass) as training set class = knn(1,xtnn(i),ytnn(i),k,nbadclass,xbadclass,ybadclass,cbadclass);

Fig. 4.13 The MATLAB functioncondense: first part.

stops whenwellclassdoes not have any point anymore or the variablenmovesis zero when an iteration of the whileloop is over. At each iteration of thiswhile loop, each point inwellclassis classified by usingbadclassas training set. If the point is classified incorrectly, then it is moved fromwellclasstobadclass. At the end of the procedure, the points inbadclassare able to classify correctly those inwellclass. The final condensed set is therefore given by the current points in badclass.

As before, a training set can be generated by using the functiongenerateand by applying thekmeansalgorithm for assigning a class to each point of the set. In

% checking the points in "wellclass"

% by using (nbadclass,xbadclass,ybadclass,cbadclass) as training set class = knn(1,xwellclass(i),ywellclass(i),k,nbadclass,xbadclass,

ybadclass,cbadclass);

% if the point is not well-classified

% it is moved from wellclass to badclass if class ˜= cwellclass(i),

% the class "badclass" is the condensed subset ntcnn = nbadclass;

Fig. 4.14 The MATLAB functioncondense: second part.

this case, 200 points have been generated by settingeps= 0.1. Then, the function kmeansis used withk= 4. The generated training set is presented in Figure 4.15(a).

In Figure 4.15(b) there is the corresponding condensed setTCN N. In Figure 4.16 the performances of theknnfunction using the obtained reduced set are shown. After the reduction of the training set, the quality of the classification remains the same.

Just few points close to the borders among different classes are misclassified. This

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(a)

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(b)

Fig. 4.15 (a) The original training set; (b) the corresponding condensed subsetTCN No btained by the functioncondense.

can be avoided if the margin among the classes is larger. The set of points to classify has been generated by the functiongeneratewithn= 50 andeps= 0.

Figure 4.17 shows the MATLAB function implementing the algorithm in Figure 4.5 for finding a reduced subset T_{RN N} of a training set. The input and output parameters of this function are similar to the ones of the functioncondense. The integer numberntnnand the three vectorsxtnn,ytnnandctnnrepresent the original training setTN N. The parameterkalways refers in this context to the number of nearest neighbors used during the classification algorithm. As outputs, the integer

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Fig. 4.16 The classification of a random set of points performed byknn. The training set which is actually used is the one in Figure 4.15(b).

ntrnnand the vectorsxtrnn,ytrnnandctrnnrepresent the reduced set this function provides.

At the beginning, the entire original set is copied into the variables that will contain the reduced set at the end. At each iteration of theforloop, a sample has a chance to be removed from it, allowing the reduced set to get smaller. The integercount counts the points of the original training set. Each of them is moved from the reduced set to an auxiliary set. If the points currently in the auxiliary set cannot be correctly classified using the current reduced set as training set, then the point is moved back to the reduced set. In the other case, however, the point remains in the auxiliary set, and therefore the reduced set is actually reduced. Figure 4.18(a) shows the subset obtained by functionreducefrom the set in Figure 4.15(a). Figure 4.18(b) shows the classification provided byknn using the reduced training set on the same set of 500 random points. As before, the classification accuracy remains the same after the reduction of the training set, while the computational cost of the classification decreases.

4.6 Exercises

Exercises related to thek-NN algorithm follow.

1. Let us suppose it is necessary to distinguish between points on a Cartesian system having positivexvalue and negativexvalue. Let us call these two classes asC⁺ andC⁻. By using the training set

% this function computes a reduced subset T_RNN of

% a given training set T_NN

% input:

% ntnn - number of points in T_NN

% xtnn - x coordinates of points in T_NN

% ytnn - y coordinates of points in T_NN

% ctnn - classes each point in T_NN belongs to

% k - kNN parameter

% output:

% ntrnn - number of points in T_RNN

% xtrnn - x coordinates of points in T_RNN

% ytrnn - y coordinates of points in T_RNN

% ctrnn - classes each point in T_RNN belongs to

% [ntrnn,xtrnn,ytrnn,ctrnn] = reduce(ntnn,xtnn,ytnn,ctnn,k) function [ntrnn,xtrnn,ytrnn,ctrnn] = reduce(ntnn,xtnn,ytnn,ctnn,k)

% copying the original training set ntrnn = ntnn;

% an auxialiary set (n,xtrain,ytrain,ctrain) is needed n = 0;

% performing the reduction algorithm for count = 1:ntnn,

% moving one point from T_RNN to the auxiliary set n = n + 1;

% classifying points in the auxiliary set by T_RNN aux_class = knn(n,xtrain,ytrain,k,ntrnn,xtrnn,ytrnn,ctrnn);

% counting the number of misclassifications nbadclass = 0;

% if there is one misclassification at least

% the point is moved back if nbadclass > 0,

Fig. 4.17 The MATLAB functionreduce.

−1 −0.5 0 0.5 1 1.5

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(a)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(b)

Fig. 4.18 (a) The reduced subsetTRN Nobtained by the functionreduce; (b) the classification of points performed byknnusing the reduced subsetTRN Nobtained by the functionreduce.

{(−1,−1), C⁻},{(−1,1), C⁻},{(1,−1), C⁺},{(1,1), C⁺} , classify the points (2,1), (-3,1) and (1,4) with thek-NN algorithm andk=1.

2. Given the training set:

{{(0,1), CA},{(−1,−1), CA},{(1,1), CA},{(−2,−2), CB},{(2,2), CB}}, classify the points

(7,8), (0,0), (0,2), (4,−2)

using the 1-NN rule.

3. By using the training set in Exercise 2, classify the points (5,1)and(−1,4) applying the 3-NN decision rule.

4. Provide an example of a training set such that the same unknown sample can be classified in different ways ifkis set to 1 or 3.

5. Plot the training set and the unknown sample that satisfies the requirements of Exercise 4 by using the MATLAB functionplotp.

6. Solve the classification problem proposed in Exercise 1 in the MATLAB envi-ronment and with the functionknn.

7. In MATLAB, create a training set and find the corresponding condensed and reduced set with the functionscondenseandreduce.

8. In MATLAB, build figures showing the original training set, and the condensed and the reduced subsets obtained in the previous exercise.

9. In MATLAB, use the functionknnfor classifying a set of unknown points. Use the training set originally generated in Exercise 7. Create a figure showing the obtained classification.

10. In MATLAB, use the functionknnfor classifying a set of unknown points. Use the condensed and reduced subsets obtained in Exercise 7. Create a figure showing the obtained classifications.

Chapter 5

Dans le document DATA MINING IN AGRICULTURE (Page 115-126)