Test set method - Support Vector Machines

Support Vector Machines

8.2 Test set method

The training set contains the information needed for performing the classification of unknown samples. It consists in a set of pairs grouping samples and their correspond-ing classifications. All the other samples which are not contained in the traincorrespond-ing set have an unknown classification, and hence they cannot be used for validation pur-poses.

Thetest setmethod is based on the following idea. Since only the samples in the training set have a known classification, the idea is to split the training set in two parts: a part which is actually used as a training set, and another part used for the validation. In general, 70% of the data can be used as a training set, and the remaining (30%) can be used for the validation process.

Let us suppose that thek-nearest neighbor rule is used for classifying a set of unknown samples. To validate the effectiveness of this rule, 30% of the training set is classified using the remaining 70% of the training set. Since samples in both cases are taken from the training set, their classification is known, and therefore the classification obtained by thek-nearest neighbor rule can be validated. Similarly, a neural network or a support vector machine can be trained using 70% of the training set, and then the accuracy of the classification provided by the trained network or support vector machine can be evaluated on the remaining 30% of the training set.

8.2.1 An example in MATLAB

In this example, a linear regression model is validated by using the test set method.

It is supposed that a set of points in a two-dimensional space is available, and that

it is needed to model these points by linear regression. These points can represent measurements of a certain process that it is known to be linear. The available set of points is used as a training set: the general rule governing the process needs to be discovered from this set. Once the regression model has been found, it should be able to approximate with an acceptable accuracy the points of the training set, and it should also be able to generalize to other unknown points.

In order to validate the quality of the regression model, the test set method can be applied. Following this method, the original training set has to be divided in two parts. Let us suppose the training set contains the following 10 points:

(1,4), (2,2), (3,3), (4,1.7), (5,1) (6,1.2), (7,1.5), (8,1.9), (9,2.3), (10,2.7).

Three of these points (30%) can be used for validating the model, while the other seven points (70%) are used as a training set for finding the model. One issue can be how to decide the points to place in the validation set and the ones to place in the actual training set. This separation can be done in a totally random way, but there might be cases in which this can lead to problems. Let us consider, for instance, that the three points having the smallestxvalue are used as a validation set, whereas the others are used as a training set. The following MATLAB code has been used for performing the validation and generating Figure 8.1.

x = [1 2 3 4 5 6 7 8 9 10];

Thexvector is initialized with thexcoordinates of the whole set of points, and the yvector is initialized with theirycoordinates. In the vectorsx1andy1are then placed the points that are actually used for computing the regression model. Inx2andy2are instead placed the remaining points, the ones that are used for the validation. In this example, the compact symbologies1:3and4:10are used for considering vectors whose first component is 1 (or 4), whose last component is 3 (or 10) and having distance between any consecutive components equal to 1 (for details see Appendix A). These points separated in this way are then printed by using the functionplot. Note that many options are used for controlling the symbols used when the points are drawn. In particular, the points in the validation set are marked by circles, and the points in the training set are marked by squares.

The functionpolyfitis able to find the coefficient of the linear function that better approximates the points inx1andy1(see Section 2.4). The specified degree for the

polynomial is 1, because a linear model is searched. The output of the function polyfit is placed in the vector c, which is soon used as input in the function polyvalthat evaluates the polynomial in thexcoordinates stored inxx. The vector xxis created so that it contains all thexcoordinates inx. The vectoryygenerated bypolyvalcontains the correspondingy coordinates. The vectors xxandyyare finally given as input to the functionplot.

Figure 8.1 shows that the found linear regression does not give a good approxima-tion of the points placed in the validaapproxima-tion set. For instance, the errorerrcomputed on the point (x(1),y(1)) has value 3.59. This is not a small error, if it is compared to the coordinates of the points. This error is larger than 3 times the difference between two consecutivexcomponents. If the points (x1,y1) and (x2,y2) are instead chosen in the following way

x1 = x([1 2 4 5 7 8 10]);

y1 = y([1 2 4 5 7 8 10]);

x2 = x([3 6 9]);

y2 = y([3 6 9]);

then the accuracy grows. Figure 8.2 shows the new-found linear regression. In this case, the whole set of points is represented better by the chosen 70% of the original training set. This brings to a reduction of the overall error on the points in the validation set. The largest error is here due to the points (x6,y6) and it corresponds to 0.85. In general, more than one random division of the training set could be considered and the test set method applied for each of these divisions.

0 2 4 6 8 10 12

0 0.5 1 1.5 2 2.5 3 3.5 4

Fig. 8.1 The test set method for validating a linear regression model.

0 2 4 6 8 10 12 1

1.5 2 2.5 3 3.5 4

Fig. 8.2 The test set method for validating a linear regression model. In this case, a validation set different from the one in Figure 8.1 is used.

Dans le document DATA MINING IN AGRICULTURE (Page 182-185)