• Aucun résultat trouvé

MATLAB and LIBSVM

Dans le document DATA MINING IN AGRICULTURE (Page 155-162)

Support Vector Machines

6.6 MATLAB and LIBSVM

There is a MATLAB toolbox especially designed for SVMs. However, we will not discuss its potentialities in this chapter, and the interested reader can find additional

information on this toolbox on the MATLAB Web site. This section is instead devoted to the free software LIBSVM (a LIBrary for Support Vector Machines). MATLAB is used just to generate instances that will be solved by using LIBSVM. This will give an example of how to interface two different software. The code in MATLAB we propose is simple and easy to modify for personal purposes.

LIBSVM is an integrated software for SVM classification and also regression and distribution estimation [43]. LIBSVM is distributed with the source code, so that it can be compiled and used on any platform. Executable files are also available for DOS and Windows users. It is composed of 4 procedures:

svmtraincan be used for training an SVM by a certain training set and using different parameters.

svmpredictcan be used for predicting classifications by SVMs defined with the previous procedure.

svmscalecan be used for scaling the data. This procedure is highly recommended by the authors of LIBSVM for avoiding what they call “numerical difficulties’’

during the calculations. In fact, variables having a greater variability can dominate on the ones with smaller ranges of variability, and this may spoil the classification accuracy.

svmtoyis a LIBSVM procedure which can be used forplayingwith SVMs. It has a graphic interface, where two-dimensional points can be drawn on a virtual plane and different classifications can be associated to them. The procedure provides graphical representations of SVMs modeling the drawn points. This can be a valuable exercise for checking the SVM classification skills in different situations, such as linear and nonlinear separable data.

In the following, it is shown how a training set can be generated and used for training an SVM. For generating the data, the MATLAB functiongenerateis used.

In this case, however, the data do not have to be used in the MATLAB environment.

Hence, the data need to be stored in a text file formatted so that it can be read by the LIBSVM software.

The LIBSVM procedures are able to read text files formatted as follows. At least two text files need to be generated: one containing the samples of the training set and another one containing the samples of a testing test. These samples need to be listed row by row in the text files, so that each sample is represented on one single row. Each row starts with the identifier of the class the sample belongs to.

If the samples are divided in two classes, the identifiers can be−1 and+1. After the identifier, all the components of the vector representing the sample need to be inserted. For each component, the component counter{1,2, . . . , n}and its value are inserted and separated by the symbol ‘:’. If known, the class to which the sample belongs can be inserted also in the text file related to the testing test. In this way, svmpredictis able to verify how many unknown samples are classified correctly by the SVM. In Figure 6.8 a modified version of the MATLAB functiongenerate (Figure 3.16) is given. It saves the generated data in the text filetrainset.txtby using the functionsfopenandfprintf. The functiongenerate4libsvmassigns

%

% this function generates a random sets of data

% in the two-dimensional space and prints it in

% the text file "trainset.txt" formatted in the

% LIBSVM format

%

% input:

% n - number of random samples to be generated

% eps - predefined margin between samples separated by the line x = 0

%

% output:

% x - x coordinates of the samples

% y - y coordinates of the samples

%

Fig. 6.8 The MATLAB functiongenerate4libsvm.

each sample of the type(−x, y)to class−1 and each sample of type (+x, y)to class+1.

A set of 100 samples has been generated by functiongenerate4libsvm with eps= 0.1. The first samples contained in the text file are shown in Figure 6.9. Another set of 1000 samples have then been generated by the same function and imposing eps= 0.0. This second set is used as a testing set, and hence its name has been modi-fied fromtrainset.txttotestset.txtafter the generation. The two-dimensional points in the sets of data are generated in a way that their components range approx-imately in the set[−1,1] × [−1,1], depending on theepsvalue. For this reason, the proceduresvmscaleis not used in this example. Figure 6.10 provides the com-mands used for training and testing an SVM. The proceduresvmtrainis used for training the SVM. The procedure has many parameters. If they are not specified, the default values are used for such parameters. In this example, the option ‘-t’ is used for specifying one of the possible kernels that can be employed. The procedure svmpredictis then used for performing the classification of unknown samples by

-1 1:-0.600916 2:-0.341989

Fig. 6.9 The first rows of filetrainset.txtgenerated bygenerate4libsvm.

using the trained SVM. This procedure needs two text files as input and one text file as output. The first one istestset.txt, where the samples to be classified are stored. The second one istrainset.txt.model, which is a text file generated by svmtrainwhere the parameters related to the SVM are saved. Finally, the output filetestresult.txtwill contain the classification of the unknown samples. The overall accuracy is 98%.

6.7 Exercises

This section presents some exercises related to SVMs. All the solutions are reported in Chapter 10.

LIBSVM> svmtrain -t 3 trainset.txt

*

optimization finished, #iter = 16 nu = 0.213405

obj = -14.075954, rho = -0.091571 nSV = 23, nBSV = 20

Total nSV = 23

LIBSVM> svmpredict testset.txt trainset.txt.model testresult.txt Accuracy = 98.1% (981/1000) (classification)

Fig. 6.10 The DOS commands for training and testing an SVM by SVMLIB.

1. Let us suppose that a set of points in a three-dimensional space is defined as follows. The generic point of this set is the triplet

(A, B, C)

such that the components can have value 0 or 1. Let us suppose that all the points grouped in the classC0satisfy the rule:

A AND B AND C = 0,

whereas all points grouped in the classC1satisfy the rule:

A AND B AND C = 1.

State whether the two classesC0andC1are linearly separable.

2. As in the previous exercise, check if the two classes C0 andC1 are linearly separable, when the classes are defined as:

C0= {(A, B, C): NOT A AND B = 0}

C1= {(A, B, C): NOT A AND B = 1}

and when the classes are defined as:

C0= {(A, B, C): (A OR B) AND (A AND C) = 0}

C1= {(A, B, C): (A OR B) AND (A AND C) = 1}. 3. Suppose that a set of points and their classifications in two classesC+andC

are specified as follows: State why the classesC+andCare not linearly separable.

4. Consider the set of points and their classification as described in Exercise 3.

Transform the set of points by using the function

(x1, x2)=

Check also if the set of points is linearly separable after the transformation.

5. Consider the set of points and their classification as described in Exercise 3.

Formulate the primal optimization problem for finding the maximum margin classifier in the higher-dimensional space defined by the function(x1, x2)in Exercise 4.

6. Reproduce the experiment discussed in Section 6.6 by using different kernel functions.

7. Considering the context of Section 6.1, prove that

M= 2

wTw

.

Chapter 7

Biclustering

Dans le document DATA MINING IN AGRICULTURE (Page 155-162)