Nonlinear classifiers - Support Vector Machines

Support Vector Machines

6.2 Nonlinear classifiers

The problem of classifying samples into two classes that can be separated by a hyperplane has been discussed in the previous section. However, the hypothesis of the linear separability is not always satisfied, as Figure 6.3 shows. In this case, the

Fig. 6.3An example in which samples cannot be classified by a linear classifier.

apples cannot be separated by a line on the Cartesian system, but a more complex classifier should be used. In fact, in most real-world applications, there is no reason for expecting that the classes can be separated by hyperplanes. Therefore, SVMs need to be extended to address more general cases. Not only hyperplanes should be used, but also nonlinear surfaces may be considered. Working on nonlinear surfaces can be much more complex than working on hyperplanes. Therefore, a very smart method has been developed for taking into account the case in which the considered classes are not linearly separable.

Let us consider the one-dimensional points in Figure 6.4. The problem is to find a classifier for the samples on the one-dimensional space defined by thex axis of the Cartesian system. As one can easily note, the samples are not linearly separable, because one class (class 2) has samples ranging from 0 to 1, whereas the other class (class 1) has samples before 0 and after 1. One will never find a hyperplane (actually a point in this example) which separates the two classes. Let us project now this data in a two-dimensional space, as Figure 6.4 shows. The points belonging to class 1 and class 2 are now linearly separable, and then a linear classifier can be used. It is obvious that this data transformation may substantially increase the dimension of the problem. In this example, the dimension of the new problem is twice the original.

The function which transforms the data is called amapping function and it is usually denoted by. A samplexⁱin the original space can be represented by(xⁱ) in the newly transformed space. It follows that the general equation of the hyperplane in this case is:

w^T(x)+b=0

and that the dual formulation of the optimization problem (6.4) is maxα

αi−1 2

i,j

cicj(xⁱ)^T(x^j)αiαj

Fig. 6.4 Example of a set of data which is not linearly classifiable in its original space. It becomes such in a two-dimensional space.

subject to the constraints (6.5). For using this formulation of the optimization prob-lem, a function able to transform the set of data into another one that can be separated by hyperplanes needs to be defined. However, this is actually not required, because the functionparticipates only in the inner product(xⁱ)^T(x^j)in the objective function of the optimization problem. Then, there is no need to compute it explicitly. This inner product can be replaced by a suitable function

K (xⁱ, x^j)=(xⁱ)^T(x^j),

which is called SVM kernel function. In this way, an SVM can be trained by solving the optimization problem (6.4), where the inner product(xⁱ)^Tx^j is substituted by the kernel functionK (xⁱ, x^j)for taking into consideration the cases where the data are not linearly separable. The corresponding optimization problem is then

maxα

α_i−1 2

i,j

c_ic_jK (xⁱ, x^j)α_iα_j (6.6)

subject to the constraints (6.5). Kernel functionsK (xⁱ, x^j)can be obtained by the specific mapping needed for transforming the data. This approach, however, requires the definition of a suitable mapping. Therefore, a more common approach is to avoid defining explicitly a mappingand to find a function which can work as a kernel function. In practice, pre-defined kernel functions are usually used, and different SVMs can be trained using different kernels in order to select the one that performs better for the given problem. The choice of a kernel in SVMs can be considered as

analogous to the problem of choosing a suitable architecture in neural networks. The most used kernels include the polynomial kernel

K (xⁱ, x^j)=

(xⁱ)^T(x^j)+1 d

which is able to lift the feature space by including all monomials of the original features up to degreed. Examples of kernels also include the Gaussian kernel

K (xⁱ, x^j)=exp

The Gaussian kernel represents the most reasonable choice because of its simplicity and the ability to model data of arbitrary complexity. It is provided with a tuning parameterσ that adjusts the kernel’s width.

SVMs coupled with these kernel functions are able to classify sets of data with a good accuracy, as is the case of the applications discussed in Section 6.5. As pointed out in [24], however, there is probably no theoretical explanation of why SVMs perform so well in practice. Following the quoted paper, we can say that, even though it is commonly accepted that maximizing the marginM is good for generalization, there is no way to prove it. Moreover, SVMs are based on ideas guided by geometric intuitions, as is the case of the example of apples on a Cartesian system. However, this intuition cannot be applied in the spaces of higher dimension obtained using the kernel functions. In the case of the Gaussian kernel, the obtained space has an infinite dimension. From here comes though the explanation of why the Gaussian kernel works well. Any two disjoint sets of data in the original space, indeed, can be separated by a hyperplane in the infinite dimensional space.

Dans le document DATA MINING IN AGRICULTURE (Page 145-148)