Open Archive TOULOUSE Archive Ouverte (OATAO)

(1)

Open Archive TOULOUSE Archive Ouverte (OATAO)

OATAO is an open access repository that collects the work of Toulouse researchers and

makes it freely available over the web where possible.

This is an author-deposited version published in :

http://oatao.univ-toulouse.fr/

Eprints ID : 17710

To link to this article : DOI :

10.1051/eas/1677008

URL :

http://dx.doi.org/10.1051/eas/1677008/

To cite this version :

Fauvel, Mathieu Introduction to the Kernel

Methods. (2016) EAS-journal, vol.77, pp. 171-193.

ISSN 1633-4760

Any correspondence concerning this service should be sent to the repository

(2)

CLASSIFICATION OF MULTIVARIATE DATA

Mathieu Fauvel

1

Abstract. In this chapter, kernel methods are presented for the clas-sification of multivariate data. An introduction example is given to enlighten the main idea of kernel methods. Then emphasis is done on the Support Vector Machine. Structural risk minimization is presented, and linear and non-linear SVM are described. Finally, a full example of SVM classification is given on simulated hyperspectral data.

Kernel methods are now a standard tool in the statistical learning area. These methods are very generic and can be applied to many data sets where relationships between numbers, strings or graphs need to be analyzed. In this chapter, we provide an introduction to the theory of kernel methods for the classification of multivariate data. In the first section, an introductory example is treated to illustrate the kernel trick concept, i.e., how to transform a linear algorithm to a non-linear one. Then, in section2kernels are defined and some of their properties are reviewed. The introductory example is continued is section3where the kernel trick is applied to the conventional K-NN. Section4provides a deep introduction to Support Vector Machines, the most used kernel method. A case study is given in section5, and the last section provides the R codes used to illustrate this chapter.

1 Introductory example

This section is based on Schölkopf(2000).

1.1 Linear case

Suppose we want to classify the following data

S = {(x1, y1), . . . , (xn, yn)}, (xi, yi) ∈ R2× {±1}

(3)

−2 −1 0 1 2 −2 −1 0 1 2 −1 1 −2 −1 0 1 2 −2 −1 0 1 2 −1 1 (a) (b)

Figure 1.Toy example - (a) Linear case and (b) Separating hyperplane.

that are distributed as in Figure 1.(a). xi is sample and yi is its corresponding

class.

Let start with a simple classifier: it assigns a new sample x to the class whose mean is closer to x. Noting µ1 and µ−1 the mean vector of class 1 and −1,

respectively, the decision rule f (x) is defined as the following

f(x) = sgn ||µ−1− x||2− ||µ1− x||2 . (1.1)

Equation (1.1) can be written in the following way

f_{(x) = sgn (hw, xi + b) ,}

where w and b can be found as follows (m1 and m−1 are the number of samples

of class 1 and −1, respectively). kµ1− xk2= h 1 m1 m1 X i=1 xi− x,_m1 1 m1 X i=1 xi− xi = h_m1 1 m1 X i=1 xi, 1 m1 m1 X i=1 xii + hx, xi − 2h_m1 1 m1 X i=1 xi,_xi = 1 m2 1 m1 X i=1 k=1 hxi,xki + hx, xi − 2h 1 m1 m1 X i=1 xi,xi, (1.2) kµ−1− xk2= 1 m2−1 m₋₁ X j=1 l=1 hxj,xli + hx, xi − 2h 1 m−1 m₋₁ X j=1 xj,_xi. (1.3)

(4)

−2 −1 0 1 2 −2 −1 0 1 2 −1 1

Figure 2.Toy example - Non linear case

Plugging (1.2) and (1.3) into (1.1) leads to

hw, xi + b = h2( 1 m1 m1 X i=1 xi− 1 m−1 m −1 X j=1 xj), xi − 1 m2 1 m1 X i=1 k=1 hxi,xki + 1 m2 −1 m −1 X j=1 l=1 hxj,xli. Then we have w = 2 m1+m−1 X i=1 αiyixi, (1.4) αi= 1 m1 if yi= 1 and 1 m−1 if yi = −1, b_{= −} 1 m2 1 m1 X i=1 k=1 hxi,xki + 1 m2 −1 m₋₁ X j=1 l=1 hxj,xli. (1.5)

So for this naive classifier, given the training samples it is possible to compute explicitly its parameters. The separating hyperplane is plotted Figure1.(b) using codes given in Figure9.

1.2 Non linear case

Now the data are distributed a bit differently, as shown in Figure2. The previous classifier is not adapted to this situation. However, if we can find a feature space where the data are linearly separable, it can still be applied.

(5)

1. Projection in the polar domain: ϕ: R2→ R2 x 7→ ϕ(x) (x1,x2) 7→ (ρ, θ). with ρ = x2 1+ x22 and θ = arctan x2 x1 .

2. Projection in the space of monomials of order 2: φ: R2→ R3

x 7→ φ(x)

(x1,x2) 7→ (x21,x22,

√ 2x1x2).

For the feature space associated to monomials of order 2, the projected samples are now distributed as shown in Figure 3.(a) (only variables x2

1 and x22 are

dis-played). In the following, this transformation is considered to derive the decision rule in the feature space.

In R3_{, the inner product can be expressed as}

hφ(x), φ(x′)iR3= 3 X i=1 φ(x)iφ(x ′ )i = φ(x)1φ(x′)1+ φ(x)2φ(x′)2+ φ(x)3φ(x′)3 = x2 1x ′ 2 1+ x22x ′ 2 2+ 2x1x2x′1x′2 = (x1x′1+ x2x′2)2 = hx, x′ i2R2 = k(x, x′ ).

where k(·, ·) is the square of the dot product in R2_{. Therefore, we can write}

the parameters of the separating hyperplane in the feature space: φ(w) = 2 m1+m−1 X i=1 αiyiφ(xi), b_{= −} 1 m2 1 m1 X i=1 p=1 k(xi,xp) + 1 m2 −1 m₋₁ X j=1 l=1 k(xj,xl).

The decision rule can be written in the input space thanks to the function k. f(x) = 2 m1+m−1 X i=1 αiyik(xi,x) − 1 m2 1 m1 X i=1 p=1 k(xi,xp) + 1 m2 −1 m −1 X j=1 l=1 k(xj,xl)

(6)

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 −1 1 0 −2 −1 0 1 2 −2 −1 0 1 2 (a) (b)

Figure 3.Toy example - Non linear case (a) Feature space and (b) Separating function.

1.3 Conclusion

With the two previous examples, we have seen that a linear algorithm can be turned to a non linear one, simply by exchanging the dot product by an appropriate function. This function has to be equivalent to a dot product in a feature space. It is called a kernel function or just kernel. In the next section, some properties of kernels are reviewed.

2 Kernel Function

2.1 Positive semi-definite kernels

These section is based onCamps-Valls and Bruzzone(2009).

Definition 1 (Positive semi-definite kernel). k : Rd_{× R}d _{→ R is positive}

semi-definite if • ∀(x, x′

) ∈ Rd_{× R}d_{, k}_{(x, x}′_{) = k(x}′

,x).

• ∀n ∈ N, ∀ξ1. . . ξn∈ R, ∀x1. . .xn ∈ Rd,Pni,jξiξjk(xi,xj) ≥ 0.

Definition 2 (Gram matrix). Given a kernel k and samples x1, . . . ,xn, the Gram

matrix K is the n × n symmetric matrix with entries Kij= k(xi,xj).

Theorem 1 (Moore-Aronsjan (1950)). To every positive semi-definite kernel k,

there exists a Hilbert space H and a feature map φ : Rd _{→ H such that for all} x, x′

∈ Rd_{× R}d _{we have k(x, x}′

) = hφ(x), φ(x′

)iH.

Hence, to turn a linear algorithm to a non linear one, the kernel has to be positive semi-definite, since it always exits a feature space H where the dot product is equivalent to the kernel evaluation. Note that in practice, it is not necessary to

(7)

know the feature map and the feature space associated to k: one has just to verify that k is positive semi-definite.

A kernel is usually seen as a measure of similarity between two samples. It reflects in some sense how two samples are similar. According to the kernel used, it can be related to geometrical or statistical properties. In practice, it is possible to define kernels using some a priori on our data.

The construction of an adapted kernel for a given task might be difficult. Hope-fully, it is possible to combine kernels easily. Let k1and k2be positive semi-definite

functions, then:

• λ1k1+ λ2k2 is positive semi-definite.

• k1k2is positive semi-definite.

• exp(k1) is positive semi-definite.

• g(x)g(x′

) is positive semi-definite, with g : Rd_{→ R.}

2.2 Some kernels

Definition 3 (Polynomial kernel). The polynomial kernel of order p and bias q

is defined as k(xi,xj) = (hxi,xji + q)p = p X l=1 p l qp−l_hxi,xjil.

It correspond to the feature space of monomials up to degree p. Depending on q ≷ 1, the relative weights of the higher order monomial is increased/decreased.

Definition 4 (Gaussian kernel). The Gaussian kernel with parameter σ is defined

as k(xi,xj) = exp −kxi− xjk 2 2σ2 .

More generally, any distance can be used in the exponential. For instance, the spectral angle is a valid distance:

Θ(xi,xj) = hx i,xji

kxikkxjk

.

Figure 4 shows the kernel evaluation in [−2, 2] × [−2, 2] for the polynomial kernel and the Gaussian kernel. The Gaussian kernel is said to be local: the value of the kernel is related to the geometrical distance between two samples in the input space. In Figure4, samples far from the black point have a low kernel evaluations, while samples close to the black point have high values. In addition, the Gaussian kernel is isotropic, i.e., it does not depend on the orientation of the samples in the feature space. The polynomial kernel is said to be global: two samples with high Euclidean distance in the input space can have high kernel value.

(8)

2 2 4 4 6 6 8 8 10 10 12 12 14 14 −2 −1 0 1 2 −2 −1 0 1 2 ● 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.6 0.7 0.8 0.9 −2 −1 0 1 2 −2 −1 0 1 2 ● (a) (b)

Figure 4.Values of kernel functions: (a) Polynomial kernel values for p = 2 and q = 0 and x = [1, 1] and (b) Gaussian kernel values for σ = 2 and x = [1, 1].

2.3 Kernels on images

Using the summation properties, it is possible to build kernels that includes in-formation from the spatial domain Fauvel et al. (2012). If one can extract, for instance, the local correlation, or any other local pixel based descriptor, it is pos-sible to build a kernel that combines these two types of information. It can be also the spatial position of the pixel. A general form for such a kernel is :

k(xi,xj) = λk(xi,xj)spatial+ (1 − λ)k(xi,xj)spectral.

3 Introductory example continued: Kernel K-NN

The K-NN decision rule is based on the distance between two samples (Hastie et al.,

2009, Chapter 13.3). In the feature space, the distance can be computed as kφ(xi) − φ(xj)k2H. As usual, it can be written in terms of kernel evaluations:

kφ(xi) − φ(xj)k2H= hφ(xi) − φ(xj), φ(xi) − φ(xj)iH

= hφ(xi), φ(xi)iH+ hφ(xj), φ(xj)iH− 2hφ(xi), φ(xj)iH

= k(xi,xi) + k(xj,xj) − 2k(xi,xj).

With the Gaussian kernel, the norm of each sample in the feature space is one. Hence, we have the following distance function:

(9)

0 −2 −1 0 1 2 −2 −1 0 1 2 0 −2 −1 0 1 2 −2 −1 0 1 2 (a) (b)

Figure 5.(a) KNN classification and (b) Kernel KNN classification with a polynomial kernel of order 2.

However, note since that exp is a monotone function, kernel K-NN with a Gaussian kernel is exactly the same than conventional K-NN. However, with other kernels, results will be different. Results on the toy data set is given in Figure 5 using codes in Figure12. A polynomial kernel of order 2 is used.

4 Support Vectors Machines

4.1 Learn from data

The SVM belongs to the family of classification algorithms that solve a supervised learning problem: given a set of samples with their corresponding classes, find a function that assigns each sample to its corresponding class. The aim of statistical learning theory is to find a satisfactory function that will correctly classify training samples and unseen samples, i.e., that has a low generalization error. The basic setting of such a classification problem is as follows. Given a training set S:

S = {(x1, y1), . . . , (xn, yn)} ∈ Rn× {−1, 1}

generated i.i.d. from an unknown probability law P(x, y) and a loss function L, we want to find a function f from a set of functions F that minimizes its expected loss, or risk, R(f ):

R(f ) = Z

S

(10)

Unfortunately, since P(x, y) is unknown, the above equation cannot be computed. However, given S, we can still compute the empirical risk, Remp(f ):

Remp(f ) = 1 n n X i=1 L(f (xi_{), y} i) (4.2)

and try to minimize that. This principle is called Empirical Risk Minimization (ERM) and is employed in conventional learning algorithms, e.g., neural networks. The law of large numbers ensures that if f1minimizes Remp, we have Remp(f1) −→

R(f1) as n tends to infinity. But f1 is not necessarily a minimizer of R. So the

minimization of (4.2) could yield an unsatisfactory solution to the classification problem. An example, arising from the No free lunch theorem, is that given one training set it is always possible to find a function that fits the data with no error but which is unable to classify a single sample from the testing set correctly.

To solve this problem, the classic Bayesian approach consists of selecting a distribution a priori for P(x, y) and then minimizing (4.1). In statistical learning, no assumption is made as to the distribution, but only about the complexity of the class of functions F. The main idea is to favor simple functions, to discard over-fitting problems, and to achieve a good generalization abilityMüller et al.(2001). One way of modeling the complexity is given by the VC (Vapnik-Chervonenkis) theory: the complexity of F is measured by the VC dimension h, and the struc-tural risk minimization (SRM) principle allows to select the function f ∈ F that minimizes an upper bound error Vapnik (1998, 1999). Hence, the upper bound is defined as a function depending on Remp and h. For example, given a set of

functions F with VC dimension h and a classification problem with a loss function L(x, y) = 1₂|y − f(x)|, then for all 1 > η > 0 and f ∈ F we have Vapnik (1998,

1999): R(f ) ≤ Remp(f ) + s h ln(2n h) + 1 − ln( η 4) n (4.3)

with probability of at least 1 − η and for n > h. Following VC theory, the training step of this classifier should minimize the right terms of the inequality (4.3). Other bounds can be found for different loss functions and measures of complexityVapnik

(1998).

In the following, we are going to present one particular class of functions that leads to a linear decision function and the definition of the SVM classifier.

4.2 Linear SVM

Definition 5 (Separating hyperplane). Given a supervised classification problem,

a separating hyperplane H(w, b) is a linear decision function that separates the space into two half-spaces, each half-space corresponding to the given class, i.e., sgn (hw, xii + b) = yi for all samples from S.

(11)

x₂ x₁ x₂ x₁ w· x +b =0 w· x+ b= 1 w· x+ b= − 1 2 kw k b kw k w (a) (b)

Figure 6. (a) Separating hyperplanes and (b) SVM separating

hyper-plane. Squares are the support vectors. Figures were adapted from http://blog.pengyifan.com/tikz-example-svm-trained-with-samples-from-two-classes/.

For samples in S, the condition of correct classification is yi(hw, xii + b) > 0.

If we assume that the closest samples satisfy |hw, xii + b| = 1 (it is always possible

since H is defined up to a multiplicative constant), we have

yi(hw, xii + b) ≥ 1. (4.4)

From Figure6.(a), several separating hyperplanes can be found given S. Ac-cording to the Vapnik-Chervonenkis theory Smola et al.(2000), the optimal one (with the best generalization ability) is the one that maximized the margin, sub-ject to eq. (4.4). The margin is inversely proportional to kwk2_{. The optimal}

parameters can be found by solving the following convex optimization problem minimize hw, wi

2

subject to yi(hw, xii + b) ≥ 1, ∀i ∈ 1, . . . , n.

It is usually solved using Lagrange multipliers Boyd and Vandenberghe (2006). The Lagrangian: L(w, b, α) = hw, wiRn 2 + ℓ X i=1 αi(1 − yi(hw, xiiRn+ b)) (4.5)

(12)

to αi. At the optimal point, the gradient vanishes: ∂L ∂x = w − ℓ X i=1 αiyixi = 0, (4.6) ∂L ∂b = ℓ X i=1 αiyi = 0. (4.7)

From (4.6), we can see that w lives in the subspace spanned by the training samples: w =Pℓ

i=1αiyixi. By substituting (4.6) and (4.7) into (4.5), we get the

dual quadratic problem: max α g(α) = Pℓ i=1αi−12 Pℓ i,j=1αiαjyiyjhxi,xjiRn subject to 0 ≤ αi Pℓ i=1αiyi= 0. (4.8)

When this dual problem is optimized, we obtain αi and hence w. This leads to

the decision rule:

g(x) = sgn ℓ X i=1 αiyihx, xiiRn+ b ! . (4.9)

The constraints assume that the data are linearly separable. For real appli-cations, this might be too restrictive, and this problem is traditionally solved by considering soft margin constraints: yi(hw, xii + b) ≥ 1 + ξi which allow some

training errors during the training process and use an upper bound of the empiri-cal riskPn

i=1ξi. The optimization problem changes slightly to:

minimize hw, wi 2 + C n X i=1 ξi subject to yi(hw, xii + b) ≥ 1 − ξi, ∀i ∈ 1, . . . , n ξi ≥ 0, ∀i ∈ 1, . . . , n.

Cis a constant controlling the number of training errors. This optimization prob-lem is solved using the Lagrangian:

L(w, b, ξ, α, β) = hw, wi 2 + n X i=1 αi(1 − ξi− yi(hw, xii + b)) − n X i=1 βiξi+ C n X i=1 ξi.

Minimizing with respect to the primal variables and maximizing w.r.t the dual variables lead to the so-called dual problem:

(13)

max α g(α) = n X i=1 αi− 1 2 n X i,j=1 αiαjyiyjhxi,xji subject to 0 ≤ αi≤ C n X i=1 αiyi= 0.

Ultimately, the only change from (4.8) is the upper bound values on αi.

As for the introductory example, w can be written in terms of xi. The

dif-ference between SVM and the "smallest distance to the mean classifier" is only how α and b are estimated. The decision function is the same. Considering the Karush-Kuhn-Tucker conditions at optimalityBoyd and Vandenberghe(2006)

                           1 − ξi− yi(hw, xii + b) ≤ 0 αi ≥ 0 αi(1 − ξi− yi(hw, xii + b)) = 0 ξi ≥ 0 βi ≥ 0 βiξi = 0 ∂L ∂x = w − Pℓ i=1αiyixi = 0 ∂L ∂b = Pℓ i=1αiyi = 0 ∂L ∂ξi = −αi− βi+ C = 0 (4.10)

it will be seen that the third condition requires that αi= 0 or (1 − ξi− yi(hw, xii + b)) =

0. This means the solution α is sparse and only some of the αiare non zero. Thus w is supported by some training samples – those with non-zero optimal αi. These are called the support vectors. Figure6.(b) shows an SVM decision function with support vectors.

4.3 Non linear SVM

It is possible to extend the linear SVM to non linear SVM by switching the dot product to a kernel function:

max α g(α) = n X i=1 αi− 1 2 n X i,j=1 αiαjyiyjk(xi,xj) subject to 0 ≤ αi≤ C n X i=1 αiyi= 0.

Now, the SVM is a non-linear classifier in the input space Rd_{, but is still linear}

(14)

0 −2 −1 0 1 2 −2 −1 0 1 2 0 −2 −1 0 1 2 −2 −1 0 1 2 C= 1 C= 100

Figure 7.Two SVM decision functions with different value of C.

function is simply: f(x) = sgn n X i=1 αiyik(x, xi) + b ! .

Example of SVM classification on the toy data set with a Gaussian kernel is given in Figure7.

4.4 Fitting the hyperparameters

Find the optimum parameters (kernel hyperparameters and penalty term C) for the SVM is not a straightforward task. The values of the kernel parameters can have a considerable influence on the learning capacity. Fortunately, even though their influence is significant, their optimum values are not critical, i.e., there is a range of values for which the SVM performs equally well.

Cross-validation is usually used to select the hyperparameters. The cross-validation estimates the expected error when the method is applied to an inde-pendent set of samples. Typically, for a given set of hyperparameters, the training data are split into V subsets and the method are trained with (V − 1) subsets and the prediction error is computed on the vth subset. The process is iterated for v = 1, . . . , V and the estimated expected error is the mean of all the prediction errors. In practice, V is generally set to 5 or 10. Cross-validation has shown good behavior in various supervised learning problems. However, its main drawback is its computational load.

4.5 Multiclass SVM

SVMs are designed to solve binary problems where the class labels can only take two values, e.g., ±1. For an astrophysics application, several classes are usually of

(15)

interest. Various approaches have been proposed to address this problem. They usually combine a set of binary classifiers. Two main approaches were originally proposed for an m class problem.

• One Versus the Rest: m binary classifiers are applied on each class against the others. Each sample is assigned to the class with the maximum output. • Pairwise Classification: m(m−1)2 binary classifiers are applied on each pair

of classes. Each sample is assigned to the class getting the highest number of votes. A vote for a given class is defined as a classifier assigning the sample to that class.

Pairwise classification has proven to be more suitable for large problems. Even though the number of classifiers used is larger than for the one versus the rest approach, the whole classification problem is decomposed into much simpler ones. Other strategies have been proposed within the remote-sensing community, such as hierarchical trees or global training. However, classification accuracy were similar, or worse, and the complexity of the training process is increased.

5 SVM in practice

This section details, step by step how to tune and SVM on a given data set.

5.1 Simulated data

To illustrate this section, we are using a simulated data set, generated by Sylvain Douté1_{. It consists in hyperspectra x of surface reflectance over the Mars planet}

Bernard-Michel et al. (2009). An hyperspectrum is represented by a vector for which each variable contains the reflectance of the surface at a given wavelength. Each hyperspectrum corresponds to a different position of the Mars surface.

The model used for the simulation of the hyperspectra has 5 parameters: the grain size of water and CO2 ice, the proportion of water, CO2 ice and dust. The

model has been used to generate n samples x (x ∈ R184 _{and n = 31500). Fives}

classes were considered according to the grain size of water used to the simulation of the samples. The mean vectors of each class are given in Figure8.

In this example, we are going to use the R package e1071 that uses the C++ library libsvm, the state of the art QP solver. The data set can be download here http://fauvel.mathieu.free.fr/data-used-for-stat4astro-labwork.html.

5.2 Load the data and extract few training samples for learning the model

First we need to load the data set. It contains 6300 samples per class and we are going to use only 100 samples for training, the remaining samples will be used

(16)

0

50

100

150

200

0

0.1

0.2

0.3

0.4

0.5

0.6

Figure 8.Mean spectra.

for validation. xt and xT denote the training samples and the validation samples, respectively. yt and yT denote their corresponding classes.

## L o a d s o m e l i b r a r y l i b r a r y ( " e 1 0 7 1 " ) l i b r a r y ( f o r e a c h ) l i b r a r y ( d o P a r a l l e l ) # S e t t h e n u m b e r o f c o r e r e g i s t e r D o P a r a l l e l ( 4 ) loa d ( " a s t r o s t a t . RData " ) n=nrow ( x ) d=n c o l ( x ) C = max( y ) # S e l e c t " n u m b e r T r a i n " p e r c l a s s f o r t r a i n i n g

num berT rai n = 100

# T h e r e m a i n i n g i s f o r v a l i d a t i o n

num berT es t = 6300− numberTrain

## I n i t i a l i z a t i o n o f t h e t r a i n i n g / v a l i d a t i o n s e t s

x t = matrix ( 0 , num berT rai n ∗C, d ) y t = matrix ( 0 , num berT rai n ∗C, 1 ) xT = matrix ( 0 , num berT es t∗C, d ) yT = matrix ( 0 , num berT es t∗C, 1 )

f o r ( i i n 1 :C) {

t = which ( y==i ) s e t . s e e d ( i )

t s = sample ( t ) # P e r m u t e r a m d o m l y t h e s a m p l e s f o r c l a s s i x t [ ( 1 + num berT rai n ∗ ( i − 1 ) ) : ( numberTrain ∗ i ) , ] = x [ t s [ 1 : numberTrain ] , ] y t [ ( 1 + num berT rai n ∗ ( i − 1 ) ) : ( numberTrain ∗ i ) , ] = y [ t s [ 1 : numberTrain ] , ] xT [ ( 1 + num berT es t∗ ( i − 1 ) ) : ( numberTest∗ i ) , ] = x [ t s [ ( numberTrain + 1 ) : 6 3 0 0 ] , ] yT [ ( 1 + num berT es t∗ ( i − 1 ) ) : ( numberTest∗ i ) , ] = y [ t s [ ( numberTrain + 1 ) : 6 3 0 0 ] , ] }

5.3 Estimate the optimal hyperparameter of the model

Now, we can fit the best couple of hyperparameters for the model. We are going to use a Gaussian kernel and it has one parameter γ. There is also the penalty

(17)

term C. Cross validation is used to select the best values. For that, we just test each couple of hyperparameters, given the range of values to be tested. The cor-rect classification rate associated to a set of values is assessed by cross-validation. The optimal values are those which correspond to the highest estimated correct classification rate.

## S e t t h e CV r a n g e o f s e a r c h ( d e p e n d e n t t o y o u r d a t a )

gamma = 2 ^ ( − 8 : 0 )

c o s t = 10^( −2:6)

CVT = matrix ( 0 , nrow=le n gt h ( c o s t ) , n c o l=le n gt h (gamma) )

## S e r i a l v e r s i o n TIME = S y s . time ( ) f o r ( i i n 1 : le n gt h ( c o s t ) ) { f o r ( j i n 1 : le n gt h (gamma) ) { s e t . s e e d ( 0 )

model = svm ( x t , y t , c o s t=c o s t [ i ] , gamma=gamma[ j ] , t y p e="C" , c r o s s =5)

CVT [ i , j ]=model$ t o t . a c c u r a c y } } p r i n t ( S y s . time () −TIME) p r i n t (CVT) ## G e t t h e p o s i t i o n o f t h e m ax im um

i n d i c e s =which (CVT == max(CVT) , a r r . i n d = TRUE)

5.4 Learn the model with optimal parameter and predict the whole validation

samples

One the optimal values are selected, we can learn the model and predict the class for the validation set. Then we compute the confusion matrix and the overall accuracy to assess the performance of the SVM.

## L e a r n t h e m o d e l

model = svm ( x t , y t , c o s t=c o s t [ i n d i c e s [ 1 , 1 ] ] , gamma=gamma[ i n d i c e s [ 1 , 2 ] ] , t y p e="C" )

## P r e d i c t t h e v a l i d a t i o n s a m p l e s

yp = p r e d i c t ( model , xT )

## C o n f u s i o n m a t r i x

c o n f u = t a b l e ( yT , yp )

OA = sum( d ia g ( c o n f u ) ) /sum( c o n f u )

5.5 Speeding up the process ?

CV can be very time consuming. Several options are possible to reduce the pro-cessing time. On very easy way is to perform the CV in parallel using R package foreachand doParallel.

TIME = S y s . time ( )

CV = f o r e a c h ( i = 1 : le n gt h ( c o s t ) , . com bi ne = r b in d ) %d o p a r% { temp = matrix ( 0 , nrow=le n gt h (gamma) , 1 )

f o r ( j i n 1 : le n gt h (gamma) ) {

s e t . s e e d ( 0 )

model = svm ( x t , y t , c o s t=c o s t [ i ] , gamma=gamma[ j ] , t y p e="C" , c r o s s =5)

temp [ j ] = model$ t o t . a c c u r a c y } r e t u r n ( c ( i , temp ) ) } p r i n t ( S y s . time () −TIME) p r i n t (CV)

(18)

5.6 Results

For a given run, the following CVT table looks like (for the range of values of γ and C, look the R codes):

γ C 17.8 18.0 18.2 18.4 18.4 18.8 18.8 19.0 18.8 21.4 23.2 30.2 35.8 38.4 41.8 44.6 46.2 45.2 36.6 40.6 46.2 55.4 61.4 69.2 71.0 72.0 70.8 56.6 65.2 75.0 79.2 82.6 81.8 83.2 81.2 78.0 81.8 85.2 86.4 86.6 86.0 84.6 83.6 80.8 78.2 86.4 86.4 85.8 85.8 85.6 84.6 83.6 80.8 78.2 85.8 86.2 85.4 85.8 85.6 84.6 83.6 80.8 78.2 85.8 86.2 85.4 85.8 85.6 84.6 83.6 80.8 78.2 85.8 86.2 85.4 85.8 85.6 84.6 83.6 80.8 78.2

The set of optimal hyperparameters is C = 100 and γ = 0.0009765625, for a correct classification rate estimated to 86.6. The SVM is then learned with the optimal hyperparameter. The resulting model has 217 support vectors (ob-tained with the command model$tot.nSV). It means that approximately the half of training samples have been used to build the decision function. The correct clas-sification rate computed on the validation set is 0.881096. In comparison, for that setting, a Gaussian Mixture Model with a diagonal covariance matrix assumption reaches a correct classification rate of 0.5300645.

(19)

n=100 # N u m b e r o f p o i n t s X = matrix ( 0 , nrow=n , n c o l =2) # D a t a G e n e r a t i o n s e t . s e e d ( 0 ) X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , − 1 , 0 . 5 ) # Two G a u s s i a n s X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , − 1 , 0 . 5 ) X [ ( n/ 2 + 1 ) : n , 2 ] = rnorm ( n/ 2 , 1 , 0 . 5 ) X [ ( n/ 2 + 1 ) : n , 1 ] = rnorm ( n/ 2 , 1 , 0 . 5 )

Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) ) # Two C l a s s e s +1 a n d −1

p l o t (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2) le ge n d ( " t o p l e f t " , c ( "−1 " , " 1 " ) , pch=c ( 1 , 2 ) ) dev . p r i n t ( pdf , ’ d a t a_t o y . e p s ’ ) # C o m p u t e t h e Gram m a t r i x o f a l l d o t p r o d u c t s f o r c l a s s 1 Kp = X [ Y==1 ,]%∗%t (X [ Y==1 ,]) # C o m p u t e t h e Gram m a t r i x o f a l l d o t p r o d u c t s f o r c l a s s −1 Kn = X [ Y==−1,]%∗%t (X [ Y== −1 ,]) # C o m p u t e t h e p a r a m e t e r s o f t h e C l a s s i f i e r s b = − sum(Kp ) / ( n c ol (Kp ) ∗ n c ol (Kp ) ) + sum(Kn ) / ( n c ol (Kn ) ∗ n c ol (Kn ) ) a l p h a = c (Y [ Y==1]/ n c o l (Kp) , −Y [ Y==−1]/ n c ol (Kn ) )

w = c ( 0 , 0 ) f o r ( i i n 1 : n ) { w = w +a l p h a [ i ] ∗Y [ i ] ∗X [ i , ] } # C o m p u t e t w o p o i n t s o f t h e d e c i s i o n f u n c t i o n f o r p l o t t i n g a = −w [ 1 ] /w [ 2 ] c = b/w [ 2 ] / 2 p = matrix ( 0 , nrow=2 , n c o l =2) p [ 1 , 1] = −2 p [ 1 , 2 ] = p [ 1 , 1 ] ∗a+c p [ 2 , 1 ] = 2 p [ 2 , 2 ] = p [ 2 , 1 ] ∗a+c segments ( p [ 1 , 1 ] , p [ 1 , 2 ] , p [ 2 , 1 ] , p [ 2 , 2 ] ) dev . p r i n t ( pdf , ’ d a t a_t o y_s e p . e p s ’ ) dev . o f f ( dev . l i s t ( ) )

Figure 9.Code for the linear example.

(20)

n=100 s e t . s e e d ( 0 ) r h o = 1. 5+ rnorm ( n/ 2 , 0 , 0 . 2 ) t h e t a = 2∗ p i ∗ r u n i f ( n/ 2 ) X = matrix ( 0 , nrow=n , n c o l =2) X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 ) X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 ) X [ ( n/ 2 + 1 ) : n , 1 ] = r h o ∗cos ( t h e t a ) X [ ( n/ 2 + 1 ) : n , 2 ] = r h o ∗ s i n ( t h e t a ) Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) ) p d f ( " d a t a_t o y_n l . e p s " ) p l o t (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2) le ge n d ( " t o p l e f t " , c ( "−1 " , " 1 " ) , pch=c ( 1 , 2 ) ) dev . o f f ( dev . l i s t ( ) ) Z = matrix ( 0 , nrow=n , n c o l =3) Z [ , 1 ] =X [ , 1 ] ^ 2 Z [ , 2 ] =X [ , 2 ] ^ 2 Z [ , 3 ] =X [ , 1 ] ∗X [ , 2 ] p d f ( " d a t a_t o y_n l_3D . e p s " ) p l o t ( Z [ Y== −1 ,1] ,Z [ Y== −1 ,2] ,pch =1 , x l i m = c ( 0 , 2 ) , y l i m = c ( 0 , 2 ) ) p o i n t s ( Z [ Y= = 1 , 1 ] , Z [ Y= = 1 , 2 ] , pch =2) le ge n d ( " t o p l e f t " , c ( "−1 " , " 1 " ) , pch=c ( 1 , 2 ) ) dev . o f f ( dev . l i s t ( ) ) # A p p l y t h e d e c i s i o n f u n c t i o n u s i n g k e r n e l # C o m p u t e t h e K e r n e l m a t r i x o f a l l d o t p r o d u c t s f o r c l a s s 1 Kp = (X [ Y==1 ,]%∗%t (X [ Y= = 1 , ] ) ) ^ 2 # C o m p u t e t h e K e r n e l m a t r i x o f a l l d o t p r o d u c t s f o r c l a s s −1 Kn = (X [ Y==−1,]%∗%t (X [Y== −1 ,]))^2 # C o m p u t e t h e p a r a m e t e r s o f t h e C l a s s i f i e r s b = − sum(Kp ) / ( n c ol (Kp ) ∗ n c ol (Kp ) ) + sum(Kn ) / ( n c ol (Kn ) ∗ n c ol (Kn ) ) a l p h a = c(−Y[ Y==−1]/ n c ol (Kp ) ,Y [ Y==1]/ n c ol (Kn ) )

a l p h a y = a l p h a ∗Y xx = se q ( − 2 , 2 , length =100) yy = se q ( − 2 , 2 , length =100) XT = a s . matrix ( expand . g r i d ( xx , yy ) ) K = (XT%∗%t (X ) ) ^ 2 F = K%∗%a l p h a y + b/2 m y l e v e l s = se q ( − 0 , 0 , 5 ) Fr = matrix ( F , n c o l = 100 ,nrow=100) p d f ( " d a t a_t o y_n l_2D . e p s " ) contour ( xx , yy , Fr , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ ) p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2) dev . o f f ( dev . l i s t ( ) )

(21)

l i b r a r y ( e 1 0 7 1 ) n=100 s e t . s e e d ( 0 ) r h o = 1. 5+ rnorm ( n/ 2 , 0 , 0 . 2 ) t h e t a = 2∗ p i ∗ r u n i f ( n/ 2 ) m y l e v e l s = se q ( − 0 , 0 , 5 ) X = matrix ( 0 , nrow=n , n c o l =2) X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 ) X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 ) X [ ( n/ 2 + 1 ) : n , 1 ] = r h o ∗cos ( t h e t a ) X [ ( n/ 2 + 1 ) : n , 2 ] = r h o ∗ s i n ( t h e t a ) Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) ) # G r i d xx = se q ( − 2 , 2 , length =100) yy = se q ( − 2 , 2 , length =100) XT = a s . matrix ( expand . g r i d ( xx , yy ) )

model = svm (X, Y , c o s t =1 ,gamma= 0 . 5 , t y p e="C−c l a s s i f i c a t i o n " )

p r e d = p r e d i c t ( model , XT, d e c i s i o n . v a l u e s = TRUE)

d f = matrix ( a t t r ( pred , " d e c i s i o n . v a l u e s " ) , n c o l = 100 ,nrow=100) p d f ( " svm_c_1 . e p s " )

contour ( xx , yy , df , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )

p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)

dev . o f f ( dev . l i s t ( ) )

model = svm (X, Y , c o s t = 100 ,gamma= 0 . 5 , t y p e="C− c l a s s i f i c a t i o n " )

p r e d = p r e d i c t ( model , XT, d e c i s i o n . v a l u e s = TRUE)

d f = matrix ( a t t r ( pred , " d e c i s i o n . v a l u e s " ) , n c o l = 100 ,nrow=100) p d f ( " svm_c_1 0 0 . e p s " )

contour ( xx , yy , df , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )

p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)

dev . o f f ( dev . l i s t ( ) )

(22)

# G e n e r a t e r a n d o m d a t a n=100 s e t . s e e d ( 0 ) r h o = 1. 5+ rnorm ( n/ 2 , 0 , 0 . 2 ) t h e t a = 2∗ p i ∗ r u n i f ( n/ 2 ) m y l e v e l s = se q ( − 0 , 0 , 5 ) # C l a s s −1 i s a 2D−G a u s s i a n a n d C l a s s 1 i s a u n i f o r m l y d i s t r i b u t e d o n a r i n g X = matrix ( 0 , nrow=n , n c o l =2) X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 ) X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 ) X [ ( n/ 2 + 1 ) : n , 1 ] = r h o ∗cos ( t h e t a ) X [ ( n/ 2 + 1 ) : n , 2 ] = r h o ∗ s i n ( t h e t a ) Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) ) # G e n e r a t e d a t a i n t h e g r i d [ − 2 , 2 ] xx = se q ( − 2 , 2 , length =100) yy = se q ( − 2 , 2 , length =100) XT = a s . matrix ( expand . g r i d ( xx , yy ) ) # C o m p u t e t h e D i s t a n c e m a t r i x

D = matrix ( 0 , nrow (XT) , nrow (X ) )

normXT = rowSums (XT∗XT) normX = rowSums (X∗X) # f o r ( i i n 1 : n ) # { # D [ , i ] = normXT + normX [ i ] − 2 ∗ ( XT%∗%X [ i , ] ) # } # Sam e a s p r e v i o u s l i n e s , b u t f a s t e r D = ou t e r ( temp1 , temp2 , ’+ ’ )−2∗ (XT%∗%t (X) ) # K−NN

i n d i c e s = apply (D, 1 , which . min ) d e c i s i o n = matrix ( 0 , nrow (XT) , 1 ) f o r ( i i n 1 : nrow (XT) ) { d e c i s i o n [ i ]=Y [ i n d i c e s [ i ] ] } # P l o t d a t a yp = matrix ( d e c i s i o n , n c o l = 100 ,nrow=100) m y l e v e l s = se q ( − 0 , 0 , 5 ) p d f ( " knn . e p s " ) contour ( xx , yy , yp , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ ) p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2) dev . o f f ( dev . l i s t ( ) ) # KK−NN p = 2 # K = m a t r i x ( 0 , n r o w ( XT ) , n r o w ( X ) ) # f o r ( i i n 1 : n r o w ( XT ) ) # { # f o r ( j i n 1 : n ) # { # K [ i , j ] = ( XT [ i , 1 ] ^ 2 + XT [ i , 2 ] ^ 2 ) ^ p + ( X [ j , 1 ] ^ 2 + X [ j , 2 ] ^ 2 ) ^ p # − 2 ∗ ( XT [ i , 1 ] ∗X [ j , 1 ] + XT [ i , 2 ] ∗X [ j , 2 ] ) ^ p # } # } # Sam e a s t h e l i n e s a b o v e , b u t f a s t e r K = ou t e r ( normXT ^ 2 , normX ^ 2 , "+" ) − 2∗ (XT%∗%t (X) ) ^ 2 i n d i c e s = apply (K, 1 , which . min )

# K K−NN d e c i s i o n = matrix ( 0 , nrow (XT) , 1 ) f o r ( i i n 1 : nrow (XT) ) { d e c i s i o n [ i ]=Y [ i n d i c e s [ i ] ] } yp = matrix ( d e c i s i o n , n c o l = 100 ,nrow=100) p d f ( " kknn . e p s " ) contour ( xx , yy , yp , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ ) p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 )) p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2) dev . o f f ( dev . l i s t ( ) )

(23)

References

Bernard-Michel, C., Douté, S., Fauvel, M., Gardes, L., and Girard, S.: 2009, Journal of Geophysical Research: Planets (1991–2012) 114(E6)

Boyd, S. and Vandenberghe, L.: 2006, Convex Optimization, Cambridge Univer-sity press

Camps-Valls, G. and Bruzzone, L. (eds.): 2009, Kernel Methods for Remote Sens-ing Data Analysis, John Wiley & Sons, Ltd

Fauvel, M., Chanussot, J., and Benediktsson, J. A.: 2012, Pattern Recognition

45(1), 381

Hastie, T. J., Tibshirani, R. J., and Friedman, J. H.: 2009, The elements of statistical learning : data mining, inference, and prediction, Springer series in statistics, Springer, New York

Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B.: 2001, IEEE Transactions on Neural Networks 12(2), 181

Schölkopf, B.: 2000, Statistical learning and kernel methods

Smola, A., Barlett, P. L., Schölkopf, B., and Schuurmans, D.: 2000, Advances in large margin classifiers, MIT press

Vapnik, V.: 1998, Statistical Learning Theory, Wiley, New York

Vapnik, V.: 1999, The Nature of Statistical Learning Theory, Second Edition, Springer, New York