## Open Archive TOULOUSE Archive Ouverte (OATAO)

### OATAO is an open access repository that collects the work of Toulouse researchers and

### makes it freely available over the web where possible.

### This is an author-deposited version published in :

### http://oatao.univ-toulouse.fr/

### Eprints ID : 17710

**To link to this article : DOI :**

### 10.1051/eas/1677008

### URL :

### http://dx.doi.org/10.1051/eas/1677008/

**To cite this version : **

*Fauvel, Mathieu Introduction to the Kernel *

*Methods. (2016) EAS-journal, vol.77, pp. 171-193. *

### ISSN 1633-4760

### Any correspondence concerning this service should be sent to the repository

**CLASSIFICATION OF MULTIVARIATE DATA**

### Mathieu Fauvel

1**Abstract.** In this chapter, kernel methods are presented for the
clas-sification of multivariate data. An introduction example is given to
enlighten the main idea of kernel methods. Then emphasis is done on
the Support Vector Machine. Structural risk minimization is presented,
and linear and non-linear SVM are described. Finally, a full example
of SVM classification is given on simulated hyperspectral data.

Kernel methods are now a standard tool in the statistical learning area. These
methods are very generic and can be applied to many data sets where relationships
between numbers, strings or graphs need to be analyzed. In this chapter, we
provide an introduction to the theory of kernel methods for the classification of
multivariate data. In the first section, an introductory example is treated to
*illustrate the kernel trick concept, i.e., how to transform a linear algorithm to a*
non-linear one. Then, in section2kernels are defined and some of their properties
are reviewed. The introductory example is continued is section3*where the kernel*
*trick is applied to the conventional K-NN. Section*4provides a deep introduction
to Support Vector Machines, the most used kernel method. A case study is given in
section5, and the last section provides the R codes used to illustrate this chapter.

**1**

**Introductory example**

This section is based on Schölkopf(2000).

*1.1*

*Linear case*

Suppose we want to classify the following data

**S = {(x**1*, y*1* ), . . . , (xn, yn)}, (xi, yi*) ∈ R2× {±1}

−2 −1 0 1 2 −2 −1 0 1 2 −1 1 −2 −1 0 1 2 −2 −1 0 1 2 −1 1 (a) (b)

**Figure 1.**Toy example - (a) Linear case and (b) Separating hyperplane.

that are distributed as in Figure 1**.(a). x***i* *is sample and yi* is its corresponding

class.

**Let start with a simple classifier: it assigns a new sample x to the class whose**
* mean is closer to x. Noting µ*1

*and µ*−1 the mean vector of class 1 and −1,

**respectively, the decision rule f (x) is defined as the following**

*f (x) = sgn ||µ*−1

**− x||**2

*− ||µ*1

**− x||**2

*.*(1.1)

Equation (1.1) can be written in the following way

*f_{(x) = sgn (hw, xi + b) ,}*

* where w and b can be found as follows (m*1

*and m*−1 are the number of samples

of class 1 and −1, respectively).
*kµ*1**− xk**2= h 1
*m*1
*m1*
X
*i*=1
**xi*** − x,_{m}*1
1

*m1*X

*i*=1

**xi− xi**= h

*1 1*

_{m}*m1*X

*i*=1

**xi**

*,*1

*m*1

*m1*X

*i*=1

**xi**

*1 1*

**i + hx, xi − 2h**_{m}*m1*X

*i*=1

**xi**

*,*

**= 1**

_{xi}*m*2 1

*m1*X

*i*=1

*k*=1

**hx**

*i,*

**xk**

*1*

**i + hx, xi − 2h***m*1

*m1*X

*i*=1

**xi**

*,*(1.2)

**xi,***kµ*−1

**− xk**2= 1

*m*2−1

*m*

_{−1}X

*j*=1

*l*=1

**hx**

*j,*

**xl**

*1*

**i + hx, xi − 2h***m*−1

*m*

_{−1}X

*j*=1

**xj**

*,*(1.3)

_{xi.}−2 −1 0 1 2 −2 −1 0 1 2 −1 1

**Figure 2.**Toy example - Non linear case

Plugging (1.2) and (1.3) into (1.1) leads to

* hw, xi + b = h2(* 1

*m*1

*m1*X

*i*=1

**xi**− 1

*m*−1

*m*−1 X

*j*=1

**xj**

*1*

**), xi −***m*2 1

*m1*X

*i*=1

*k*=1

**hx**

*i,*

**xk**i + 1

*m*2 −1

*m*−1 X

*j*=1

*l*=1

**hx**

*j,*

**xl**

*i.*Then we have

**w = 2**

*m1+m*−1 X

*i*=1

*αiyixi,*(1.4)

*αi*= 1

*m*1

*if yi*= 1 and 1

*m*−1

*if yi*

*= −1,*

*b*

_{= −}1

*m*2 1

*m1*X

*i*=1

*k*=1

**hx**

*i,*

**xk**i + 1

*m*2 −1

*m*

_{−1}X

*j*=1

*l*=1

**hx**

*j,*

**xl**

*i.*(1.5)

So for this naive classifier, given the training samples it is possible to compute explicitly its parameters. The separating hyperplane is plotted Figure1.(b) using codes given in Figure9.

*1.2*

*Non linear case*

Now the data are distributed a bit differently, as shown in Figure2. The previous classifier is not adapted to this situation. However, if we can find a feature space where the data are linearly separable, it can still be applied.

1. Projection in the polar domain:
*ϕ*: R2→ R2
**x 7→ ϕ(x)****(x**1*,***x**2*) 7→ (ρ, θ).*
* with ρ = x*2
1

**+ x**22

*and θ = arctan*

**x**2

**x**1 .

2. Projection in the space of monomials of order 2:
*φ*: R2→ R3

**x 7→ φ(x)**

**(x**1*,***x**2**) 7→ (x**21*,***x**22*,*

√
**2x**1**x**2*).*

For the feature space associated to monomials of order 2, the projected samples
are now distributed as shown in Figure 3**.(a) (only variables x**2

1 **and x**22 are

dis-played). In the following, this transformation is considered to derive the decision rule in the feature space.

In R3_{, the inner product can be expressed as}

* hφ(x), φ(x*′)iR3=
3
X

*i*=1

*φ*

**(x)**

*iφ*

**(x**′ )

*i*

*1*

**= φ(x)***φ*

**(x**′)1

*2*

**+ φ(x)***φ*

**(x**′)2

*3*

**+ φ(x)***φ*

**(x**′)3

**= x**2 1

**x**′ 2 1

**+ x**22

**x**′ 2 2

**+ 2x**1

**x**2

**x**′1

**x**′2

**= (x**1

**x**′1

**+ x**2

**x**′2)2

*′ i2R2*

**= hx, x***′*

**= k(x, x***).*

*where k(·, ·) is the square of the dot product in R*2_{. Therefore, we can write}

the parameters of the separating hyperplane in the feature space:
*φ***(w) = 2**
*m1+m*−1
X
*i*=1
*αiyiφ***(x***i),*
*b*_{= −} 1
*m*2
1
*m1*
X
*i*=1
*p*=1
*k***(x***i,***xp**) + 1
*m*2
−1
*m*_{−1}
X
*j*=1
*l*=1
*k***(x***j,***xl***).*

*The decision rule can be written in the input space thanks to the function k.*
*f***(x) = 2**
*m1+m*−1
X
*i*=1
*αiyik***(x***i,***x) −**
1
*m*2
1
*m1*
X
*i*=1
*p*=1
*k***(x***i,***xp**) +
1
*m*2
−1
*m*
−1
X
*j*=1
*l*=1
*k***(x***j,***xl**)

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 −1 1 0 −2 −1 0 1 2 −2 −1 0 1 2 (a) (b)

**Figure 3.**Toy example - Non linear case (a) Feature space and (b) Separating function.

*1.3*

*Conclusion*

With the two previous examples, we have seen that a linear algorithm can be
turned to a non linear one, simply by exchanging the dot product by an appropriate
function. This function has to be equivalent to a dot product in a feature space.
*It is called a kernel function or just kernel. In the next section, some properties*
*of kernels are reviewed.*

**2**

**Kernel Function**

*2.1*

*Positive semi-definite kernels*

These section is based onCamps-Valls and Bruzzone(2009).

**Definition 1 (Positive semi-definite kernel). k : R**d_{× R}*d* _{→ R is positive }

*semi-definite if*
* • ∀(x, x*′

) ∈ R*d*_{× R}*d _{, k}_{(x, x}*′

*′*

_{) = k(x}*, x).*

*• ∀n ∈ N, ∀ξ*1*. . . ξn ∈ R, ∀x*1

*. . .*

**xn**∈ R

*d,*P

*ni,jξiξjk*

**(x**

*i,*

**xj**

*) ≥ 0.*

* Definition 2 (Gram matrix). Given a kernel k and samples x*1

*, . . . ,*

**xn**

*, the Gram*

**matrix K is the n × n symmetric matrix with entries K**ij**= k(x**i,**xj***).*

**Theorem 1 (Moore-Aronsjan (1950)). To every positive semi-definite kernel k,**

*there exists a Hilbert space H and a feature map φ : Rd* _{→ H such that for all}* x, x*′

∈ R*d*_{× R}*d* *_{we have k(x, x}*′

* ) = hφ(x), φ(x*′

)iH*.*

Hence, to turn a linear algorithm to a non linear one, the kernel has to be positive semi-definite, since it always exits a feature space H where the dot product is equivalent to the kernel evaluation. Note that in practice, it is not necessary to

*know the feature map and the feature space associated to k: one has just to verify*
*that k is positive semi-definite.*

A kernel is usually seen as a measure of similarity between two samples. It
reflects in some sense how two samples are similar. According to the kernel used,
it can be related to geometrical or statistical properties. In practice, it is possible
*to define kernels using some a priori on our data.*

The construction of an adapted kernel for a given task might be difficult.
*Hope-fully, it is possible to combine kernels easily. Let k*1*and k*2be positive semi-definite

functions, then:

*• λ*1*k*1*+ λ*2*k*2 is positive semi-definite.

*• k*1*k*2is positive semi-definite.

*• exp(k*1) is positive semi-definite.

* • g(x)g(x*′

*) is positive semi-definite, with g : Rd*_{→ R.}

*2.2*

*Some kernels*

**Definition 3 (Polynomial kernel). The polynomial kernel of order p and bias q**

*is defined as*
*k***(x***i,***xj) = (hx***i,***xj***i + q)p*
=
*p*
X
*l*=1
*p*
*l*
*qp−l*_{hx}*i,***xj**i*l.*

*It correspond to the feature space of monomials up to degree p. Depending on*
*q ≷ 1, the relative weights of the higher order monomial is increased/decreased.*

**Definition 4 (Gaussian kernel). The Gaussian kernel with parameter σ is defined**

*as*
*k***(x***i,***xj**) = exp
−**kx***i***− x***j*k
2
*2σ*2
*.*

More generally, any distance can be used in the exponential. For instance, the spectral angle is a valid distance:

**Θ(x***i,***xj**) = **hx**
*i,***xj**i

**kx***i***kkx***j*k

*.*

Figure 4 *shows the kernel evaluation in [−2, 2] × [−2, 2] for the polynomial*
*kernel and the Gaussian kernel. The Gaussian kernel is said to be local: the value*
of the kernel is related to the geometrical distance between two samples in the input
space. In Figure4, samples far from the black point have a low kernel evaluations,
while samples close to the black point have high values. In addition, the Gaussian
*kernel is isotropic, i.e., it does not depend on the orientation of the samples in the*
*feature space. The polynomial kernel is said to be global: two samples with high*
Euclidean distance in the input space can have high kernel value.

2 2 4 4 6 6 8 8 10 10 12 12 14 14 −2 −1 0 1 2 −2 −1 0 1 2 ● 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.6 0.7 0.8 0.9 −2 −1 0 1 2 −2 −1 0 1 2 ● (a) (b)

**Figure 4.***Values of kernel functions: (a) Polynomial kernel values for p = 2 and q = 0*
**and x = [1, 1] and (b) Gaussian kernel values for σ = 2 and x = [1, 1].**

*2.3*

*Kernels on images*

Using the summation properties, it is possible to build kernels that includes in-formation from the spatial domain Fauvel et al. (2012). If one can extract, for instance, the local correlation, or any other local pixel based descriptor, it is pos-sible to build a kernel that combines these two types of information. It can be also the spatial position of the pixel. A general form for such a kernel is :

*k***(x***i,***xj****) = λk(x**i,**xj**)spatial**+ (1 − λ)k(x**i,**xj**)spectral*.*

**3**

**Introductory example continued: Kernel K-NN**

The K-NN decision rule is based on the distance between two samples (Hastie et al.,

2009, Chapter 13.3). In the feature space, the distance can be computed as
* kφ(xi) − φ(xj*)k2H. As usual, it can be written in terms of kernel evaluations:

* kφ(xi) − φ(xj*)k2H

*)iH*

**= hφ(x**i**) − φ(x**j**), φ(x**i**) − φ(x**j* = hφ(xi), φ(xi*)iH

*)iH*

**+ hφ(x**j**), φ(x**j*)iH*

**− 2hφ(x**i**), φ(x**j**= k(x**i,**xi****) + k(x**j,**xj****) − 2k(x**i,**xj***).*

With the Gaussian kernel, the norm of each sample in the feature space is one. Hence, we have the following distance function:

0 −2 −1 0 1 2 −2 −1 0 1 2 0 −2 −1 0 1 2 −2 −1 0 1 2 (a) (b)

**Figure 5.**(a) KNN classification and (b) Kernel KNN classification with a polynomial
kernel of order 2.

However, note since that exp is a monotone function, kernel K-NN with a Gaussian kernel is exactly the same than conventional K-NN. However, with other kernels, results will be different. Results on the toy data set is given in Figure 5 using codes in Figure12. A polynomial kernel of order 2 is used.

**4**

**Support Vectors Machines**

*4.1*

*Learn from data*

*The SVM belongs to the family of classification algorithms that solve a supervised*
*learning problem: given a set of samples with their corresponding classes, find a*
function that assigns each sample to its corresponding class. The aim of statistical
*learning theory is to find a satisfactory function that will correctly classify training*
*samples and unseen samples, i.e., that has a low generalization error. The basic*
setting of such a classification problem is as follows. Given a training set S:

**S = {(x**1*, y*1* ), . . . , (xn, yn*)} ∈ R

*n× {−1, 1}*

**generated i.i.d. from an unknown probability law P(x, y) and a loss function L,***we want to find a function f from a set of functions F that minimizes its expected*
*loss, or risk, R(f ):*

*R(f ) =*
Z

S

**Unfortunately, since P(x, y) is unknown, the above equation cannot be computed.***However, given S, we can still compute the empirical risk, Remp(f ):*

*Remp(f ) =*
1
*n*
*n*
X
*i*=1
*L (f (xi_{), y}*

*i*) (4.2)

*and try to minimize that. This principle is called Empirical Risk Minimization*
*(ERM) and is employed in conventional learning algorithms, e.g., neural networks.*
*The law of large numbers ensures that if f*1*minimizes Remp, we have Remp(f*1) −→

*R(f*1*) as n tends to infinity. But f*1 *is not necessarily a minimizer of R. So the*

minimization of (4.2) could yield an unsatisfactory solution to the classification
*problem. An example, arising from the No free lunch theorem, is that given one*
training set it is always possible to find a function that fits the data with no error
but which is unable to classify a single sample from the testing set correctly.

*To solve this problem, the classic Bayesian approach consists of selecting a*
* distribution a priori for P(x, y) and then minimizing (*4.1). In statistical learning,

*no assumption is made as to the distribution, but only about the complexity of the*

*class of functions F. The main idea is to favor simple functions, to discard*over-fitting problems, and to achieve a good generalization abilityMüller et al.(2001). One way of modeling the complexity is given by the VC (Vapnik-Chervonenkis)

*theory: the complexity of F is measured by the VC dimension h, and the*

*struc-tural risk minimization (SRM) principle allows to select the function f ∈ F that*minimizes an upper bound error Vapnik (1998, 1999). Hence, the upper bound

*is defined as a function depending on Remp*

*and h. For example, given a set of*

*functions F with VC dimension h and a classification problem with a loss function*
*L (x, y) =* 1

_{2}

*Vapnik (1998,*

**|y − f(x)|, then for all 1 > η > 0 and f ∈ F we have**1999):
*R(f ) ≤ Remp(f ) +*
s
*h* ln(*2n*
*h*) + 1 − ln(
*η*
4)
*n* (4.3)

*with probability of at least 1 − η and for n > h. Following VC theory, the training*
step of this classifier should minimize the right terms of the inequality (4.3). Other
bounds can be found for different loss functions and measures of complexityVapnik

(1998).

In the following, we are going to present one particular class of functions that leads to a linear decision function and the definition of the SVM classifier.

*4.2*

*Linear SVM*

**Definition 5 (Separating hyperplane). Given a supervised classification problem,**

**a separating hyperplane H(w, b) is a linear decision function that separates the***space into two half-spaces, each half-space corresponding to the given class, i.e.,*
**sgn (hw, x**ii + b) = yi*for all samples from S.*

x_{2}
x_{1}
x_{2}
x_{1}
w·
x
+b
=0
w·
x+
b=
1
w·
x+
b=
−
1
2
kw
k
b
kw
k
w
(a) (b)

**Figure** **6.** (a) Separating hyperplanes and (b) SVM separating

hyper-plane. Squares are the support vectors. Figures were adapted from http://blog.pengyifan.com/tikz-example-svm-trained-with-samples-from-two-classes/.

*For samples in S, the condition of correct classification is yi (hw, xii + b) > 0.*

**If we assume that the closest samples satisfy |hw, x**ii + b| = 1 (it is always possible

*since H is defined up to a multiplicative constant), we have*

*yi (hw, xii + b) ≥ 1.* (4.4)

From Figure6.(a), several separating hyperplanes can be found given S.
*Ac-cording to the Vapnik-Chervonenkis theory* Smola et al.(2000), the optimal one
*(with the best generalization ability) is the one that maximized the margin, *
sub-ject to eq. (4.4**). The margin is inversely proportional to kwk**2_{. The optimal}

parameters can be found by solving the following convex optimization problem
minimize **hw, wi**

2

*subject to yi (hw, xii + b) ≥ 1, ∀i ∈ 1, . . . , n.*

It is usually solved using Lagrange multipliers Boyd and Vandenberghe (2006).
The Lagrangian:
*L (w, b, α) =*

*R*

**hw, wi***n*2 +

*ℓ*X

*i*=1

*αi(1 − yi*iR

**(hw, x**i*n+ b))*(4.5)

*to αi*. At the optimal point, the gradient vanishes:
*∂L*
*∂***x** **= w −**
*ℓ*
X
*i*=1
*αiyixi* = *0,* (4.6)
*∂L*
*∂b* =
*ℓ*
X
*i*=1
*αiyi* = *0.* (4.7)

From (4.6**), we can see that w lives in the subspace spanned by the training**
**samples: w =**P*ℓ*

*i*=1*αiyixi*. By substituting (4.6) and (4.7) into (4.5), we get the

dual quadratic problem:
max
*α* *g(α)* =
P*ℓ*
*i*=1*αi*−12
P*ℓ*
*i,j*=1*αiαjyiyj***hx***i,***xj**iR*n*
subject to *0 ≤ αi*
P*ℓ*
*i*=1*αiyi= 0.*
(4.8)

*When this dual problem is optimized, we obtain αi* **and hence w. This leads to**

the decision rule:

*g***(x) = sgn**
*ℓ*
X
*i*=1
*αiyi hx, xi*iR

*n+ b*!

*.*(4.9)

The constraints assume that the data are linearly separable. For real
appli-cations, this might be too restrictive, and this problem is traditionally solved by
*considering soft margin constraints: yi (hw, xii + b) ≥ 1 + ξi* which allow some

training errors during the training process and use an upper bound of the
empiri-cal riskP*n*

*i*=1*ξi*. The optimization problem changes slightly to:

minimize * hw, wi*
2

*+ C*

*n*X

*i*=1

*ξi*

*subject to yi*

**(hw, x**ii + b) ≥ 1 − ξi,*∀i ∈ 1, . . . , n*

*ξi*

*≥ 0, ∀i ∈ 1, . . . , n.*

*C*is a constant controlling the number of training errors. This optimization
prob-lem is solved using the Lagrangian:

*L (w, b, ξ, α, β) =*

*2 +*

**hw, wi***n*X

*i*=1

*αi(1 − ξi− yi*

**(hw, x**ii + b)) −*n*X

*i*=1

*βiξi+ C*

*n*X

*i*=1

*ξi.*

Minimizing with respect to the primal variables and maximizing w.r.t the dual variables lead to the so-called dual problem:

max
**α***g (α) =*

*n*X

*i*=1

*αi*− 1 2

*n*X

*i,j*=1

*αiαjyiyj*

**hx**

*i,*

**xj**i

*subject to 0 ≤ αi≤ C*

*n*X

*i*=1

*αiyi= 0.*

Ultimately, the only change from (4.8*) is the upper bound values on αi*.

**As for the introductory example, w can be written in terms of x***i*. The

*dif-ference between SVM and the "smallest distance to the mean classifier" is only*
* how α and b are estimated. The decision function is the same. Considering the*
Karush-Kuhn-Tucker conditions at optimalityBoyd and Vandenberghe(2006)

*1 − ξi− yi (hw, xii + b) ≤ 0*

*αi*≥ 0

*αi(1 − ξi− yi*

**(hw, x**ii + b)) = 0*ξi*≥ 0

*βi*≥ 0

*βiξi*= 0

*∂L*

**∂x****= w −**P

*ℓ*

*i*=1

*αiyixi*= 0

*∂L*

*∂b*= P

*ℓ*

*i*=1

*αiyi*= 0

*∂L*

*∂ξi*

*= −αi− βi+ C*= 0 (4.10)

*it will be seen that the third condition requires that αi= 0 or (1 − ξi− yi (hw, xii + b)) =*

* 0. This means the solution α is sparse and only some of the αi*are non zero. Thus

*. These*

**w is supported by some training samples – those with non-zero optimal αi***are called the support vectors. Figure*6.(b) shows an SVM decision function with support vectors.

*4.3*

*Non linear SVM*

It is possible to extend the linear SVM to non linear SVM by switching the dot product to a kernel function:

max
**α***g (α) =*

*n*X

*i*=1

*αi*− 1 2

*n*X

*i,j*=1

*αiαjyiyjk*

**(x**

*i,*

**xj**)

*subject to 0 ≤ αi≤ C*

*n*X

*i*=1

*αiyi= 0.*

Now, the SVM is a non-linear classifier in the input space R*d*_{, but is still linear}

0
−2 −1 0 1 2
−2
−1
0
1
2
0
−2 −1 0 1 2
−2
−1
0
1
2
*C*= 1 *C*= 100

**Figure 7.***Two SVM decision functions with different value of C.*

function is simply:
*f***(x) = sgn**
*n*
X
*i*=1
*αiyik (x, xi) + b*
!

*.*

Example of SVM classification on the toy data set with a Gaussian kernel is given in Figure7.

*4.4*

*Fitting the hyperparameters*

*Find the optimum parameters (kernel hyperparameters and penalty term C) for*
the SVM is not a straightforward task. The values of the kernel parameters can
have a considerable influence on the learning capacity. Fortunately, even though
*their influence is significant, their optimum values are not critical, i.e., there is a*
range of values for which the SVM performs equally well.

Cross-validation is usually used to select the hyperparameters. The
cross-validation estimates the expected error when the method is applied to an
inde-pendent set of samples. Typically, for a given set of hyperparameters, the training
*data are split into V subsets and the method are trained with (V − 1) subsets and*
*the prediction error is computed on the vth subset. The process is iterated for*
*v* *= 1, . . . , V and the estimated expected error is the mean of all the prediction*
*errors. In practice, V is generally set to 5 or 10. Cross-validation has shown good*
behavior in various supervised learning problems. However, its main drawback is
its computational load.

*4.5*

*Multiclass SVM*

SVMs are designed to solve binary problems where the class labels can only take
*two values, e.g., ±1. For an astrophysics application, several classes are usually of*

interest. Various approaches have been proposed to address this problem. They
usually combine a set of binary classifiers. Two main approaches were originally
*proposed for an m class problem.*

**• One Versus the Rest: m binary classifiers are applied on each class against***the others. Each sample is assigned to the class with the maximum output.*
**• Pairwise Classification:** *m(m−1)*2 binary classifiers are applied on each pair

of classes. Each sample is assigned to the class getting the highest number of votes. A vote for a given class is defined as a classifier assigning the sample to that class.

*Pairwise classification has proven to be more suitable for large problems. Even*
*though the number of classifiers used is larger than for the one versus the rest*
approach, the whole classification problem is decomposed into much simpler ones.
Other strategies have been proposed within the remote-sensing community,
such as hierarchical trees or global training. However, classification accuracy were
similar, or worse, and the complexity of the training process is increased.

**5**

**SVM in practice**

This section details, step by step how to tune and SVM on a given data set.

*5.1*

*Simulated data*

To illustrate this section, we are using a simulated data set, generated by Sylvain
Douté1_{. It consists in hyperspectra x of surface reflectance over the Mars planet}

Bernard-Michel et al. (2009). An hyperspectrum is represented by a vector for which each variable contains the reflectance of the surface at a given wavelength. Each hyperspectrum corresponds to a different position of the Mars surface.

The model used for the simulation of the hyperspectra has 5 parameters: the grain size of water and CO2 ice, the proportion of water, CO2 ice and dust. The

* model has been used to generate n samples x (x ∈ R*184

_{and n = 31500). Fives}classes were considered according to the grain size of water used to the simulation of the samples. The mean vectors of each class are given in Figure8.

*In this example, we are going to use the R package e1071 that uses the C++*
*library libsvm, the state of the art QP solver. The data set can be download here*
http://fauvel.mathieu.free.fr/data-used-for-stat4astro-labwork.html.

*5.2*

*Load the data and extract few training samples for learning the model*

First we need to load the data set. It contains 6300 samples per class and we are
going to use only 100 samples for training, the remaining samples will be used
### 0

### 50

### 100

### 150

### 200

### 0

### 0.1

### 0.2

### 0.3

### 0.4

### 0.5

### 0.6

**Figure 8.**Mean spectra.

for validation. xt and xT denote the training samples and the validation samples, respectively. yt and yT denote their corresponding classes.

*## L o a d* *s o m e* *l i b r a r y*
**l i b r a r y ( " e 1 0 7 1 " )**
**l i b r a r y ( f o r e a c h )**
**l i b r a r y ( d o P a r a l l e l )**
*# S e t* *t h e* *n u m b e r* *o f* *c o r e*
r e g i s t e r D o P a r a l l e l ( 4 )
**loa d ( " a s t r o s t a t . RData " )**
**n=nrow ( x )**
**d=n c o l ( x )**
**C = max( y )**
*#* *S e l e c t* *" n u m b e r T r a i n "* *p e r* *c l a s s* *f o r* *t r a i n i n g*

num berT rai n = 100

*# T h e* *r e m a i n i n g* *i s* *f o r* *v a l i d a t i o n*

num berT es t = 6300− numberTrain

*##* *I n i t i a l i z a t i o n* *o f* *t h e* **t r a i n i n g / v a l i d a t i o n***s e t s*

**x t = matrix ( 0 , num berT rai n ∗C, d )**
**y t = matrix ( 0 , num berT rai n ∗C, 1 )**
**xT = matrix ( 0 , num berT es t∗C, d )**
**yT = matrix ( 0 , num berT es t∗C, 1 )**

**f o r** ( i i n **1 :C)**
{

**t = which ( y==i )**
**s e t . s e e d ( i )**

**t s = sample ( t ) # P e r m u t e***r a m d o m l y* *t h e* *s a m p l e s* *f o r* *c l a s s* *i*
**x t [ ( 1 + num berT rai n ∗ ( i − 1 ) ) : ( numberTrain ∗ i ) , ] = x [ t s [ 1 : numberTrain ] , ]**
**y t [ ( 1 + num berT rai n ∗ ( i − 1 ) ) : ( numberTrain ∗ i ) , ] = y [ t s [ 1 : numberTrain ] , ]**
**xT [ ( 1 + num berT es t∗ ( i − 1 ) ) : ( numberTest∗ i ) , ] = x [ t s [ ( numberTrain + 1 ) : 6 3 0 0 ] , ]**
**yT [ ( 1 + num berT es t∗ ( i − 1 ) ) : ( numberTest∗ i ) , ] = y [ t s [ ( numberTrain + 1 ) : 6 3 0 0 ] , ]**
}

*5.3*

*Estimate the optimal hyperparameter of the model*

Now, we can fit the best couple of hyperparameters for the model. We are going
*to use a Gaussian kernel and it has one parameter γ. There is also the penalty*

*term C. Cross validation is used to select the best values. For that, we just test*
each couple of hyperparameters, given the range of values to be tested. The
cor-rect classification rate associated to a set of values is assessed by cross-validation.
The optimal values are those which correspond to the highest estimated correct
classification rate.

*## S e t* *t h e* *CV r a n g e* *o f* *s e a r c h* *( d e p e n d e n t* *t o* *y o u r* *d a t a )*

**gamma = 2 ^ ( − 8 : 0 )**

c o s t = 10^( −2:6)

**CVT = matrix ( 0 , nrow=le n gt h ( c o s t ) , n c o l=le n gt h (gamma) )**

*##* *S e r i a l* *v e r s i o n*
**TIME = S y s . time ( )**
**f o r** ( i i n **1 : le n gt h ( c o s t ) )**
{
**f o r** ( j i n **1 : le n gt h (gamma) )**
{
**s e t . s e e d ( 0 )**

**model = svm ( x t , y t , c o s t=c o s t [ i ] , gamma=gamma[ j ] , t y p e="C" , c r o s s =5)**

**CVT [ i , j ]=model$ t o t . a c c u r a c y**
}
}
**p r i n t ( S y s . time () −TIME)**
**p r i n t (CVT)**
*## G e t* *t h e* *p o s i t i o n* *o f* *t h e* *m ax im um*

**i n d i c e s =which (CVT == max(CVT) ,** a r r . i n d = TRUE)

*5.4*

*Learn the model with optimal parameter and predict the whole validation*

*samples*

One the optimal values are selected, we can learn the model and predict the class for the validation set. Then we compute the confusion matrix and the overall accuracy to assess the performance of the SVM.

*## L e a r n* *t h e* *m o d e l*

**model = svm ( x t , y t , c o s t=c o s t [ i n d i c e s [ 1 , 1 ] ] , gamma=gamma[ i n d i c e s [ 1 , 2 ] ] , t y p e="C" )**

*## P r e d i c t* *t h e* *v a l i d a t i o n* *s a m p l e s*

**yp = p r e d i c t ( model , xT )**

*## C o n f u s i o n* *m a t r i x*

**c o n f u = t a b l e ( yT , yp )**

**OA = sum( d ia g ( c o n f u ) ) /sum( c o n f u )**

*5.5*

*Speeding up the process ?*

CV can be very time consuming. Several options are possible to reduce the pro-cessing time. On very easy way is to perform the CV in parallel using R package foreachand doParallel.

**TIME = S y s . time ( )**

**CV = f o r e a c h ( i = 1 : le n gt h ( c o s t ) , . com bi ne = r b in d ) %d o p a r% {**
**temp = matrix ( 0 , nrow=le n gt h (gamma) , 1 )**

**f o r** ( j i n **1 : le n gt h (gamma) ) {**

**s e t . s e e d ( 0 )**

**model = svm ( x t , y t , c o s t=c o s t [ i ] , gamma=gamma[ j ] , t y p e="C" , c r o s s =5)**

**temp [ j ] = model$ t o t . a c c u r a c y**
}
**r e t u r n ( c ( i , temp ) )**
}
**p r i n t ( S y s . time () −TIME)**
**p r i n t (CV)**

*5.6*

*Results*

*For a given run, the following CVT table looks like (for the range of values of γ and*
*C*, look the R codes):

*γ*
C
17.8 18.0 18.2 18.4 18.4 18.8 18.8 19.0 18.8
21.4 23.2 30.2 35.8 38.4 41.8 44.6 46.2 45.2
36.6 40.6 46.2 55.4 61.4 69.2 71.0 72.0 70.8
56.6 65.2 75.0 79.2 82.6 81.8 83.2 81.2 78.0
81.8 85.2 86.4 **86.6** 86.0 84.6 83.6 80.8 78.2
86.4 86.4 85.8 85.8 85.6 84.6 83.6 80.8 78.2
85.8 86.2 85.4 85.8 85.6 84.6 83.6 80.8 78.2
85.8 86.2 85.4 85.8 85.6 84.6 83.6 80.8 78.2
85.8 86.2 85.4 85.8 85.6 84.6 83.6 80.8 78.2

*The set of optimal hyperparameters is C = 100 and γ = 0.0009765625, for*
*a correct classification rate estimated to 86.6. The SVM is then learned with*
the optimal hyperparameter. The resulting model has 217 support vectors
(ob-tained with the command model$tot.nSV). It means that approximately the half
of training samples have been used to build the decision function. The correct
*clas-sification rate computed on the validation set is 0.881096. In comparison, for that*
setting, a Gaussian Mixture Model with a diagonal covariance matrix assumption
*reaches a correct classification rate of 0.5300645.*

n=100 *# N u m b e r* *o f* *p o i n t s*
**X = matrix ( 0 , nrow=n , n c o l =2)** *# D a t a* *G e n e r a t i o n*
**s e t . s e e d ( 0 )**
**X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , − 1 , 0 . 5 )** *# Two* *G a u s s i a n s*
**X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , − 1 , 0 . 5 )**
**X [ ( n/ 2 + 1 ) : n , 2 ] = rnorm ( n/ 2 , 1 , 0 . 5 )**
**X [ ( n/ 2 + 1 ) : n , 1 ] = rnorm ( n/ 2 , 1 , 0 . 5 )**

**Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) )** *# Two* *C l a s s e s* *+1* *a n d −1*

**p l o t (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 ))**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**
**le ge n d ( " t o p l e f t " , c ( "−1 " , " 1 " ) , pch=c ( 1 , 2 ) )**
**dev . p r i n t ( pdf , ’ d a t a_t o y . e p s ’ )**
*# C o m p u t e* *t h e* *Gram* *m a t r i x* *o f* *a l l* *d o t* *p r o d u c t s* *f o r* *c l a s s* *1*
**Kp = X [ Y==1 ,]%∗%t (X [ Y==1 ,])**
*# C o m p u t e* *t h e* *Gram* *m a t r i x* *o f* *a l l* *d o t* *p r o d u c t s* *f o r* *c l a s s* *−1*
**Kn = X [ Y==−1,]%∗%t (X [ Y== −1 ,])**
*# C o m p u t e* *t h e* *p a r a m e t e r s* *o f* *t h e* *C l a s s i f i e r s*
**b = − sum(Kp ) / ( n c ol (Kp ) ∗ n c ol (Kp ) ) + sum(Kn ) / ( n c ol (Kn ) ∗ n c ol (Kn ) )**
**a l p h a = c (Y [ Y==1]/ n c o l (Kp) , −Y [ Y==−1]/ n c ol (Kn ) )**

**w = c ( 0 , 0 )**
**f o r** ( i i n 1 : n ) {
**w = w +a l p h a [ i ] ∗Y [ i ] ∗X [ i , ]**
}
*# C o m p u t e* *t w o* *p o i n t s* *o f* *t h e* *d e c i s i o n* *f u n c t i o n* *f o r* *p l o t t i n g*
**a = −w [ 1 ] /w [ 2 ]**
**c = b/w [ 2 ] / 2**
**p = matrix ( 0 , nrow=2 , n c o l =2)**
p [ 1 , 1] = −2
**p [ 1 , 2 ] = p [ 1 , 1 ] ∗a+c**
p [ 2 , 1 ] = 2
**p [ 2 , 2 ] = p [ 2 , 1 ] ∗a+c**
**segments ( p [ 1 , 1 ] , p [ 1 , 2 ] , p [ 2 , 1 ] , p [ 2 , 2 ] )**
**dev . p r i n t ( pdf , ’ d a t a_t o y_s e p . e p s ’ )**
**dev . o f f ( dev . l i s t ( ) )**

**Figure 9.**Code for the linear example.

n=100
**s e t . s e e d ( 0 )**
**r h o = 1. 5+ rnorm ( n/ 2 , 0 , 0 . 2 )**
**t h e t a = 2∗ p i ∗ r u n i f ( n/ 2 )**
**X = matrix ( 0 , nrow=n , n c o l =2)**
**X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 )**
**X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 )**
**X [ ( n/ 2 + 1 ) : n , 1 ] = r h o ∗cos ( t h e t a )**
**X [ ( n/ 2 + 1 ) : n , 2 ] = r h o ∗ s i n ( t h e t a )**
**Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) )**
**p d f ( " d a t a_t o y_n l . e p s " )**
**p l o t (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 ))**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**
**le ge n d ( " t o p l e f t " , c ( "−1 " , " 1 " ) , pch=c ( 1 , 2 ) )**
**dev . o f f ( dev . l i s t ( ) )**
**Z = matrix ( 0 , nrow=n , n c o l =3)**
Z [ , 1 ] =X [ , 1 ] ^ 2
Z [ , 2 ] =X [ , 2 ] ^ 2
**Z [ , 3 ] =X [ , 1 ] ∗X [ , 2 ]**
**p d f ( " d a t a_t o y_n l_3D . e p s " )**
**p l o t ( Z [ Y== −1 ,1] ,Z [ Y== −1 ,2] ,pch =1 , x l i m = c ( 0 , 2 ) ,** **y l i m = c ( 0 , 2 ) )**
**p o i n t s ( Z [ Y= = 1 , 1 ] , Z [ Y= = 1 , 2 ] , pch =2)**
**le ge n d ( " t o p l e f t " , c ( "−1 " , " 1 " ) , pch=c ( 1 , 2 ) )**
**dev . o f f ( dev . l i s t ( ) )**
*# A p p l y* *t h e* *d e c i s i o n* *f u n c t i o n* *u s i n g* *k e r n e l*
*# C o m p u t e* *t h e* *K e r n e l* *m a t r i x* *o f* *a l l* *d o t* *p r o d u c t s* *f o r* *c l a s s* *1*
**Kp = (X [ Y==1 ,]%∗%t (X [ Y= = 1 , ] ) ) ^ 2**
*# C o m p u t e* *t h e* *K e r n e l* *m a t r i x* *o f* *a l l* *d o t* *p r o d u c t s* *f o r* *c l a s s* *−1*
**Kn = (X [ Y==−1,]%∗%t (X [Y== −1 ,]))^2**
*# C o m p u t e* *t h e* *p a r a m e t e r s* *o f* *t h e* *C l a s s i f i e r s*
**b = − sum(Kp ) / ( n c ol (Kp ) ∗ n c ol (Kp ) ) + sum(Kn ) / ( n c ol (Kn ) ∗ n c ol (Kn ) )**
**a l p h a = c(−Y[ Y==−1]/ n c ol (Kp ) ,Y [ Y==1]/ n c ol (Kn ) )**

**a l p h a y = a l p h a ∗Y**
**xx = se q ( − 2 , 2 , length =100)**
**yy = se q ( − 2 , 2 , length =100)**
**XT = a s . matrix ( expand . g r i d ( xx , yy ) )**
**K = (XT%∗%t (X ) ) ^ 2**
**F = K%∗%a l p h a y + b/2**
**m y l e v e l s = se q ( − 0 , 0 , 5 )**
**Fr = matrix ( F , n c o l = 100 ,nrow=100)**
**p d f ( " d a t a_t o y_n l_2D . e p s " )**
**contour ( xx , yy , Fr , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )**
**p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 ))**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**
**dev . o f f ( dev . l i s t ( ) )**

**l i b r a r y ( e 1 0 7 1 )**
n=100
**s e t . s e e d ( 0 )**
**r h o = 1. 5+ rnorm ( n/ 2 , 0 , 0 . 2 )**
**t h e t a = 2∗ p i ∗ r u n i f ( n/ 2 )**
**m y l e v e l s = se q ( − 0 , 0 , 5 )**
**X = matrix ( 0 , nrow=n , n c o l =2)**
**X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 )**
**X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 )**
**X [ ( n/ 2 + 1 ) : n , 1 ] = r h o ∗cos ( t h e t a )**
**X [ ( n/ 2 + 1 ) : n , 2 ] = r h o ∗ s i n ( t h e t a )**
**Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) )**
*# G r i d*
**xx = se q ( − 2 , 2 , length =100)**
**yy = se q ( − 2 , 2 , length =100)**
**XT = a s . matrix ( expand . g r i d ( xx , yy ) )**

**model = svm (X, Y , c o s t =1 ,gamma= 0 . 5 , t y p e="C−c l a s s i f i c a t i o n " )**

**p r e d = p r e d i c t ( model , XT, d e c i s i o n . v a l u e s = TRUE)**

**d f = matrix ( a t t r ( pred ,** **" d e c i s i o n . v a l u e s " ) , n c o l = 100 ,nrow=100)**
**p d f ( " svm_c_1 . e p s " )**

**contour ( xx , yy , df , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )**

**p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 ))**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**

**dev . o f f ( dev . l i s t ( ) )**

**model = svm (X, Y , c o s t = 100 ,gamma= 0 . 5 , t y p e="C− c l a s s i f i c a t i o n " )**

**p r e d = p r e d i c t ( model , XT, d e c i s i o n . v a l u e s = TRUE)**

**d f = matrix ( a t t r ( pred ,** **" d e c i s i o n . v a l u e s " ) , n c o l = 100 ,nrow=100)**
**p d f ( " svm_c_1 0 0 . e p s " )**

**contour ( xx , yy , df , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )**

**p o i n t s (X [ Y== −1 ,1] ,X [ Y== −1 ,2] ,pch =1 , x l i m = c ( − 2 , 2 ) , y l i m = c ( − 2 , 2 ))**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**

**dev . o f f ( dev . l i s t ( ) )**

*# G e n e r a t e* *r a n d o m* *d a t a*
n=100
**s e t . s e e d ( 0 )**
**r h o = 1. 5+ rnorm ( n/ 2 , 0 , 0 . 2 )**
**t h e t a = 2∗ p i ∗ r u n i f ( n/ 2 )**
**m y l e v e l s = se q ( − 0 , 0 , 5 )**
*# C l a s s* *−1* *i s* *a* *2D−G a u s s i a n* *a n d* *C l a s s* *1* *i s* *a* *u n i f o r m l y* *d i s t r i b u t e d* *o n* *a* *r i n g*
**X = matrix ( 0 , nrow=n , n c o l =2)**
**X [ 1 : ( n/ 2 ) , 1 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 )**
**X [ 1 : ( n/ 2 ) , 2 ] = rnorm ( n/ 2 , 0 , 0 . 2 5 )**
**X [ ( n/ 2 + 1 ) : n , 1 ] = r h o ∗cos ( t h e t a )**
**X [ ( n/ 2 + 1 ) : n , 2 ] = r h o ∗ s i n ( t h e t a )**
**Y = c ( rep ( −1 , n/ 2 ) , rep ( 1 , n/ 2 ) )**
*# G e n e r a t e* *d a t a* *i n* *t h e* *g r i d* *[ − 2 , 2 ]*
**xx = se q ( − 2 , 2 , length =100)**
**yy = se q ( − 2 , 2 , length =100)**
**XT = a s . matrix ( expand . g r i d ( xx , yy ) )**
*# C o m p u t e* *t h e* *D i s t a n c e* *m a t r i x*

**D = matrix ( 0 , nrow (XT) , nrow (X ) )**

**normXT = rowSums (XT∗XT)**
**normX = rowSums (X∗X)**
*# f o r* *( i* *i n* *1 : n )*
*#* *{*
*#* *D [ , i ]* *= normXT + normX [ i ]* − **2 ∗ ( XT%∗%X [ i , ] )***#* *}*
*# Sam e* *a s* *p r e v i o u s* *l i n e s ,* *b u t* *f a s t e r*
**D = ou t e r ( temp1 , temp2 , ’+ ’ )−2∗ (XT%∗%t (X) )**
*# K−NN*

**i n d i c e s = apply (D, 1 , which . min )**
**d e c i s i o n = matrix ( 0 , nrow (XT) , 1 )**
**f o r** ( i i n **1 : nrow (XT) )**
{
d e c i s i o n [ i ]=Y [ i n d i c e s [ i ] ]
}
*# P l o t* *d a t a*
**yp = matrix ( d e c i s i o n , n c o l = 100 ,nrow=100)**
**m y l e v e l s = se q ( − 0 , 0 , 5 )**
p d f ( " knn . e p s " )
**contour ( xx , yy , yp , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**
**dev . o f f ( dev . l i s t ( ) )**
*# KK−NN*
p = 2
*# K =* *m a t r i x ( 0 , n r o w ( XT ) , n r o w ( X ) )*
*# f o r* *( i* *i n* *1 : n r o w ( XT ) )*
*# {*
*#* *f o r* *( j* *i n* *1 : n )*
*#* *{*
*#* *K [ i , j ] =* *( XT [ i , 1 ] ^ 2 + XT [ i , 2 ] ^ 2 ) ^ p + ( X [ j , 1 ] ^ 2 + X [ j , 2 ] ^ 2 ) ^ p*
*#* − **2 ∗ ( XT [ i , 1 ] ∗X [ j , 1 ] + XT [ i , 2 ] ∗X [ j , 2 ] ) ^ p***#* *}*
*# }*
*# Sam e* *a s* *t h e* *l i n e s* *a b o v e ,* *b u t* *f a s t e r*
**K = ou t e r ( normXT ^ 2 , normX ^ 2 , "+" ) − 2∗ (XT%∗%t (X) ) ^ 2**
**i n d i c e s = apply (K, 1 , which . min )**

*# K K−NN*
**d e c i s i o n = matrix ( 0 , nrow (XT) , 1 )**
**f o r** ( i i n **1 : nrow (XT) )**
{
d e c i s i o n [ i ]=Y [ i n d i c e s [ i ] ]
}
**yp = matrix ( d e c i s i o n , n c o l = 100 ,nrow=100)**
p d f ( " kknn . e p s " )
**contour ( xx , yy , yp , l e v e l s=m y l e v e l s , x a x s= ’ i ’ , y a x s= ’ i ’ )**
**p o i n t s (X [ Y= = 1 , 1 ] ,X [ Y= = 1 , 2 ] , pch =2)**
**dev . o f f ( dev . l i s t ( ) )**

**References**

Bernard-Michel, C., Douté, S., Fauvel, M., Gardes, L., and Girard, S.: 2009,
*Journal of Geophysical Research: Planets (1991–2012) 114(E6)*

*Boyd, S. and Vandenberghe, L.: 2006, Convex Optimization, Cambridge *
Univer-sity press

*Camps-Valls, G. and Bruzzone, L. (eds.): 2009, Kernel Methods for Remote *
*Sens-ing Data Analysis, John Wiley & Sons, Ltd*

*Fauvel, M., Chanussot, J., and Benediktsson, J. A.: 2012, Pattern Recognition*

**45(1), 381**

*Hastie, T. J., Tibshirani, R. J., and Friedman, J. H.: 2009, The elements of*
*statistical learning : data mining, inference, and prediction, Springer series in*
statistics, Springer, New York

*Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B.: 2001, IEEE*
**Transactions on Neural Networks 12(2), 181**

*Schölkopf, B.: 2000, Statistical learning and kernel methods*

*Smola, A., Barlett, P. L., Schölkopf, B., and Schuurmans, D.: 2000, Advances in*
*large margin classifiers, MIT press*

*Vapnik, V.: 1998, Statistical Learning Theory, Wiley, New York*

*Vapnik, V.: 1999, The Nature of Statistical Learning Theory, Second Edition,*
Springer, New York