Support Vector Machines

(1)

Linear and nonlinear methods for classification

Support Vector Machines

1 Introduction

In this chapter, we consider supervised classification problems. We have a training data set withnobservation points (or objects)Xi and their class (or label)Yi. For example, the MNIST data set is a database of handwritten digits, where the objectsXi are images andYi ∈ {0,1, . . . ,9}. Many other examples can be considered, such as the recognition of an object in an image, the detection of spams for emails, the presence of some illness for patients (the observation points may be gene expression data) ... We will first introduce the notion of best classifier, which is also called the Bayes classifier, we will then propose linear methods for classification. The core of the chapter will be devoted to the Support Vector machine, which are very powerful nonlinear methods for classification.

The main references for this course are the following books :

• An introduction to Support Vector Machines by N. Cristianini and J.

Shawe-Taylor [1]

• Introduction to High-Dimensional Statistics by C. Giraud [3]

• The elements of Statistical Learning by T. Hastie et al [4].

• Learning with kernels by A. Smola and B. Scholkopf [5].

• Statistical Learning Theory by V. Vapnik [6].

• M2 courses, M. Fromont-Renoir, Université de Rennes 2 : https://perso.univ-rennes2.fr/magalie.fromont

2 Optimal rules for classification

Suppose that dⁿ corresponds to the observation of a n-sample Dⁿ = {(X1, Y1), . . . ,(Xn, Yn)}with joint unknown distributionP onX × Y, and thatxis one observation of the variableX, where(X, Y)has joint distribu- tionPand isindependentofDⁿ. Since we consider a classification problem, Yis a finite set. The sampleDⁿis called thelearning sample.

Aclassification ruleis a measurable functionf :X → Ythat associates the outputf(x)to the inputx∈ X.

In order to quantify the quality of the prevision, we introduce a loss function.

DEFINITION1. — A measurable functionl :Y × Y →R+is aloss function ifl(y, y) = 0andl(y, y⁰)>0fory6=y⁰.

Iffis a classification rule,xan input,ythe output that is really associated to the inputx, thenl(y, f(x))quantifies the loss when we associate to the inputxthe predicted outputf(x).

In a regression framework whenY=R: we consider theL^ploss (p≥1) l(y, y⁰) =|y−y⁰|^p.

Ifp= 2, this is the quadratic loss.

For binary classification:Y ={−1,1}

l(y, y⁰) =1_y6=y⁰ = |y−y⁰|

2 = (y−y⁰)²

4 .

We consider the expectation of this loss, this leads to the definition of the risk:

DEFINITION2. — Given a loss functionl, therisk- orgeneralisation error - of a prediction rulefis defined by

R_P(f) =E(X,Y)∼P[l(Y, f(X))].

(2)

It is important to note that, in the above definition,(X, Y)is independent of the training sampleDⁿthat was used to build the prediction rulef.

LetF denote the set of all possible prediction rules. We say thatf^∗ is an optimal rule if

R_P(f^∗) = inf

f∈FR_P(f).

A natural question arises : is it possible to build optimal rules ? This is indeed the case. We focus here on the classification framework. We define the Bayes rule, and show that it is an optimal rule for classification.

DEFINITION3. — We call Bayes rule any measurable functionf^∗inFsuch that for allx∈ X,

P(Y =f^∗(x)|X =x) = max

y∈YP(Y =y|X =x).

THEOREM4. — Iff^∗is a Bayes rule, thenRP(f^∗) = inf_f∈FRP(f).

The definition of a Bayes rule depends on the knowledge of the distribution P of (X, Y). In practice, we have a training sample Dⁿ = {(X1, Y₁), . . . ,(X_n, Y_n)} with joint unknown distribution P, and we construct a classification rule. The aim is to find a "good" classification rule, in the sense that its risk is close to the optimal risk of a Bayes rule.

Exercise. — Prove Theorem4.

3 Linear discriminant analysis

Let(X, Y)with unknown distributionPonX × Y, where we assume that X =R^pandY={1,2, . . . , K}. We define

fk(x) =P(Y =k/X =x).

A Bayes rule is defined by

f^∗(x) = argmax

k∈{1,2,...,K}

fk(x).

We assume that the distribution ofXhas a densityfXand the distribution of XgivenY =khas a densitygkwith respect to the Lebesgue measure onR^p, and we setπk=P(Y =k).

Exercise. — Prove that

fX(x) =

K

X

l=1

gl(x)πl

and that

fk(x) = gk(x)πk

PK

l=1gl(x)πl

.

If we assume that the distribution ofXgivenY =kis a multivariate normal distribution, with meanµkand covariance matrixΣk, we have

gk(x) = 1

(2π)^p/2|Σk|^1/2exp

−1

2(x−µk)⁰Σ⁻¹_k (x−µk)

.

For the linear discriminant analysis, we furthermore assume thatΣk = Σfor allk. In this case we have

log

P(Y =k/X=x) P(Y =l/X=x)

= log πk

π_l

+x⁰Σ⁻¹(µk−µl)

− 1

2(µk+µl)⁰Σ⁻¹(µk−µl).

Hence the decision boundary between the classkand the classl,{x,P(Y = k/X =x) =P(Y =l/X =x)}is linear. We can write

logP(Y =k/X =x) =C(x)δk(x) whereC(x)does not depend on the classk, and

δ_k(x) =x⁰Σ⁻¹µ_k−1

2µ⁰_kΣ⁻¹µ_k+ log(π_k).

The Bayes rule will assignxto the classf^∗(x)which maximisesδk(x).

We want now to built a decision rule from a training sample Dⁿ =

(3)

{(X1, Y1), . . . ,(Xn, Yn)}which is close to the Bayes rule. For this purpose, we have to estimate for allk πk, µkand the matrixΣ. We consider the following estimators.

ˆ

πk = Nk

n ˆ

µ_k = Pn

i=1Xi1Y_i=k

Nk

whereNk=Pn

i=11Y_i=k. We estimateΣby Σ =ˆ

K

X

k=1 n

X

i=1

(Xi−µˆk)(Xi−µˆk)⁰1Y_i=k

n−K .

To conclude, the Linear Discriminant Analysis assigns the inputxto the class f(x)ˆ which maximisesδˆk(x), where we have replaced in the expression of δk(x)the unknown quantities by their estimators.

Remark : If we no more assume that the matrix Σdoes not depend on the classk, we obtain quadratic discriminant functions

δk(x) =−1

2log|Σk| − 1

2(x−µk)⁰Σ⁻¹_k (x−µk) + log(πk).

This leads to the quadratic discriminant analysis.

4 Logistic regression

Noting that the Bayes classifier only depends on the conditional distribution ofY givenX, we can avoid to model the distribution ofX as previously. We assume thatX =R^d. One of the most popular model for binary classification whenY ={−1,1}is thelogistic regressionmodel, for which it is assumed that

P(Y = 1/X =x) = exp(α+hβ,xi)

1 + exp(α+hβ,xi)for allx∈ X, withα∈Randβ∈R^d.

Exercise. — Compute the Bayes classifierf^∗for this model and determine the border betweenf^∗= 1andf^∗=−1.

We can estimate the parameters(α,β)by maximizing the conditional like- lihood ofY givenX.

L(α,β) = Y

i,Yi=1

exp(α+hβ,Xii) 1 + exp(α+hβ,Xii)

Y

i,Yi=−1

1

1 + exp(α+hβ,Xii).

We then compute the logistic regression classifier :

∀x∈ X,fˆ(x) =sign( ˆα+hβ,ˆ xi).

In general, the logistic regression provides different classifiers as LDA. If the distribution ofX if far from a Gaussian distribution, the logistic regression outperforms the LDA, as soon as the logistic model is still valid.

5 Linear Support Vector Machine

5.1 Linearly separable training set

We assume thatX =R^d, endowed with the usual scalar producth., .i, and thatY={−1,1}.

DEFINITION5. — The training set dⁿ₁ = (x1, y1), . . . ,(xn, yn)is called linearly separableif there exists(w, b)such that for alli,

yi= 1ifhw, xii+b >0, yi=−1ifhw, xii+b <0, which means that

∀i yi(hw, xii+b)>0.

The equationhw, xi+b= 0defines a separating hyperplane with orthogonal vectorw.

The functionfw,b(x) =1_hw,xi+b≥0−1hw,xi+b<0defines a possible linear classification rule.

The problem is that there exists an infinity of separating hyperplanes, and therefore an infinity of classification rules.

Which one should we choose ? The response is given by Vapnik [6]. The classification rule with the best generalization properties cooresponds to the

(4)

separating hyperplane maximizing the marginγbetween the two classes on the training set.

If we consider two entries of the training set, that are on the broder defining the margin, and that we callx₁ andx₋₁ with respective outputs1 and−1, the separating hyperplane is located at the half-distance betweenx1andx₋₁. The margin is therefore equal to the half of the distance betweenx1andx₋₁ projected onto the normal vector of the separating hyperplane :

γ= 1 2

hw, x1−x₋₁i kwk .

Let us notice that for allκ6= 0, the couples(κw, κb)and(w, b)define the same hyperplane.

DEFINITION6. — The hyperplanehw, xi+b= 0iscanonical with respect to the set of vectorsx1, . . . , xk if

mini=1...k|hw, x_ii+b|= 1.

The separating hyperplane has the canonical form relatively to the vectors {x₁, x₋₁}if it is defined by(w, b)wherehw, x₁i+b= 1andhw, x₋₁i+b=

−1.In this case, we havehw, x₁−x₋₁i= 2, hence γ= 1

kwk.

5.2 A convex optimisation problem

Finding the separating hyperplane with maximal margin consists in finding (w, b)such that

kwk²or ¹₂kwk²is minimal under the constraint yi(hw, xii+b)≥1for alli.

This leads to a convex optimization problem with linear constraints, hence there exists a unique global minimizer.

Primal optimization problemLet us recall some definitions.

We want to minimize foru∈R²h(u)under the constraintsgi(u)≥0for i= 1. . . n,his a quadratic function,giare affine functions.

The Lagrangian of the optimization problem is the function defined on R²×Rⁿby

(5)

L(u, α) =h(u)−

n

X

i=1

αigi(u).

The variablesαiare called thedual variables.

For allα∈Rⁿ,uαis the value ofuminimizingL(u, α).

Thedual functionis defined by

θ(α) =L(uα, α) =min_u∈_R2L(u, α).

Dual optimization problem

Maximizing θ(α) = L(uα, α) = min_u∈_R2L(u, α) under the constraints αi≥0fori= 1. . . n.

The solution of the dual problemα^∗ gives the solution of the primal problem:

u^∗=uα^∗.

Karush-Kuhn-Tucker conditions for the dual problem

• α^∗_i ≥0for alli= 1. . . n.

• gi(uα^∗)≥0for alli= 1. . . n.

• Back to the dual problem :

We have to minimizeL(u, α) = h(u)−Pn

i=1α_ig_i(u)with respect tou and to maximizeL(u_α, α)with respect to the dual variablesα_i. Note that if g_i(u_α^∗)>0, then we getα^∗_i = 0.

The complementary Karush-Kuhn-Tucker condition writesα^∗_ig_i(u_α^∗) = 0.

Let us come back to the linear support vector machine optimization problem.

The primal problemto solve is :

Minimizing¹₂kwk²s. t.yi(hw, xii+b)≥1∀i.

LagrangianL(w, b, α) = ¹₂kwk²−Pn

i=1αi(yi(hw, xii+b)−1). Dual Function

∂L

∂w(w, b, α) = w−

n

X

i=1

αiyixi = 0 ⇔ w=

n

X

i=1

αiyixi

∂L

∂b(w, b, α) = −

n

X

i=1

α_iy_i= 0 ⇔

n

X

i=1

α_iy_i= 0

θ(α) = 1 2

n

X

i,j=1

αiαjyiyjhxi, xji+

n

X

i=1

αi−

n

X

i,j=1

αiαjyiyjhxi, xji

=

n

X

i=1

αi−1 2

n

X

i,j=1

αiαjyiyjhxi, xji.

The correspondingdual problem is : Maximizing

θ(α) =

n

X

i=1

α_i−1 2

n

X

i,j=1

α_iα_jy_iy_jhxi, x_ji

under the constraintPn

i=1αiyi= 0andαi≥0∀i.

The solutionα^∗ of the dual problem can be obtained with classical optimization softwares.

Remark : The solution does not depend on the dimensiond, but depends on the sample sizen, hence it is interesting to notice that when X is high dimensional, linear SVM do not suffer from the curse of dimensionality.

5.3 Supports Vectors

For our optimization problem, theKarush-Kuhn-Tucker conditions are

• α^∗_i ≥0∀i= 1. . . n.

(6)

• yi(hw^∗, xii+b^∗)≥1∀i= 1. . . n.

• α^∗_i(yi(hw^∗, xii+b^∗)−1) = 0∀i= 1. . . n.

(complementary condition)

Only theα^∗_i >0are involved in the resolution of the optimization problem.

If the number of valuesα^∗_i >0is small, the solution of the dual problem is called"sparse".

DEFINITION7. — The xi such that α^∗_i > 0 are called the support vectors. They are located on the border defining the maximal margin namely yi(hw^∗, xii+b^∗) = 1(c.f. complementary KKT condition).

We finally obtain the following classification rule : fˆ(x) =1_hw^∗_,xi+b^∗_≥0−1_hw^∗_,xi+b^∗_<0,

with

• w^∗=Pn

i=1α^∗_ix_iy_i,

• b^∗=−¹₂{miny_i=1hw^∗, xii+miny_i=−1hw^∗, xii}.

The maximal margin equalsγ^∗=_kw¹∗k = Pn

i=1(α^∗_i)²−1/2

.

Theα_i^∗that do not correspond to support vectors (sv) are equal to 0, and therefore

fˆ(x) =1^P

xi svy_iα^∗_ihx_i,xi+b^∗≥0−1^P

xi svy_iα^∗_ihx_i,xi+b^∗<0.

6 Linear SVM in the non separable case

The previous method cannot be applied when the training set is not linearly separable. Moreover, the method is very sensitive to outliers.

6.1 Flexible margin

In the general case, we allow some points to be in the margin and even on the wrong side of the margin. We introduce the slack variableξ= (ξ1, . . . , ξn) and the constraintyi(hw, xii+b)≥1becomesyi(hw, xii+b)≥1−ξi, with ξi≥0.

• Ifξi ∈ [0,1]the point is well classified but in the region defined by the margin.

• Ifξ_i>1the point is misclassified.

The margin is calledflexible margin.

6.2 Optimization problem with relaxed constraints

In order to avoid too large margins, we penalize large values for the slack variableξ_i.

Theprimal optimization problem is formalized as follows : Minimize with respect to(w, b, ξ) ¹₂kwk²+CPn

i=1ξisuch that yi(hw, xii+b)≥1−ξi∀i

ξi≥0

(7)

Remarks :

• C > 0 is a tuning parameter of the SVM algorithm. It will determine the tolerance to misclassifications. IfCincreases, the number of misclassified points decreases, and if Cdecreases, the number of misclassified points increases.Cis generally calibrated by cross-validation.

• One can also minimize ¹₂kwk²+CPn

i=1ξ^k_i,k= 2,3, . . ., we still have aconvexoptimization problem.

The choice Pn

i=11_ξ_i_>1 (number of errors) instead of Pn

i=1ξ^k_i would lead to a non convex optimization problem.

TheLagrangianof this problem is:

L(w, b, ξ, α, β) = 1 2kwk²+

n

X

i=1

ξi(C−αi−βi)

+

n

X

i=1

αi−

n

X

i=1

αiyi(hw, xii+b),

withαi ≥0andβi≥0.

The cancellation of the partial derivatives _∂w^∂L(w, b, ξ, α, β),

∂L

∂b(w, b, ξ, α, β) and _∂ξ^∂L

i(w, b, ξ, α, β) leads to the following dual problem.

Dual problem:

Maximizingθ(α) =Pn

i=1α_i−¹₂Pn

i,j=1α_iα_jy_iy_jhxi, x_ji s. t.Pn

i=1αiyi = 0and0≤αi≤C∀i.

Karush-Kuhn-Tucker conditions:

• 0≤α^∗_i ≤C∀i= 1. . . n.

• y_i(hw^∗, x_ii+b^∗)≥1−ξ_i^∗∀i= 1. . . n.

• α^∗_i(yi(hw^∗, xii+b^∗) +ξ^∗_i −1) = 0∀i= 1. . . n.

• ξ_i^∗(α^∗_i −C) = 0.

6.3 Supports vectors

We have the complementary Karush-Kuhn-Tucker conditions:

α^∗_i (yi(hw^∗, xii+b^∗) +ξ_i^∗−1) = 0∀i= 1. . . n, ξ_i^∗(α^∗_i −C) = 0

DEFINITION8. — The pointsx_isuch thatα^∗_i >0are thesupport vectors.

We have two types of support vectors :

• The support vectors for which the slack variables are equal to0. They are located on the border of the region defining the margin.

• The support vectors for which the slack variables are not equal to0:ξ^∗_i >

0and in this caseα^∗_i =C.

For the vectors that are not support vectors, we haveα^∗_i = 0andξ^∗_i = 0.

The classification rule is defined by

f(x)ˆ = 1_hw^∗_,xi+b^∗_≥0−1_hw^∗_,xi+b^∗<0,

= sign(hw^∗, xi+b^∗)

(8)

with

• w^∗=Pn

i=1α^∗_ix_iy_i,

• b^∗such thaty_i(hw^∗, x_ii+b^∗) = 1∀x_i, 0< α^∗_i < C.

The maximal margin equalsγ^∗=_kw¹∗k = Pn

i=1(α^∗_i)²−1/2

. Theα^∗_i that do not correspond to support vectors are equal to0, hence

fˆ(x) =1^P

xisvy_iα^∗_ihxi,xi+b^∗≥0−1^P

xiscy_iα^∗_ihxi,xi+b^∗<0.

7 Non linear SVM and kernels

A training set is rarely linearly separable.

In this case, a linear SVM leads to bad performances and a high number of support vectors. We can make the classification procedure more flexible by enlarging the feature space and sending the entries{xi, i= 1. . . n}in an Hilbert spaceH, with high or possibly infinite dimension, via a functionφ, and we apply a linear SVM procedure on the new training set{(φ(x_i), y_i), i= 1. . . n}.

The spaceHis called thefeature space. This idea is due to Boser, Guyon, Vapnik (1992).

In the previous example, settingφ(x) = (x²₁, x²₂, x1, x2), the training set be- comes linearly separable inR⁴, and a linear SVM is appropriate.

7.1 The kernel trick

A natural question arises : how can we chooseHandφ? In fact, we do not chooseHandφbut akernel.

The classification rule is fˆ(x) =1^P_y_i_α^∗

ihφ(xi),φ(x)i+b^∗≥0−1^P_y_i_α^∗

ihφ(xi),φ(x)i+b^∗<0, where theα_i^∗’s are the solutions of the dual problem in the feature spaceH: Maximizingθ(α) =Pn

i=1αi−¹₂Pn

i,j=1αiαjyiyjhφ(xi), φ(xj)i s. t.Pn

i=1αiyi= 0and0≤αi≤C∀i.

It is important to notice that the final classification rule in the feature space depends onφonly through scalar products of the formhφ(xi), φ(x)ior hφ(xi), φ(xj)i.

The only knowledge of the functionkdefined by k(x, x⁰) = hφ(x), φ(x⁰)i allows to define the SVM in the feature spaceHand to derive a classification rule in the spaceX. The explicite computation ofφis not required.

DEFINITION9. — A function k : X × X → R such that k(x, x⁰) = hφ(x), φ(x⁰)ifor a given functionφ:X → His called akernel.

A kernel is generally more easy to compute than the functionφthat returns values in a high dimensional space. For example, forx = (x1, x2) ∈ R², φ(x) = (x²₁,√

2x1x2, x²₂), andk(x, x⁰) =hx, x⁰i².

Let us now give a property to ensure that a functionk:X × X →Rdefines a kernel.

PROPOSITION10. —Mercer conditionIf the functionk : X × X → Ris continuous, symmetric, and if for all finite subset{x1, . . . , xk}inX, the matrix (k(xi, xj))_1≤i,j≤kis positive definite :

∀c1, . . . , cn∈R,

k

X

i,j=1

cicjk(xi, xj)≥0,

then, there exists an Hilbert spaceHand a functionφ : X → Hsuch that k(x, x⁰) = hφ(x), φ(x⁰)i_H. The spaceHis called theReproducing kernel

(9)

Hilbert Space (RKHS) associated tok.

We have :

1. For allx∈ X,k(x, .)∈ Hwherek(x, .) :y7→k(x, y).

2. Reproducing property:

h(x) =hh, k(x, .)i_Hfor allx∈ Xandh∈ H.

Let us give some examples. The Mercer condition is often hard to verify but we know some classical examples of kernels that can be used. We assume that X =R^d.

• pdegree polynomial kernel :k(x, x⁰) = (1 +hx, x⁰i)^p

• Gaussian kernel (RBF):k(x, x⁰) =e⁻^{kx−x0 k}

2 2σ2

φreturns values in a infinite dimensional space.

• Laplacian kernel :k(x, x⁰) =e⁻^{kx−x0 k}^σ .

• Sigmoid kernel : k(x, x⁰) = tanh(κhx, x⁰i+θ) (this kernel is not positive definite).

By way of example, let us precise the RKHS associated with the Gaussian kernel.

PROPOSITION11. — For any functionh∈L¹(R^d)∩L²(R^d)andω ∈R^d, we define the Fourier transform

F[f](ω) = 1 (2π)^d/2

Z

R^d

f(t)e^−ihω,tidt.

For anyσ >0, the functional space Hσ={f ∈C0(R^d)∩L¹(R^d)such that

Z

R^d

|F[f](ω)|²e^σ|ω|²^/2dω <+∞}

endowed with the scalar product hf, gi_H_σ = (2πσ²)^−d/2

Z

R^d

F[f](ω)F[g](ω)e^σ|ω|²^/2dω,

is the RKHS associated with the Gaussian kernelk(x, x⁰) =e⁻

kx−x0 k2 2σ2 . Indeed, for allx∈R^d, the functionk(x, .)belongs toHσand we have

hh, k(x, .)iHσ =F⁻¹[F[h]](x) =h(x).

The RKHSH_σcontains very regular functions, and the normkhk_H_σ controls the smoothness of the function h. Whenσ increases, the functions of the RKHS become smoother. See A. Smola and B. Scholkopf [5] for more details on RKHS.

We have seen some examples of kernels. One can construct new kernels by aggregating several kernels. For example letk1andk2be two kernels andf a functionR^d → R,φ : R^d → R^d

0,B a positive definite matrix,P a polynomial with positive coefficients andλ >0.

The functions defined by k(x, x⁰) = k1(x, x⁰) +k2(x, x⁰), λk1(x, x⁰), k1(x, x⁰)k2(x, x⁰), f(x)f(x⁰), k1(φ(x), φ(x⁰)), x^TBx⁰, P(k1(x, x⁰)), or e^k¹^(x,x⁰⁾are still kernels.

We have presented examples of kernels for the case whereX = R^d but a very interesting property is that kernels can be defined for very general input spaces, such as sets, trees, graphs, texts, DNA sequences ...

7.2 Minimization of the convexified empirical risk

The ideal classification rule is the one which minimizes the riskL(f) = P(Y 6=f(X)), we have seen that the solution is the Bayes rulef^∗. A classical way in nonparametric estimation or classification problems is to replace the risk by the empirical risk and to minimize the empirical risk :

L_n(f) = 1 n

n

X

i=1

1_Y_i_6=f(X_i₎.

(10)

In order to avoid overfitting, the minimization is restricted to a setF: fˆ=argmin_f∈FL_n(f).

The risk offˆcan be decomposed in two terms : 0≤L( ˆf)−L(f^∗) = min

f∈FL(f)−L(f^∗) +L( ˆf)−min

f∈FL(f).

The first termmin_f∈FL(f)−L(f^∗)is the approximation error, or bias term, the second term L( ˆf)−min_f∈FL(f) is the stochastic error or variance term. Enlarging the classF reduces the approximation error but increases the stochastic error.

The empirical risk minimization classifier cannot be used in practice because of its computational cost, indeedL_nis not convex. This is the reason why we generally replace the empirical misclassification probabilityLn by some convex surrogate, and we consider convex classesF. We consider a loss function l, and we require the conditionl(z)≥1z<0, which will allow to give an upper bound for the misclassification probability; indeed

E(l(Y f(X)))≥E(1_{Y f}_(X)<0) =P(Y 6=f(X)).

Classical convex losseslare the hinge lossl(z) = (1−z)₊, the exponential lossl(z) = exp(−z), the logit lossl(z) = log₂(1 + exp(−z)).

Let us show that SVM are solutions of the minimization of the convexified and penalized empirical risk. For the sake of simplicity, we consider the linear case.

We first notice that the following optimization problem : Minimizing¹₂kwk²+CPn

i=1ξ_is. t.

yi(hw, xii+b)≥1−ξi∀i ξi ≥0

is equivalent to minimize 1

2kwk²+C

n

X

i=1

(1−yi(hw, xii+b))₊,

or equivalently

1 n

n

X

i

(1−y_i(hw, x_ii+b))₊+ 1 2Cnkwk².

γ(w, b, x_i, y_i) = (1−y_i(hw, x_ii+b))₊ is a convex upper bound of the empirical risk1_y_i_(hw,x_i_i+b)<0

Hence, SVM are solutions of the minimization of the convexified empirical risk with the hinge losslplus a penalty term. Indeed, SVM are solutions of

argmin_f∈F1 n

n

X

i=1

l(yif(xi)) +pen(f), where

F={hw, xi+b, w∈R^d, b∈R} and

∀f ∈ F,pen(f) = 1 2Cnkwk².

(11)

8 Agregation of classifiers with Adaboost

Adaboost (or adaptive boosting) is a boosting method introduced by Freund and Schapire [2] to combine several classifiersf1, . . . , fk, for example SVM obtained with different kernels or different penalty terms. The principle of the algorithm is to minimize the empirical risk for the exponential loss function over the linear spaceFgenerated by the classifiersf1, . . . , fk. The aim is to compute

fˆ= argmin

f∈span(f1,...,f_k)

{1 n

n

X

i=1

exp(−Y_if(X_i))}.

To approximate the solution, Adaboost computes a sequence of functionsfˆm

form= 0, . . . M with

fˆ0 = 0

fˆ_m = fˆ_m−1+β_mh_j_m

where(βm, jm)minimizes the empirical convexified risk argmin

β∈R,j=1,...,p

{1 n

n

X

i=1

exp(−Yi( ˆf_m−1(Xi) +βhj))}.

The final classification rule is given by fˆ=sign( ˆfM).

Exercise. — We denote

w^(m)_i = 1

nexp(−Yifˆ_m−1(Xi)) and we assume that for allj= 1, . . . , p,

errm(j) = Pn

i=1w^(m)_i 1_h_j_(X_i_)6=Y_i Pn

i=1w^(m)_i

∈]0,1[.

Prove that

j_m=argmin

j=1,...,p

err_m(j),

and

β_m=1 2log

1−err_m(j) errm(j)

.

This leads to the AdaBoost algorithm :

• w_i⁽¹⁾= 1/nfori= 1, . . . , n.

• Form= 1, . . . , M jm = argmin

j=1,...,p

errm(j)

βm = 1

2log

1−errm(j) err_m(j)

w^(m+1)_i = w_i^(m)exp(2βm1_f_jm_(X_i_)6=Y_i−βm)fori= 1, . . . , n

• fˆM(x) =PM

m=1βmfj_m(x).

• fˆ=sign( ˆf_M).

In the above computation, note that, sincefj_m(Xi)∈ {−1,1},Yifj_m(Xi) = 21f_jm(X_i)6=Yi−1.

9 Regression and kernels

Although the framework of the chapter is classification, let us mention that kernel methods can also be used for regression function estimation. We present here the Kernel Regression Least Square procedure. It is based on a penalized least square criterion. Let(Xi, Yi)_1≤i≤n the observations, withXi ∈ R^p, Yi∈R. We consider a positive definite kernelkdefined onR^p:

k(x,y) =k(y,x);

n

X

i,j=1

c_ic_jk(X_i,X_j)≥0.

We are looking for a predictor of the form f(x) =

n

X

i=1

c_jk(X_j,x), c∈Rⁿ.

(12)

Let us denote byKthe matrix defined by Ki,j = k(Xi,Xj). The KRLS method consists in minimizing forf on the form defined above the penalized least square criterion

n

X

i=1

(Y_i−f(X_i))²+λkfk²_K,

where

kfk²_K =

n

X

i,j=1

c_ic_jk(X_i,X_j).

Equivalently, we minimize forc∈Rⁿthe criterion kY −Kck²+λc⁰Kc.

There exists an explicit solution

ˆc= (K+λIn)⁻¹Y,

which leads to the predictor fˆ(x) =

n

X

j=1

ˆ

cjk(Xj,x).

Yˆ =Kˆc.

With a kernel corresponding to the scalar product, we recover a linear predictor K=XX⁰,ˆc= (XX⁰+λI_n)⁻¹Y,

fˆ(x) =

n

X

j=1

ˆ

cjhXj,xi.

For polynomial or Gaussian kernels for example, we obtain non linear predictors. As for SVM, an important interest of this method is the possibility to be generalized to complex predictors such as text, graphs, DNA sequences .. as soon as one can define a kernel function on such objects.

10 Conclusion

• Using kernels allows to delinearize classification algorithms by mapping X in the RKHS H with the map x 7→ k(x, .). It provides nonlinear algorithms with almost the same computational properties as linear ones.

• SVM have nice theoretical properties, cf. Vapnik’s theory for empirical risk minimization [6].

• The use of RKHS allows to apply to any setX(such as set of graphs, texts, DNA sequences ..) algorithms that are defined for vectors as soon as we can define a kernelk(x, y)corresponding to some measure of similarity between two objects ofX.

• Important issues concern the choice of the kernel, and of the tuning parameters to define the SVM procedure.

• Note that SVM can also be used for multi-class classification problems for example, one can built a SVM classifier for each pair of classes and predict the class for a new point by a majority vote.

• Kernels are also used for regression as mentioned above or for non supervised classification (kernel PCA).

References

[1] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines. Cambridge University Press, 2000.

[2] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm.

InMachine Learning: proceedings of the Thirteenth International Confer- ence, pages 148–156. Morgan Kaufman, 1996. San Francisco.

[3] C. Giraud. Introduction to high-dimensional statistics, volume 139 of Monographs on Statistics and Applied Probability. CRC Press, Boca Ra- ton, FL, 2015.

[4] T. Hastie, R. Tibshirani, and J Friedman.The elements of statistical learning : data mining, inference, and prediction. Springer, 2009. Second edition.

(13)

[5] B. Scholkopf and A. Smola. Learning with Kernels Support Vector Ma- chines, Regularization, Optimization and Beyond. MIT Press, 2002.

[6] V.N. Vapnik.Statistical learning theory. Wiley Inter science, 1999.