• Aucun résultat trouvé

Support Vector Machines

N/A
N/A
Protected

Academic year: 2022

Partager "Support Vector Machines"

Copied!
13
0
0

Texte intégral

(1)

Linear and nonlinear methods for classification

Support Vector Machines

1 Introduction

In this chapter, we consider supervised classification problems. We have a training data set withnobservation points (or objects)Xi and their class (or label)Yi. For example, the MNIST data set is a database of handwritten digits, where the objectsXi are images andYi ∈ {0,1, . . . ,9}. Many other examples can be considered, such as the recognition of an object in an image, the detection of spams for emails, the presence of some illness for patients (the observation points may be gene expression data) ... We will first introduce the notion of best classifier, which is also called the Bayes classifier, we will then propose linear methods for classification. The core of the chapter will be devoted to the Support Vector machine, which are very powerful nonlinear methods for classification.

The main references for this course are the following books :

• An introduction to Support Vector Machines by N. Cristianini and J.

Shawe-Taylor [1]

• Introduction to High-Dimensional Statistics by C. Giraud [3]

• The elements of Statistical Learning by T. Hastie et al [4].

• Learning with kernels by A. Smola and B. Scholkopf [5].

• Statistical Learning Theory by V. Vapnik [6].

• M2 courses, M. Fromont-Renoir, Université de Rennes 2 : https://perso.univ-rennes2.fr/magalie.fromont

2 Optimal rules for classification

Suppose that dn corresponds to the observation of a n-sample Dn = {(X1, Y1), . . . ,(Xn, Yn)}with joint unknown distributionP onX × Y, and thatxis one observation of the variableX, where(X, Y)has joint distribu- tionPand isindependentofDn. Since we consider a classification problem, Yis a finite set. The sampleDnis called thelearning sample.

Aclassification ruleis a measurable functionf :X → Ythat associates the outputf(x)to the inputx∈ X.

In order to quantify the quality of the prevision, we introduce a loss function.

DEFINITION1. — A measurable functionl :Y × Y →R+is aloss function ifl(y, y) = 0andl(y, y0)>0fory6=y0.

Iffis a classification rule,xan input,ythe output that is really associated to the inputx, thenl(y, f(x))quantifies the loss when we associate to the inputxthe predicted outputf(x).

In a regression framework whenY=R: we consider theLploss (p≥1) l(y, y0) =|y−y0|p.

Ifp= 2, this is the quadratic loss.

For binary classification:Y ={−1,1}

l(y, y0) =1y6=y0 = |y−y0|

2 = (y−y0)2

4 .

We consider the expectation of this loss, this leads to the definition of the risk:

DEFINITION2. — Given a loss functionl, therisk- orgeneralisation error - of a prediction rulefis defined by

RP(f) =E(X,Y)∼P[l(Y, f(X))].

(2)

It is important to note that, in the above definition,(X, Y)is independent of the training sampleDnthat was used to build the prediction rulef.

LetF denote the set of all possible prediction rules. We say thatf is an optimal rule if

RP(f) = inf

f∈FRP(f).

A natural question arises : is it possible to build optimal rules ? This is indeed the case. We focus here on the classification framework. We define the Bayes rule, and show that it is an optimal rule for classification.

DEFINITION3. — We call Bayes rule any measurable functionfinFsuch that for allx∈ X,

P(Y =f(x)|X =x) = max

y∈YP(Y =y|X =x).

THEOREM4. — Iffis a Bayes rule, thenRP(f) = inff∈FRP(f).

The definition of a Bayes rule depends on the knowledge of the distribution P of (X, Y). In practice, we have a training sample Dn = {(X1, Y1), . . . ,(Xn, Yn)} with joint unknown distribution P, and we construct a classification rule. The aim is to find a "good" classi- fication rule, in the sense that its risk is close to the optimal risk of a Bayes rule.

Exercise. — Prove Theorem4.

3 Linear discriminant analysis

Let(X, Y)with unknown distributionPonX × Y, where we assume that X =RpandY={1,2, . . . , K}. We define

fk(x) =P(Y =k/X =x).

A Bayes rule is defined by

f(x) = argmax

k∈{1,2,...,K}

fk(x).

We assume that the distribution ofXhas a densityfXand the distribution of XgivenY =khas a densitygkwith respect to the Lebesgue measure onRp, and we setπk=P(Y =k).

Exercise. — Prove that

fX(x) =

K

X

l=1

gl(x)πl

and that

fk(x) = gk(x)πk

PK

l=1gl(x)πl

.

If we assume that the distribution ofXgivenY =kis a multivariate normal distribution, with meanµkand covariance matrixΣk, we have

gk(x) = 1

(2π)p/2k|1/2exp

−1

2(x−µk)0Σ−1k (x−µk)

.

For the linear discriminant analysis, we furthermore assume thatΣk = Σfor allk. In this case we have

log

P(Y =k/X=x) P(Y =l/X=x)

= log πk

πl

+x0Σ−1k−µl)

− 1

2(µkl)0Σ−1k−µl).

Hence the decision boundary between the classkand the classl,{x,P(Y = k/X =x) =P(Y =l/X =x)}is linear. We can write

logP(Y =k/X =x) =C(x)δk(x) whereC(x)does not depend on the classk, and

δk(x) =x0Σ−1µk−1

0kΣ−1µk+ log(πk).

The Bayes rule will assignxto the classf(x)which maximisesδk(x).

We want now to built a decision rule from a training sample Dn =

(3)

{(X1, Y1), . . . ,(Xn, Yn)}which is close to the Bayes rule. For this purpose, we have to estimate for allk πk, µkand the matrixΣ. We consider the follow- ing estimators.

ˆ

πk = Nk

n ˆ

µk = Pn

i=1Xi1Yi=k

Nk

whereNk=Pn

i=11Yi=k. We estimateΣby Σ =ˆ

K

X

k=1 n

X

i=1

(Xi−µˆk)(Xi−µˆk)01Yi=k

n−K .

To conclude, the Linear Discriminant Analysis assigns the inputxto the class f(x)ˆ which maximisesδˆk(x), where we have replaced in the expression of δk(x)the unknown quantities by their estimators.

Remark : If we no more assume that the matrix Σdoes not depend on the classk, we obtain quadratic discriminant functions

δk(x) =−1

2log|Σk| − 1

2(x−µk)0Σ−1k (x−µk) + log(πk).

This leads to the quadratic discriminant analysis.

4 Logistic regression

Noting that the Bayes classifier only depends on the conditional distribution ofY givenX, we can avoid to model the distribution ofX as previously. We assume thatX =Rd. One of the most popular model for binary classification whenY ={−1,1}is thelogistic regressionmodel, for which it is assumed that

P(Y = 1/X =x) = exp(α+hβ,xi)

1 + exp(α+hβ,xi)for allx∈ X, withα∈Randβ∈Rd.

Exercise. — Compute the Bayes classifierffor this model and determine the border betweenf= 1andf=−1.

We can estimate the parameters(α,β)by maximizing the conditional like- lihood ofY givenX.

L(α,β) = Y

i,Yi=1

exp(α+hβ,Xii) 1 + exp(α+hβ,Xii)

Y

i,Yi=−1

1

1 + exp(α+hβ,Xii).

We then compute the logistic regression classifier :

∀x∈ X,fˆ(x) =sign( ˆα+hβ,ˆ xi).

In general, the logistic regression provides different classifiers as LDA. If the distribution ofX if far from a Gaussian distribution, the logistic regression outperforms the LDA, as soon as the logistic model is still valid.

5 Linear Support Vector Machine

5.1 Linearly separable training set

We assume thatX =Rd, endowed with the usual scalar producth., .i, and thatY={−1,1}.

DEFINITION5. — The training set dn1 = (x1, y1), . . . ,(xn, yn)is called linearly separableif there exists(w, b)such that for alli,

yi= 1ifhw, xii+b >0, yi=−1ifhw, xii+b <0, which means that

∀i yi(hw, xii+b)>0.

The equationhw, xi+b= 0defines a separating hyperplane with orthogonal vectorw.

The functionfw,b(x) =1hw,xi+b≥0−1hw,xi+b<0defines a possible linear classification rule.

The problem is that there exists an infinity of separating hyperplanes, and therefore an infinity of classification rules.

Which one should we choose ? The response is given by Vapnik [6]. The classification rule with the best generalization properties cooresponds to the

(4)

separating hyperplane maximizing the marginγbetween the two classes on the training set.

If we consider two entries of the training set, that are on the broder defining the margin, and that we callx1 andx−1 with respective outputs1 and−1, the separating hyperplane is located at the half-distance betweenx1andx−1. The margin is therefore equal to the half of the distance betweenx1andx−1 projected onto the normal vector of the separating hyperplane :

γ= 1 2

hw, x1−x−1i kwk .

Let us notice that for allκ6= 0, the couples(κw, κb)and(w, b)define the same hyperplane.

DEFINITION6. — The hyperplanehw, xi+b= 0iscanonical with respect to the set of vectorsx1, . . . , xk if

mini=1...k|hw, xii+b|= 1.

The separating hyperplane has the canonical form relatively to the vectors {x1, x−1}if it is defined by(w, b)wherehw, x1i+b= 1andhw, x−1i+b=

−1.In this case, we havehw, x1−x−1i= 2, hence γ= 1

kwk.

5.2 A convex optimisation problem

Finding the separating hyperplane with maximal margin consists in finding (w, b)such that

kwk2or 12kwk2is minimal under the constraint yi(hw, xii+b)≥1for alli.

This leads to a convex optimization problem with linear constraints, hence there exists a unique global minimizer.

Primal optimization problemLet us recall some definitions.

We want to minimize foru∈R2h(u)under the constraintsgi(u)≥0for i= 1. . . n,his a quadratic function,giare affine functions.

The Lagrangian of the optimization problem is the function defined on R2×Rnby

(5)

L(u, α) =h(u)−

n

X

i=1

αigi(u).

The variablesαiare called thedual variables.

For allα∈Rn,uαis the value ofuminimizingL(u, α).

Thedual functionis defined by

θ(α) =L(uα, α) =minu∈R2L(u, α).

Dual optimization problem

Maximizing θ(α) = L(uα, α) = minu∈R2L(u, α) under the constraints αi≥0fori= 1. . . n.

The solution of the dual problemα gives the solution of the primal prob- lem:

u=uα.

Karush-Kuhn-Tucker conditions for the dual problem

• αi ≥0for alli= 1. . . n.

• gi(uα)≥0for alli= 1. . . n.

• Back to the dual problem :

We have to minimizeL(u, α) = h(u)−Pn

i=1αigi(u)with respect tou and to maximizeL(uα, α)with respect to the dual variablesαi. Note that if gi(uα)>0, then we getαi = 0.

The complementary Karush-Kuhn-Tucker condition writesαigi(uα) = 0.

Let us come back to the linear support vector machine optimization problem.

The primal problemto solve is :

Minimizing12kwk2s. t.yi(hw, xii+b)≥1∀i.

LagrangianL(w, b, α) = 12kwk2−Pn

i=1αi(yi(hw, xii+b)−1). Dual Function

∂L

∂w(w, b, α) = w−

n

X

i=1

αiyixi = 0 ⇔ w=

n

X

i=1

αiyixi

∂L

∂b(w, b, α) = −

n

X

i=1

αiyi= 0 ⇔

n

X

i=1

αiyi= 0

θ(α) = 1 2

n

X

i,j=1

αiαjyiyjhxi, xji+

n

X

i=1

αi

n

X

i,j=1

αiαjyiyjhxi, xji

=

n

X

i=1

αi−1 2

n

X

i,j=1

αiαjyiyjhxi, xji.

The correspondingdual problem is : Maximizing

θ(α) =

n

X

i=1

αi−1 2

n

X

i,j=1

αiαjyiyjhxi, xji

under the constraintPn

i=1αiyi= 0andαi≥0∀i.

The solutionα of the dual problem can be obtained with classical opti- mization softwares.

Remark : The solution does not depend on the dimensiond, but depends on the sample sizen, hence it is interesting to notice that when X is high dimensional, linear SVM do not suffer from the curse of dimensionality.

5.3 Supports Vectors

For our optimization problem, theKarush-Kuhn-Tucker conditions are

• αi ≥0∀i= 1. . . n.

(6)

• yi(hw, xii+b)≥1∀i= 1. . . n.

• αi(yi(hw, xii+b)−1) = 0∀i= 1. . . n.

(complementary condition)

Only theαi >0are involved in the resolution of the optimization problem.

If the number of valuesαi >0is small, the solution of the dual problem is called"sparse".

DEFINITION7. — The xi such that αi > 0 are called the support vec- tors. They are located on the border defining the maximal margin namely yi(hw, xii+b) = 1(c.f. complementary KKT condition).

We finally obtain the following classification rule : fˆ(x) =1hw,xi+b≥0−1hw,xi+b<0,

with

• w=Pn

i=1αixiyi,

• b=−12{minyi=1hw, xii+minyi=−1hw, xii}.

The maximal margin equalsγ=kw1k = Pn

i=1i)2−1/2

.

Theαithat do not correspond to support vectors (sv) are equal to 0, and therefore

fˆ(x) =1P

xi svyiαihxi,xi+b≥0−1P

xi svyiαihxi,xi+b<0.

6 Linear SVM in the non separable case

The previous method cannot be applied when the training set is not linearly separable. Moreover, the method is very sensitive to outliers.

6.1 Flexible margin

In the general case, we allow some points to be in the margin and even on the wrong side of the margin. We introduce the slack variableξ= (ξ1, . . . , ξn) and the constraintyi(hw, xii+b)≥1becomesyi(hw, xii+b)≥1−ξi, with ξi≥0.

• Ifξi ∈ [0,1]the point is well classified but in the region defined by the margin.

• Ifξi>1the point is misclassified.

The margin is calledflexible margin.

6.2 Optimization problem with relaxed constraints

In order to avoid too large margins, we penalize large values for the slack variableξi.

Theprimal optimization problem is formalized as follows : Minimize with respect to(w, b, ξ) 12kwk2+CPn

i=1ξisuch that yi(hw, xii+b)≥1−ξi∀i

ξi≥0

(7)

Remarks :

• C > 0 is a tuning parameter of the SVM algorithm. It will determine the tolerance to misclassifications. IfCincreases, the number of misclas- sified points decreases, and if Cdecreases, the number of misclassified points increases.Cis generally calibrated by cross-validation.

• One can also minimize 12kwk2+CPn

i=1ξki,k= 2,3, . . ., we still have aconvexoptimization problem.

The choice Pn

i=11ξi>1 (number of errors) instead of Pn

i=1ξki would lead to a non convex optimization problem.

TheLagrangianof this problem is:

L(w, b, ξ, α, β) = 1 2kwk2+

n

X

i=1

ξi(C−αi−βi)

+

n

X

i=1

αi

n

X

i=1

αiyi(hw, xii+b),

withαi ≥0andβi≥0.

The cancellation of the partial derivatives ∂w∂L(w, b, ξ, α, β),

∂L

∂b(w, b, ξ, α, β) and ∂ξ∂L

i(w, b, ξ, α, β) leads to the following dual prob- lem.

Dual problem:

Maximizingθ(α) =Pn

i=1αi12Pn

i,j=1αiαjyiyjhxi, xji s. t.Pn

i=1αiyi = 0and0≤αi≤C∀i.

Karush-Kuhn-Tucker conditions:

• 0≤αi ≤C∀i= 1. . . n.

• yi(hw, xii+b)≥1−ξi∀i= 1. . . n.

• αi(yi(hw, xii+b) +ξi −1) = 0∀i= 1. . . n.

• ξii −C) = 0.

6.3 Supports vectors

We have the complementary Karush-Kuhn-Tucker conditions:

αi (yi(hw, xii+b) +ξi−1) = 0∀i= 1. . . n, ξii −C) = 0

DEFINITION8. — The pointsxisuch thatαi >0are thesupport vectors.

We have two types of support vectors :

• The support vectors for which the slack variables are equal to0. They are located on the border of the region defining the margin.

• The support vectors for which the slack variables are not equal to0:ξi >

0and in this caseαi =C.

For the vectors that are not support vectors, we haveαi = 0andξi = 0.

The classification rule is defined by

f(x)ˆ = 1hw,xi+b≥0−1hw,xi+b<0,

= sign(hw, xi+b)

(8)

with

• w=Pn

i=1αixiyi,

• bsuch thatyi(hw, xii+b) = 1∀xi, 0< αi < C.

The maximal margin equalsγ=kw1k = Pn

i=1i)2−1/2

. Theαi that do not correspond to support vectors are equal to0, hence

fˆ(x) =1P

xisvyiαihxi,xi+b≥0−1P

xiscyiαihxi,xi+b<0.

7 Non linear SVM and kernels

A training set is rarely linearly separable.

In this case, a linear SVM leads to bad performances and a high number of support vectors. We can make the classification procedure more flexible by en- larging the feature space and sending the entries{xi, i= 1. . . n}in an Hilbert spaceH, with high or possibly infinite dimension, via a functionφ, and we ap- ply a linear SVM procedure on the new training set{(φ(xi), yi), i= 1. . . n}.

The spaceHis called thefeature space. This idea is due to Boser, Guyon, Vapnik (1992).

In the previous example, settingφ(x) = (x21, x22, x1, x2), the training set be- comes linearly separable inR4, and a linear SVM is appropriate.

7.1 The kernel trick

A natural question arises : how can we chooseHandφ? In fact, we do not chooseHandφbut akernel.

The classification rule is fˆ(x) =1Pyiα

ihφ(xi),φ(x)i+b≥0−1Pyiα

ihφ(xi),φ(x)i+b<0, where theαi’s are the solutions of the dual problem in the feature spaceH: Maximizingθ(α) =Pn

i=1αi12Pn

i,j=1αiαjyiyjhφ(xi), φ(xj)i s. t.Pn

i=1αiyi= 0and0≤αi≤C∀i.

It is important to notice that the final classification rule in the feature space depends onφonly through scalar products of the formhφ(xi), φ(x)ior hφ(xi), φ(xj)i.

The only knowledge of the functionkdefined by k(x, x0) = hφ(x), φ(x0)i allows to define the SVM in the feature spaceHand to derive a classification rule in the spaceX. The explicite computation ofφis not required.

DEFINITION9. — A function k : X × X → R such that k(x, x0) = hφ(x), φ(x0)ifor a given functionφ:X → His called akernel.

A kernel is generally more easy to compute than the functionφthat returns values in a high dimensional space. For example, forx = (x1, x2) ∈ R2, φ(x) = (x21,√

2x1x2, x22), andk(x, x0) =hx, x0i2.

Let us now give a property to ensure that a functionk:X × X →Rdefines a kernel.

PROPOSITION10. —Mercer conditionIf the functionk : X × X → Ris continuous, symmetric, and if for all finite subset{x1, . . . , xk}inX, the matrix (k(xi, xj))1≤i,j≤kis positive definite :

∀c1, . . . , cn∈R,

k

X

i,j=1

cicjk(xi, xj)≥0,

then, there exists an Hilbert spaceHand a functionφ : X → Hsuch that k(x, x0) = hφ(x), φ(x0)iH. The spaceHis called theReproducing kernel

(9)

Hilbert Space (RKHS) associated tok.

We have :

1. For allx∈ X,k(x, .)∈ Hwherek(x, .) :y7→k(x, y).

2. Reproducing property:

h(x) =hh, k(x, .)iHfor allx∈ Xandh∈ H.

Let us give some examples. The Mercer condition is often hard to verify but we know some classical examples of kernels that can be used. We assume that X =Rd.

• pdegree polynomial kernel :k(x, x0) = (1 +hx, x0i)p

• Gaussian kernel (RBF):k(x, x0) =ekx−x0 k

2 2

φreturns values in a infinite dimensional space.

• Laplacian kernel :k(x, x0) =ekx−x0 kσ .

• Sigmoid kernel : k(x, x0) = tanh(κhx, x0i+θ) (this kernel is not positive definite).

By way of example, let us precise the RKHS associated with the Gaussian kernel.

PROPOSITION11. — For any functionh∈L1(Rd)∩L2(Rd)andω ∈Rd, we define the Fourier transform

F[f](ω) = 1 (2π)d/2

Z

Rd

f(t)e−ihω,tidt.

For anyσ >0, the functional space Hσ={f ∈C0(Rd)∩L1(Rd)such that

Z

Rd

|F[f](ω)|2eσ|ω|2/2dω <+∞}

endowed with the scalar product hf, giHσ = (2πσ2)−d/2

Z

Rd

F[f](ω)F[g](ω)eσ|ω|2/2dω,

is the RKHS associated with the Gaussian kernelk(x, x0) =e

kx−x0 k2 2 . Indeed, for allx∈Rd, the functionk(x, .)belongs toHσand we have

hh, k(x, .)iHσ =F−1[F[h]](x) =h(x).

The RKHSHσcontains very regular functions, and the normkhkHσ controls the smoothness of the function h. Whenσ increases, the functions of the RKHS become smoother. See A. Smola and B. Scholkopf [5] for more details on RKHS.

We have seen some examples of kernels. One can construct new ker- nels by aggregating several kernels. For example letk1andk2be two kernels andf a functionRd → R,φ : Rd → Rd

0,B a positive definite matrix,P a polynomial with positive coefficients andλ >0.

The functions defined by k(x, x0) = k1(x, x0) +k2(x, x0), λk1(x, x0), k1(x, x0)k2(x, x0), f(x)f(x0), k1(φ(x), φ(x0)), xTBx0, P(k1(x, x0)), or ek1(x,x0)are still kernels.

We have presented examples of kernels for the case whereX = Rd but a very interesting property is that kernels can be defined for very general input spaces, such as sets, trees, graphs, texts, DNA sequences ...

7.2 Minimization of the convexified empirical risk

The ideal classification rule is the one which minimizes the riskL(f) = P(Y 6=f(X)), we have seen that the solution is the Bayes rulef. A classical way in nonparametric estimation or classification problems is to replace the risk by the empirical risk and to minimize the empirical risk :

Ln(f) = 1 n

n

X

i=1

1Yi6=f(Xi).

(10)

In order to avoid overfitting, the minimization is restricted to a setF: fˆ=argminf∈FLn(f).

The risk offˆcan be decomposed in two terms : 0≤L( ˆf)−L(f) = min

f∈FL(f)−L(f) +L( ˆf)−min

f∈FL(f).

The first termminf∈FL(f)−L(f)is the approximation error, or bias term, the second term L( ˆf)−minf∈FL(f) is the stochastic error or variance term. Enlarging the classF reduces the approximation error but increases the stochastic error.

The empirical risk minimization classifier cannot be used in practice because of its computational cost, indeedLnis not convex. This is the reason why we generally replace the empirical misclassification probabilityLn by some con- vex surrogate, and we consider convex classesF. We consider a loss function l, and we require the conditionl(z)≥1z<0, which will allow to give an upper bound for the misclassification probability; indeed

E(l(Y f(X)))≥E(1Y f(X)<0) =P(Y 6=f(X)).

Classical convex losseslare the hinge lossl(z) = (1−z)+, the exponential lossl(z) = exp(−z), the logit lossl(z) = log2(1 + exp(−z)).

Let us show that SVM are solutions of the minimization of the convexified and penalized empirical risk. For the sake of simplicity, we consider the linear case.

We first notice that the following optimization problem : Minimizing12kwk2+CPn

i=1ξis. t.

yi(hw, xii+b)≥1−ξi∀i ξi ≥0

is equivalent to minimize 1

2kwk2+C

n

X

i=1

(1−yi(hw, xii+b))+,

or equivalently

1 n

n

X

i

(1−yi(hw, xii+b))++ 1 2Cnkwk2.

γ(w, b, xi, yi) = (1−yi(hw, xii+b))+ is a convex upper bound of the empirical risk1yi(hw,xii+b)<0

Hence, SVM are solutions of the minimization of the convexified empirical risk with the hinge losslplus a penalty term. Indeed, SVM are solutions of

argminf∈F1 n

n

X

i=1

l(yif(xi)) +pen(f), where

F={hw, xi+b, w∈Rd, b∈R} and

∀f ∈ F,pen(f) = 1 2Cnkwk2.

(11)

8 Agregation of classifiers with Adaboost

Adaboost (or adaptive boosting) is a boosting method introduced by Freund and Schapire [2] to combine several classifiersf1, . . . , fk, for example SVM obtained with different kernels or different penalty terms. The principle of the algorithm is to minimize the empirical risk for the exponential loss function over the linear spaceFgenerated by the classifiersf1, . . . , fk. The aim is to compute

fˆ= argmin

f∈span(f1,...,fk)

{1 n

n

X

i=1

exp(−Yif(Xi))}.

To approximate the solution, Adaboost computes a sequence of functionsfˆm

form= 0, . . . M with

0 = 0

m = fˆm−1mhjm

where(βm, jm)minimizes the empirical convexified risk argmin

β∈R,j=1,...,p

{1 n

n

X

i=1

exp(−Yi( ˆfm−1(Xi) +βhj))}.

The final classification rule is given by fˆ=sign( ˆfM).

Exercise. — We denote

w(m)i = 1

nexp(−Yim−1(Xi)) and we assume that for allj= 1, . . . , p,

errm(j) = Pn

i=1w(m)i 1hj(Xi)6=Yi Pn

i=1w(m)i

∈]0,1[.

Prove that

jm=argmin

j=1,...,p

errm(j),

and

βm=1 2log

1−errm(j) errm(j)

.

This leads to the AdaBoost algorithm :

• wi(1)= 1/nfori= 1, . . . , n.

• Form= 1, . . . , M jm = argmin

j=1,...,p

errm(j)

βm = 1

2log

1−errm(j) errm(j)

w(m+1)i = wi(m)exp(2βm1fjm(Xi)6=Yi−βm)fori= 1, . . . , n

• fˆM(x) =PM

m=1βmfjm(x).

• fˆ=sign( ˆfM).

In the above computation, note that, sincefjm(Xi)∈ {−1,1},Yifjm(Xi) = 21fjm(Xi)6=Yi−1.

9 Regression and kernels

Although the framework of the chapter is classification, let us mention that kernel methods can also be used for regression function estimation. We present here the Kernel Regression Least Square procedure. It is based on a penalized least square criterion. Let(Xi, Yi)1≤i≤n the observations, withXi ∈ Rp, Yi∈R. We consider a positive definite kernelkdefined onRp:

k(x,y) =k(y,x);

n

X

i,j=1

cicjk(Xi,Xj)≥0.

We are looking for a predictor of the form f(x) =

n

X

i=1

cjk(Xj,x), c∈Rn.

(12)

Let us denote byKthe matrix defined by Ki,j = k(Xi,Xj). The KRLS method consists in minimizing forf on the form defined above the penalized least square criterion

n

X

i=1

(Yi−f(Xi))2+λkfk2K,

where

kfk2K =

n

X

i,j=1

cicjk(Xi,Xj).

Equivalently, we minimize forc∈Rnthe criterion kY −Kck2+λc0Kc.

There exists an explicit solution

ˆc= (K+λIn)−1Y,

which leads to the predictor fˆ(x) =

n

X

j=1

ˆ

cjk(Xj,x).

Yˆ =Kˆc.

With a kernel corresponding to the scalar product, we recover a linear predictor K=XX0,ˆc= (XX0+λIn)−1Y,

fˆ(x) =

n

X

j=1

ˆ

cjhXj,xi.

For polynomial or Gaussian kernels for example, we obtain non linear predic- tors. As for SVM, an important interest of this method is the possibility to be generalized to complex predictors such as text, graphs, DNA sequences .. as soon as one can define a kernel function on such objects.

10 Conclusion

• Using kernels allows to delinearize classification algorithms by mapping X in the RKHS H with the map x 7→ k(x, .). It provides nonlinear algorithms with almost the same computational properties as linear ones.

• SVM have nice theoretical properties, cf. Vapnik’s theory for empirical risk minimization [6].

• The use of RKHS allows to apply to any setX(such as set of graphs, texts, DNA sequences ..) algorithms that are defined for vectors as soon as we can define a kernelk(x, y)corresponding to some measure of similarity between two objects ofX.

• Important issues concern the choice of the kernel, and of the tuning pa- rameters to define the SVM procedure.

• Note that SVM can also be used for multi-class classification problems for example, one can built a SVM classifier for each pair of classes and predict the class for a new point by a majority vote.

• Kernels are also used for regression as mentioned above or for non super- vised classification (kernel PCA).

References

[1] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines. Cambridge University Press, 2000.

[2] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm.

InMachine Learning: proceedings of the Thirteenth International Confer- ence, pages 148–156. Morgan Kaufman, 1996. San Francisco.

[3] C. Giraud. Introduction to high-dimensional statistics, volume 139 of Monographs on Statistics and Applied Probability. CRC Press, Boca Ra- ton, FL, 2015.

[4] T. Hastie, R. Tibshirani, and J Friedman.The elements of statistical learn- ing : data mining, inference, and prediction. Springer, 2009. Second edition.

(13)

[5] B. Scholkopf and A. Smola. Learning with Kernels Support Vector Ma- chines, Regularization, Optimization and Beyond. MIT Press, 2002.

[6] V.N. Vapnik.Statistical learning theory. Wiley Inter science, 1999.

Références

Documents relatifs

Acts of resistance to escape ‘the most awesome powers of the liberal state – expulsion’ (Ellermann 2010, 408) are often used as reactions to state prac- tice, but SLBs, legal

The goal is to perform a filtering by computing the convolution of an input image and the filter’s

Central will never send Channel Abort message on this channel; if the remote site closes its socket (socket b), Central will not re-transmit, but simply cease sending messages

We will establish a local Li-Yau’s estimate for weak solutions of the heat equation and prove a sharp Yau’s gradient gradient for harmonic functions on metric measure spaces, under

Surprisingly, she also joins the Glee Club, where popular kids never want to be, but only to keep an eye on her boyfriend

That’s why the NDP takes a balanced approach to government: developing the economy, providing quality education and health care, addressing crime and the causes of crime, and

Manitoba's New Government is committed to making Manitoba the most improved province in all of Canada: to better health care and education for Manitoba families; to better jobs

 In general, the continuous is used to focus on the activity itself or to stress its temporary nature.. Compare