Basics of Support Vector Machines - Support Vector Machines

3.6 Support Vector Machines

3.6.1 Basics of Support Vector Machines

Let X = {(x1,y1), (x2,y2), . . . , (xn,yn)},where x ∈ R^d,y ∈ {−1,1}be a set of d-dimensional feature vectors corresponding to data points (e.g. recommended items).

3.6.1.1 Linearly Separable Data

In the simple case of separable training data, there exists a hyperplane decision function which separates the positive from the negative examples

f(x)=(w·x)−b, (3.11)

with appropriate parametersw,b[12,37].

Letdpositivebe the shortest distance from the separating hyperplane to the closest positive example anddnegativethe corresponding shortest distance from the separating hyperplane to the closest negative example. Themarginof the hyperplane is defined as D=dpositive+dnegative. Consequently, theoptimal hyperplaneis defined as the hyperplane with maximal margin. The optimal hyperplane separates the data without error. Then, if all the training data satisfy the above requirements, we have:

(w·xi)−b≥1 ifyi =1 hyperplaneH1

(w·xi)−b≤ −1 ifyi = −1 hyperplaneH₋1 (3.12) The above inequalities can be combined in the following form:

yi[(w·xi)−b] ≥1,i =1, . . . ,n. (3.13)

Thereby, the optimal hyperplane can be found by solving the primal optimization problem:

minimizeΦ(w)=¹₂w²=¹₂(w^T ·w)

subject toyi[(w·xi)−b] ≥1,i =1, . . . ,n. (3.14) In order to solve this optimization problem, we form the following Lagrange functional:

where theaiare Lagrange multipliers withai ≥0. The LagrangianL(w,b)must be minimized with respect to thewandbso that the saddle pointw0,b0anda_i⁰satisfy From the above equations, we obtain

n Also, the solution must satisfy the Karush-Kuhn-Tucker conditions:

a_i⁰yi[(w0·xi)−b0] −1=0,i =1, . . . ,n. (3.19) The Karush-Kuhn-Tucker conditions are satisfied at the solution of any strained optimization problem whether it is convex or not and for any kind of con-straints, provided that the intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions. The problem for SVM is convex and for convex problems the Karush-Kuhn-Tucker conditions are necessary and sufficient forw0,b0anda_i⁰to be a solution. Thus, solving the SVM problem is equivalent to finding a solution to the Karush-Kuhn-Tucker conditions. Note that there is a Lagrange multiplierai for every training point.

The data points for which the above conditions are satisfied and can have nonzero coefficientsa_i⁰in the expansion of3.18are calledsupport vectors. As the support vectors are the closest data points to the decision surface, they determine the location of the optimal separating hyperplane (see Fig.3.4).

3.6 Support Vector Machines 49 Thereby, we can substitute Eq.3.18back into Eq.3.15and, taking into account the Karush-Kuhn-Tucker conditions, we obtain the functional

W(a)=

Consequently, we arrive at the, so-called, dual optimization problem:

maximize W(a)=ⁿ

We can obtain the optimal hyperplane as a linear combination of support vectors (SV)

The optimal separating hyperplane can be found following the SRM principle. Since the data points are linearly separable, one finds a separating hyperplane that has minimum empirical risk (i.e., zero empirical error) from the subset of functions with the smallest VC-dimension.

3.6.1.2 Linearly Nonseparable Data

Usually, data collected under real conditions are affected by outliers. Sometimes, outliers are caused by noisy measurements. In this case, outliers should be taken into consideration. This means that the data are not linearly separable and some of the training data points fall inside the margin. Thus, the SVM has to achieve two contradicting goals. On one hand, the SVM has to maximize the margin and on the other hand the SVM has to minimize the number of nonseparable data [4,12,37].

We can extend the previous ideas of separable data to nonseparable data by intro-ducing a slack positive variableξi ≥ 0 for each training vector. Thereby, we can modify Eq.3.12in the following way

(w·xi)−b≥1−ξi ifyi =1 hyperplaneH1

(w·xi)−b≤ −1+ξi ifyi = −1 hyperplaneH₋1

ξi ≥0,i =1, . . . ,n. (3.24)

The above inequalities can be combined in the following form:

yi[(w·xi)−b] ≥1−ξi,i=1, . . . ,n. (3.25) Obviously, if we takeξi large enough, the constraints in Eq.3.25will be met for alli. For an error to occur, the correspondingξi must exceed unity, therefore the upper bound on the number of training errors is given by

i=1

ξi. To avoid the trivial solution of largeξi, we introduce a penalization costC in the objective function in Eq.3.25, which controls the degree of penalization of the slack variablesξi, so that, whenCincreases, fewer training errors are permitted. In other words, the parameter Ccontrols the tradeoff between complexity (VC-dimension) and proportion of non-separable samples (empirical risk) and must be selected by the user. A given value for Cimplicitly specifies the size of margin. Hence, the SVM solution can then be found by (a) keeping the upper bound on the VC-dimension small and (b) by minimizing an upper bound on the empirical risk so the primal optimization formulation (1-Norm Soft Margin or the Box Constraint) becomes:

minimize Φ(ξ)= ¹₂w²+C

Following the same formalism of Lagrange multipliers leads to the same dual problem as in Eq.3.21, but with the positivity constraints on ai replaced by the constraints 0≤ai ≤C. Correspondingly, we end up with the Lagrange functional:

L(w,b, ξ)=1 withi ≥ 0 andri ≥0. The optimal solution has to fulfil the Karush-Kuhn-Tucker conditions:

Substituting back the Karush-Kuhn-Tucker conditions, into the initial Lagrange functional we obtain the following objective function:

3.6 Support Vector Machines 51

This objective function is identical to that for the separable case, with the additional constraintsC−ai−ri =0 andri ≥0 to enforceai ≤C, whileξi =0 only ifri =0 and thereforeai =C. Thereby the complementary Karush-Kuhn-Tucker conditions become

ai[yi((xi·w)+b)−1+ξi] =0, i =1, . . . ,n,

ξi(ai−C)=0, i =1, . . . ,n. (3.30) The Karush-Kuhn-Tucker conditions imply that non-zero slack variables can only occur when ai = C. The points with non-zero slack variables are _w¹ −ξi, as their geometric margin is less than _w¹. Points for which 0 < ai < C lie at the target distance of _w¹from the hyperplane. Hence, the dual optimization problem is formulated as follows: solution has to fulfil the following Karush-Kuhn-Tucker optimality conditions:

ai =0 ⇒yi[(w·xi)−b] ≥1 andξi =0 ignored vectors

The above equations indicate one of the most significant characteristics of SVM:

since most patterns lie outside the margin area, their optimalaiare zero. Only those training patternsxiwhich lie on the margin surface that is equal to the a priori chosen penalty parameterC(margin support vectors), or inside the margin area (error support vectors) have non-zeroaiand are namedsupport vectors SV[4,12,37].

The optimal solution gives rise to a decision function for the classification problem which consists of assigning any input vectorxto one of the two classes according to the following rule:

f(x)=sign

According to Eq.3.23, in an (infinite-dimensional) Hilbert space, the VC-dimen-sionhof the set of separating hyperplanes with marginb0= _w¹

0² depends only on w0². An effective way to construct the optimal separating hyperplane in a Hilbert spacewithout explicitly mapping the input data points into the Hilbert spacecan be done usingMercer’s Theorem. The inner product(Φ(xi)·Φ(xj))in some Hilbert space (feature space) H of input vectorsxi andxj can be defined by a symmetric positive definite functionK(xi,xj)(calledkernel) in the input spaceX

(Φ(xi)·Φ(xj))=K(xi,xj). (3.35) In other words, for a non-linearly separable classification problem, we map the data x from input space X ∈ R^d to some other (possibly infinite dimensional) Euclidean spaceHwhere the data are linearly separable, using a mapping function which we will callΦ:

Φ: X ∈R^d→ H. (3.36)

Thus, we replace all inner products (xi ·x)by a proper K(xi,xj) and the dual optimization problem for the Lagrangian multipliers is formulated as follows:

maximize W(a)=ⁿ

Different kernel functionsK(xi,xj)result in different description boundaries in the original input space. In here, we utilize the gaussian or the polynomial kernel functions. These are of the respective forms:

K(xi,xj)=exp where k is the order of the polynomial kernel function. Thus, the corresponding decision function is

3.6 Support Vector Machines 53 We have to note that we get linear decision functions (Eq.3.40) in the feature space that are equivalent to nonlinear functions (Eq.3.14). The computations do not involve theΦfunction explicitly, but depend only on the inner product defined inH feature space which in turn is obtained efficiently from a suitable kernel function.

The learning machines which construct functions of the form of Eq.3.40are called Support Vector Machines (SVM).

Dans le document Machine Learning Paradigms (Page 60-66)