Support Vector Machine and Kernel Machine Methodology

In recent years, there has been increasing interest in a method called support vector machines (Cristianni & Shawe-Taylor, 2000; Guermeur, 2002; Joachims, 1999; Vapnik, 1995). In brief, this can be explained quite easily as follows: Assume a set of (n-Table 6. Some correctly classified Category-5 items

Item No Item description

55028 Head, ultrasound scan of, performed by, or on behalf of, a medical practitioner where: (a) the patient is referred by a medical practitioner for ultrasonic examination not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies; and (b) the referring medical practitioner is not a member of a group of practitioners of which the first mentioned practitioner is a member (R)

55029 Head, ultrasound scan of, where the patient is not referred by a medical practitioner, not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies (NR)

55030 Orbita l contents, ultrasound scan of, performed by, or on behalf of, a medical practitioner where: (a) the patient is referred by a medical practitioner for ultrasonic examination not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies; and (b) the referring medical practitioner is not a member of a group of practitioners of which the first mentioned practitioner is a member (R)

55031 Orbita l contents, ultrasound scan of, where the patient is not referred by a medical practitioner, not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies (NR)

55033 Neck, 1 or more structures of, ultrasound scan of, where the patient is not referred by a medical practitioner, not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies (NR)

dimensional) vectors x₁, x₂,..., x_n. Assuming that this set of vectors is drawn from two classes, 1 and -1. If these classes are linearly separable, then there exists a straight line dividing these two classes as shown on the left of Figure 2. In Figure 2, it is observed that the vectors are well separated. Now if the two classes cannot be separated by a straight line, the situation becomes more interesting. Traditionally, in this case we use a non-linear classifier to separate the classes as shown on the right of Figure 2. In general terms, any two collections of n-dimensional vectors are said to be linearly separable if there exists an (n-1)-dimensional hyper-plane that separates the two collections.

Figure 2. Illustration of the linear separability of classes (the two classes at top are separable by a single line, as indicated; for the lower two classes there is no line that can separate them)

Class A

Class B

Class A

Class B

One intuition is inspired by the following example: In the exclusive-OR case, we know that it is not possible to separate the two classes using a straight line, when the problem is represented in two dimensions. However, we know that if we increase the dimension of the exclusive-OR example by one, then in three dimensions one can find a hyper-plane which will separate the two classes. This can be observed in Tables 7 and 8, respectively.

Here it is observed that the two classes are easily separated when we simply add one extra dimension. The support vector machine uses this insight, namely, in the case when it is not possible to separate the two classes by a hyper-plane; if we augment the dimension of the problem sufficiently, it is possible to separate the two classes by a hyper-plane. f(x) = w^Tφ(x) + b, where w is a set of weights, and b a constant in this high-dimensional space. The embedding of the vectors x in the high-high-dimensional plane is to transform them equivalently to φ(x), where φ(⋅) is a coordinate transformation. The question then becomes: how to find such a transformation φ(⋅)?

Let us define a kernel function as follows:

K(x, z) ≤φ(x), φ(z) >≡φ(x)^Tφ(z) (1) where φ is a mapping from X to an inner product feature space F. It is noted that the kernel thus defined is symmetric, in other words K(x, z) = K(z, x). Now let us define the matrix X = [x₁ x₂ ... x₃]. It is possible to define the symmetric matrix:

In a similar manner, it is possible to define the kernel matrix:

K = [φ(x₁) φ(x₂) ... φ(x_n)]^T [φ(x₁) φ(x₂) ... (x_n)] (3) Note that the kernel matrix K is symmetric. Hence, it is possible to find an orthogonal matrix V such that K = VΛΛΛΛΛV^T, where ΛΛΛΛΛ is a diagonal matrix containing the eigenvalues of K. It is convenient to sort the diagonal values of ΛΛΛΛΛ such that λ₁≥ λ₂≥ . . . ≥ λ_n. It turns out that one necessary requirement of the matrix K to be a kernel function is that the eigenvalue matrix ΛΛΛΛΛ must contain all positive entries, in other words, λ_i≥ 0. This implies that in general, for the transformation φ(⋅) to be a valid transformation, it must satisfy some conditions such that the kernel function formed is symmetric. This is known as the Mercer conditions (Cristianni & Shawe-Taylor, 2000).

There are many possible such transformations; some common ones (Cristianni &

Shawe-Taylor, 2000) being:

Power kernel: K(x, z) = (K(x, z) + c)^P where p = 2, 4, ...

Gaussian kernel: K(x, z) = expK^{( , )}+K( , ) 2 ( , )₂ − K 

 

 σ 

x x z z x z

There exist quite efficient algorithms using optimisation theory which will obtain a set of support vectors and the corresponding weights of the hyper-plane for a particular problem (Cristianni & Shawe-Taylor, 2000; Joachims, 1999). This is based on re-formulating the problem as a quadratic programming problem with linear constraints.

Once it is thus re-formulated, the solutions can be obtained very efficiently.

It was also discovered that the idea of a kernel is quite general (Scholkopf, Burges,

& Smola, 1999). Indeed, instead of working with the original vectors x, it is possible to work with the transformed vectors φ(x) in the feature space, and most classic algorithms, for example, principal component analysis, canonical correlation analysis, and Fisher’s discriminant analysis, all have equivalent algorithms in the kernel space. The advantage of working in the feature space is that the dimension is normally much lower than the original space.

Dans le document John Fulcher, University of Wollongong, Australia (Page 50-53)