Maximum margin classifier - Clinical data mining with Kernel-based algorithms

In the following, we consider the binary classification setting for mathematical convenience and reduce its name to classification. Most of the characteristics described in this section hold for a multi–class classification problem. The classification task can be seen in the probabilistic view as the estimation of the conditional probabilityP(y|x) for all x∈ X andy ∈ Y. In real situations, the probability distributionPis unknown and one has to model the probability of having a specific y∈ Ygiven an inputx∈ X. For the supervised learning, there are two main families of algorithms:

thegenerativeand thediscriminativealgorithms if one consider the estimation of the posterior probability ofy [22].

P(y|x) = P(x|y)P(y) R

yP(x|y)P(y)dy (3.3)

The generative algorithms learns first the structure of each class by the means of P(x|y) from the training points. The Bayes rule (equation 3.3) is then used to derive the posterior probability of the class P(y|x). The discriminative algorithms learn directly the mapping from X to Y by the means ofP(y|x). In other words, the discriminative algorithms do not focus on the way the training examples were generated. The maximum margin classifier (MMC) falls in the class of discriminative algorithms. Let us first consider the case of the discriminative linear classification before introducing the MMC.

3.2.1 Linear classification

The objective of a linear classification is to provide a discriminant function which is linear in the input spaceX. Ideally, this linear function should maximize the contrast between the two classes.

The key to formally define such contrast are obviously distance measures among the patterns.

The leading example of linear classifier is the Fisher’s linear discriminant (FLD) [49]. The FLD is looking for a hyperplane directed byw∈ X, which (i) maximizes the distance between the mean of the classes when projected on the line directed bywand (ii) minimizes the variance around these means. An illustration of this property is highlighted on Figure 3.1. The strategies to find the vectorw were proposed in [99]. An unseen new example is classified into the nearest class mean

3.2. MAXIMUM MARGIN CLASSIFIER 27 projected into the hyperplane directed byw. If one considerm₋1andm+1 the empirical mean of the input examples according to their classes, the decision function is provided by Equation 3.4.

Notice the use of dot product to express the projection onto the optimal hyperplane.

f(x) =

−1 if |hw,xi − hw,m₋1i|<|hw,xi − hw,m+1i|

+1 if |hw,xi − hw,m₋1i|>|hw,xi − hw,m+1i| (3.4)

The FLD described above is only an example to illustrate the mechanism of a linear classifier which seeks an optimal hyperplane separating the training points. Other linear classification tech-niques were developed and their main differences are in the way of finding the optimal hyperplane.

One can cite among others the following linear classification techniques: quadratic discriminant analysis, regularized discriminant analysis, logistic regression, perceptron algorithm [64].

If we consider an vector space V, a hyperplane is defined by {x∈ V such that hw,xi+b= 0} whereh., .iis the dot product operator, the vectorw∈R^d is perpendicular to the hyperplane and ba scalar value. Generally, alinear classifier is a hyperplane separating a pattern according to the following rule:

y(xi) =

−1 if hw,xi+b <0

+1 if hw,xi+b >0 (3.5)

The training set{xi, yi}^Ni=1 is saidlinearly separable, if there exist one or more linear classifiers classifying totally the training set in the correct class according to 3.5. For linearly separable train-ing set, one can find an infinity of linear classifiers. Many heuristics are developed to select the best one (see for example the FLD above). Amaximum margin classifier or anoptimal margin classifieris a linear classification technique such that the distance of the separating hyperplane and the closest training points of each class is maximized. The margin is the distance of the two hyperplanes on both side of the separating hyperplane. An SVM is a particular example of a maximum margin classifier where only the points on the margin hyperplanes orsupport vectors are necessary to formally define the model. The MMC is illustrated on Figure 3.2.

In a formal way, a classification task with SVM is a convex optimization problem as suggested by the following expression:

minw hw,wi s.t. yi(hw,xii+b)≥1 ∀i (3.6) The first part of Equation 3.6 i.e. the objective function corresponds to the maximization of the margin of the separating hyperplane and is drawn to guarantee the equivalence of the patterns.

The similarity of the outputs, according to the equivalence of the patterns, is formulated in the constraint of Equation 3.6. This formulation is applicable if the dataset is linearly separable.

If certain conditions (e.g. convexity of the objective function) hold and using the Lagrangian formulation, the previous problem is equivalent to its dual in Equation (3.7), which is a quadratic optimization problem and can be solved using several techniques (see for example [18]). The vector w=P

iαiyixi and the value of bare obtained after the Lagrangian multipliersαi are known.

maxα≥0 θ(α) = X

3.2.2 Classification of a linearly separable dataset

Even if the training set is linearly separable, it is often desirable to “enlarge” the margin and let some training points lie inside the margin to prevent overfitting. For this purpose, slack variables

Figure 3.1: Illustration of Fisher’s linear discriminant. The algorithm searches for the direction providing the best separation of the classes when projected upon. In this figure, the third image (bottom left) provides the best separation of the datasets.

ξi are introduced to soften the constraint of (3.6) and the SVM formulation becomes:

minw hw,wi+C

i=1

ξi s.t. yi(hw,xii+b)≥1−ξi, ξi≥0∀i (3.8) The regularization parameter C represents the compromise between the margin width and the generalization performance. In other words, it controls the amount of overlaps i.e. for non sepa-rable datasets and a large value ofC, the solution of the SVM approaches the minimum overlap solution with the largest margin [63]. The equation 3.8 gives a penalty of cost Cξi for any data point falling within the margin on the correct side of the separating hyperplane (0< ξi≤1) or on the wrong side of the separating hyperplane (ξ >1). A SVM, in its simplest form, is a maximum margin classifier, which uses only the points on both sides of the margin or support vectors (points xi for whichαi>0) to build a model.

In the equation (3.8), ξi is called the loss function. The natural loss function (also known as the zero–one loss function) is given by

ξi(yi, f(xi)) =

0 if yi=f(xi)

1 if yi6=f(xi) (3.9)

where yi is the true label of the example i and f(xi) is the predicted label. However, this loss function is not convex and makes the SVM problem NP–hard. To obtain a convex objective func-tion, upper bound of the natural loss are used. The most common loss functions are the hinge lossξi(ti) = max (0,1−ti), the quadratic lossξi(ti) = (max (0,1−ti)² and the least square loss ξi(ti) = (1−ti)²leading respectively to L1–SVM, L2–SVM and LS–SVM. In practice, the parame-terCis determined via a model selection process on training examples (see model selection section).

As described above, the probability distribution of the training points is unknown in a classifi-cation task. This implies that the classificlassifi-cation of new unseen input examples is error-prone. The

3.2. MAXIMUM MARGIN CLASSIFIER 29

Figure 3.2: Illustration of the maximum margin classifier. The dotted lines define the margin on which lie the support vectors (circled examples).

SVM can be seen as an algorithm minimizing the generalization error expressed with a loss func-tion while seeking for the optimal margin. Formally, the SVM can be expressed as a sum of a loss and penalty function (3.10). With this expression, the SVM algorithm finds a linear classifier minimizing the convex loss function according to the constraint expressed by the convex penalty function (λis the regularization parameter and is equivalent to 1/C).

min

i=1

[1−yi(hw,xii+b)] +λ

2hw,wi (3.10)

3.2.3 Classification of non-linearly separable dataset

In real practice, it is rare to have linearly separable input elements. When the dataset is not linearly separable, a mapping function φ : X → H is introduced to transform the dataset into the so-called feature space or hypothesis space. The objective of this transformation is simply to enable the application of the well–understood linear classifier in the feature space. In other words, one assumes the existence of suitable linear classifier in the feature space. This is possible ifHis a reproducing kernel hilbert space (RKHS, see definition 3.1.1) [4, 140]. The relationship between a RKHS and its reproducing kernel is provided by the theorem of Moore-Aronszajn [4]. Boser et al. were the first to introduce the concept of kernel for maximum margin classifiers [18].

To be able to apply a maximum margin classifier in the feature space, the similarity measure expressed with the dot product operation should be preserved. This is achieved by the means of kernel functions. The dot product used in the maximum margin classifier (3.7) is expressed with the kernel matrixK= [k(xi,xj)] = [hφ(xi), φ(xj)i]^N_i,j=1. The linear decision function in the feature space is expressed with the kernel: f(x) =Pn

i=1αik(x,xi) +bwhereαiare the Lagrangian multipliers. The dual formulation of a kernel-based algorithm is expressed as [34, 36]:

maxα 2hα,ei −α^T G(K) +_C¹IN

(3.11) st C≥α≥0, hα,yi= 0

whereeis the n–vector of ones,α∈Rⁿ,G(K) the Gram matrix is defined byGij(K) = [K]ijyiyj = k(xi,xj)yiyj, IN, which is a diagonal matrix of 1 and α≥ 0 means αi ≥0, i= 1, . . . , N. The transformation function φ is integrated into the definition of the Gram matrix. The kernelized maximum margin classifier is a kernel-based algorithm using only the points on both sides of the margin in the feature space to build model. For the rest of the document, we call an SVM a kernelized maximum margin classifier.

3.2.4 Algorithm complexity

As we have seen above, three elements are at the heart of the SVM: (a) a kernel measuring the equivalence of patterns and conditioning the margin width because it defines the structure of the feature space; (b) a hypothesis space which is a space of linear functions and (c) a convexity of the objective function guaranteeing the unicity of the solution. To obtain the optimal classifier, some parameters have to be optimized such as the regularization variableCand the kernel’s parameter (σfor the gaussian kernel). The next section treats this issue.

3.3 Model selection

Machine–learning, like other scientific disciplines, start from observations in order to extract

“knowledge” allowing to generate a general model. In real practice, it is often difficult to obtain a large number of examples and there is a high probability that a built model provides an error on the prediction of new examples. As we have seen in the previous section, there is always an infinite hypothesis or models for a given input example. The kernel–based algorithms reduce considerably the domain of possible hypotheses to linear functions in the feature space. However, to provide optimal results, once the choice of the kernel performed, one has to optimize the regularization parameterCand the kernel parameters. In the rest of this document, these parameters are referred as hyperparameters. This section treats the way to select the best model from an empirical and analytical viewpoint and how to measure the performance of a selected model.

3.3.1 Empirical model selection

The easiest way to tune the parameter of a classification algorithm is to randomly split the input example into two disjoint training and testing sets. The training set is used to build the model and the test set to evaluate it. To build the model, one has to scan a list of all possible values of the algorithm’s parameters (grid–search) and select the one giving lowest error. The main drawback of this method is the possible random effect on the split realization providing a good model but with a low generalization performance. To reduce this random effect, one can produce a high number of randomized train/test splits and select the model providing the lowest error average on the testing sets. The issue with this method lies on the model creation where the performance of the model is not tested on the training sets.

To overcome this, one can randomly split the training sets into training and validation sets.

The state–of–the–art way to perform this cross–validation is to split the training sets into k–

folds [63, 85]. To assess the performance of the parameters on the training examples, one iteratively hold–out one split for validation while thek−1 splits are used to build the model. The best way to carry out such a procedure is to repeat it a finite time to avoid selecting models having high error variability. A possible issue with this method lies in the alteration of the data distribution during the randomized splits realization. This is relatively severe for datasets having skewed distributions:

it is possible that the examples of one class are not observed in a random split. To overcome this, a stratified k–fold cross–validation may be carried out. In this way, the distribution of each class is kept in each fold. This is necessary because we assume that our dataset is independent and

3.3. MODEL SELECTION 31 identically–distributed according to an unknown distributionp. There are cases where the num-ber of input examples is very low. To obtain a reliable model, one has to set the numnum-ber of the folds equal to the number of examples. Such a model selection procedure is called leave–one–out cross–validation.

In our classification setting above, we did not provide any conditional structure on the input examples

(xi, yi)^N_i=1 and assumed that one couple (xi, yi) is an independent observation. There are situations where more than one couple (xi, yi) form an entity of an observation. This situa-tion is common in pattern recognisitua-tion problems. Suppose for instance that one wants to detect an abnormal region from a lung CT scan. One is provided with a set of normal and abnormal regions of the lung of a finite number of patients and each each region is characterized by a set of variables. A leave–one–out procedure may provide optimistic performance on the training sets because some observations of the each region were used in the training sample and one obser-vation in the validation set. To overcome this issue, one has to hold out all the obserobser-vations of a patient for validation and build the model with the remaining observations. This procedure is called leave–one–patient–out and can be seen as a generalization of the leave–one–out procedure.

Another limitation of the cross–validation mentioned above is on the size of the training points. It may provide unsatisfactory results when the training size is low. To overcome this, the bootstrap method consisting of generating N training observations through a resampling with replacement of the training observations and use the later for validation [63]. To avoid training and validating with the same observation, one has to carry out the bootstrap procedure in a leave–one–patient–

out cross–validation way. To this end, one observation is removed iteratively and the bagging is carried out on the remainingN−1 training examples.

3.3.2 Analytical model selection

The empirical model selection is computationally expensive: N CV ×ACP X×GSZ×N T S where N CV is the number of folds in the cross–validation, ACP X is the algorithm complexity, GSZis the size of the parameters grid andN T Sis the number of train/test splits. The objective of the analytical model selection is to reduce the computational cost by avoiding as much compu-tation as possible without loosing generalization performance. The reduction of the compucompu-tational cost also allows to use more complex kernels with many parameters or using more than one kernel to have a better data representation in the feature space.

The analytical model selection is generally carried out by theoretically estimating or approxi-mating the error of the classifier (f(x)6=y) on the training set and choose the hyperparameters minimizing the error. For the SVM algorithm, the analytical model selection is performed with a minmax procedure i.e. a dual optimization minimizing the training error and maximizing the margin in the feature space. Developing the most accurate approximation of the generalization error of the SVM algorithm attracted some researchers in the second part of the 90s. The majority of the proposed approximators upper-bounds the leave–one–out errors.

In 1995, Vapnik introduced a bound based on the number of support vectors and the number of training data [136], Jaakkoola and Haussler introduced an upper bound based on the coeffi-cients of the support vectors and the diagonal elements of the Gram matrix [77]. In 1998, Vapnik defines the so–called radius–margin bound for SVMs without a threshold (b= 0 in 3.5) in the case of linearly separable data [137]. This upper bound of the leave–one–out error is based on the radius of the smallest sphere containing the training data and the width of the SVM margin. Opper and Winther proposed an upper bound based on the dot products between support vectors and their coefficients for the same type of SVM [107]. Vapnik and Chapelle improved the radius–margin bound by the introduction of the span of the support vectors to derive the upper bound of the leave–one–out error [135]. Barlet and Shawe-Taylor derived a bound independent on the data to be analyzed and compatible with other classifiers [11, 122].

Another way to perform the analytical model selection is based on the analysis of the input map-ping into the feature space i.e. check if a mapmap-ping function provides a high data separability in some sense. One can cite, for example, the analysis of the separability in the feature space by measuring the ratio of the between–class scatter matrix and the within–class scatter matrix as indicated in the relation 3.12 [146]. Christianini et al., analyzed the best alignment of the kernel and the target (KTA) using the formula 3.13 [27]. In a similar way, Baram introduced the kernel polarization (KPOL) using 3.14 [10]. The operator h., .i denotes the inner product for matrices.

Maximizing these three measures is equivalent to maximizing the separation of the dataset in the feature space and thus allows to select the best kernel (with its parameter) for the classification problem. The advantage of such approaches is their independence to the classification algorithm and do not require high computational cost.

SEP = trace(SBφ)

Another analytical approach is the analysis of the regularization path [63]. This method evaluates the behavior of the support vectors according to the values of the regularization parameter C and can be used to select the most regular solution to avoid overfitting or underfitting. In the formulation of the SVM, we have seen that the value of the Lagrangian multipliersαivary according to the position of the examples compared to the decision boundary. The size of the margin is also controlled by the value of the cost parameter C. Varying the value of this regularization parameter from 0 to +∞ will change the position of some training examples from inside into outside of the margin (or vice versa). This is illustrated in Figure 3.3 with artificial examples.

The decision boundary becomes less regularized for high values ofC i.e. the solution has smaller margin. The successive values of the Lagrangian multipliersαi(and thus the decision function) are piecewise linear inC and their entire path can be interpolated if one can obtain the break points.

The algorithm proposed by Hastie and colleagues has essentially the same computational cost of fitting a single SVM model with a particular value ofC to compute the entire SVM regularization path [63].

3.3.3 SVM and Dataset imbalance

In the previous chapter, we highlighted with the NI dataset the problems that may raise in the classification of imbalanced dataset. One of the possible explanations of the low performance in the prediction of the minority examples is the inherent objective of classification algorithms to learn regularity in the available dataset. Many data mining algorithms use some form of similarity measures stating implicitly that similar objects form a group and share common statistical char-acteristics. The minority class is not represented often enough and ignored as it is considered as noise [2].

Dans le document Clinical data mining with Kernel-based algorithms (Page 39-47)