Multiple kernels learning framework - Clinical data mining with Kernel-based algorithms

Instead of mapping the input space with a mapping function φ as described in the previous chapters,M functionsφ1, . . . , φM are considered and positive numbersµ1, . . . , µM are associated respectively to each of them. The weightsµm express the importance of each kernel in the final model. In the following, the vectorµ= (µ1, . . . , µM) indicates the values of the weight of each ker-nel. In this setting, we haveM separating hyperplanes in the feature spacew= (w1, . . . ,wM) and the kernel matrix becomes a linear combination of base kernelsK(xi,xj) =PM

m=1µmKm(xi,xj).

Many generations of MKL formulations and solvers were proposed in the literature and most 67

of them were applied to SVM. In the following, the MKL designs an SVM with multiple kernels unless specified explicitly. The computational performance and the structure of the final decision function (and thus the final kernel) were the main vectors of improvement from one generation to another one.

Bennet and colleagues use MKL to select the optimal combination of kernels (and their param-eters) and solve the problem using a gradient descent methods [12]. The algorithm is called MARK, for Multiple Additive Regression Kernels, and was applied for regression problems even if it can be easily adapted for a classification task. The decision function is of the formfα,b(x) = PN

j=1

m=1αm,jKm(x,xj)

+b where α ∈ R^M^×^N. The decision function is a combination of M decision functions obtained from M feature space data representation. In this study, the MKL problem is calledheterogeneous kernel problem or a mixture of kernels. The MKL problem is tackled from the regularization viewpoint where the objective is to minimize a convex loss functionL(y, fα,b) regularized with a convex functionp(α, b) as stated by the equation (6.1):

minµ,b n

i=1

L(yi, f(xi)) +λp(α, b) (6.1)

Assuming that the loss function and the regularization functions are continuously differentiable, the problem can be solved using gradient descent methods in the space ofαandb. This method is inspired from boosting algorithms and the kernel ridge regression. With a boosting algorithm,

“weak learners” are linearly combined to provide a “strong learner” [52]. With kernel ridge regres-sion, the least square error loss function for regression withl2-norm regularization is used. The main characteristic of the approach proposed is on the kernel used: any type of kernels, even those which are not square positive definite matrix, can be used. Indeed, each column of the kernels (and thus the weights αm.) maximizing the gradient of the loss function are iteratively selected until a stopping criterion is reached. The algorithm is evaluated on 3 benchmark datasets and provides good results. The authors demonstrated that the built model provides a better general-ization than an SVM with a single kernel and does not require storage of large amount of data for future predictions. However, one of the benchmark dataset has more variables than examples and the author did not apply a feature selection before applying the SVM. The MKL provides good performance on this dataset because it is performing a feature selection during the resolution of the MKL problem. The open issues raised by the authors lie on the interpretation and visualization of the results of MKL.

Lanckriet and colleagues applied MKL to a transduction or semi–supervised task and the ob-jective is to select the most suitable kernel for the problem at hand instead of a kernel selection using cross-validation procedure. De Bie and Cristianini defined the transduction task as an esti-mation of the decision function allowing to provide labels to unlabeled data or working set from a set of labeled training data [16]. The work is inspired by the algebraic properties of kernels and the convex optimization over the set of positive semidefinite matrices. To find the optimalµi and αusing a semi–definite programming (SDP) with an additional constraint on kernel [89]:

• the kernelKshould be positive semidefinite;

• eachKm are constrained to have unit trace to avoid the dominance of some kernels during the optimization.

Lanckriet and colleagues state the MKL transductive problem as a maximum-margin classifier.

The decision function is built on a combination of multiple data representation in the feature space (expression 6.2) i.e. one kernel resulting from the linear combination of kernels is used and in opposition to the previous method:

f(x) =

6.2. MULTIPLE KERNELS LEARNING FRAMEWORK 69 where µ ∈R^M, α ∈R^M^×^N and b are estimated from a training set and preferably in a single optimization step. To solve the problem, the primal formulation is casted into an SDP for hard-margin, l1-norm soft margin and l2-norm soft margin criterion. The algorithm was applied on benchmark datasets. Two studies were done on these datasets:

1. the MKL was compared with the soft margin SVM with RBF kernel with respect to the model selection;

2. the benefits of MKL with respect to the analysis of data from heterogeneous data sources (data fusion) compared to the classical SVM was carried out.

The obtained results were comparable to those obtained with a cross-validation procedure for the estimation of the SVM hyper-parameters. The time to solve the MKL problem is much lower because there is no need for cross-validation during its resolution. An improvement is also ob-tained in the resolution of the MKL in its dual form compared to the resolution in the primal form because the computation is reduced. Indeed, in the dual form, which is a special case of SDP called quadratically constrained quadratic program (QCQP), the weightµmare constrained to be positive. It is also important that with thel2-norm soft margin criterion, it is possible to optimize the cost parameter C during the QCQP resolution. The main advantage of this method is the good generalization performance. However, the semidefinite programming is very expensive on the computational viewpoint and algorithm become intractable for problems with many kernels and many examples.

Bi and colleagues formulates the problem as a mixture of models as done by Bennet and col-leagues The main goals of the study were to achieve higher sparsity of the solutions and scale the algorithm for large problems. The main difference is highlighted in the build decision function:

fm(x) =P

mαm,jKm(x,xj) where αm,j are the Lagrangian of them-th kernel. The sparsity ofwis achieved by the use of column generation techniques (and thus better generalization). The work is inspired by the LPBoost algorithm (for linear program) which was extended to quadratic program. The computational cost is reduced compared to composite kernels.

Bach and colleagues formulates the MKL problem as a minimization of the weighted block l1 -norm (6.3): in-dicates al2-norm [7]. The dual of the proposed formulation is convex but not differentiable and the authors added the Moreau–Yosida regularization to achieve this differentiability. This additional transformation allows to solve the problem using optimization algorithms such as the sequential minimal optimization (SMO) which is an efficient algorithm to solve SVM problem. The weighted l1-norm block (PM

m=1µmkwmk)induce a sparsity of the vector w and thus allows a selection of the most appropriate kernels for the classification problem. The SMO ensures the convergence to the global optimum of the problem and especially for large datasets. The main issue related to this method is its scalability.

Sonnenburg and colleagues address the scalability issue of MKL for classification and regression tasks [126]. They formulated the weightedl1-norm block of Bach and colleagues as a semi–infinite learning problem (SILP) and solved the problem in two steps: linear programming + SVM resolu-tion. However, the algorithm needs many iterations to converge. Zien and colleagues derived from this approach a formulation of the MKL for multi-class classification problems [148].

Inspired by the work of Bach and Sonnenburg, Rakotomamonjy and colleagues introduced a new

MKL formulation which can be solved using a gradient descent algorithm [115]. The MKL is solved in two steps as a succession of a classical SVM resolution 6.5 and an optimization of the kernels coefficients using a gradient descent algorithm 6.4:

minµ J(µ) such that

A MATLab toolbox called SimpleMKLis available to solve the latter formulation [116] and the algorithm to solve the problem is provided hereafter. It is important to notice that when µm

approaches 0 thenkwmkshould also approaches zero to achieve the convexity (concavity) ofJ(µ).

The scalability of this method is equivalent to the scalability of classical SVM algorithm because (6.5) is a classical SVM formulation with a kernelK(xi,xj) =P

mµmKm(xi,xj).

The method of Rakotomamonjy and colleagues, detailed in Table 6.1, assumes that J is differ-entiable and the solution of 6.5 does not depend on _µ¹_m. The latter is achieved according to the strict concavity ofJ(µ) implying the uniqueness of the solution. Rakotomamonjy and colleagues proved the equivalence of their formulation with the one proposed by Bach and colleagues in [7] so their formulation also allows the selection of the most appropriate kernel selection for a learning problem.

Table 6.1: SimpleMKL algorithm 1. Init µ¹_k =_M¹ f or k= 1, . . . , M

2. Fort= 1,2, . . .do

3. solve classical SVM problem with K=P

kµ^t_kKk

4. compute _∂µ^∂J

k fork= 1, . . . , M

5. compute the descent direction ∆t,k and optimal stepγt

6. µ^t+1_k ←µ^t_k+γt∆t,k

7. if stopping criterion reachedthen

8. break

9. end if 10. end for

Dans le document Clinical data mining with Kernel-based algorithms (Page 80-83)