Summary - Clinical data mining with Kernel-based algorithms

A first analysis of clinical datasets built from the data available from the EHR was carried out in this chapter. The behavior of classification algorithms in the case of imbalanced datasets was investigated. The necessity of obtaining high sensitivity of positive examples was highlighted for a practical implementation of the data mining results. Selecting the most important variables im-proves significantly the results of classification algorithms and attention should be paid on negative interaction among them. These operations require other evaluation metrics than the accuracy to evaluate the performance of classification algorithms.

In our experiments, a better discretization of the workload parameter should be investigated in future works and not treated in this thesis. For the rest of the document, we shall use the ranking obtained with the SVM–RFE as kernel–based algorithm is studied in this thesis. The variable ranking obtained with the IG measure could be improved by taking into account the data im-balance. This aspect should be investigated in future works by using measures sensitive to data imbalance as proposed in [119, 149] to cite only a few.

2.7. SUMMARY 21 3 McCabe score non fatal = NO 23 Antibiotic therapy = NO 4 McCabe score non fatal = YES 24 Antibiotic therapy = YES 5 McCabe score fatal<6 months = NO 25 Surgery = NO 6 McCabe score fatal<6 months = YES 26 Surgery = YES

7 Transfer from another hospital as admission = NO 27 Stay at the ICU during hospitalization = NO 8 Transfer from another hospital as admission = YES 28 Stay at the ICU during hospitalization = YES 9 Congestive cardiomiopathy = NO 29 Mechanical ventilation = NO

10 Congestive cardiomiopathy = YES 30 Mechanical ventilation = YES 11 Diabetes with organ affected = NO 31 Workload value≤45.5 = NO 12 Diabetes with organ affected = YES 32 Workload value≤45.5 = YES 13 Obstetrical ward = NO 33 Workload value between 45.5 and 91.5 = NO 14 Obstetrical ward = YES 34 Workload value between 45.5 and 91.5 = YES 15 Intensive care unit ward = NO 35 Workload value>91.5 = NO

16 Intensive care unit ward = YES 36 Workload value>91.5 = YES 17 No MRSA colonization = NO 37 Central vein catheter = NO 18 No MRSA colonization = YES 38 Central vein catheter = YES 19 Actual MRSA colonization = NO 39 Urinary tract = NO 20 Actual MRSA colonization = YES 40 Urinary tract = YES

Figure 2.4: Projection of the variables on the 2 first factor axes

Chapter 3

Theoretical background

3.1 Introduction

The kernel-based algorithms for classification tasks, the selection of classification model and the selection of relevant variables are described in this chapter. The objective of this chapter is to provide an intuitive explanation and introduce the terminology of a kernel–based algorithm based on the maximum margin classifier and especially the support vector machines (SVM). A large body of contribution exists in the field of kernel–based algorithms and the reader can be redirected to [18, 34, 36] for more details on kernel–based algorithms, a complete tutorial on SVM can be found in [20].

The main difference between the kernel–based algorithms is on the way of building the deci-sion function. The formalization of the SVM for linearly separable and for non-linearly separable datasets, the algebraic properties of the kernel are provided in this chapter. Mathematical def-initions related to the kernel–based algorithm can be read in the appendix. In the rest of this document, all mathematical properties are defined for real spaceRunless otherwise stated.

3.1.1 Classification setting

Roughly speaking, a classification task aims to predict the unknown class of an observation which is a collection of d measurements related a specific problem. Let us consider the set of observations {xi, yi}^Ni=1 where xi is a vector in the input space X =R^d and is called input ex-ample or a pattern,yi a real number taking its values in the setY ={−1,+1} and is called the label or the class or the characteristic of the pattern of the i-th example,N ∈N the number of examples andd∈Nthe numbers of variables. A pattern is a vector because it is characterized by a set of descriptors or variables and we are interested in the prediction of the characteristic of this pattern. It is assumed that the training points are independently and identically distributed from a common distribution inX × Y. The values of the class are set to−1 and +1 for mathematical convenience and indicate thaty has binary value hence the term (binary) classification. However, the class set may have more than two elements and in this case we have a multi–class classification.

The objective of a classification task is to build a “model” (also known as classifier, estimator, hypothesis, decision function)fallowing to predict the characteristic y for a new unseen pattern x∈R^d. Formally, a model is a function f : R^d → Y ={−1,+1} representing a the outputy for a given input observationxand an error occurs whenf(x)6=y. As a function,f has parameters allowing to achieve the best prediction i.e. providing the smallest error as possible. This model and its parameters are estimated or learned from theN observations called training data. Learn-ing the best function and its parameters (also known as an induction) allow learnLearn-ing the intrinsic regularity within the training data. For this purpose, the class of each of the N observations is provided. A classification task is a succession of an induction phase (or model selection phase) and

a deduction phase allowing to predict the class of new examples. The model selection (resp. the deduction phase) is based on similarity relation between the training data (the training data and a new pattern unseen pattern) and can be summarized as: similar input should lead to equivalent or similar output. The definition of similarity measures is then in the heart of classification.

The knowledge of the class values in the induction phase makes the classification falling into the category of supervised learning. The later contains another type of learning: the regression for whichY =R. Obtaining labeled training data may be harder in some situations. If one is only provided with input examples (without the label), finding interesting structure within such dataset is called unsupervised learning. If one is provided with partially labeled examples and has to provide the label of the remaining input examples, one is in the context of semi–supervised learning. Indeed, three categories of machine learning strategies are reported in the literature: su-pervised, unsusu-pervised, and semi-supervised learning [106, 147]. The rest of the document focuses on classification problems (binary and multi–class).

Before providing the elements of the similarity definition with kernel–based algorithms for classi-fication, let us recall some mathematical definitions and properties enabling to define similarity measures. The next section justifies the existence and the construction of the so–called feature space which will be necessary in the subsection 3.2.3. For this reason, the reader can switch di-rectly to the section 3.2 and return at this level when the kernel is introduced. Other mathematical concepts and properties related to vector spaces are presented in the appendix B.

3.1.2 Kernels

Definition 3.1.1. AReproducing Kernel Hilbert Space(RKHS) is a Hilbert space in which all the point evaluations are bounded linear functionals. Let us consider a class of functionF in a Hilbert spaceX. For allf ∈ F and all s∈ X one can findM ∈R such that|f(s)| ≤Mkf(s)k. TheRiesz representation theoremfor bounded linear functions states the existence of a unique element calledreproducing kernel mapφ:t→k(.,t) ∈ X such that:

f(t) =hf(t), φ(t)i ∀t∈ X and∀f ∈ F

The functionk(s,t) =hφ(s), φ(t)iis the reproducing kernel ofX and it is positive definite:

i,j

αiαjk(s,t)≥0∀s,t∈ X, αi, αj∈R

In other words, the RKHS is a functional space in which all elements are constructed as a linear combination k(.,t). The RKHS which is a special case of Hilbert space allows to define a new space from an input spaceX having a similarity measure based on the dot product of continuous functions. The main question is how to build such space. The following theorem of Moore and Aronszajn states the existence of a RKHS for every positive definite function and vice versa [4].

Mercer’s theorem allows to build a RKHS associated to a definite positive functionk [100, 140].

Theorem 3.1.1. Moore-Aronszajn theorem 1. If a reproducing kernelk exists, it is unique

2. For all RKHS, the reproducing kernel is positive definite

3. For every positive definite function k(s,t)∈ X × X, there exists a unique RKHS Theorem 3.1.2. Mercer’s theorem

Let us consider a linear operator

Lk :L²(X)→ L²(X) f(.)→R

k(.,t)f(t)dt

3.1. INTRODUCTION 25 There exists in L²(X) an orthonormal basis {ψi}^∞i=1 of eigenfunctions of Lk such that for all x andyin X

k(x,y) = X∞

i=1

λiψi(x)ψi(y) =hφ(x), φ(y)il² (3.1) whereλi ∈R⁺ are the eigenvalues ofLk.

Mercer’s theorem provides the way of constructing a RKHS without defining the reproducing kernel map functionφand also a graphical representation of the feature space by the means of the eigenfunctions.

The principle of kernel–based algorithms is to map the input examples into the feature space F according to a function φ and compare them using the dot product capturing their similarity in the feature space [121]. Let us considerF the set of potential feature spaces. This set F has a large dimension and the practical way to avoid scanning for the optimal feature spaceF is to use a kernel function in the input space which computes directly the dot product inF and having no need to define explicitly the mapping functionφ. This process is called the “kernel trick” and Table 3.1 provides the list of the most popular kernel functions.

When the spaceX has finite elements, in which we can define a symmetric positive definite func-tion, the matrixK:= [(ki,j)i,j] build with theN elementsx1,x2, . . . ,xN ofX is called theGram matrixofk. The positive definiteness of the kernelkcan be derived from the positive definiteness of the Gram matrix as in definition 3.1.2. As a function, the algebraic operations for a kernel are given in the proposition 3.1.1 which plays a fundamental role for this document [66].

Definition 3.1.2. A real symmetricN×N matrixKis positive semi–definite if

i,j=1

αiαjk(xi,xj)≥0 (3.2)

for all αi and αj ∈R and xi and xj ∈ X. In matrix notation, (3.2) is equivalent toαKα ≥ 0

∀α = α1, . . . , αN) ∈ X. When the inequality of (3.2) is strict, the matrix is called positive definite. When the equality of 3.2 is only achieved for α1=α2=. . .= 0 then the Gram matrix is calledstrictly positive definite kernel. When the relation 3.2 is achieved with the additional conditionPN

i=1αi = 0 then the Gram matrix is called conditionally positive definite. This later captures the dissimilarities among the input examples rather than the similarities.

In this document, we use the term kernel for positive semi–definite kernels unless stated other-wise.

Table 3.1: Most popular kernel functions Kernel function k(xi,xj)

Linear hxi,xji+a Polynomial (ahxi,xji+b)^d Exponential exp⁻^k^xi−xj^2σ² ^k

Gaussian exp⁻^k^xi−xj^k

2 2σ2

Proposition 3.1.1. Let us consider positive definite kernels k1, k2, . . . onX × X.

1. The set of positive definite kernels is a closed convex cone, that is, (a) if α1, α2 ≥ 0 then α1k1+α2k2is positive definite; and (b) ifk(xi,xj) := limn→∞kn(xi,xj)exists for allxi,xj

thenk is positive definite;

2. The point wise product k1k2 is positive definite;

3. Assume that for i = 1,2,ki is a positive definite kernel on Xⁱ× Xⁱ, whereXⁱ is a nonempty set. Then the tensor productk1⊗k2 and the direct sumk1⊕k2 are positive definite kernels on (Xⁱ× Xⁱ)×(Xⁱ× Xⁱ).

The construction of a feature space received considerable attention from many researchers in machine learning and pattern recognition communities. This was achieved by the definition of specific kernels according to the nature of data to be analyzed. One can cite for example the Laplacian kernel for graph data, the ANOVA Kernel and the Hyperbolic Tangent (Sigmoid) Ker-nel. The characteristics of available kernels can be found in [66, 123]. All existing kernels are satisfyingMercer’s theorem. Genton studied the classes of kernels via the comparison of the Gram matrix with the covariance matrix [54]. It allows to classify and build new kernels according to the spatial characteristics of the feature space. However, in this document, we focus our studies on the so–called universal kernels and especially the Gaussian kernels also known as the radial basis function (RBF) kernels [66]. Such kernels provide an easy way to define the density estimator P(y|x) by the means of the exponential kernel function.

Dans le document Clinical data mining with Kernel-based algorithms (Page 33-39)