• Aucun résultat trouvé

Clinical data mining with Kernel-based algorithms

N/A
N/A
Protected

Academic year: 2022

Partager "Clinical data mining with Kernel-based algorithms"

Copied!
145
0
0

Texte intégral

(1)

Thesis

Reference

Clinical data mining with Kernel-based algorithms

IAVINDRASANA, Jimison

Abstract

Cette thèse traite le développement d'un système d'aide à la décision et se focalise sur la création de "modèle" : la sélection de modèle optimal, la sélection de variables les plus significatives, l'interprétabilité du modèle et sa validation historique. Ces questions sont traitées avec l'algorithme machine à vecteurs supports à unique et multiples noyaux. Deux applications cliniques ont ainsi été choisies: la prédiction de cas d'infection nosocomiale et la classification de tissus pulmonaires caractérisés par des maladies interstitielles. Cette thèse apporte ses contributions à quatre principales problématiques : 1) la méthode d'analyse de données déséquilibrées sur lesquelles les méthodes de fouille de données peuvent avoir un faible taux d'erreur sans être sensibles aux cas mal représentés ; 2) la portabilité de modèles prédictifs en les évaluant dans le temps ; 3) l'analyse du compromis interprétabilité de modèle et sa complexité et/ou stabilité ; 4) l'analyse de l'exploitation des résultats obtenus.

IAVINDRASANA, Jimison. Clinical data mining with Kernel-based algorithms . Thèse de doctorat : Univ. Genève, 2010, no. Sc. 4263

DOI : 10.13097/archive-ouverte/unige:84626

Available at:

http://archive-ouverte.unige.ch/unige:84626

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSIT ´E DE GEN `EVE FACULT ´E DES SCIENCES

D´epartement d’Informatique Professeur C. Pellegrini

UNIVERSIT ´E DE GEN `EVE FACULT ´E DE M ´EDECINE

D´epartement d’Imagerie et Professeur A. Geissbuhler

des Sciences de l’Information M´edicale

H ˆOPITAUX UNIVERSITAIRES DE GEN `EVE

Direction des Analyses M´edico–´Economiques Docteur G. Cohen

Clinical Data Mining with Kernel–based Algorithms

TH ` ESE

pr´esent´ee `a la Facult´e des sciences de l’Universit´e de Gen`eve pour obtenir le grade de Docteur `es sciences, mention informatique

par

Jimison Iavindrasana

de Madagascar

Th`ese No 4263

GEN`EVE 2010

(3)
(4)

Acknowledgements

First, I would like to thank Dr Gilles Cohen and Pr Henning M¨uller for their continuous support, advices and enthusiasm during the preparation of this thesis. The first one supervised this thesis and the second one supervised my works within the @neurIST project. I have learned a lot from them and the human aspect of our relationships eased many things.

I also would like to thank my thesis director Pr Christian Pellegrini for trusting me and for allowing me the required time to carry out this thesis. I am also grateful to Pr Antoine Geissb¨uhler for welcoming me into the Medical Informatics Service of the University Hospitals of Geneva. His innovative and futurist vision of the medical informatics stimulated me.

I thank Pr Djamel Abdelkader Zighed, Pr Thomas Perneger and Dr Ariel Beresniak for giving me the honor to take part in the jury.

I also would like to thank all the colleagues at the Medical Informatics Department. Thanks to Adrien Depeursinge and Xin Zhou with whom I had many useful discussions and feedback.

Thanks to Claudine Br´eant, Philippe Rossier, Eric Burgel, Pascal Philippe, Florian Mauvais and Martine Burford for making my stay at the office enjoyable.

I am very grateful to my parents, sister and brother for all their support.

Last and not the least, I would like to thank my wife Nina for her unconditional support and patience.

i

(5)
(6)

Contents

Abstract vii

R´esum´e ix

1 Introduction 1

1.1 Motivations and challenges . . . 1

1.2 Thesis contributions and outline . . . 4

2 Clinical data mining 7 2.1 Introduction . . . 7

2.2 Challenges in medical data analysis . . . 7

2.2.1 Data availability and collection . . . 7

2.2.2 Run–time speed and performance of algorithms . . . 8

2.2.3 Class imbalance problems . . . 8

2.2.4 Curse of dimensionality . . . 8

2.2.5 Interpretability . . . 9

2.2.6 Use of the discovered knowledge . . . 9

2.3 Nosocomial infection dataset . . . 9

2.4 TALISMAN dataset . . . 10

2.5 Learning framework . . . 12

2.5.1 Model selection . . . 12

2.5.2 Model evaluation . . . 12

2.6 Baseline setup . . . 13

2.6.1 Comparative performance of nosocomial infection classification . . . 13

2.6.2 Discussion and partial conclusion . . . 15

2.6.3 Minimal set of attributes required for nosocomial infection classification . . 16

2.6.4 Discussion and partial conclusion . . . 19

2.7 Summary . . . 20

3 Theoretical background 23 3.1 Introduction . . . 23

3.1.1 Classification setting . . . 23

3.1.2 Kernels . . . 24

3.2 Maximum margin classifier . . . 26

3.2.1 Linear classification . . . 26

3.2.2 Classification of a linearly separable dataset . . . 27

3.2.3 Classification of non-linearly separable dataset . . . 29

3.2.4 Algorithm complexity . . . 30

3.3 Model selection . . . 30

3.3.1 Empirical model selection . . . 30

3.3.2 Analytical model selection . . . 31

3.3.3 SVM and Dataset imbalance . . . 32

3.3.4 Performance measures . . . 33 iii

(7)

3.4 Variable selection . . . 34

3.5 Summary . . . 36

4 SVM generalization ability for imbalanced datasets 39 4.1 Introduction . . . 39

4.2 Handling imbalanced datasets with SVM . . . 40

4.2.1 Resampling approach . . . 40

4.2.2 Cost–sensitive learning with SVM . . . 41

4.3 SVM Model selection . . . 42

4.3.1 VC bound . . . 43

4.3.2 Span bound . . . 43

4.3.3 Radius margin bound . . . 44

4.3.4 Radius margin bound for imbalanced datasets . . . 44

4.3.5 Radius–margin bound and model selection . . . 45

4.4 Experiments . . . 46

4.4.1 Motivations . . . 46

4.4.2 Historical validation of nosocomial infection model . . . 46

4.4.3 Lung tissue classification . . . 49

4.4.4 Model selection using the gradient descent . . . 50

4.4.5 Classification performance . . . 50

4.5 Summary . . . 54

5 Variable relevance validation 55 5.1 Introduction . . . 55

5.2 Variable selection problem and methods . . . 55

5.2.1 Variable ranking and filtering . . . 56

5.2.2 Wrapper method . . . 56

5.2.3 Embedded method . . . 57

5.3 Variable selection with SVM . . . 57

5.3.1 Sparse SVM . . . 58

5.3.2 SVM with recursive feature elimination . . . 58

5.3.3 Feature selection using scaling factors . . . 58

5.4 Experiments . . . 59

5.4.1 Motivation and objective . . . 59

5.4.2 Material and methods . . . 59

5.4.3 Results . . . 60

5.4.4 Discussion and conclusion . . . 61

5.5 Summary . . . 61

6 SVM with multiple kernels 67 6.1 Introduction . . . 67

6.2 Multiple kernels learning framework . . . 67

6.3 Multiple kernel regularization path . . . 70

6.3.1 Regularization paths . . . 70

6.3.2 MK–SVM regularization path . . . 71

6.4 Kernels, model and variables selection with MKL . . . 71

6.4.1 Motivations . . . 71

6.4.2 Experimental setup with the nosocomial infection dataset . . . 72

6.4.3 Kernel(s) selection . . . 73

6.4.4 Model interpretability . . . 80

6.5 Discussion . . . 85

6.6 Summary . . . 86

(8)

CONTENTS v

7 Conclusion 87

7.1 Summary . . . 87

7.1.1 SVM generalization ability . . . 87

7.1.2 Variable relevance analysis . . . 88

7.1.3 Model and variables selection with MKL . . . 88

7.2 Future work . . . 88 A Nosocomial infection variables: definition and ranking 91

B Elements of topology 101

C Quasi–regularization path for three benchmark datasets 103

Notation 113

Glossary 115

List of Figures 117

List of Tables 119

Bibliography 120

Index 131

(9)
(10)

Abstract

The healthcare is a domain where the information plays important role for all decision making and the electronic health record (EHR) management systems are tools to help the healthcare professionals for this purpose. Since the 1950s, many researchers have investigated the use of digitized information to assist healthcare professionals in their decision makings especially for diagnostic purposes. These clinical decision support systems (CDSS) match the individual patient information to a computerized knowledge base and algorithms generate recommendations for this specific patient. Power distinguished four components of a decision support system: 1) the user interface, 2) the database, 3) the models and analytical tools, and 4) the decision support system architecture and network.

This thesis deals with the development of knowledge–driven CDSS: it addresses the creation of data mining models which constitutes a pillar of future CDSS. To be more specific, the selection of efficient model, the selection of optimal variables for prediction, the model interpretability and the historical validation are investigated. These data mining issues are tackled using the support–

vector machine with single and multiple kernels. For these purposes, we have chosen two real clinical applications: the nosocomial infection prediction and the classification of interstitial lung diseases.

The University Hospitals of Geneva has been performing yearly comprehensive nosocomial infection (NI) prevalence surveys since 1994. To evaluate the prevalence of the NI, the EHR of all patients admitted for more than 48 hours the day of survey were analyzed by infection control practitioners (ICP). If necessary, additional information is obtained by interviews with nurses or physicians in charge of the patient. This manual data collection is compatible with the methodology proposed by the American Centers for Disease Control and Prevention but it is labor intensive. Our hypothesis is that if one could find a small subset of the 83 prevalence database attributes having high predictive power and which could be derived easily from the data warehouse attributes, one could suggest to the ICP and/or healthcare professionals a list of high–risk patient for investigation by analyzing on a regular basis the hospital data warehouse. This clinical problem constitutes the main clinical application of our approaches.

Interstitial lung diseases (ILD) form a heterogeneous group of diseases containing more than 150 disorders of the lung tissue. Many of the diseases are rare and present unspecific symptoms. Besides the patients clinical data, imaging of the chest allows to resolve an ambiguity in a large number of cases by enabling the visual assessment of the lung tissue. The gold standard imaging technique used in case of doubt is the high–resolution computed tomography (HRCT), which provides three–

dimensional images of the lung tissue with high spatial resolution. We apply our model selection method for the classification of the 6 most common lung tissue patterns (emphysema, ground glass, fibrosis, micronodules and consolidation) characterized by their texture properties from HRCT images. The rarity of some texture patterns in an HRCT image adds another level of complexity in the analysis of this data.

The work we carry out in this thesis has several challenges. First, the high imbalance of the positive cases in our datasets (NI and ILD) may lead to high accuracies without being sensitive.

Second, the validation of a model on newly available dataset may cause problems because the population analyzed may not have the same probability distribution due to many factors such as the evolution of medical practice. Third, the trade–off between interpretability and the complexity and/or stability of a model should be addressed. Fourth, analysis of the possible implementation

vii

(11)

of the results are investigated.

We investigated two ways of dealing with imbalanced dataset. The first method use the un- dersampling approach during the model selection. The second method transforms the algorithm to handle the dataset imbalance. The later is performed on (1) classical SVM thanks to the upper bound of the leave–one–out error and (2) SVM with multiple kernels. The historical validation of models are of great importance if one wants to use the output of a data mining process. On the one side, the historical evaluation results give insight to the effect of the preventive measures against nosocomial infection. On the other side, the outcomes of the experiments show the limit of distribution–free algorithm such as the SVM. The interpretability of the models is investigated according to the importance of variables participating in the classification. It is done with the clas- sical SVM and its version with multiple kernels. New knowledge emerged from the latter but their direct use is challenging as the variables were most of the time filled in a subjective manner. Many possible improvements are suggested such as the optimization of the multiple kernel algorithms by the use of bounds on the generalization error and the development of infrastructures dedicated to data extraction and analysis.

(12)

R´ esum´ e

L’information joue un rˆole crucial dans les processus de d´ecision et le dossier patient informatis´e (DPI) est un outil d´evelopp´e pour aider les professionnels de sant´e dans ce but. Depuis les ann´ees 50, plusieurs chercheurs ont ´etudi´e la meilleure mani`ere d’exploiter les informations des patients pour aider les professionnels de sant´e notamment dans le processus de diagnostiques. Ces outils d’aide `a la d´ecision cherchent la correspondance des informations individuelles du patient `a une base de connaissances num´eriques en vue de g´en´erer des recommandations. Power a distingu´e quatre composantes d’un outil d’aide `a la d´ecision : 1) l’interface utilisateur, 2) la base de donn´ees, 3) les mod`eles et les outils d’analyse, et 4) l’architecture informatique du syst`eme.

Cette th`ese traite le d´eveloppement d’un syst`eme d’aide `a la d´ecision et se focalise sur la cr´eation de “mod`ele”. Plus pr´ecis´ement, la s´election de mod`ele optimal, la s´election de variables les plus significatives, l’interpr´etabilit´e du mod`ele et sa validation historique sont analys´ees. Les deux premiers sont des probl`emes courants en fouille de donn´ees tandis que les deux derniers ne sont pas souvent abord´es notamment depuis l’av`enement des algorithmes dit “boˆıtes noires”. Ces questions sont trait´ees avec l’algorithme MVS ou machine `a vecteurs supports `a unique et multiples noyaux.

Deux applications cliniques ont ainsi ´et´e choisies: la pr´ediction de cas d’infection nosocomiale et la classification de tissus pulmonaires caract´eris´es par des maladies interstitielles.

Des ´etudes de pr´evalence des infections nosocomiales sont effectu´ees annuellement au sein des Hˆopitaux Universitaires de Gen`eve depuis 1994. L’´evaluation du taux de pr´evalence se fait par l’analyse des dossiers de tous les patients hospitalis´es depuis plus de quarante-huit heures par rapport au jour de l’enquˆete. Le praticien alimente ainsi une base de donn´ees synth´etisant l’´etat du patient. Selon les besoins, d’autres informations peuvent ˆetre rajout´ees `a partir d’interview des infirmi`eres ou m´edecins en charge du patient. Cette collecte de donn´ees pourrait ˆetre all´eg´ee si on arrive `a extraire un petit nombre de variables les plus significatives et dont les valeurs pourraient ˆetre d´eduites facilement `a partir de donn´ees stock´ees dans l’entrepˆot de donn´ees de l’hˆopital.

Apr`es une application d’algorithme de fouille de donn´ees `a un extrait de l’entrepˆot, on pourrait ainsi soumettre une liste de patients `a risque pour lesquels les praticiens devraient concentrer leurs efforts. Cette hypoth`ese constitue la principale application des approches d´evelopp´ees dans ce manuscrit.

Les maladies interstitielles pulmonaires (MIP) rassemblent un groupe h´et´erog`ene de plus de 150 maladies affectant le tissu pulmonaire. La plupart de ces maladies sont rares et pr´esentent des symptˆomes non sp´ecifiques. La radiographie du poumon permet une ´evaluation visuelle du tissu pulmonaire conduisant `a dissiper les ambigu¨ıt´es dans un grand nombre de cas. La technique d’imagerie appel´ee images de tomodensitom´etrie `a haute r´esolution (TDMHR) est la technique de r´ef´erence car elle fournit des images tridimensionnelles du poumon avec une haute r´esolution spatiale. Nous appliquons notre m´ethode de s´election de mod`ele pour la pr´ediction de 6 maladies du tissu pulmonaire les plus courantes caract´eris´ees par leur texture extraite des images TDMHR.

Outre la raret´e de certaines maladies au niveau de la population en g´en´eral, une autre difficult´e vient de la raret´e de certaines textures dans une s´erie d’images TDHMR.

Cette th`ese apporte ses contributions `a quatre principales probl´ematiques. Le premier concerne la m´ethode d’analyse de donn´ees d´es´equilibr´ees sur lesquelles les m´ethodes de fouille de donn´ees peuvent avoir un faible taux d’erreur sans ˆetre sensibles aux cas mal repr´esent´es. La deuxi`eme con- tribution concerne la portabilit´e de mod`eles pr´edictifs en les ´evaluant dans le temps. La troisi`eme contribution est l’analyse du compromis interpr´etabilit´e de mod`ele et sa complexit´e et/ou stabilit´e.

ix

(13)

La quatri`eme contribution concerne l’analyse de l’exploitation des r´esultats obtenus.

L’analyse des donn´ees d´es´equilibr´ees a ´et´e abord´ee de deux mani`eres: en sous–´echantillonnant de la classe majoritaire lors de la s´election de mod`ele et en adaptant l’algorithme de classification au d´es´equilibre des classes. Cette seconde approche a ´et´e impl´ement´ee sur une MVS `a noyau unique et `a noyaux multiples. La portabilit´e des mod`eles est d’une importance capitale si l’on souhaite impl´ementer les r´esultats d’une fouille de donn´ees. Les r´esultats que l’on a obtenu avec les donn´ees de pr´evalence des infections nosocomiales ont permis de: 1) mesurer l’effet des actes de pr´evention contre les infections nosocomiales, et 2) montrer la limite des algorithmes n’impliquant pas la probabilit´e de distribution des donn´ees dans leur formulation. L’interpr´etabilit´e des mod`eles est analys´ee selon l’importance des variables participant `a la classification. Ceci a ´et´e effectu´ee avec une MVS classique et sa version avec plusieurs noyaux. Une nouvelle connaissance par- ticuli`erement int´eressante a ´emerg´ees de cette derni`ere. Son impl´ementation directe ouvre une nouvelle piste de recherche due `a la difficult´e d’obtenir automatiquement les valeurs des deux vari- ables qui participent dans la cr´eation du mod`ele depuis l’entrepˆot de donn´ees. De nombreuses pistes d’am´eliorations sont propos´ees telles que l’optimisation des MVS `a noyaux multiples par l’utilisation de bornes sur l’erreur de g´en´eralisation et aussi le d´eveloppement d’infrastructures d´edi´ees `a l’extraction et `a l’analyse de donn´ees.

(14)

Chapter 1

Introduction

1.1 Motivations and challenges

The main vocation of a healthcare organization is to provide individualized patient care. The healthcare is a domain where information plays important role for all decision making, and the electronic health record (EHR) management systems are tools to help the healthcare profession- als for this purpose. The information technology (IT) is actually at the heart of modern health organizations and is valuable for the management (collection, storage and display), analysis and transmission of patient health information. Since the 1950s, many researchers have investigated the use of digitized information to assist healthcare professionals in their decision makings especially for diagnostic purposes. Until the 1980s, many clinical decision support systems (CDSS) were proposed to assist the healthcare professionals (HPs). These systems mimic the “expert” HPs with respect to diagnostic problem solving. Formally, the CDSS matches the individual patient information with a computerized knowledge base and algorithms generate recommendations for this specific patient [53]. Power distinguished four components of a decision support system: 1) the user interface, 2) the database, 3) the models and analytical tools, and 4) the decision support system architecture and network [109].

Logic and probabilistic reasoning were the basis of the majority of these CDSS at the model and analytical tools level. These systems operated as “oracle” systems and the HPs played passive roles as patient-specific information providers. The development of the IT from the 1980s reshaped the characteristics of the CDSS. The personal computers connected via local area networks allow the distribution and accessibility of the EHR from some computers of the healthcare organizations.

During the same period, many types of CDSS have emerged due to the development of formal mod- els such as the artificial neural network from the artificial intelligence domain. The “oracle” mode of the CDSSs was abandoned because of a lack of adherence from the HP’s side and the CDSSs were developed to only provide additional knowledge in the form of alerts or reminders, to cite only a few, in order to improve the performance of their users. However, the hospital information systems were organized department by department and the data warehouse was the unique tool providing longitudinal retrospective views of the patient data [95].

From the last 10 years, the EHR management systems achieved a spectacular development: all clinical information of a patient is accessible everywhere from the hospital network. The com- munication of information between departments was improved thanks to the implementation of better communication protocols and better information coding. This situation offers the CDSS to be integrated into the EHR management systems, to be validated in multiple points in time and to be evaluated in a clinical context. To keep the performance of the EHR management system, the data stored in the data warehouse are mainly used for (possibly computationally expensive) analyses [69, 78, 82, 92, 103].

1

(15)

The hospital data warehouse is not updated in a continuous manner; it is updated in a batch mode and on a regular basis and the stored data are not modified anymore [95]. In order to ana- lyze the data, one may have to transform, filter and aggregate them. A data mart is a subset of a data warehouse developed for a specific analysis purpose [3, 76, 98]. It is important to notice that the data from the production database may have undergone these operations before their incor- poration in the data warehouse. Preparing data marts for any research topic within a healthcare organization is time consuming and needs extra man power. Chaudhry and colleagues discussed in their systematic review the low rate of the deployment of CDSS in real practice [25]. The devel- opment of CDSS is a long incremental process and the development of high–quality and efficient CDSS may be hindered by difficulty to access high–quality datasets, which is still an open issue [73]. Indeed, medical data are not perfect and contain among others redundancies, inconsistencies, missing parameter values due to the patients variability. Analyzing such data presents a great challenge.

Hospital data warehouses were developed primarily to support billing systems and legal reporting for the government but they were enriched with other data along with the development of the EHR management systems within healthcare organizations. They contains primarily administra- tive information about the patients, the laboratory results, the procedures and the final diagnoses.

Ordered and administered medications, vital signs and monitoring data may also be found in the hospital data warehouses depending on the development degree of the EHR management systems.

Some information such as the discharge letter may also be available in textual format i.e. unstruc- tured. The imaging data are stored in the picture archival and communication system (PACS), which is a specific type of data warehouse for image data. Roughly speaking, the healthcare or- ganizations’ archives contain data spanning the cellular and tissue level, the organ level and the physiological level.

This thesis deals with the development of knowledge–driven CDSS according to Power’s decision support systems classification [109]. It addresses the creation of data mining models for super- vised learning, which constitutes a pillar of future CDSS. Clinical data mining is the application of data mining to clinical data [72]. Data mining is a methodological extraction of knowledge, patterns, useful information, or trends from retrospective, massive, and multidimensional data.

According to Fayyad, data mining is the application of specific algorithms for extracting patterns from massive data collections and is only a step in the knowledge discovery in databases (KDD) process [48]. However, in practical settings, it is broadly assimilated as the whole KDD process [69].

Data mining, as described in [48] is an iterative and interactive process having nine steps to extract latent information from data: 1) the business understanding, 2) data set selection, 3) data cleaning and preprocessing, 4) data reduction and projection, 5) matching the objective defined in step 1 into a data mining method (classification, clustering, regression, etc.), 6) choice of the algorithm and search for data patterns, 7) pattern extraction, 8) data interpretation and 9) use of the discovered knowledge. The knowledge discovery process as Fayyad et al have defined it in [48]

is depicted in Figure 1.1. This thesis addresses the steps 5 to 8 but the steps 1 to 4 and step 9 are discussed when appropriate. To be more specific, it treats the issues related to model and variable selection for classification tasks (see Chapter 2) using support–vector machines with single and multiple kernels [102]. The validation of a built model on new datasets available later on and the interpretability of the models are also addressed. For these purposes, we have chosen two clini- cal applications: nosocomial infection prediction and the classification of interstitial lung diseases.

The University Hospitals of Geneva (HUG) have been performing yearly nosocomial infection (NI) prevalence surveys since 1994. Prevalence of NI is presented as prevalence of infected patients, defined as the number of infected patients divided by the total number of patients hospitalized in the time of study, and prevalence of infections, defined as the number of NIs divided by the total number of patients hospitalized in the time of study [120]. To evaluate the prevalence of the NI, the EHR of all patients admitted for more than 48 hours the day of survey have to be

(16)

1.1. MOTIVATIONS AND CHALLENGES 3

Figure 1.1: The knowledge discovery process according to Fayyad et al

analyzed by infection control practitioners (ICPs). If necessary, additional information is obtained by interviews with nurses or physicians in charge of the patient. This manual data collection is labor intensive. The data are mainly collected for statistical analysis.

Our hypothesis is that if one could find a small subset of the 83 prevalence variables having high predictive power and which could be derived easily from the data warehouse attributes, one could suggest to the ICPs and/or HPs a list of high–risk patient for investigation by analyzing on a regular basis the hospital data warehouse. Having the smallest number of highly predictive variables is of primary importance because the value of some NI database attributes are filled in a subjective manner and the others represent an aggregation of multiple information from the EHR.

Other than the number of variables, the imbalance between the number of positive and negative examples in the dataset makes the construction of the model challenging. The HPs carry out many preventive measures to eradicate or at least reduce the prevalence of NI. Setting up a general model able to take into account the changes due to these preventive measures is also of importance. Last and not the least, the final users of the model are the HPs and the model should be transparent enough so they can adhere to its implementation in the clinical setting. This clinical problem constitutes the main clinical application of our approaches.

Interstitial lung diseases (ILD) form a heterogeneous group of diseases containing more than 150 disorders of the lung tissue. Many of the diseases are rare and present unspecific symptoms. Beside the patient’s clinical data, imaging of the chest allows to resolve an ambiguity in a large number of cases by enabling the visual assessment of the lung tissue [50]. The gold standard imaging tech- nique used in case of doubt is the high–resolution computed tomography (HRCT), which provides three–dimensional images of the lung tissue with high spatial resolution. The main objective with this clinical problem is to characterize each pixel of the HRCT image (and not propose a diagno- sis). The main challenge here is the number of patterns one may find in an HRCT image and thus the number of examples one can create from an HRCT image. Having a robust classifier capable of handling a large number of examples with a relatively high number of variables is of central importance. The number of tissue patterns vary according to the age of the patient, and some tissue pattern may be more common than others inducing again an imbalance of the classes. We focus in this study on the 6 most common lung tissue patterns (emphysema, ground glass, fibrosis, micronodules and consolidation) characterized by their texture properties from HRCT images.

In this study, we deliberately focus on the kernel–based algorithms and especially the support–

(17)

vector machine algorithm (SVM) to build our CDSS model due to their practical performance in many classification and regression problems [102]. However, comparisons with other algorithms are carried out when appropriate. The application of SVMs in this thesis is restricted to the clas- sification problems (binary and multi–class). The SVMs have proven to be highly accurate for the nosocomial infection prediction [30–32] and for the interstitial lung tissues classification [41].

In this thesis, we want to go forward in this direction by using other implementations of this algorithm: the speed of the model creation and the interpretability of the models guided our choice. The speed (but also the inherent accuracy) of the algorithm is of importance to carry out multiple experiments and to analyze voluminous datasets such as those derived from the chest radiography to detect abnormal interstitial lung tissue. Justice and colleagues argued that the generalizability of a model should be assessed, among others, in a historical viewpoint and call such property as the transportability of a model [83]. The speed of a learning algorithm is also valuable to carry out such validation. Interpretability is necessary for the HPs in the evaluation and adoption of the developed models. The adoption of HPs is considered as one of the key success of an IT implementation [65]. The interpretability of the NI models are analyzed according to the feature selection with single and multiple kernels.

The work we carry out in this thesis has several challenges. First, the high imbalance of the positive cases in our datasets (NI and ILD) may lead to high accuracies without being sensitive.

Indeed, a binary classification algorithm with 90% of accuracy on a dataset having 10% of positive examples may be useless if it classifies all the examples in the negative class. Second, the valida- tion of a model on newly available dataset may cause problems because the population analyzed may not have the same probability distribution due to the evolution of medical practice. Third, the trade–off between interpretability and the complexity and/or stability of a model should be addressed. Fourth, the NI prevalence datasets we are using were collected manually for statistical analysis and may contain subjective values. The future deployment of the model may be prohibited by missing information in the EHR.

1.2 Thesis contributions and outline

The main contributions of this thesis is found in Chapters 4, 5 and 6. Chapters 2 and 3 are introductions to the clinical problems and the SVM algorithm. The rest of the thesis is organized as follows.

Chapter 2gives the description of the datasets to be analyzed and the challenges in their analyses.

The dataset comprises those produced at the HUG (NI and ILD datasets) and three benchmark datasets. This chapter also describes the experimental setup we used along the thesis and first analysis we have carried out with the NI and ILD datasets.

Chapter 3provides primarily the theoretical background on the kernel–based algorithms. The principles of model selection, evaluation and variable selection are also provided. The originality of this chapter is in the presentation of these algorithms. Indeed, the algorithm is introduced in a more intuitive way and their links with the clinical data mining challenges are presented.

Chapter 4describes the construction of the models using kernel–based algorithm exploiting the bound on the generalization error. This chapter introduces two main contributions of this thesis.

The first contribution concerns the methods to handle imbalanced datasets. Two approaches are proposed to handle imbalanced dataset using the AUC criterion for model selection: 1) a data–

driven approach where we under–sample the majority class and 2) a combination of a data–driven method based on the single minority over–sampling technique and an algorithmic–driven method based on the asymmetrical cost of SVM. The second contribution concerns the historical valida- tion of the NI model. The model is evaluated on two datasets collected at year +1 and +2 of the acquisition of the data used to build the model.

(18)

1.2. THESIS CONTRIBUTIONS AND OUTLINE 5

Chapter 5 details the selection of important variables for the nosocomial infection prediction and also their historical evaluation. In this chapter, we introduce a method for model selection where we combine feature ranking and filtering to select the most important features.

Chapter 6 addresses the problem of model and variable selection with an insight on the in- terpretability of built models. We used the SVM version based on multiple kernel learning for model and also variable selections. The approximate regularization path is used to choose a par- ticular model of interest.

Chapter 7summarizes the outcomes with their limitations and suggests future works.

Additional parts were created to ease the reading of the manuscript:

• thetable of contentscan be found at Page v,

• theabstract of the contents of this thesis is given at Page vii,

• the various mathematicalnotations are summarized starting at Page 113,

• aglossary of the abbreviations used in this thesis are listed at Page 115,

• thelist of figures is given at Page 117,

• thelist of tablesis can be found at Page 119,

• theindex lists selected keywords and their respective locations in the text starting at Page 131.

(19)
(20)

Chapter 2

Clinical data mining

2.1 Introduction

Clinical data mining is the application of data mining on clinical data in order to extract new knowledge, patterns, useful information, or trends. Data mining is a multidisciplinary field at the intersection of database technology, statistics, machine learning, and pattern recognition. The database technology is related to the steps 1 to 4 and 9 of the data mining described in the previous chapter. The statistics cover all the steps for methodological, evaluation and validation purposes.

The algorithms used in data mining are developed in the machine learning community and their implementation (step 5 to 8) is related to pattern recognition. This multidisciplinarity of data mining makes it face to multiple issues and challenges that the other disciplines do not have or at a minor degree.

This chapter provides the challenges in the analysis of real–world medical data, the description of the datasets used for all experiments and a first classification of these data used as a comparative baseline for future studies. The datasets used in this thesis are for binary and multi–class classifi- cation. Two datasets were collected at the HUG (6 hospitals and 2’200 beds). Three benchmark datasets are also used and they are coming from the machine learning community 1 and their characteristics is provided in the appendix C. We need them to validate some concepts compared to established algorithms and they are used when appropriate. The HUG datasets are used to evaluate machine learning concepts we are investigating in this thesis.

2.2 Challenges in medical data analysis

This section provides a non–exhaustive list of challenges when analyzing real–world clinical datasets. Indeed, most of the issues in a data mining application are related to the structure of the data to be analyzed. In this section, we address the main issues and challenges for the analysis of clinical data according to the study we have carried out in [72].

2.2.1 Data availability and collection

Healthcare data are collected to support specific patient treatment. In the ideal situation, the EHR should be “harmonized” to ease the analysis at the population level. The challenging sec- ondary use of EHR attracted many researchers during the last five years [110]. However, in the era of data explosion, it is still difficult to obtain a large number of cases, with sufficient items to be submitted for a specific analysis. To overcome this issue, several solutions can be envisaged.

One can, for example, analyze all the available clinical datasets of a population of interest but

1http://www.fml.tuebingen.mpg.de/Members/raetsch/benchmark(last visit on 03.09.2010)

7

(21)

one has to face some data incompleteness issues due to the patients’ variability. However, many solutions were proposed in the literature to overcome the missing value issue if it is not supported by the data mining algorithm: impute the missing values, add an explicit value indicating a missing value (for categorical variables only) or add a new column indicating the presence/absence of the value of the corresponding variable [93].

Another solution is to design an infrastructure linking the patient care, with data collection and able to support secondary use of the collected data [139]. This cumbersome solution needs a long time to be implemented at the hospital level.

Another option is to build its own research dataset according to the data available from the EHR and some inclusion/exclusion criteria for a specific target analysis as described in [72]. The objective of this process is to collect the maximum of information to carry out clinical data analysis.

2.2.2 Run–time speed and performance of algorithms

Having a large number of cases with many variables is ideal for any comprehensive analyt- ics problem but it may introduce new questions with respect to the analysis of the data. Indeed, hospitals do not necessarily have high–performance computing infrastructures to analyze such data.

Building infrastructures for secondary use of EHR is proposed in some studies [73, 110] but it is hindered by questions like the privacy and confidentiality of the patient data to cite only two examples. Optimizing the algorithms with respect to run–time speed while providing good per- formance is another avenue of investigation for analyzing large datasets. Most of the algorithm optimizations are based on an approximation of mathematical concepts and reduce computational costs. This algorithm optimization approach is used in Chapter 4 and Chapter 5 of the present document.

2.2.3 Class imbalance problems

The class imbalance problem is an important problem in data mining and especially for classifi- cation tasks. In this situation, the class of interest is represented with a small number of examples.

The classification of such data sets is challenging when the interesting examples are little repre- sented. The imbalance of examples is a common problem when one want to analyze real–world datasets such as fraud detection, prediction of equipment failure, etc. The class imbalance problem is related to the rarity of the elements of a particular class compared to the number of elements of another class [142]. This absolute lack of data makes it difficult to find the regularity in the training examples.

Accuracy is the most common measure to evaluate the performance of a classifier. However, the value of this performance measure may be misleading when the dataset presents high imbal- ance. If a positive class has a ratio of 10%, a classification accuracy of 90% may be meaningless if the classification is not sensitive at all. For this reason the classification of imbalanced dataset should be handled with much precaution: the classification methodology should take into account this phenomenon. Several solutions were proposed in the literature to handle this question and these solutions operate as a pre–processing step on the dataset or on the algorithmic level or a combination of both.

In the rest of this document, we always choose the minority class as the positive class in all binary classifications.

2.2.4 Curse of dimensionality

The variables in a created target dataset are based on the domain knowledge. As we have shown earlier, computational efficiency is one of the particular concerns of data mining. This can

(22)

2.3. NOSOCOMIAL INFECTION DATASET 9 be achieved using Occam’s Razor principle, which can be interpreted as ’simpler is better’. Ifm variables can provide equal or better results thandvariables, wherem << d, it is straightforward to use the reduced dataset withm variables to reduce the computational cost.

Data dimension reduction permits to remove ambiguous variables from the dataset [55]. For medical professionals, reducing the dataset to the most important variables permits to identify the most important factors for a clinical problem [69]. It is important to notice that medical pro- fessionals perform ad–hoc variable selection in their daily practice due to time pressure [13, 69].

In the present document, the curse of dimensionality is addressed using two methods (Chapter 4 and 5).

2.2.5 Interpretability

The extracted knowledge from the application of data mining algorithms should be evaluated and interpreted before a wider use and/or implementation.

Most of the time, objective evaluations (quantitative measures) are used to evaluate and inter- pret the results of data mining algorithms. One can cite for example the accuracy, the sensitivity, the specificity, the precision of the algorithm as quantitative measures of the performance of data mining algorithms. However, these measures should be handled with precaution based on the type of the data and the final objective of the data mining problem.

The final users of any tool developed within a clinical research are the healthcare professionals.

Subjective evaluation of the results of data mining are carried out by domain experts. The com- prehensibility of the output of any data mining algorithms is one of the most important criterions for a final adoption of any developed system. The quality of the output and/or their comparison with baselines results are also used to assess the interpretability of the results of a data mining process.

2.2.6 Use of the discovered knowledge

The use of newly discovered knowledge in clinical routine is rare and most clinical data mining studies reported in the literature can be seen as exploratory ones. This low adoption may be due to the fact that clinical data mining brings no new knowledge but confirms established knowledge.

The discovered knowledge may not be used directly in clinical routine because it is a new research hypothesis that needs to be validated according to clinical research such as in [80, 91, 103]. This question is not treated in this document but is always discussed according to the obtained results.

2.3 Nosocomial infection dataset

Hospital–acquired infections or nosocomial infections (NI) are those infections acquired in a hospital, independently of the reason of the patient admission. NI normally appear more than 48 hours after the patient admission. These infections may be related to medical procedures such as the implantation of infected urinary tracts or simply occur during the hospitalization where the micro-organisms are transmitted from other patients, medical staff or are a consequence of the hospital environment contamination. These complications affecting the hospitalized patients prolong their hospital stay and increase the patient–care cost.

In Switzerland, 70 000 hospitalized patients per year are infected and 2 000 deaths per year are caused by NI. A hospital aware of the quality of the patient care should have an infection preven- tion, control and surveillance program. The surveillance is the process of detecting these infections.

Prevalence surveys are recognized as valid and realistic approaches of NI surveillance strategies [51].

Prevalence of NI is presented as prevalence of infected patients, defined as the number of infected

(23)

patients divided by the total number of patients hospitalized at the time of study, and prevalence of infections, defined as the number of NIs divided by the total number of patients hospitalized at the time of study [120]. The prevalence survey is resource and labor consuming: the electronic health record (EHR) of all patients admitted for more than 48 hours the day of survey should be analyzed by infection control practitioners (ICP). If necessary, additional information is obtained by interviews with nurses or physicians in charge of the patient.

The HUG has been performing yearly comprehensive prevalence survey since 1994. The preva- lence database contains 83 attributes which can be classified into five categories: 1) demographic information, 2) admission diagnosis (classified according to McCabe [96] and the Charlson index classifications [24]); 3) patient information at the study date (ward type and name, status of Methicillin-Resistant Staphylococcus Aureus (MRSA) portage, etc); 4) information at the study date and the six days before (clinical data, central venous catheter carriage, workload, infection status, etc) and 5) those related to the infections i.e. for infected patients (infection type, clinical data, etc.). In this thesis, we are interested in the four first categories of data as they are related to patient infection, which comprises 45 attributes. Only the year of birth and the workload values are numerical. Table A.1 provides the list of all variables of the nosocomial infection database.

To homogenize the data values, we transformed all numerical data into nominal ones. The year of birth was converted into age and discretized into 3 categories (0-60; 60-75; 75) as in [120], and a new variable “hospitalization duration” was created. A Mann–Whitney–Wilcoxon statistical test on the workload value provides a significant difference between infected and non-infected patients.

As it is the unique attribute having missing values (91 cases including 2 positive cases), all cases having no workload value were removed. The latter and the hospitalization duration were dis- cretized using the minimum description length principle [87]. Patients admitted for less than 48 hours at the time of the study and not transferred from another hospital were also removed. All the NI variables have therefore binary responses.

2.4 TALISMAN dataset

Interstitial lung diseases (ILD) form a heterogeneous group of diseases containing more than 150 disorders of the lung tissue [40]. Many of the diseases are rare and present unspecific symp- toms. During the diagnosis process, all available information including the patient’s personal data, medication, past medical history, host risk factors and laboratory tests (e.g. pulmonary function tests, hematocrit, ...) are meticulously analyzed to find any indicator of the presence of an ILD.

Beside the patient’s clinical data, imaging of the chest allows to resolve an ambiguity in a large number of cases by enabling the visual assessment of the lung tissue [50].

The most common imaging modality used is the chest X–ray because of its low cost and radi- ation dose [41]. It is of sometimes of limited usefulness for the characterization of lung tissue as these are overlaid with other anatomical structures, making the reading sometimes difficult. The gold standard imaging technique used in case of doubt is the high–resolution computed tomography (HRCT), which provides three–dimensional images of the lung tissue with high spatial resolution.

Most of the histological diagnoses of ILDs are associated with a given combination of image find- ings (i.e. abnormal lung tissue) [141]. The most common lung tissue patterns are emphysema, ground glass,fibrosis,micronodules andconsolidation. These are characterized by distinct texture properties in HRCT imaging.

The detection and characterization of the lung tissue patterns in HRCT are time–consuming and requires experience. In order to reduce the risk of omission of important tissue lesions and to ensure the reproducibility of image interpretation, computer–aided diagnosis (CAD) was proposed several times for HRCT of the lung [21, 39, 42, 124, 125, 131, 133, 150]. The typical approaches use supervised machine learning to draw decision boundaries in feature spaces spanned by texture attributes. The reported performance of these approaches suggests that these systems have the

(24)

2.4. TALISMAN DATASET 11 potential to be valuable tools in clinical routine by providing second opinions to the clinicians.

However, the CAD system must include a sufficient number of classes of lung tissue to cover the heterogeneous visual findings associated with ILDs. A CAD system, which aims at detecting one single lung tissue pattern is of limited use as the radiologist still needs to look for other pathological lung tissue patterns in the image series.

The TALISMAN project of the HUG built a database of ILD cases. The diagnoses of each ILD cases was confirmed by a biopsy or an equivalent test (e.g. bronchoalveolar lavage, tuberculin skin test, Kveim test, ...). For each collected patient, 99 clinical parameters associated with 13 of the most frequent diagnoses of ILDs were collected from the electronic health record (EHR), describing the patient’s clinical state at the time of the stay when the HRCT image series were acquired. The lung tissue patterns related to the ILD diagnosis were manually delineated in HRCT images series (1mm slice thickness, no contrast agent) by two experienced radiologists at the HUG. The distri- butions of the 6 most represented tissue sorts are detailed in Table 2.1 in terms of number of region of interests (ROIs), volumes and number of block instances obtained as shown in Figure 2.1. The size of the blocks is 32×32×1 pixels. The features used to characterize the texture properties of the 6 lung tissue patterns are derived from grey–level histograms and tailored wavelet transforms (WT). The resulting feature space has a dimension of 46.

Table 2.1: Distribution of the classes in terms of ROIs, volumes and blocks. The number of instances corresponds to the number of blocks.

label ROIs volume (liters) blocks patients

healthy 100 5.12 l 3043 7

emphysema 66 1.15 l 422 5

ground glass 427 4.91 l 2313 37

fibrosis 473 8.45 l 3113 38

micronodules 297 16.06 l 6133 16

consolidation 196 0.69 l 90 14

Total 1559 36.38 l 15114 87

Grey–level histograms

Thanks to Hounsfield Units (HU), the pixel values in HRCT images corresponds univoquely to the density of the observed tissue and thus contain essential information for the characterization of the lung tissue. To encode this information, 22 histogram bins of grey–levels in the interval [−1050; 600[ are used as texture features. An additional feature related to the number of air pixels is computed as the number of pixel values below -1000 HU.

Wavelet–based features

Near affine–invariant texture features are derived from a tailored WT. A frame transform is used to ensure translation–invariant descriptions of the lung tissue patterns [42, 132]. Based on the assumption that no predominant orientations are contained in the lung tissue patterns a rotation–invariant non–separable WT is implemented using isotropic polyharmonic B–spline scaling functions and wavelets [43, 134]. At last, an augmented scale progression is obtained using the quincunx lattice for upsampling the filters by a factor of √

2 at each iteration of the WT.

Within each unique subband i, the wavelet coefficients are characterized by a mixture of two Gaussians with fixed meansµi1,2i and distinct variances σ1,2i . 24 wavelet–based features are thus generated by 8 iterations of the WT.

(25)

Figure 2.1: Construction of the block instances from manually delineated ROIs.

2.5 Learning framework

In the whole document, we apply the same experimental setup similar to the one described in [117]. One hundred (100) partitions of training and testing sets are generated with the data source having respectively a ratio of 60% and 40%. The whole experimental setup is depicted in Figure 2.2.

2.5.1 Model selection

The parameters of the learning algorithms are selected using 3×5–folds (stratified or not) cross–validation on 5 random training set. In the stratified cross–validation, the ratio of positive to negative examples is kept as in the original dataset. The algorithm manages the class imbalance when the stratified cross–validation is used. Otherwise, a preprocessing step aiming to balance the ratio of positive and negative examples is applied before the application of the data mining algorithm. The method used to handle the data imbalance is highlighted in the method section of each experiment.

The area under the receiver operating characteristic curve (AUC) is chosen as performance metric to select the best parameters if the classification algorithm has a probability output. The sensi- tivity and precision is chosen if the classification algorithm does not allow the computation of the AUC. To be more specific, among the three 5–fold cross–validation, the parameters providing the highest AUC (or sensitivity and precision) on each dataset is retained and the median of the best parameters from the 5 random datasets are kept as the best parameters of the problem. When we refer to imbalanced dataset, the original data distribution is kept in both partitions otherwise, balanceddatasets i.e. approximatively 50% of positive cases and 50% of negative cases are used during the model selection process.

2.5.2 Model evaluation

The model is build on 100 training datasets i.e. the best parameters are applied to each train- ing set and evaluated on the corresponding testing set. The mean over the 100 testing sets of the

(26)

2.6. BASELINE SETUP 13 recall or sensitivity, the precision, the specificity, the accuracy and the AUC are used as perfor- mance metrics and discussed. The AUC is of particular interest for imbalanced datasets because it indicates, among other merits, how well separated the signal or positive class and the noise or negative class are. If the AUC cannot be computed, we focus the result analysis on the sensitivity and precision of the algorithm because we need the highest positive cases ratio from the positive prediction in the final implementation of our results.

In a study, Rakotomamonjy highlighted what may raise when optimizing the SVM algorithm according to the AUC or precision criterion [114]. Indeed, the author showed that a model built according to these criterion may lead to overfitting and does not perform well on the test set for imbalanced datasets. This is due to the fact that the SVM was initially designed for optimizing both the empirical error and the capacity of the hypothesis space in which the decision function to be learned lies [137]. Thus, the usual way for selecting the model of the SVM is based on the minimization of the generalization error or any related bound or estimates [135]. In this thesis, we propose two ways to overcome this issue of using the AUC criterion for model selection (see Chapter 4).

2.6 Baseline setup

In this section, we provide a first analysis of the NI dataset in order to build a baseline allowing to compare all future results. A comparison of state–of–the–art classification algorithms for the categorization of lung tissue patterns can be seen in [40]. From this study, the support vector machine algorithm provides a good result compared to other classification algorithms such as the naive Bayes, k–nearest neighbor, J48 decision trees and multi–layer perceptron.

2.6.1 Comparative performance of nosocomial infection classification

Motivation and objective

In this section, we compare the SVM algorithm with other classification algorithms. The main motivation is to compare the behavior of these algorithms in the situations of class imbalance. In other words, we do not make any intensive optimization to obtain good results: the NI dataset and the classification algorithm are taken and used without any transformation. However, the pa- rameters of each algorithm are optimized with respect to a grid search procedure before comparing their performance on the NI dataset. The classification results are discussed in terms of accuracy, precision, sensitivity, specificity, f–measure and AUC of the classification results.

Materials and methods

We have chosen the Weka experimenter to evaluate the classification performance of 7 algo- rithms on the NI dataset. The NI dataset is obtained from the 2006 prevalence survey as described earlier. We have chosen the following algorithms: na¨ıve bayes [81], k–nearest neighbors classi- fier [1], logistic regression [90, 127], radial basis function network [108], multi–layer perceptron [17], SVM [35] and the AdaBoost algorithm [52].

We are using the learning framework defined above to optimize the parameters of these train- ing algorithms. The SVM algorithm is taken as the basis for comparison. The 100 random training/testing splits of the dataset in our learning framework induces a source of variation in our experiment as well as the different algorithms used. The corrected paired t-test proposed by Nandeau and Bengio [104] is used to compare the results at 5% of statistical significance level.

The t-test is the unique statistical test for comparing data mining algorithms implemented in the Weka experimenter software. Its null hypothesis is that there is no difference between each pair of algorithms with respect to the mean performance metrics obtained with the 100 testing sets.

There are two types of risks in such test. The type I risk is the probability that the test rejects

(27)

Figure 2.2: General experimental setup

(28)

2.6. BASELINE SETUP 15

Table 2.2: Comparison of SVM with other algorithms for the NI cases classification Classifiers

¯ Accuracy

¯ Specificity

¯ Sensitivity

¯ Precision

¯ F-measure

¯ AUC

SVM 90.85% 93.11% 66.93% 47.84% 55.79% 80.02%¯

Adaboost 89.99% 93.01% 60.95% 47.48% 53.37% 76.98%

1–nearest neighbors 87.64% 90.48% 47.77% 26.34% 33.95% 69.13% Logistic regression 90.45% 92.46% 66.34% 42.31% 51.67% 79.40%

Multi layer perceptron 90.06% 93.30% 60.68% 49.96% 54.80% 76.99%

Naive Bayes 87.64% 93.35% 48.86% 51.92% 50.34% 71.10% RBF Network 88.72% 91.96% 54.53% 39.09% 45.54% 73.24%

the null hypothesis incorrectly. The type II risk is the probability that the null hypothesis is not rejected when there are differences.

Results

Table 2.2 shows the mean performance values obtained with each algorithms over the 100 test sets. In this table, the performance values are written in italic if there is a statistically significant difference between the results obtained with the SVM. One can see in this result that:

• the 1–nearest neighbor, the na¨ıve bayes and the RBF network algorithms perform less good than the SVM on the dataset

• the Adaboost, the logistic regression and the multi–layer perceptron perform favorably with the SVM

• the accuracies and specificities of the algorithms are high

• the sensitivities and precisions are relatively low

2.6.2 Discussion and partial conclusion

The objective of this experiment is to highlight the behavior of classification algorithms in a situation of class imbalance. The parameters of the algorithms were optimized using a grid search approach. While the accuracies of the selected algorithms (>87%) and their specificity (>90%) are high, this is not the case for the sensitivities and the precision. In other words, these algo- rithms classified correctly most of the negative examples and few positive examples inducing a low f–measure. In the positive prediction of the SVM algorithm, for example, 47.84% are true positive cases which represent approximatively 66.93% of all positive examples. The AUCs of the SVM, Adaboost, logistic regression and multi–layer perceptron do not present any significant difference.

The high values of this performance metric can be explained by the high value of the specificities.

In the ideal situation, the values of the specificity and sensitivity should be balanced.

The low level of sensitivity and precision is not uniquely due to the class imbalance. It can also be imputed to the number of variables of the dataset. The SVM algorithm is suited for large datasets but it is possible that some variables in the dataset blur the classification process. This aspect will be investigated in the next experiment.

In this experiment, the corrected paired t-test proposed by Nandau and Bengio was used to eval- uate the statistical difference of the algorithms results. The main advantage of this test is that it takes also into account the variability of the training set and not only the variability of the test set.

In other words, the 100 training/testing pairs overlap and this dependence was taken into account by the test. The choice of this test is only guided by its availability in the Weka experimenter platform. In the rest of this document, we use McNemar’s test to compare multiple algorithms because it does not assume any independence criterion on the datasets.

(29)

The planned final implementation of the classification carried out in this study is the extraction of highly probable infected patients through the analysis of their EHR. For this purpose, having high sensitivity is essential. Optimizing the sensitivity of the results is one of the contribution of this thesis and guides most of the next experiments.

2.6.3 Minimal set of attributes required for nosocomial infection classi- fication

This section gives a first investigation of the minimal variables allowing to predict NI cases and enabling to report potential cases to be reviewed by ICP. Various data mining techniques were applied at the University Hospitals of Geneva to support the NI prevalence survey since 2002. We can quote among others the use of different forms of the SVM algorithm optimization including asymmetrical margin approach [29], one-class SVM [32], a comparison of SVM with other classifica- tion algorithms [31]. The differences with the previous work lie in the objective of the experiment, the methodology and the dataset used. Indeed, the main contribution of the approach developed in this section is the use of a variable ranking followed by variable filtering using statistical tests.

In this section, we use Fisher’s Linear Discriminant (FLD) algorithm to evaluate the performance of the retained variables. The basic idea behind linear discriminant algorithms is to find a lin- ear function providing the best separation of instances from 2 classes. The FLD is looking for a hyperplane directed by w, which (i) maximizes the distance between the mean of the classes when projected on the line directed bywand (ii) minimizes the variance around these means [49].

Formally, FLD aims at maximizing the function:

J(w) = wTSBw

wTSWw (2.1)

where SB is the scatter matrix between classes and SW the scatter matrix within classes. This equation permits to formulate FLD as an algorithm aiming at minimizing the variance within the classes and maximizing the variance between classes. An unknown case will be classified into the nearest class centroid when projected onto a hyperplane directed byw.

In a classification task, an object is a member of exactly one class and an error occurs if the object is classified into the wrong one. The objective is then to minimize the misclassification rate. With the FLD algorithm, the scatter matrix within classesSW is evaluated on the training datasets. To minimize the misclassification rate on unseen test sets (generalization error), a regu- larization factorr(0< r <1) is introduced into the computation of SW [64]. The regularization factorrhas to be optimized to improve numerical stability and generalization ability. The qual- ity of the retained variables is evaluated using a classical linear discriminant function in order to compare the results obtained with a kernel–based algorithm used in the rest of the document.

Material and methods

Two feature selection algorithms were used independently. The first based on the information gain (IG) of each attribute [111] and the second based on the combination of attributes using SVM Recursive Feature Elimination (SVM–RFE) [61]. The SVM–RFE algorithm is detailed in the sec- tion 5.3.2. These 2 algorithms return all the features ranked by order of importance. To filter the most important ones, aχ2 statistic test was performed to filter the discriminative features to be retained for an evaluation with a classification algorithm. These feature selection algorithms were applied to 100 training sets build from the original dataset that we call DS AF (dataset with all features).

The significant attributes retained by both feature selection algorithms over the 100 training datasets were retained to build a second dataset DS RF (dataset with reduced features). Af- terwards, we removed the important features in DS RF which are not straightforward to retrieve

(30)

2.6. BASELINE SETUP 17 in the EHR to obtain a third dataset DS RF NSR. We then evaluated the performance of the FLD algorithm on these two datasets with a reduced number of variables. For classification purposes, we used the open-source toolbox MATLABArsenal2. This MATLAB package contains many clas- sification algorithms and in particular the regularized FLD algorithm as described above. FLD was chosen as it has only one parameter so easier to optimize.

The FLD does not have a probability output allowing to carry out a ROC analysis. The results of the classification are discussed according to the performance measures from the contingency matrix (accuracy, sensitivity, specificity, precision, f–measure and positive prediction value). The FLD model is selected using a cross–validated grid search procedure during which the regulariza- tion factor takes a value 2k(k=−10, . . . ,0). The sensitivity is chosen to select the best model. To handle the data imbalance, a preprocessing step is performed consisting of reducing the imbalance by selecting randomly a subset of the negative samples to obtain approximatively 50% of examples in each class. The mean of the performance metrics obtained with 100 training/testing random splits are compared according to the Mann-Whitney-Wilcoxon statistical test.

To ease the interpretation of the results, we carry out a multiple correspondence analysis (MCA) in order to see the relationship (closeness and contrast) between the variables. The result of this MCA will be used along the document to interpret the results related to any variable selection. It is carried out according to the graphical representation of the variables with respect to the first 2 factor axes.

Results

The SVM–RFE computation took many hours (≈24h) while the IG of each variable was ob- tained in less than 10 minutes. Twenty (20) attributes are retained from the two feature selection algorithms; IG and SVM–RFE returned the same features after the χ2 filtering but with not identical variable ranking. The “length of stay up to 7.5 days”, was retained as a discriminative attribute. Two admission diagnoses are discriminative: those classified as “non fatal” and “fatal in less than 6 months” according to the McCabe classification, “transfer as admission”, “congestive cardiomyopathy” and “diabet with organ affected” as comorbidities. In the third data category, the “intensive care unit” and “obstetrical ward”, absence or actual MRSA colonization are the most discriminative attributes. In the fourth data category, an “antibiotic treatment”, “fever”, surgery, a stay at the intensive care unit during the hospitalization, a presence of artificial ventila- tion, urinary tract, central venous catheter and the 3 categories of workload value were significantly discriminative. Table 2.3 summarizes the features returned by both IG and SVM–RFE associated with a χ2 filtering which allow the creation of the dataset DS RF. The antibiotic treatment is ranked as the most important attribute by both methods.

The feature selections described above provided two clinical features which are not straightfor- ward to retrieve from the patients EHR: the fever and the workload values. These attributes were removed to create the second dataset DS RF NSR. The grid search algorithm applied on the two datasets DS RF and DS RF NSR returned respectively r =0.5 and r =1 as the best param- eter. Figure 2.3 summarizes the performance metrics (accuracy, sensitivity, specificity, precision, f-measure, and the ratio of positive predictions) obtained with the two datasets in terms of their mean, standard deviation (SD) and the performance comparisons.

Datasets DS RF and DS RF NSR permit to obtain respectively a mean sensitivity (SD) of 65.37%

(6.76) and 82.56 (4.22), a specificity (SD) 87.5% (1.44) and 85.4% (1.8), a precision (SD) of 41.50%

(3.9) and 43.54% (4.59), a f-measure (SD) of 50.58( 3.83) and 56.87( 4.29) over the 100 data split realizations. The mean accuracy (SD) for DS RF and DS RF NSR are respectively 84.83%( 1.04) and 85.04%( 1.65) and the positive prediction ratios are respectively 18.82% (1.72) and 22.73%

2http://www.informedia.cs.cmu.edu/yanrong/MATLABArsenal/MATLABArsenal.htm(last visit on 03.09.2010)

Références

Documents relatifs

Support Vector Machines (SVMs) are well-established Ma- chine Learning (ML) algorithms. They rely on the fact that i) linear learning can be formalized as a well-posed

These results show that randomized comparison-based algorithms are not so bad, in spite of their limitations in terms of convergence rate: (i) they are optimal in the robust

had any targets with non-zero transition counts (with a 48 hour transition window). The commits with inputs had, on average only 63% of targets with pre-submit results, and 5.6%

Comparing results of opinion targets identification using different methods It can be seen that the effect of opinion targets extraction is highly improved after adding

We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting

Key-words: Discriminant, redundant or independent variables, Variable se- lection, Gaussian classification models, Linear regression, BIC.. ∗ Institut de Math´ ematiques de

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des