Motivations and challenges - Clinical data mining with Kernel-based algorithms

The main vocation of a healthcare organization is to provide individualized patient care. The healthcare is a domain where information plays important role for all decision making, and the electronic health record (EHR) management systems are tools to help the healthcare profession-als for this purpose. The information technology (IT) is actually at the heart of modern health organizations and is valuable for the management (collection, storage and display), analysis and transmission of patient health information. Since the 1950s, many researchers have investigated the use of digitized information to assist healthcare professionals in their decision makings especially for diagnostic purposes. Until the 1980s, many clinical decision support systems (CDSS) were proposed to assist the healthcare professionals (HPs). These systems mimic the “expert” HPs with respect to diagnostic problem solving. Formally, the CDSS matches the individual patient information with a computerized knowledge base and algorithms generate recommendations for this specific patient [53]. Power distinguished four components of a decision support system: 1) the user interface, 2) the database, 3) the models and analytical tools, and 4) the decision support system architecture and network [109].

Logic and probabilistic reasoning were the basis of the majority of these CDSS at the model and analytical tools level. These systems operated as “oracle” systems and the HPs played passive roles as patient-specific information providers. The development of the IT from the 1980s reshaped the characteristics of the CDSS. The personal computers connected via local area networks allow the distribution and accessibility of the EHR from some computers of the healthcare organizations.

During the same period, many types of CDSS have emerged due to the development of formal mod-els such as the artificial neural network from the artificial intelligence domain. The “oracle” mode of the CDSSs was abandoned because of a lack of adherence from the HP’s side and the CDSSs were developed to only provide additional knowledge in the form of alerts or reminders, to cite only a few, in order to improve the performance of their users. However, the hospital information systems were organized department by department and the data warehouse was the unique tool providing longitudinal retrospective views of the patient data [95].

From the last 10 years, the EHR management systems achieved a spectacular development: all clinical information of a patient is accessible everywhere from the hospital network. The com-munication of information between departments was improved thanks to the implementation of better communication protocols and better information coding. This situation offers the CDSS to be integrated into the EHR management systems, to be validated in multiple points in time and to be evaluated in a clinical context. To keep the performance of the EHR management system, the data stored in the data warehouse are mainly used for (possibly computationally expensive) analyses [69, 78, 82, 92, 103].

The hospital data warehouse is not updated in a continuous manner; it is updated in a batch mode and on a regular basis and the stored data are not modified anymore [95]. In order to ana-lyze the data, one may have to transform, filter and aggregate them. A data mart is a subset of a data warehouse developed for a specific analysis purpose [3, 76, 98]. It is important to notice that the data from the production database may have undergone these operations before their incor-poration in the data warehouse. Preparing data marts for any research topic within a healthcare organization is time consuming and needs extra man power. Chaudhry and colleagues discussed in their systematic review the low rate of the deployment of CDSS in real practice [25]. The devel-opment of CDSS is a long incremental process and the develdevel-opment of high–quality and efficient CDSS may be hindered by difficulty to access high–quality datasets, which is still an open issue [73]. Indeed, medical data are not perfect and contain among others redundancies, inconsistencies, missing parameter values due to the patients variability. Analyzing such data presents a great challenge.

Hospital data warehouses were developed primarily to support billing systems and legal reporting for the government but they were enriched with other data along with the development of the EHR management systems within healthcare organizations. They contains primarily administra-tive information about the patients, the laboratory results, the procedures and the final diagnoses.

Ordered and administered medications, vital signs and monitoring data may also be found in the hospital data warehouses depending on the development degree of the EHR management systems.

Some information such as the discharge letter may also be available in textual format i.e. unstruc-tured. The imaging data are stored in the picture archival and communication system (PACS), which is a specific type of data warehouse for image data. Roughly speaking, the healthcare or-ganizations’ archives contain data spanning the cellular and tissue level, the organ level and the physiological level.

This thesis deals with the development of knowledge–driven CDSS according to Power’s decision support systems classification [109]. It addresses the creation of data mining models for super-vised learning, which constitutes a pillar of future CDSS. Clinical data mining is the application of data mining to clinical data [72]. Data mining is a methodological extraction of knowledge, patterns, useful information, or trends from retrospective, massive, and multidimensional data.

According to Fayyad, data mining is the application of specific algorithms for extracting patterns from massive data collections and is only a step in the knowledge discovery in databases (KDD) process [48]. However, in practical settings, it is broadly assimilated as the whole KDD process [69].

Data mining, as described in [48] is an iterative and interactive process having nine steps to extract latent information from data: 1) the business understanding, 2) data set selection, 3) data cleaning and preprocessing, 4) data reduction and projection, 5) matching the objective defined in step 1 into a data mining method (classification, clustering, regression, etc.), 6) choice of the algorithm and search for data patterns, 7) pattern extraction, 8) data interpretation and 9) use of the discovered knowledge. The knowledge discovery process as Fayyad et al have defined it in [48]

is depicted in Figure 1.1. This thesis addresses the steps 5 to 8 but the steps 1 to 4 and step 9 are discussed when appropriate. To be more specific, it treats the issues related to model and variable selection for classification tasks (see Chapter 2) using support–vector machines with single and multiple kernels [102]. The validation of a built model on new datasets available later on and the interpretability of the models are also addressed. For these purposes, we have chosen two clini-cal applications: nosocomial infection prediction and the classification of interstitial lung diseases.

The University Hospitals of Geneva (HUG) have been performing yearly nosocomial infection (NI) prevalence surveys since 1994. Prevalence of NI is presented as prevalence of infected patients, defined as the number of infected patients divided by the total number of patients hospitalized in the time of study, and prevalence of infections, defined as the number of NIs divided by the total number of patients hospitalized in the time of study [120]. To evaluate the prevalence of the NI, the EHR of all patients admitted for more than 48 hours the day of survey have to be

1.1. MOTIVATIONS AND CHALLENGES 3

Figure 1.1: The knowledge discovery process according to Fayyad et al

analyzed by infection control practitioners (ICPs). If necessary, additional information is obtained by interviews with nurses or physicians in charge of the patient. This manual data collection is labor intensive. The data are mainly collected for statistical analysis.

Our hypothesis is that if one could find a small subset of the 83 prevalence variables having high predictive power and which could be derived easily from the data warehouse attributes, one could suggest to the ICPs and/or HPs a list of high–risk patient for investigation by analyzing on a regular basis the hospital data warehouse. Having the smallest number of highly predictive variables is of primary importance because the value of some NI database attributes are filled in a subjective manner and the others represent an aggregation of multiple information from the EHR.

Other than the number of variables, the imbalance between the number of positive and negative examples in the dataset makes the construction of the model challenging. The HPs carry out many preventive measures to eradicate or at least reduce the prevalence of NI. Setting up a general model able to take into account the changes due to these preventive measures is also of importance. Last and not the least, the final users of the model are the HPs and the model should be transparent enough so they can adhere to its implementation in the clinical setting. This clinical problem constitutes the main clinical application of our approaches.

Interstitial lung diseases (ILD) form a heterogeneous group of diseases containing more than 150 disorders of the lung tissue. Many of the diseases are rare and present unspecific symptoms. Beside the patient’s clinical data, imaging of the chest allows to resolve an ambiguity in a large number of cases by enabling the visual assessment of the lung tissue [50]. The gold standard imaging tech-nique used in case of doubt is the high–resolution computed tomography (HRCT), which provides three–dimensional images of the lung tissue with high spatial resolution. The main objective with this clinical problem is to characterize each pixel of the HRCT image (and not propose a diagno-sis). The main challenge here is the number of patterns one may find in an HRCT image and thus the number of examples one can create from an HRCT image. Having a robust classifier capable of handling a large number of examples with a relatively high number of variables is of central importance. The number of tissue patterns vary according to the age of the patient, and some tissue pattern may be more common than others inducing again an imbalance of the classes. We focus in this study on the 6 most common lung tissue patterns (emphysema, ground glass, fibrosis, micronodules and consolidation) characterized by their texture properties from HRCT images.

In this study, we deliberately focus on the kernel–based algorithms and especially the support–

vector machine algorithm (SVM) to build our CDSS model due to their practical performance in many classification and regression problems [102]. However, comparisons with other algorithms are carried out when appropriate. The application of SVMs in this thesis is restricted to the clas-sification problems (binary and multi–class). The SVMs have proven to be highly accurate for the nosocomial infection prediction [30–32] and for the interstitial lung tissues classification [41].

In this thesis, we want to go forward in this direction by using other implementations of this algorithm: the speed of the model creation and the interpretability of the models guided our choice. The speed (but also the inherent accuracy) of the algorithm is of importance to carry out multiple experiments and to analyze voluminous datasets such as those derived from the chest radiography to detect abnormal interstitial lung tissue. Justice and colleagues argued that the generalizability of a model should be assessed, among others, in a historical viewpoint and call such property as the transportability of a model [83]. The speed of a learning algorithm is also valuable to carry out such validation. Interpretability is necessary for the HPs in the evaluation and adoption of the developed models. The adoption of HPs is considered as one of the key success of an IT implementation [65]. The interpretability of the NI models are analyzed according to the feature selection with single and multiple kernels.

The work we carry out in this thesis has several challenges. First, the high imbalance of the positive cases in our datasets (NI and ILD) may lead to high accuracies without being sensitive.

Indeed, a binary classification algorithm with 90% of accuracy on a dataset having 10% of positive examples may be useless if it classifies all the examples in the negative class. Second, the valida-tion of a model on newly available dataset may cause problems because the populavalida-tion analyzed may not have the same probability distribution due to the evolution of medical practice. Third, the trade–off between interpretability and the complexity and/or stability of a model should be addressed. Fourth, the NI prevalence datasets we are using were collected manually for statistical analysis and may contain subjective values. The future deployment of the model may be prohibited by missing information in the EHR.

Dans le document Clinical data mining with Kernel-based algorithms (Page 14-17)