• Aucun résultat trouvé

Historical validation of nosocomial infection model

4.4 Experiments

4.4.2 Historical validation of nosocomial infection model

Objective

The objective of this experiment is to evaluate a model on a historical basis. A NI model is firstly built with the 2006 prevalence database. The model is then evaluated on two datasets obtained from the year+1 and year+2. This allows not only to evaluate the robustness of the model but also to measure the effect of the measures carried out to prevent the NIs. On the one hand, tunning a classification algorithm is time consuming and may need human intervention. On the other hand, having an model that can be used for a long period allows to measure the evolution of a situation during the period of its deployment.

Materials and methods

As introduced previously, we use a 2006 prevalence survey to build the NI model and the 2007 and 2008 prevalence surveys for evaluation. We use three versions of these datasets as in Chapter 2.

The first dataset, called DS AF, contains all features from the prevalence database: demographic information; admission diagnostic according to the McCabe score and the Charlson index classifi-cation; patient information at the study date (ward type and name, status of Methicillin-Resistant Staphylococcus Aureus portage, etc); and information at the study date and the six days before (clinical data, central venous catheter carriage, workload, infection status, etc).

After a first data cleaning and binarization, this dataset contains 60 features and 1384 cases includ-ing 166 positive ones (11.99%). The second dataset, called DS RF, contains 20 features obtained after application of 2 feature selection methods (information gain [111] and SVM–RFE [61]). The third dataset DS RF NSR is obtained from DS RF but without the fever and workload features.

We highlighted in a previous study the redundancy or the negative interaction of these two features with the others making the learning with the dataset DS AF and DS RF challenging [71]. The

4.4. EXPERIMENTS 47

Table 4.1: Best parameters C and Sigma according to the datasets and the initialization points on the dataset DF AF, DF RF and DF RF NSR

C σ |Cov(C, σ)|

Mean Median Std dev Mean Median Std dev Dataset DF AF

init1 3.395 0.555 21.529 0.069 0.066 0.041 0.1906

init2 0.678 0.556 0.445 0.067 0.065 0.038 0.005

init3 1.152 0.561 4.173 0.068 0.071 0.04 0.035

init4 0.577 0.549 0.128 0.065 0.064 0.029 0.002

Dataset DF RF

init1 0.406 0.389 0.085 0.199 0.189 0.086 0.0001

init2 0.408 0.388 0.077 0.199 0.194 0.085 0.002

init3 0.392 0.391 0.051 0.198 0.193 0.076 0

init4 0.4 0.397 0.06 0.197 0.193 0.082 0

Dataset init1 1.273 0.356 7.914 0.232 0.205 0.128 0.2123

init2 0.363 0.35 0.065 0.23 0.201 0.124 0.001

DF RF NSR init3 0.362 0.349 0.081 0.23 0.204 0.115 0.002

init4 0.358 0.348 0.061 0.227 0.199 0.108 0.002

2007 (resp. 2008) prevalence survey contains 1528 (1467) unique cases including 153 (resp. 156) positive cases. The ratio of positive cases turns around 10% and 12% for the 3 years. We use libsvm with an L2-SVM implementation and have a radius margin bound resolution implementa-tion using gradient descent algorithm [28]. The software is executed on a linux machine having quad-processor with 2.33GHz frequency and 3Gb of memory.

The model selection and evaluation follows approximatively the learning framework we have de-fined in Chapter 2. However, instead of using a grid search, we have chosen 4 couples of algorithm parameters (C, σ) and for each of the initialization point, 3x5 cross-validations are performed and provide 60 couples of the SVM parameters. As the datasets present an imbalance ratio on positive and negative cases, we arbitrarily correct the imbalance by taking approximatively equal numbers of positive and negative cases before performing cross–validations. The radius margin optimiza-tion may converge to the final SVM soluoptimiza-tion from each initializaoptimiza-tion point but we take the results having less absolute value of covariance on the 2 parameters.

The evaluation of the model on 2006 data is done on the 100 training/testing splits i.e. 100 models are created with the best parameters and are evaluated on the corresponding test set. The mean value of the AUC, f–measure, precision, recall, specificity and accuracy over the 100 test sets are used as performance metric. The 2007 and 2008 prevalence data are also evaluated with the 100 models. The prediction of a case is the mean prediction over 100 models using a majority vote.

Results

Four initialization points are considered for model selection: init1 = (e,1), init2 = (e2, e2), init3 = (e, e210) and init4 = (e5, e5) where e denotes the exponential function. The 60 SVM parameters of the dataset DS AF (respectively DS RF and DS RF NSR) are obtained in 81 (re-spectively 56 and 53) seconds. Table 4.1 provides a summary of the obtained parameters with respect to their mean, median and standard deviation. All initialization points converge approxi-matively to the same value of (C, σ) but the initialization pointinit4 (respectivelyinit3 andinit4) provides less covariance of the SVM parameters for the dataset DS AF (respectively DS RF and DS RF NSR).

The evaluation of these models on the 2006, 2007 and 2008 data based on the features used is summarized in Table 4.2. The mean of the AUC, f-measure, precision, sensitivity, specificity and accuracy of the models are reported.

Table 4.2: Results obtained when applying the best parameters on the prevalence datasets of the year 2006, 2007 and 2008 according to the number of features in datasets DF AF, DF RF, and DF AF NSR

The time to obtain the best parameters is relatively small making the approach of exploiting the bound on the generalization error an attractive way for SVM. The size of the training file and the number of features have also contributed to this as we have under–sampled the majority class. The convergence of the initialization points to approximate the same value of the parameters suggests that the algorithm is not hindered in a local minima.

All the performance metrics obtained with the 3 datasets (DS AF, DS RF and DS RF NSR) of the year 2006 are improved gradually when the redundant variables are removed except the sensi-tivity from DS RF to DS RF NSR. As we have seen in Chapter 2, the temperature and workload are not straightforward to retrieve from the hospital data warehouse. The removal of these two variables from DS RF to obtain DS RF NSR improved all the performance metrics except the sen-sitivity but the difference is not significant according to the Mann-Whitney Wilcoxon mean test.

The speed and efficiency of this version of SVM is of particular interest if one want to carry out many experiments. In this chapter, for example, we only used three versions (DS AF, DS RF and DS RF NSR) of the NI dataset. Powered by the efficiency of this SVM algorithm implementation, we investigate in the next chapter the problem of feature selection where we gradually remove one feature according to its rank to find the best predictive NI variables.

With the dataset DS AF, we can expect 39.29% true positive cases with the model built with the 2006 prevalence data. This number represents 81.29% of all positive cases. When the model is applied to the 2007 and 2008 prevalence data, the values of the performance metrics have con-siderably changed. The model provides high precision with a low sensitivity compared to the 2006 results. This situation is the same for the three datasets DS AF, DS RF and DS RF NSR.

The results on Table 4.2 shows the importance of using the right evaluation measure because one can say that the models provide better accuracies and f–measure when evaluated on 2007 and 2008 dataset. The increase on the accuracy, for example, is due to the high proportion of false negatives (specificity). The AUC shows the opposite tendancy showing a degradation of the results because it takes into account simultaneously the sensitivity and the specificity of the model.

4.4. EXPERIMENTS 49

The degradation of the AUC value can be interpreted as the measure of the effectiveness of the preventive measures carried out by the infection control practitioners. Indeed, the prevalence rate remained relatively stable from 2006 to 2008 but we cannot predict the new cases at the same level as with the 2006 dataset. This means that the “profile” of the patients i.e. the way they were contaminated by the NI has changed. A measure of information gain followed by a chi–square filtering of the 2007 and 2008 data as we have done in Chapter 2 to obtain the dataset DS RF highlight a change in the order of the variables and the appearance of new important ones. This

“concept drift” [129] makes the model built on the year Y inconsistent with data from the year Y + 1 andY + 2. This result can also be interpreted as the limit of a distribution–free algorithm such as the SVM: when the internal structure of the data changes, one has to retrain the classifier.