Experimentation with Data Mining Algorithms

Predictive Classification with Imbalanced Enterprise Data

3. The Process of Knowledge Discovery from Enterprise Data As already mentioned in the introduction of this chapter, the KDD

3.7 Experimentation with Data Mining Algorithms

At this stage, the actual search for patterns of interest takes place with the help of the chosen data mining algorithms. Particularly for predictive classification, the results are summarized in confusion matrices, which provide the true and false positives, and true and false negatives. In order to test and compare the performance of the aforementioned classification algorithms for our project, several experiments were realized using a 10-fold validation procedure and splitting the data in training and testing data, as discussed in Section 3.6. For performance criteria, initially, three simple measures were used (the two accuracy rates and the precision rate) and in the sequel three composite ones, the AUC, the geometric mean, and the F-measure. In this chapter, we provide a summary of our experimental results (Table 2), while more details can be found in (Daskalaki et al., 2006).

Table 2. Simple and composite performance measures from a 10-fold validation classification procedure.

DT LLR NN BN MLR SVM DT LLR NN BN MLR SVM

1:142(N) 0.061 0.673 0.020 1:142(N) 0.500 0.500 0.530 0.799 0.510 0.000 1:100 0.020 0.102 0.673 0.020 1:100 0.510 0.500 0.550 0.798 0.510 0.500 1:50 0.204 0.061 0.306 0.673 0.082 0.041 1:50 0.598 0.530 0.649 0.797 0.540 0.519 1:25 0.367 0.224 0.551 0.694 0.245 0.082 1:25 0.676 0.609 0.767 0.807 0.619 0.538 1:15 0.469 0.347 0.469 0.755 0.347 0.245 1:15 0.720 0.667 0.727 0.836 0.666 0.616 1:10 0.469 0.408 0.592 0.735 0.469 0.347 1:10 0.722 0.696 0.780 0.825 0.724 0.663 1:5 0.653 0.694 0.755 0.837 0.714 0.674 1:5 0.805 0.822 0.847 0.873 0.832 0.812 T

1:1(B) 0.816 0.898 0.878 0.878 0.898 0.857 A U C

1:1(B) 0.789 0.855 0.845 0.886 0.851 0.833

1:142(N) 0.998 0.925 0.999 1:142(N) 0.104 0.200 0.101

1:100 0.999 0.997 0.923 0.999 0.999 1:100 0.071 0.149 0.197 0.058

1:50 0.993 0.998 0.991 0.920 0.998 0.997 1:50 0.184 0.104 0.244 0.194 0.128 0.064 1:25 0.984 0.993 0.983 0.920 0.993 0.995 1:25 0.226 0.206 0.317 0.199 0.218 0.088 1:15 0.975 0.987 0.984 0.917 0.985 0.988 1:15 0.216 0.233 0.285 0.213 0.222 0.172 1:10 0.970 0.983 0.968 0.916 0.979 0.980 1:10 0.233 0.241 0.259 0.206 0.254 0.194 1:5 0.957 0.950 0.938 0.910 0.950 0.951 1:5 0.249 0.248 0.243 0.225 0.256 0.243 T

1:1(B) 0.761 0.812 0.813 0.894 0.804 0.809 G M

1:1(B) 0.138 0.170 0.167 0.219 0.167 0.161

1:142(N) 0.176 0.059 0.500 1:142(N) 0.091 0.109 0.002

1:100 0.250 0.217 0.058 0.167 1:100 0.038 0.139 0.107 0.003

1:50 0.167 0.176 0.195 0.056 0.200 0.100 1:50 0.183 0.091 0.238 0.103 0.009 0.058 1:25 0.140 0.190 0.182 0.057 0.194 0.095 1:25 0.202 0.206 0.274 0.108 0.016 0.088 1:15 0.099 0.156 0.173 0.060 0.142 0.121 1:15 0.185 0.215 0.253 0.111 0.017 0.162 1:10 0.116 0.143 0.114 0.058 0.138 0.108 1:10 0.164 0.212 0.191 0.107 0.032 0.165 1:5 0.095 0.088 0.078 0.061 0.091 0.087 1:5 0.166 0.157 0.142 0.113 0.062 0.155 P

1:1(B) 0.023 0.032 0.032 0.055 0.031 0.030 F

1:1(B) 0.045 0.062 0.061 0.103 0.187 0.059

Chapter 3. Predictive Classification with Imbalanced Enterprise Data 165

Using the simple performance measures it is concluded that the accuracy rate for the minority class (TP rate) has the tendency to increase as the proportion of positive examples in the dataset increases. The same is also true for the accuracy rate for the majority class (TN rate) and more so the precision rate (PR) for the minority decrease. These observations are roughly true for all classification algorithms except of the Bayesian Network. As discussed also previously in (Chan and Stolfo, 1998; Elkan, 2001) the Bayes Network classification algorithm is not sensitive to changes in the class distribution.

Compared to the other algorithms, the Bayes Network algorithm gives the highest values for the TP rate but the lowest ones for the PR and the TN rate for nearly all class distributions. In addition, from the experimental results it is clear that maximizing the TP rate conflicts with the maximization of the PR or the TN rate. Increasing the percentage of the positive cases in the training dataset, the probability of positive prediction from an induced classifier becomes higher too. Thus, both the number of false positive cases and true positive cases are expected to increase. Apparently, the increase in the number of true positive cases is a lower percentage than the corresponding increase of the false positive cases and the PR decreases.

The performance of the classification algorithms was additionally evaluated using three composite performance measures, the AUC that uses both the TP and TN (FP = 1 – TN) rates, the geometric mean of TP rate and PR (GM = TP PR⋅ ), and the F-measure of TP rate and PR:

2 2

( 1) * *

TP PR

F PR TP

β β

= +

+ (2)

According to our experimental results (Table 2), the AUC measure behaves very similarly to the TP rate and has the tendency to increase as the proportion of minority cases in the training set increases. Again, the classifiers induced by the Bayes Networks give the highest AUC value for all different class distributions. Conversely, the geometric mean and the F-measure exhibit approximately concave behavior for most algorithms and attain their “maximum” values when the class distribution is in the range of 1:25 to 1:5. For the GM, this is explained because when the number of minority cases in the dataset increases the TP rate

increases and the PR decreases. Thus, an increase of the geometric mean indicates that the achieved improvement in the TP rate is beneficial since it is not accompanied by a simultaneous “large” decrease of the PR. The GM attains a “maximum” at that class distribution where the benefit from the increase in the TP rate is larger than the corresponding decrease in the PR. Using the GM as performance measure, the classifiers induced by the Neural Network algorithm exhibit superior behavior, by achieving its best performance at the 1:25 dataset. The rest of the classification algorithms behave in a comparable fashion and attain best performance either at the 1:10 or 1:5 dataset, while the SVM classifiers exhibit the worst performance. The only exception again is the Bayesian Network algorithm, which as already discussed, is insensitive to changes in the class distribution. Therefore, the values for the GM are approximately the same for all class distributions.

The F-measure (Lewis and Gale, 1994) also combines the rates TP and PR. Its value depends on a factor denoted by β (Equation (2)), which takes on values from 0 to infinity and its role is to control the impact of the TP rate and the PR separately. It is easy to show that if β = 0 then the F–measure reduces to the PR and conversely if β → ∞ then the F–

measure approaches the TP rate. Based on the F-measure, the classifiers induced by the Neural Network prevail by giving the highest values for several datasets followed by the classifiers induced by the Decision Tree and the Linear Logistic Regression algorithms. For all class distributions the Bayesian Network’s classifiers achieve approximately the same F value (close to 0.1), while the classifiers induced by the Multiple Logistic Regression give the lowest values except only of the dataset 1:1. Lastly, the datasets in the range 1:25 to 1:5 appear to train classifiers in a way that achieves the highest F values for all algorithms.

Dans le document Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications (Page 194-197)