COG-OS for Network Intrusion Detection - COG for Rare Class Analysis

6.4 COG for Rare Class Analysis

6.5.4 COG-OS for Network Intrusion Detection

Here, we demonstrate an application of COG-OS for network intrusion detection. For this experiment, we used a real-world network intrusion data set, which is provided as part of the KDD-CUP-99 classifier learning contest, and now is a benchmark data set in the UCI KDD Archive.⁸

The KDD Cup Data Set.This data set was collected by monitoring a real-life military computer network that was intentionally peppered with various attacks that hackers would use to break in. Original training set has about 5 million records

8http://kdd.ics.uci.edu/

142 6 K-means Based Local Decomposition for Rare Class Analysis

belonging to 22 subclasses and 4 classes of attacks, i.e., DoS, Probe, R2l and U2R, and still one normal class. In this experiment, we used the 10 % sample of this original set which is also supplied as part of the KDD CUP contest. We present results for two rare classes:ProbeandR2l, whose populations in the 10 % sample training set are 0.83 % and 0.23 %, respectively. The provided test set has some new subclasses that are not present in the training data, so we deleted the instances of new subclasses, and the percentages ofProbeandR2lin the test set are 0.81 % and 2.05 %, respectively.

Table6.6shows the detailed information of these data sets. Note that we obtained the probe_binarydata set by making theprobeclass as the rare class, and the rest four classes as one large class. The other data set, i.e.,r2l_binary, was prepared in a similar way.

The Benchmark Classifiers. In this experiment, we applied four classifiers:

COG-OS(SVMs), pure SVMs, RIPPER [3], and PNrule [14]. For COG-OS, the clus-ter number for the large class is four, and over-sampling ratios for the rare classes of probe_binaryandr2l_binaryare 30 and 120, respectively. For SVMs, we set the parameters as: -t 0. Ripper and PNrule are two rule induction classifiers. RIPPER builds rules first for the smallest class and will not build rules for the largest class.

Hence, one might expect that RIPPER can provide a good performance on the rare class. As to PNrule, it consists of positive rules (P-rules) that predict presence of the class, and negative rules (N-rules) that predict absence of the class. It is the existence of N-rules that can ease the two problems induced by the rare class: splintered false positives and error-prone small disjuncts. These two classifiers have shown appeal-ing performances on classifyappeal-ing the modified binary data sets in Table6.6, and the PNrule classifier even shows superior performance [14]. To our best knowledge, we used the same source data as Joshi et al., and the pre-processing procedure for the modified data sets is also very similar to the one used by them. Therefore, we simply adopted the results of PNrule in [14] for our study.

The Results.Table6.7shows the classification results by various methods on the probe_binarydata set. As can be seen, COG-OS performs much better than pure SVMs and RIPPER on predicting the rare class as well as the normal class, while PNrule shows slightly higher F-measure on the rare class. For data setr2l_binary, however, COG-OS shows the best performance among all classifiers. As indicated in Table6.7, the F-measure value of the rare class by COG-OS is 0.496, far more higher than the ones produced by the rest classifiers. Meanwhile, the predictive accuracy of the large class by COG-OS is also higher than that of pure SVMs and RIPPER. This real-world application nicely illustrates the effectiveness of COG-OS—the combination of local clustering and over-sampling schemes. We believe that COG-OS is a prospective solution to the difficult classification problem induced by complex concepts and imbalanced class distributions.

We also observed the training efficiency of COG-OS, where the experimental platform is Windows XP with an Intel Core2 1.86 GHz cpu and 4 GB memory. As can be seen in Fig.6.7, for theprobe_binarydata set, the training time for COG-OS is 173 s, which is far less than the training time for pure SVMs: 889 s. Note that the scale of the training data set for COG-OS is 613124 after over-sampling, which is a much larger number than the scale of the training data set for pure SVMs:

6.5 Experimental Results 143

Table6.6Informationofdatasets DatasetSource#Objects#Features#ClassesMinClassZizeMaxClassSizeCV kddcup99dataUCIKDD494021/29230041552/39391458/2232981.708/1.634 probe_binaryaUCIKDD494021/2923004124107/2377489914/2899231.391/1.391 r2l_binaryaUCIKDD494021/2923004121126/5993492895/2863071.408/1.356 Notewedeleted18729instancesfromthetestset,sincethelabelsoftheseinstancesarenotpresentinthetrainingset.aModifiedbinarydatasetsfrom kddcup99data

144 6 K-means Based Local Decomposition for Rare Class Analysis

Table 6.7 Results on modified data sets

probe_binary SVMs RIPPER PNrule COG-OS

rare class 0.806 0.798 0.884 0.881

huge class 0.998 0.998 N/A 0.999

total 0.996 0.996 N/A 0.998

r2l_binary SVMs RIPPER PNrule COG-OS

rare class 0.262 0.360 0.230 0.496

huge class 0.991 0.992 N/A 0.993

total 0.983 0.984 N/A 0.986

Note“N/A” means results were not provided by the source paper [14]

Fig. 6.7 The computational performance on the network intrusion data. Reprinted from Ref. [31], with kind permission from Springer Science+Business Media

494021. This implies that local clustering indeed helped to reduce the training time for SVMs. Actually, if we only learn SVMs but not employ local clustering on the over-sampled data set, as indicated by the “OS” column in Fig.6.7, the training time increases dramatically up to over 2.5 h! To better illustrate the efficiency of COG-OS, we further recorded the time consumed for learning each of the four hyper-planes in COG-OS, as shown by the pie plot in Fig.6.7. As can be seen, for four sub-classes produced by local clustering on the normal class, only one sub-class is much harder to be distinguished from the rare class, since it took longer time to find a hyper-plane for this sub-class and the rare class. For the rest three sub-groups, it took much less time to find the corresponding hyper-planes. The above indicates that local clustering can help to divide a complex structure/concept into several simple structures/concepts which are easy to be linearly separated and take less training time to find hyper-planes.

6.5 Experimental Results 145

Table 6.8 Some characteristics of the credit card data set

Dataset #instances #features C V

Normal class Rare class

Training set 67763 13374 324 0.948

Test set 67763 13374 324 0.948

Table 6.9 Results on the credit card data set

SVMs RIPPER COG

R P F R P F R P F

Normal class 0.969 0.856 0.909 0.955 0.874 0.913 0.949 0.881 0.914 Rare class 0.176 0.530 0.264 0.304 0.573 0.397 0.353 0.577 0.438

total 0.838 0.838 0.838 0.848 0.848 0.848 0.851 0.851 0.851

Note Rrecall,Pprecision,FF-measure

Dans le document Springer Theses (Page 152-156)