Classification Techniques - Two-Phase Feature Selection

DESIGN OF THE DATA MINING TOOL

Algorithm 7.1 Two-Phase Feature Selection

7.5 Classification Techniques

Classification is a supervised data mining technique in which a data mining model is first trained with some “ground truth,”

that is, training data. Each instance (or data point) in the training data is represented as a vector of features, and each training instance is associated with a “class label.” The data mining model trained from the training data is called a

“classification model,” which can be represented as a function f(x):feature vector→class label. This function approximates the feature vector-class label mapping from the training data.

When a test instance with an unknown class label is passed to the classification model, it predicts (i.e., outputs) a class label for the test instance. The accuracy of a classifier is determined by how many unknown instances (instances that were not in the training data) it can classify correctly.

We apply the NB [John and Langley, 1995], SVM [Boser et al., 1992], and C4.5 decision tree [Quinlan, 1993] classifiers in our experiments. We also apply our implementation of the Series classifier [Martin et al., 2005-b] to compare its performance with other classifiers. We briefly describe the Series approach here for the purpose of self-containment.

NB assumes that features are independent of each other. With this assumption, the probability that an instance x = (x₁, x₂,

…,xn) is in classc(c ∈ {1, …, C}) is

wherexi is the value of thei-th feature of the instance x,P(c) is the prior probability of class C, and P(X_j = x_j|c) is the conditional probability that the j-th attribute has the value xj

given classc.

So the NB classifier outputs the following class:

NB treats discrete and continuous attributes differently. For each discrete attribute, p(X = x|c) is modeled by a single real number between 0 and 1 which represents the probability that the attribute X will take on the particular value x when the class is c. In contrast, each numeric (or real) attribute is modeled by some continuous probability distribution over the range of that attribute’s values. A common assumption not intrinsic to the NB approach but often made nevertheless is that within each class the values of numeric attributes are normally distributed. One can represent such a distribution in terms of its mean and standard deviation, and one can efficiently compute the probability of an observed value from such estimates. For continuous attributes we can write

where

Smoothing (m-estimate) is used in NB. We have used the valuem= 100 andp= 0.5 while calculating the probability

where n_c = total number of instances for which X = x given Classcandn= total number of instances for whichX=x.

SVM can perform either linear or non-linear classification.

The linear classifier proposed by [Boser et al., 1992] creates a

hyperplane that separates the data into two classes with the maximum margin. Given positive and negative training examples, a maximum-margin hyperplane is identified, which splits the training examples such that the distance between the hyperplane and the closest examples is maximized. The non-linear SVM is implemented by applying kernel trick to maximum-margin hyperplanes. The feature space is transformed into a higher dimensional space, where the maximum-margin hyperplane is found. This hyperplane may be non-linear in the original feature space. A linear SVM is illustrated in Figure 7.3. The circles are negative instances and the squares are positive instances. A hyperplane (the bold line) separates the positive instances from negative ones. All of the instances are at least at a minimal distance (margin) from the hyperplane. The points that are at a distance exactly equal to the hyperplane are called the support vectors. As mentioned earlier, the SVM finds the hyperplane that has the maximum margin among all hyperplanes that can separate the instances.

Figure 7.3 Illustration of support vectors and margin of a

In our experiments, we have used the SVM implementation provided at [Chang and Lin, 2006]. We also implement the Series or “two-layer approach” proposed by [Martin et al., 2005-b] as a baseline technique. The Series approach works as follows: In the first layer, SVM is applied as a novelty detector. The parameters of SVM are chosen such that it produces almost zero false positive. This means, if SVM classifies an email as infected, then with probability (almost) 100%, it is an infected email. If, otherwise, SVM classifies an email as clean, then it is sent to the second layer for further verification. This is because, with the previously mentioned parameter settings, while SVM reduces false positive rate, it also increases the false negative rate. So, any email classified as negative must be further verified. In the second layer, NB classifier is applied to confirm whether the suspected emails are really infected. If NB classifies it as infected, then it is marked as infected; otherwise, it is marked as clean. Figure 7.4illustrates the Series approach.

7.6 Summary

In this chapter, we have described the design and implementation of the data mining tools for email worm detection. As we have stated, feature-based methods are superior to the signature-based methods for worm detection.

Our approach is based on feature extraction. We reduce the dimension of the features by using PCA and then use classification techniques based on SVM and NB for detecting worms. In Chapter 8, we discuss the experiments we carried out and analyze the results obtained.

Figure 7.4 Series combination of SVM and NB classifiers for email worm detection. (This figure appears in Email Work Detection Using Data Mining, International Journal of Information Security and Privacy, Vol. 1, No. 4, pp. 47–61, 2007, authored by M. Masud, L. Kahn, and B.

Thuraisingham. Copyright 2010, IGI Global, www.igi-global.com. Posted by permission of the publisher.) As stated in Chapter 6, as future work, we are planning to detect worms by combining the feature-based approach with the content-based approach to make it more robust and efficient. We will also focus on the statistical property of the contents of the messages for possible contamination of worms. In addition, we will apply other classification techniques and compare the performance and accuracy of the results.

References

[Boser et al., 1992] Boser, B. E., I. M. Guyon, V. N. Vapnik, A Training Algorithm for Optimal Margin Classifiers, in D.

Haussler, editor, 5th Annual ACM Workshop on COLT, Pittsburgh, PA, ACM Press, 1992, pp. 144–152.

[Chang and Lin, 2006] Chang, C.-C., and C.-J. Lin,LIBSVM:

A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/∼cjlin/libsvm

[John and Langley, 1995] John, G. H., and P. Langley, Estimating Continuous Distributions in Bayesian Classifiers, in Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo, CA, 1995, pp. 338–345.

[Martin et al., 2005-a] Martin, S., A. Sewani, B. Nelson, K.

Chen, A. D. Joseph, Analyzing Behavioral Features for Email Classification, inProceedings of the IEEE Second Conference on Email and Anti-Spam (CEAS 2005), July 21 & 22, Stanford University, CA.

[Martin et al., 2005-b] Martin, S., A. Sewani, B. Nelson, K.

Chen, A. D. Joseph, A Two-Layer Approach for Novel Email Worm Detection, Submitted to USENIX Steps on Reducing Unwanted Traffic on the Internet (SRUTI).

[Mitchell, 1997] Mitchell, T. Machine Learning, McGraw-Hill, 1997.

[Quinlan, 1993] Quinlan, J. R. C4.5: Programs for Machine Learning,Morgan Kaufmann Publishers, 1993.

8

Dans le document IT MANAGEMENT TITLESFROMAUERBACHPUBLICATIONS AND CRC PRESS (Page 174-182)