Literature Survey Using Different Data Mining Techniques

Evaluation of Binary Risk Classification of Heart Diseases for Diabetes Mellitus Patients

3. Literature Survey Using Different Data Mining Techniques

The models are classification based on Decision tree induction and Support Vector Machine Figure 1: Overview of the Proposed System

A classification model is a mapping of instances between certain classes or groups. It predicts categorical class labels. It classifies the data based on the training set and the values in classifying the attributes and uses it in classifying the new data. [18] The goal of classification is to accurately predict the target class for each case in the data.

Typically, cross-validation is used to generate a set of training, validation folds, and we compare the expected error on the validation folds after training on the training folds. Cross validation

trained model. In this case, we will divide the dataset into 10 parts and train and test for each part.

Let us consider a two-class prediction problem (binary classification), in which the outcomes are labelled either as positive (p) or negative (n). There are four possible outcomes from a binary classifier namely True Positive, False Positive, True Negative and False Negative. Let us define an experiment from P positive instances and N negative instances. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix.

3.1. Decision Tree

A decision tree is a flow chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. The topmost node in a tree is the root node .Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. [10]

By creating a decision tree, the data can be mined based on the past history to determine the likelihood the person may be having the risk of heart disease. Possible nodes would be determined based on the attributes such as age, vldl, cholesterol, etc.,.The attributes of this person can be used against the decision tree to determine the likelihood of the person x having the risk of heart disease.

Decision tree is visually represented and it‘s easy to interpret and explain. They easily handle feature interactions and they‘re non-parametric, so you don‘t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end).

Figure 2: Decision Tree Process Diagram

As in Figure 2 we have identified it is simple and easy to visualize the process of classification where the predicates return discrete values and can be explained by a series of nested if-then-else statements.

We have tried the decision tree using various split methods for the decision tree which gives different level of accuracy.

Information Gain is computed as the information before the split minus information after the split. It works fine for most cases, unless you have a few variables which have a large number of values (or classes). Then these variables tend to end up as root nodes. This is not a problem, except in extreme cases. For example, each Patient ID is unique and thus the variable has too many classes (each ID is a class). A tree that is split along these lines has no predictive value. [11]

Gain Ratio is usually a good option and it overcomes the problem with information gain by taking into account the number of branches that would result before making the split.

The other important parameter is the "minimal gain" value. Theoretically this can take any range from 0 upwards. In practice, a minimal gain of 0.2-0.3 is considered usable. Default is 0.1.Information gain ratio biases the decision tree against considering attributes with a large number of distinct values. We have chosen Information Gain as the split method for the evaluation and analysis.

Table 3: Accuracy by Split methods

Split Method Criteria Accuracy Classification Error

Gain Ratio 88.99% 11.10%

Information Gain 89.20% 10.80%

For classification problems, it is natural to measure a classifier‘s performance in terms of the error rate. The classifier predicts the class of each instance: If it is correct, that is counted as a success;

if not, it is an error. The error rate is just the proportion of errors made over a whole set of instances, and it measures the overall performance of the classifier.

After the data set is loaded into the RapidMiner repository, the are two operations are necessary such as select attributes to select the necessary attributes for classification, and set role to identify the label variable.

Table 4: Confusion matrix of Decision tree

True High True Low Class precision

Pred. High 238 101 70.21%

Pred. Low 7 654 98.94%

Class recall 97.14% 86.62%

A decision tree can be represented by each internal node is test, each branch corresponds to test results and each leaf node assigns a class.

Figure 3: Decision tree diagram

model at predicting or classifying cases as having an enhanced response measured against a random choice targeting model.. Lift is simply the ratio of target response divided by average response. [12]

This operator creates a Lift chart based on a Pareto plot for the discredited confidence values for the given example set and model.

Figure 4: Lift chart for Decision tree

The Lift Chart of decision tree to predict the response for ―High‖ risk patients based on the given dataset. There were 218 patients who were turning ―High‖ with good confidence level of 1.

3.2. Support Vector Machine

Support Vector Machine (SVM) was first heard in 1992, introduced by Boser, Guyon, and Vapnik in COLT-92. These are a set of related supervised learning methods used for classification and regression.

[24] They belong to a family of generalized linear classifiers. In another terms, Support Vector Machine (SVM) is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. It can be defined as systems which use hypothesis space of a linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory.

The major strengths of SVM are the training is relatively easy. No local optimal, unlike in neural networks. It scales relatively well to high dimensional data and the trade-off between classifier complexity and error can be controlled explicitly. The weakness includes the need for a good kernel function. [4], [21]

Support vector machine was initially popular with the NIPS community and now is an active part of the machine learning research around the world. SVM becomes famous when, using pixel maps as input; it gives accuracy comparable to sophisticated neural networks with elaborated features in a handwriting recognition task. It is also being used for many applications, such as hand writing analysis, face analysis and so forth, especially for pattern classification and regression based applications.

LIBSVM operator is a library for support vector machines (SVMs). LIBSVM has gained wide popularity in machine learning and many other areas. This is the operator used to setup the evaluation of the classification. Here training vectors are mapped into a higher dimensional space by the function.

Figure 5: SVM process diagram

We have set up the value of C as 5 and  as 1 for the SVM to operate in the RBF (Radial Basis Function) kernel type and use the type C-SVC which is standardized regular algorithm.

Table 5: Confusion matrix of Support vector machine

true High true Low class precision

pred. High 36 0 100.00%

pred. Low 209 755 78.32%

class recall 14.69% 100.00%

Figure 6: Lift chart for Support vector machine

The Lift Chart of SVM to predict the response for ―High‖ risk patients based on the given dataset. There were 35 patients who were turning ―High‖ with good confidence level of 0.74.

Dans le document European Journal of Scientific Research (Page 76-80)