What Does Accuracy Mean? - Model Evaluation (Step 5)

Model Evaluation (Step 5)

7.3 What Does Accuracy Mean?

Every problem domain has a preferred method of measuring the correctness of results.

Because the effects of different kinds of error vary from domain to domain, a variety of accuracy measures are used in practice. In some domains a false positive from a classi-fier is the most costly mistake, while in other domains a false negative is the most costly mistake. The cost of modeling error is problem dependent, making the best choice of accuracy measure problem dependent.

7.3.1 Confusion Matrix Example

A very general measure of correctness that provides sufficient information to compute many others is the confusion matrix. It is used to evaluate detectors and classifiers—

data mining applications that detect/classify vectors into one of a number of ground truth classes.

Let the number of vectors classified be 100; suppose there are four ground truth classes; 1, 2, 3, and 4, each consisting of 25 vectors. A confusion matrix for the classifier shows performance for each class. Given the confusion matrix below, rows represent ground truth counts, and columns represent the classifier’s decisions.

Table 7.1 Confusion Matrix Example 1

Confusion Matrix

The sum of the entries in row J will be the number of vectors in class J; and the sum of the entries in column K will be the number of vectors the classifier determined were in

class K. For example, the last column of the second row indicates that the classifier has (incorrectly) assigned five of the class 2 vectors to class 4. A perfect classifier will produce a confusion matrix that is diagonal: only entries on the diagonal will be non-zero.

Many of the most commonly used performance measures are ratios drawn from the confusion matrix. We begin with four examples that show some of the very different things that “accuracy” can mean.

Classification Accuracy

Classification Accuracy answers the question: What proportion of the entire data set is being correctly classified? This is a single numeric value. It is the number of vectors correctly classified (without respect to class), divided by the total number of vectors. It is computed from the confusion matrix as:

Classification Accuracy = sum of diagonal values / sum of all values

Using the matrix above (Table 7.1), we have the following result:

Classification Accuracy = (25 + 1 + 0 + 24) / 100 = 50%

Recall

Recall answers the question: What proportion of the vectors in class J does the classifier decide are in class J? Note that sometimes the term class accuracy is used as a synonym for recall. There will be a recall value for each ground truth class. It is computed from the confusion matrix as:

Recall for Class J = value in row J and column J / sum of entries in row J

For the confusion matrix above (Table 7.1), we have the following class recalls:

Recall for class 1 = 25 / (25 + 0 + 0 + 0) = 100%

Recall for class 2 = 1 / ( 0 + 1 + 19 + 5) = 4%

Recall for class 3 = 0 / ( 0 + 0 + 0 + 25) = 0%

Recall for class 4 = 24 / ( 0 + 0 + 1 + 24) = 96%

Precision

Precision answers the question: When the machine says a vector is in class K, how likely is it to be correct? As with recall, there will be a precision value for each class. Precision measures the proportion of the vectors the machine classified as class K that actually are class K. It is computed from the confusion matrix as:

Precision for Class K = value in row K, column K / sum of entries in column K For the matrix above (Table 7.1), we have the following precisions:

Precision for class 1 = 25 / (25 + 0 + 0 + 0) = 100%

Precision for class 2 = 1 / ( 0 + 1 + 0 + 0) = 100%

Precision for class 3 = 0 / ( 0 + 19 + 0 + 1) = 0%

Precision for class 4 = 24 / ( 0 + 5 + 25 + 24) = 44%

Geometric Accuracy

Geometric accuracy (not a standard industry term) answers the question: How accurate is the classifier when class imbalance is taken into account? This is a single numeric value represented by the geometric mean of the class precisions. Suppose that there are N classes. Then the geometric accuracy is computed as:

Geometric Accuracy = Nth root of the product of the class precisions

For the matrix above (Table 7.1), we have the following result:

Geometric Accuracy = fourth root of (1.0 * 1.0 * 0.0 * 0.44) = 0%

The value of geometric accuracy is non-zero only if the classifier has some level of success in every class. Using the geometric accuracy in the objective function for a clas-sifier will prevent it from increasing its accuracy by ignoring small classes (which are often the classes of greatest interest.) Using geometric accuracy as part of the objective function makes it possible to train on sets that are not balanced by class.

How are Precision, Recall, and Overall Accuracies Related?

If either classification accuracy or geometric accuracy is 100%, the other will be 100%

as well. The same is true for class recall and precision accuracies. However, in general the following is true:

• For a particular class, recall can be high and precision low (see class 4 above).

• For a particular class, recall can be low and precision high (see class 2 above).

• Having a class recall of 100%, in itself, doesn’t mean much.

• Having a class precision of 100%, in itself, doesn’t mean much.

• Classification accuracy can be greater than, equal to, or less than the geometric accuracy.

Other Accuracy Metrics

There are accuracy measures used in clinical research that correspond to precision and recall, but different terms are used to denote them. These terms map to their corre-sponding counterparts in data mining in a natural way.

Assume that a binary condition is either present or absent. Presence of the condi-tion will result in a positive value, and absence of the condicondi-tion will result in a negative value. Consider the confusion matrix in Table 7.2:

Confusion Matrix Positive Test Negative Test

Condition Present TP = number of true positives FN = number of false negatives Condition Absent FP = number of false positives TN = number of true negatives

Table 7.2 Confusion Matrix Example 2

Table 7.3 Confusion Matrix Example 3

Confusion Matrix Condition Present Condition Absent

Positive Test TP = number of true positives FP = number of false positives Negative Test FN = number of false negatives TN = number of true negatives

Warning—some authors (notably in clinical work) use the transpose of the matrix depicted here, which interchanges the rows and columns. They would have the confu-sion matrix in Table 7.3:

When you are interpreting a confusion matrix in the literature, make certain you understand how the authors have chosen to arrange rows and columns.

Legend:

TP = number of condition present instances for which application says condition present FN = number of condition present instances for which application says condition absent FP = number of condition absent instances for which application says condition present TN = number of condition absent instances for which application says condition absent Important—False positives, given by FP, are referred to as Type I errors. False Negatives, given by FN, are referred to as Type II errors. In classification and detection problems, these are linked in the sense that reducing one of them generally increases the other.

Which type is most harmful is domain dependent. Table 7.4 provides definitions of some metric terms:

Table 7.4 Defi nitions of Metric Terms

Term Defi nition Formula

Sensitivity Proportion of persons with condition who test positive

TP/(TP+FN) Specifi city Proportion of persons without condition

who test negative

TN/(FP+TN) Positive Predictive

Power

Proportion of persons with positive test who have condition

TP/(TP+FP) Negative Predictive

Power

Proportion of persons with negative test who do not have condition

TN/(FN+TN)

7.3.2 Other Metrics Derived from the Confusion Matrix

• Precision. The information-theoretic term for Positive Predictive Power. Preci-sion is the term usually used in data mining.

• Recall. The information-theoretic term for sensitivity. Recall is the term usually used in data mining. Also referred to as PD, or probability of detection.

• False Positive Rate. The probability that the application indicates condition present, when it is actually absent. It is equal to 1 – specificity. Also denoted by α, or the significance level and by Probability of False Alarm (PFA).

• False Negative Rate. The probability that the application indicates condition absent when it is actually present. It is the probability of a missed detection, and is equal to 1 – sensitivity, and is also denoted by .

• Power of the Test. 1 –  1 – False Negative Rate

• Likelihood Ratio (positive). Sensitivity / (1 – Specificity)

• Likelihood Ratio (negative). (1 – Sensitivity) / Specificity

• F-Measure. The harmonic mean of precision and recall. It is a single number that captures information about both of these complimentary error rates. It is given by:

F = 2(Precision)(Recall)/(Precision + Recall)

• Equal Error Rate (EER). The error rate when decisions thresholds are set so that the false positive and false negative rates are equal.

• Receiver Operating Characteristic (ROC) curve. The ROC is not derived from the confusion matrix. ROC curve refers to a plot of the number of false positives vs. the number of true positives as the decision threshold is varied from low to high. It was developed during the Second World War to calibrate early RADAR warning systems so they could be adjusted to favor one type of error over another. For example, a RADAR may be set to have a high probability of detection; but this also causes it to have a high PFA and low Specificity (lots of false alarms). The ROC curve shows how these two error rates are related.

If the decision being made has more than two possible outcomes, each of these metrics can be computed from the confusion matrix for each outcome versus all others.

In this case, the confusion matrix will have a number of rows and columns equal to the number of ground truth classes. The same ratio definitions apply as in the 2-by-2 case.

7.3.3 Model Evaluation Case Study: Addressing Queuing Problems by Simulation

Can data mining techniques be used when there is no data? Analysis of systems for which no data are available can sometimes be carried out by using simulators. For this case study, we describe the analysis of a meter reading system for residential customers that was conducted before it was built.

A large municipality had a difficult operational problem they wanted to solve using automation:

1. The manual collection of readings from 50,000+ residential water meters every month required a standing army of meter readers and a fleet of vehicles.

2. Manual collection of readings was expensive, slow, and prone to error.

The customer’s question: Could existing communication infrastructure be used to automate the collection of meter data? Supervisory Control and Data Acquisition (SCADA) technology existed to allow the meters themselves to autonomously call-in their readings to a collection center, and be assigned next month’s call-in time. This was done over customers’ existing phone lines, without interfering with their telephone service. But this theoretical solution was untested. Critical questions about schedul-ing and loadschedul-ing had to be answered to insure that this was also a practical solution. A simulation was constructed to address the fundamental questions:

1. Could over 5 0,000 autonomous meters be remotely scheduled for call-in so readings could be collected on schedule for a monthly billing cycle?

2. Could a call-in schedule be constructed to minimize collisions? A collision occurs when call-in attempts exceeded available phone lines at the collection center. When this happens, some of the meters would get busy signals, and have to call back later.

This could be a system-killing problem, because the call-in hardware on the meters was battery powered. If meters had to call-in twice on average each month instead of once, 50,000 meter batteries would have to be replaced twice as often as planned.

3. How many telephone lines would be required to support optimized call-in, and how should they be managed?

The problem doesn’t sound too difficult yet, until it is pointed out that the inter-nal clock on a 50 cent SCADA device is not going to be very accurate because after a 30-day wait for its next call-in, these clocks could be off by several minutes. Further, these variations were not consistent for a clock, since they were affected by temperature and other factors. The scheduling problem actually came down to trying to remotely synchronize the behaviors of 50,000 randomly drifting clocks that cannot communi-cate with each other.

A simulator was constructed allowing the data miner to specify the number of meters, number of telephone lines into the collection center, time requirements, and scheduling methodologies. The simulator would then create and execute a dynamic queuing model which ran through the entire call-in schedule, maintaining relevant statistics for each scheduling model (service counts, collisions, number of redials, maxi-mum queue depths, phone-line loading efficiency, etc.). The performance data for each simulation run were collected and analyzed (Figure 7.1).

Using this simulation, the data miner was able to game a collection of metrics across a wide spectrum of possible system configurations and scheduling strategies.

The metrics collected included the number of meters being processed, the number of incoming phone in the collection center, the time required to service and reschedule an installed unit, the time required to service connection to a new unit, the maximum

number of redials for any single unit, and the percentage utilization of the input phone lines. A Monte Carlo approach was utilized, in which system parameters were selected pseudo-randomly.

From these analyses, the data miner was able to specify the smallest number of telephone lines in the collection center to handle meter call-in, and a scheduling model that minimized meter redial.

7.3.4 Model Evaluation Checklist

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 171-177)