Calculating Evaluation Scores - Natural Language Annotation for Machine Learning

Much like IAA scoring, there are a few different methods for evaluating your algorithm’s performance. In this section we’ll go through how to calculate the most common eval

uation metrics, specifically percentage accuracy, precision and recall, and F-measures, and how to apply these to different types of annotation tasks.

Percentage accuracy

The first metric that most people think of as an evaluation score for your algorithm is to grade the output of the algorithm the way that you would grade an exam: simply look at the generated annotations; mark off which ones are wrong and which are right; and figure out what percentage of the total annotations are correct.

If you have a confusion matrix, the percentage accuracy is the sum of the diagonal (top left to bottom right) of the table divided by the sum of all the cells of the table. So in the movie review table from the previous section, we have a total accuracy of:

(96 + 87 + 100) / 300 = .943, or 94.3%

Unfortunately, while accuracy is easy to calculate, it can only give us a general idea of how well an algorithm performed at a task: it can’t show specifically where the task went wrong, or what aspects of the features need to be fixed. There are other ways to obtain and assign scores to those attributes of a machine learning (ML) task, however, which we will examine next.

Precision and recall

The biggest problem with calculating overall accuracy is that the resultant score doesn’t address the two different kinds of mistakes that need to be examined for evaluating an annotation task: places where it put the wrong tag on an item, and places where it failed to put the right tag. These values are calculated by examining the precision and recall for each tag/label in a task, and they help form a clearer picture of what aspects of an algorithm need to be adjusted.

Precision and recall are traditionally associated with Information Retrieval (IR) tasks such as those related to search algorithms. Let’s say there was a database of scientific papers, and you wanted the computer to return all of the articles related to string theory.

If there are 50 relevant documents in the database and the search returns 75, then it’s clear that the search has falsely identified some documents as being related to string theory when they actually weren’t. When an item is given a label that it should not have, that is called a false positive. On the other hand, if the same database were queried for

“string theory” and only 25 documents were returned, there would be multiple cases of false negatives, where items were rejected for a label that they should have actually been given. Documents that are returned accurately are called true positives, and documents that are correctly ignored when they are irrelevant are true negatives.

By applying the idea of false negative and false positive to the results of an ML algorithm, we can compute useful evaluation metrics that will help us to more accurately identify sources of error in the system. However, unlike overall accuracy calculations where all the tags are evaluated together, false positives and false negatives have to be evaluated one tag at a time. If we think about it, this makes sense, particularly in annotation tasks where only one tag (or no tag) is being applied to an item, as a false positive for one tag is a false negative for another.

So how do we calculate the number of false positives and false negatives? With a con

fusion matrix, it’s quite easy. If we look at the matrix for movie review labels again, we

see that to evaluate the accuracy of the “positive” label we need to look at the first column, which shows the number of documents the classifier labeled as “positive,” and the first row, which shows how many documents in the gold standard were given the “positive”

label:

test test test

positive neutral negative gold standard positive 96 4 0

gold standard neutral 13 87 0

gold standard negative 0 0 100

In all confusion matrices, the true positives are located at the intersection of the row and column we are examining. In this case, the number of true positives is 96. False positives are calculated by summing up the column “test—>positive” (minus the number of true positives): here that number is 13 (13 + 0). Similarly, false negatives are calculated by summing across the row for “gold standard—>positive” (again, without counting the true positives), which is 4 (4 + 0). (The number of true negatives is the sum of the rest of the table [187], but we won’t be needing that here.)

With the confusion matrix, calculating the true and false positives and negatives is simple enough, but what do we do with them now? The answer is that we calculate precision and recall values, which provide more nuanced information about our algorithm test.

Precision is the measure of how many items were accurately identified, and is defined as:

p= true positive

true positive+ false positive

For the “positive” tag, p = 96 / (96 + 13)

= 96 / 109

= .881

Recall is the measure of how many relevant items were identified (in other words, how many of the documents that should have been labeled “positive” were actually given that label):

r= true positive

true positive+ false negative

r = 96 / (96 + 4 )

= 96 / 100

= .96

We can see from these numbers that for the “positive” label our algorithm has high recall, which means it found most of the documents it should have, but lower precision, which means it’s giving too many documents the “positive” label when they should be labeled as something else.

If we perform these calculations with the rest of the matrix, we can make a table that looks like this:

tag precision recall positive .881 .96

neutral .95 .87

negative 1 1

In this case, the precision and recall numbers for positive and neutral are very close to being each other’s opposite, because here, the “negative” tag isn’t influencing the other two. In tables with more variation, such reciprocity is not the norm.

While it’s fairly standard to report the precision and recall numbers for each tag (and tag attribute), papers about ML algorithms often mention another number, called the F-measure, which we will discuss in the next section.

You can also se the NLTK to generate accuracy, precision, recall, F-measure, and a number of other analysis metrics for you using the nltk.metrics package. Like the confusion matrix, the accuracy metric uses lists to calculate the numbers, while precision, recall, and F-measure use sets of data instead. A good overview of how to create these sets is available at http://bit.ly/QViUTB.

F-measure

The F-measure (also called the F-score or the F1 score) is an accuracy measure calculated by finding the harmonic mean of the precision and recall of an algorithm’s performance over a tag. The formula for F-measure is as follows (where p is precision and r is recall):

F = 2 ×

(

^p^p^×⁺^r^r

)

We see that the F-measure for the “positive” tag is 2 * (.881*.96) / (.881 + .96) = 1.692 / 1.841 = .919.

While the F-measure won’t tell you specifically where your algorithm is producing in

correct results, it does provide a handy way for someone to tell at a glance how your

algorithm is performing in general. This is why F-measure is commonly reported in computational linguistics papers: not because it’s inherently useful to people trying to find ways to improve their algorithm, but because it’s easier on the reader than pro

cessing an entire table of precision and recall scores for each tag used.

You may be wondering why we don’t use kappa coefficients for evalu

ating algorithms. The equations for kappa coefficients are designed to take into account that annotators always bring in some level of random chance, and also that when IAA scores are being calculated, the right answer isn’t yet known. While gold standard corpora can certainly have errors in them, they are considered correct for the purpose of training and testing algorithms, and therefore are a mark to be evaluated against, like the answer key on an exam.

Notice that this number is not the same as the overall accuracy from “Percentage accu

racy” (page 172), because this is the accuracy measure for only one tag. Again, we can create a table of all the F-scores as shown here:

tag precision recall F-measure

positive .881 .96 .919

neutral .95 .87 .908

negative 1 1 1

F-measures, precision and recall, and accuracy are all commonly used metrics for eval

uating how an algorithm is performing during development, and are also often discussed in articles and at conferences to report on how well an algorithm does on an annotation task. We will discuss how to interpret these scores in the next section.

Other evaluation metrics

There are many other ways to analyze the various features, distributions, and problems with your dataset, algorithms, and feature selection. Here’s an overview of some other methods that you can use to check out how your system is performing:

T-test

Determines whether the means of two distributions are different in a statistically significant way. Useful for comparing the results of two different algorithms over your training and test sets.

Analysis of variance (ANOVA) test

Like a t-test, but allows multiple distributions to be compared at the same time.

Χ-squared (chi-squared) test

Can be used to test if two variables are independent. Useful for determining if an attribute of a dataset (or a feature in the model) is contributing to part of the data being mislabeled.

Receiver Operator Characteristic (ROC) curves

Used to compare true positive and false positive rates between classifiers or classifier thresholds.

Also, don’t forget about the methods we showed you in Chapter 3. Some of those tests can also be applied here.

For more information on how to apply these tests to your corpus and algorithm, please check a statistics textbook. If you’re interested specifically in how statistics can be done in Python, we suggest looking at Think Stats: Probability and Statistics for Program

mers, Allen B. Downey (O’Reilly, 2011).

Dans le document Natural Language Annotation for Machine Learning (Page 188-193)