Creating the Gold Standard (Adjudication)

Once you’ve created a set of annotation guidelines that is getting you IAA scores that you are satisfied with and you’ve had your annotators apply those guidelines to your entire corpus, it’s time to actually adjudicate their annotations and create your gold standard dataset, which is what you will use to train and test your ML algorithm. Gen

erally it’s best to have adjudicators who were involved in creating the annotation guide

lines, as they will have the best understanding of the purpose of the annotation. Hiring new adjudicators means that you’ll have the same problem you did with annotators.

Since you’re already familiar with the annotation task, you should find the adjudication process to be fairly straightforward. You’ll need software to perform the adjudication in (see Appendix B again for a list of what’s available), and after that, it’s just a matter of taking the time to do a careful job. There are a few things to think about, however:

• It will take you just as long, possibly longer, to do a careful job of adjudicating your corpus as it took one of your annotators to annotate it. Don’t forget to allocate sufficient time to get this done.

• Consider breaking up the adjudication task into layers: first adjudicate each extent tag or label individually, then each link tag. This will make it much easier to pay attention to how each tag is being used, and will make link tags much more accurate (because they’ll be connecting accurate extents).

• Just because two (or more) of your annotators agree about the presence of attributes of a tag at a location doesn’t mean they’re right! Remember that annotators can randomly be in agreement (which is why we spent so much time calculating kappa scores). So don’t take for granted that annotators agreeing means they’re right, at least until you have a good sense of your annotators’ abilities.

• If you do use more than one adjudicator, consider giving them all some of the same documents so that you can also calculate IAA scores for them—that way, you’ll be able to make sure they’re all on the same page.

Once you have all your files adjudicated, you’ll be ready to move on to the ML parts of the MATTER cycle.

Summary

In this chapter we discussed how to apply your model and spec to your corpus through the process of annotation, and how to create a gold standard corpus that can be used to train and test your ML algorithms. Some of the important points are summarized here:

• The “A” in the MATTER cycle is composed of a lot of different parts, including creating annotation guidelines, finding annotators, choosing an annotation tool, training annotators, checking for IAA scores, revising guidelines, and finally, ad

judicating. Don’t be put off by the number of steps outlined in this chapter; just take your time and take good notes.

• Guidelines and specifications are related, but they are not the same thing. The guidelines will determine how the model is applied to the text—even if you are using the same model, if the guidelines are different, they can result in a very different set of annotations.

• Creating a good set of annotation guidelines and an accurate and useful annotation won’t happen on the first try. You will need to revise your guidelines and retrain your annotators, probably more than once. That’s fine; just remember to allow yourself time when planning your annotation task. Within the MATTER cycle is the MAMA cycle, and no task is perfect straight off the bat.

• One of the things you will have to consider about the annotation process is what information you want to present to your annotators—giving them preprocessed data could ease the annotation process, but it could also bias your annotators, so consider what information you want to present to your annotators.

• Because you’ll need to go through the MAMA cycle multiple times, it’s a good idea to set aside a portion of your corpus on which to test your annotation guidelines while you work out the kinks. This set can be used in your gold standard later, but shouldn’t be given to your annotators right away once the guidelines are finalized.

• When you’re writing your annotation guidelines, there are a few questions that you’ll find it necessary to answer for your annotators in the document. But the most important thing is to keep your guidelines clear and to the point, and provide plenty of examples for your annotators to follow.

• When finding annotators for a task, you need to consider what type of knowledge they will need to complete your annotation task accurately (and, if possible, quick

ly), what language they should speak natively, and how much time you have to annotate. The last consideration may play a role in how many annotators you need to hire to complete your task on schedule.

• The annotation software that you give to your annotators to create the annotations will have an effect on how easily and accurately the annotations are created, so keep that in mind when choosing what tool you will use. Using more than one piece of software for the same task could cause confusion and irregularities in the annota

tion, so it’s better to pick one and stick with it.

• Once your annotators have annotated your sample set of texts, it’s time to evaluate the IAA scores. While there are many different ways to determine agreement, the two most common in computational linguistics are Cohen’s Kappa and Fleiss’s Kappa. Cohen’s Kappa is used if you have only two annotators annotating a docu

ment, while Fleiss’s Kappa is used for situations where more than two annotators are used on a dataset.

• Based on how good your agreement scores are, you can decide whether or not your task is ready to go past the test set and on to the full corpus. You will probably need to revise your task at least once, so don’t be discouraged by low IAA scores.

• Interpreting IAA scores isn’t an exact science—a number of factors can influence whether a score indicates that an annotation task is well defined, including the number of tags, the subjectivity of the annotation task, and the number of annotators.

• Additionally, the items being annotated can have an effect on how you calculate IAA. While it’s easy to calculate agreement when applying a single label to an entire document, there is some debate about how IAA scores should be calculated when applying tags to text extents. Regardless of what method you decide to apply for calculating IAA scores, keep track of the decisions you make so that other people can understand how you came up with your numbers.

• Having high IAA scores mean your task is likely to be reproducible, which is helpful when creating a sufficiently large corpus. However, just because a task is reprodu

cible doesn’t necessarily mean it will be suitable for feeding to ML algorithms. Sim

ilarly, just because a task doesn’t have great agreement scores doesn’t mean it will not be good for ML tasks. However, a reproducible task will be easier to create a large corpus for, and the bigger your corpus, the more likely you are to get good ML results, so putting some effort into creating your annotation guidelines will pay off in the end.

• Once you’ve reached acceptable IAA scores on your annotation test set, you can set your annotators loose on the full corpus. When you have the full corpus, it’s time to adjudicate the annotations and create the gold standard corpus that you will use to train your ML algorithms.

• Adjudication is best performed by people who helped create the annotation guide

lines. Bringing in new people to perform the adjudication can cause more confusion and noise in your dataset.

• Calculating IAA agreement scores between adjudicators can be a good way to ensure that your adjudicated corpus is consistent. The more consistent your corpus is, the more accurate your ML results will be.

CHAPTER 7

Dans le document Natural Language Annotation for Machine Learning (Page 150-155)