• Aucun résultat trouvé

Model the Phenomenon

The first step in the MATTER development cycle is “Model the Phenomenon.” The steps involved in modeling, however, vary greatly, depending on the nature of the task you have defined for yourself. In this section, we will look at what modeling entails and how you know when you have an adequate first approximation of a model for your task.

The parameters associated with creating a model are quite diverse, and it is difficult to get different communities to agree on just what a model is. In this section we will be pragmatic and discuss a number of approaches to modeling and show how they provide the basis from which to created annotated datasets. Briefly, a model is a characterization of a certain phenomenon in terms that are more abstract than the elements in the domain being modeled. For the following discussion, we will define a model as consisting of a vocabulary of terms, T, the relations between these terms, R, and their interpretation, I. So, a model, M, can be seen as a triple, M = <T,R,I>. To better understand this notion of a model, let us consider the scenarios introduced earlier. For spam detection, we can treat it as a binary text classification task, requiring the simplest model with the cate­

gories (terms) spam and not-spam associated with the entire email document. Hence, our model is simply:

• T = {Document_type, Spam, Not-Spam}

• R = {Document_type ::= Spam | Not-Spam}

• I = {Spam = “something we don’t want!”, Not-Spam = “something we do want!"}

The document itself is labeled as being a member of one of these categories. This is called document annotation and is the simplest (and most coarse-grained) annotation possible. Now, when we say that the model contains only the label names for the cate­

gories (e.g., sports, finance, news, editorials, fashion, etc.), this means there is no other annotation involved. This does not mean the content of the files is not subject to further scrutiny, however. A document that is labeled as a category, A, for example, is actually analyzed as a large-feature vector containing at least the words in the document. A more fine-grained annotation for the same task would be to identify specific words or phrases in the document and label them as also being associated with the category directly. We’ll return to this strategy in Chapter 4. Essentially, the goal of designing a good model of the phenomenon (task) is that this is where you start for designing the features that go into your learning algorithm. The better the features, the better the performance of the ML algorithm!

Preparing a corpus with annotations of NEs, as mentioned earlier, involves a richer model than the spam-filter application just discussed. We introduced a four-category ontology for NEs in the previous section, and this will be the basis for our model to identify NEs in text. The model is illustrated as follows:

• T = {Named_Entity, Organization, Person, Place, Time}

• R = {Named_Entity ::= Organization | Person | Place | Time}

• I = {Organization = “list of organizations in a database”, Person = “list of people in a database”, Place = “list of countries, geographic locations, etc.”, Time = “all possible dates on the calendar”}

This model is necessarily more detailed, because we are actually annotating spans of natural language text, rather than simply labeling documents (e.g., emails) as spam or not-spam. That is, within the document, we are recognizing mentions of companies, actors, countries, and dates.

Finally, what about an even more involved task, that of recognizing all temporal infor­

mation in a document? That is, questions such as the following:

When did that meeting take place?

How long was John on vacation?

• Did Jill get promoted before or after she went on maternity leave?

We won’t go into the full model for this domain, but let’s see what is minimally necessary in order to create annotation features to understand such questions. First we need to distinguish between Time expressions (“yesterday,” “January 27,” “Monday”), Events (“promoted,” “meeting,” “vacation”), and Temporal relations (“before,” “after,” “during”).

Because our model is so much more detailed, let’s divide the descriptive content by domain:

• Time_Expression ::= TIME | DATE | DURATION | SET

— TIME: 10:15 a.m., 3 o’clock, etc.

— DATE: Monday, April 2011

— DURATION: 30 minutes, two years, four days

— SET: every hour, every other month

• Event: Meeting, vacation, promotion, maternity leave, etc.

• Temporal_Relations ::= BEFORE | AFTER | DURING | EQUAL | OVERLAP | ...

We will come back to this problem in a later chapter, when we discuss the impact of the initial model on the subsequent performance of the algorithms you are trying to train over your labeled data.

In later chapters, we will see that there are actually several models that might be appropriate for describing a phenomenon, each providing a different view of the data. We will call this multimodel annotation of the phenomenon. A common scenario for multimodel annotation involves annotators who have domain expertise in an area (such as biomedical knowledge). They are told to identify specific entities, events, attributes, or facts from documents, given their knowledge and interpretation of a specific area. From this annotation, nonexperts can be used to mark up the structural (syntactic) aspects of these same phenomena, thereby making it possible to gain domain expert understanding without forc­

ing the domain experts to learn linguistic theory as well.

Once you have an initial model for the phenomena associated with the problem task you are trying to solve, you effectively have the first tag specification, or spec, for the annotation. This is the document from which you will create the blueprint for how to annotate the corpus with the features in the model. This is called the annotation guide­

line, and we talk about this in the next section.