Where Do Parameter Estimates Come From? - Natural Language Annotation for Machine Learning

The estimates for the parameters being used in the model need to come from somewhere.

But how do we get them? In Chapter 3 we talked about calculating the prior probability of a given category by using the relative frequencies of the data. Using relative frequencies is a kind of Maximum Likelihood Estimation (MLE). We can do that for Naïve Bayes as well. Namely, divide the number of samples in the category, C, by the total number of samples, X, to estimate the probability distribution P(X|C). The Maximum Likelihood Estimation is so called because it is this selection of the values of the parameters that will give the highest probability for the training corpus.

The problem with this approach for a lot of problems in Natural Language Processing (NLP), however, is that there simply isn’t enough data to calculate such values! This is called the data sparseness problem. Consider what happens for data that the model hasn’t encountered in the training corpus. The MLE will assign a zero probability to any unseen events, which is a very unhelpful value for predicting behavior on a new corpus. To solve this problem, statisticians have developed a number of methods to “discount” the prob

abilities of known events in order to assign small (but nonzero) values to the events not seen in the corpus. One such technique is smoothing. For example, additive smoothing is a common technique that takes the existing MLE for a known event and then discounts by a factor dependent on the size of the corpus and the vocabulary (the set of categories being used to bin the data). For a good review of smoothing techniques as used in NLP, check out Jurafsky and Martin (2008), and Manning and Schuetze (1999).

Movie genre identification

Let’s take the classifier we just built out for a test drive. Recall the IMDb movie corpus we discussed in Chapter 3. This corpus consists of 500 movie descriptions, evenly bal

anced over five genres: Action, Comedy, Drama, Sci-fi, and Family. Let’s assume our training corpus to be 400 labeled movie summaries, consisting of 80 reviews from each of the five genres. The learner’s task is to choose, from the 400 articles, which genre the summary should belong to. The first question we need to confront is, what are the features we can use as input for this classifier? Recall that, generally, the task of making

document-level categorizations is best handled with n-gram features, while annotation of NEs, events, or other text within the summary is not particularly helpful. This is easy to see by simply inspecting a couple of movie summaries. Here are two of the movies from the IMDb corpus, the drama Blowup, from 1966; and the family movie Chitty Chitty Bang Bang, from 1968.

Blowup (1966): Thomas is a London-based photographer who leads the life of excess typical of late 1960s mod London. He is primarily a highly sought-after studio fashion photographer, although he is somewhat tiring of the vacuousness associated with it. He is also working on a book, a photographic collection of primarily darker images of human life, which is why he spent a night in a flophouse where he secretly took some photos.

While he is out one day, Thomas spies a couple being affectionate with each other in a park. From a distance, he clandestinely starts to photograph them, hoping to use the photographs as the final ones for his book. The female eventually sees what he is doing and rushes over wanting him to stop and to give her the roll of film. She states that the photographs will make her already complicated life more complicated. Following him back to his studio, she does whatever she needs to to get the film. He eventually complies, however in reality he has provided her with a different roll. After he develops the photo

graphs, he notices something further in the background of the shots. Blowing them up, he believes he either photographed an attempted murder or an actual murder. The photos begin a quest for Thomas to match his perception to reality.

Chitty Chitty Bang Bang (1968): In the early 20th century England, eccentric Caractacus Potts works as an inventor, a job which barely supports himself, his equally eccentric father, and his two adolescent children, Jeremy and Jemima. But they’re all happy. When the children beg their father to buy for them their favorite plaything - a broken down jalopy of a car sitting at a local junk yard - Caractacus does whatever he can to make some money to buy it. One scheme to raise money involves the unexpected assistance of a pretty and wealthy young woman they have just met named Truly Scrumptious, the daughter of a candy factory owner. But Caractacus eventually comes into another one time only windfall of money, enough to buy the car. Using his inventing skills, Caractacus trans

forms the piece of junk into a beautiful working machine, which they name Chitty Chitty Bang Bang because of the noise the engine makes. At a seaside picnic with his children and Truly, Caractacus spins a fanciful tale of an eccentric inventor, his pretty girlfriend (who is the daughter of a candy factory owner), his two children, and a magical car named Chitty all in the faraway land of Vulgaria. The ruthless Baron Bomburst, the ruler of Vulgaria, will do whatever he can to get his hands on the magical car. But because of Baroness Bomburst’s disdain for them, what are outlawed in Vulgaria are children, in

cluding the unsuspecting children of a foreign inventor of a magical car.

As you can see, while it might be possible to annotate characters, events, and specific linguistic phrases, their contribution will be covered by the appropriate unigram or bigram feature from the text. Selecting just the right set of features, of course, is still very difficult for any particular learning task. Now let’s return to the five-step procedure for creating our learning algorithm, and fill in the specifics for this task.

1. Choose the training experience, E. In this case, we start with the movie corpus and take the list of n-gram features for each word in the summary.

2. Identify the target function (what is the system going to learn?). We are making an n-ary choice of whether the summary is Drama, Action, Family, Sci-fi, or Comedy.

3. Choose how to represent the target function. We will assume that target function is represented as the MAP of the Bayesian classifier over the features.

4. Choose a learning algorithm to infer the target function from the experience you provide it with. This is tied to the way we chose to represent the function, namely:

Classify(^f1, ..., fn)^{= argmax}cϵCP(C=c)

Π

(

^X^Fi= fi

|

^C⁼^c

)

5. Evaluate the results according to the performance metric you have chosen. We will use accuracy over the resultant classifications as a performance metric.

Sentiment classification

Now let’s look at some classification tasks where different feature sets resulting from richer annotation have proved to be helpful for improving results. We begin with sentiment or opinion classification of texts. This is really two classification tasks: first, distinguishing fact from opinion in language; and second, if a text is an opinion, deter

mining the sentiment conveyed by the opinion holder, and what object it is directed toward. A simple example of this is the movie-review corpus included in the NLTK corpus package, where movies are judged positively or negatively in a textual review.

Here are some examples:

• Positive:

jaws is a rare film that grabs your attention before it shows you a single image on screen . the movie opens with blackness , and only distant , alien-like underwater sounds . then it comes , the first ominous bars of composer john williams’ now infamous score . dah-dum . there , director steven spielberg wastes no time , taking us into the water on a midnight swim with a beautiful girl that turns deadly .

• Positive:

... and while the film , like all romantic comedies , takes a hiatus from laughs towards the end because the plot has to finish up , there are more than enough truly hilarious moments in the first hour that make up for any slumps in progress during the second half . my formal complaints for the wedding singer aren’t very important . the film is predicable , but who cares ? the characters are extremely likable , the movie is ridiculously funny , and the experience is simply enjoyable .

• Negative:

synopsis : a mentally unstable man undergoing psychotherapy saves a boy from a potentially fatal accident and then falls in love with the boy’s mother , a fledgling restauranteur . unsuccessfully attempting to gain the woman’s favor , he takes pic

tures of her and kills a number of people in his way . comments : stalked is yet another in a seemingly endless string of spurned-psychos-getting-their-revenge type movies which are a stable category in the 1990s film industry , both theatrical and direct-to-video .

• Negative:

sean connery stars as a harvard law professor who heads back into the courtroom , by way of the everglades , to defend a young , educated black man ( blair under

wood ) . the guy is on death row for the murder of a white girl , and says that his confession was coerced from the region’s tough , black cop ( lawrence fishburne ) . watching connery and fishburne bump heads for two hours is amusing enough , but the plot’s a joke . there’s no logic at work here . tone is also an issue--there is none . director arne glimcher never establishes exactly what his film is trying to say . is it a statement on human rights ? is it a knock-off of silence of the lambs ? More complicated cases emerge when we look at product reviews, or more nuanced reviews, where the text is conveying a number of different opinions, not all of them negative or positive. Consider the following review, for example:

I received my Kindle Fire this morning and it is pretty amazing. The size, screen quality, and form factor are excellent. Books and magazines look amazing on the tablet and it checks email and surfs the web quickly. (Kindle Fire review on Amazon.com)

This has been a growing area since around 2002 (Pang et al. 2002), and has also been an area where corpora have been developed, including the MPQA Opinion Corpus (Wiebe et al. 2005), described in Appendix A. There are some early classifiers based entirely on n-gram models (mostly unigram) that perform quite well, so we will not explore those here. Instead, we will look at whether annotation based on model criteria can improve the results seen from n-gram-based models. If we take a model-based approach, as developed in this book, then we are hoping to characterize the text and the learning task with an annotation that reflects a deeper appreciation of the linguistic phenomena being studied (and learned). To handle more nuanced review texts, re

searchers have proposed model-based schemas that reflect the dependencies between the opinion holder and the product, as well as the type of sentiment. There are a couple of annotation schemas that we can consider for sentiment annotation. For example, following Liu (2012), we can define an opinion as a tuple consisting of the following elements:

Opinion = <h, e, a, so, t>

where h is an opinion holder; e is the target entity of the opinion; a is a feature of the target; so is the sentiment orientation; and t is the time of the opinion event. Using such a description of opinions gives us an annotation language that picks out a much finer-grained set of entities and properties regarding sentiment toward different kinds of objects.

For example, the “orientation of the sentiment” will include values such as negative, positive, neutral, or sarcastic. Furthermore, we may identify the intensity of the opinion as low, medium, or high. Now consider what such an annotation gives us. Rather than creating classifiers based only on n-gram features, we can make references to features that have several advantages. First, they generalize over n-gram values and capture this generalization as an abstraction, captured by the value of an attribute. Second, this at

tribute can be manipulated as a feature independent of whatever n-grams might be associated with the values. Finally, the elements in the annotation can be associated by relations that are explicitly annotated in the text, or they can be more readily discovered as nonindependent by some algorithms, such as MaxEnt, which we will discuss next.

Dans le document Natural Language Annotation for Machine Learning (Page 169-173)