• Aucun résultat trouvé

Counting Occurrences

When putting together a corpus of linguistic texts, you most likely will not know the probability distribution of a particular phenomenon before you examine the corpus. We could not have known, for example, what the probability of encountering an action film would be in the IMDb corpus, without counting the members associated with each value of the Genre random variable. In reality, no corpus will be so conveniently balanced.

Such information constitutes the statistics over the corpus by counting occurrences of the relevant objects in the dataset—in this case, movies that are labeled as action films, comedy films, and so on. Similarly, when examining the linguistic content of a corpus, we cannot know what the frequency distribution of the different words in the corpus will be beforehand.

One of the most important things to know about your corpus before you apply any sort of ML algorithm to it is the basic statistical profile of the words contained in it. The corpus is essentially like a textbook that your learning algorithm is going to use as a supply of examples (positive instances) for training. If you don’t know the distribution of words (or whatever constitutes the objects of interest), then you don’t know what the textbook is supplying for learning. A language corpus typically has an uneven distribu­

tion of word types, as illustrated in Figure 3-1. Instructions for how to create this graph for your own corpus can be found in Madnani 2012.

First, a note of clarification. When we say we are counting the “words” in a corpus, we need to be clear about what that means. Word frequencies refer to the number of word tokens that are instances of a word type (or lemma). So we are correct in saying that the following sentence has 8 words in it, or that it has 11 words in it. It depends on what we mean by “word”!

“The cat chased the mouse down the street in the dark.”

You can perform word counts over corpora directly with the NLTK. Figure 3-2 shows some examples over the IMDb corpus. Determining what words or phrases are most common in your dataset can help you get a grasp of what kind of text you’re looking at.

Here, we show some sample code that could be used to find the 50 most frequently used words in a corpus of plain text.

Figure 3-1. Frequency distribution in the NLTK Gutenburg corpus

Figure 3-2. Frequency distribution in the IMDb corpus

>>> import nltk

>>> from nltk.corpus import PlaintextCorpusReader

>>> imdbcorpus = PlaintextCorpusReader('./training','.*')

>>> from nltk import FreqDist

>>> fd1 = FreqDist(imdbcorpus.words())

>>> fd1.N() #total number of sample outcomes. Same as len(imdbcorpus.words()) 160035

>>> fd1.B() #total number of sample values that have counts greater than zero 16715

>>> len(fd1.hapaxes()) #total number of all samples that occur once 7933

>>> frequentwords = fd1.keys() #automatically sorts based on frequency

>>> frequentwords[:50]

[',', 'the', '.', 'and', 'to', 'a', 'of', 'is', 'in', "'", 'his', 's', 'he', 'that', 'with', '-', 'her', '(', 'for', 'by', 'him', 'who', 'on', 'as', 'The', 'has', ')', '"', 'from', 'are', 'they', 'but', 'an', 'she', 'their', 'at', 'it', 'be', 'out', 'up', 'will', 'He', 'when', 'was', 'one', 'this', 'not', 'into', 'them', 'have']

Instructions for using the NLTK’s collocation functions are availabled at http://

nltk.googlecode.com/svn/trunk/doc/howto/collocations.html.

Here are two of the basic concepts you need for performing lexical statistics over a corpus:

Corpus size (N)

The number of tokens in the corpus Vocabulary size (V)

The number of types in the corpus

For any tokenized corpus, you can map each token to a type; for example, how many times the appears (the number of tokens of the type the), and so on. Once we have the word frequency distributions over a corpus, we can calculate two metrics: the rank/

frequency profile and the frequency spectrum of the word frequencies.

To get the rank/frequency profile, you take the type from the frequency list and replace it with its rank, where the most frequent type is given rank 1, and so forth. To build a frequency spectrum, you simply calculate the number of types that have a specific fre­

quency. The first thing one notices with these metrics is that the top few frequency ranks are taken up by function words (i.e., words such as the, a, and and; prepositions; etc.).

In the Brown Corpus, the 10 top-ranked words make up 23% of the total corpus size (Baroni 2009). Another observation is that the bottom-ranked words display lots of ties in frequency. For example, in the frequency table for the IMDb corpus, the number of hapax legomena (words appearing only once in the corpus) is over 8,000. In the Brown Corpus, about half of the vocabulary size is made up of words that occur only once.The mean or average frequency hides huge deviations. In Brown, the average frequency of a type is 19 tokens, but the mean is increased because of a few very frequent types.

We also notice that most of the words in the corpus have a frequency well below the mean. The mean will therefore be higher than the median, while the mode is usually 1.

So, the mean is not a very meaningful indicator of “central tendency,” and this is typical of most large corpora.

Recall the distinctions between the following notions in statistics:

• Mean (or average): The sum of the values divided by the number of values

x¯ = 1 n

i=4 nxi

• Mode: The most frequent value in the population (or dataset)

• Median: The numerical value that separates the higher half of a population (or sample) from the lower half