Gender Identification - Natural Language Annotation for Machine Learning

Now let’s get our hands dirty with decision trees. We’ll start with the “first name gender”

classification problem described in Natural Language Processing with Python. This is the problem of automatically recognizing whether a (first) name is male or female, as il

lustrated in Figure 7-3.

Figure 7-3. Gender target function

The approach the authors of the aforementioned book describe is to look at properties of the name as a token to determine how these properties can contribute to discrimi

nating between male and female. This is a great example of what we have called structure-dependent (SD) features for learning. Essentially, the name is analyzed in terms of its structure as a token, that is, a string of characters. That’s why the features that will be used for training a gender classifier are structure-dependent, and not n-gram or annotation-dependent features.

What would n-gram features for the first name gender problem look like? It wouldn’t be pretty, since we would have to train on individual full-token first name instances, where the corpus would be token-based.

No generalizations would be possible, since there is only one feature associated with the target function values: for example, (female, male).

So, the algorithm can only correlate a tag with a known token, which means it can only use this pairing in the future on tokens that it has already identified.

What would an annotation-based feature be for this problem? Well, you might have an annotation of the context around the name; for example, words occurring within a window of the token. This might help in identifying names from the syntactic or semantic context. This would be particularly useful in a language that carried morphological marking for gender on the verb or on modifiers that might accompany the name:

La belle Mary (Fr.), Le beau Peter (Fr.).

Upon examination of the corpus of first names, finding the most relevant structure-dependent features for this problem turns out to be quite straightforward. Gender seems to be reflected in values of two basic properties of the token:

• The value of characters in specific positions; for example, last, first, next to last, and so on

• Other character properties, such as whether it’s a vowel, a consonant, and so on

2. We could also define an approximation to the target function, F’, as follows:

F' = w1F1 + w2F2 + w3F3 + w4F4 + w5F5 + w6F6

Namely, F' is a linear combination of the SD features shown in the preceding bulleted list, where wi refers to the numerical coefficients that are chosen by the learning algorithm to optimize the value returned by F’, which approximates F.

Once we have an inventory of features (in this case, they’re all structure-dependent features), then we can get started. Recall the steps in the creation of a learning algorithm:

1. Choose the training experience, E. In this case, the experience is a list of SD features for each name viewed as a token. More specifically, the instances are attribute-value pairs, where the attributes are chosen from a fixed set along with their values.

2. Identify the target function (what is the system going to learn?). We are making a binary choice of whether the token is male or female. So the target function is a discrete Boolean classification (yes or no) over each example (e.g., Is “Nancy” fe

male? → yes; Is “Peter” female? → no). Formally, we can say that our function, f, maps from word tokens to the binary classification associated with the gender pair, {female, male}. We will define this as a Boolean value function, and simply refer to the function as F, which returns true when the token is a female name and false when the token is a male name.

3. Choose how to represent the target function: this will be some functional combi

nation of the SD features that we identified earlier. Again, using some of the features that are identified in Natural Language Processing with Python, we have:²

• F1: last_letter = “a”

• F2: last_letter = “k”

• F3: last_letter = “f”

• F4: last_letter = “r”

• F5: last_letter = “y”

• F6: last_2_letters = “yn”

4. Choose a learning algorithm to infer the target function from the experience you provide it with. We will start with the decision tree method.

5. Evaluate the results according to the performance metric you have chosen. We will use accuracy over the resultant classifications as a performance metric.

But, now, where do we start? That is, which feature do we use to start building our tree?

When using a decision tree to partition your data, this is one of the most difficult ques

tions to answer. Fortunately, there is a very nice way to assess the impact of choosing one feature over another. It is called information gain and is based on the notion of entropy from information theory. Here’s how it works.

In information theory, entropy measures the average uncertainty involved in choosing an outcome from the set of possible outcomes. In our case, entropy is a measure over a collection of data (the examples to classify) that characterizes how much order exists between the items relative to a particular classification, C. Given our corpus, S, of train

ing data, the entropy is the sum of probabilities of each class value c_i (p_i), times the log probability for that class value:

log2 1 pi

So, we can state the entropy of the corpus (as the random variable S) as follows:

H(S) = ∑

With the concept of entropy given here, we now define the information gain associated with choosing a particular feature to create a partition over the dataset. Assume that through examining the data or through the MATTER cycle, we’ve come up with a set of features (or attributes) that we want to use for classifying our data. These can be n-gram, structure-dependent, or annotation-dependent features. Let the features that we’ve come up with for our task be the set, {A₁, A₂, … , A_n}. To judge the usefulness of a feature as a “separation” between the data points, let’s define the information gain, G, associated with an attribute, A_i, as the expected reduction in entropy that results from using this attribute to partition the examples. Here’s the formal statement for informa

tion gain as just described, where G(S,A) stands for the “information gain” using at

tribute A relative to the set S:

G(S, A)=df H(S) – ∑

υ∈Val(A)

|^Sυ|

|S| H(Sυ)

Notice what this says. The measure of how effective an attribute A is in reducing the entropy of a set is computed by taking the difference between the current value before the partitioning, H(S), and the sum of the entropies of each subset Sv, weighted by the fraction belonging to Sv.

Using information gain, let’s put the “20 questions” in the most effective order for par

titioning the dataset.

One of the problems we have when integrating a lot of features into our learning algo

rithm, as we saw earlier with decision tree learning, is that features are checked (the questions are asked) in a fixed order in the tree. This ordering is not able to reflect the fact that many features are independent of one another.

Another problem, as pointed out in Natural Language Processing with Python, is that decision trees are bad at exploiting “weak predicates” of the correct category value, since they usually show up so far down in the decision tree. The Naïve Bayes method can get around many of these problems, as we will see in the next section.

Dans le document Natural Language Annotation for Machine Learning (Page 163-167)