Writing the Part-of-Speech Tagger - Thoughtful Machine Learning

We need to be able to do three things with the part-of-speech tagger: take data from the CorpusParser, store it internally so we can calculate probabilities of word tag

combos, and do the same for tag transitions. We want this class to be able to tell us how probable a word and tag sequence is, as well as determine from a plaintext sen‐

tence what the optimal tag sequence is.

To be able to do that, we need to tackle calculating probabilities first, followed by cal‐

culating the probability of a tag sequence with a word sequence. Finally, we’ll imple‐

ment the Viterbi algorithm.

Let’s talk about the probability of a tag given its previous tag. Using something called a maximum likelihood estimate, we can assert that the probability should equal the count of the two tags together divided by the count of the previous tag. A test for that would look like this:

pos_tagger = POSTagger.new([StringIO.new(stream)]) pos_tagger.train!

pos_tagger }

it 'calculates tag transition probabilities' do pos_tagger.tag_probability("Z", "Z").must_equal 0 # count(previous_tag, current_tag) / count(previous_tag) # count D and D happens 2 times, D happens 3 times so 2/3 pos_tagger.tag_probability("D", "D").must_equal Rational(2,3) pos_tagger.tag_probability("START", "B").must_equal 1

pos_tagger.tag_probability("B", "D").must_equal Rational(1,2) pos_tagger.tag_probability(".", "D").must_equal 0

end end

Remember that the sequence starts with an implied tag called START. So here you see the probability of D transitioning to D is in fact two divided by three, because D tran‐

sitions to D three times but D shows up three times in that sequence. To make this work, we would have to write the following in our ^POSTagger class:

# lib/pos_tagger.rb class POSTagger

def initialize(data_io = []) @corpus_parser = CorpusParser.new @data_io = data_io

@trained = false

@tag_frequencies[ngram.last.tag] += 1

@word_tag_combos[[ngram.last.word, ngram.last.tag].join("/")] += 1 @tag_combos[[ngram.first.tag, ngram.last.tag].join("/")] += 1 end

# Maximum likelihood estimate

# count(previous_tag, current_tag) / count(previous_tag) def tag_probability(previous_tag, current_tag)

denom = @tag_frequencies[previous_tag]

if denom.zero?

0 else

@tag_combos["#{previous_tag}/#{current_tag}"] / denom.to_f end

end end

You’ll notice that we’re doing a bit of error handling for the case when zeros happen, because we will throw a divide-by-zero error. Next, we need to address the probability of word tag combinations, which we can do by introducing the following to our exist‐

ing test:

# test/lib/pos_tagger_spec.rb describe POSTagger do

let(:stream) { "A/B C/D C/D A/D A/B ./." } # Maximum Liklihood estimate

# count (word and tag) / count(tag)

it 'calculates the probability of a word given a tag' do pos_tagger.word_tag_probability("Z", "Z").must_equal 0

# A and B happens 2 times, count of b happens twice therefore 100%

pos_tagger.word_tag_probability("A", "B").must_equal 1 # A and D happens 1 time, count of D happens 3 times so 1/3 pos_tagger.word_tag_probability("A", "D").must_equal Rational(1,3) # START and START happens 1, time, count of start happens 1 so 1 pos_tagger.word_tag_probability("START", "START").must_equal 1 pos_tagger.word_tag_probability(".", ".").must_equal 1

end end

To make this work in the ^POSTagger, we need to write the following:

# lib/pos_tagger.rb

# Maximum likelihood estimate # count (word and tag) / count(tag) def word_tag_probability(word, tag) denom = @tag_frequencies[tag]

if denom.zero?

0 else

@word_tag_combos["#{word}/#{tag}"] / denom.to_f end

end end

Now that we have those two things—word_tag_probability and tag_probability

—we can answer the question: given a word and tag sequence, how probable is it?

That is the probability of the current tag given the previous tag, multiplied by the word given the tag. In a test, it looks like this:

# test/lib/pos_tagger.rb describe POSTagger do

it 'calculates probability of sequence of words and tags' do words = %w[START A C A A .]

tags = %w[START B D D B .]

tagger = pos_tagger tag_probabilities = [

tagger.tag_probability("B", "D"), tagger.tag_probability("D", "D"), tagger.tag_probability("D", "B"), tagger.tag_probability("B", ".") ].reduce(&:*)

word_probabilities = [

tagger.word_tag_probability("A", "B"), # 1 tagger.word_tag_probability("C", "D"), tagger.word_tag_probability("A", "D"), tagger.word_tag_probability("A", "B"), # 1 ].reduce(&:*)

expected = word_probabilities * tag_probabilities

pos_tagger.probability_of_word_tag(words, tags).must_equal expected end

end

So basically we are calculating word tag probabilities multiplied by the probability of tag transitions. We can easily implement this in the ^POSTagger using the following:

# lib/pos_tagger.rb if word_sequence.length != tag_sequence.length

raise 'The word and tags must be the same length!' end

# word_sequence %w[START I want to race .]

# Tag sequence %w[START PRO V TO V .]

length = word_sequence.length probability = Rational(1,1) (1...length).each do |i|

probability *= (

tag_probability(tag_sequence[i - 1], tag_sequence[i]) * word_tag_probability(word_sequence[i], tag_sequence[i]) )

end

probability end

end

Now we can figure out how probable a given word and tag sequence is. But it would be better if we were able to determine, given a sentence and training data, what the optimal sequence of tags is. For that, we need to write this simple test:

# test/lib/pos_tagger_spec.rb

pos_tagger = POSTagger.new([StringIO.new(training)]) pos_tagger.train!

pos_tagger }

it 'will calculate the best viterbi sequence for I want to race' do pos_tagger.viterbi(sentence).must_equal %w[START PRO V TO V .]

end end end

This test takes a bit more to implement because the Viterbi algorithm is somewhat involved. So let’s go through this step by step. The first problem is that our method accepts a string, not a sequence of tokens. We need to split by whitespace and treat stop characters as their own word. So to do that, we write the following to set up the Viterbi algorithm:

The Viterbi algorithm is an iterative algorithm, meaning at each step it figures out where it should go next based on the previous answer. So we will need to memoize the previous probabilities as well as keep the best tag. We can initialize and figure out what the best tag is as follows:

# lib/pos_tagger.rb

backpointers = ["START"]

@tags.each do |tag|

At this point, last_viterbi has only one option, which is {"PRO” ⇒ 1.0}. That is because the probability of transitioning from START to anything else is zero. Like‐

wise, backpointers will have START and PRO in it. So, now that we’ve set up our ini‐

tial step, all we need to do is iterate through the rest:

# lib/pos_tagger.rb

# parts

best_previous = last_viterbi.max_by do |prev_tag, probability|

(

What we are doing is storing only relevant information, and if there’s a case where last_viterbi is empty, we’ll use @tag_frequencies instead. That case really only happens when we have pruned too far. But this approach is much faster than storing all of the information in memory.

At this point, things should work! But how well?

Dans le document Thoughtful Machine Learning (Page 104-112)