The SentimentClassifier Class - Thoughtful Machine Learning

Now that we have the portion of the app that takes training data for both positive and negative text, we can build the SVM portion of the app (the SentimentClassifier class). The point of this class is to take information from a ^CorpusSet and convert it into an SVM model. After it has done that, this serves as a way of taking new infor‐

mation and mapping it either to ^:positive or ^:negative. In addition to building the SentimentClassifier class, we need to address the following:

• We need something to refactor the interaction with ^CorpusSet because the API for ^CorpusSet is too complicated to use.

• We need a library to handle the SVM algorithm.

• We need something to train on and cross-validate.

Refactoring the interaction with CorpusSet

The SentimentClassifier takes one argument, which is a ^CorpusSet. This is simply the corpus set of all training data. Unfortunately, with our argument being a ^Corpus Set, we might run into the following syntax:

# lib/sentiment_classifier.rb class SentimentClassifier def initialize(corpus_set)

# Initialization

This is not good API design—it requires a lot of previous information to build a Corpus and then a ^CorpusSet. In reality, we want something more like a factory method that builds a SentimentClassifier. This method, ^.build, would take multi‐

ple arguments pointing at training data. Instead of passing in a hash, we’ll just assume that positive text will have the file extension .pos, and negative text will have the extension .neg.

Making a factory method called ^.build will really help, and doesn’t require us to explicitly build everything; it relies on the filesystem type, so we can simply fill in the blanks:

corpora = files.map { |file| Corpus.new(file, mapping.fetch(File.extname(fil e)) }

corpus_set = CorpusSet.new(corpora) new(corpus_set)

end end

Now that we’re at this junction, we still have a couple of decision to make: what library to use to build our SVM model, and where to find training data.

Library to handle Support Vector Machines: LibSVM

When it comes to libraries to handle Support Vector Machines, most people tend to grab LibSVM. It has the longest track record, is written in C, and has many bindings, from Python to Java to Ruby. One caveat here, though, is that there are a few Ruby gems for LibSVM, and not all are superb. The gem rb-libsvm—which is what we will use—supports sparse vectors and therefore is best suited for our problems. There are others out there that use swig adapters and unfortunately don’t support sparse matri‐

ces as well.

Training data

Up to this point, we haven’t talked about training data for our tool. We need some text that is mapped as either negative or positive. These data would be organized into lines and stored in files. There are many different sources of data, but what we’ll use is from GitHub. This is a set of data from Pang Lee about movie review sentiment.

This is a highly specific data set and will work only for movie reviews from IMDb (the Internet Movie Database), but it is sufficient for our purposes. If you were to use this with any other program, most likely you would use a data set specific to what you were trying to solve. So, for instance, Twitter sentiment would come from actual tweets that were mapped to negative and positive. Keep in mind that it’s not too diffi‐

cult to build your own data set by creating a survey form and partitioning out work to Mechanical Turk by Amazon.

Cross-validating with the movie review data

Cross-validation is the best way to ensure that our data is trained well and that our model works properly. The basic idea is to take a big data set, split it into two or more pieces, and then use one of those pieces of data as training while using the other to validate against it.

In test form, it would look like this:

# test/cross_validation_spec.rb

it "runs cross validation for C=#{2**exponent}" do

neg = split_file("./config/rt-polaritydata/rt-polarity.neg") pos = split_file("./config/rt-polaritydata/rt-polarity.pos") classifier = SentimentClassifier.build([

neg.fetch(:training),

n_er = validate(classifier, neg.fetch(:validation), :negative) p_er = validate(classifier, pos.fetch(:validation), :positive)

total = Rational(n_er.numerator + p_er.numerator, n_er.denominator + p_er.d enominator)

skip("Total error rate for C=#{2 ** exponent} is: #{total.to_f}") end

end end

We’re using ^skip and self.test_order here. The ^skip method is used to give us information but not to test anything per se. Because we are trying to find an optimal c, we are just experimenting using tests. Also notice that we override ^test_order here and set it to ^alpha. That is because the mini-test by default uses random order, mean‐

ing that as we’re going through the series from –15 to 15 we will get data out of order.

It is much easier to interpret results when you’re looking at them in order.

Also notice that we have introduced two new methods, ^split_file and ^validate. These are in our test macros module as:

# test/test_macros.rb module TestMacros

def validate(classifier, file, sentiment) total = 0

misses = 0

File.open(file, 'rb').each_line do |line|

if classifier.classify(line) != sentiment misses += 1

else end total += 1 end

Rational(misses, total) end

def split_file(filepath) ext = File.extname(filepath)

validation = File.open("./test/fixtures/validation#{ext}", "wb") training = File.open("./test/fixtures/training#{ext}", "wb") counter = 0

File.open(filepath, 'rb').each_line do |l|

if (counter) % 2 == 0

validation.close end

end

In this test, we iterate from 2 through –15 all the way up to 15. This will cover most of the territory we want. After the cross-validation is done, we can pick the best C and use that for our model. Technically speaking, this is called a grid search, and it will attempt to find a good enough solution over a set of trial runs.

Now we need to work on the backend of the SentimentClassifier. This is where we use LibSVM by building our model and making a tiny state machine:

# lib/sentiment_classifier.rb

prediction = @model.predict(@corpus_set.sparse_vector(string)) present_answer(prediction)

else

puts 'starting to get sparse vectors' y_vec, x_mat = @corpus_set.to_sparse_vectors prob = Libsvm::Problem.new

parameter = Libsvm::SvmParameter.new parameter.cache_size = 1000

parameter.gamma = Rational(1, y_vec.length).to_f parameter.eps = 0.001

parameter.c = @c

parameter.kernel_type = Libsvm::KernelType::LINEAR prob.set_examples(y_vec, x_mat)

Libsvm::Model.train(prob, parameter) end

end

Here’s where things get more interesting and we actually build the Support Vector Machine to work with the rest of the problem. As noted, we are using the LibSVM library, which is a standard. We first build our sparse_vectors, then load up a new LibSVM problem, and finally give it the default parameters.

After running the cross-validation, we see that the best C is 128, which happens to have a ~30% error rate.

Dans le document Thoughtful Machine Learning (Page 134-139)