Error Minimization Through Cross-Validation

At this point, we need to measure how well our model works. To do so, we need to take the data that we downloaded earlier and do a cross-validation test on it. From there, we need to measure only false positives, and then based on that determine whether we need to fine-tune our model more.

Minimizing false positives

Up until this point, our goal with making models has been to minimize error. This error could be easily denoted as the count of misclassifications divided by the total classifications. In most cases, this is exactly what we want, but in a spam filter this isn’t what we’re optimizing for. Instead, we want to minimize false positives. False positives, also known as Type I errors, are when the model incorrectly predicts a posi‐

tive when it should have been negative.

In our case, if our model predicts spam when in fact the email isn’t, then the user will lose her emails. We want our spam filter to have as few false positives as possible. On the other hand, if our model incorrectly predicts something as ham when it isn’t, we don’t care as much.

Instead of minimizing the total misclassifications divided by total classifications, we want to minimize spam misclassifications divided by total classifications. We will also measure false negatives, but they are less important because we are trying to reduce spam that enters someone’s mailbox, not eliminate it.

To accomplish this, we first need to take some information from our data set, which we’ll cover next.

Building the two folds

Inside the spam email training data is a file called keyfile.label. It contains information about whether the file is spam or ham. Inside our cross-validation test, we can easily parse the file using the following code:

# test/cross_validation_spec.rb describe 'Cross Validation' do def self.parse_emails(keyfile) emails = []

File.open(keyfile, 'rb').each_line do |line|

label, file = line.split(/\s+/) emails << Email.new(filepath, label) end

File.open(fold_file, 'rb').each_line do |line|

label, file = line.split(/\s+/)

set_of_emails.each do |email|

classification = trainer.classify(email) confidence += classification.score

if classification.guess == 'spam' && email.category == 'ham' false_positives += 1

elsif classification.guess == 'ham' && email.category == 'spam' false_negatives += 1

else

correct += 1 end

end

total = false_positives + false_negatives + correct message = <<-EOL

False Positives: #{false_positives / total}

False Negatives: #{false_negatives / total}

Accuracy: #{(false_positives + false_negatives) / total}

EOL message end end

Cross-validation and error measuring

From here, we can actually build our cross-validation test, which will read fold1 and fold2 and then cross-validate to determine the actual error rate. The test looks some‐

thing like this:

# test/cross_validation_spec.rb describe 'Cross Validation' do describe "Fold1 unigram model" do let(:trainer) {

self.class.label_to_training_data('./test/fixtures/fold1.label') }

let(:emails) {

self.class.parse_emails('./test/fixtures/fold2.label') }

it "validates fold1 against fold2 with a unigram model" do skip(self.class.validate(trainer, emails))

end end

describe "Fold2 unigram model" do let(:trainer) {

self.class.label_to_training_data('./test/fixtures/fold2.label') }

let(:emails) {

self.class.parse_emails('./test/fixtures/fold1.label') }

it "validates fold2 against fold1 with a unigram model" do skip(self.class.validate(trainer, emails))

end end end

When we run the command ruby test/cross_validation_spec.rb, we get the fol‐

lowing results:

WARNING: Could not parse (and so ignoring) 'From spamassassin-devel-admin@lists.

sourceforge.net Fri Oct 4 11:07:38 2002' Parsing emails for ./test/fixtures/fold2.label

WARNING: Could not parse (and so ignoring) 'From quinlan@pathname.com Thu Oct 1 0 12:29:12 2002'

Done parsing emails for ./test/fixtures/fold2.label Cross Validation::Fold1 unigram model

validates fold1 against fold2 with a unigram model False Positive Rate (Bad): 0.0036985668053629217 False Negative Rate (not so bad): 0.16458622283865001 Error Rate: 0.16828478964401294

WARNING: Could not parse (and so ignoring) 'From quinlan@pathname.com Thu Oct 1 0 12:29:12 2002'

Parsing emails for ./test/fixtures/fold1.label

WARNING: Could not parse (and so ignoring) 'From spamassassin-devel-admin@lists.

sourceforge.net Fri Oct 4 11:07:38 2002'

Done parsing emails for ./test/fixtures/fold1.label Cross Validation::Fold2 unigram model

validates fold2 against fold1 with a unigram model False Positive Rate (Bad): 0.005545286506469501 False Negative Rate (not so bad): 0.17375231053604437 Error Rate: 0.17929759704251386

You’ll notice that the false negative rate (classifying an email as ham when it’s actually spam) is much higher than the false positive rate (classifying an email as spam when it’s ham). This is because of Bayes’s theorem! Let’s look at the actual probabilities for ham versus spam in Table 4-3.

Table 4-3. Spam versus ham

Category Email count Word count Probability of email Probability of word

Spam 1,378 231,472 31.8% 36.3%

Ham 2,949 406,984 68.2 63.7%

Total 4,327 638,456 100% 100%

As you can see, ham is more probable, so we will default to that and more often than not we’ll classify something as ham when it might not be. The good thing here, though, is that we have reduced spam by 80% without sacrificing incoming messages.

Conclusion

In this chapter, we have delved into building and understanding a Naive Bayesian Classifier. As you have learned it, this algorithm is well suited for data that can be asserted to be independent. Being a probablistic model, it works well for classifying data into multiple directions given the underlying score. This supervised learning method is useful for fraud detection, spam filtering, and any other problem that has these types of features.

CHAPTER 5

Dans le document Thoughtful Machine Learning (Page 86-91)