Naive Bayes Classifier

(1)

M1 LI

Naive Bayes Classifier

Guillaume Wisniewski [email protected]

September 2021

1 Goal

We are aiming to train a classifier, implementing the Naive Bayes approach, to predict whether a mail is a spam or a ham. To train and evaluate our classifier we will consider theling-spam dataset that contains spam messages and messages from the Linguist list.¹

2 Dataset

1. Write a function that takes as parameter the name of the directory containing theling-spam and returns a list of examples. Each examples is made of an observation (the list of words extracted from the mail) and a label (eitherhamorspam).

2. How many examples are there in the corpus? Represent the label distribution.

3. Write a function that split the corpus into a train and a test sets. The function will take as parameters the list of examples and the proportion of examples that must be included into the train set.

3 Naive Bayes classifier

In the Naive Bayes classifier the conditional distribution over the class variableyis estimated by:

p(y|x)∝ p(y)·

k

Y

i=1

p(xi|y) (1)

1Seethis articlefor further details.

1

(2)

where∝denotes proportionality andx = (xi)^k_i₌₁ is an observation represented by a vector ofk components.²

The computation ofp(y|x) involves the multiplication of a large number of tiny numbers (there as many terms in the multiplication as words in an observation, each of the terms is a probability and, consequently, smaller than 1), which can result in numerical stability issues (computers are not able to represent small real numbers). That is why, you have to manipulate log-probability in our program rather than probabilities.

4. Use Equation 1to express logp(y|x) as a function of p(xi|y) and p(y). Why can we use logp(y|x) in the MAP decision rule instead of p(y|x).

5. Write a function that takes as input a list of examples a returns a dictionary of dictionar- ies that maps every pairs (word,label) to the number of occurrences ofword in mails labeled bylabel

6. Write down a class³with:

• afitmethod that will estimate the parameters of the Naive Bayes classifier;

• apredictmethod that will take as input an observation (a list of words) and return the predicted label (eitherspamorham);

• ascoremethod that will take as input a list of annotated examples and return the proportion of correctly predicted labels. This corresponds to the classifieraccuracy.

7. What is the accuracy of a naive Bayes classifier when the train set is made of 80% of all the examples. How is the accuracy related to the 0/1 loss?

4 Evaluation

8. Compute theconfusion matrixof the classifier trained in the previous section. A confusion matrix is an×n matrix, wherenis the number of possible label. An entrym_i,j of the confusion matrix indicates the number of times an examples with gold labelihas been classified as j. What is a confusion matrix useful for? Interpret.

9. Compute the accuracy of the classifier for different size of the training set (e.g. starting with a 80:20 train-test split, consider 10%, 20%, 30%, ... of the training data to estimate the classifier parameters and estimate the classifier performance on the test set). Why do we have to consider a single test set (rather than, for instance, consider all the remaining data to evaluate the classifier performance). Plot the classifier performance with respect to the size to the train set. Interpret.

2In the next classes we will see that these components are usually called ‘features’.

3Defining a class is a simple way to ‘link’ the different methods together. If you are not at ease with OOP, you can write two separated methods.

2

(3)

10. Generate 100 train-test splits (all with a 80:20 proportion), train and evaluate a classifier on each of this split. What is the smallest and largest accuracies achieved? Plot the resulting accuracy distribution. Interpret.

11. Find, for each labely, the ten word with the largest probability p(w|y). Interpret.

3