T N D www-galilee.univ-paris13.fr

(1)

Université Paris 13 – Institut Galilée Département d'Informatique

Pôle de Recherche et d'Enseignement Supérieur Sorbonne Paris Cité

M ^ASTER I NFORMATIQUE SPECIALITES EID

²

, PLS

Master of Science in Informatics

T ^RAITEMENT N UMERIQUE DES D ^ONNEES

Digital data processing

Série d’exercices N° 3

Evaluation des classifieurs et estimation de l’erreur

www-galilee.univ-paris13.fr

(2)

Exercises with PRTools

R.P.W. Duin, C. Lai, E. P

^,

ekalska, D.M.J. Tax

Information and Communication Theory Group, Delft University of Technology

PR Sys Design

(3)

3 Classifier evaluation and error estimation

Example 13. Evaluation

The following routines are available for the evaluation of classifiers:

testc Test a dataset on a trained classifier

crossval Train and test classifiers by cross validation

cleval Classifier evaluation by computing a learning curve reject Computation of an error-reject curve

roc Computation of a receiver-operator curve

gendat Split a given dataset at random into a training set and a test set.

sdroc ROC estimation using the PRSD toolbox

sddrawroc Interactive ROC plot and selection of an operating point sdcrossval Cross validation with rotation or randomization

A simple example of the generation and use of a test set is the following:

13.1 Load the mfeat_kar dataset, consisting of 64 Karhunen-Loeve coefficients measured for 10∗200 written digits (’0’ to ’9’). A training set of 50 objects per class (i.e. a fraction of 0.25 of 200) can be generated by:

>> a = mfeat_kar

MFEAT KL Features, 2000 by 64 dataset with 10 classes: [200 ... 200]

>> [trainset,testset] = gendat(a,0.25)

50×10 objects are stored intrainset, while the remaining 1500 objects are stored in testset.

Train a linear normal-density based classifier and test it:

>> w = ldc(trainset);

>> testset*w*testc

Compare the result with training and testing by all data:

>> a*ldc(a)*testc

which is likely better for two reasons. Firstly, it uses more objects for training, so a better classifier is obtained. Secondly, it uses the same objects for testing as well as for training, by which the test result is positively biased. Because of that, the use of separate sets for training and testing has to be preferred.

Example 14. Classifier performance

In this exercise we will investigate the difference in behaviour of the error on the training and the test set. Generate a large test set and study the variations in the classification error based on repeatedly generated training sets:

>> t = gendath([500 500]);

>> a = gendath([20 20]); t*ldc(a)*testc

16

(4)

Repeat the last line e.g. 30 times. What causes variation in the error?

Now do the same for different test sets:

>> a = gendath([20 20]);

>> w = ldc(a);

>> t = gendath([500 500]); t*w*testc

Repeat the last line e.g. 30 times and try to understand what causes the variance observed in the results.

Example 15. Use of cell arrays for classifiers and datasets

TheMatlabcell arrays can be very useful for finding the best classifiers over a set of datasets.

A cell array is a collector of arbitrary items. For instance a set of untrained classifiers can be stored as follows:

>> classifiers = {nmc,parzenc([],1),knnc([],3)}

and a set of datasets is similarly stored as:

>> data = {iris,gendath(50),gendatd(30,30,10),gendatb(100)}

Training and test sets can be generated for all datasets simultaneously by

>> [trainset,testset] = gendat(data,0.5)

In a similar way classifiers and error estimation can be done:

>> w = map(trainset,classifiers)

>> testc(testset,w)

Note that the construction w = trainset*classifiers doesn’t work for cell arrays. Cross- validation can be applied by:

>> crossval(data,classifiers,5)

The parameter ’5’ indicates a 5-fold cross-validation, i.e. a rotation over training sets that contain 80% (4/5) of the data and test sets that contain 20% (1/5) of the data. The leave- one-out error is computed if this parameter is omitted. For the nearest neighbour rule this can also be done by testk. Take a small dataset a and verify that testk(a) and crossval(a,knnc([],1))yield the same result. Note how much more efficient the specialised routine testk is.

Example 16. Learning curves introduction

cleval is an easy to use routine for studying the behaviour of a classifier on a given dataset:

>> a = gendatb([30 30])

>> e = cleval(a,ldc,[2 3 5 10 20],3)

This randomly generates training sets of sizes [2 3 5 10 20] per class out of the dataset a and trains the classifier ldc. The remaining objects are used for testing (so in this example the set a has to contain more than 20 objects per class). This is repeated 3 times and the resulting errors are averaged and returned in the structure e. This is made ready for plotting

(5)

the so-called learning curve by:

>> plote(e)

which also automatically annotates the plot.

Exercise 3.1. Learning curve experiment

Plot the learning curves of qdc, udc, fisherc and nmc for gendath by using training set sizes ranging from 3 to 100 per class. Do the same for a 20-dimensional problem generated by gendatd. Study the result and try to understand them.

Example 17. Confusion matrices

The confusion matrix shows how the objects of the dataset are classified. It is a matrix with rows representing true classes and columns the classifier decisions. Typicaly, for multiclass discriminants the confusion matrix is square. The elements on the diagonal represent the correctly classified objects, while the off-diagonal entries the errors. Confusion matrices are especially usefull in multi-class problems to inspect how the errors are distribuited between the classes. Let’s generate the three class dataset a=gendatf, and split the data into training and testing sets [tr,ts]=gendat(a,0.5). We can train the classifier on tr, execute it on ts and compare the true ts labels with the classifier decisions.

>> lab=getlab(ts);

>> w=fisherc(tr)

>> dec=ts*w*sddecide(w);

>> sdconfmat(lab,dec)

The sddecide command creates a decision mapping with a default operating point based on PRTools mapping w.¹ sdconfmat may normalize the errors per class using the option

’norm’: (sdconfmat(lab,dec,’norm’)).

Is the confusion matrix symmetric? Why?

Sometimes, it is useful to identify data samples suffering from specific types of errors. If we provide sdconfmat with a test dataset and the classifier decisions, it will add a new set of labels using the confusion matrix entries into a dataset property ’confmat’. We can visualize this labeling using sdscatter.

>> ts=sdconfmat(ts,dec)

>> sdscatter(ts)

In the sdscatter figure, change the class labeling by selecting the option ’confmat’ in the Scatter/Use class groupingmenu. Also, switch on the legend by the’Show legend’ menu item or simply by pressing ’l’ key. The ’confmat’ labels are given in ’true’-’estimated’ format.

In order to quickly extract only specifically misclassified samples, we can use the Class visi- bility/Show only class menu. The Scatter menu now offers an option to store this subset in a workspace using the menu itemCreate dataset with visible samples.

Which classifier would provide better performances than fisherc? Try and compare the confusion matrices.

1sddecidealso allows one to perform decisions at different operating points as we will see in later examples on ROC analysis.

18

(6)

Exercise 3.2. Confusion matrix experiment

Compute the confusion matrix for fisherc applied to the two digit feature sets mfeat_kar and mfeat_zer. One of these feature sets is rotation invariant. Which one?

Exercise 3.3. Cross-validation

Compare the error estimates of the 2-fold cross validation, 10-fold cross validation, leave-one out error estimate (all obtained by crossval) and the true error (based on a very large test set) for a simple problem, e.g. gendath with 10 objects per class, classified by fisherc. In order to obtain significant results the entire experiment should be repeated a large number of times, e.g. 50. Verify whether this is sufficient by computing the variances in the obtained error estimates.

Example 18. Interactive rejection

Rejection refers to the choice we make not to assign the data sample to any of the learned classes. There are two situations where we may want to reject samples:

• Distance-based rejection. We reject samples that are far from the class distributions, i.e. lay in regions of the feature space where we do not have strong evidence (this is especially usefull for outlier removal)

• Rejection close to the decision boundary, where there is high probability of making errors.

18.1 Distance-based rejection. Given the two-class Higleyman dataset, we train the quadratic classifier on the training set tr and apply the classifier to the test set ts.

>> a = sdrelab(gendath) % the dataset a has now string labels

>> [tr,ts] = gendat(a,0.5)

>> w = qdc(tr)

>> out = ts*w

The classifier soft outputs outare the class conditional densities, there is one column for each of the two classes. The higher the value the higher the classifier’s confidence that the sample belong to that class. We may choose to reject the samples for which the confidence is lower than a certain threshold. To implement this rejection capability we use the sdroc routine with the ’reject’ option. We can visualize the reject curve and observe what happend in the feature space.

>> r=sdroc(out,’reject’)

>> sdscatter(ts,[sdconvert(w) r],’roc’,r)

The reject curve shows the correctly classified samples of the first class as a function of the rejected samples. By hovering the mouse over the points of the reject curve the corresponding scatterplot illustrates the classifier behaviour in the feature space. We can reject a sample if it belongs to a region far from the class distribution learned from the training set.

Set the rejection fraction to be 10% of the data. You can select a specific point by left mouse click on the reject curve and store it back to the robject by pressing the ’s’ key. What are the

’true positive’ fractions for the two classes? The measures visualized on the axis of the reject curve may be changed via the cursor keys (up-down for the vertical axis, and left-right for

(7)

the horizontal one). Alternatively, you can set the operating point in sdroc object manually (r=setcurop(r,30)) We can now estimate the confusion matrix on the test set:

>> sdconfmat(getlab(ts),ts*[sdconvert(w) r])

Can you explain the difference between rows and colums of the confusion matrix?

18.2 Rejection close to the decision boundary. To perform this rejection we make a single change in the procedure above: we normalize the model soft outputs to sum to one. In this way we obtain the posterior probabilities of the classes. The higher the value of the posterior the farther we are from the decision boundary. In this way, the same ’reject’ option allows us to reject samples that are closer to the decision boundary.

>> w2=qdc(a)*classc % classc assures output normalization

>> out2=ts*w2

>> r2=sdroc(out,’reject’)

>> sdscatter(ts,[sdconvert(w2) r2],’roc’,r2)

Set the rejection fraction to be 10%. Compute the confusion matrixes with and without reject fraction, and compare the two.

Example 19. Reject curves

Given a classification result d = a*w, the classification error is found by e = testc(d). The number of columns in d equals the number of classes. testc determines the largest value in each column of d. By rejection of objects a threshold is used to determine when this largest value is not sufficiently large. The routine e = reject(d)determines the classification error and the reject rate for a set of such threshold values. The errors and reject frequencies are stored in e. We will illustrate this by a simple example.

19.1 Load a dataset by gendath and train the Fisher classifier:

>> a = gendath([100 100]); w = fisherc(a);

Take a small test set:

>> b = gendath([20 20])

Classify it and compute its classification error:

>> d = b*w; testc(d)

Compute the reject/error trade off:

>> e = reject(d)

Errors are stored in e.errorand rejects are stored in e.xvalues. Inspect them by

>> [e.error; e.xvalues]’

The left column shows the error for the reject frequencies presented in the right column. It starts with the classification error found above by testc(d) for no reject (0), i.e. all objects are accepted, and runs to an error of 0 and a reject of 1 at the end. e.xvalues is the reject

20

(8)

rate, starting at no reject. Plot the reject curve by:

>> plote(e)

19.2 Repeat this for a test set b of 500 objects per class. How many objects have to be rejected to have an error of less than 0.06?

Exercise 3.4. Reject experiment

Study the behavior of the reject curves for nmc, qdc and parzenc for the Sonar dataset (a = sonar). Take training sets and test sets of equal sizes ([b,c] = gendat(a,0.5)). Study help rejectto see how a set of reject curves can be computed simultaneously. Plot the result by plote. Try to understand the reject curve for qdc.

Example 20. ROC curves

The roc command computes separately the classification errors for each of the classes for various thresholds. The results can again be plotted for a two-class problem by the plote command, e.g.

>> [a,b] = gendat(sonar,0.5)

>> w1 = ldc(a);

>> w2 = nmc(a);

>> w3 = parzenc(a);

>> w4 = svc(a);

>> e = roc(b,{w1 w2 w3 w4});

>> plote(e)

This plot shows how the error shifts from one class to the other class for a changing threshold.

Try to understand what these plots indicate for the selection of a classifier.

Example 21. ROC curves: Making decisions at a specific operating point

In this example, based on PRSD Toolbox, we estimate ROC for a two-class discriminant using output weighting. We will learn how to interactively choose an operating point and construct a mapping delivering the corresponding decisions.

21.1 We define natural names of classes as ’apple’ and ’banana’ and divide the data set into a training set and a test set.

>> a = gendatb(200);

>> a = sdrelab(a,{1 ’apple’; 2 ’banana’}))

>> [tr,ts]=gendat(a,0.5);

Now we train the classifier on the training set and estimate its soft outputs on the test set.

Finally, we estimate and visualize the ROC curve.

>> w = tr*nmc

>> out = ts*w

>> r = sdroc(out)

>> sddrawroc(r)

The plot title shows the currently focused operating point and the two respective errors.

Moving the cursor over the points, the information about the operating point index and the errors are updated. You can select the desired operating point by clicking. Press the ”s” key

(9)

(save) on the keyboard to store the selected operating point as default. A dialog will appear asking you to specify the variable name. We can, for example, enter rand store the operating point back to the ROC object. Alternatively, the current operating point may be chosen as follows: r = setcurop(r,1025), (where 1025 is the index of the desired operating point).

21.2 In order to visualize confusion matrices interactively with the ROC plot, we need to store them using the ’confmat’ option.

>> r = sdroc(out,’confmat’); fig=sddrawroc(r)

By pressing the ’c’ key a new figure will appear showing two confusion matrices. The first one corresponds to the highlighted operating point (black marker), the second one corresponds to the operating point following the mouse cursor (gray marker). We can study how the error per class changes between operating points.

21.3 We can also create an interactive scatter plot visualizing the decisions at different operating points. Use the following command to create a scatter plot and connected to the ROC plot in Figure fig.

>> sdscatter(ts,w*sddecide(r),’roc’,fig)

See how the decision boundary changes when moving the cursor over the operating points.

sdscatter also accepts ROC object instead of figure handle.

Investigate the difference in the ROC curves and decision boundaries for different classifiers, e.g. qdcor parzenc.

Example 22. Multi-class ROC

We will illustrate the construction of ROC in a multi-class problem. First, we generate a multi-class dataset and divide it into training and testing sets. Second, we train the qdc classifier and execute it on the testing set, storing its soft outputs.

>> a = gendatm(1000)

>> [tr,ts] = gendat(a,0.5)

>> w = qdc(tr)

>> out = ts*w

Multi-class ROC is estimated using the sdroc command analogously to the two-class case.

By default, the sdroc estimates a sub-optimal ROC using a greedy optimizer:

>> r = sdroc(out,’confmat’)

>> sdscatter(ts,w*sddecide(r),r)

Use cursor keys to flip through the available per-class error measures in the ROC plot. The weights for a given operating point may be shown in the plot title by pressing the ’w’ key.

Inspect the full confusion matrix at each operating point by pressing the ’c’ key.

Example 23. ROC with variances

The ROC can also be estimated within the cross-validation framework. By specifying the complete set of operating points, we may effectively estimate the average ROC accompanied with variances at each operating point. Let us create the grid of operating points varying the

22

(10)

weights from 0 to 1 in 0.1 intervals, and compute the average ROC curve.

>> a=gendatb; a=sdrelab(a,{1 ’apple’; 2 ’banana’});

>> W=0:0.1:1; W=[W’ 1-W’];

>> ops=sdops(’w’,W,getlablist(a));

>> r=sdcrossval(nmc,a,’ops’,ops) % cross-validation

>> sddrawroc(r) % plot the average ROC

The ROC plot now renders also the standard deviation for each of the measures. The sdroc object r is now a matrix with size: Number of operating points, number of measures and number of folds. By default a 10-fold cross validation is performed. By hovering the mouse over the points the variance of the two error measures is visualized in the figure.

Note that the current operating point is set to the uninteresting extreme situation [0 1]. We may change the current operating point in the ROC object (for example to point # 7) using setcurop function:

>> r=setcurop(r,7)

When we press the ’f’ key (folds) in the ROC Figure, the plot will show all the per-fold realizations for a given operating point instead of the error bars. This view may give us important insight regarding the worst case scenarios.

Repeat the example using a larger dataset, e.g. a=gendatb(1000), and compare the variances of the ROC measures.

T N D www-galilee.univ-paris13.fr

M ASTER I NFORMATIQUE SPECIALITES EID

, PLS

T RAITEMENT N UMERIQUE DES D ONNEES

Série d’exercices N° 3

Evaluation des classifieurs et estimation de l’erreur

www-galilee.univ-paris13.fr

Exercises with PRTools

R.P.W. Duin, C. Lai, E. P

ekalska, D.M.J. Tax

Information and Communication Theory Group, Delft University of Technology

PR Sys Design

3 Classifier evaluation and error estimation

M ^ASTER I NFORMATIQUE SPECIALITES EID

T ^RAITEMENT N UMERIQUE DES D ^ONNEES