Exercises with PRTools and Meastools

(1)

Exercises with PRTools and Meastools

PR Course TN3534, Januari - March 2003

R.P.W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, M. Skurichina, D.M.J. Tax, Pattern Recognition Group, Delft University of Technology

http://www.ph.tn.tudelft.nl/Courses/tn3534.html ftp://ftp.ph.tn.tudelft.nl/pub/bob/PR_Course/

(2)

Introduction

It is the aim of this set of exercises to assist the reader in getting acquinted with PRTools, a Matlab toolbox for pattern recognition. It is a prerequisite to have a global knowledge on pattern

recognition, to have read the introductory part of the PRTools manual and to have access to this manual during studying the exercises. Moreover, the reader needs to have some experience with Matlab and should regulary study the help texts provided with the PRTools commands (e.g.help gendatc).

The exercises should give insight in using the toolbox. They are not meant to explain in detail how the tools are constructed and thereby don't reach the level that enables the student to add new tools to PRTools, using its specific classesdataset andmapping.

It is left to the resposibility of the reader to study the exercises using various datasets. They can be either generated by one of the routines in the toolbox, or should be loaded from a special dataset directory. On pages 40 and following this is explained further with examples of both, artificial data as well as real world data. First the Matlab commands are given, next scatterplots of some of the sets are shown. Note that not all arguments in the commands are compulsory. It is necessary to refer to these pages regulary in order to find suitable problems for the exercises.

In order to build pattern recognition systems for real world (raw) datasets, e.g. images as they are grabbed by a camera, preprocessing and the measurement of features is necessary. The growing measurement toolbox Meastools is designed for that. Here it is unavoidable that students write their own low level routines as at this moment the collection of feature measuring tools is insufficient. As no Meastool manual is available students should read at the on-line documentation and the additonal material that may be supplied during a course.

(3)

Exercises I. Introduction

Example 1. Dataset

PRTools entirely deals with sets of objects represented by vectors in a feature space. The central data structure is a so-calleddataset. It consist of a matrix of sizem x k;m row vectors representing the objects given bykfeatures each. Attached to this matrix is a set ofm labels (strings or numbers), one for each object and a set ofk feature names (also strings or numbers), one for each feature. Moreover, a set of prior probabilities, one for each class, is stored. Objects with the same label belong to the same class. In most help files in PRTools, a dataset is denoted byA. Almost all routine can handle multiclass objects. Some useful routines to handle datasets are:

dataset Define dataset from data matrix and labels getdata Retrieve data from dataset

getlab Retrieve object labels getfeat Retrieve feature labels seldat Select a subset of a dataset gendat Generate dataset labels

setdat Define a new dataset from an old one by replacing its data renumlab Convert labels to numbers

Sets of objects may be given externally or may be generated by one of the data generation routines in PRTools (see page 40 and following). Their labels may be given externally or may be the results of a classification or a cluster analysis.

A dataset containing 10 objects with 5 random measurements can be generated by:

>> data = rand(10,5);

>> a = dataset(data)

10 by 5 dataset with 1 classes: [10]

In this example no labels are supplied, therefore one class is detected. Labels can be added to the dataset by:

>> labs = [1 1 1 1 1 2 2 2 2 2]'; % labs should be a column vector

>> a = dataset(a,labs)

10 by 5 dataset with 2 classes: [5 5]

Note that the labels have to be supplied as a column vector. A simple way to assign labels to a dataset is offered by the routinegenlab in combination with the Matlabchar command:

>> labs = genlab([4 2 4],char('apple','pear','banana'))

>> a = dataset(a,labs)

10 by 5 dataset with 3 classes: [4 4 2]

1.1 Use the routinesgetlabandgetfeatto retrieve the object labels and the feature labels ofa. The fields of a dataset can be made visible by the converting it to a structure, e.g.:

>> struct(a)

data: [10x5 double]

(4)

lablist: [3x6 char]

nlab: [10x1 double]

labtype: ’crisp’

targets: []

featlab: [5x1 double]

featdom: {[] [] [] [] []}

prior: []

objsize: 10 featsize: 5

ident: {10x1 cell}

version: {[1x1 struct] ’31-Jan-2003 10:02:55’}

name: []

user: []

In the on-line information on datasets (help datasets, also printed in hte PRTools manual) the meaning of these fields is explained. Each field may be changed by aset-command (e.g.

>> b = setdata(a,rand(10,5));

Field values can be retrieved by a similarget-command, e.g.

>> classnames = getlablist(a)

Innlab an index is stored for each object to the list of class nameslablist. Note that this list is alphabetically ordered.

The size of a dataset can be found by both, size and getsize:

>> [m,k] = size(a);

>> [m,k,c] = getsize(a);

The number of objects is returned inm, the number of features inkand the number of classes in c.

The class prior probabilities are stored in prob. It is by default set to the class frequencies if the field is empty. The data in the data itself can also be retrieved bydouble(a) or more simple by+a.

1.2 Have a look of the help-information ofseldat. Use the routine to extract thebanana data froma and check this by inspecting the result of+a.

Datasets can be manipulated in many ways comparable with Matlab matrices. So[a1; a2]

combines two datasets, provided that they have the same number of features. The feature set may be extended by[a1 a2] ifa1 anda2 have the same number of objects.

1.3 Generate 3 new objects of the classes 'apple' and 'pear' and add them to the dataseta. 1.4 Generate a new, 6th feature for the whole dataseta.

Another way to inspect a dataset is to make a scatterplot of the objects in the dataset. For this the functionscatterdis supplied. This plots each object in a dataset in a 2D graph, using a colored marker when class labels are supplied. When more than two features are present in the

(5)

dataset, the first two are used. For obtaining a scatterplot of two other features they have to be explicitly extracted first, e.g.a1 = a(:,[2 5]); It is also possible to create 3D scatterplots. 1.5 Usescatterd to make a scatterplot of the features 2 and 5 of dataseta.

Try alsoscatterdui. Use its buttons to select features.

1.6 Make a 3-dimensional scatterplot byscatterd(a,3)and try to rotate it by the mouse after pressing the right toolbar button.

1.7 Use one of the procedures described on page 40 and following to create an artificial dataset of 100 objects. Make a scatterplot. Repeat this a few times.

Exercise 1. Scatterplot

Load the 4-dimensional Iris dataset by 'load iris' (stored ina) or bya = irisand make scatterplots of all feature combinations using thegriddedoption ofscatterd. Try also all feature combination usingscatterdui.

Plot in a separate figure the one-dimensional feature densities byplotf. Identify visually the best combination of two features. Create a new datasetbthat contains just these two features.

Create a new figure by thefigure command and plot in this a scatterplot ofb. Exercise 2. Mahalanobis distance

Use thedistmaha command to compute the Mahalanobis distances between all pairs of classes ina. Repeat this for the best two features just selected. Can you find a way to test whether this is really the best feature pair according the Mahalanobis distance?

Exercise 3. Generate your own dataset

Generate a dataset that consists of two 2-D uniformly distributed classes of objects using the randcommand. Transform the sets such that for the[xmin xmax; ymin ymax]intervals the following holds:[0 2; -1 1] for class 1 and [1 3; 1.5 3.5] for class 2. Generate 50 objects for each class. An easy way is to do this for x and y coordinates separately and combine them afterwards. Label the features by 'area' and 'perimeter'.

Check the result byscatterd and by retrieving object labels and feature labels.

Exercise 4. Enlarge an existing dataset

Generate a dataset using gendatb containing 10 objects per class. Enlarge this dataset to 100 objects per class by generating more data using the gendatk and gendatp commands.

Compare the scatterplots with a scatterplot of 100 objects per class directly generated by gendatb.

Example 2. Density estimation

The following routines are available for density estimation:

normalm Normal distribution parzenm Parzen density estimation

knnm K-nearest neighbor density estimation

They are programmed as amapping. Details of mappings are discussed later. The following two steps are always essential for amapping: the estimation is built, or trained using a training set, e.g. by:

(6)

Gaussian Data, 100 by 1 dataset with 1 classes: [100]

Which is a 1-dimensional normally distributed dataset of 100 points with mean 0.

>> w = normalm(a)

Normal Density Estimation, 1 to 1 trained mapping --> normal_map The trained mappingwnow contains all information needed for computing densities of given points, e.g.

>> b = [-2:0.1:2]';

Now we are going to measure on the points defined by b what the denstity is according to w (which is a density estimator based on the dataseta):

>> d = map(b,w)

41 by 1 dataset with 1 classes: [41]

The result may be listed on the screen by[+b +d](coordinates and densities) or plotted by:

>> plot(+b,+d)

2.1 Plot in the densities estimated byparzenmandknnm. in separate figures. These routines need sensible parameters. Try a few values for the smoothing parameter and the number of nearest neighbors.

Exercise 5. Density plots

Generate a 2-dimensional 2-class dataset bygendatb of 50 points per class. Estimate the densities by each of the above three methods. Make in three figures a 2D scatterplot by scatterd. Different from the above 1-dimensional example, a ready made density plotting routine, plotm can be used for drawing iso-density lines in the scatterplot. Plot each of the three density estimators in the three density plots byplotm(w). Try also 3-d plots by plotm(w,3).Note thatplotm always needs first a scatterplot to find the domain where the density has to be computed.

Exercise 6. Nearest Neighbor Classification

Write your own function for nearest neighbor error estimation:e = nne(d)in whichdis a labeled distance matrix (e.g. obtained byd = distm(b,a)ifaandbare labeled datasets).

Such a matrix is again a dataset. The objects ofdare the objects of the datasetb, which labels may be retrieved byobject_lab = getlab(d). The features are the distances ofb to all objects ina. Their labels are found byfeat_lab = getfeat(d). The number of differences between two label sets can be counted byn = nlabcmp(object_lab,feat_lab). For the nearest neighbor rule the label of each object in the test set (hereb) has to be compared with the label of its nearest neigbor in the training set (herea). Thenne routine thereby has the following steps:

1. Construct a vectorL with as many elements asd has rows. Ifj is the index of the nearest neighbor (smallestd(i,j)) ofrow_object i thenL(i) = j. This can be found by [dd,j]= min(d(i,:));

2. Usenlabcmpto count the differences between the true labels of the objects corresponding to the rows given byobject_lab and the labels of the nearest neigboursfeat_lab(L,:). 3. Normalize and return the error.

(7)

If the training setaand the test setbare identical (e.g.d = distm(a,a)),nneshould return 0as each object is its own nearest neighbor. Modify your routine in such a way that it returns the ’leave-one-out’ error if it is called bye = nne(d,’loo’). The leave-one-out error is the error made in a set of objects if for each object under consideration the object itself is excluded from the set at the moment it is evaluated. In this case not the smallestd(i,j)on rowihas to be found (which should be on the diagonal), but the next one.

Inspect some 2D datasets byscatterd and estimate the nearest neighbor error bynne.

Exercise 7. Confusion matrices

A confusion matrix C has on element C(i,j) the confusion between the classes i and j. They are especially useful in multi-class problems for analyzing the similarities between classes. Using the above written error routinenne such a matrix may be found by estimating the error between all class pairs. The following steps are possible in a routine conf(A) in which A is a multi-class dataset.

1. find the number of classes bygetsize

2. construct temporarily datasets byseldat for each pair of classes (i,j) 3. usenne to find its error and store it inC(i,j)

4. print C

Test your routine on a simple multi-class dataset, e.g. generated bygendatm or theiris dataset.

Find the six confusion matrices for the six mfeat datasets (mfeat_fac, mfeat_fou, mfeat_kar, mfeat_mor, mfeat_pix, mfeat_zer) which are based on the same digits, but different features) and compare them. Are some classes better separable by one feature set than by an other?

Example 3. Raw datasets

In the matlab directory measdata some ’raw’ datasets are stored (read help measdata).

No features are present yet. They are stored as images or polygons (contours). The toolbox meastools is constructed to handle such data and to facilitate the measurement of features.

Study the concept of meastools from the documentation or by readinghelp meastools, help measurement andhelp prfilter.

Load a part of a dataset, e.g. by a = load_faces([1,2],[1:4]),which reads the first four pictures of the subjects 1 and 2. They may be displayed byshow(a).

A prfilter for converting images to histograms ismeas_hist. Have a look at its source.

Determine the histograms in the images byh = meas_hist(a); inspect them byshow(h). These histograms may now be converted to a PRTools dataset by

x = dataset(h,genlab([4 4]). Determine a distance matrix between the histograms and find the leave-one-out nearest neighbor error as in exercise 6.

Example 4. Digit classification

Load a part of the nist dataset, e.g.a = load_nist([1,2],[1:100]) and display it.

(8)

Determine some features usingmeas_statandim_momentsand compute like, in the exercises 6 and 7, the leave-one-out nearest neighbor error.

Exercise 8. Write a prfilter

Write your own prfilter prof = im_profile(a) to measure features like horizontal and vertical profiles of a digit. A profile is the (normalised) sum of the pixels in the horizontal or vertical direction. Construct profiles of 3 bins: left central, right for vertical profiles or top, central, bottom for horizontal profiles. In this way 6 features are obtained and stored inprof. Compare them with the above ones by comparing their leave-one-out nearest neighbor errors.

(9)

II. Blob Recognition

Example 5. The NIST dataset

Load a part of the NIST dataset by

>> a = load_nist([1 6],1:25)

50 by 1 measurement set using 571440 bytes Look at its content to get an idea of the blobs in the data set by

>> show(a)

This measurement set is labeled. The labels can be retrieved by

>> getlab(a)

The contents of the measurement set is listed by

>> disp(a)

In this case it lists the sizes of all structures in which the raw images are stored. Compare them with the images. Note that some images are large due to isolated noise pixels.

A single character can be inspected bya{17} and can be shown as an image by

>> imagesc(a{17})

It may be necessary to load the correct colormap by

>> colormap gray

Bymeas_stat some properties can be measured, seehelp meas_stat. Take two, e.g.

’col_mean’ and ’sum’:

>> x = meas_stat(a,{’sum’, ’size’}) 50 by 1 measurement set using 1200 bytes

Have a look at the result by+x. Convert the measurement set into a dataset that can be used by PRTools by y = dataset(x).Make a scatterplot ofy. Compute and plot a classifier by

>> w = fisherc(y)

>> plotc(w)

Compute the classification error of y bytestc. Load a second dataset:

>> b= load_nist([1 6],26:50);

Have a look at it byshow, compute the same features as above and test it on the classifierw. Exercise 9. Meastool commands

Use help meastools to see the possibilities of Meastools. Try to normalise the

measurement set a resizing (im_resize) all characters to a fixed size, e.g. 32 x 32. Inspect them byshow. Try also a few other commands likeim_centerandim_rotate and look at the results.

Example 6. The Kimia dataset Load the Kimia training set:

>> train = load_kimia([5 7 11 18],1:10);

train will now be a 40x1 measurement set. Look at its content using show(train) to get an idea of the blobs in the data set.

(10)

Exercise 10. Grab images

You have been given 8 pages containing test objects. Using the camera, create a set of images of these objects. To open a preview window for the camera, type:

>> im_grab

Consider placement, zoom, resolution etc. Look in the "Configure" menu to change the camera's parameters. Once you're happy with your settings, you can create a test set. To create an empty measurement set, use:

>> test = measurement ([]);

Now repeat for all prints of test objects:

>> test = [test; im_grab];

Make sure the labels of your test set are correct:

>> getlab(test)

and compare the labels to those in the training set:

>> getlab(train)

If the labels are not correct, set them like this (depending on the order of your images):

>> test.lab = num2cell ([5 5 7 7 11 11 18 18]');

You may save your set by:

>> meas_save ('kimia_test', test);

Example 7. Blob recognition

Open the file <blob.m> in the Matlab editor:

>> edit blob.m

This file contains all the steps necessary to classify the test object images you captured in exercise 10, based on the Kimia train set. However, not all intermediate steps have been programmed. The goal of the following exercises is to do the necessary pre-processing on your test images and extract some features. Have a look at blob.m.

Exercise 11. Shading removal

In blob.m, find the part labelled "NORMALISE ILLUMINATION". Using the function im_minf, remove the shading in your test images.

Exercise 12. Advanced thresholding

Find the heading "THRESHOLD IMAGES". Replace the simple thresholding now used by a smarter one (see im_threshold and threshold for options).

Exercise 13. Normalisation of blobs

Under the heading "NORMALISE POSITION, SIZE, ANGLE OF BLOBS", apply the functions im_center, im_scale, im_rotate and im_box to normalise the blobs.

Exercise 14. Normalisation of image size

Normalise the image sizes under the heading "NORMALISE SIZE OF IMAGES", using im_resize.

(11)

Exercise 15. Measure features

For now, a simple set of features containing only the object sizes is extracted. Have a look at im_measure and calculate more features. How does this influence the classification error?

Exercise 16. Moments

Calculate some more features using im_moments, e.g. Hu's invariant moments. Add them to the current set of features like this:

>> train_features = [train_features; im_moments(train,'hu')];

Don't forget to do this for both train and test set!

Exercise 17. Feature scaling

Below, under the heading "SCALE FEATURES", all features are centered to have mean zero and standard deviation 1. If you remove these lines (e.g. by putting a %-sign in front of them), what happens to the features? And the classification error? Inspect the mean and std (standard deviation) of the unscaled features.

Exercise 18. Save feature set

Once you are happy with your feature set, save it. Later, you may use them in more experiments.

>> save kimia_sets.mat train_set test_set Exercise 19. More features

If you have time left, try to find some more features (and save them as you did in 18.)

(12)

III. Polygons (optional)

For the following exercises it is necessary to usepolygons, a special toolbox for processing (sets of) polygons or contours. Study the documentation of this toolbox. An easy way to find an suitable starting image is by loading the nist16 dataset:

>> load nist16

>> a

2000 by 256 dataset with 10 classes

>> b = data2im(a);

>> size(b) ans =

16 16 2000

>> c = b(:,:,650);

The imagec contains a 16 x 16 digit ’3’ with grey values between 0 and 255.

Exercise 20. Polygons

Get acquainted with thepolygon toolbox by the following exercises:

a. Convert the image to a polygon (im2poly). Compare the displays of the image (imagesc) with the two ways to plot polygons:plotpoly andplotblob.

b. Store a series of digits (e.g. for all digits 0-9 one example) into a single image by concatenating images. Repeat a.

c. Study the properties of interpolating polygons (polyint) and simplifying

polygons(polyopt) by plotting their results as on the screen. Take one of the polygons of the fish dataset (get_fish) and display the interpolated polygon usingn=5:5:500 points.

d. Take the image of a single digit and plot its polygon on top of it. Study the relation between the grey values and the polygon. The polygon is found by thresholding the image. The threshold can be set in the routineim2poly. Now convert the image to a binary image by thresholding it by hand (e.g.c = (c > 100);). Display this binary image and plot also here its polygon on top of it. Finally, convert the polygon back to an image (poly2im) and compare the result with the original. Note the border.

e. The length and area of a polygon can be computed bypolylength andpolyarea. Generate simple regular polygons bypolygen. Display them byplotpoly. Compute and verify their length and area. Do this also for an almost circular polygon obtained by

p = polygen(100).

Exercise 21. Polygon classification

Use the same datasets a and b as in example 5. Convert them to polygons by im2poly and inspect the results. The command d = polydist(b,a,’hausdorff’,n) may be used to compute a distance matrix between a and b. Note how n influences computing time and accuracy. The nearest neighbor error of classifying the objects b using a as a training set can be directly found by (1-d)*testc.Compare this error with the one found in exercise 5 for various distance measures and values of n.

(13)

Exercise 22. The scaling problem

Compute for all polygons in the datasets a and b their length and area. Have a look at the scatterplot. Compute the distance matrix d between b and a and find the nearest neighbor error (compare exercise 6). What happens if we replace area by its square root?

Exercise 23. The edit-distance

Polygons can be converted to Freeman code strings byf = poly2fc(p);Note that the result depends on how the polygon is situated on the grid. Thereby a different result is obtained if all elements of the polygon (coordinates) are multiplied by some constant, e.g.

f=poly2fc(p*2);

Compute the Freeman code strings for some simple polygons and verify the results. Start by easy ones, e.g. rectangles like p = [ 1 1; 1 4; 4 4; 4 1] and modify them slightly.

The routined=edist(S,T)computes the edit-distance matrix between the code stringsSand T. Try it by using two simple Freeman code strings as just computed. The final edit-distance is stored in the elementd(end,end). Can you see the path ind betweend(1,1) and

d(end,end)?

Determine the edit-distance between two larger polygons, e.g. two of the fishes. Note that this is very time consuming. One possibility to simplify the procedure is by reducing the scale of the polygon. Write now the following routine that determines the distance matrix of edit- distancesD=polydiste(P1,P2,n)between two sets of polygonsP1andP2, each given in a cell-array or a measurement set, and in which each polygon is scaled to a maximum sizen before the Freeman code is determined. The routine has the following steps:

1. extract two polygonsp1 andp2 from the setsP1 andP2.

2. scale them by using polynorm and multiplying the outcome byn. 3. determine the Freeman code stringsf1 andf2.

4. rund=edist(f1,f2) and store the resultingd(end,end) inD.

5. If labels are given convertD into a dataset byD = dataset(D,lab1,lab2). Use this routine and your leave-one-out error estimation routinenne to find the nearest neighbor error for classifying the Kimia classes 5 and 9.

(14)

IV. Classifiers

Example 8. Mappings and Classifiers

In PRTools datasets are transformed by mappings. These are procedures that map a set of objects form one space into another. Examples are feature selection, feature rescaling, rotations of the space, classification. e.g.

>> w = cmapm(10,[2 4 7])

FeatureSelection, 10 to 3 fixed mapping --> cmapm

w is herewith defined as a mapping of 10-dimensional space to a 3-dimensional space by selecting the features 2, 4 and 7. Its name is ’FeatureSelection’ and its executing routine, when it is applied to data is ’cmapm’. It may be applied as follows:

>> a = gauss(100,zeros(1,10))

>> b = map(a,w)

In a mapping (we use almost everywhere the variablewfor mappings) various information is stored, like the dimensionalities of input and output space, parameters that define the

transformation and the routine that is used for executing the transformation. Givestruct(w) the see all fields.

Often a mapping has to be trained, i.e. it has to be adapted to a training set by some estimation or training procedures to minimize some error for the training set. An example is the principal component analysis that performs an orthogonal rotation according to the directions with main variance in a given dataset:

>> w = pca(a,2)

Principal Component Analysis, 10 to 2 trained mapping --> affine This just defines the mapping ('trains' it by a) for finding the first 2 principal components. The fields of a mapping can be shown bystruct(w). In the PRTools-manual or by ’help mappings’ more information on mappings can be found. The mapping w may be applied to a or to any other 10-dimensional dataset by:

>> b = map(a,w)

Instead of the routinemapalso the '*' operator may be used for applying mappings to datasets:

>> b = a*w

Note that the size of the variables a (100 x 10) and w (10 x 2) are such that the inner dimensionalities cancel in the computation of b, like in all Matlab matrix operations.

The '*' operator may also be used for training.a*pca is equivalent withpca(a) and

a*pca([],2)is equivalent withpca(a,2). As a result an 'untrained' mapping can be stored in a variable:

(15)

w = pca([],2). They may, thereby, also be passed as an argument in a fuction call. The advantages of this possibility will be shown later.

A special case of a mapping is a classifier. It maps a dataset on distances to a discriminant function or on class posterior probability estimates. They can be used in an untrained as well as in a trained mode. When applied to a dataset, in the first mode the dataset is used for training and a classifier is generated, while in the second mode the dataset is classified. Unlike mappings, fixed classifiers don’t exist. Some important classifiers are:

fisherc Fisher classifier

qdc Quadratic classifier assuming normal densities

udc Quadratic classifier assuming normal uncorrelated densities

ldc Linear classifier assuming normal densities with equal covariance matrices nmc Nearest mean classifier

parzenc Parzen density based classifier knnc k-nearest neighbor classifier treec Decision tree

svc Support vector classifier

lmnc Neural network classifier trained by the Levenberg-Marquardt rule 8.1 Generate a dataseta bygendath and compute the Fisher classifier byw = fisherc(a).

Make a scatterplot ofa and plot the classifier byplotc(w). Classifiy the training set by d = map(a,w) ord = a*w. Show the result on the screen by+d.

8.2 What is displayed is the value of the sigmoid function of the distances to the classifier. This function maps the distances to the classifier from the (-inf,+inf) interval on the (0,1) interval.

The latter can be interpreted as posterior probabilities. The original distances can be retrieved by+invsigm(d). This may be visualised byplot(+invsigm(d(:,1)),+d(:,1),’*’), which shows the shape of the sigmoid function (distances along the horizontal axis, sigmoid values along the vertical axis).

8.3 During training distance based classifiers are appropriately scaled such that the posterior probabilities are optimal for the training set in the maximum likelihood sense. In multiclass problems a normalization is needed to take care that the posterior probabilities sum to one.

This is enabled byclassc. Soclassc(map(a,w)), ora*w*classc maps the dataset a on the trained classifierw and normalizes the resulting posterior probabilities. If we include training as well then this can be written in a one-liner asp = a*(a*fisherc)*classc. This may be visualised by computing classifier distances, sigmoids and normalised posterior probability estimates for a multi-class problem as follows. Load the x80 dataset by a = x80.

Compute the Fisher classifier byw = a*fisherc, classify the training set byd = a*w, and computep = d*classc. Display the various output values by+[d p]. Note that the object confidences over the first 3 columns don’t sum to one and that they are normalised in the last 3 columns to proper posterior probability estimates.

8.4 Density based classifiers likeqdcfind after training (w = qdc(a), orw = a*qdc), density estimators for all classes in the training set. Estimates for objects in some dataset b can be found by d = b*w. Again, posterior probability estimates are found after normalisation by classc:p = d*classc. Have a look at+[d p] to see the estimates for the class density

(16)

and the related posterior probabilities.

Example 9. Classifiers and discriminant plots.

This example illustrates how to plot decision boundaries in 2D scatterplots by plotc. 9.1 Generate a dataset, make a scatterplot, train and plot some classifiers by

>> a = gendath([20 20]);

>> scatterd(a)

>> w1 = ldc(a);

>> w2 = nmc(a);

>> w3 = qdc(a);

>> plotc({w1,w2,w3})

Plot in a new scatterplot of a a series of classifiers computd by the k-NN rule (knnc) for various values of k between 1 on 10. Look at the influence of the nieghborhood size on the classification boundary.

9.2 A special option of plotc colors the regions assigned to different classes:

>> a = gendatm

>> w = a*qdc

>> scatterd(a) % defines the plotting domain of interest

>> plotc(w,’col’) % colors the class regions

>> hold on % necessary to preserve the plot

>> scatterd(a) % plots the data again in the plot

Plots like these are influenced by the gridsize used for computing the classifier outputs in the scatterplot. By default it is 30 x 30 (gridsize = 30). The gridsize value can be retrieved and set bygridsize. Study its influence by setting the gridsize to 100 (or even larger) and repeating the above commands. Use each time a new figure, so results can be compared. Note the influence on the computation time.

Exercise 24. Normal densities based classifiers.

Take the features 2 and 3 of the Iris dataset. Make a scatterplot and plot in it the normal densities, see also example 2 and/or exercise 5. Compute the quadratic classifier based on normal densities (qdc) and plot it on top of this. Repeat this for the uncorrelated (udc) and the linear classifiers (ldc) based on normal distributions, but plot them on top of the

corresponding density estimation plots.

Exercise 25. Linear classifiers

Use the same dataset for comparing some linear classifiers: the linear normal distribution based classifier (ldc) , nearest mean (nmc), Fisher (fisherc) and the support vector classifier (svc). Plot them on top of each other, in different colors, in the same scatterplot. Don’t plot density estimates now.

Exercise 26. Non-linear classifiers

Generate a dataset bygendathand compare in the scatterplots the quadratic normal densities based classifier (qdc) with the Parzen classifier (parzenc) and the 1-nearest neighbor rule (knnc([],1)). Try also a decision tree (treec).

(17)

Example 10. Classifier evaluation

In PRTools a dataseta can be split into a training setb and a test setc by thegendat command, e.g.[b,c] = gendat(a,0.5). In this case, for each class 50% of the objects are randomly chosen for datasetb and the remaining objects are stored in datasetc. After computing a classifier by the training set, e.g.w = b*fisherc, the test set c can be classified byd = c*w. For each object, the label of the class with the highest confidence, or posterior probability, can be found by d*classd. E.g.:

>> a = gendath;

>> [b,c] = gendat(a,0.9)

Higleyman Dataset, 90 by 2 dataset with 2 classes: [45 45]

Higleyman Dataset, 10 by 2 dataset with 2 classes: [5 5]

>> w = fisherc(b); % the class names (labels) of b are stored in w

>> getlabels(w) % this shows them (classes are named 1 and 2)

>> d = c*w; % classify test set

>> lab = d*classd; % get the labels of the test objects

>> disp([+d lab]) % show the posterior probabilities and labels Note that in the last displayed column (lab) the labels of the classes with the highest classifier outputs are stored. The average error in a test set can be directly computed by testc:

>> d*testc

which may also be written astestc(d) ortestc(c,w). Example 11. Training and test sets

The performance of a classifierwcan be tested by an independent test set, say b. If such a set is available the routinetestc may be used to count the number of errors. It can be called as testc(b,w), but also asb*w*testc. Note that the routineclassc (discussed in example 8.3) just converts classifier outcomes to posterior probabilities, but does not change the class assignments. Sob*w*classc*testc produces the same result.

11.1 Generate a training seta of 20 objects per class bygendath and a test setb of 1000 objects per class. Compute the performance of the Fisher classifier byb*(a*fisherc)*testc. Repeat this for some other classifiers.

Exercise 27. Error limits of K-NN rule and Parzen classifier

Take a simple dataset like the Higleyman classes (gendath) and generate a small training set (e.g. 25 objects per class) and a large test set (e.g. 200 objects per class). Recall what the theory predicts for the limits of the classification error of the k-NN rule and the Parzen classifier as a function of the number of neighbors k and the smoothing parameter h. Estimate and plot the corresponding error curves and verify the theory. How can you estimate the Bayes error of this problem if you know that the classes are normally distributed? Try to explain the differences between the theory and your results.

Exercise 28. Simple classification experiment Perform now the following experiment.

- Load the IMOX data bya = imox. This is a feature based character recognition dataset.

- What are the class labels?

(18)

- Split the dataset in two parts, 80% for training and 20% for testing.

- Store the true labels of the test set using getlabels intolab_true - Compute the Fisher classifier

- Classify the test set

- Store the labels found by the classifier for the test set intolab_est - Display the true and estimated labels bydisp([lab_true lab_est]) - Predict the classification error of the test set by observing the output.

- Verify this number usingtestc. Exercise 29. Classification of large datasets

Try to find out what the best classifier is for the sixmfeat datasets (mfeat_fac,

mfeat_fou, mfeat_kar, mfeat_mor, mfeat_pix, mfeat_zer). These are different feature sets for the same objects. Take a fixed training set of 30 objects per class and use the others for testing. Make sure that all the six training sets refer to the same objects. This can be done by resetting the random seed byrand(’seed’,1)or by using the indexes returned bygendat.

Try the following classifiers:

nmc, ldc([],1e-2,1e-2), qdc([],1e-2,1e-2), fisherc, knnc, parzenc.

Write a macro script that produces a 6 x 6 table of errors. Which classifiers perform globally good? Which dataset(s) are presumably normally distributed? Which are not?

Exercise 30. Classification of the raw datasets

Compute features for the Kimia measurement set: area, perimeter, center of gravity. Build a training set of 10 objects per class and a test set of 2 objects per class. Classify the test set.

Train a linear, the quadratic and the knn classifier. Classify the test set. Try to display the erroneously classified objects together with an example of the class to which they are assigned.

(19)

V. Feature Spaces and Feature Reduction

Exercise 31.Feature scaling (optional)

For some classifiers the result depends on the scaling of the individual features. This may be studied by an experiment in which the data is badly scaled. Generate a training set of 400 points for two normally distributed classes with common covariance matrix and means [0 -0.04] for one class and [0.08 -0.17] for the other:

>> a = gauss(400,[0 -0.04; 0.08 -1.7],[0.004 0.17; 0.17 10]) Study the scatterplot (scatterd(A)) and note the difference when it is scaled properly (axis equal).

In relation with badly scaled data three types of classifiers can be distinguished:

a. classifiers that are scaling independent

b. classifiers that are scaling dependent but that can compensate badly scaled data by large training sets.

c. classifiers that are scaling dependent but that cannot compensate badly scaled data by large training sets.

Which of the following classifiers belong to which group?: nearest mean (nmc), 1-nearest neighbor (knnc([],1)), Parzen (parzenc), Fisher (fisherc) and the Bayes classifier assuming normal distributions (qdc)?

Verify your answer by the following experiment:

Generate an independent test set and compute the learning curves (i.e. an error curve as function of the size of the training set) for each of the classifiers. Use training sizes of 5,10,20,50,100 and 200 objects per class. Plot the error curves.

Usescalem for scaling the features on their variance. For a fair result, this should be computed on the training seta and applied toa as well as to the test setb:

>> w = scalem(a,’variance’); a = a*w; b = b*w;

Compute and plot the learning curves for the scaled data as well. Which classifier(s) are independent of scaling? Which classifier(s) can compensate bad scaling by a large training set?

Exercise 32. Feature Evaluation

The routinefeatevalcan be used to evaluate feature sets according to a criterion. For a given dataset, it returns either a distance between the classes in the dataset or a classification accuracy.

In both cases it means that large values means good separation.

Load the datasetbiomed. How many features does this dataset have? How many possible subsets of two features can be made from this dataset? Make a script which loops through all possible subsets of two features and that creates for each combination a new datasetb. Use featevalto evaluatebusing the Euclidean distance, the Mahalanobis distance and the leave- one-out error for the one-nearest neighbour rule.

Find for each of the three criteria the two features that are selected by individual ranking (use featseli), by forward selection (usefeatself) and by the above procedure that finds the best combination of two features. Compute for each set of two features the leave-one-out error

0.004 0.17 0.17 10

(20)

Exercise 33. Feature Selection

Load the datasetglass. Rank the features by the sum of the Mahalanobis distances using individual selection (featseli), forward selection (featself) and backward selection (featselb). The selected features can be retrieved from the mapping by:

>> w = featseli(a,’maha-s’); w.data{2}

Compute for each feature ranking an error curve for the Fisher classifier byclevalf.

>> rand('seed',1); e = clevalf(fisherc,a*w,[],[],5)

The random seed is reset to make the results for different feature sequences w comparable.

The commanda*w reorders the features ina according tow. Inclevalf the classifier is trained by a bootstrapped version of the given data set. The remaining objects are used for testing. This is repeated 5 times. All results are stored in a structure e that can be visualised byplotr(e). Plot the result for the three feature sequences obtained by the three selection methods in a single figure byplotr. Compare this error plot with a plot of the ’maha-s’ criterion value as a function of the feature size (usefeateval).

Use the forward, the backward and the branch&bound strategy to determine the optimal set of three features according to the sum of the Mahalanobis distances. Compute for each of the three sets the error on the training set for the linear normal densities based classifier (ldc).

Example 12. Mapping

There are several ways to perform feature extraction. Some common approaches are:

1. PCA on the complete dataset. This is unsupervised, so it does not use class information. It only tries to describe the variance in the data. In PRTools, this mapping can be trained by using pcaon a (labeled or unlabeled) dataset: e.g.w = pca(a,2) finds a mapping to 2 dimensions.

scatterd(a*w) plots this data.

2. PCA on the classes. This is supervised as it makes use of class labels. The PCA is computed on the average of the class covariance matrices. In PRTools, this mapping can be trained by usingklm (Karhunen Loeve mapping) on a labeled dataseta: w = klm(a,2)

3. Fisher mapping. This tries to maximize the between scatter over the within scatter of the different classes. It is, therefore, supervised:w = fisherm(a,2)

12.1 Apply the three methods onmfeat-pixand investigate if, and how, the mapped results differ.

12.2 Performplot(pca(a,0)) to see a plot of the relative cumulative ordered eigenvalues (normalized sum of variances).

12.3 After mapping the data, use some simple classifiers to investigate how the choice of the mappings influences the classification performance in the 2-dimensional feature spaces.

Exercise 34. Eigenfaces and Fisherfaces

The linear mappings used in example 12 may also be applied to image datasets in which each pixel is a feature, e.g. the Face-database containing images of 92*112 pixels. An image is now a point in a 10304 dimensional feature space.

34.1 Load a subset of 10 classes bya = faces([1:10],[1:10]). The images can be displayed byshow(a).

(21)

34.2 Plot the explained variance for the PCA as a function of the number of components. When and why reaches this curve the value 1?

34.3 Make for each of the three mappings a 2D scatterplot of all data mapped on the first two vectors. Try to understand what you see.

34.4 The PCA eigenvector mappingw points to positions in the original feature space called eigenfaces. These can be displayed byshow(w).

Display the first 20 eigenfaces computed bypcaas well as byklmand the first 20 Fisherfaces of the dataset.

(22)

VI. Error estimation and evaluation

Example 13. Evaluation

The following routines are available for the evaluation of classifiers:

testc test a dataset on a trained classifier

crossval train and test classifiers by cross validation

cleval classifier evaluation by computing a learning curve reject computation of an error-reject curve

roc computation of a receiver-operator curve

gendat split a given dataset at random into a training set and a test set.

A simple example of the generation and use of a test set is the following:

13.1 Load themfeat_kar dataset, consisting of 64 Karhunen-Loeve coefficients measured for 10*200 written digits ('0' to '9'). A training set of 50 objects per class (i.e. a fraction of 0.25 of 200) can be generated by:

>> a = mfeat_kar

MFEAT KL Features, 2000 by 64 dataset with 10 classes: [200 ... 200]

>> [trainset,testset] = gendat(a,0.25)

MFEAT KL Features, 500 by 64 dataset with 10 classes: [50 ... 50]

MFEAT KL Features, 1500 by 64 dataset with 10 classes: [150 ... 150]

50 x 10 objects are stored intrainset, the remaining 1500 objects are stored intestset. Train the linear normal densities based classifier and test it:

>> w = ldc(trainset);

>> testset*w*testc

Compare the result with training and testing by all data:

>> a*ldc(a)*testc

which is probably better for two reasons. First, it uses more objects for training, so a better classifier is obtained. Second, it uses the same objects for testing as well a for training, by which the test result is positively biased. For this reason the use of separate sets for training and testing has to be preferred.

Example 14. Classifier performance

In this exercise we will investigate the difference in behavior of the error on the training and the test set. Generate a large test set and study the variations in the classification error based on repeatedly generated training sets:

>> t= gendath([500 500]);

>> a = gendath([20 20]); t*ldc(a)*testc

Repeat this last line a number of times. What causes the variations in error?

Now do the same for different test sets:

>> a= gendath([20 20]);

>> w = ldc(a);

>> t = gendath([500 500]); t*w*testc

(23)

Example 15. Use of cell arrays for classifiers and datasets

In finding the best classifiers over a set of datasets the Matlab cell arrays can be very useful.

A cell array is a collector of arbitrary items. For instance a set of untrained classifiers can be stored as follows:

>> classifiers = {nmc, parzenc([],1), knnc([],3)}

and a set of datasets is similarly stored as:

>> data = {iris, gendath(50), gendatd(30,30,10),gendatb(100)}

Training and test sets can be generated for all datasets simultaneaously by

>> [trainset,testset] = gendat(data,0.5) In a similar way classifiers and error estimation can be done:

>> w = map(trainset,classifiers)

>> testc(testset,w)

Note that the construction w = trainset*classifiers doesn’t work for cell arrays.

Cross validation can be done by:

>> crossval(data,classifiers,5)

The parameter ’5’ indicates 5-fold crossvalidation, i.e. a rotation over training sets of 80% and test sets of 20% of the data. If this parameter is omitted the leave-one-out error is computed.

For the nearest neighbor rule this is also done by testk. Take a small data set a and verify that testk(a) and crossval(a,knnc([],1)) yield the same result. Note how much more efficient the specialised routine testk is.

Example 16. Learning curves introduction

An easy to use routine for studying the behavior of a classifier on a given dataset iscleval:

>> a = gendatb([30 30])

>> e = cleval(a,ldc,[2 3 5 10 20],3)

This generates at random training sets of sizes [2 3 5 10 20]per class out of the dataset a and trains the classifierldc. The remaining objects are used for testing (so in this example the seta has to contain more than 20 objects per class). This is repeated 3 times and the resulting errors are averaged and returned in the structure e. This is ready made for plotting the so called learning curve by:

>> plotr(e)

which automatically annotates the plot.

Exercise 35. Learning curve experiment

Plot the learning curves ofqdc,udc,fishercandnmcforgendathusing training set sizes ranging from 3 to 100. Do the same for a 20-dimensional problem generated bygendatd. Study and try to understand the results.

Example 17. Confusion matrices

A confusion matrixC has in elementC(i,j) the confusion between the classesi andj. Confusion matrices are especially useful in multi-class problems for analyzing the similarities between classes. For instance, let us take the IMOX dataset a = imox, split it for training

(24)

and testing by[train_set,test_set] = gendat(a,0.5). We can now compare the true labels of the test set with the estimated ones found by a classifier:

>> true_lab = getlab(test_set);

>> w = fisherc(train_set);

>> est_lab = test_set*w*classd;

>> confmat(true_lab,est_lab) Exercise 36. Confusion matrix experiment

Compute the confusion matrix forfishercapplied to the two digit feature setsmfeat_kar andmfeat_zer. One of these feature sets is rotation invariant. Which one?

Exercise 37. Bootstrap error estimates (optional)

Note thatgendatcan be used for bootstrapping datasets. Write two error estimation routines based on bootstrap based bias corrections for the apparent error:

e₁ = e_a - (e_ba - e_bc) e₂ = .348 e_a + .632 e_bo

in which e_ais the apparent error of the classifier to be tested, e_bais the bootstrap apparent error, e_bcis the apparent error (based on the whole training set) of the bootstrap based classifier and e_bois the out-of-bootstrap error estimate of the bootstrap based classifier. These estimates have to be based on a series of bootstraps, e.g. 25.

Compare these error estimates with 2-fold cross validation, 10-fold cross validation, the leave- one out error estimate (all obtained bycrossval) and the true error (based on a very large test set) for a simple problem, e.g.gendathwith 10 objects per class, classified byfisherc. In order to obtain significant result the entire experiment should be repeated a large number of times, e.g. 50. Verify whether this is sufficient by computing the variances in the obtained error estimates.

Example 18. Reject curves.

The classification error for a classification resultd = a*wis found bye = testc(d)after determining the largest value in each column ofd. By rejection of objects a threshold is used to determine when this largest is not sufficiently large. The routinee = reject(d)

determines the classification error and the reject rate for a set of such threshold values. The errors and reject frequncies ared stored ine. We will illustrate this by a simple example.

18.1 Load a dataset bygendath for training Fisher’s classifier:

>> a = gendath([100 100]); w = fisherc(a);

Take a small test set:

>> b = gendath([20 20])

classify it and compute its classification error:

>> d = b*w; testc(d)

Compute and the reject / error trade off:

>> e = reject(d)

Errors are stored in e.error and rejects are stored in e.xvalues. Inspect them by

(25)

The left column shows the error for the reject frequencies shown in the right column. It starts with the classification error found above by testc(d) for no reject (0) and runs to an error of 0 and a reject of 1 at the end. is the reject rate, starting at no reject. Plot the reject curve by:

>> plotr(e)

18.2 Repeat this for a test set b of 500 objects per class. How many objects have to be rejected to have an error of less than 0.06?

Exercise 38. Reject experiment

Study the behavior of the reject curves fornmc,qdc andparzenc for the sonar dataset (a = sonar). Take training sets and test sets of equal size ([b,c] = gendat(a,0.5)).

Study help reject to see how a set of reject curves can be computed simultaneously. Plot the result by plotr. Try to understand the reject curve forqdc.

Example 19. ROC curves

The ROC command computes separately the classification errors for each of the classes for various thresholds. Results for a two-class problem can again be plotted by theplotr command, e.g.

>> [a,b] = gendat(sonar,0.5)

>> w1 = ldc(a);

>> w2 = nmc(a);

>> w3 = parzenc(a);

>> w4 = svc(a);

>> e = roc(b,{w1 w2 w3 w4});

>> plotr(e)

This plot shows how the error shifts from one class to the other class for a changing threshold.

Try to understand what these plots indicate for the selection of a classifier.

(26)

VII. Neural network classifiers

In PRTools three neural network classifiers are implemented based on an old version of Matlab's Neural Network Toolbox:

bpxnc, a feed-forward network (multi-layer perceptron), trained by a modified back- propagation algorithm with variable learning parameter.

lmnc, a feed-forward network, trained by the Levenberg-Marquardt rule.

rbnc, a radial basis network. This network has always one hidden layer which is extended with more neurons as long as necessary.

These classifiers have built-in choices for target values, stepsizes, momentum terms, etcetera. No weight decay facilities are available. Stopping is done for no-improvement on the training set, no improvement on a validation set error (if supplied) or at a preset maximum number of epochs.

In addition the following neural network classifiers are available:

rnnc, feed-forward network (multi-layer perceptron) with a random input layer and a trained output layer. This has a similar architecture asbpxnc andrbnc, but is much faster.

perlc, single layer perceptron with linear output and adjustable step sizes and target values.

Example 20. The neural network as a classifier

The following lines demonstrate the use of the neural network as a classifier:

>> a = gendats; scatterd(a)

>> w = lmnc(a,3,1); h = plotc (w);

>> for i=1:50,

w=lmnc(a,3,1,w);delete(h);h=plotc(w);disp(a*w*testc); drawnow;

end

Repeat these lines if you expect a further improvement. Repeat the experiment for 5 and 10 hidden units. Try also the use of the back-propagation rule (bpxnc).

Exercise 39. A neural network classification experiment

Compare the performance of networks trained by the Levenberg-Marquardt rule (lmnc) with different numbers of hidden units: 3, 5 and 10 for a three class digit problem (2, 3 and 5). Use the nist16 dataset (a = nist16) . Reduce the dimensionality of the feature space bypcato a space that contains 90% of the original variance. Use training sets of 5, 10, 20, 50 and 100 objects per class and a large testset. Plot the errors on the training set and the testset as a function of the training size. Which networks are overtrained? What can be changed in this network to avoid overtraining?

Exercise 40. Perceptron classifier (skip this exercise)

Study the influence of the step size (range from 0.001 to 1) and the target values (range 0.0001 to 0.40) inperlc andpersc by computing learning curves for simple problems like

gendats. Use a fixed and small number of steps, e.g. 100.

(27)

Exercise 41. Overtraining (optional)

Study the errors on training and test set as a function of training time (number of epochs) for a network with one hidden layer of 10 neurons. Use as classification problemgendatcwith 25 training objects per class. Do this forlmnc as well as forbpxnc.

Exercise 42. Number of hidden units (optional)

Study the influence of the number of hidden units on the test error for the same problem and the same classifiers as in the overtraining exercise 41.

Exercise 43. Network outputs and posterior probabilities (optional)

Network output values are normalized, like for all classifiers, bya*w*classc. Compare these outcomes for test sets with the posterior probabilities found for the normal density based classifierqdc and with the 'true' posterior probabilities found for aqdc classifier based on a very large training set. This comparison might be based on scatterplots.Use data based on normal distributions. Train the network with various numbers of steps and try a small and a large number of hidden units.