Comparing data mining methods - Evaluating What’s Been Learned

Evaluating What’s Been Learned

5.5 Comparing data mining methods

We often need to compare two different learning methods on the same problem to see which is the better one to use. It seems simple: estimate the error using cross-validation (or any other suitable estimation procedure), perhaps repeated several times, and choose the scheme whose estimate is smaller. This is quite sufficient in many practical applications: if one method has a lower estimated error than another on a particular dataset, the best we can do is to use the former method’s model. However, it may be that the difference is simply caused by esti-mation error, and in some circumstances it is important to determine whether one scheme is really better than another on a particular problem. This is a stan-dard challenge for machine learning researchers. If a new learning algorithm is proposed, its proponents must show that it improves on the state of the art for the problem at hand and demonstrate that the observed improvement is not just a chance effect in the estimation process.

This is a job for a statistical test that gives confidence bounds, the kind we met previously when trying to predict true performance from a given test-set error rate. If there were unlimited data, we could use a large amount for train-ing and evaluate performance on a large independent test set, obtaintrain-ing confi-dence bounds just as before. However, if the difference turns out to be significant we must ensure that this is not just because of the particular dataset we

e=0 632. ¥etest instances+0 368. ¥etraining instances.

happened to base the experiment on. What we want to determine is whether one scheme is better or worse than another on average, across all possible train-ing and test datasets that can be drawn from the domain. Because the amount of training data naturally affects performance, all datasets should be the same size: indeed, the experiment might be repeated with different sizes to obtain a learning curve.

For the moment, assume that the supply of data is unlimited. For definite-ness, suppose that cross-validation is being used to obtain the error estimates (other estimators, such as repeated cross-validation, are equally viable). For each learning method we can draw several datasets of the same size, obtain an accu-racy estimate for each dataset using cross-validation, and compute the mean of the estimates. Each cross-validation experiment yields a different, independent error estimate. What we are interested in is the mean accuracy across all possi-ble datasets of the same size, and whether this mean is greater for one scheme or the other.

From this point of view, we are trying to determine whether the mean of a set of samples—cross-validation estimates for the various datasets that we sampled from the domain—is significantly greater than, or significantly less than, the mean of another. This is a job for a statistical device known as the t-test, or Student’s t-test. Because the same cross-validation experiment can be used for both learning methods to obtain a matched pair of results for each dataset, a more sensitive version of the t-test known as a paired t-test can be used.

We need some notation. There is a set of samples x1,x2, . . . ,xkobtained by successive 10-fold cross-validations using one learning scheme, and a second set of samples y1,y2, . . . ,ykobtained by successive 10-fold cross-validations using the other. Each cross-validation estimate is generated using a different dataset (but all datasets are of the same size and from the same domain). We will get the best results if exactly the same cross-validation partitions are used for both schemes so that x1and y1are obtained using the same cross-validation split, as are x2and y2, and so on. Denote the mean of the first set of samples by x– and the mean of the second set by y–. We are trying to determine whether x– is sig-nificantly different from y–.

If there are enough samples, the mean (x–) of a set of independent samples (x1,x2, . . . ,xk) has a normal (i.e., Gaussian) distribution, regardless of the dis-tribution underlying the samples themselves. We will call the true value of the mean m. If we knew the variance of that normal distribution, so that it could be reduced to have zero mean and unit variance, we could obtain confidence limits on mgiven the mean of the samples (x–). However, the variance is unknown, and the only way we can obtain it is to estimate it from the set of samples.

That is not hard to do. The variance ofx– can be estimated by dividing the variance calculated from the samples x1,x2, . . . ,xk—call it s²x—by k.But the

fact that we have to estimate the variance changes things somewhat. We can reduce the distribution ofx– to have zero mean and unit variance by using

Because the variance is only an estimate, this does nothave a normal distribu-tion (although it does become normal for large values ofk). Instead, it has what is called a Student’s distribution with k-1 degrees of freedom.What this means in practice is that we have to use a table of confidence intervals for Student’s distribution rather than the confidence table for the normal distribution given earlier. For 9 degrees of freedom (which is the correct number if we are using the average of 10 cross-validations) the appropriate confidence limits are shown in Table 5.2. If you compare them with Table 5.1 you will see that the Student’s figures are slightly more conservative—for a given degree of confidence, the interval is slightly wider—and this reflects the additional uncertainty caused by having to estimate the variance. Different tables are needed for different numbers of degrees of freedom, and if there are more than 100 degrees of freedom the confidence limits are very close to those for the normal distribu-tion. Like Table 5.1, the figures in Table 5.2 are for a “one-sided” confidence interval.

To decide whether the means x– and y–, each an average of the same number kof samples, are the same or not, we consider the differences dibetween corre-sponding observations,di=xi-yi.This is legitimate because the observations are paired. The mean of this difference is just the difference between the two means,d–

=x–-–, and, like the means themselves, it has a Student’s distributiony with k-1 degrees of freedom. If the means are the same, the difference is zero (this is called the null hypothesis); if they’re significantly different, the difference will be significantly different from zero. So for a given confidence level, we will check whether the actual difference exceeds the confidence limit.

x k -m s² .

Table 5.2 Confidence limits for Student’s distribution with 9 degrees of freedom.

Pr[X≥z] z

0.1% 4.30

0.5% 3.25

1% 2.82

5% 1.83

10% 1.38

20% 0.88

First, reduce the difference to a zero-mean, unit-variance variable called the t-statistic:

where s²dis the variance of the difference samples. Then, decide on a confidence level—generally, 5% or 1% is used in practice. From this the confidence limit z is determined using Table 5.2 if k is 10; if it is not, a confidence table of the Student’s distribution for the k value in question is used. A two-tailed test is appropriate because we do not know in advance whether the mean of the x’s is likely to be greater than that of the y’s or vice versa: thus for a 1% test we use the value corresponding to 0.5% in Table 5.2. If the value oftaccording to the preceding formula is greater than z,or less than -z,we reject the null hypothe-sis that the means are the same and conclude that there really is a significant dif-ference between the two learning methods on that domain for that dataset size.

Two observations are worth making on this procedure. The first is technical:

what if the observations were not paired? That is, what if we were unable, for some reason, to assess the error of each learning scheme on the same datasets?

What if the number of datasets for each scheme was not even the same? These conditions could arise if someone else had evaluated one of the methods and published several different estimates for a particular domain and dataset size—

or perhaps just their mean and variance—and we wished to compare this with a different learning method. Then it is necessary to use a regular, nonpaired t-test. If the means are normally distributed, as we are assuming, the difference between the means is also normally distributed. Instead of taking the mean of the difference,d–

, we use the difference of the means,x–-–. Of course, that’s they same thing: the mean of the difference isthe difference of the means. But the variance of the difference d–

is notthe same. If the variance of the samples x1,x2, . . . ,xkis s²^x and the variance of the samples y1,y2, . . . ,y1is s²^y, the best esti-mate of the variance of the difference of the means is

It is this variance (or rather, its square root) that should be used as the denom-inator of the t-statistic given previously. The degrees of freedom, necessary for consulting Student’s confidence tables, should be taken conservatively to be the minimum of the degrees of freedom of the two samples. Essentially, knowing that the observations are paired allows the use of a better estimate for the vari-ance, which will produce tighter confidence bounds.

The second observation concerns the assumption that there is essentially unlimited data so that several independent datasets of the right size can be used.

s_x s_y k

2 2

+ 1 .

t d

d k

= s²

In practice there is usually only a single dataset of limited size. What can be done? We could split the data into (perhaps 10) subsets and perform a cross-validation on each. However, the overall result will only tell us whether a learn-ing scheme is preferable for that particular size—perhaps one-tenth of the original dataset. Alternatively, the original dataset could be reused—for example, with different randomizations of the dataset for each cross-validation.² However, the resulting cross-validation estimates will not be independent because they are not based on independent datasets. In practice, this means that a difference may be judged to be significant when in fact it is not. In fact, just increasing the number of samples k,that is, the number of cross-validation runs, will eventually yield an apparently significant difference because the value of the t-statistic increases without bound.

Various modifications of the standard t-test have been proposed to circum-vent this problem, all of them heuristic and lacking sound theoretical justifica-tion. One that appears to work well in practice is the corrected resampled t-test.

Assume for the moment that the repeated holdout method is used instead of cross-validation, repeated ktimes on different random splits of the same dataset to obtain accuracy estimates for two learning methods. Each time,n1instances are used for training and n2for testing, and differences diare computed from performance on the test data. The corrected resampled t-test uses the modified statistic

in exactly the same way as the standard t-statistic. A closer look at the formula shows that its value cannot be increased simply by increasing k.The same mod-ified statistic can be used with repeated cross-validation, which is just a special case of repeated holdout in which the individual test sets for one cross-validation do not overlap. For 10-fold cross-cross-validation repeated 10 times, k=100,n2/n1=0.1/0.9, and s²dis based on 100 differences.

Dans le document Data Mining (Page 186-190)