Computational criteria - Model evaluation

Model evaluation

5.4 Computational criteria

The widespread use of computational methods has led to the development of com-putationally intensive model selection criteria. These criteria are usually based on using data sets that are different from the one being analysed (external valida-tion) and are applicable to all the models considered, even when they belong to different classes (e.g. in comparing logistic regression, decision trees and neural networks, even when the latter two are non-probabilistic). A possible problem with these criteria is that they take a long time to design and implement, although general-purpose software such as R has made this task easier. We now consider the main computational criteria.

The cross-validation criterion

The idea of the cross-validation method is to divide the sample into two subsam-ples, a training sample havingn−mobservations and a validation sample having m observations. The first sample is used to fit a model and the second is used to estimate the expected discrepancy or to assess a distance. We have already seen how to apply this criterion with reference to neural networks and decision trees. Using this criterion the choice between two or more models is made by evaluating an appropriate discrepancy function on the validation sample.

We can see that the logic of this criterion is different. The other criteria are all based on a function of internal discrepancy on a single data set, playing the roles of the training data set and the validation data set. With these criteria we compare directly predicted and observed values on an external validation sample. Notice that the cross-validation idea can be applied to the calculation of any distance function. For example, in the case of neural networks with quantitative output, we usually employ a Euclidean discrepancy,

1 m

(tij−oij)²,

wheretij is the fitted output andoij the observed output, for each observationi in the validation set and for each output neuronj.

One problem with the cross-validation criterion is in deciding how to select m, the number of the observations in the validation data set. For example, if we selectm=n/2 then onlyn/2 observations are available to fit a model. We could reduce m but this would mean having few observations for the validation data set and therefore reducing the accuracy with which the choice between models is made. In practice, proportions of 75% and 25% are usually used for the training and validation data sets, respectively.

The cross-validation criterion can be improved in different ways. One limitation is that the validation data set is in fact also used to construct the model. Therefore the idea is to generalise what we have seen by dividing the sample in more than two data sets. The most frequently used method is to divide the data set into three blocks: training, validation and testing. The test data will not be used in the modelling phase. Model fit will be carried out on the training data, using the validation data to choose a model. Finally, the model chosen and estimated on the first two data sets will be adapted to the test set and the error found will provide a correct estimate of the prediction error. The disadvantage of this generalisation is that it reduces the amount of data available for training and validation.

A further improvement could be to use all the data available for training.

The data is divided into k subsets of equal size; the model is fitted k times, leaving out one of the subsets each time, which could be used to calculate a prediction error rate. The final error is the arithmetic mean of the errors obtained.

This method is known as k-fold cross-validation. Another common alternative is the leave-one-out method, in which one observation only is left out in each of the k samples, and this observation is used to calibrate the predictions. The disadvantage of these methods is the need to retrain the model several times, which can be computationally intensive.

The bootstrap criterion

The bootstrap method was introduced by Efron (1979) and is based on the idea of reproducing the ‘real’ distribution of the population with a resampling of the observed sample. Application of the method is based on the assumption that the observed sample is in fact a population, a population for which we can calculate the underlying model f (x)– it is the sample density. To compare alternative models, a sample can be drawn (or resampled) from the fictitious population (the available sample) and then we can use our earlier results on model comparison.

For instance, we can calculate the Kullback–Leibler discrepancy directly, without resorting to estimators. The problem is that the results depend on the resampling variability. To get around this, we resample many times, and we assess the discrepancy by taking the mean of the obtained results. It can be shown that the expected discrepancy calculated in this way is a consistent estimator of the expected discrepancy of the real population.

Application of the bootstrap method requires the assumption of a probabil-ity model, either parametric or non-parametric, and tends to be computationally intensive.

Bagging and boosting

Bootstrap methods can be used not only to assess model’s discrepancy, and therefore its accuracy, but also to improve the accuracy. Bagging and boosting methods are recent developments that can be used for combining the results of more than one data mining analysis. In this respect they are similar to Bayesian model-averaging methods, as they also lead to model-averaged estimators, which often improve on estimators derived from only one model.

Bagging (bootstrap aggregation) methods can be described as follows. At every iteration, we draw a sample with replacement from the available training data set.

Typically, the sample size corresponds to the size of the training data itself. This does not mean that the sample drawn will be the same as the training sample, because observations are drawn with replacement – not all the observations in the original sample are drawn. ConsiderB loops of the procedure; the value of B depends on the computational resources available and time. A data mining method can be applied to each bootstrapped sample, leading to a set of estimates for each model; these can then be combined to obtain a bagged estimate. For instance, the optimal classification tree can be searched for each sample, and each observation allocated to the class with the highest probability. The procedure is repeated, for each samplei=1, . . . , B, leading toB classifications. The bagged classification for an observation corresponds to the majority vote, namely, to the class in which it is most classified by theB fitted trees. Similarly, a regression tree can be fitted for each of theB samples, producing a fitted value ˆyi, in each of them, for each observation. The bagged estimate would be the mean of these fitted values,

1 B

B i=1

ˆ yi.

With reference to the bias– variance trade-off, as a bagged estimate is a sample mean, it will not alter the bias of a model; however, it may reduce the variance.

This can occur for highly unstable models, such as decision trees, complex neu-ral networks and nearest-neighbour models. On the other hand, if the applied model is simple, the variance may not decrease, because the bootstrap variability dominates.

So far we have assumed that the same model is applied to the bootstrap sam-ples; this need not be the case. Different models can be combined, provided the estimates are compatible and expressed on the same scale. While bagging relies on bootstrap samples, boosting does not. Although now there are many variants, the early versions of boosting fitted models on several weighted versions of the data set, where the observations with the poorest fit receive the greatest weight.

For instance, in a classification problem, the well-classified observations will get lower weights as the iteration proceeds, allowing the model to concentrate on the estimating the most difficult cases. More details can be found in Han and Kamber (2001) and Hastieet al. (2001).

Dans le document Applied Data Mining for Business and Industry (Page 163-166)