Criteria based on loss functions

Model evaluation

5.5 Criteria based on loss functions

One aspect of data mining is the need to communicate the final results in accor-dance with the aims of the analysis. With business data we need to evaluate models not only by comparing them among themselves but also by comparing the business advantages which can be gained by using one model rather than another. Since the main problem dealt with by data analysis is to reduce uncer-taintiesin the risk factors or loss factors, we often talk about developing criteria that minimise the loss connected with a problem. In other words, the best model is the one that leads to the least loss. The best way to introduce these rather specific criteria is to give some examples. Since these criteria are mostly used in predictive classification problems, we will mainly refer to this context here.

The confusion matrix is used as an indication of the properties of a classifi-cation (discriminant) rule (see the example in Table 5.1). It contains the number of elements that have been correctly or incorrectly classified for each class. The main diagonal shows the number of observations that have been correctly classi-fied for each class; the off-diagonal elements indicate the number of observations that have been incorrectly classified. If it is assumed, explicitly or implicitly, that each incorrect classification has the same cost, the proportion of incorrect clas-sifications over the total number of clasclas-sifications is called the error rate, or misclassification error; this is the quantity we must minimise. The assumption of equal costs can be relaxed by weighting errors with their relative costs.

We now consider the lift chart, and the ROC curve, two graphs that can be used to assess model costs. Both are presented with reference to a binary response variable, the area where evaluation methods have developed most quickly. For a comprehensive review, see Hand (1997).

Lift chart

The lift chart puts the observations in the validation data set into increasing or decreasing order on the basis of their score, which is the probability of the response event (success), as estimated on the basis of the training set. It groups these scores into deciles, then calculates and graphs the observed probability of success for each of the decile classes in the validation data set. A model is valid if the observed success probabilities follow the same order (increasing or

Table 5.1 Example of a confusion matrix.

Observed classes Predicted classes Class A Class B Class C

Class A 45 2 3

Class B 10 38 2

Class C 4 6 40

Table 5.2 Theoretical confusion matrix.

Predicted Observed

Event (1) Non-event (0) Total

Event (1) a b a + b

Non-event (0) c d c + d

TOTAL a +c b +d a+b+c+d

decreasing) as the estimated probabilities. To improve interpretation, a model’s lift chart is usually compared with a baseline curve, for which the probability estimates are drawn in the absence of a model, that is, by taking the mean of the observed success probabilities.

ROC curve

The receiver operating characteristic (ROC) curve is a graph that also measures predictive accuracy of a model. It is based on the confusion matrix in Table 5.2.

In the table, the term ‘event’ stands for the value Y=1 (success) of the binary response. The confusion matrix classifies the observations of a validation data set into four possible categories:

• observations correctly predicted as events (with absolute frequency equal to a);

• observations incorrectly predicted as events (with frequency equal toc);

• observations incorrectly predicted as non-events (with frequency equal tob);

• observations correctly predicted as non-events (with frequency equal tod).

Given an observed table, and a cut-off point, the ROC curve is calculated on the basis of the resulting joint frequencies of predicted and observed events (successes) and non-events (failures). More precisely, it is based on the following conditional probabilities:

• sensitivity,a

(a+b), the proportion of events predicted as such;

• specificity,d

(c+d), the proportion of non-events predicted as such;

• false positives,c

(c+d)=1 – specificity, the proportion of non-events pre-dicted as events (type II error);

• false negatives, b

(a+b)=1 – sensitivity, the proportion of events pre-dicted as non-events (type I error).

The ROC curve is obtained by graphing, for any fixed cut-off value, the false pos-itives on the horizontal axis and the sensitivity on the vertical axis (see Figure 5.1

Figure 5.1 Example of an ROC curve.

for an example). Each point in the curve corresponds to a particular cut-off. The ROC curve can also be used to select a cut-off point, trading off sensitivity and specificity. In terms of model comparison, the ideal curve coincides with the vertical axis, so the best curve is the leftmost curve.

The ROC curve is the basis for an important summary statistic called the Gini index of performance. Recall the concentration curve in Figure 3.2. For any given value of F_i, the cumulative frequency, there is a corresponding value of Q_i, the cumulative intensity. F_i andQ_i take values in [0,1] and Q_i ≤F_i. The concentration curve joins a number of points in the Cartesian plane determined by takingx_i =F_i andy_i =Q_i,fori=1, . . . , n. The area between the curve and the 45^◦line gives a summary measure for the degree of concentration. The ROC curve can be treated in a similar way. In place ofFi andQi we need to consider two cumulative distributions constructed as follows.

The data contains both events (Yi =1) and non-events (Yi =0). It can therefore be divided into two samples, one containing all events (labelled E) and one containing all non-events (labelled N). As we have seen, any statistical model for predictive classification takes each observation and attaches to it a score that is the fitted probability of success πi. In each of the two samples,E andN, the observations can be ordered (in increasing order) according to this score. Now, for any fixed value of i (a percentile corresponding to the cut-off threshold), a classification model would consider all observations below it as non-events and all observations above it as events.

Correspondingly, the predicted proportion of events can be estimated for both EandN. For a reasonable model, in populationEthis proportion has to be higher

than in populationN. LetF_iÊ andF_i^N be these proportions corresponding to the cut-offi, and calculate coordinate pairs(F_iÊ, F_i^N), as ivaries. We have that, for i=1, . . . , n, bothF_iÊ andF_i^N take values in [0,1]; indeed they both represent cumulative frequencies. Furthermore,F_i^N ≤F_iÊ. The ROC curve is obtained by joining points with coordinatesyi =F_iÊandxi =F_i^N. This is becauseF_iÊequals the sensitivity andF_i^N equals 1 – specificity.

Notice that the curve will always lie above the 45^◦ line. However, the area between the curve and the line can also be calculated, and coincides with the Gini index of performance. The larger the area, the better the model.

5.6 Further reading

In this chapter we have systematically compared the main criteria for model selection and comparison. These methods can be grouped into: criteria based on statistical tests, criteria based on scoring functions, Bayesian criteria, computa-tional criteria, and business criteria. Criteria based on statistical tests start from the theory of statistical hypothesis testing, so there is a lot of detailed literature related to this topic; see, for example, Moodet al. (1991). The main limitation of these methods is that the choice among the different models is made by pairwise comparisons, thus leading to a partial ordering.

Criteria based on scoring functions offer an interesting alternative, since they can be applied in many settings and provide a complete ordering of the models. In addition, they can be easily computed. However, they do not provide threshold levels for assessing whether the difference in scores between two models is significant. Therefore they tend to be used in the exploration or preliminary phase of the analysis. For more details on these criteria and how they compare with the hypothesis testing criteria, see Zucchini (2000) or Hand et al. (2001). Bayesian criteria are a possible compromise between the previous two. However, Bayesian criteria are not widely used, since they are not implemented in the most popular statistical software. For data mining case-studies that use Bayesian criteria, see Giudici (2001) and Giudici and Castelo (2001).

Computational criteria have the advantage that they can be applied to statistical methods that are not necessarily ‘model based’. From this point of view they are the main principle of ‘universal’ comparison among the different types of models. On the other hand, since most of them are non-probabilistic, they may be too dependent on the sample observed. A way to overcome this problem is to consider model combination methods, such as bagging and boosting. For a thorough description of these recent methodologies, see Hastie et al. (2001).

Criteria based on loss functions are relatively recent, even though the under-lying ideas have been used in Bayesian decision theory for quite some time; see Bernardo and Smith (1994). They are of great interest and have great application potential, even though presently they are used only in the context of classifica-tion. For a more detailed examination of these criteria, see Hand (1997), Hand et al. (2001), or the manuals for the R statistical software.

PART II

Dans le document Applied Data Mining for Business and Industry (Page 166-170)