Criteria based on statistical tests

Model evaluation

5.1 Criteria based on statistical tests

The choice of the statistical model used to describe a database is one of the main aspects of statistical analysis. A model is either a simplification or an approximation of reality and therefore it does not entirely reflect reality. As we have seen in Chapter 4, a statistical model can be specified by a discrete probability function or by a probability density function f (x); this is what is considered to be ‘underlying the data’ or, in other words, it is the generating mechanism of the data. A statistical model is usually specified up to a set of unknown quantities that have to be estimated from the data at hand.

Often a density function is parametric or, rather, it is defined by a vector of parameters=(θ1, . . . , θ_I)such that each valueθofcorresponds to a partic-ular density function,f_θ(x). A model that has been correctly parameterised for a given unknown density functionf (x)is a model that givesf (x)for particular values of the parameters. We can select the best model in a non-parametric con-text by choosing the distribution function that best approximates the unknown distribution function. But first of all we consider the notion of a distance between a model f, which is the ‘true’ generating mechanism of the data, and model g, which is an approximating model.

5.1.1 Distance between statistical models

We can use a distance function to compare two models, saygandf. As explained in Section 4.1, there are different types of distance function; here are the most important ones.

In the categorical case, a distance is usually defined by comparing the estimated discrete probability distributions, denoted byf andg. In the continuous case, we often refer to two variables, Xf and Xg, representing fitted observation values obtained with the two models.

Entropy distance

The entropy distance is used for categorical variables and is related to the concept of heterogeneity reduction (Section 3.4). It describes the proportional reduction of the heterogeneity between categorical variables, as measured by an appro-priate index. Because of its additive property, the entropy is the most popular heterogeneity measure for this purpose. The entropy distance of a distributiong from a target distributionf is

Ed=

f_ilogfi

which is the form of the uncertainty coefficient (Section 3.4), but also the form taken by theG²statistic. TheG²statistic can be employed for most probabilistic data mining models. It can therefore be applied to prediction problems, such as logistic regression and directed graphical models, but also to descriptive prob-lems, such as log-linear models and probabilistic cluster analysis. It also finds application with non-probabilistic models, such as classification trees. The Gini index can also be used as a measure of heterogeneity.

Chi-squared distance

The chi-squared distance between a distribution g and a targetf is

χ²d=

(fi−gi)² gi

which corresponds to a generalisation of the Pearson’s statistic seen in Section 3.4. This distance is used both for descriptive and predictive problems in the presence of categorical data, as an alternative to the entropy distance. It does not require an underlying probability model; we have seen its application within the CHAID decision trees algorithm.

0– 1 distance

The 0– 1 distance applies to categorical variables, and it is typically used for supervised classification problems. It is defined as

0−1d= n r=1

1(X_{f r}−X_gr)

where 1(w−z)=1 ifw=zand 0 otherwise. It measures the distance in terms of a 0– 1 function that counts the number of correct matches between the classifi-cations carried out using the two models. Dividing by the number of observations give the misclassification rate, probably the most important evaluation tool in pre-dictive classification models, such as logistic regression, classification trees and nearest-neighbour models.

Euclidean distance

Applied to quantitative variables, the Euclidean distance between a distribution g and a targetf is

2d(Xf, Xg)= ⁿ

r=1

(Xf r−Xgr)².

It represents the distance between two vectors in the Cartesian plane. The Euclidean distance leads to theR²index and to theF test statistics. Furthermore, by squaring it and dividing by the number of observations we obtain the mean squared error. The Euclidean distance is widely used, especially for continuous

predictive models, such as linear models, regression trees, and continuous probabilistic expert systems. But it is also used in descriptive models for the observations, such as cluster analysis and Kohonen maps. Notice that it does not necessarily require an underlying probability model. When there is an underlying probability model, it is usually the normal distribution.

Uniform distance

The uniform distance applies to comparisons between distribution functions. For two distribution functionsF, Gwith values in [0, 1], the uniform distance is

sup

0≤t≤1|F (t)−G(t)|.

The uniform distance is used in non-parametric statistics such as the Kolmogorov– Smirnov statistic (Section 4.10), which is typically employed to assess whether a non-parametric estimator is valid. But it is also used to verify whether a specific parametric model, such as the Gaussian model, is a good approximation to a non-parametric model.

5.1.2 Discrepancy of a statistical model

The distances in Section 5.1.1 can be used to define the notion of discrepancy for a model. Suppose that f represents an unknown density, and let g=pθ

be a family of density functions (indexed by a vector of parameters, θ). The discrepancy of a statistical model g, with respect to a target model f, can be defined using the Euclidean distance as

(f, pθ)= n

i=1

(f (xi)−pθ(xi))².

For each observation,i=1, . . . , n, this discrepancy (which is a function of the parametersθ) considers the error made by replacingf withg.

If we knew f, the real model, we would be able to determine which of the approximating statistical models, different choices for g, would be the best, in the sense of minimising the discrepancy. Therefore, the discrepancy ofg(due to the parametric approximation) can be obtained as the discrepancy between the unknown probability model and the best parametric statistical model,p^{(I )}_θ

0: (f, p^{(I )}_θ₀)=

n i=1

(f (xi)−p_θ^{(I )}₀ (xi))².

However, sincef is unknown we cannot identify the best parametric statistical model. Therefore we will substitutef with a sample estimate denoted byp^{(I )}_ˆ

θ (x), for which theI parameters are estimated on the basis of the data. The discrepancy between this sample estimate off (x)and the best statistical model is called the

discrepancy ofg (due to the estimation process):

Now we have a discrepancy that is a function of the observed sample. Bear in mind the complexity of g. To get closer to the unknown model, it is better to choose a family where the models have a large number of parameters. In other words, the discrepancy due to parametric approximation is smaller for more complex models. However, the sample estimates obtained with the more complex model tend to overfit the data, resulting in a greater discrepancy due to estimation.

The aim in model selection is to find a compromise between these opposite effects of parametric approximation and estimation. The total discrepancy, known as the discrepancy between the function f and the sample estimate p^{(I )}_ˆ

θ , takes both these factors into account. It is given by the equation

(f, p^{(I )}_ˆ

which represents the algebraic sum of two discrepancies, one from the parametric approximation and one from the estimation process. Generally, minimisation of the first discrepancy favours complex models, which are more adaptable to the data, whereas minimisation of the second discrepancy favours simple models, which are more stable.

The best statistical model to approximatef will be the model p^{(I )}_ˆ

θ that min-imizes the total discrepancy. The total discrepancy can rarely be calculated in practice as the density functionf (x)is unknown. Therefore instead of minimiz-ing the total discrepancy, the model selction problem is solved by minimizminimiz-ing the total expected discrepancy,E(f, p^{(I )}_θ_ˆ ), where the expectation is taken with respect to the sample probability distribution. Such an estimator defines an eval-uation criterion, for a model withI parameters. Model choice will then be based on comparing the corresponding estimators, known as minimum discrepancy estimators.

5.1.3 Kullback–Leibler discrepancy

We now consider how to derive a model evaluation criterion. To define a general estimator we consider, rather than the Euclidean discrepancy we have already met, a more general discrepancy known as the Kullback– Leibler discrepancy (or divergence). The Kullback– Leibler discrepancy can be applied to observations of any type; it derives from the entropy distance and is given by

KL(f, p^{(I )}_ˆ

This can be easily mapped to the expression for theG²deviance; then the target density function corresponds to the saturated model. The best model can be

interpreted as the one with a minimal loss of information from the true unknown distribution. Like the entropy distance, the Kullback–Leibler discrepancy is not symmetric.

We can now show that the statistical tests used for model comparison are based on estimators of the total Kullback– Leibler discrepancy. Letpθ denote a proba-bility density function parameterised by the vector=(θ1, . . . , θI). The sample values x1, . . . , xn are a series of independent observations that are identically distributed, therefore the sample density function is expressed by the equation

L(θ;x1, . . . , xn)= n i=1

pθ(xi)

Let ˆθn denote the maximum likelihood estimator of the parameters, and let the likelihood function L be calculated at this point. Taking the logarithm of the resulting expression and multiplying it by−1/n, we get

KL(f, p^{(I )}_ˆ

known as the sample Kullback–Leibler discrepancy function. This expression can be shown to be the maximum likelihood estimator of the total expected Kullback– Leibler discrepancy of a modelpθ. Notice that the Kullback–Leibler discrepancy gives a score to each model, corresponding to the mean (negative) log-likelihood of the observations, which is equal to

2nKL(f, p^{(I )}_ˆ

The Kullback– Leibler discrepancy is fundamental to selection criteria devel-oped in the field of statistical hypothesis testing. These criteria are based on a successive comparisons between pairs of alternative models. Let us suppose that the expected discrepancy for two statistical models is calculated as above, with the pθ model substituted by one of the two models considered. Let Z(f, zθˆ) be the sample discrepancy function estimated for the model with density zθ

and let G(f, gθˆ) the sample discrepancy estimated for the model with den-sity gθ. Let us suppose that model g has a lower discrepancy, namely that Z(f, zθˆ)=G(f, gθˆ)+ε, whereεis a small positive number. Therefore, based on the comparison of the discrepancy functions we will choose the model with the density functiongθ.

This result may depend on the specific sample used to estimate the discrep-ancy function. We therefore need to carry out a statistical test to verify whether a discrepancy difference is significant; that is, whether the results obtained from a sample can be extended to all possible samples. If we find that the differ-ence ε is not significant, then the two models would be considered equal and it would be natural to choose the simplest model. The deviance difference cri-terion defined by G² (Section 4.12.2) is equal to twice the difference between

sample Kullback– Leibler discrepancies. For nested models, the G² difference is asymptotically equivalent to the chi-squared comparison (and test). When a Gaussian distribution is assumed, the Kullback– Leibler discrepancy coincides with the Euclidean discrepancy, therefore F statistics can also be used in this context.

To conclude, using a statistical test, it is possible to use the estimated discrep-ancy to make an accurate choice among the models. The disadvantage of this procedure is that it requires comparisons between model pairs, so when we have a large number of alternative models, we need to make heuristic choices regard-ing the comparison strategy (such as choosregard-ing among the forward, backward and stepwise criteria, whose results may diverge). Furthermore, we must assume a specific probability model and this may not always be a reasonable assumption.

Dans le document Applied Data Mining for Business and Industry (Page 155-160)