Non-parametric modelling - Model specification

Model specification

4.10 Non-parametric modelling

A parametric model is usually specified by making a hypothesis about the dis-tribution and by assuming this hypothesis is true. But this can often be difficult or uncertain. One possible way to overcome this is to use non-parametric proce-dures, which eliminate the need to specify the form of the distribution in advance.

A non-parametric model only assumes that the observations come from a cer-tain distribution functionF, not specified by any parameters. But compared with parametric models, non-parametric models are more difficult to interpret and esti-mate. Semiparametric models are a compromise between parametric models and non-parametric models.

A non-parametric model can be characterised by the distribution function or by the density function, which need to be fully specified. First consider the estimate of the distribution function. A valid estimator is the empirical distribution function, usually denoted by S(x). Intuitively it is an analogous estimate of the distribution function F (x) of the random variable X. Formally, the empirical distribution function is calculated, at any point x, by taking the proportion of sample observations less or equal to it,

S(x)= 1

n#{xi ≤x}.

It can be shown that the expected value ofS(x)is F (x)and that Var(S(x))= 1

nF (x)(1−F (x)).

Therefore the empirical distribution function is an unbiased estimator of F (x) and it is consistent as, forn→ ∞, Var(S(x))→0, so that MSE(S(x))→0.

The sample distribution function can be used to assess a parametric model’s goodness of fit in an exploratory way. To evaluate the goodness of fit of a distribution function, we usually use the Kolmogorov– Smirnov distance that leads to the well-known statistical test of the same name. In this test the null hypothesis refers to a particular distribution which we shall call F^∗(x) (this distribution could be Gaussian, for example). Therefore we have

H0:F (x)=F^∗(x) , H1:F (x) =F^∗(x) .

To test H0 against H1 we consider the available random sample X1, . . . , Xn. The idea is to compare the observed distribution function, S(x), with the the-oretical distribution function F^∗ calculated with the observed values. The idea of Kolmogorov and Smirnov is simple and clever. Since S(x) estimates F (x) it is logical to hypothesise a ‘distance’” between S(x) and F (x). If S(x) and F (x)are close enough (i.e. similar enough) the null hypothesis can be accepted, otherwise it is rejected. But what kind of test statistics can we use to measure the discrepancy betweenS(x)andF (x)? One of the easiest measurements is the supremum of the vertical distance between the two functions. This is the statistic suggested by Kolmogorov:

T1= sup

−∞<x<+∞

%%S (x)−F^∗(x)%%.

It relies on the usage of the uniform distance, explained in Section 5.1. For ‘high’

values ofT1, the null hypothesis is rejected, while for ‘low’ values it is accepted.

The logic of the T1 statistic is obvious but the calculation of the probability distribution is more complicated. Nevertheless, we can demonstrate that, under the null hypothesis, the probability distribution of the statistical test based onT1

does not depend on the functional form ofF^∗(x). This distribution is tabulated and included in the main statistical packages. It is therefore possible to determine critical values forT1and obtain a rejection region of the null hypotheses. Alter-natively, it is possible to obtainp-values for the test. The Kolmogorov– Smirnov test is important in exploratory analysis. For example, when the QQ plot (Section 3.1) does not give any obvious indications that a certain empirical distribution is normal or not, we can check whether the distance of the normal distribution function from the empirical distribution function is large enough to be rejected.

Figure 4.10 illustrates how the Kolmogorov– Smirnov statistic works.

The simplest type of density estimator is the histogram. A histogram assigns a constant density to each interval class. This density is easily calculated by taking the relative frequency of observations in the class and dividing it by the class width. For continuous densities, the histogram can be interpolated by joining

1.0

.5 F* (x)

T₁

S (x)

Figure 4.10 The Kolmogorov–Smirnov statistic.

all midpoints of the top segment of each bar. However, histograms can depend heavily on the choice of the classes, as well as on the sample, especially when considering a small sample. Kernel estimators represent a more refined class of density estimators. They represent a descriptive model, that however works locally, strongly analogous to nearest-neighbour models (Section 4.7). Consider a continuous random variable X, with observed valuesx1, . . . , xn, and a kernel density function K with a bandwidthh. The estimated density function at any point x is

f (x)ˆ = 1 n

i=1

x−xi

In practice the kernel function is usually chosen as a unimodal function, with a mode at zero. A common choice is to take a normal distribution for the random variable x−x_i with zero mean and variance corresponding to h², the square of the bandwidth of the distribution. The quality of a kernel estimate then depends on a good choice of the variance parameterh. The choice ofhreflects the trade-off between parsimony and goodness of fit that we have already encountered: a low value of his such that the estimated density values are fitted very locally, possibly on the basis of a single data point; a high value leads to a global estimate, smoothing the data too much. It is quite difficult to establish what a good value ofhshould be. One possibility is to use computationally intensive methods, such as cross-validation techniques. The training sample is used to fit the density, and the validation sample to calculate the likelihood of the estimated density. A value ofhcan then be chosen that leads to a high likelihood.

Estimating high-dimensional density functions is more difficult but kernel methods can still be applied. Replacing the univariate normal kernel with a mul-tivariate normal kernel yields a viable mulmul-tivariate density estimator. Another approach is to assume that the joint density is the product of univariate kernels.

However, the problem is that, as the number of variables increases, observations

tend to be farther away and there is little data for the bandwidths. This parallels what happens with nearest-neighbour models. Indeed, both are memory-based and the main difference is in their goals; kernel models are descriptive and nearest-neighbour models are predictive.

Kernel methods can be seen as a useful model for summarising a low-dimensional data set in a non-parametric way. This can be a helpful step towards the construction of a parametric model, for instance.

The most important semiparametric models are mixture models. These models are suited to situations where the data set can be clustered into groups of obser-vations, each with a different parametric form. The model is semiparametric because the number of groups, hence the number of distributions to consider, is unknown. The general form of a finite mixture distribution for a random variable Xis

f (x)=

i=1

wifi(x;θi),

where wi is the probability that an observation is distributed as the ith popula-tion, with density fi and parameter vector θi. Usually the density functions are all the same (often normal) and this simplifies the analysis. We can apply a sim-ilar techniques to a random vectorX. The model can be used for (model-based) probabilistic cluster analysis. Its advantage is conducting cluster analysis in a coherent probability framework, allowing us to draw conclusions based on infer-ential results rather than on heuristics. Its disadvantage is that the procedure is structurally complex and possibly time-consuming. The model can choose the number of components (clusters) and estimate the parameters of each population as well as the weight probabilities, all at the same time. The most challenging aspect is usually to estimate the number of components, as mixture models are non-nested so a log-likelihood test cannot be applied. Other methods are used, such as AIC, BIC, cross-validation and Bayesian methods (Chapter 5). Once the number of components is found, the unknown parameters are estimated by maximum likelihood or Bayesian methods.

Dans le document Applied Data Mining for Business and Industry (Page 116-119)