• Aucun résultat trouvé

Definition of generalised linear models

Dans le document Applied Data Mining for Business and Industry (Page 125-132)

Model specification

4.12 Generalised linear models

4.12.2 Definition of generalised linear models

A generalised linear model takes a function of the mean value of the response variable and relates it to the explanatory variables through an equation having linear form. It is specified by three components: a random component, which identifies the response variableYand assumes a probability distribution for it; a systematic component, which specifies the explanatory variables used as predic-tors in the model; and a link function, which describes the functional relation between the systematic component and the mean value of the random component.

Random component

For a sample of sizen, the random component of a generalised linear model is described by the sample random variablesY1, . . . , Yn; these are independent, each has a distribution in exponential family form that depends on a single parameter θi, and each is described by the density function

f (yi;θi)=exp

yib (θi)+c (θi)+d (yi) .

All the distributions for the Yi have to be of the same form (e.g. all normal or all binomial) but theirθi parameters do not have to be the same.

Systematic component

The systematic component specifies the explanatory variables and their roles in the model. It is given by a linear combination

η=β1x1+ · · · +βpxp=

p

j=1

βjxj.

The linear combination η is called the linear predictor. The Xj represent the covariates, whose values are known (e.g. they can derive from the data matrix).

The βj are the parameters that describe the effect of each explanatory variable on the response variable. The values of the parameters are generally unknown and have to be estimated from the data. The systematic part can be written in the following form:

ηi =

p

j=1

βjxij, i=1, . . . , n,

where xij is the value of the jth explanatory variable for theith observation. In matrix form, we have

η=Xβ,

whereηis a vector of ordern×1,Xis a matrix of ordern×p, called the model matrix, andβ is a vector of orderp×1, called the parameter vector.

Link function

The third component of a generalised linear model specifies the link between the random component and the systematic component. Let the mean value ofYi be denoted by

μi =E (Yi) , i=1, . . . , n.

The link function specifies which function ofμi linearly depends on the explana-tory variables through the systematic componentηi. Letg (μi)be a (monotone and differentiable) function ofμi. The link function is defined by

g (μi)=ηi =

p

j=1

βjxij, i=1, . . . , n.

Table 4.5 Main canonical links.

Distribution Canonical link Normal g (μi)=μi Binomial g (μi)=log

πi 1−πi

Poisson g (μi)=logμi

In other words, the link function describes how the explanatory variables affect the mean value of the response variable, that is, through the (not necessarily lin-ear) functiong. How do we choseg? In practice, the more commonly used link functions are canonical and define the natural parameter of the particular distri-bution as a function of the mean response value. Table 4.5 shows the canonical links for the three important distributions in Section 4.12.1. The same table can be used to derive the most important examples of generalised linear models.

The simplest link function is the normal one. It directly models the mean value through the identity linkηi =μi, thereby specifying a linear relationship between the mean response value and the explanatory variables:

μi =β0+β1xi1+ · · · +βpxip.

The normal distribution and the identity link give rise to the normal linear model for continuous response variables (Section 4.11).

The binomial link function models the logarithm of the odds as a linear function of the explanatory variables:

log μi

1−μi

=β0+β1xi1+ · · · +βpxip, i =1, . . . , n.

This type of link, called the logit link, is appropriate for binary response variables, as in the binomial model. A generalised linear model that uses the binomial distribution and the logit link is a logistic regression model (Section 4.4). For a binary response variable, econometricians often use the probit link, which is not canonical. This assumes that

−1i)=β0+β1xi1+ · · · +βpxip, i=1, . . . , n, where−1 is the inverse of the cumulative normal distribution function.

The Poisson canonical link function specifies a linear relationship between the logarithm of the mean response value and the explanatory variables:

logi)=β0+β1xi1+ · · · +βkxik, i=1, . . . , n.

A generalised linear model that uses the Poisson distribution and a logarithmic link is a log-linear model; it constitutes the main data mining tool for describing associations between the available variables (see Section 4.13).

Inferential results

We now consider inferential results that hold for the whole class of generalised linear models; we will apply them logistic regression and log-linear models.

Parameter estimates are usually obtained using the method of maximum likeli-hood. The method computes the derivative of the log-likelihood with respect to each coefficient in the parameter vector β and sets it equal to zero, similarly to the linear regression context in Section 4.3. But unlike what happens with the normal linear model, the resultant system of equations is non-linear in the param-eters and does not generally have a closed-form solution. So to obtain maximum likelihood estimators of β we need to use iterative numerical methods, such as the Newton– Raphson method or Fisher’s scoring method; for more details, see Agresti (1990) or Hand et al. (2001).

Once the parameter vector β is estimated, its significance is usually assessed by hypothesis testing. We will now see how to verify the significance of each parameter in the model. Later we will compute the overall significance of a model in the context of model comparison. Consider testing the null hypothesis H0 : βi =0 against the alternative H1 : βi =0. A rejection region for H0 can be defined using the asymptotic procedure known as Wald’s test. If the sample size is sufficiently large, the statistic

Z= βˆi

σ (βˆi)

is approximately distributed as standardised normal; hereσ βˆi

denotes the stan-dard error of the estimator in the numerator, Therefore, to decide whether to accept or reject the null hypothesis, we can construct the rejection region

R=&

|Z| ≥z1−α/2' ,

where z1−α/2is the 100(1−α/2)percentile of the standard normal distribution.

Alternatively, we can find thep-value and see whether it is less than a predefined significance level (e.g. α=0.05). If p < α then H0 is rejected. The square of the Wald statisticZhas a chi-squared distribution with 1 degree of freedom, for large samples. That means it is legitimate for us to construct a rejection region or to assess a p-value.

Rao’s score statistic, often used as an alternative to the Wald statistic, computes the derivative of the observed log-likelihood function evaluated at the parameter values set by the null hypothesis,βi =0. Since the derivative is zero at the point of maximum likelihood, the absolute value of the score statistic tends to increase as the maximum likelihood estimate ˆβi gets more distant from zero. The score statistic is equal to the square of the ratio between the derivative and its standard error, and it is also asymptotically distributed as chi-squared with 1 degree of freedom. For more details on hypotheses testing using generalised linear models, see Dobson (2002), McCullagh and Nelder (1989) or Azzalini (1992).

Model comparison

Fitting a model to data can be interpreted as a way of replacing a set of observed data values with a set of estimated values obtained during the fitting process.

The number of parameters in the model is generally much lower than the number of observations in the data. We can use these estimated values to predict future values of the response variable from future values of the explanatory variables. In general the fitted values, say ˆμi, will not be exactly equal to the observed values, yi. The problem is to establish the distance between the ˆμi and theyi. In Chapter 5 we will start from this simple concept of distance between observed values and fitted values, then show how it is possible to construct statistical measures to compare statistical models and, more generally, data mining methods. In this section we will consider the deviance and the Pearson statistic, two measures for comparing the goodness of fit of different generalised linear models.

The first step in evaluating a model’s goodness of fit is to compare it with the models that produce the best fit and the worst fit. The best-fit model is called the saturated model; it has as many parameters as observations (n) and therefore leads to a perfect fit. The worst-fit model is called the null model; it has only one intercept parameter and leaves all response variability unexplained. The saturated model attributes the whole response variability to the systematic component. In practice the null model is too simple and the saturated model is not informative because it does completely reproduces the observations. However, the saturated model is a useful comparison when measuring the goodness of fit of a model withpparameters. The resulting quantity is called the deviance, and it is defined as follows for a model M (with p parameters) in the class of the generalised linear models:

G2(M)= −2 log

*Lβ(M)ˆ Lβ(Mˆ )

1 ,

where the quantity in the numerator is the likelihood function, calculated using the maximum likelihood parameter estimates under modelM, denoted by ˆβ(M);

and the quantity in the denominator is the likelihood function of the observa-tions, calculated using the maximum likelihood parameter estimates under the saturated model M. The expression in curly brackets is called the likelihood ratio, and it seems an intuitive way to compare two models in terms of the likelihood they receive from the observed data. Multiplying the natural loga-rithm of the likelihood ratio by−2, we obtain the maximum likelihood ratio test statistic.

The asymptotic distribution of the G2 statistic (under H0) is known: for a large sample size,G2(M)is approximately distributed as chi-squared withnk degrees of freedom, wherenis the number of observations andk is the number of the estimated parameters under model M, corresponding to the number of explanatory variables plus one (the intercept). The logic behind the use ofG2is as follows. If the modelMthat is being considered is good, then the value of its maximised likelihood will be closer to that of the maximised likelihood under the saturated modelM. Therefore ‘small’ values ofG2indicate a good fit.

The asymptotic distribution of G2 can provide a threshold beneath which to declare the simplified modelM as a valid model. Alternatively, the significance can be evaluated through thep-value associated withG2. In practice thep-value represents the area to the right of the observed value for G2 in the χn−k2 dis-tribution. A model M is considered valid when the observed p-value is large (e.g. greater than 5%). The value ofG2alone is not sufficient to judge a model, becauseG2 increases as more parameters are introduced, similarly to what hap-pens toR2in the regression model. However, since the threshold value generally decreases with the number of parameters, by comparingG2and the threshold we can reach a compromise between goodness of fit and model parsimony.

The overall significance of a model can also be evaluated by comparing it against the null model. This can be done by taking the difference in deviance between the model considered and the null model to obtain the statistic

D = −2 log

*Lβ(Mˆ 0) Lβ(M)ˆ

1 .

Under the null hypothesis that the null model is true, D is asymptotically distributed asχp2, wherep is the number of explanatory variables in modelM. This can be obtained by noting that D =G2(M0)G2(M) and recalling that the two deviances independent and asymptotically distributed as chi-squared random variables. From the additive property of the chi-squared distribution, it follows that the degrees of freedom of D are (n−1)−(np−1)=p.

The model considered is accepted (i.e. the null model in the null hypothesis is rejected) if thep-value is small. This is equivalent to the differenceD between the log-likelihoods being large. Rejection of the null hypothesis implies that at least one parameter in the systematic component is significantly different from zero. Statistical software often gives the log-likelihood maximised for each model of analysis, corresponding to −2 log(L(β(M))), which can be seen as aˆ score of the model under consideration. To obtain the deviance, this should be normalised by subtracting −2 log(L(β(Mˆ ))). On the other hand, D is often given with the corresponding p-values. This is analogous to the statistic in Result 4 in Section 4.11.1 (page 114).

More generally, following the same logic used in the derivation ofD, any two models can be compared in terms of their deviances. If the two models are nested (i.e. the systematic component of one of them is obtained by eliminating some terms from the other one), the difference between the deviances is asymptotically distributed as chi-squared withpqdegrees of freedom, wherepqrepresents the number of variables excluded in the simpler model (which hasqparameters) but not in the other one (which has p parameters). If the difference between the two is large (with respect to the critical value), the simpler model will be rejected in favour of the more complex model, and similarly when the p-value is small.

For the whole class of generalised linear models, it is possible to employ a formal procedure in searching for the best model. As with linear models, this procedure is usually forward, backward or stepwise elimination.

When the data analysed are categorical, or discretised to be such, an alternative toG2 is Pearson’sX2:

X2=

i

(oiei)2 ei

where, for each category i, oi represents the observed frequencies and ei rep-resents the frequencies expected according to the model under examination. As with the deviance G2, we are comparing the fitted model (which corresponds to the ei) and the saturated model (which corresponds to the oi). However, the distance function is not based on the likelihood, but on direct comparison between observed and fitted values for each category. Notice that this statis-tic generalises the PearsonX2 distance measure in Section 3.4. There the fitted model particularised to the model under which the two categorical variables were independent.

The Pearson statistic is asymptotically equivalent to G2, therefore under H0, X2χn−k2 . The choice between G2 and X2 depends on the goodness of the chi-squared approximation. In general, it can be said thatX2is less affected by small frequencies, particularly when they occur in large data sets – data sets having many variables. The advantage of G2 lies in its additivity, so it easily generalises to any pairwise model comparisons, allowing us to adopt a model selection strategy.

The statisticsG2andX2indicate a model’s overall goodness of fit; we need to do further diagnostic analysis to look for local lack of fit. Before fitting a model, it may be extremely useful to try some graphical representations. For example, we could plot the observed frequencies of the various categories, or functions of them, against the explanatory variables. It is possible to draw dispersion dia-grams and fit straight lines for the response variable transformation described by the canonical link (e.g. the logit function for logistic regression). This can be useful in verifying whether the hypotheses behind the generalised linear model are satisfied. If not, the graph itself may suggest further transformations of the response variable or the explanatory variables. Once a model is chosen as the best and fitted to the data, our main diagnostic tool is to analyse the residuals.

Unlike what happens in the normal linear model, for generalised linear models there are different definitions of the residuals. Here we consider the deviance residuals that are often used in applications. For each observation, the residual from the deviance is defined by the quantity

Dri =(yiμˆi) -di.

This quantity increases (or decreases) depending on the difference between the observed and fitted values of the response variable (yiμˆi) and is such that

Dri2=G2. In a good model, the deviance residuals should be randomly dis-tributed around zero, so plot the deviance residuals against the fitted values. For a good model, the points in the plane should show no evident trend.

Dans le document Applied Data Mining for Business and Industry (Page 125-132)