Deﬁnition of generalised linear models - Generalised linear models

Statistical data mining

5.4 Generalised linear models

5.4.2 Deﬁnition of generalised linear models

A generalised linear model takes a function of the mean value of the response variable and relates it to the explanatory variables through an equation having linear form. It is speciﬁed by three components: a random component, which identiﬁes the response variableY and assumes a probability distribution for it;

a systematic component, which speciﬁes the explanatory variables used as pre-dictors in the model; and a link function, which describes the functional relation between the systematic component and the mean value of the random component.

Random component

For a sample of sizen, the random component of a generalised linear model is described by the sample random variablesY1, . . . , Yn; these are independent, each has a distribution in exponential family form that depends on a single parameter θi, and each is described by the density function

f (yi;θi)=exp[yib(θi)+c(θi)+d(yi)]

All the distributions for theYi have to have the same form (e.g. all normal or all binomial) but theirθi parameters do not have to be the same.

Systematic component

The systematic component speciﬁes the explanatory variables and their roles in the model. It is speciﬁed by a linear combination:

η=β1x1+ · · · +βpxp= p j=1

βjxj

The linear combination η is called the linear predictor. The Xj represent the covariates, whose values are known (e.g. they can be derived from the data matrix). The βj are the parameters that describe the effect of each explanatory variable on the response variable. The values of the parameters are generally unknown and have to be estimated from the data. The systematic part can be written in the following form:

ηi = p j=1

βjxij (i =1, . . . , n)

where xij is the value of thejth explanatory variable for the ith observation. In matrix form it is

η=Xβ

whereηis a vector of ordern×1,Xis a matrix of ordern×p, called the model matrix, andβ is a vector of orderp×1, called the parameter vector.

Link function

The third component of a generalised linear model speciﬁes the link between the random component and the systematic component. Let the mean value ofYi be denoted by

µi =E(Yi) (i =1, . . . , n.)

The link function speciﬁes which function ofµi linearly depends on the explana-tory variables through the systematic component ηi. Let g(µi) be a function (monotone and differentiable) ofµi. The link function is deﬁned by

g(µi)=ηi = p j=1

βjxij (i =1, . . . , n)

In other words, the link function describes how the explanatory variables affect the mean value of the response variable, that is, through the functiong. How do we chooseg? In practice the more commonly used link functions are canonical and deﬁne the natural parameter of the particular distribution as a function of the mean response value. Table 5.5 shows the canonical link functions for the three important distributions in Section 5.4.1. The same table can be used to derive the most important examples of generalised linear models. The simplest link function is the normal one. It directly models the mean value through the link identityηi =µi, thereby specifying a linear relationship between the mean response value and the explanatory variables:

µi=β0+β1xi1+ · · · +βpxip

The normal distribution and the identity link give rise to the normal linear model for continuous response variables (Section 5.3).

Table 5.5 Main canonical links.

Distribution Canonical link

Normal g(µi)=µi

Binomial g(µi)=log πi

1−πi

Poisson g(µi)=logµi

The binomial link function models the logarithm of the odds as a linear func-tion of the explanatory variables:

log µi

1−µi

=β0+β1xi1+ · · · +βpxip (i=1, . . . , n)

This type of link is called a logit link and is appropriate for binary response vari-ables, as in the binomial model. A generalised linear model that uses the binomial distribution and the logit link is a logistic regression model (Section 4.4). For a binary response variable, econometricians often use the probit link, which is not canonical. They assume that

⁻¹(µi)=β0+β1xi1+ · · · +βpxip (i =1, . . . , n) where⁻¹ is the inverse of the cumulative normal distribution function.

The Poisson canonical link function speciﬁes a linear relationship between the logarithm of the mean response value and the explanatory variables:

log(µi)=β0+β1xi1+ · · · +βkxik (i=1, . . . , n)

A generalised linear model that uses the Poisson distribution and a logarithmic link is a log-linear model; it constitutes the main data mining tool for describing associations between the available variables. It will be examined in Section 5.5.

Inferential results

We now consider inferential results that hold for the whole class of generalised linear models; we will apply them to logistic regression and log-linear models.

Parameter estimates are usually obtained using the method of maximum likeli-hood. The method computes the derivative of the log-likelihood with respect to each coefﬁcient in the parameter vector β and sets it equal to zero, similar to the linear regression context in Section 4.3. But unlike what happens with the normal linear model, the resultant system of equations is non-linear in the param-eters and does not generally have a closed-form solution. So to obtain maximum likelihood estimators of β, we need to use iterative numerical methods, such as the Newton–Raphson method or Fisher’s scoring method; for more details see Agresti (1990) or Hand, Mannila and Smyth (2001).

Once the parameter vectorβ is estimated, its significance is usually assessed by hypothesis testing. We will now see how to verify the significance of each parameter in the model. Later we will compute the overall significance of a model in the context of model comparison. Consider testing the null hypothesis H0:βi =0 against the alternative H1 :βi =0. A rejection region forH0 can be defined using the asymptotic procedure known as Wald’s test. If the sample size is sufficiently large, the statistic

Z= βˆi

σ (βˆi)

is approximately distributed as a standardised normal; here σ (βˆi)indicates the standard error of the estimator in the numerator. Therefore, to decide whether to accept or reject the null hypothesis, we can build the rejection region

R= {|Z| ≥z_1−α/2}

wherez1−α/2is the 100(1−α/2)percentile of a standard normal distribution. Or we can find the p-value and see whether it is less than a predefined significance level (e.g. α=0.05). If p < α then H0 is rejected. The square of the Wald statistic Z has a chi-squared distribution with 1 degrees of freedom, for large samples (Section 5.1). That means it is legitimate for us to build a rejection region or to assess a p-value.

Rao’s score statistic, often used as an alternative to the Wald statistic, computes the derivative of the observed log-likelihood function evaluated at the parameter values set by the null hypothesis,βi =0. Since the derivative is zero at the point of maximum likelihood, the absolute value of the score statistic tends to increase as the maximum likelihood estimate βˆi gets more distant from zero. The score statistic is equal to the square of the ratio between the derivative and its standard error, and it is also asymptotically distributed as a chi-squared distribution with 1 degree of freedom. For more details on hypotheses testing using generalised linear models, see McCullagh and Nelder (1989), Dobson (1990) or Azzalini (1992).

Model comparison

Fitting a model to data can be interpreted as a way of replacing a set of observed data values with a set of estimated values obtained during the ﬁtting process. The number of parameters in the model is generally much lower than the number of observations in the data. We can use these estimated values to predict future values of the response variable from future values of the explanatory variables.

In general the fitted values, say µî, will not be exactly equal to the observed values,yi. The problem is to establish the distance between theµî and theyi. In Chapter 6 we will start form this simple concept of distance between observed values and fitted values, then show how it is possible to build statistical measures to compare statistical models and, more generally, data mining methods. In this section we will consider the deviance and the Pearson statistic, two measures for comparing the goodness of fit of different generalised linear models.

The first step in evaluating a model’s goodness of fit is to compare it with the models that produce the best fit and the worst fit. The best-fit model is called the saturated model; it has as many parameters as observations and therefore leads to a perfect fit. The worst-fit model is called the null model; it has only one intercept parameter and leaves all response variability unexplained. The saturated model attributes the whole response variability to the systematic component. In practice the null model is too simple and the saturated model is not informative because it does completely reproduces the observations. However, the saturated model is a useful comparison when measuring the goodness of fit of a model withpparameters. The resulting quantity is called the deviance, and it is defined

as follows for a modelM (withp parameters) in the class of generalised linear models:

G²(M)= −2 log

L(β(M))ˆ L(β(Mˆ ^∗))

where the quantity in the numerator is the likelihood function, calculated using the maximum likelihood parameter estimates under model M, indicated by β(M);ˆ and the quantity in the denominator is the likelihood function of the observa-tions, calculated using the maximum likelihood parameter estimates under the saturated model M^∗. The expression in curly brackets is called the likelihood ratio, and it seems an intuitive way to compare two models in terms of the likelihood they receive from the observed data. Multiplying the natural loga-rithm of the likelihood ratio by−2, we obtain the maximum likelihood ratio test statistic.

The asymptotic distribution of theG²statistic (underH0)is known: for a large sample size, G²(M) is approximately distributed as a chi-squared distribution withn−k degrees of freedom, wheren is the number of observations and k is the number of estimated parameters under modelM, corresponding to the number of explanatory variables plus one (the intercept). The logic behind the use ofG² is as follows. If the modelM that is being considered is good, then the value of its maximised likelihood will be closer to that of the maximised likelihood under the saturated modelM^∗. Therefore ‘small’ values ofG² indicate a good ﬁt.

The asymptotic distribution of G² can provide a threshold beneath which to declare the simplified modelM as a valid model. Alternatively, the significance can be evaluated through the p-value associated with G². In practice the p-value represents the area to the right of the observed p-value for G² in the χ_n−k² distribution. A modelM is considered valid when the observedp-value is large (e.g. greater than 5%). The value ofG² alone is not sufficient to judge a model, becauseG²increases as more parameters are introduced, similar to what happens with R² in the regression model. However, since the threshold value generally decreases with the number of parameters, by comparingG²and the threshold we can reach a compromise between goodness of fit and model parsimony.

The overall signiﬁcance of a model can also be evaluated by comparing it against the null model. This can be done by taking the difference in deviance between the considered model and the null model to obtain

D= −2 log

L(β(Mˆ 0)) L(β(M))ˆ

Under H0 ‘the null model is true’, D is asymptotically distributed as a χ_p² dis-tribution, where p is the number of explanatory variables in model M. This can be obtained by noting that D=G²(M0)−G²(M) and recalling that the two deviances are independent and asymptotically distributed as chi-squared random variables. From the additive property of the chi-squared distribution, the degrees of freedom of D are (n−1)−(n−p−1)=p. The considered

model is accepted (i.e. the null model in H0 is rejected) if the p-value is small.

This is equivalent to the difference D between the log-likelihoods being large.

Rejection of the null hypothesis implies that at least one parameter in the sys-tematic component is signiﬁcantly different from zero. Statistical software often gives the log-likelihood maximised for each model of analysis, corresponding to −2 log(L(β(M)), which can be seen as a score of the model under con-ˆ sideration. To obtain the deviance, this should be normalised by subtracting

−2 log(L(β(Mˆ ^∗)). On the other hand, D is often given with the correspond-ing p-values; This is analogous to the statistic in Result 4 on page 148. SAS calls it chi-square for covariate.

More generally, following the same logic used in the derivation of D, any two models can be compared in terms of their deviances. If the two models are nested (i.e. the systematic component of one of them is obtained by eliminating some terms from the other one), the difference of the deviances is asymptotically distributed as chi-squared with p−qdegrees of freedom, that is the number of variables excluded in the simpler model (which hasq parameters) but not in the other one (which has p parameters). If the difference between the two is large (with respect to the critical value), the simpler model will be rejected in favour of the more complex model, and similarly when the p-value is small.

For the whole class of generalised linear models, it is possible to employ a formal procedure in searching for the best model. As with linear models, this procedure is usually forward, backward or stepwise elimination.

When the analysed data is categorical, or discretised to be such, an alternative toG² is Pearson’sX²:

X² =

(oi−ei)² ei

where, for each categoryi, ‘oi’ represents the observed frequencies and ‘ei’ rep-resents the frequencies expected according to the model under examination. As with the deviance G², we are comparing the fitted model (which corresponds to the ei) and the saturated model (which corresponds to the oi). However, the distance function is not based on the likelihood, but on direct comparison between observed and fitted values for each category. Notice that this statis-tic generalises the Pearson X² distance measure in Section 3.4. There the fitted model particularised to the model under which the two categorical variables were independent.

The Pearson statistic is asymptotically equivalent to G², therefore under H0, X² ≈χ_n²₋_k. The choice between G² and X² depends on the goodness of the chi-squared approximation. In general, it can be said that X² is less affected by small frequencies, particularly when they occur in large data sets – data sets having many variables. The advantage of G² lies in its additivity, so it easily generalises to any pairwise model comparisons, allowing us to adopt a model selection strategy.

The statisticsG²andX² indicate a model’s overall goodness of fit; we need to do further diagnostic analysis to look for local lack of fit. Before fitting a model, it may be extremely useful to try some graphical representations. For example,

we could plot the observed frequencies of the various categories, or functions of them, against the explanatory variables. It is possible to draw dispersion diagrams and fit straight lines for the response variable transformation described by the canonical link (e.g. the logit function for logistic regression). This can be useful to verify whether the hypotheses behind the generalised linear model are satisfied. If not, the graph itself may suggest further transformations of the response variable or the explanatory variables. Once a model is chosen as the best and fitted to the data, our main diagnostic tool is to analyse the residuals. Unlike what happens in the normal linear model, for generalised linear models there are different definitions of the residuals. Here we consider the deviance residuals that are often used in applications. For each observation, the residual from the deviance is defined by the quantity

Dri =(yi− ˆµi) di

This quantity increases (or decreases) according to the difference between the observed and ﬁtted values of the response variable (yi− ˆµi) and is such that

Dr_i² =G². In a good model, the deviance residuals should be randomly dis-tributed around zero, so plot the deviance residuals against the ﬁtted values. For a good model, the points in the plane should show no evident trend.

Dans le document Applied Data Mining Statistical Methods for Business and Industry (Page 171-177)