• Aucun résultat trouvé

Construction of a log-linear model

Statistical data mining

5.5 Log-linear models

5.5.1 Construction of a log-linear model

We now show how a log-linear model can be built, starting from three different distributional assumptions about the absolute frequencies of a contingency table, corresponding to different sampling schemes for the data in the table. For sim-plicity but without loss of generality, we consider a two-way contingency table of dimensionsI×J (I rows andJ columns).

Scheme 1

The cell counts are independent random variables that are distributed according to a Poisson distribution. All the marginal counts, including the total number of observations n, are also random and distributed according to a Poisson distri-bution. As the natural parameter of a Poisson distribution with parametermij is

log(mij), the relationship that links the expected value of each cell frequencymij

to the systematic component is

log(mij)=ηij fori=1, . . . , I andj =1, . . . , J

In the linear and logistic regression models, the total amount of information (which determines the degrees of freedom) is described by the number of obser-vations of the response variable (indicated byn), but in the log-linear model this corresponds to the number of cells of the contingency table. In the estimation procedure, the expected frequencies will be replaced by the observed frequencies, and this will lead us to estimate the parameters of the systematic component. For anI×J table there are two variables in the systematic component. Let the levels of the two variables be indicated byxiandxj, fori=1, . . . , I andj=1, . . . , J. The systematic component can therefore be written as

ηi =u+

i

uixi+

j

ujxj

ij

uijxixj

This expression is called the log-linear expansion of the expected frequencies.

The terms ui anduj describe the single effects of each variable, corresponding to the mean expected frequencies for each of their levels. The termuij describes the joint effect of the two variables on the expected frequencies. The termu is a constant that corresponds to the mean expected frequency over all table cells.

Scheme 2

The total number of observations n is not random, but a fixed constant. This implies that the relative frequencies follow a multinomial distribution. Such a distribution generalises the binomial to the case where there are more than two alternative events for the considered variable. The expected values of the absolute frequencies for each cell are given by mij =ij. With n fixed, specifying a statistical model for the probabilitiesπij is equivalent to modelling the expected frequenciesmij, as in Scheme 1.

Scheme 3

The marginal row (or column) frequencies are known. In this case it can be shown that the cell counts are distributed as a product of multinomial distributions. It is also possible to show that we can define a log-linear model in the same way as before.

Properties of the log-linear model

Besides being parsimonious, the log-linear model allows us easily to incorporate in the probability distribution constraints that specify independence relationships between variables. For example, using results introduced in Section 3.4, when two categorical variables are independent, the joint probability of each cell probability

factorises asπij =πi+π+j, fori=1, . . . , I andj=1, . . . , J. And the additive property of logarithms implies that

logmij =logn+logπi++logπ+j

This describes a log-linear model of independence that is more parsimonious than the previous one, called the saturated model as it contains as many parameters as there are table cells. In general, to achieve a unique estimate on the basis of the observations, the number of terms in the log-linear expansion cannot be greater than the number of cells in the contingency table. This implies some constraints on the parameters of a log-linear model. Known as identifiability constraints, they can be defined in different ways, but we will use a system of constraints that equates to zero all the u-terms that contain at least one index equal to the first level of a variable. This implies that, for a 2×2 table, relative to the binary variables A and B, with levels 0 and 1, the most complex possible log-linear model (saturated) is defined by

log(mij)=u+uAi +uBi +uABij

with constraints such that: uAi =0 fori=1 (i.e. if A=1); uBi =0 for j=1 (i.e. if B =1); uABij =0 for i=1 and j=1 (i.e. if A=1 and B=1). The notation reveals that, in order to model the four cell frequencies in the table, there is a constant term, u; two main effects terms that exclusively depend on a variable, uAi and uBi; and an interaction term that describes the association between the two variables,uABij . Therefore, following the stated constraints, the model establishes that the logarithms of the four expected cell frequencies are equal to the following expressions:

log(m00)=u log(m10)=u+uAi log(m01)=u+uBi

log(m11)=u+uAi +uBi +uABij 5.5.2 Interpretation of a log-linear model

Logistic regression models with categorical explanatory variables (also called logit models) can be considered as a particular case of log-linear models. To clarify this point, consider a contingency table with three dimensions for variables A,B,C, and numbers of levelsI,J, 2 respectively. Assume thatCis the response variable of the logit model. A logit model is expressed by

log mij1

mij0

=α+βiA+βjB+βijAB

All the explanatory variables of a logit model are categorical, so the effect of each variable (e.g. variableA) is indicated only by the coefficient (e.g.βiA) rather than by the product (e.g.βA). Besides that, the logit model has been expressed in terms of the expected frequencies, rather than probabilities, as in the last section.

This is only a notational change, obtained through multiplying the numerator and the denominator byn. This expression is useful to show that the logit model which hasC as response variable is obtained as the difference between the log-linear expansions of log(mij1)and log(mij0). Indeed the log-linear expansion for a contingency table with three dimensions I×J×2 has the more general form

log(mijk)=u+uAi +uBj +uCk +uABij +uACik +uBCjk +uABCijk

Substituting and taking the difference between the logarithms of the expected frequencies forC=1 andC=0:

log(mij1)−log(mij0)=uC1 +uACi1 +uBCj1 +uABCij1

In other words, the u-terms that do not depend on the variable C cancel out.

All the remaining terms depend on C. By eliminating the symbol C from the superscript, the value 1 from the subscript and relabelling the u-terms using α andβ, we arrive at the desired expression for the logit model. Therefore a logit model can be obtained from a linear model. The difference is that a log-linear model contains not only the terms that describe the association between the explanatory variables and the response – here the pairs AC, BC – but also the terms that describe the association between the explanatory variables – here the pairAB. Logit models do not model the association between the explanatory variables.

We now consider the relationship between log-linear models and odds ratios.

The logarithm of the odds ratio between two variables is equal to the sum of the interactionu-terms that contain both variables. It follows that if in the considered log-linear expansion there are nou-terms containing both the variablesAandB, say, then we obtainθAB =1; that is, the two variables are independent.

To illustrate this concept, consider a 2×2 table and the odds ratio between the binary variables AandB:

θ = θ1

θ2 = π1|10|1

π1|00|0 = π1101

π1000 = π11π00

π01π10

Multiplying numerator and denominator by n2 and taking logarithms:

log(θ )=log(m11)+log(m00)−log(m10)−log(m01)

Substituting for each probability the corresponding log-linear expansion, we obtain log(θ )=uAB11. Therefore the odds ratio between the variables A and B is θ =exp(uAB11). These relations, which are very useful for data interpretation, depend on the identifiability constraints we have adopted. For example, if we

had used the default constraints of the SAS software, we would have obtained the relationshipθ =exp(1/4uAB11).

We have shown the relationship between the odds ratio and the parameters of a log-linear model for a 2×2 contingency table. This result is valid for contingency tables of higher dimension, providing the variables are binary and, as usually happens in a descriptive context, the log-linear expansion does not contain interaction terms between more than two variables.