• Aucun résultat trouvé

Graphical log-linear models

Dans le document Applied Data Mining for Business and Industry (Page 136-139)

Model specification

4.13 Log-linear models

4.13.3 Graphical log-linear models

Multiplying numerator and denominator by n2 and taking logarithms:

log(θ )=log(m11)+log(m00)−log(m10)−log(m01).

Substituting for each probability the corresponding log-linear expansion, we obtain log(θ )=uAB11 . Therefore the odds ratio between the variablesAandB is θ =exp(uAB11 ). These previous relations, which are very useful for data interpre-tation, depend on the identifiability constraints we have adopted.

We have shown the relationship between the odds ratio and the parameters of a log-linear model for a 2×2 contingency table. This result is valid for contingency tables of higher dimension, provided the variables are binary and, as usually happens in a descriptive context, the log-linear expansion does not contain interaction terms between more than two variables.

4.13.3 Graphical log-linear models

A key instrument in understanding log-linear models, and graphical models in general, is the concept of conditional independence for a set of random variables;

this extends the notion of statistical independence between two variables, seen in Section 3.4. Consider three random variables X, Y, and Z. X and Y are conditionally independent givenZ if the joint probability distribution of Xand Y, conditional on Z, can be decomposed into the product of two factors: the conditional density of X given Z and the conditional density of Y given Z. In formal terms, X and Y are conditionally independent of Z if f (x, y|Z=z)= f (x|Z=z)f (y|Z=z)and we writeXY|Z. An alternative way of expressing this concept is that the conditional distribution of Y on bothX and Zdoes not depend on X. So, for example, if X is a binary variable and Z is a discrete variable, then for everyzandy we have

f (y|X=1, Z=z)=f (y|X=0, Z=z)=f (y|Z=z) .

The notion of (marginal) independence between two random variables (Section 3.4) can be obtained as a special case of conditional independence. As seen for marginal independence, conditional independence can simplify the expression for and the interpretation of log-linear models. In particular, it can be extremely useful in visualising the associative structure among all variables at hand, using the so-called independence graphs. Indeed, a subset of log-linear models, called graphical log-linear models, can be completely characterised in terms of condi-tional independence relationships and therefore graphs. For these models, each graph corresponds to a set of conditional independence constraints and each of these constraints can correspond to a particular log-linear expansion.

The study of the relationship between conditional independence statements, represented in graphs, and log-linear models has its origins in the work of Darroch et al. (1980). We explain this relationship through an example. For a systematic treatment, see Edwards (2000), Whittaker (1990) or Lauritzen (1996). We believe that the introduction of graphical log-linear models helps to explain the problem of model choice for log-linear models. Consider a contingency table of three dimensions, each corresponding to a binary variable so that the total number of cells in the contingency table is 23=8. The simplest log-linear graphical model for a three-way contingency table assumes that the logarithm of the expected frequency of every cell is

log mj kl

=u+uAj +uBk +uCl .

This model does not contain interaction terms between variables, therefore the three variables are mutually independent. In fact, the model can be expressed in terms of cell probabilities aspj kl=pj++p+k+p++l, where the symbol + indi-cates that the joint probabilities have been summed with respect to all the values of the relative index. Note that, for this model, the three odds ratios between the variables – (A,B), (A,C), (B,C) – are all equal to 1. To uniquely identify the model it is possible to use a list of the terms, called generators, that correspond to the maximal terms of interaction in the model. These terms are called maximals in the sense that their presence implies the presence of interaction terms between subsets of their variables. At the same time, their existence in the model is not implied by any other term. For the previous model of mutual independence, the generators are (A,B,C); they are the main effect terms as there are no other terms in the model. To graphically represent conditional independence statements, we can use conditional independence graphs. These are constructed by associating a node with each variable and by placing a link (technically, an edge) to connect a pair of variables whenever the corresponding random variables are dependent.

For the cases of mutual independence we have described, there are no edges and therefore we obtain the representation in Figure 4.11.

Consider now a more complex log-linear model for the three variables, described by the log-linear expansion

log mj kl

=u+uAj +uBk +uCl +uABj k +uACj l .

In this case, since the maximal terms of interaction areuABj k anduACj l , the gener-ators of the model will be (AB, AC). Notice that the model can be reformulated

A

B C

Figure 4.11 Conditional independence graph for the mutual independence case.

in terms of cell probabilities as

πj kl= πj k+πj+l πj++

or, equivalently, as

πj kl

πj++ = πj k+

πj++

πj+l

πj++

which, in terms of conditional independence, states that

P (B=k, C=l|A=j )=P (B=k|A=j ) P (C=l|A=j ) . The indicates that, in the conditional distribution (on A), B and C are independent – in symbols, BC|A. Therefore, the conditional independence graph of the model is as in Figure 4.12. It can be demonstrated that, in this case, the odds ratios between all variable pairs are different from 1, while the two odds ratios for the two-way table betweenB and C, conditional toA, are both equal to 1.

We finally consider the most complex (saturated) log-linear model for the three variables,

log mj kl

=u+uAj +uBk +uCl +uABj k +uACj l +uBCkl +uABCj kl ,

which has (ABC) as generator. This model does not establish any conditional independence constraints on cell probabilities. Correspondingly, all odds ratios, marginal and conditional, will be different from 1. The corresponding conditional independence graph will be completely connected. The previous model (AB, AC) can be considered as a particular case of the saturated model, obtained by setting uBCkl =0 for allkandlanduABCj kl =0 for allj,k,l. Equivalently, it is obtained by removing the edge between from B and C in the completely connected graph, which corresponds to imposing the constraint that B and C are independent conditionally on A. Notice that the mutual independence model is a particular

A

B C

Figure 4.12 Conditional independence graph corresponding toB⊥C|A.

case of the saturated model obtained settinguBCkl =uACj l =uABj k =uABCj kl =0, for allj,k,l, or by removing all three edges in the complete graph. Consequently, the differences between log-linear models can be expressed in terms of differences between the parameters or as differences between graphical structures. We think it is easier to interpret differences between graphical structures.

All the models in this example are graphical log-linear models. In general, graphical log-linear models are definable as log-linear models that have as gen-erators the cliques of the conditional independence graph. A clique is a subset of completely connected and maximal nodes in a graph. For example, in Figure 4.12 the subsets AB and AC are cliques, and they are the generators of the model.

On the other hand, the subsets formed by the isolated nodes A, B and C are not cliques. To better understand the concept of a graphical log-linear model, consider a non-graphical model for the trivariate case. Take the model described by the generator (AB, AC, BC):

log mj kl

=u+uAj +uBk +uCl +uABj k +uACj l +uBCkl .

Although this model differs from the saturated model by the absence of the three-way interaction termuABCj kl , its conditional independence graph is the same, with one single clique, ABC. Therefore, since the model generator is different from the set of cliques, the model is not graphical. To conclude, in this section we have obtained a remarkable equivalence relation between: conditional inde-pendence statements, graphical representations and probability models, with the probability models represented in terms of cell probabilities, log-linear models or sets of odds ratios.

Dans le document Applied Data Mining for Business and Industry (Page 136-139)