• Aucun résultat trouvé

Graphical log-linear models

Statistical data mining

5.5 Log-linear models

5.5.3 Graphical log-linear models

A key instrument in understanding log-linear models, and graphical models in general, is the concept of conditional independence for a set of random variables;

this extends the notion of statistical independence between two variables, seen in Section 3.4. Consider three random variables X, Y, and Z. X and Y are conditionally independent givenZ if the joint probability distribution ofX and Y, conditional on Z, can be decomposed into the product of two factors: the conditional density of X givenZ and the conditional density of Y given Z. In formal terms, X andY are conditionally independent onZ if f (x, y|Z=z)= f (x|Z=z)f (y|Z=z)and we writeXY|Z. An alternative way of expressing this concept is that the conditional distribution of Y on bothX andZ does not depend on X. So, for example, if X is a binary variable and Z is a discrete variable, then for everyzandy we have

f (y|X=1, Z=z)=f (y|X=0, Z=z)=f (y|Z=z)

The notion of (marginal) independence between two random variables (Section 3.4) can be obtained as a special case of conditional independence.

As seen for marginal independence, conditional independence can simplify the expression and interpretation of log-linear models. In particular, it can be extremely useful in visualising the associative structure among all variables at hand, using the so-called independence graphs. Indeed a subset of log-linear models, called graphical log-linear models, can be completely characterised in terms of conditional independence relationships and therefore graphs. For these models, each graph corresponds to a set of conditional independence constraints and each of these constraints can correspond to a particular log-linear expansion.

The study of the relationship between conditional independence statements, represented in graphs, and log-linear models has its origins in the work of Darroch, Lauritzen and Speed (1980). We explain this relationship through an example. For a systematic treatment see Whittaker (1990), Edwards (1995), or Lauritzen (1996). I believe that the introduction of graphical log-linear models helps to explain the problem of model choice for log-linear models. Consider a contingency table of three dimensions, each one corresponding to a binary variable, so the total number of cells in the contingency table is 23 =8. The simplest log-linear graphical model for a three-way contingency table assumes

that the logarithm of the expected frequency of every cell is log(mjkl)=u+uAj +uBk +uCl

This model does not contain interaction terms between variables, therefore the three variables are mutually independent. In fact, the model can be expressed in terms of cell probabilities aspjkl =pj++p+k+p++l, where the symbol+indicates that the joint probabilities have been summed with respect to all the values of the relative index. Note that, for this model, the three odds ratios between the variables –(A, B), (A, C), (B, C)– are all equal to 1. To identify the model in a unique way it is possible to use a list of the terms, called generators, that correspond to the maximal terms of interaction in the model. These terms are called maximals in the sense that their presence implies the presence of interac-tion terms between subsets of their variables. At the same time, their existence in the model is not implied by any other term. For the previous model of mutual independence, the generators are (A, B, C); they are the main effect terms as there are no other terms in the model. To graphically represent conditional inde-pendence statements, we can use conditional indeinde-pendence graphs. These are built by associating a node to each variable and by placing a link (technically, an edge) to connect a pair of variables whenever the corresponding random vari-ables are dependent. For the cases of mutual independence we have described, there are no edges and therefore we obtain the representation in Figure 5.6.

Consider now a more complex log-linear model among the three variables, described by the following log-linear expansion:

log(mjkl)=u+uAj +uBk +uCl +uABjk +uACjl

In this case, since the maximal terms of interaction are uABjk anduACjl , the gener-ators of the model will be(AB, AC). Notice that the model can be reformulated in terms of cell probabilities as

πjkl = πj k+πj+l

πj++

or equivalently as

πjkl

πj++ = πj k+

πj++

πj+l πj++

A

B C

Figure 5.6 Conditional independence graph for mutual independence.

which, in terms of conditional independence, states that

P (B=k, C =l|A=j )=P (B =k|A=j )P (C=l|A=j )

This indicates that, in the conditional distribution (onA), B andC are indepen-dent. In other words,BC|A. Therefore the conditional independence graph of the model is as shown in Figure 5.7. It can been demonstrated that, in this case, the odds ratio between all variable pairs are different from 1, whereas the two odds ratios for the two-way table betweenB andC, conditional onA, are both equal to 1.

We finally consider the most complex (saturated) log-linear model for the three variables:

log(mjkl)=u+uAj +uBk +uCl +uABjk +uACjl +uBCkl +uABCjkl

which has (ABC) as generator. This model does not establish any conditional independence constraints on cell probabilities. Correspondingly, all odds ratios, marginal and conditional, will be different from 1. The corresponding conditional independence graph will be completely connected. The previous model (AB,AC) can be considered as a particular case of the saturated model, obtained by setting uBCkl =0 for allk andl anduABCjkl =0 for allj,k,l. Equivalently, it is obtained by removing the edge between B and C in the completely connected graph, which corresponds to imposing the constraint, that B and C are conditionally independent onA. Notice that the mutual independence model is a particular case of the saturated model obtained by settinguBCkl =uACjl =uABjk =uABCjkl =0 for all j,k,l, or by removing all three edges in the complete graph. Consequently, the differences between log-linear models can be expressed in terms of differences between the parameters or as differences between graphical structures. I think it is easier to interpret differences between graphical structures.

All the models in this example are graphical log-linear models. In general, graphical log-linear models are definable as log-linear models that have as gen-erators the cliques of the conditional independence graph. A clique is a subset of completely connected and maximal nodes in a graph. For example, in Figure 5.7 the subsets AB and AC are cliques, and they are the generators of the model.

On the other hand, the subsets formed by the isolated nodes A, B and C are not cliques. To better understand the concept of a graphical log-linear model,

A

B C

Figure 5.7 Conditional independence graph forBC|A.

consider a non-graphical model for the trivariate case. Take the model described by the generator(AB, AC, BC):

log(mjkl)=u+uAj +uBk +uCl +uABjk +uACjl +uBCkl

Although this model differs from the saturated model by the absence of the three-way interaction termuABCjkl , its conditional independence graph is the same, with one single clique ABC. Therefore, since the model generator is different from the set of cliques, the model is not graphical. To conclude, in this section we have obtained a remarkable equivalence relation between conditional indepen-dence statements, graphical representations and probabilistic models, with the probabilistic models represented in terms of cell probabilities, log-linear models or sets of odds ratios.