Multivariate exploratory analysis of qualitative data

We now discuss the exploratory analysis of multivariate data of qualitative type.

Hitherto we have used the concept of covariance and correlation as the main measures of statistical relationships among quantitative variables. In the case of ordinal qualitative variables, it is possible to extend the notion of covariance and correlation via the concept of ranks. The correlation between the ranks of two variables is known as the Spearman correlation coefficient. More generally, transforming the levels of the ordinal qualitative variables into the corresponding ranks allows most of the analysis applicable to quantitative data to be extended to the ordinal qualitative case.

However, if the data matrix contains qualitative data at the nominal level the notion of covariance and correlation cannot be used. In this section we consider summary measures for the intensity of the relationships between qualitative vari-ables of any kind. Such measures are known as association indexes. Although suited for qualitative variables, these indexes can be applied to discrete quanti-tative variables as well (although this entails a loss of explanatory power).

In the examination of qualitative variables a fundamental part is played by the frequencies with which the levels of the variables occur. The usual starting point for the analysis of qualitative variables is the creation or computation of contin-gency tables (see Table 3.7). We note that qualitative data are often available in the form of a contingency table and not in the data matrix format. To emphasise this difference, we now introduce a slightly different notation.

Given a qualitative variableXwhich assumes the levelsX1, . . . , XI, collected in a population (or sample) of n units, the absolute frequency ni of the level Xi (i=1, . . . , I) is the number of times that the level Xi is observed in the sample or population. Denote by nij the frequency associated with the pair of levels (X_i, Y_j), for i=1,2, . . . , I and j =1,2, . . . , J, of the variables X and Y. Then_ij are also called cell frequencies. Thenn_i+=J

j=1n_ij is the marginal frequency of the ith row of the table and represents the total number of obser-vations that assume the ith level of X (i=1,2, . . . , I); andn_+j =I

i=1nij is the marginal frequency of the jth column of the table and represents the total Table 3.7 A two-way contingency table.

X\Y y^∗₁ y₂^∗ . . . y_j^∗ . . . y_k^∗

x₁^∗ n_xy(x₁^∗, y₁^∗) n_xy(x^∗₁, y^∗₂) . . . n_xy(x^∗₁, y^∗_j) . . . n_xy(x^∗₁, y^∗_k) n_x(x₁^∗) x₂^∗ nxy(x₂^∗, y₁^∗) nxy(x^∗₂, y^∗₂) . . . nxy(x^∗₂, y^∗_j) . . . nxy(x^∗₂, y^∗_k) nx(x₂^∗)

... ... ... ... ... ... ... ...

x_i^∗ nxy(x_i^∗, y₁^∗) nxy(x^∗_i, y^∗₂) . . . nxy(x^∗_i, y^∗_j) . . . nxy(x^∗_i, y^∗_k) nx(x_i^∗)

... ... ... ... ... ... ... ...

x_h^∗ n_xy(x_h^∗, y₁^∗) n_xy(x^∗_h, y^∗₂) . . . n_xy(x^∗_h, y^∗_j) . . . n_xy(x^∗_h, y^∗_k) n_x(x_h^∗)

ny(y₁^∗) ny(y₂^∗) . . . ny(y_j^∗) . . . ny(y_k^∗) N

number of observations that assume the jth level of Y (j =1,2, . . . , J). Note that for any contingency table the following relationship (called marginalization) holds:

We note that given an n×p data matrix (i.e. a data matrix containing p distinct variables), it is possible to construct p(p−1)/2 two-way contingency tables, correspondending to all possible pairs among thep qualitative variables.

However, it is usually best to generate only the contingency tables for those pairs of variables that might exhibit an interesting relationship.

3.4.1 Independence and association

In order to develop indexes to describe the relationship between qualitative vari-ables it is necessary to first introduce the concept of statistical independence. Two variablesXandY are said to be independent, for a sample ofnobservations, if

ni1

If this occurs it means that, with reference to the first equation, the (bivariate) joint analysis of the two variablesX andY does not given any additional knowledge about Xthan can be gained from the univariate analysis of the variableX; the same is true for the variable Y in the second equation. When this happens Y andXare said to be statistically independent. Note that the concept of statistical independence is symmetric: ifXis independent ofYthenY is independent ofX.

The previous conditions can be equivalently, and more conveniently, expressed as function of the marginal frequenciesn_i+ and n_+j. In this case X and Y are independent if

n_ij = n_i+n_+j

n , ∀i=1,2, . . . , I; ∀j=1,2, . . . , J.

In terms of relative frequencies this is equivalent to

p_XY(x_i, y_j)=p_X(x_i)p_Y(y_j), for everyi and for everyj.

When working with real data the statistical independence condition is almost never satisfied exactly; in other words, real data often show some degree of dependence among the variables.

We note that the notion of statistical independence applies to both qualitative and quantitative variables. On the other hand, measures of dependence are defined differently depending on whether the variables are quantitative or qualitative. In the first case it is possible to calculate summary measures (called correlation

measures) that work both on the levels and on the frequencies. In the second case the summary measures (called association measures) must depend on the frequencies, since the levels are not metric.

For the case of quantitative variables an important relationship holds between statistical independence and the absence of correlation. If two variables X and Y are statistically independent then also Cov(X, Y )=0 and r(X, Y )=0. The converse is not necessarily true: two variables may be such that r(X, Y )=0, even though they are not independent. In other words, the absence of correlation does not imply statistical independence.

The study of association is more complicated than the study of correlation because there are a multitude of association measures. Here we examine three different classes of these: distance measures, dependency measures, and model-based measures.

3.4.2 Distance measures

As already remarked, independence between two variables, X and Y, holds when nij = n_i+n_+j

n , ∀i=1,2, . . . , I; ∀j =1,2, . . . , J.

for all joint frequencies of the contingency table. One way to provide a summary measure of the association between two variables is based on the calculation of a ‘global’ measure of disagreement between the frequencies actually observed (nij) and those expected under the assumption of independence between the two variables (n_i·n_·j

n). The original statistic proposed by Karl Pearson is the most widely used measure for assessing the hypothesis of independence between X and Y. In the general case, such a measure is defined by

X²=

Note that X²=0 if the X and Y variables are independent. In fact in such a case, the factors in the numerator are all zero.

We note that the X²statistic can be written in the equivalent form X²=n

which emphasizes the dependence of the statistic on the number of observations, n; this is a potential problem since the value of X² increases with the sample sizen. To overcome this problem, alternative measures have been proposed that are function of the previous statistic.

A first measure is

φ²= X² n =

I i=1

J j=1

n²_ij n_i+n_+j −1,

usually called the mean contingency. The square root of φ² is instead called phi coefficient. Note that, in the case of a 2× 2 contingency table representing binary variables, the φ² coefficient is normalised as it takes values between 0 and 1 and, furthermore, it can be shown that

φ²= Cov²(X, Y ) Var(X)Var(Y ).

Therefore, the φ² coefficient, in the case of 2×2 tables, is equivalent to the squared linear correlation coefficient.

However, in the case of contingency tables larger than 2×2, the φ² index is not normalised. The Cramer index normalises the X² measure, so that it can be used for making comparisons. The Cramer index is obtained by dividing X² by the maximum value it can assume for a given contingency table; this is a common approach used in descriptive statistics for normalising measures. Since such maximum can be shown to be the smaller ofI −1 andJ −1, whereI and J are the number of rows and columns of the contingency table respectively, the Cramer index is equal to

V²= X²

nmin[I −1, J−1].

It can be shown that 0≤V²≤1 for anyI ×J contingency table and in partic-ular,V²=0 if and only ifXandY are independent. On the other hand,V²=1 in case of maximum dependency between the two variables.V²takes value 1 in three instances:

(a) There is maximum dependency ofY on Xwhen in every row of the table there is only one non-zero frequency. This happens when every level ofX corresponds to one and only one level ofY. If this holds, thenV²=1 and I ≥J.

(b) There is maximum dependency of X on Y when in every column of the table there is only one non-zero frequency. This means that every level of Y corresponds to one and only one level ofX. This condition occurs when V²=1 andJ ≥I.

(c) If both of the two previous conditions are simultaneously satisfied, that is, if I =J, whenV²=1 the two variables are maximally dependent.

In our exposition we have referred to the case of two-way contingency tables, involving two variables, with an arbitrary number of levels. However, the mea-sures presented in this subsection can be easily applied to multi-way tables,

extending the number of summands in the definition of X², to account for all table cells.

In conclusion, the association indexes based on theX²Pearson statistic mea-sure the distance between the relationship of X and Y and the case of inde-pendence. They represent a generic notion of association, in the sense that they measure exclusively the distance from the independence situation, with-out informing on the nature of the relationship betweenX and Y. On the other hand, these indexes are rather general, as they can be applied in the same fashion to all kinds of contingency tables. Furthermore, theX²statistic has an asymptotic probabilistic (theoretical) distribution and, therefore, can also be used to assess an inferential threshold to evaluate inductively whether the examined variables are significantly dependent.

3.4.3 Dependency measures

The measures of association seen so far are all functions of the X² statistics and thus have the disadvantage of being hard to interpret in the majority of real applications. This important point was underlined by Goodman and Kruskal (1979), who proposed an alternative approach for measuring the association in a contingency table. The set-up followed by Goodman and Kruskal is based on the definition of indexes suited for the specific investigation context in which they are applied. In other words, such indexes are characterised by an operational meaning that defines the nature of the dependency between the available variables.

We now examine two such measures. Suppose that, in a two-way contingency table, Y is the ‘dependent’ variable and X the ‘explanatory’ variable. It is of interest to evaluate whether, for a generic observation, knowing the category of X can reduce the uncertainty as to what the corresponding category ofY might be. The ‘degree of uncertainty’ as to the category of a qualitative variable is usually expressed via a heterogeneity index.

Let δ(Y ) indicate a heterogeneity measure for the marginal distribution ofY, expressed by the vector of marginal relative frequencies, {f₊1, f₊2, . . . , f_+J}. Similarly, let δ(Y|i) be the same measure calculated on the distribution of Y conditional on the ith row of the variable X of the contingency table, {f_1|i, f_2|i, . . . , f_J|i}.

An association index based on the ‘proportional reduction in the heterogeneity’

(error proportional reduction index, EPR), is then given (see for instance, Agresti, 1990) by

EPR= δ(Y )−M[δ(Y|X)]

δ(Y ) ,

where M[δ(Y|X)] is the mean heterogeneity calculated with respect to the dis-tribution ofX, namely

M[δ(Y|X)]=

f_i·δ(Y|i),

with

f_i·=ni+

n(i=1,2, . . . , I ).

The above index measures the proportion of heterogeneity of Y (calculated throughδ) that can be ‘explained’ by the relationship withX.

Depending on the choice of the heterogeneity index δ, different association measures can be obtained. Usually, the choice is between the Gini index and the entropy index. In the first case it can be shown that the EPR index gives rise to the so-called concentration coefficient,τ_Y|X:

τ_Y_|X = f_ij²

f_i+− f_+j² 1−

jf_+j² .

In the second case, using the entropy index in the ERP expression, we obtain the so-called uncertainty coefficient,U_Y|X:

UY|X = −

jfijlog(fij

f_i+·f_+j)

jf_+jlogf_+j ,

where, in the case of null frequencies, by convention log 0=0. It can be shown that bothτY|X andU_Y|X take values in the [0,1] interval. Note, in particular, that:

τ_Y|X =UY|X if and only if the variables are independent;

τ_Y|X =UY|X =1 if and only ifY has maximum dependence on X.

The indexes described have a simple operational interpretation regarding spe-cific aspects of the dependence link between the variables. In particular, both τ_Y|X andUY|X represent alternative quantifications of the reduction of theY het-erogeneity that can be explained through the dependence ofY on X. From this viewpoint they are, in comparison to the distance measures of associations, rather specific.

On the other hand, they are less general than the distance measures. Their application requires the identification of a causal link from one variable (explanatory) to another (dependent), while theX²-based indexes are symmetric.

Furthermore, the previous indexes cannot easily be extended to contingency tables with more than two variables, and cannot be used to derive an inferential threshold.

3.4.4 Model-based measures

The last set of association measures that we present is different from the previous two sets in the that it does not depend on the marginal distributions of the variables. For ease of notation, we will assume a probability model in which cell relative frequencies are replaced by cell probabilities. The cell probabilities can be interpreted as relative frequencies as the sample size tends to infinity, therefore they have the same properties as relative frequencies.

Consider a 2×2 contingency table summarising the joint distribution of the variables Xand Y; the rows report the values ofX (X=0,1) and the columns the values of Y (Y =0,1). Let π11, π00, π10 and π01 denote the probability that an observation is classified in one of the four cells of the table. The odds ratio is a measure of association that constitutes a fundamental parameter in the statistical models for the analysis of qualitative data. Letπ_1|1andπ_0|1denote the conditional probabilities of having a 1 (a success) and a 0 (a failure) in row 1, andπ_1|0 andπ_0|0the same probabilities for row 0. The odds of success for row 1 are defined by

odds1= π1|1

π_0|1 = P (Y =1|X=1) P (Y =0|X=1), and for row 0 by

odds₀= π_1|0

π_0|0 = P (Y =1|X=0) P (Y =0|X=0).

The odds are always non-negative, with a value greater than 1 when a success (level 1) is more probable than a failure (level 0), that is, when P (Y =1|X= 1) > P (Y =0|X=1). For example, if the odds equal 4 this means that a success is four times more probable than a failure. In other words, one expects to observe four successes for every failure (i.e. four successes in five events). Conversely, if the are odds are ¹/4=0.25 then a failure is four times more probable than a success, and one expects to observe one success for every four failures (i.e. one success in five events).

The ratio between the above two odds values is called the odds ratio:

θ = odds1

odds0 = π_1|1 π_0|1 π_1|0

π_0|0.

From the definition of the odds, and using the definition of joint probability, it can easily be shown that:

θ = π11·π00

π10·π01

This expression shows that the odds ratio is a cross product ratio, the product of probabilities on the main diagonal divided by the product of the probabilities off the main diagonal of a contingency table.

In the actual computation of the odds ratio, the probabilities will be replaced with the observed frequencies, leading to the expression

θ_ij =n11n00

n10n01

We now list some properties of the odds ratio, without proof.

1. The odds ratio can be equal to any non-negative number, that is, it can take values in the interval [0,+∞).

2. WhenXandYare independentπ_1|1=π_1|0, so that odds₁=odds₀andθ =1.

On the other hand, depending on whether the odds ratio is greater or smaller than 1 it is possible to evaluate the sign of the association:

• for θ >1 there is a positive association, since the odds of success are greater in row 1 than in row 0;

• for 0< θ <1 there is a negative association, since the odds of success are greater in row 0 that in row 1.

3. When the order of the rows or the order of the columns is reversed, the new value of θ is the reciprocal of the original value. On the other hand, the odds ratio does not change value when the orientation of the table is reversed so that the rows become columns and the columns become rows.

This means that the odds ratio deals with the variables symmetrically and, therefore, it is not necessary to identify one variable as dependent and the other as explanatory.

The odds ratio can be used as an exploratory tool aimed at building a proba-bilistic model, similarly to the linear correlation coefficient.

Concerning the construction of a decision rule that determines whether a certain observed value of the odds ratio indicates a significant association between the corresponding variables, it is possible to derive a confidence interval, as was done for the correlation coefficient. The interval leads to a rule for detecting a significant association when

where zα/2 is the 100(1−α/2)% percentile of a standard normal distribution.

For instance, whenα=0.05,zα/2=1.96. We remark that the confidence interval used in this case is only approximate, but that the approximation improves with the sample size.

So far we have defined the odds ratio for 2× 2 contingency tables. How-ever, odds ratios can be calculated in a similar fashion for larger contingency tables. The odds ratio for I× J tables can be defined with reference to each of the

I 2

=I (I−2)

2 pairs of rows in combination with each of the J

= J (J−2)

2 pairs of columns. There are I

2 J 2

odds ratios of this type. As the number of odds ratios to be calculated can become enormous, it is wise to choose parsimonious representations.

Dans le document Applied Data Mining for Business and Industry (Page 34-41)