Frequency distributions - Organisation of the data

Organisation of the data

2.4 Frequency distributions

Often it seems natural to summarise statistical variables by the co-occurrence of their levels. A summary of this type is called a frequency distribution. In all procedures of this kind, the summary makes it easier to analyse and present the results, but it also leads to a loss of information. In the case of qualitative variables, the summary is justiﬁed by the need to carry out quantitative analysis on the data. In other situations, such as with quantitative variables, the summary is essentially to simplify the analysis and presentation of results.

2.4.1 Univariate distributions

First we will concentrate on univariate analysis, the analysis of a single vari-able. This simplifies presentation of results but it also simplifies the analytical method. It is easier to extract information from a database by beginning with univariate analysis and then moving on to multivariate analysis. Determining the univariate distribution frequency from the data matrix is often the first step in a univariate exploratory analysis. To create a frequency distribution for a variable it is necessary to know the number of times each level appears in the data. This number is called the absolute frequency. The levels and their frequencies give the frequency distribution.

The observations related to the variable being examined can be indicated as follows:x1, x2, . . . , xN, omitting the index related to the variable itself. The dis-tinct values between the N observations (levels) are indicated as x₁^∗, x₂^∗, . . . , x_k^∗ (k≤N). The frequency distribution is shown as in Table 2.4 where ni indi-cates the number of times level x_i^∗appears (its absolute frequency). Note that

i=1ni =N, where N is the number of classiﬁed units. Table 2.5 shows an example of a frequency distribution for a binary qualitative variable that will be analysed in Chapter 10.

It can be seen from Table 2.5 that the data at hand is fairly balanced between the two levels.

To make reading and interpretation easier, frequency distribution is usually presented with relative frequencies. The relative frequency of the level x_i^∗, indi-cated bypi, is deﬁned by the relationship between the absolute frequencyni and the total number of observations: pi =ni/N. Note that we have_k

i=1pi =1.

Table 2.4 Univariate frequency distribution.

Levels Absolute frequencies

x1^∗ n1

x₂^∗ n2

... ...

x_k^∗ nk

Table 2.5 Example of a frequency distribution.

Levels Absolute frequencies

0 1445

1 1006

Table 2.6 Univariate relative frequency distribution.

Levels Relative frequencies

x^∗1 p1

x^∗2 p2

... ...

x^∗_k pk

Table 2.7 Example of a univariate relative frequency distribution.

Modalities Relative frequencies

0 0.59

1 0.41

The results are shown in Table 2.6. For the frequency distribution in Table 2.5 we obtain the relative frequencies in Table 2.7.

2.4.2 Multivariate distributions

Now we shall see how it is possible to create multivariate frequency distributions for the joint examination of more than one variable. We will look particularly at qualitative or discrete quantitative variables. For continuous quantitative multi-variate variables, it is better to work directly with the data matrix. Multimulti-variate frequency distributions are represented by a contingency table. For clarity, we will mainly consider the case where two variables are examined at a time. This creates a bivariate distribution having a contingency table with two dimensions.

LetXandY be the two variables collected forN statistical units, which take onhlevels forX,x1^∗, . . . , x^∗_h, andklevels forY,y1^∗, . . . , y_k^∗. The result of the joint classiﬁcation of the variables into a contingency table can be summarised by the pairs{(x_i^∗, y_j^∗),nxy(x^∗_i, y_j^∗)}wherenxy(x_i^∗, y_j^∗)indicates the number of statistical units, among the N considered, where the level pair (x_i^∗, y_j^∗) is observed. The value indicated bynxy(x_i^∗, y_j^∗)is called the absolute joint frequency which refers to the (x_i^∗, y_j^∗) pair. For simplicity we will often refer to nxy(x_i^∗, y_j^∗) with the symbolnij.

Note that sinceN =

jnxy(x_i^∗, y_j^∗)is equal to the total number of clas-siﬁed units, we can get relative joint frequencies from the equation

pxy(xi, yj)= nxy(x_i^∗, y^∗_j) N

Table 2.8 A two-way contingency table.

X\Y y₁^∗ y₂^∗ . . . y_j^∗ . . . y_k^∗

x₁^∗ nxy(x^∗₁, y₁^∗) nxy(x₁^∗, y₂^∗) . . . nxy(x^∗₁, y_j^∗) . . . nxy(x₁^∗, y_k^∗) nx(x₁^∗) x₂^∗ nxy(x^∗₂, y₁^∗) nxy(x₂^∗, y₂^∗) . . . nxy(x^∗₂, y_j^∗) . . . nxy(x₂^∗, y_k^∗) n_x(x₂^∗)

... ... ... ... ... ... ... ...

x_i^∗ nxy(x^∗_i, y1^∗) nxy(x_i^∗, y2^∗) . . . nxy(x^∗_i, y_j^∗) . . . nxy(x_i^∗, y_k^∗) nx(x_i^∗)

... ... ... ... ... ... ... ...

x_h^∗ nxy(x^∗_h, y₁^∗) nxy(x_h^∗, y₂^∗) . . . nxy(x^∗_h, y_j^∗) . . . nxy(x_h^∗, y_k^∗) nx(x_h^∗) ny(y^∗1) ny(y2^∗) . . . ny(y^∗_j) . . . ny(y_k^∗) N

To classify the observations into a contingency table, we could mark the level of the variableXin the rows and the levels of the variableY in the columns. In the table we will therefore include the joint frequencies, as shown in Table 2.8. Note that from the joint frequencies it is easy to get the marginal univariate frequencies ofXand Y using the following equations:

nX(x_i^∗)=

nxy(x_i^∗, y_j^∗)

nY(y_j^∗)=

nxy(x_i^∗, y_j^∗)

Table 2.8 reports absolute frequencies. It can also be expressed in terms of relative frequencies. This will lead to two analogous equations that determine marginal relative univariate frequencies.

From a joint frequency distribution it is also possible to determinehfrequency distributions of the variableY, conditioned on thehlevels ofX. Each of these, indicated by (Y|X=x_i^∗), shows the distribution frequency of Y only for the observations whereX=xi. For example, the frequency with which we observe Y =y₁^∗conditional on X=x_i^∗can be obtained from the ratio

pY|X(y1^∗|x_i^∗)= pxy(x_i^∗, y₁^∗) pX(x_i^∗)

where pxy indicates the distribution of the joint frequency of Xand Y and pX

the distribution of the marginal frequency (unidimensional) ofX. Similarly, we can getk frequency distributions of theXconditioned on thek levels ofY.

Statistical software makes it easy to create and analyse contingency tables.

Consider a 2×2 table where X is the binary variable Npurchases (number of purchases) andY =South (referring to the geographic area where the customer comes from); we will look at this in more detail in Chapter 10. The output

Table 2.9 Example of a two-way contingency table:NPURCHASES(rows) bySOUTH(columns).

0, 1, Total

*****************************

0 , 1102 , 343 , 1445 , 44.96 , 13.99 , 58.96 , 76.26 , 23.74 ,

, 57.40 , 64.60 ,

*****************************

1 , 818 , 188 , 1006 , 33.37 , 7.67 , 41.04 , 81.31 , 18.69 ,

, 42.60 , 35.40 ,

*****************************

Total 1920 531 2451 78.34 21.66 100.00

in Table 2.9 shows the following four pieces of information, for each of the four possible levels forX andY: (a) absolute frequency of the pair; (b) relative frequency of the pair; (c) conditional frequency ofX=x, conditionally on the Y row; (d) conditional frequency ofY =y, conditionally on theXcolumn.

Dans le document Applied Data Mining Statistical Methods for Business and Industry (Page 39-43)