Bivariate exploratory analysis - Exploratory data analysis

Exploratory data analysis

3.2 Bivariate exploratory analysis

The relationship between two variables can be graphically represented using a scatterplot. Figure 3.7 shows the relationship between the observed values in two performance indicators, return on investment (ROI) and return on equity (ROE), for a set of business enterprises in the computer sector. There is a noticeable increasing trend in the relationship between the two variables. Both variables in Figure 3.7 are quantitative and continuous, but a scatterplot can be drawn for all kinds of variables.

A real data set usually contains more than two variables, but it is still possible to extract interesting information from the analysis of every possible bivariate scatterplot between all pairs of the variables. We can create a scatterplot matrix in which every element corresponds to the scatterplot of the two corresponding variables indicated by the row and the column. Figure 3.8 is an example of a scatterplot matrix for real data on the weekly returns of an investment fund made up of international shares and a series of worldwide ﬁnancial indexes. The period of observation for all the variables starts on 4 October 1994 and ends on 4 October 1999, for a total of 262 working days. Notice that the variable REND shows an increasing relationship with all ﬁnancial indexes and, in particular, with

−150

−100

−50 0 50 100 150

−30 −20 −10 0 10 20 30 40

ROI

ROE

Figure 3.7 Example of a scatterplot diagram.

Figure 3.8 Example of a scatterplot matrix.

the EURO, WORLD and NORDAM indexes. The squares containing the variable names also contain the minimum and maximum value observed for that variable.

It is useful to develop bivariate statistical indexes that further summarise the frequency distribution, improving the interpretation of data, even though we may

lose some information about the distribution. In the bivariate case, and more generally in the multivariate case, these indexes permit us to summarise the distribution of each data variable, but also to learn about the relationship between the variables (corresponding to the columns of the data matrix). The rest of this section focuses on quantitative variables, for which summary indexes are more easily formulated, typically by working directly with the data matrix. Section 3.4 explains how to develop summary indexes that describe the relationship between qualitative variables.

Concordance is the tendency of observing high (low) values of a variable together with high (low) values of the other. Discordance is the tendency of observing low (high) values of a variable together with high (low) values of the other. For measuring concordance, the most common summary measure is the covariance, deﬁned as

Cov(X, Y )= 1 N

N i=1

[xi−µ(X)][yi−µ(Y )]

where µ(X) is the mean of variable X and µ(Y ) is the mean of variable Y. The covariance takes positive values if the variables are concordant and negative values if they are discordant. With reference to the scatterplot representation, setting the point (µ(X), µ(Y )) as the origin, Cov(X,Y) tends to be positive when most of the observations are in the upper right-hand and lower left-hand quadrants. Conversely, it tends to be negative when most of the observations are in the lower right-hand and upper left-hand quadrants.

Notice that the covariance is directly calculable from the data matrix. In fact, since there is a covariance for each pair of variables, this calculation gives rise to a new data matrix, called the variance–covariance matrix. In this matrix the rows and columns correspond to the available variables. The main diagonal contains the variances and the cells outside the main diagonal contain the covariances between each pair of variables. Since Cov(Xj, Xi)=Cov(Xi, Xj), the resulting matrix will be symmetric (Table 3.4).

Table 3.4 The variance– covariance matrix.

X1 . . . Xj . . . Xh

X1 Var(X1) . . . Cov(X1, Xj) . . . Cov(X1, Xh)

... ... ... ... ... ...

Xj Cov(Xj, X1) . . . Var(Xj) . . . . . .

... ... ... ... ... ...

Xh Cov(Xh, X1) . . . . . . . . . Var(Xh)

The covariance is an absolute index; that is, it can identify the presence of a relationship between two quantities but it says little about the degree of this rela-tionship. In other words, to use the covariance as an exploratory index, it need to be normalised, making it a relative index. The maximum value that Cov(X, Y) can assume is σxσy, the product of the two standard deviations of the vari-ables. The minimum value that Cov(X, Y) can assume is –σxσy. Furthermore, Cov(X,Y) assumes its maximum value when the observed data points lie on a line with positive slope; it assumes its minimum value when the observed data points lie on a line with negative slope. In light of this, we deﬁne the (linear) correlation coefﬁcient between two variablesXandY as

r(X, Y )= Cov(X, Y ) σ(X)σ(Y )

The correlation coefﬁcientr(X, Y ) has the following properties:

• r(X,Y ) takes the value 1 when all the points corresponding to the joint observations are positioned on a line with positive slope, and it takes the value – 1 when all the points are positioned on a line with negative slope.

That is whyr is known as thelinear correlation coefﬁcient.

• When r(X, Y ) = 0 the two variables are not linked by any type of linear relationship; that is,Xand Y are uncorrelated.

• In general,−1≤r(X, Y )≤1.

As for the covariance, it is possible to calculate all pairwise correlations directly from the data matrix, thus obtaining a correlation matrix. The structure of such a matrix is shown in Table 3.5. For the variables plotted in Figure 3.7 the cor-relation matrix is as in Table 3.6. Table 3.6 takes the ‘visual’ conclusions of Figure 3.7 and makes them stronger and more precise. In fact, the variable REND is strongly positively correlated with EURO, WORLD and NORDAM. In general, there are many variables exhibiting strong correlation.

Interpreting the magnitude of the linear correlation coefﬁcient is not partic-ularly easy. It is not clear how to distinguish the ‘high’ values from the ‘low’

Table 3.5 The correlation matrix.

X1 . . . Xj . . . Xh

X1 1 . . . Cor(X1, Xj) . . . Cor(X1, Xh)

... ... ... ... ... ...

Xj Cor(Xj, X1) . . . 1 . . . . . .

... ... ... ... ... ...

Xh Cor(Xh, X1) . . . . . . . . . 1

Table 3.6 Example of a correlation matrix.

values of the coefﬁcient, in absolute terms, so that we can distinguish the impor-tant correlations from the irrelevant. Section 5.3 considers a model-based solution to this problem when examining statistical hypothesis testing in the context of the normal linear model. But to do that we need to assume the pair of variables have a bivariate Gaussian distribution.

From an exploratory viewpoint, it would be convenient to have a threshold rule to inform us when there is substantial information in the data to reject the hypothesis that the correlation coefﬁcient is zero. Assuming the observed sample comes from a bivariate normal distribution (Section 5.3), we can use a rule of the following type: Reject the hypothesis that the correlation coefﬁcient is null when

r(X, Y ) 1−r²(X, Y )

√n−2 > tα/2

where tα/2 is the (1−α/2) percentile of a Student’s t distribution with n−2 degrees of freedom, corresponding to the number of observations minus 2 (Sec-tion 5.1). For example, for a large sample and a signiﬁcance level of α=5%

(which sets the probability of incorrectly rejecting a null correlation), the thresh-old ist0.025=1.96. The previous inequality asserts that we should conclude that the correlation between two variables is ‘signiﬁcantly’ different from zero when the left-hand side is greater than tα/2. For example, applying the previous rule to Table 3.6, withtα/2=1.96, it turns out that all the observed correlations are signiﬁcantly different from zero.

Dans le document Applied Data Mining Statistical Methods for Business and Industry (Page 59-63)