Application of the principal components - Reduction of dimensionality

Exploratory data analysis

3.5 Reduction of dimensionality

3.5.2 Application of the principal components

We now apply the method of principal components to the data in Figure 3.8.

More precisely, the objective of the analysis is to determine a compound index, a function of the ﬁve available ﬁnancial indexes EURO, NORDAM, JAPAN, PACIFIC and COMIT, that can eventually substitute the aggregate WORLD index as a predictor of the considered investment fund return. Therefore the starting data matrix contains 262 rows and 5 columns.

In Section 3.2 we looked at the scatterplot matrix and the correlation matrix for this data. We need both matrices before we can apply the method of principal components. In fact, the methodology will typically be efﬁcient only in presence of a certain degree of correlation (collinearity) between the variables; otherwise the principal components will eventually reproduce the original variables. Indeed, in the limiting case where the original variables are mutually uncorrelated, the principal components will coincide with them. In this example there is high collinearity between the variables, and this justiﬁes using the method.

Having determined that the method is appropriate, we need to choose the number of components. Table 3.11 shows part of the output from SAS Proc

Table 3.11 Absolute importance of the principal components.

Table 3.12 Relative importance of the principal components.

Princompused on the available data matrix. The contribution of the ﬁrst prin-cipal component is around 59%, the contribution of the ﬁrst two components together is around 75%, and so on. Therefore there is a big gap in passing from one to two components. It seems reasonable to choose only one principal com-ponent, even though this leads to a loss of around 40% of the overall variability.

This decision is further enforced by the objective of the analysis – to obtainone composite index of the ﬁnancial markets, to be used as a ﬁnancial benchmark.

To interpret the chosen component, we look at the relative importance of the components (Table 3.12).

The table reports the weight coefficients (loadings) relative to each of the five principal components that can be extracted from the data matrix, corresponding to the eigenvector of the variance–covariance matrix. Secondly, it presents the correlation coefficient of each component with the original variable, which rep-resents the degree of relative importance of each component. It turns out that the first principal component is linked to all indexes and particularly with the EURO index.

3.6 Further reading

Exploratory data analysis has developed as an autonomous ﬁeld of statistics, in parallel with the development of computing resources. It is possible to date the

initial developments in the ﬁeld to the publication of texts by Benzecri (1973) and Tukey (1977).

Univariate exploratory analysis is often fundamental to understanding what might be discovered during a data mining analysis. It often reveals problems with data quality, such as missing items and anomalous values. But most real problems are multivariate. Given the difﬁculty of visualising multidimensional graphical representations, many analyses concentrate on bivariate exploratory analysis, and on how the relationships found in a bivariate analysis can modify themselves, conditioning the analysis on the other variables. We looked at how to calculate the partial correlation for quantitative variables. Similar calculations can be performed on qualitative variables, for example, comparing the marginal odds ratios with those calculated conditionally on the levels of the remaining variables.

This leads to a phenomenon known as Simpson’s paradox (e.g. Agresti, 1990), for which a certain observed marginal association can completely change direction when conditioning the odds ratio on the level of additional variables.

We focused on some important matrix representations that allow simpler nota-tion and easier analysis when implemented using a computer program. Searle (1982) covers matrix calculations in statistics. Multidimensional exploratory data analysis is a developing ﬁeld of statistics, incorporating developments in com-puter science. Substantial advances may well come from this research in the near future. For a review of some of these developments, particularly multidimensional graphics, consult Hand, Mannila and Smyth (2001).

We introduced multidimensional analysis of qualitative data, trying to system-atise the argument from an applied viewpoint. This too is a developing ﬁeld and the existence of so many indexes suggests that the arguments have yet to be consolidated. We put the available indexes into three principal classes: distance measures, dependence measures and model-based indexes. Distance measures are applicable to any contingency tables, for dimension and number of levels, but the results they produce are only moderately informative. Dependence mea-sures give precise information on the type of dependence among the variables being examined, but they are hardly applicable to contingency table of dimension greater than 2. Model-based indexes are a possible compromise. They are sufﬁ-ciently broad and they offer a good amount of information. An extra advantage is that they relate to the most important statistical models for analysing qualitative data: logistic and log-linear regression models (Chapter 5). For an introduction to descriptive analysis of qualitative data, consult Agresti (1990).

An alternative approach to multidimensional visualisation of the data is reduc-tion to spaces of lower dimension. The loss of informareduc-tion and the difﬁculty of interpreting the reduced information may be compensated by greater usability of the results. The classical technique is principal component analysis. We looked at how it works, but for greater detail on the formal aspects consult Mardia, Kent and Bibby (1979). The method of principal components is used for more than exploratory data analysis; it underpins an important modelling technique known as (conﬁrmatory) factor analysis, widely adopted in the social sciences. Assuming a probabilistic model, usually Gaussian, it decomposes the variance–covariance matrix into two parts: one part is common to all the variables and corresponds to

the presence of underlying latent variables (variables that are unobserved or not measurable) and the other part is speciﬁc to each variable. In this framework, the chosen principal components identify the latent variables and are interpreted accordingly. Rotation of the components (latent factors) is a way of modifying the weight coefﬁcients to improve their interpretability. For further details on factor analysis consult Bollen (1989).

Principal component analysis is probably the simplest way to accomplish data reduction as it is based on linear transformations. Essentially, the obtained scores transform the original data into linear projections on the reduced space, minimis-ing the Euclidean distance between the coordinates in the original space and the transformed data. Other types of transformation include wavelet methods, based on Fourier transforms, as well as the methods of projection pursuit, which look for the best directions of projection on a reduced space. Both techniques are covered in Hand, Mannila and Smyth (2001) and Hastie, Tibshirani and Friedman (2001).

There are also methodologies for reducing the dimensionality of qualitative data.

For every row of a contingency table with two dimensions, correspondence anal-ysis produces a graphical row profile, corresponding to the conditional frequency distribution of the row. It produces a similar profile for every column. Dimen-sionality reduction is then performed by projecting these profiles in a space of lower dimension that reproduces the most likely of the original dispersion, which is related to theX² statistic. Correspondence analysis can also be applied to contingency tables of arbitrary dimension (represented using the Burt matrix).

Greenacre (1983) provides an introduction to correspondence analysis.

CHAPTER 4

Dans le document Applied Data Mining Statistical Methods for Business and Industry (Page 79-83)