• Aucun résultat trouvé

Principal Component Analysis

Dans le document Data Mining Using (Page 91-94)

Unsupervised Learning Methods

4.3 Principal Component Analysis

Because it is difficult to visualize multidimensional space, principal com-ponent analysis, a popular multivariate technique, is primarily used to reduce the dimensionality of p multiple attributes to two or three dimen-sions. PCA summarizes the variation in a correlated multi-attribute to a set of uncorrelated components, each of which is a particular linear combination of the original variables. The extracted uncorrelated compo-nents are called principal compocompo-nents (PCs) and are estimated from the eigenvectors of the covariance or correlation matrix of the original vari-ables; therefore, the objective of PCA is to achieve parsimony and reduce dimensionality by extracting the smallest number of components that account for most of the variation in the original multivariate data and to summarize the data with little loss of information.

3456_Book.book Page 81 Thursday, November 21, 2002 12:40 PM

In PCA, uncorrelated PCs are extracted by linear transformations of the original variables so that the first few PCs contain most of the variations in the original dataset. These PCs are extracted in decreasing order of importance so that the first PC accounts for as much of the variation as possible and each successive component accounts for a little less. Fol-lowing PCA, the analyst tries to interpret the first few principal components in terms of the original variables and thereby have a greater understanding of the data. To reproduce the total system variability of the original p variables, we need all p PCs. However, if the first few PCs account for a large proportion of the variability (80–90%), we have achieved our objec-tive of dimension reduction. Because the first principal component accounts for the covariation shared by all attributes, this may be a better estimate than simple or weighted averages of the original variables. Thus, PCA can be useful when a very high degree of correlation is present in the multiple attributes.

In PCA, the extractions of PC can be made using either original multivariate datasets or the covariance or the correlation matrix, if the original dataset is not available. To derive PCs, the correlation matrix is commonly used when the variables in the dataset are measured using different units (e.g., annual income, educational level, numbers of cars owned per family) or when the variables have different variances. Using the correlation matrix is equivalent to standardizing the variables to zero mean and unit standard deviation. The statistical theory, methods, and computation aspects of PCA are presented in details elsewhere.3

4.3.1 PCA Terminology

Eigenvalues. Eigenvalues measure the amount of the variation explained by each PC and will be largest for the first PC and smaller for the subsequent PCs. An eigenvalue greater than 1 indicates that PCs account for more variance than accounted for by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained.

Eigenvectors. Eigenvectors provide the weights to compute the uncorrelated PCs, which are the linear combinations of the centered standardized or centered unstandardized original variables.

PC scores. PC scores are the derived composite scores computed for each observation based on the eigenvectors for each PC. The means of PC scores are equal to zero, as these are linear combi-nations of the centered variables. These uncorrelated PC scores can be used in subsequent analyses to check for multivariate normality,4 to detect multivariate outliers,4 or as a remedial measure in regression analysis with severe multi-collinearity.5

3456_Book.book Page 82 Thursday, November 21, 2002 12:40 PM

Estimating the number of PCs. Several criteria are available for determining the number of PCs to be extracted, but these are just empirical guidelines rather than definite solutions. In practice, we seldom use a single criterion to decide on the number of PCs to extract. Some of the most commonly used guidelines ar e the Kaiser–Guttman rule, the scree and parallel analysis plots, and interpretability.6

Kaiser-Guttman rule. The Kaiser–Guttman rule states that the number of PCs to be extracted should be equal to the number of PCs having an eigenvalue greater than 1.0.

Scree test. Plotting the eigenvalues against the correspond-ing PCs produces a scree plot that illustrates the rate of change in the magnitude of the eigenvalues for the PCs. The rate of decline tends to be fast at first, then it levels off. The

“elbow”, or the point at which the curve bends, is considered to indicate the maximum number of PCs to extract. One less PC than the number at the elbow might be appr opriate if an overly defined solution is sought. However, scree plots may not give such a clear indication of the number of PCs at all times.

Parallel analysis. To aid in determining the number of PCs to be extracted in standardized data, another graphical method known as parallel analysis is suggested to enhance the inter-pretation the scree plot (see Sharma7 for a description of the computational details of performing parallel analysis). In parallel analysis, eigenvalues are extracted in repeated sampling from a totally independent multivariate dataset with the exact same dimensions as the data of interest. Because the variables are not correlated in this simulated dataset, all the extracted eigen-values should have a value equal to 1. However, due to sam-pling error, the first half of the PC will have eigenvalues greater than 1, and the second half of the PCs will have eigenvalues less than 1. The average eigenvalues for each PC computed from the repeated sampling is overlaid on the same scree plot of the actual data. The optimum number of PCs is selected at the cutoff point where the scree plot and the parallel analysis curve intersect. An example of a scree/parallel analysis plot is presented in Figure 4.3.

Interpretability. Another very important criterion for determining the number of PCs is the interpretability of the PCs extracted. The number of PCs extracted should be evaluated not only according to empirical criteria, but also according to the criterion of mean-ingful interpretation.

3456_Book.book Page 83 Thursday, November 21, 2002 12:40 PM

PC loadings. PC loadings are correlation coefficients between the PC scores and the original variables. They measure the importance of each variable in accounting for the variability in the PCs. It is often possible to interpret the first few PCs in terms of overall effect or contrast between groups of variables based on the struc-tures of PC loadings. A high correlation between the first principal component (PC1) and a variable indicates that the variable is associated with the direction of the maximum amount of variation in the dataset. More than one variable might have a high correlation with PC1. A strong correlation between a variable and the second principal component (PC2) indicates that the variable is responsible for the next largest variation in the data perpendicular to PC1, and so on. Conversely, if a variable does not correlate to any PC axis or correlates only with the last PC or one before the last PC, this usually suggests that the variable has little or no contribution to the variation in the dataset. Therefore, PCA may often indicate which variables in a dataset are important and which ones may be of little consequence. Some of these low-performance variables might therefore be removed from consideration in order to simplify the overall analyses.

Dans le document Data Mining Using (Page 91-94)