Principal Component Analysis - Feature Extraction and Enhancement (Step 3) *

Feature Extraction and Enhancement (Step 3) *

5.3 Principal Component Analysis

The dimension of the data set being processed dramatically affects the computational complexity of pattern processing algorithms (e.g., classifiers and detectors). For many algorithms, there is a practical upper bound on the number of dimensions that can be handled in a reasonable amount of time and space. For this reason, techniques that reduce the dimension of a data set without damaging its information content can be very useful.

The phrase, “without damaging its information content” is the key. If all we want to do is compactly represent a data set, then the information content is preserved if the variation is captured. This is what PCA does. In this sense, PCA is an ideal method for representing data compactly.

However, when the goal is to build a classifier, the information content usually has little to do with how variation in the data is encoded. Rather, we seek a represen-tational scheme that preserves the class identities given by the ground truth. This is not a question of variation, but discrimination. For difficult problems, the variations that define the differences between two target classes are often quite small. These can be exactly the variations that are discarded during PCA dimension reduction as described above.

Class Agnosticism (an important limitation of PCA as a feature extraction method)

PCA is a class agnostic winnowing method: it does not consider the ground truth assigned to feature vectors in selecting its representation. We now describe a critical limitation that arises from this fact when using PCA during feature extraction.

PCA is a method for efficiently accounting for the variation in a data set. The underlying mathematics provides a mechanism for detecting, characterizing, and quantifying variation in the given data. This information is then used to formulate a new representation of the data that concentrates as much of the variation as possible in as few terms as possible. The vectors that define this new representation are called principal components. They are completely determined by the data.

It is customary to order the principal components of a data set in decreasing order by the amount of variation they explain. For many data sets, the first few principal components account for a large percentage of the total variation in the data set. When this occurs, dimension reduction can be achieved by simply discarding the compo-nents from some point on, since these do not account for a significant amount of the variation. The number of terms to retain can sometimes be determined by reference to

factors (computed during the PCA process) that quantify each component’s explana-tory power. More frequently, some experimentation is required.

In summary, PCA is designed to pack as much of the variation of a data set into as few dimensions as possible. It does this by systematically isolating and discarding insignificant forms of variation. This is done agnostically: ground truth assignments are not consulted, so the dimension reduction is done with complete disregard for how class distinctions might be affected. In the following intuitive example, we show what PCA can do when the distinctions between classes are small relative to the dynamic range of the data.

Consider a set of 1,000 three-dimensional data points, constituting two ground truth classes (Figure 5.1).

Figure 5.1 is a scatter plot showing two thoroughly mixed and overlapping classes.

We can’t tell which class each point is in, and it looks like a single cluster. The viewer is seeing the data plotted in the first two coordinates x (horizontal axis) and y (vertical axis). The third coordinate, z, is coming out of the page toward the viewer. From this perspective, nothing can be known about the third dimension. This view clearly shows that the two visible coordinates x and y by themselves do not provide a good basis for distinguishing between the two classes: class 1, and class 2.

Considering only x and y, the principal components of this data set are depicted by the lines, the first principal component points to “2 o’clock”, and runs along the line that accounts for most of the variation in the data. The second principal component runs along the line that accounts for most of the variation that remains after the con-tribution of the first principal component has been removed. In the example above, the second principal component gives the line that points to “10 o’clock” (Figure 5.2).

For data having many dimensions, this process is repeated until all of the variation in the data has been accounted for. The procedure is deterministic, and guaranteed to Figure 5.1 Three-dimensional scatter plot of two classes.

terminate after at most N steps, where N is the dimension of the feature space. The principal components will be mutually orthogonal (perpendicular to each other).

Once the directions of maximum variation (the principal components) have been determined, the data are rescaled along each component (Figure 5.3)

Figure 5.2 Direction of principal data set components.

Figure 5.3 Rescaled data aligned with new axes.

It is customary to think of the data being expressed in terms of the principal com-ponents, so the variation in the data aligns with the new axes.

Return now to considering the original data before PCA is applied. If we move around to look at the data from a perspective perpendicular to the z-axis, the problem of distinguishing between the mixed classes does not appear quite so hopeless. Close examination shows there is still quite a bit of overlap between the two groups in this view, but some of the data can be distinguished (Figure 5.4). This tells us that the third feature, z, carries some information that was not present in x and y by themselves.

Here we see the data plotted as y vs. z; y is on the vertical axis, and z is on the horizontal axis. From this perspective, it is clear that the problem of separating the two classes can be completely solved by carefully drawing a wavy boundary between the two sets. We have found a feature set (y and z) that will solve the problem. Because of the wave in the data, a nonlinear boundary will be required.

Having seen the data from all three perspectives, we now realize that it consists of two sheets that are elliptical in x and y, and that are slightly separated, interlocked sinusoids in y and z (sort of like two ruffled potato chips locked together; Figure 5.4).

Variation in the z coordinate is small relative to the ranges of x and y. As a result, PCA will see elimination of the contribution of z as providing an opportunity to enhance the efficiency of data representation.

Applying PCA, we obtain a new representation of the data. In keeping with our goal of using PCA for dimension reduction, we keep the first two principal components and discard the third. The resulting data set is plotted below. The first principal com-ponent is horizontal, the second, vertical (Figure 5.5).

After PCA, the data still appear as a single aggregation. Nothing has been gained, but a price has been paid: much of the discriminating power in z has been removed, because this accounted for only a small portion of the overall variation in the data. As Figure 5.4 Nonlinear representation of two classes.

we saw above, it is this variation in z that is needed to fully solve the problem. After PCA, we are left with a data set that still has the two classes confounded, and there is no longer a third dimension available to resolve the conflict. Because the scale of variation is a function of the units of measurement used, we see that it would be pos-sible for PCA to ruin a data set, for example, just because one of the measurements was expressed in minutes rather than seconds!

Nonlinear PCA for Feature Extraction

Conventional Principal Component Analysis (i.e., Karhunen-Loeve PCA¹) is a linear technique: it develops an efficient basis for representing data using linear regression, and then applies an affine transformation to encode the data. Linear and affine map-pings are unable to model many complex relationships encountered in real-world data.

PCA leaves this complexity intact, passing all the work of discrimination on to subse-quent processes. As seen above, it may destroy information, guaranteeing the failure of subsequent techniques.

The use of nonlinear regression for representation of feature data can (in theory) convert complex problems into simple ones. In practice, dimension reduction can some-times be realized by the application of nonlinear transforms to features, even though linear transforms fail. This amounts to the creation of a nonlinear basis for the feature space. Because these transforms can be created with ground truth assignments in mind, the new representation may yield a simpler classification problem.

Independent Component Analysis (ICA) is a methodology for transforming data so that class distinctions (rather than data variation as in PCA) are the driving Figure 5.5 Application of PCA to data set.

consideration. In a more general sense, the whole field of feature extraction and enhancement addresses the problem of representing a problem space in such a way that classes are coherent and distinct.

As an example, we refer again to our three-dimensional data set. The two classes in this problem are distinct, but not coherent. Any linear transform applied to this data set will leave us with an abutting curved interface between the classes, that is, it will fail to reduce the complexity of the classification problem.

However, the application of a (contrived) nonlinear transform can yield the follow-ing result. The transformed problem is trivial (Figure 5.6):

The nonlinear transform threaded a wavy boundary between the two potato chips, dragged them apart, and shaped them into nice, round discs.

In practice, the characterization of a transform for determining nonlinear com-ponents is difficult (in fact, determining optimal comcom-ponents is equivalent to the classification problem itself). Sometimes a process of trial and error is successful, but sometimes the use of some kind of automated optimization scheme is required. An excellent treatment of nonlinear PCA using neural networks is “Image Compression by back propagation: An example of extensional programming,” ICS Report 8702, UCSD, February, 1987.²

It must be kept in mind that PCA of whatever sort is data driven, not problem driven. That is, if transforms are being sought to compress data, no ground truth is needed. If transforms are being sought to prepare the data for analysis or modeling, distinctions between ground truth classes are at risk.

Class discrimination aside, there is often value in using nonlinear PCA agnostically to reduce dimensionality and mitigate complexity. The result is a low-dimensional rep-resentation of the problem having a simplified phenomenology.

Conventional PCA is a useful first cut technique for dimension reduction. It is designed for efficiency of linear representation, and does not consider a priori class

Figure 5.6 Application of nonlinear transform.

assignments. As a linear agnostic technique, it can damage the power of a data set to support class discrimination.

There are nonlinear techniques for feature selection and transformation that are designed to reduce data dimensionality and enhance its ability to discriminate between a priori classes. The selection of such a technique in a particular problem domain gener-ally requires special tools and expertise.

5.3.1 Feature Winnowing and Dimension Reduction Checklist Question 1: How much bulk data are we talking about (KB, MB, GB or TB)?

Why this question is important? Data scale is a cost and complexity driver.

What is question seeking? Is your existing infrastructure and tool set adequate for this project?

Likely responses and their meanings Likely answers and their meanings: Th is answer can usually be answered pretty precisely.

Follow-up questions If specifi c issues are raised in any of these areas, you should follow up with additional questions.

Special considerations Th e customer probably already has some data access and conditioning tools. Ask about these.

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 121-127)