Use of Principal Component Analysis

2 Data Mining and Knowledge Discovery - an

3.2 Use of Principal Component Analysis

The method of principal component analysis (PCA) was originally developed in the 1900's [84, 85], and has now re-emerged as an important technique in data analysis.

The central idea is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. Multiple regression and discrimination analysis use variable selection procedures to reduce the dimension but result in the loss of one or more important dimensions. The PCA approach uses all of the original variables to obtain a smaller set of new variables (principal components - PCs) that they can be used to approximate the original variables. The greater the degree of correlation between the original variables, the fewer the number of new variables required. PCs are uncorrelated and are ordered so that the first few retain most of the variation present in the original set.

3.2.1 Basic Concepts: Mean, Variance, Covariance

It is convenient at this point to gave a brief summary of the basic points. In the univariate case the mean and variance are used to summarise a data set. The mean or mean value of a discrete distribution is denoted by ^j.land is defined by

J.! = LX,f(X,) , (3.1)

Chapter 3 Data Pre-processing 31

where j{Xi) is the probability function of the random variable X considered. The mean is also known as the mathematical expectation of X and is sometimes denoted by E(X).

The variance of a distribution is denoted by

c/

and is defined by a²= I(Xi-IlY!(Xi) = E(Xi-IlY

(3.2) In fact it is an index reflecting the deviation of Xi from the mean Il. In other words, the variance a² describes the linear dependency of all Xi . The bigger the variance, the less dependent; the smaller the variance and hence the greater the linear dependency between Xi .

To summarise multivariate data sets, it is necessary to find the mean and variance of each of the p variables, together with a measure of the way each pair of variables is related. For the latter, the covariance or correlation of each pair of variables is

where

s7

is the sample variance of variable Xi.

The covariance of two variables Xi and Xj is defined by

The covariance is often difficult to interpret because it depends on the units in which the two variables are measured; consequently it is conveniently standardised by dividing by the product of the standard deviations of the two variables to give a quantity called the correlation coefficient, Pij , defined by

_ (Ii)

Pij- ~

VO'ii(Jjj

(3.7) The correlation coefficient lies between -I and I and gives a measure of the linear relationship between variables Xi and Xj

3.2.2 Principal Component Analysis

Given a data matrix X representing n observations of each of p variables, ^{Xl, X2,} ... xp, the purpose of principal component analysis is to determine a new variable Y], that can be used to account for the variation in the p variables, Xl, X2, ••• xp. The first principal component is given by a linear combination of the p variables as

YI = WIIXI + W12X2 + ... + WlpX p (3.8)

where the sample variance is greatest for all of the coefficients (also called weights), WII, W12, ... Wlp' conveniently written as a vector ^WI.

The WI I , W12, '" Wlp have to satisfy the constraint that the sum-of-squares of the coefficients, i.e., W'I WI , should be unity.

The second principal component, ^Y2,is given by the linear combination of the p variables in the form:

Y2 = ^WZIXI+ ^W22X2 + ... + ^W2pXp y,

=

w·,x

which has the greatest variance subject to the two conditions, w',w,

=

and

W',WI

=

0 (so thatYI andY2 are uncorrelated) Similarly the jth principal component is a linear combination

Yj = w'jX

which has greatest variance subject to

W'jWj 1

W'jWi

=

⁰ ⁽ⁱ^<^j)

(3.9)

(3.10)

(3.11)

Chapter 3 Data Pre-processing 33

To find the coefficients defming the first principal component, the elements of

WI should be chosen so as to maximise the variance of YI subject to the constraint, w't WI

=

1. The variance of Yl is then given by

Var(yl)

=

Var( WI I x)

=

^W'IS WI (3.12) where S is the variance-covariance matrix of the original variables (See Section 3.2.1). The solution of WI

= (

WI I , W12, ••• Wlp) to maximise the variance Yl is the eigenvector of S corresponding to the largest eigenvalue. The eigenvalues of S are roots of the equation,

IIS-A.II =

⁰ ^(3.13)

If the eigenvalues are A.I, A.2, ••• A.p , then they can be arranged from the largest to the smallest. The first few eigenvectors are the principal components that can capture most of the variance of the original data while the remaining PCs mainly represent noise in the data.

PCA is scale dependent, and so the data must be scaled in some meaningful way before PCA analysis. The most usual way of scaling is to scale each variable to unit variance.

3.2.3 Data Pre-processing Using PCA

3.2.3.1 Pre-processing Dynamic Transients/or Compression and Noise Removal

In computer control systems such as DCS, nearly all important process variables are recorded as dynamic trends. Dynamic trends can be more important than the actual real time values in evaluating the current operational status of the process and in anticipating possible future developments. Appendix C describes a data set of one hundred cases corresponding to various operational modes such as faults, disturbances and normal operation of a refmery reactive distillation process for manufacture of methyl tertiary butyl ether (MTBE), a lead-free gasoline additive.

This can be used to illustrate the dimension compression capability of PCA. For each data case, twenty one variables are recorded as dynamic responses after a disturbance or fault occurs. Each trend consists of 256 sample points. Figure 3.l shows the trends of a variable for two different cases. The eigenvalues of the first 20 principal components are summarised in Figure 3.2. It is apparent that the eigenvalues of the first few principal components can be· used as a concise representation of the original dynamic trend, and so are used to replace the original responses for use in pattern recognition.

5 4

3 2

o

31 61 91 121 151 181 211 241

Point

Figure 3.1 The dynamic trends of a variable for two data cases.

250

200

100

50 TTTfl f f f f I I I I I I I I I I

..

-11 13 15 17 19

Points

Figure 3.2 The first 20 eigenvalues of a variable.

Chapter 3 Data Pre-processing 35

3.2.3.2 Pre-processing of Dynamic Transient Signals for Concept Formation Since the first two principal components can capture the main feature of a dynamic trend, this can be displayed graphically by plotting the eigenvalues on a two-dimensional plane. Figure 3.3 shows such a plot of the eigenvalues of the first two principal components of a variable F ^o' A point in the two dimensional plane represents the feature of the variable response trend for one data case. Data points in region B have response trends which are similar and unlike those in region D.

The fact that a two-dimensional plot is able to capture the features can be seen from Figures 3.4 and 3.5. Figure 3.4 shows the dynamic responses of the variable T_MTBE for seven data cases. After being processed using peA (actually the seven data cases are processed using peA together with another 93 data cases, but here only the seven are shown for illustrative purpose), the results are shown on the two-dimensional peA plane" in Figure 3.5. It is clear that the dynamic trends of data cases 1 and 2 are more alike than with the others in Figure 3.4 and they are grouped closer in Figure 3.5. Similar observations can be made for data cases 40 and 80, as well as 14 and 15.

Figure 3.3 The peA two dimensional plane of the variable Fo.

12r---.

Case 16

o~~~--__ ^~____ ----__ --__ ^----~----__ ---______ ^~

1 13 25 37 49 61 73 85 97 109121 133145 157169181 193205217229241 253 Time

Figure 3.4 The dynamic trends of the temperature T_MTBE for the case study described in Appendix C.

50 Case 16

,

30 Case 15

•

f;I;l

,

^{Case 80} ^{Case I}

=

²⁰

E-

,

:::!!I 10

II

E-I

• • •

NI 0

.. •

u ^1:1. ^-10-20 Case 14

I I ^!

Case 40 Case 2

-30 -40

-40 -30 -20 -10 0 10 20 30 40 50 60

PC_LTjlTBE

Figure 3.5 The projection of the dynamic trends of Figure 3.4 on the two-dimensional PCA plane.

Chapter 3 Data Pre-processing 37

The plot of the dynamic trends of a variable on a two dimensional plane, as depicted in Figure 3.3 is referred to as concept formation. Concept formation transforms a complicated trend to a concept, e.g., "the variable Fa is in region D".

The transformed concept of the trend of a variable can be used to develop knowledge based systems. A simple example is the following production rule for a case of a continuous stirred tank reactor where the data refer to historical operating data:

IF Fo is in region D of Figure 3.3 AND TR is in region C of Figure 3.6 THEN The operation will be in region

ABN-l of Figure 3.7

A detailed discussion on using the concept formation method to develop conceptual clustering systems will be discussed later in Chapter 7.

Figure 3.6 The PCA two dimensional plane of the variable TR.

60 80

-

^'"

Figure 3.7 The PCA plane of operational states of a CSTR reactor_

3.2.3.3 Dependency Removal and Clearance of Redundancy Variables

Studies have found that presence of redundancy and irrelevant variables may deteriorate pattern recognition or hide the real patterns in the data [68] and so some data mining and KDD tools require the inputs to be independent. Sometimes it is not possible to directly identify the dependencies between variables. PCA can be used to pre-process the data and the first few principal components are then be used by other data mining and KDD tools.

Dans le document Industrial Control (Page 46-54)