BASIC PROPERTIES OF PCA .1 Eigenvalue Decomposition

Principal/Minor Component Analysis and Related

3.2 BASIC PROPERTIES OF PCA .1 Eigenvalue Decomposition

The purpose of principal component analysis (PCA) is to derive a relatively small number of decorrelated linear combinations (principal components) of a set of random zero-mean variables while retaining as much of the information from the original variables as possible.

Among the objectives of Principal Components Analysis are the following.

1. dimensionality reduction;

2. determination of linear combinations of variables;

3. feature selection: the choosing of the most useful variables;

4. visualization of multidimensional data;

5. identification of underlying variables;

6. identification of groups of objects or of outliers.

PCA has been widely studied and used in pattern recognition and signal processing. In fact it is important in many engineering and scientific disciplines, e.g., in data compression, feature extraction, noise filtering, signal restoration and classification [357]. PCA is used widely in data mining as a data reduction technique. In image processing and computer vision PCA representations have been used for solving problems such as face and object recognition, tracking, detection, background modelling, parameterizing shape, appearance and motion [1207,712].

Often the principal components (PCs) (i.e., directions on which the input data have the largest variances) are regarded as important, while those components with the smallest variances called minor components (MCs) are regarded as unimportant or associated with noise. However, in some applications, the MCs are of the same importance as the PCs, for example, in curve and surface fitting or total least squares (TLS) problems [1309,282].

Generally speaking, PCA is related and motivated by the following two problems:

1. Given random vectorsx(k)∈IR^m, with finite second order moments and zero mean, find the reducedn-dimensional (n < m) linear subspace that minimizes the expected distance ofxfrom the subspace. This problem arises in the area of data compression where the task is to represent all the data with a reduced number of parameters while assuring as low as possible distortion generated by the projection.

2. Given random vectors x(k)∈IR^m, find the n-dimensional linear subspace that cap-tures most of the variance of the datax. This problem is related to feature extraction, where the objective is to reduce the dimension of the data while retaining most of its information content.

It turns out that both the problems have the same optimal solution (in the sense of least-squares error) which is based on the second order statistics, in particular, on the eigen

BASIC PROPERTIES OF PCA 89 structure of the data covariance matrix². PCA can be converted to the eigenvalue problem of the covariance matrix of xand it is essentially equivalent to Karhunen-Loeve transform used in image and signal processing. In other words, PCA is a technique for computation of the eigenvectors and eigenvalues for the estimated covariance matrix³

Rbx x=E{x(k)x^T(k)}=V Λ V^T ∈IR^m×m, (3.1) where Λ = diag{λ1, λ2, ..., λm} is a diagonal matrix containing the m eigenvalues andV

= [v1,v2, . . . ,vm]∈IR^m×mis the corresponding orthogonal or unitary matrix consisting of the unit length eigenvectors referred to as principal eigenvectors.

The Karhunen-Loeve-transform determines a linear transformation of an input vectorx as

yP=V_S^Tx, (3.2)

where

x= [x1(k), x2(k), . . . , xm(k)]^T is a zero-mean input vector, yP= [y1(k), y2(k), . . . , yn(k)]^T is the output vector called the vector of principal components (PCs), and VS = [v1,v2, . . . ,vn]^T ∈IR^m×n is the set of signal subspace eigenvectors, with the orthonormal vectors vi = [vi1, vi2, . . . , vim]^T, (i.e., (v^T_i vj = δij) for j ≤i, (δij is the Kronecker delta). The vectors vi (i= 1,2, . . . , n) are eigenvectors of the covariance matrix, while the variances of the PCs yi are the corresponding principal eigenvalues. On the other hand, the (m−n) minor components are given by

yM =V^T_Nx, (3.3)

where VN = [vm,vm−1, . . . ,vm−n+1] consists of the (m−n) eigenvectors associated with the smallest eigenvalues.

Therefore, the basic problem we try to solve is the standard eigenvalue problem which can be formulated by the equations

Rx xvi=λivi, (i= 1,2, . . . , n) (3.4) where vi are the eigenvectors, λi are the corresponding eigenvalues and Rx x =E{x x^T} is the covariance matrix of zero-mean signalx(k) andE is the expectation operator. Note that Eq.(3.4) can be written in matrix formV^TRx xV=Λ, whereΛis the diagonal matrix of the eigenvalues of the covariance matrixRx x.

In the standard numerical approach for extracting the principal components, first the co-variance matrixRx x =E{x x^T}is computed and then its eigenvectors and (corresponding) associated eigenvalues are determined by one of the known numerical algorithms. However, if the input data vectors have a large dimension (say 1000 elements), then the covariance matrix Rx x is very large (10⁶ entries) and it may be difficult to compute the required eigenvectors.

2If signals are zero mean the covariance and correlation matrices become the same.

3The covariance matrix is the correlation matrix of the vector with the mean removed. Since, we consider the zero-mean signals the both matrices are equivalent.

The neural network approach and adaptive learning algorithms enable us to find the eigenvectors and the associated eigenvalues directly from the input vectors x(k) without a need to compute or estimate the very large covariance matrixRx x. Such an approach will be especially useful for nonstationary input data, i.e., in the cases of tracking slow changes of correlations in the input data (signals) or in updating eigenvectors with new samples.

Computing the sample covariance matrix itself is very costly. Furthermore, the direct diagonalization of matrix or eigenvalue decomposition can be extremely costly since this operation is of complexityO(m³). Most of the adaptive algorithms presented in this chapter do not require computing the sample covariance matrix and they have low complexity.

3.2.2 Estimation of Sample Covariance Matrices

In practice, the ideal covariance matrix Rxx is not available. We only have an estimate Rbxx ofRxxcalled the sample covariance matrix based on a block of finite number samples:

Rbxx = 1 N

k=1

x(k)x^T(k). (3.5)

We assume that the covariance matrix does not change (or change very slowly) over the length of the block. Alternatively, we can use the Moving Average (MA) approach to estimate on-line the sampling covariance matrix as follows:

Rb^(k)_xx = (1−η0)Rb^(k−1)_xx +η0x(k)x^T(k) (3.6) where η0>0 is a learning rate (and (1−η0) is a forgetting factor) to be chosen according to the stationarity of the signal (typically 0.01≤η0≤0.1).

Alternatively, in real time applications, the sample covariance matrix can be recursively updated as

Rb_N = 1 N

l=k−N+1

x(l)x^T(l) = 1 N

" _k−1 X

l=k−N+1

x(l)x^T(l) +x(k)x^T(k)

= N−1

N RbN−1+ 1

N x(k)x^T(k), (3.7)

where RbN denotes the estimated covariance matrix atk-th data instant so that RbN−1= 1

N−1

k−1X

l=k−N+1

x(l)x^T(l). The recursive update can be formulated in more general form as

RbN =αRbN−1+4R,b (3.8)

whereαis a parameter in the range (0,1] and4Rˆ is a symmetric matrix of rank much less than that of RbN−1. While working with stationary signals, we usually use rank-1 update

BASIC PROPERTIES OF PCA 91 withα= (N−1)/Nand4Rb = (1/N)x(k)x^T(k), wherex(k) is the data vector atk-th in-stant. On the other hand, in the nonstationary case, rank-1 updating is carried out by choos-ing 0 < α ¿1 and 4Rb =x(k)x^T(k). Alternatively, in the nonstationary case, we can use the rank-2 update withα= 1 and4Rb =x(k)x^T(k)−x(k−N+ 1) x^T(k−N+ 1), where N is the sliding window length over which the covariance matrix is computed. The term RbN−1 may be thought of as a prediction of R based on N −1 observations and x(k)x^T(k) may be thought of as an instantaneous estimate ofR.

3.2.3 Signal and Noise Subspaces - AIC and MDL Criteria for their Estimation A very important problem arising in many application areas is determination of the di-mension of the signal and the noise subspace. To solve this problem, we usually exploit a fundamental property of PCA: It projects the input data x(k) from their original m-dimensional space onto the n-dimensional output subspacey(k) (typically, with n¿m), thus performing a dimensionality reduction which retains most of the intrinsic information in the input data vectors. In other words, the principal components yi(k) = v^T_ix(k) are estimated in such a way that, for n¿m, although the dimensionality of data is strongly reduced, the most relevant information must be retained in the sense that the original input data xcan be reconstructed from the output data (signals)yby using the transformation ˆ

x = VSy, that minimizes a suitable cost function. A commonly used criterion is the minimization of least mean squared error kx−V^T_SVSxk².

PCA enables us to divide observed (measured), sensor signals: x(k) =xs(k) +ν(k) into two subspaces: thesignal subspacecorresponding to principal components associated with the largest eigenvalues called principal eigenvalues: λ1, λ2, ..., λn, (m > n) and associated eigenvectorsVs = [v1,v2, . . . ,vn] called the principal eigenvectors and thenoise subspace corresponding to the minor components associated with the eigenvalues λn+1, ..., λm.The subspace spanned by the nfirst eigenvectors v_i can be considered as an approximation of the noiseless signal subspace. One important advantage of this approach is that it enables not only a reduction in the noise level, but also allows us to estimate the number of sources on the basis of distribution of eigenvalues. However, a problem arising from this approach, is how to correctly set or estimate the threshold which divides eigenvalues into the two subspaces, especially when the noise is large (i.e., the SNR is low).

Let us assume that we model the vectorx(k)∈IR^mas

x(k) =H s(k) +ν(k), (3.9) whereH∈IR^m×n is a full column rank mixing matrix withm > n,s(k)∈IRⁿ is a vector of zero-mean Gaussian sources with the nonsingular covariance matrix Rs s =E{s(k)s^T(k)}

and ν(k)∈IR^m is a vector of Gaussian zero-mean i.i.d. noise modelled by the covariance matrix Rνν = σ_ν²I_m, furthermore, random vectors {s(k)} and {ν(k)} are uncorrelated [773].

Remark 3.1 The model given by Eq. (3.9) is often referred as the probabilistic PCA, and have been introduced in machine learning context [1017, 1148]. Moreover, such model can be also considered as a special form of Factor Analysis (FA) with isotropic noise [1148].

For the model (3.9) and under the above assumptions the covariance matrix ofx(k) can be written as

Rxx = E{x(k)x^T(k)}=H Rs sH^T+σ²_νIm

= [VS,VN]

· ΛS 0 0 ΛN

[VS,VN]^T

= VSΛSV^T_S +VNΛNV^T_N, (3.10)

where H Rs sH^T =VSΛbSV_S^T is a rank-n matrix, VS ∈IR^m×n contains the eigenvectors associated withnprincipal (signal+noise subspace) eigenvalues ofΛS = diag{λ1≥λ2· · · ≥ λn}in descending order. Similarly, the matrixVN ∈IR^m×(m−n)contains the (m−n) (noise) eigenvectors that correspond to noise eigenvalues ΛN = diag{λn+1, . . . , λm} = σ_ν²Im−n. This means that, theoretically, the (m−n) smallest eigenvalues ofRxxare equal to σ_ν², so we can determine in theory the dimension of the signal subspace from the multiplicity of the smallest eigenvalues under assumption that variance of noise is relative low and we perfect estimate the covariance matrix. However, in practice, we estimate the sampled covariance matrix from limited number of samples and the smallest eigenvalues are usually different, so the determination of the dimension of the signal subspace is usually not easy task.

Instead of setting the threshold between the signal and noise eigenvalues by using some heuristic procedure or a rule of thumb, we can use one of two well-known information theoretic criteria, namely, Akaike’s information criterion (AIC) or the minimum description length (M DL) criterion [671,1266].

Akaike’s information theoretic criterion (AIC) selects the model that minimizes the cost function [773]

AIC=−2 log(p(x(1),x(2), . . . , x(N)|Θ)) + 2n,ˆ (3.11) where p(x(1),x(2), . . . , x(N)|Θ) is a parameterized family of probability density, ˆˆ Θ is the maximum likelihood estimator of a parameter vectorΘ, andnis the number of free adjusted parameters.

The minimum description length (M DL) criterion selects the model that instead mini-mizes

M DL=−log(p(x(1),x(2), . . . , x(N)|Θ)) +ˆ 1

2nlogN. (3.12)

Assuming that the observed vectors {x(k)}^N_k=1 are zero-mean, i.i.d. Gaussian random vec-tors it can be shown [1266] that the dimension of the signal subspace can be estimated by taking the value of n∈ {1,2, . . . , m} for which

AIC(n) = −2N(m−n) log%(n) + 2n(2m−n), (3.13)

M DL(n) = −N(m−n) log%(n) + 0.5n(2m−n) logN (3.14) is minimized. Here,N is the number of the data vectors x(k) used in estimating the data covariance matrixRxx, and

%(n) = (λn+1λn+2· · ·λm)^m−n¹

m−n(λn+1+λn+2+· · ·+λm) (3.15)

BASIC PROPERTIES OF PCA 93 is the ratio of the geometric mean of the (m−n) smallest PCA eigenvalues to their arithmetic mean. The estimate ˆnof the number of terms (sources) is chosen so that it minimizes either theAIC orM DLcriterion.

Both criteria provide rough estimates (of the number of sources) that are rather very sensitive to variations in the SNR and the number of available data samples [773]. Another problem with the AIC andM DL criteria given above is that they have been derived by assuming that the data vectors x(k) have a Gaussian distribution [1266]. This is done for mathematical tractability, by making it possible to derive closed form expressions. The Gaussianity assumption does not usually hold exactly in the BSS and other signal processing applications. Therefore, while theM DLandAICcriteria yield suboptimal estimates only, they still provide useful formulas that can be used for model order estimation.

Instead of setting the threshold between the signal and noise eigenvalues, one might even suppose that the M DL and AIC criteria cannot be used in the BSS problem, because there we assume that the source signals si(k) are non-Gaussian. However, it should be noted that the components of the data vectors x(k) = Hs(k) +ν(k) are mixtures of the sources, and therefore often have distributions that are not so far from the Gaussian one.

In practical experiments, theM DLandAIC criteria have quite often performed very well in estimating the number nof the sources in the BSS problem [671]. We have found two practical requirements for their successful use. Firstly, the number of the mixtures must be larger than the number of the sources. If the number of sources is equal to the number of sensors, both criteria inevitably underestimate n by one. The second requirement is that there must be at least a small amount of noise. This also guarantees that the eigenvalues λn+1, λn+2, . . . , λm, corresponding to noise, are nonzero. It is obvious that zero eigenvalues cause numerical difficulties in formulas (3.13) and (3.14).

3.2.4 Basic Properties of PCA

It is easy to obtain the following properties for principal components (PCs)yi=v^T_ix: 1. The factory1(k) =v^T₁x(k) is the first principal component ofx(k) if the variance of

y1(k) is maximally large under constraint that the norm of vectorv1is constant [910].

Then the weight vectorv1maximizes the following criterion

J1(v1) =E{y₁²}=E{v^T₁ Rxxv1}, (3.16) subject to the constraint kv1k2 = 1. The criterion can be extended for n principal components (withnany number between 1 andm) as

Jn(v1,v2, . . . ,vn) =E{

i=1

y_i²}=E{

i=1

(v^T_ix)²}= Xn

i=1

v_i^TRxxvi, (3.17)

subject to the constraintsv^T_i vj=δij. 2. The PCs have zero mean values

E{yi}= 0, ∀i. (3.18)

3. Different PCs are mutually uncorrelated

E{yiyj}=δijλj, (i, j= 1,2, . . . , n). (3.19) 4. The variance of thei-th PC is equal to the i-th eigenvalue of the covariance matrix

Rxx

var{yi}=σ_y²_i =E{y_i²}=E{(v_i^Tx)²}=v^T_i Rxxvi=λi. (3.20) 5. The PCs are hierarchically organized with respect to decreasing values of their

vari-ances

σ_y²₁ >σ_y²₂>· · ·>σ²_y_n, (3.21) i.e.,λ1>λ2>· · ·>λn.

6. Best approximation property: for the mean-square error of approximation b

x= Xn

i=1

yivi= Xn

i=1

viv^T_i x, n < m, (3.22)

we have

E{kx−bxk²}=E{k Xm

i=n+1

yivik²}= Xm

i=n+1

E{|yi|²}= Xm

i=n+1

λi. (3.23)

Taking into account that λ1 > λ2 >· · · > λn it is obvious that an approximation with these eigenvectorsv1,v2, . . . ,vn, corresponding to the largest eigenvalues, leads to the minimal mean square error.

3.3 EXTRACTION OF PRINCIPAL COMPONENTS USING OPTIMAL

Dans le document Adaptive Blind Signal and Image Processing (Page 120-126)