UNIFIED PARALLEL ALGORITHMS FOR ESTIMATING PRINCIPAL COMPONENTS, MINOR COMPONENTS AND THEIR SUBSPACES

} processing To second unit

3.7 UNIFIED PARALLEL ALGORITHMS FOR ESTIMATING PRINCIPAL COMPONENTS, MINOR COMPONENTS AND THEIR SUBSPACES

In the previous sections, we have presented simple fast local algorithms which enable us to extract principal and minor components sequentially one by one. In this section, we will present a more general and unified approach which allows us to estimate principal and minor components in parallel. Moreover, the discussed algorithms can also be used for principal subspace analysis (PSA) and minor subspace analysis (MSA). When we have interest only in the subspace spanned by thenlargest or smallest eigenvectors, we do not need to identify the respective eigenvectorsvi, but any set of ˜vi’s, which span the same subspace asvi’s, are

UNIFIED PARALLEL ALGORITHMS FOR PCA/MCA AND PSA/MSA 111 sufficient. Indeed, when there are multiplicity in the eigenvalues, say λ1 =λ2, we cannot obtainv1 andv2uniquely, but obtain the subspace spanned byv1 andv2. Such a problem is called the principal subspace or minor subspace extraction.

All these problems (PCA, PSA, MCA and MSA) are very similar, including sequential extraction (n= 1) as a special case. Therefore, it is desirable to obtain a general unified principle applicable to all of these problems and obtain algorithms in a unified way. The principle should elucidate the structure of the PCA and MCA problems, and not only explain most algorithms proposed so far, but give unified algorithms.

3.7.1 Cost Function for Parallel Processing

We first explain the idea intuitively. Let {λ1, . . . , λm} and {d1,· · ·, dm} be two sets of m positive numbers, where d1 > · · · > dm > 0. Consider the problem to maximize (or minimize) the sum

S= Xm

i=1

λi⁰di (3.106)

by rearranging the order of {λ1, . . . , λm} as{λ1⁰, . . . , λm⁰}, where{1⁰, . . . , m⁰} is a permu-tation of {1,2, . . . , m}. It is easy to see thatS is maximized when{λi⁰} is arranged in a decreasing order (that is,λ1⁰ >· · ·> λm⁰) and is minimized whenλ1⁰ <· · ·< λm⁰.

Brockett generalized this idea to a matrix calculation [113, 114]. LetV = [v1, . . . ,vm] be an orthogonal matrix whose columns satisfy

v^T_i vj=δij

¡V^TV=Im

¢. (3.107)

Let us put

J(V) = tr¡

DV^TRxxV¢

= tr¡

VDV^TRxx

¢ (3.108)

whereD= diag (d1, . . . , dm). WhenVconsists ofmeigenvectors ofRxx,V= [v1⁰, . . . ,vm⁰], V^TRxxV= diag (λ1⁰, . . . , λm⁰) (3.109) and Eq. (3.108) reduces to J(V) = P

diλi⁰. When V is a general orthogonal matrix, V^TRxxVis not a diagonal matrix. However, the following proposition holds.

Proposition 3.1 The cost functionJ(V)is maximized when

V= [v1, . . . ,vm] (3.110)

and minimized when

V= [vm, . . . ,v1] (3.111)

provided the eigenvalues satisfyλ1>· · ·> λm. J(V)has no local minima nor local maxima except for the global ones.

Now consider the case whered1>· · ·> dn> dn+1=· · ·=dm= 0. In such a case,

so that the last (m−n) columns ofVautomatically vanish. Let us usenarbitrary mutually orthogonal unit vectorsw1, . . . ,wm and put

which plays the same role asJ(V). It is immediate visible thatJ(W) is maximized (mini-mized) whenWconsists of the eigenvectors ofnlargest (smallest) eigenvalues, and there are no local maxima nor minima except for the true solution. So this cost function is applicable for both PCA and MCA as well as PSA and MSA.

Whenn= 1,J(w) reduces to

J(w) =w^TRxxw (3.115)

under the condition that d1 = 1, w^Tw = 1. Hence, this is equivalent to the Rayleigh quotient or its constrained version.

Whend1=d2=· · ·=dn=d,J(W) is maximized (minimized) whenWis composed by the eigenvectors of thenlargest (smallest) eigenvalues. There are no other local maxima nor minima, but only the subspace, not the exact eigenvectors, can be extracted by maximizing (minimizing) this cost function.

3.7.2 Gradient of

J (W)

The matrix W∈IR^n×m satisfies

W^TW=In. (3.116)

The set of such matrices is called the Stiefel manifold Om,n. We calculate the gradient of J(W) in Om,n. LetdWbe a small change of W, and the corresponding change inJ(W)

UNIFIED PARALLEL ALGORITHMS FOR PCA/MCA AND PSA/MSA 113 Hence, from

dJ = ∂J

∂W·dW, (3.119)

where A·B isP

i,jaijbij= tr(AB^T) and the gradient is a matrix given by

∂J

∂W = −WDW^TRxxW+RxxWDW^TW

= −WDW^TRxxW+RxxWD. (3.120)

The gradient method for obtaining the principal components is written as

∆WP =η¡

RxxWPD−WPDW_P^TRxxWP

¢ (3.121)

and, for the minor components,

∆W_M =−η¡

R_xxW_MD−W_MDW^T_MR_xxW_M¢

, (3.122)

where WP =VS = [w1, . . . ,wn] andWM =VN = [wm, . . . ,wn+1].

Let us consider the special case with n= 1. By puttingd₁= 1, Eq. (3.121) reduces to

∆w=η¡

Rxxw−ww^TRxxw¢

, (3.123)

which is the well-known algorithm. Its on-line version is

∆w=η¡

yx−y²w¢

, (3.124)

where we replaced the covariance matrix Rxx by its instantaneous version x^Tx, and y = w^Tx. This is the classic algorithm found by Amari (1978) and by Oja (1982) in which the constraint term imposingw^Tw= 1 is treated separately [20,910, 914,1].

When n =m, this is the algorithm given by Brockett, while Xu derived it for n < m [114,1309,1310]. If we putd₁=· · ·=d_n= 1, we obtain the subspace algorithm.

How does the algorithm (3.122) for extracting minor components work? It is obtained from the same cost function, but it uses minimization instead of maximization. Hence, the MCA algorithm changes the sign of the gradient. However, computer simulations show that it does not work. We have shown that J(W) has only one maximum and only one minimum, so that this looks strange. This was a puzzle for many years and it is interesting to know the reason. This is related with the stability of the algorithms. We need a more detailed stability analysis to elucidate the structure. Table 3.4summarizes several parallel algorithms for the PCA and PSA while Table3.5summarizes algorithms for the MSA/MCA.

3.7.3 Stability Analysis

It is easier to replace a finite time difference equation by its continuous time version for analyzing the stability. The continuous time versions of (3.121) and (3.122) are

dW(t)

dt = µ¡

RxxWD−WDW^TRxxW¢

, (3.125)

dW(t)

dt = −µ¡

RxxWD−WDW^TRxxW¢

, (3.126)

Table 3.4 Parallel adaptive algorithms for PSA/PCA.

No. Learning algorithm Notes, References

1. ∆W=η[xy^H−Wyy^H] Oja-Karhunen (1985) [914]

2. ∆W=η[RxxWD−WDW^HRxxWD] Brockett (1991) [114]

∼=η[xy^HD−WDyy^HD]

3. ∆W=η(RxxWW^H−WW^HRxx)W Chen, Amari (1998) [189]

∼=η[xy^HW^HW−Wyy^H]

4. ∆W=η(RxxWDW^H−WDW^HRxx)W Chen, Amari (2001) [188]

For diagonal matrixDwith positive-valued strictly decreasing entries the Brockett and Chen-Amari algorithms perform parallel adaptive PCA [114,188,189].

where we omitted sufficesP andM.

We assume here that Wis a generaln×mmatrix, not necessarily belonging to Om,n, and we put

U(t) =W^T(t)W(t). (3.127)

We can prove that U(t) is invariant under the gradient dynamics (3.125) and (3.126), that

is d

dtU(t) =0 (3.128)

whenW(t) changes under the dynamics of (3.125) or (3.126). This is easily proved by the direct calculations,

dtU(t) = dW^T

dt W+W^TdW dt

= µ(I−W^TW)(DW^TRxxW+W^TRxxWD) =0 (3.129) with the initial condition: U(0) =W^T(0)W(0) =I. Therefore, whenW(0) at time t= 0 belongs to the Stiefel manifold Om,n it holds, U(0) = I, and henceU(t) = I for any t, implying that W(t) always belongs to Om,n. Since any global extremum gives the true solution of n principal or minor components, the dynamical equation (3.121) or (3.122) should converge to the true solution, provided W(t) always belong toOm,n.

If computer simulations show that (3.122) does not work for minor components extrac-tion, the dynamics (3.122) defined in Om,n is stable in Om,n but is unstable when it is

UNIFIED PARALLEL ALGORITHMS FOR PCA/MCA AND PSA/MSA 115 extended to the entire space of IR^m×n. In other words, if W(t) deviates to W(t) +dW outsideOm,ndue to noise or numerical rounding-off, the deviation grows andW(t) escapes from Om,n.

To show the above scenario, letδWbe a small deviation ofW∈Om,nin any direction.

It is known that the deviation is in general decomposed into the three terms

δW=WδS+WδG+NδB, (3.130)

where δS is a n×n skew symmetric matrix, δG is a n×n symmetric matrix, δB is a (m−n)×nmatrix, andN is am×(m−n) matrix whose (m−n) columns are orthogonal to the columns of W. The current W consists of n orthonormal vectors {w1, . . . ,wn} which define the n-dimensional subspace. The first term WδS represents a change of wi

to wi +δwi, keeping the subspace spanned by wi’s invariant although its orthonormal basis vectors change. The third term NδB alterswi from the subspace, so that it alters the directions of the subspace spanned by W but still staying in Om,n. The second term WδGdestroys the orthonormality ofW, so that it represents changes ofWin the direction orthogonal toOm,n. Hence this term is 0, when the dynamics is restricted only to inside of O_m,n.

When W(t) deviates toW(t) +δW(t), how such change δW(t) develops through dy-namics. The dynamics ofδW(t) is given by the variational equation

dtδW(t) = ±µ δ(RxxWD−WDW^TRxxW)

= ±µ,(RxxδWD−δWDW^TRxxW−WDδW^TRxxW). (3.131) We analyze the variational equation in the neighborhood of the true solution, where the variationδWis not only insideOm,nbut also in the orthogonal directions.

The results are summarized as follows (see Chen and Amari, 2001; Chen et al. 1998) [188,189].

1. The variational equations are stable at the true solution with respect to changesδS andδB insideOm,n, when alld⁰_is are different and allλ⁰_i s are different.

2. When somed⁰_is are equal, or someλ⁰_is are equal, the variational equations are stable with respect toδB but only neutrally stable with respect toδS.

3. The variational equation is stable at the true solution concerning changes inδGfor principle component extraction (3.125), but is unstable for minor component extrac-tion (3.126).

Result 1. shows that the gradient dynamics (3.125) and (3.126) are successful for obtaining the principal and minor components, respectively, in parallel, provided W(t) is exactly controlled to belong toO_n,m.

Result 2. shows that the algorithms are successful for extracting principal and minor subspaces, under the same condition when somedi’s orλi’s are equal.

Result 3. shows that algorithm (3.125) is successful for extraction principal components, but algorithm (3.126) fails for extracting minor components because of the instability in

Table 3.5 Adaptive parallel MSA/MCA algorithms for complex valued data.

No. Learning rule Notes, References

1. ∆W=−η(RxxWW^H−WW^HRxx)W= Chen, Amari,

∼=−η[xy^HW^HW−Wyy^H] Lin (1998) [189]

2. ∆W=−η(RxxWDW^H−WDW^HRxx)W+ Chen, Amari

+W(D−W^HDW) (2001) [188]

3. ∆W=−η[RxxWW^HWW^HW−WW^HRxxW] = Douglas, Kung,

∼=−η[x y^HW^HW W^HW−W y y^H] Amari (1998) [413]

4. y(k) =W^H(k)x(k) Orthogonal Algorithm [1]

x(k) =W(k)y(k) Abed-Meraimet al. (2000) e(k) =x(k)−x(k)ˆ

α(k) = (1 +η²ke(k)k²ky(k)k²)^−1/2 β(k) = (α(k)−1)/ky(k)k²

e(k) =−β(k) ˆx(k)/η+α(k)e(k) (∆W(k) =−η¯e(k)y^H(k)) or

u(k) = ¯e(k)/k¯e(k)k z(k) =W^H(k)u(k)

∆W(k) =−2u(k)z^H(k)

ForD=Ithe Chen-Amari algorithm perform MSA [188,189], however, for matrixDwith positive strictly decreasing entries the algorithm performs stable MCA.

the directions orthogonal to On,m. Such deviations caused by noises or numerical rounding-off, so that theW(t) should be adjusted in each step to satisfy (3.116).

3.7.4 Unified Stable Algorithms

In order to overcome the instability of minor components extraction, Chen et al. (1998) added a term which forcesW(t) to return toOm,n[189]. The algorithms of principal and

UNIFIED PARALLEL ALGORITHMS FOR PCA/MCA AND PSA/MSA 117

minor components extraction differ only in the signs:

dW dt =µ¡

R_xxWDW^TW−WDW^TR_xxW¢

(3.132) for PCA,

dt =−µ¡

RxxWDW^TW−WDW^TRxxW¢

(3.133) for MCA. Here,

W^T(t)W(t) =In (3.134)

is not necessarily guaranteed in the course of dynamics, although we choose the initial value to satisfy W^T(0)W(0) = I_n. When (3.116) holds, (3.132) is the same as Xu’s algorithm [1309, 1310]. But it cannot be applied to MCA by changing the sign, while (3.133) works well for MCA.

However, the dynamics of (3.132) and (3.133) are neutrally stable with respect to changes in WδG. Douglas et al. performed detailed numerical simulations, and show that the discretized version of the algorithm does not work when Rxx is replaced by x(t)x^T(t) [412,413,414]. They proposed to strengthen the term to return toOm,n.

Taking this into account, Chen and Amari proposed the following algorithms [188]

dt =µ[(RxxWDW^TW−WDW^TRxxW) +W(D−WDW^T)] (3.135) for PCA, and

dt =−µ[(RxxWDW^TW−WDW^TRxxW) +W(D−WDW^T)] (3.136) for MCA.

It is interesting to show that almost all algorithms proposed so far are induced by mod-ifying the penalty term. See discussion in Chen and Amari for details [188].

Remark 3.5 Let λ0 be an upper bound of allλi that is a constant larger than λ1. When we know a bound λ0, we can define

R¯xx=λ0I−Rxx. (3.137)

Then, the eigenvalues of R¯x x are λ0−λ1, λ0−λ2, . . . , λ0−λm. Hence, by performing PCA on R¯_xx, we can easily obtain the minor components and their eigenvectors. This was pointed out by Chen, Amari and Murata [190].

It is well known that the principal components of R⁻¹_{x x} correspond to the minor compo-nents of Rx x. Hence any PCA algorithm can be used for MCA if we can calculate R⁻¹_{x x}, but this needs an extra cost of matrix inversion.

3.8 SINGULAR VALUE DECOMPOSITION IN RELATION TO PCA AND

Dans le document Adaptive Blind Signal and Image Processing (Page 142-150)