The Gustafson-Kessel (GK) Model - 2 Cluster Analysis for Object Data

2 Cluster Analysis for Object Data

A. The Gustafson-Kessel (GK) Model

Gustafson and Kessel (1979) proposed that the matrix A in equation (2.5) be a third variable. They put A = (A ,..., A ), A being a positive-definite p x p matrix, and modified (2.5) to

Hii3 \ J CK^^' V. A) = I £ u";||x - V II' \ . (2.25)

det(A')=p,

The variables estimated by the GK model are the triplet (U, V, A) where V is still a vector of point prototypes. This model predates possibilistic partitions by some 15 years, so the weights {wj in (2.5) are all zero. The important idea here is that the i-th cluster in U might be best matched by a hyperellipsoidal shape generated by the eigenstructure of the variable matrix A, much like S does for GMD-AO at (2.21c). The additional constraint t h a t det(Aj) = p. > 0 guarantees that A is positive-definite; p. is a user defined constant for each cluster. Gustafson and Kessel showed that minimization of J with respect to A leads to the necessary condition

m , G K '^^ I •'

A. = [p. det(C, )]'^^C-i, 1 < i < c . (2.26) In (2.26) C. is the fuzzy covariance matrix of cluster i.

0 = 1 u'"(x -v.)(x -v.f/l u " , l < i < c ; m > l , (2.27)

1 j£rj ik k 1 k 1 / ^ i ik

where v, is the i-th point prototype or cluster center. In the sequel we may represent the set of fuzzy covariance matrices calculated with (2.27) a s the vector C = (C^ C J e 9^'='P''P'. Gustafson and Kessel used p,= I for all 1. With this choice, the fixed norm D^ = |xi,. - Vi||^

used for the c distances from x to the current {v.} during calculation of (2.7a) is replaced in the GK-AO algorithm with the c distances

D L . G K = d e t ( C j i / P K - V i | g - i , l < i < c . (2.28) When crisp covariance matrices are used, the distance measure in

(2.28) is the one suggested by Sebestyen (1962). This distance measure was also used by Diday (1971), Diday et al. (1974) and Diday and Simon (1976) in their adaptive distance dynamic clusters (ADDC) algorithm. Thus, the GK-AO algorithm can be viewed as the fuzzification of ADDC, and may be regarded as the first (locally) adaptive fuzzy clustering algorithm.

For AO optimization of J (U,V, A ), the partition U and centers (vj are updated using (2.7a, b) as in FCM-AO, and the covariance matrices are updated with (2.26). The GK-AO algorithm is more sensitive to initialization than FCM-AO because its search space is much larger. Typically, FCM-AO is used to provide a reasonably good initialization for this algorithm. Experimental evidence indicates that 1 < m <2 gives good results, with m = 1.5 often being the recommended value.

Example 2.5 Figure 2.5 shows two data sets that were processed with the FCM-AO, GK-AO, and the GMD-AO algorithms. The parameter m was set at 2.0 for FCM-AO, and 1.5 for the GK-AO algorithm. All runs were initialized with the first two points in the left views ( v = (30, 35)^, V = (42, 45)^) or first three points in the right views (v = (21, 104)"^, V = (22, 101)'^ and v^ ^ = (22, 104)"^). The termination criterion was E = j|V(. - Vt_i|| < e = 0.001. The Euclidean norm was used for J . The left side of Figure 2.5 contains points drawn from a mixture of c

= 2 fairly circular Gaussians. The clusters on the right are drawn from a mixture of c = 3 bivariate normals, one of which (in the upper right portion of each view) has a covariance structure that tends toward linear correlation between x and y. The three clusters on the right exhibit visually different shapes, so we expect GK-AO and GMD-AO to use their localized adaptivity to find these clouds more accurately than FCM-AO, which has a fixed norm-inducing weight matrix.

Terminal partitions hardened with (2.10) are shown in Figure 2.5 by assigning different symbols to the crisp clusters. The shaded areas in views a, b, d, e and f correspond to the points that, when compared with the labels of the samples drawn, were labeled incorrectly. For the data on the left, FCM-AO tends to divide the data into circular (because the norm is Euclidean) clusters of roughly equal size (the problem illustrated in Figure 2.3(a)). The GK-AO result in view 2.5(b) shows an even stronger encroachment of the right cluster into the

CLUSTER ANALYSIS 43 left one. View 2.5(c) shows that GMD-AO labels every point in the 2 clusters data correctly; it reproduces the a priori labels flawlessly.

(a) FCM

• • • .t—

,.pk f . In

(d) FCM

• Q

(b)GK (e)GK

US

D D

(c) GME

•

m •

* •• • • .

• •

* •

• •

(f)GMD

T T *

"S*

• > " CD.

^ no ' l ^ ^ l

• •

D •

Figure 2.5 Hardened partitions for 2 sets of Gaussian clusters

The GMD-AO model also produces visually better results with the three clusters on the right side of Figure 2.5. Here FCM-AO draws four points up towards the linear cluster from the centrally located large group, and loses three points from the linear cluster to the lower cluster on the right (7 mistakes). GK-AO draws three points to the left from the lower right hand cluster and loses one point to the linear cluster (4 mistakes). GMD-AO reproduces the a priori labels almost flawlessly, missing just one point in the bottom right cluster to the linear cluster {1 mistake).

Visually, GMD gives much better results than FCM or GK for both data sets in Figure 2.5. Is GMD generally superior? Well, if the data are really from a mixture of normals, the GMD model matches the data better than the other two models. But if the geometry of the data does not fit the pattern expected for draws from a mixture of normals very well, GMD does not produce better results than other models. Moreover, (2.28) reduces to the Euclidean distance when Ci = of I. If this is true for all i = 1 c, the behavior of GK and FCM are very similar.

Bezdek and Dunn (1975) studied the efficacy of replacing GMD-AO p a r a m e t e r (P, M) with terminal (U, V)'s from FCM, and then calculating the remaining MLE of components (the priors and covariance matrices) of n o r m a l m i x t u r e s non-iteratively.

Hathaway and Bezdek (1986b) proved t h a t this strategy could not produce correct MLEs for (P, M) in even the univariate case.

Gath and Geva (1989a) discuss an algorithm they called fuzzy maxlTnum likelihood estimation (FMLE). Specifically, they used the f u z ^ covariance matrix C at (2.27) with m = 1 (this does not mean or require that the partition matrix U is crisp) to define an exponential

e ' , where p is the estimate of

P- * the prior probability of class i shown in (2.21a). This distance was

then used in FCM formula (2.7a) with m = 2, resulting in the distance D ^ ^^

memberships <u^ . | ( ^ . , G G / D „ , O G ) ^ which were taken as estimates for the posterior probabilities in equation (2.22). It is not hard to verify that this with updating scheme u^j^ is identical to p in (2.22). It is not hard to show that the update equations for FMLE are identical to those for GMD-AO. T h u s , FMLE is essentially equivalent to GMD-AO with the {p } interpreted as fuzzy memberships. We will illustrate FMLE in Section 2.4 in conjunction

CLUSTER ANALYSIS 45 with several measures of cluster validity defined in Gath and Geva (1989a).

Although the GK algorithm was developed for, and both it and GMD-AO are used to detect ellipsoidal clusters, since lines and planes can be viewed as extremely elongated or flat ellipsoids, these two models can also be used to detect lines and planes. Other algorithms that generate prototypes of this kind are described in the next subsection.

Chapter 5 contains a more detailed discussion of how the clustering algorithms described in this subsection can be used for boundary description.

B. Linear manifolds as prototjrpes

The earliest reference to the explicit use of non-point prototypes in connection with generalizations of FCM was Bezdek et al. (1978).

These authors discussed a primitive method for fitting fuzzy clusters with lines in the plane. The fuzzy c-varieties (FCV) models (Bezdek et al. 1981a,b) grew out of this effort, and were the first generalizations of FCM that explicitly used many kinds of non-point prototypes.

FCV uses r-dimensional linear varieties, 0 < r < p-1 as prototypes in (2.24a). This model predates possibilistic partitions, so the weights {w.} in (2.24) are zero for the FCV objective function. The linear variety (or manifold) of dimension r through the point v e g^p spanned by the linearly independent vectors {bj^, bj2,..., bj^} c 9tP is

L ={ye9tP|y = v -h t t b . ;t 6 9t} , (2.29) SO p. = {v.,b ,b b } are the parameters of L . These prototypes

can be thought of as "flat" sets in <^v. Dimension r is the number of directions in which the flatness extends. FCV uses the perpendicular distance from x to L as the distance measure in (2.24a). When the

k ri

{b .} are an orthonormal basis for their span, the orthogonal projection theorem yields

2 II l | 2 r / \ 2

Df - k - ' » ' - - l ( x ^ - v . , b . . ) . (2.30)

L,„ II k IIIA J=I\ k 1 ij/^

D is j u s t the A-norm length of (x -v) minus the A-norm length of

r.lk

its unique best approximation by a vector in the span of the {b. j j=l,...,r}. When r = 0 equation (2.30) reduces to Df^ = ||Xk - V J I ^ as

used in (2.5), so for r = 0, FCV reduces to FCM. For r = 1, FCV becomes Fuzzy c-Lines (FCL), for r = 2, Fuzzy c-Planes (FCP), etc., and for r = p - 1 , Fuzzy c-Hyperplanes (FCHP). In the FCV-AO algorithms derived to optimize the FCV model, the fuzzy c-partitlon matrix U and the centers {v.} are updated with FCM formulae (2.7a, b) except that the squared distance in (2.30) is used In place of D%^ = jxy^ - Vi|^. First order necessary conditions for minimizing the FCV functional now include the spannl

iteration by finding

include the spanning vectors {b }, which are updated at each

b = the J-th unit eigenvector of C j=l, 2, ...p , (2.31) where the {b } are arranged in the same order as the descending

eigenvalues of C., and C is the fuzzy covariance matrix of cluster i given by (2.27). The eigenvectors are assumed to be arranged corresponding to a descending ordering of their eigenvalues. For r >

0 it Is of course necessary to get the eigenvalues and eigenvectors of the fuzzy covariance matrices at each pass through FCV-AO. This is usually done with singular value decomposition, and makes FCV and its relatives more computationally Intense than the simpler point prototype models.

Since FCV-AO uses perpendicular distance to the linear varieties, it does not take into account the extent (i.e., length, area, etc.) of the flat clusters being sought. For example, if r = 1, FCV seeks (infinitely long) lines, and t h u s can lump approximately coUinear clusters together, even when they are very far apart (see Figure 24.1 in Bezdek (1981)). One solution to this problem is to choose a distance measure given by

D^ = a D ^ + ( l - a ) D ^ ; 0 < a < l • (2.32)

lines points

which is a convex combination of the perpendicular distance from X, to L, ., and the point distance from x, to v . See Figure 4.50 for a geometric Interpretation of the distance in equation (2.32).

Parameter a can vary from 0 for spherical clusters (having point prototypes) to 1 for linear clusters (having line prototypes).

The algorithm resulting from first order necessary conditions for J (U, B) with distance (2.32) is called the fuzzy c -elliptotypes (FCE-AO) algorithm (Bezdek et al., 1981b). More generally, AO algorithms to optimize any convex combination of FCV terms with dimensions (r) and convex weights {a} were derived by Bezdek et al. (1981a, b).

The purpose of this is to allow the clusters to have shapes built from convex combinations of flat shapes. However, the actual prototypes

CLUSTER ANALYSIS 47 from the convex combinations model are not easily recognizable geometric entities such as ellipses; rather, they are mathematical entities described in terms of level sets of certain functions.

While the parameters V and A in the GK model can be jointly viewed as "generalized" prototypes, FCV was the first generalization of FCM that explicitly used non-point prototypes for B. The FCV algorithms and the particular convex combination FCE have found various applications over the years (Jacobsen and Gunderson, 1983, Gunderson and Thrane, 1985, Yoshinarl et al., 1993). However, a rough idea of the shape of the clusters in the data set must be known a priori (which is impossible for p > 3} to select proper values for the dimensions {r.} and convex weights {a.}. An important exception is rule extraction for function approximation in fuzzy input-output systems. FCE seems well suited to this problem because the input-output space of often g^3^ and linear Takagi-Sugeno (1985) input-output functions can be fitted quite well with FCE (Runkler and Palm, 1996;

Runkler and Bezdek, 1998c; and Example 4.17).

Adaptive fuzzy c-elliptotypes (AFCE). Perhaps the biggest drawback of FCV and convex combinations like FCE is that these models find c clusters with prototypical "shapes" that are all the same. The reason for this is that FCV uses the same real dimension (r) and its convex combinations all use the same "mixture of dimensions" for all c clusters, so cluster substructure having these characteristics is imposed on the data whether they possess it or not. This problem resulted in the first locally adaptive fuzzy clustering method (the GK model), and the next generation of locally adaptive clustering methods followed rapidly on the heels of the FCV models.

There are a number of ways to make FCV adaptive. The earliest scheme for local adaptation in the FCV models was due to Anderson et al. (1982). They suggested that the value of a used in convex combinations of the FCV functionals should be different for each cluster, reflecting a customized distance m e a s u r e t h a t best represents the shape of each cluster. When convex combinations are used, there is no dimensionality of prototypes. (We remind you that it is the distances in the FCV objective function that become convex combinations in Bezdek et al. (1981a, b), and not the fitting prototypes. The fitting prototypes in AFCE, as in FCE, are no longer recognizable geometric entities.) The basic idea in FCE is to mediate between geometric needs for point prototypes (central tendencies) and varietal s t r u c t u r e (shape or dispersions). But convex combinations of FCV such as FCE fix the amount by which each factor contributes to the overall representation of all c clusters.

Anderson et al. (1982) regulated each cluster through the shape information possessed by the eigenstructure of its fuzzy covariance

matrix. Adaptation is with respect to the convex weights in (2.32) used for each cluster. For X c 9t^ the modification of FCE to AFCE is

1 - ^ ^ 2 ^ , 1 = 1,2 , (2.33)

"•I,max

where X is the larger eigenvalue and X is the smaller

i.max ° i.min

eigenvalue of the 2 x 2 fuzzy covariance matrix C. of cluster i, 1=1,2.

Equation (2.33) covers only the 2D case. Extensions to higher dimensions may be found in Phansalkar and Dave (1997) and K m (1997). The AFCE-AO algorithms are exactly the same as the FCE-AO methods j u s t described except that the convex weights in (2.32) are updated at each iteration with (2.33).

Example 2.6 Figure 2.6 shows the results of clustering two data sets with FCL-AO, AFCE-AO and GK-AO. Each of these models has a different kind of prototype (lines, elliptotypes and points, respectively); all three are configured for possible success with data of this kind. FCL, however, is more rigid than the other two because it does not have a feature that enables localized adaptation. The left panel depicts three intersecting noisy linear clusters of different sizes, and the right side shows three noisy linear clusters, two of which are collinear.

Run time protocols for this example were as follows. The covariance matrices for all three methods were initialized with the (U, V) output of the fifth iteration of FCM-AO, m = 2, c = 3 using the Euclidean norm. FCM-AO was Itself initialized with the first 3 points in each data set ((v^ ^ = (80, 81)"^, v^ ^ = (84, 84)"^ and v^ ^ = (87, 89)"^) in the left views, and (v, ^ = (10, 11)"^, v _ = (11, 189)'^and v^„ = (14, 14)"^) in the

Jl,U ^t\) o,U

right views). Termination of all three methods by either of

||Vt+i - V t L ^O.OOlor ||Ut+i -UtIL < 0 . 0 1 yielded the same results.

FCV and AFCE both used m=2, and GK used m = 1.5 (our experience is that GK does much better with a value near 1.5 than it does with a value near 2).

The results shown are terminal partitions hardened with (2.10), each cluster identified by a different symbol. In the collinear situation for the right h a n d views, FCL finds two almost coincidental clusters and the points belonging to these two clusters are arbitrarily assigned to the two prototypes in view 2.6d. The AFCE result in 2.6f is much better, having j u s t two squares that are erroneously grouped with the dots. GK makes perfect assignments, as shown in Figure 2.6e. Terminal values of a for i\FCE were very nearly 1 for both data sets.

CLUSTER ANALYSIS 49

(a) FCL

D a .

+ •<l3

•^ 1,

a a

D •

a a

t (d) FCL

\ if

(b)GK +

+ p D++

4-4

(c) AFCE

4-1 °

+ •

n +

4-+

• 4

1^^ (f) AFCE 13

• • •••

• •

Figure 2.6 Detection of linear clusters with FCL, GK and AFCE For the three well separated lines (the right views in Figure 2.6), all three values were 0.999; for the intersecting lines in the left views of Figure 2.6, the three terminal values of a were 0.999, 0.997 and

0.999. These are the expected results, since the clusters in both data sets are essentially linear, so the ratio of eigenvalues in (2.33) is essentially zero.

The tendency of FCV to disregard compactness is seen in the cluster denoted by the six "+" signs in panel 2.6(a). Here the pluses and dots are clearly interspersed incorrectly. For this data, both GK and AFCE produce flawless results. One possible explanation for this is that the FCV functional is more susceptible to being trapped by local minima, since it cannot adapt locally like GK and AFCE.

AFCE is called AFC (adaptive fuzzy clustering) in many of the later papers on this topic, especially those of Dave (1989a, 1990a). Because several other adaptive schemes discussed in this chapter are not based on FCE, we prefer to call this method AFCE. Dave and Patel (1990) considered the problem of discovering the unknown number of clusters. They proposed progressive removal of clusters that are good fits to subsets of the data. This idea was further developed for lines and planes in Krisnapuram and Freg (1992).

Adaptive Fuzzy c-Varieties (AFCV) Gunderson (1983) introduced a heuristic way to make the integer dimension (r.) of the fitting prototype for class i independent of the dimensions used by other clusters sought in the data. His adaptive Juzzxj c - varieties (AFCV) scheme is based on the eigenstructure of the fuzzy covariance matrices {CI at (2.27) that are part of the necessary conditions for extrema of the FCV functional.

Gunderson observed that the distance calculations made in the necessary conditions for U , the i-th row of partition matrix U shown at (2.7a), are independent of how the distances themselves are computed - that is, (2.7a) does not care what value of r is used in equation (2.30). He reasoned that making a heuristic adjustment to the optimality conditions by allowing different D^ 's to be used in (2.7a) for different i's might enable FCV to seek manifolds of different d i m e n s i o n s for t h e v a r i o u s c l u s t e r s . A second modification of the necessary conditions was to introduce a non-convex weight d into distance equation (2.30) as follows:

K,.^ = K -Villi -dfl(xk -v„byf] • (2.34)

The user defined parameter d in (2.34) essentially controls the importance of the non-point or r > 0 part of the distance calculation for each cluster, and not much guidance is given about its selection.

Gunderson's modification of FCV also calls for the selection of (p-1)

CLUSTER ANALYSIS 51 shaping coefficients {a : 1 < r < p -1} which are compared to ratios of eigenvalues from the fuzzy scatter matrices {C.} at (2.27). In particular, if {A, < A, <• • • < A, I are the ordered eigenvalues of C ,

^ ip i,p-i 11 1

Gunderson adapts the dimension of each FCV prototype during iteration as follows: If there exists a least integer k, k = 1, 2, .... p-1 so that (A,., , /A,., ) < a, ; 1 < i < c, set r = k; otherwise, set r = 0. The

•• i,k+l/ i,k k i ' 1

parameter a is also user defined, and again, is fine tuned during iteration, much like many other algorithmic parameters, to secure the most acceptable solution to the user. Then, U is updated with r = r, in (2.30). These two changes are analogous to the modifications of FCM that Bensaid et al. (1996a) used to create ssFCM: the resultant algorithm no longer attempts to solve a well-posed optimization problem.

Example 2.7 Figure 2.7 is adapted from Figure 5 in Gunderson (1983).

Figure 2.7 shows the output obtained by applying Gunderson's adaptive FCV to a data set in 5K^ that contains four visually apparent clusters. The two upper clusters are roughly circular, cloud-type structures while the two lower are elongated, linear structures.

Using the Euclidean norm, c =4 and m = 1.75 in (2.24), and d = 0.95 in (2.34), Gunderson's algorithm iteratively chooses r = r = 0, so that the cloud shaped clusters are represented, respectively, by the point prototypes v and v as shown in Figure 2.7. And the algorithm settles on r = r = 1, so that the linear clusters have prototypes that are shown as the lines L „ and L, in Figure 2.7. The value of a, is not

13 14 =" k

specified.

Summarizing, Gunderson's method makes FCV adaptive with respect to the dimensions {r} of the linear varieties {L }. Different

^ 1 rl

clusters are allowed to have representation as linear manifolds of possibly different dimensions. In contrast, the adaptive GK model

Dans le document FUZZY MODELS AND ALGORITHMS FOR PATTERN RECOGNITION AND IMAGE PROCESSING (Page 55-66)