The c-means models - 2 Cluster Analysis for Object Data

2 Cluster Analysis for Object Data

A. The c-means models

The c-means (or k-means) families are the best known and most well developed families of batch clustering models. Why? Probably b e c a u s e they are least squares models. The history of this methodology is long and important in applied mathematics because l e a s t - s q u a r e s models have m a n y favorable m a t h e m a t i c a l properties. (Bell (1966, p. 259) credits Gauss with the invention of the method of least squares for parameter estimation in 1802, but states that Legendre apparently published the first formal exposition of it in 1806.) The optimization problem that defines the hard {H), fuzzy (F) and possibilistic (P) c-means (HCM, FCM and PCM, respectively) models is:

m l n j j „ ( U , V ; w ) = i iuS^D2i, + I w , 1 ( 1 - U i k r j , where (2.5)

{U/V)i i=lk=l i=l k=l J

U e M , M, or M for HCM, FCM or PCM respectively

hen fen pen

V = (v , V ,..., V) e g^cp; v^ e g^p is the i-th point prototype w = (w , w w ) ; Wj e 9^+ is the i-th penalty term (PCM) T m > 1 is the degree of fuzzification

D^ =i|x, - V ²

ik II k IIIA

Note especially that w in (2.5) is a fixed, user-specified vector of positive weights; it is not part of the variable set in minimization problem (2.5).

^ Caveat: Model optima versus human expectations. T h e p r e s u m p t i o n in (2.5) is t h a t "good" solutions for a p a t t e r n recognition problem - here clustering - correspond to "good"

solutions of a mathematical optimization problem chosen to represent the physical process. Readers are warned not to expect too much from their models. In (2.5) the implicit assumption is that pairs (U, V) that are at least local minima for J will provide (i) good clusters U, and (ii) good prototypes V to represent those clusters.

What's wrong with this? Well, it's easy to construct a simple data set upon which the global minimum of J leads to algorithmically suggested substructure that humans will disagree with (example 2.3).

The problem? Mathematical models have a very rigid, well-defined idea of what best means, and it is often quite different than that held by h u m a n evaluators. There may not be any relationship between clusters that h u m a n s regard as "good" and the various types of

CLUSTER ANALYSIS 17 e x t r e m a of a n y objective function. Keep this in mind a s you try to u n d e r s t a n d y o u r d i s a p p o i n t m e n t a b o u t t h e terrible c l u s t e r s y o u r favorite algorithm j u s t found. The HCM, FCM a n d PCM clustering models are s u m m a r i z e d in Table 2 . 1 . Problem (2.5) is well defined for a n y distance function on 9f{P. The method chosen to approximate solutions of (2.5) d e p e n d s primarily on Dik. J^^^ is differentiable in U u n l e s s it is crisp, so first order necessary conditions for U are readily obtainable. If Dik is differentiable in V (e.g., whenever Dik is a n inner p r o d u c t norm), t h e m o s t p o p u l a r t e c h n i q u e for solving (2.5) is grouped coordinate descent (or alternating optimization (AO)).

Table 2.1 Optimizing Ji„(U,V;w) w h e n ^ik = p ^{- V :}

Minimize

First order necessary conditions for (U, V) when Dik = ||^ k " '^ij^ > 0 V i, k

(inner product norm case only)

HCM each model requires a t extreme points of its functional w h e n the d i s t a n c e m e a s u r e in (2.5) is a n i n n e r p r o d u c t n o r m m e t r i c . Derivations of the FCM a n d PCM conditions are made by zeroing the gradient of Jm with respect to V, a n d the gradient of (one term of) the

LaGrangian of J with respect to U. The details for HCM and FCM can be found in Bezdek (1981), and for PCM, see Krishnapuram and Keller (1993).

The second form, Vj for v^in (2.6b), emphasizes that optimal HCM-AO prototypes are simply the mean vectors or centroids of the points in crisp cluster i, Uj = |U(j)|, where U is the i-th row of U. Conditions (2.7) converge to (2.6) and J ^ J as m->l from above. At the other

^ m l

extreme for (2.7), lim {u^^^} = 1/c V i, k as m increases without bound,

m-><»

and lim {Vi} = v = I x^ / n V i (Bezdek, 1981).

i ^ ^ k = l /

Computational singularity for u in HCM-AO is manifested as a tie in (2.6a) a n d may be resolved arbitrarily by assigning the membership in question to any one of the points that achieves the minimum. Singularity for u in FCM-AO occurs when one or more

1

² = 0 at any iterate. In this case (rare in practice), (2.7a) cannot be calculated. When this happens, assign O's to each non-singular class, and distribute positive memberships to the non-singular classes arbitrarily subject to constraint X u,k = 1. As long as the w 's c

1=1 '

are positive (which they are by user specification), PCM-AO cannot experience this difficulty. Constraints on the {u } are enforced by the necessary conditions in Table 2.1, so the denominators for computing each v. are always positive.

Table 2.2 specifies the c-means AO algorithms based on the necessary conditions in Table 2.1 for the inner product norm case.

The case shown in Table 2.2 corresponds to (2.4b), initialization and termination on cluster centers V. The rule of thumb c < Vri in the second line of Table 2.2 can produce a large upper bound for c. For example, this rule, when applied to clustering pixel vectors in a 256 X 256 image where n=65,536, suggests that we might look for c = 256 clusters. This is done, for example, in image compression, but for segmentation, the largest value of c that might make sense is more like 20 or 30. In most cases, a reasonable choice for c can be

max

made based on auxiliary information about the problem. For example, segmentation of magnetic resonance images of the brain requires at most c = 8 to 10 clusters, as the brain contains no more than 8-10 tissue classes.

All three algorithms can get stuck at undesirable terminal estimates by initializing with cluster centers (or equivalently, rows of U ) that have the same values because U,, and v are functions of just each

(i) 1 •'

CLUSTER ANALYSIS 19 other. Consequently, Identical rows (and their prototypes) will remain identical unless computational roundoff forces them to become different. However, this is easily avoided, and should never present a problem.

Table 2.2 The HCM/FCM/PCM-AO algorithms

Store Unlabeled Object Data X c 9^P

Pick

number of clusters: 1 < c < n

maximum number of iterations: T

weighting exponent: 1 < m < <» (m=l for HCM-AO)

Ml = ^"Ax

V, - V, , = big value

t II t t-iii ^

termination threshold: 0 < e = small value weights w > 0 V i (w = 0 for FCM-AO/HCM-AO)

«" inner product norm for J :

••• termination measure: E, =

Guess initial prototjqjes: V^ = (v^, -V^oJeSt'^P (2.4b)

The rows of U are completely decoupled in PCM because there is no constraint that X Ujj^ = 1. This can be an advantage in noisy data c

i=l

sets, since noise points and outliers can be assigned low memberships in all clusters. On the other hand, removal of the

constraint that J u^j^ = 1 also means t h a t PCM-AO h a s a higher tendency to produce identical rows in U unless the initial prototypes are sufficiently distinct and the specified weights {w} are estimated reasonably correctly (Barni et al., 1996, ICrishnapuram and Keller, 1996). However, this behavior of PCM can sometimes be used to advantage for cluster validation - that is, to determine c, the number of clusters that are most plausible (ICrishnapuram and Keller (1996)).

Krishnapuram and Keller recommend two ways to choose the weights w for PCM-AO,

w< K ⁿ

k=l

D ik

k=l K > 0 ; or (2.9a)

w , = IDfk /|U(i,„| , (2.9b) where U is an a-cut of U,,, the i-th row of the initiaUzing

c-(i)a (i) °

partition for PCM-AO. An initial c-partition of X is required to use (2.9), and this is often taken as the terminal partition from a run of FCM-AO prior to the use of PCM-AO. However, (2.9a) and (2.9b) are not good choices when the data set is noisy. It can be shown (Dave a n d K r i s h n a p u r a m , 1997) t h a t the m e m b e r s h i p function corresponds to the idea of "weight function" in robust statistics and the weights {w.} correspond to the idea of "scale". Therefore, robust statistical methods to estimate scale can be used to estimate the {w}

in noisy situations. Robust clustering methods will be discussed in Section 2.3.

Clustering algorithms produce partitions, which are sets of n label v e c t o r s . For n o n - c r i s p p a r t i t i o n s , t h e u s u a l m e t h o d of defuzzification is the application of (1.15) to each column U, of matrix U. The crisp partition corresponding to the Tnaximum membership partition of any U e M is

U" = H(U,) = e o u > u , J9^i;Vk . (2.10)

k k I ik jk -*

The action of H on U will be denoted by U " = [H(Uj)--H(U^)]. The conversion of a probabilistic partition P e M by Bayes rule (decide X e class i if and only if p(i | x )] > p(j | x )]for jV i) results in the crisp partition P " . We call this the hardening of U with H.

Example 2,2 HCM-AO, FCM-AO and PCM-AO were applied to the unlabeled data set X illustrated in Figure 2.2 (the labels in Figure 2.2 correspond to HCM assignments at termination of HCM-AO).

The coordinates of these data are listed in the first two columns of Table 2.3. There are c=3 visually compact, well-separated clusters in The AO algorithms in Table 2.2 were run on X using Euclidean distance for J and E^ until E^= ||Vt - V t _ i | < e = 0.01. All three algorithms quickly terminated (less than 10 iterations each) using this criterion. HCM-AO and FCM-AO were initialized with the first three vectors in the data. PCM-AO was initialized with the final prototypes given by FCM-AO, and the weight w for each PCM-AO cluster was set equal to the value obtained with (2.9a) using the

CLUSTER ANALYSIS 21 terminal FCM-AO values. The PCM-AO weights were w^=0.20,

FCM-AO and PCM-AO both used m = 2 for the W2=0.21 andWg= 1.41

membership exponent, and all three algorithms fixed the number of clusters at c = 3. Rows U, of U are shown as columns in Table 2.3.

(1)

+ + + +

i^fo^

2 • •m

Terminal HCM labels

+ = 3 o=2

• = 1

10 12 14

Figure 2.2 Unlabeled data set X _{3 0}

The terminal HCM-AO partition in Table 2.3 (shaded to visually enhance its crisp memberships) corresponds to visual assessment of the data and its terminal labels appear in Figure 2.2. The cells in Table 2.3 t h a t correspond to maximum memberships in the terminal FCM-AO and PCM-AO partitions are also shaded to help you visually compare these three results.

The clusters in this data are well separated, so FCM-AO produces memberships that are nearly crisp. PCM-AO memberships also indicate well separated clusters, but notice that this is evident not by

many memberships being near 1, but rather, by many memberships being near zero.

Table 2.3 Terminal partitions and prototypes for X^^

DATA HCM-AO FTM-AO PCM-AO

For example, the third cluster h a s many relatively low maximum memberships, but the other memberships for each of points 21 to 30 in cluster 3 are all zeroes. The greatest difference between fuzzy and possibilistic partitions generated by these two models is that FCM-AO memberships (are forced to) sum to 1 on each data point, whereas PCM-AO is free to assign labels that don't exhibit dependency on points that are not clearly part of a particular cluster. Whether this is an advantage for one algorithm or the other depends on the data in hand. Experiments reported by various authors in the literature support trying both algorithms on the data, and then selecting the output that seems to be most useful.

Because the columns of U in M are independent, PCM actually seeks c independent possibilistic clusters, and therefore it can locate all c clusters at the same spot even when an algorithm such as FCM is used for initialization. In some sense this is the price PCM pays

-CLUSTER ANALYSIS 23 losing the ability to distinguish between different clusters - for the advantage of getting possibilistic memberships that can isolate noise and outliers.

The bottom three rows of Table 2.3 enable you to compare the point prototypes produced by the three algorithms to the sample means {Vj, V2, Vg} of the three clusters. HCM-AO and FCM-AO both produce exact (to two decimal places) replicates; PCM-AO cluster centers for the second and third clusters are close to V2 and Vg, while the PCM-AO prototype for cluster 1 differs by about 15% in

each coordinate. This is b e c a u s e PCM m e m b e r s h i p s vary considerably within a cluster, depending on how close the points are to the prototype.

Applying (2.10) to the FCM-AO and PCM-AO partitions in Table 2.3 results in the same terminal partition as found by HCM-AO (i.e., the shaded cells in Table 2.3 show that U^^^ = U^^j^ = Uj^^^). This happens because the data set is small and well-structured. In large, less well structured data sets, the three algorithms may produce partitions that, when hardened, can be significantly different from each other. Needless to say, the utility of a particular output is dependent on the data and problem at hand, and this determination is, unfortunately, largely up to you.

Dans le document FUZZY MODELS AND ALGORITHMS FOR PATTERN RECOGNITION AND IMAGE PROCESSING (Page 30-37)