Comments and bibliography Clustering algorithms

ai,t = ij^-vtii7lXi

F. Indirect measures for non-point prototype models

2.6 Comments and bibliography Clustering algorithms

The crisp or hard c-means clustering model has its roots in the work of Gauss, who wrote to Olbers in 1802 about the method of least squared errors for parameter estimation (Bell, 1966). Duda and Hart (1973) credit Thorndike (1953) as the first explicit user of the HCM functional for clustering. There are now thousands of papers that report theoretical results or applications of the HCM model. There is also a very large body of non-fuzzy work in the signal and image processing literature that is a very close relative of (indeed, perhaps it is] HCM. The basic method for this community is the Lloyd-Buzo-Gray (LBG) algorithm. See Gersho and Lloyd-Buzo-Gray (1992) for an excellent summary of this material. A new approach to signal processing based on clustering has been recently discussed by Geva and Pratt

(1994).

The FCM model and FCM-AO were introduced by Dunn (1974a) for the special case m=2, and both were generalized by Bezdek (1973, 1974a) for any m > 1. PMshnapuram and Keller's (1993) PCM model and PCM-AO was published in 1993. The newest entrant to the c-m e a n s fac-milies is a c-mixed fuzzy-possibilistic c-c-means (FPCM) model and AO algorithm for optimizing it t h a t simultaneously generates both a fuzzy partition of a n d typicality matrix for unlabeled data set X (Pal et al., 1997a). See Yang (1993) for a nice survey of many other generalizations of the basic FCM model, including some to the case of continous data processes, in which the double sum for J is replaced by integrals.

There are several points to be careful about when reading papers on c-means clustering. First, many writers use k instead of c for the integer that denotes the number of clusters. Our notation follows Duda and Hart (1973). Second, many papers and books refer to the sequential version of c or k-means more simply as, e.g., "k-means".

The well-known sequential version is not an AO method and h a s many adherents. Its basic structure is that of a competitive learning model, which will be discussed in Chapter 4. Be very careful, when reading about c-means or k-means, to ascertain whether the author means the sequential (Section 4.3.C) or batch version (Section 2.2.A); their properties and performance can be wildly different.

The term ISODATA was used (incorrectly) for both HCM-AO and FCM-AO in early papers that followed the lead of Dunn (1974a) and Bezdek (1973). Conditions (2.6) were used by Ball and Hall (1967) in

their crisp ISODATA [iterative self-organizing data analysis)

CLUSTER ANALYSIS 131

algorithm, which is our HCM-AO combined with a n u m b e r of heuristic procedures for (dynamic) determination of the correct number of clusters to seek. Early papers by Dunn and Bezdek called the FCM-AO algorithm "fuz2y ISODATA", even though there were no heuristics attached to it analogous to those proposed by Ball and Hall. Later papers replaced the term fuzzy ISODATA by fuzzy c-means, but the incorrect use of fuzzy ISODATA still occurs now and then. To our knowledge, a generalization of crisp ISODATA that could correctly bear the name fuzzy ISODATA has - surprisingly- yet to be studied. There is a crisp iterative self-organizing entropy (ISOETRP) clustering algorithm due to Wang and Suen (1984) that u s e s some of the same heuristics as ISODATA. ISOETRP is an interactive clustering model that builds classifier decision trees, and it attempts to optimize an entropy functional instead of J : we will discuss this method in Section 4.6.

Suppose you have T sets of unlabeled data, X = {X^,...,X.j,}, where

X . = { x . , X. }c:9?P, situation in the data. Data like these are common. For example, in estimates of brain tumor volume such as those made by Clark et al.

(1998), X corresponds to the j - t h slice in a set of T slices of 3D magnetic resonance images. In this example, the data are not collocated either spatially or temporally. For a second example, X might be the j - t h image in a temporal sequence of frames collected by an imaging sensor such as a Ladar on a flying seeker platform that sweeps the scene below it. In this second case the data are not temporally collocated, but after suitable adjustments to register the images, they are spatially collocated.

In the first step of tumor volume estimation in Clark et al. (1998) each of the T magnetic resonance slices is independently segmented by unsupervised FCM, leading to a set of T computationally uncorrelated terminal pairs, say {(Uj,Vj) (U.j,,V.j,)} for the input data sets X = {X^,..., X^}. In such a scheme the number of clusters could be - and in this application should be - a variable, changing from slice to slice as the number of tissue classes changes. In the seeker example, however, when images are collected at very high frame rates, only the locations of the targets (the V , ' s ) in the images should change. The number of clusters for each frame in a (short time window) of this temporal data should be fixed. You can cluster each image in such data independently with c fixed, of course, and the sequence {(Uj,V^),...,(U.j,,V.j,)} might be a useful representation of unfolding events in the illuminated scene.

However, Sato et al. (1997) discuss a different approach to this

problem that involves a very minor change to the FCM functional that seems like a useful alternative.

Sato et al. extend the basic FCM function J (U, V) = S X u^D?, by

i=ik=i "^

adding together T terms (one for each X), and each term is weighted with a user specified weight co , j = 1 T. Their temporal fuzzy c-means (TFCM) function is defined as J^''^'"(U.{V.}) = Io)jJ^(U, Vj), CO > 0 for all i. (TFCM is our name for their model and algorithm;

j •'

they don't really give it a name.) J^FCM jg ^ positive linear combination of T copies of the FCM functional, so an easy corollary that follows from the proofs for necessary conditions (Bezdek, 1981) yields necessary conditions for AO minimization of J^^^^'^that are simple extensions of (2.6a) and (2.6b). The fuzzy partition common to all T terms of j"^^*^"^ is calculated as

U i ik

I

s=l

If«j8 (Xjk.Vji) J=l

I03t8^(Xtic.Vts)

t=l

m - l - 1

, l < i < c , l < k < n . (2.140a)

The c prototypes V , one for each data set X, are updated with the common fuzzy c-partition in (2.140a) using the standard equation in (2.7b),

£"S-,.,

^{/ n ^}_{k=i y}l < j < T , l < i < c (2.140b) In (2.140) the values {u } define a common fuzzy c-partition for all T data sets X = {X^ X^}, and for each data set X, there is a set of c point prototjrpes, V = {v ,..., v } c 3i'^^. Sato et al. only discuss the

j i ' j c

case where 6 is the Euclidean norm, but equations (2.140) are easily generalized to any inner product norm, and, we suspect, are also easily generalizable to many instances of non-point prototype models as well. AO between (2.140a) and (2.140b) produces, at termination, the set {(U, V^),...,(U,V.j,)}. Thus, U is a common fuzzy c-partition of all T situations, and the {V } provide an estimate of the trajectory of the c point prototypes through time (that is, through the 3-way data). Because tumors come and go in a set of magnetic resonance image slices of a h u m a n with a brain tumor.

CLUSTER ANALYSIS 133 TFCM seems inappropriate for the application discussed by Clark et al. (1998), but we can imagine the sequence {V } being very useful in s i t u a t i o n s s u c h a s the seeker example in automatic target recognition, or hurricane tracking via real time satellite imagery.

However, it is clear that the effectiveness of TFCM is very dependent upon "good" choices for the T fixed, user-defined weights {co.}.

Sato et al. give severed examples of TFCM, including a data set of 60 dental patients who have had underbite treatment. Each patient h a s p=8 numerical features measured at T=3 different post-treatment times. TFCM with c = 4, m = 2 and the Euclidean norm are applied to this data, and the resultant prototypes of the four clusters seem to have physically interpretable meanings over time. The only complaints we have about the examples in Sato et al.'s book are that none of them are compared to other methods (such as applying FCM to each data set in the sequence independently); and no guidance is given about the choice of the T weights {co.}. Nonetheless, we think this is a very promising extension of FCM ^or some problems with a temporal aspect.

Much of the general theory for AO (also called grouped coordinate descent) is contained in Bezdek et al. (1986a, 1987a), Redner et al.

(1987) and Hathaway and Bezdek (1991). AO schemes are essentially split gradient descent methods, and as such, suffer from the usual numerical analytic problems. They need good initializations, can get trapped at undesirable local extrema (e.g., saddle points), and can even exhibit limit cycle behavior for a given data set. Karayiannis (1996) gives fuzzy and possibilistic clustering algorithms based on a generalization of the reformulation theorem discussed in Section 2.2.E.

There are many hybrid clustering models that combine crisp, fuzzy, probabilistic and possibilistic notions. Simpson (1993) uses fuzzy sets to find crisp clusters (directly, without hardening). Moreover, this method adjusts the number of clusters djniamically, so does not rely on posterior validation indices. The method of Yager and Filev (1994a) called the "mountain clustering method" is often described as a fuzzy clustering method. However, this method, and a relative called the sub tractive clustering method (Chiu, 1994) are not fuzzy clustering methods, nor are they even clustering methods. They both seek point prototypes in unlabeled data without reference to good partitions of the data, and then use the discovered prototypes non-iteratively to construct crisp clusters with the nearest prototype rule (equation 2.6a). These models will be discussed in Chapter 4.

Runkler and Bezdek (1998a) have recently introduced a new class of clustering schemes that are not driven by objective function models.

Instead, they propose alternating cluster estimation (ACE), a scheme whereby the user selects update equations for prototypes and

memberships from toolbars of choices for each of these sets of variables. All of the AO models of this chapter can be imbedded in the ACE framework (including probabilistic models), a n d additionally, ACE enables users to build new clustering algorithms by a "mix and match" paradigm, that is, by mixing formulae from various sources. This type of algorithm trades mathematical interpretability (the objective function and necessary conditions for it) for user-defined properties of presumably desirable prototypes a n d m e m b e r s h i p functions (e.g., convexity of m e m b e r s h i p functions, a property not enjoyed by continuous functions satisfying the FCM necessary condition (2.7a)).

Cluster Validity

A third way (besides direct and indirect validity measures) to assess cluster validity is to assign each U e P some task, and then compare its performance on the task to other candidates in p (Backer and Jain, 1981). For example, the labels in U can be used to design a classifier, and empirical error rates on labeled data can then be used to compare the candidates. This is performance-based validity. It is hard to give more than a general description of this idea because the performance criteria which dictate how to select the best solution are entirely context dependent. Nonetheless, for users with a real application in mind, this is a n i m p o r t a n t methodology to remember. A best strategy when the end goal is known may be to first eliminate badly incorrect clustering outputs with whatever validity measures seem to work, and then use the performance goal to make a final selection from the pruned set of candidates.

Our discussion of cluster validity was made in the context that the choice of c is the most important problem in validation studies.

Duda and Hart (1973) call this the "fundamental problem of cluster validity". A more complete treatment of cluster validity would also include validation of clustering methods as well as validation of individual clusters, neither of which was addressed in Section 2.4.

Applying direct, indirect or performance-based validity criteria to each partition in P is called static cluster validity. When assessment criteria are integrated into the clustering scheme t h a t alter the number of clusters during computation (that is, other than in the obvious way of clustering once at each c in some prespecified range), as in Ball and Hall's (1967) ISODATA or Tou's (1979) DYNOC, the resulting approach is called dynamic cluster validation. In this approach P is not generated at all - rather, an algorithm generates U, assesses it, and then adjusts (or simply tries other) parameters (and possibly algorithms) in an attempt to find a "most valid" U or P for X. Surprisingly enough, a fuzzy version of ISODATA per se h a s never been developed. However, many authors have added merge-split (or progressive) clustering schemes based on values of various validity functionals to FCM/PCM in an attempt to make them dynamic (see

CLUSTER ANALYSIS 135 Dave and Bhaswan, 1991b, Krishnapuram and Freg, 1992, Bensaid et al., 1996b, Frlgui and Krishnapuram, 1997).

Given the problems of indirect indices (functions of U alone, which are usually mediocre at best), it is somewhat surprising to see so much new work on functionals of this type. For example, Runkler (1995) discusses the use of a family of indirect indices (the mean, median, maximum, minimum and second maximum) of the c row maximums {Mj = max{U(j5},i = 1 c} of U for validation of clusters

l<k<n

found by the FCE algorithm. Continued interest in measures of this type can probably be attributed to three things: their simplicity; the general allure of computing "how fuzzy" a non-crisp entity is; and most importantly, how important cluster validity really is for users of clustering algorithms. Trauwaert (1988) contains a nice discussion of some of the issues raised here about the use of the partition coefficient (historical note: T r a u w a e r t mistakenly attributed the partition coefficient to Dunn in the title of and throughout his paper; Bezdek (1973) introduced the partition coefficient to the literature). See Cheng et al. (1998) for a recent application of the partition entropy at (2.99) to the problem of (automatically) selecting thresholds in images that separate objects from their backgrounds.

There are several principles that can be used as guides when building an index of validity. First, computational examples on many data sets with various indices suggest that the more reliable indices explicitly use all of the data in the computation of the index. And second, most of the better indices also use the cluster means V(U) if U is crisp or whatever prototjqjes B in (2.24a) are available in their definition. Even when X is not used, using V(U) or B implicitly involves all of X, and insulates indices from being brittle to a few noisy points in the data.

If it is possible to know, or to ascertain, the rough structure of the data, then of course an index that is designed to recognize that type of structure is most appealing. For example, mixtures of normal distributions with roughly equal covariance structure are expected to generate hyperellipsoidal clusters that are most dense near their means, and in this case any index that optimizes for this type of geometry should be more useful than those that do not. Bezdek et al.

(1997b) discuss this idea at length, and show that both crisp and fuzzy validity indices a s reliable as many of the most popular probabilistic criteria for validation in the context of normal mixtures.

When an indirect index is used (partition coefficient, partition entropy, etc.), the quality of B either as a compact representation of the clusters or a s an estimate of parameters is never directly

measured, so this class of indices cannot be expected to perform well unless the data have very distinct clusters. Thus, indirect Indices that Involve only U are probably not very useful for volumetric or shell cluster validation - in either case they simply measure the extent to which U is a non-crisp partition of the data. When parameters such as B are added to an indirect index (Gath-Geva or Xie-Beni for example), the issue of cluster type becomes more important. When the clusters are volumetric (because they are, or because the algorithm that produced them seeks volumetric structure), B should be a set of point prototypes. When the clusters use B, a parameter vector of a set of non-point prototypes as representatives, the cluster structure is shell like. In either case, the validity index should incorporate B into its definition. We feel that the best indices are direct or indirect parametric data indices. This is why we chose the classification of indices in Table 2.7 a s the fundamentally important way to distinguish between types of measures.

The literature of fuzzy models for feature analysis when the data are unlabeled as in this chapter is extremely sparse and widely scattered.

The few papers we know of that use fuzzy models for feature analysis with labeled data will be discussed in Section 4.11.

Finally, we add some comments about clustering for practitioners.

Clustering is a very useful tool that h a s many well documented and i m p o r t a n t applications: to n a m e a few, data mining, image segmentation and extraction of rules for fuzzy systems. The problem of validation for truly unlabeled data is an important consideration in all of these applications, each of which has developed its own set of partially successful validation schemes. Our experience is that no one index is likely to provide consistent results across different clustering algorithms and data structures. One popular approach to overcoming this dilemma is to use many validation indices, and conduct some sort of vote among them about the best value for c.

Many votes for the same value tend to increase your confidence, but even this does not prevent mistakes (Pal and Bezdek, 1995). We feel t h a t the best strategy is to use several very different clustering models, vary the paramiCters of each, and collect many votes from various indices. If the results across various trials are consistent, the user may assume that meaningful structure in the data is being found. But if the results are inconsistent, more simulations are needed before much confidence can be placed in algorithmlcally suggested substructure.

3 Cluster Analysis for

Dans le document FUZZY MODELS AND ALGORITHMS FOR PATTERN RECOGNITION AND IMAGE PROCESSING (Page 144-151)