• Aucun résultat trouvé

2 Cluster Analysis for Object Data

H. Clustering for robust parametric estimation

2.4 Cluster Validity

-• : ; -• -• -• -• -• . -• -• ' . -• . -• -•

>••.'•

• • • . •

-• : ; -• -• -• -• -• . -• -• ' . -• . -• -•

• • • • • • : • • ••:

: . : • • ' ' '••' •

' • • . - • • • • • • • • • • • •

• ' . • . _ , | •. • •

' $ :

•'•;•': ;• }

'.' . " • ; " . • • • . • : • " • ' • " • . • . ' \

(a) input data (b) After 5 ( ^ iterates

(c) After 6 RCA iterates (d) Final RCA result Figure 2.15 The robust competitive a^omeration technique

2.4 Cluster Validity

Now that we have some ways to get clusters, we turn to the problem of how to validate them. Figure 2.3(a) shows that the criterion driving a clustering algorithm towards an optimal partition sometimes produces a result that is disagreeable at best, and wrong at worst. This illustrates the need for approaches to the problem of cluster validity.

Clustering algorithms {G} will produce as many partitions as you have time to generate. Let /^ = {ej(X) = Uj eMp^n:! ^ J ^ N}, where index (j) indicates: (i) clustering X with one C at various values of c;

(ii) clustering X over algorithmic parameters of a particular Gj or

(ill) applying different G's to X. Cluster validity (problem (3), Figure 2.1) is an assessment of the relative attractiveness of different U's in p. The usual approach is computational, and is based on one or more validity Junctionals V: D t-> 91, D denoting the domain of V, to rank each U. e P.

You may wonder: if the global minimum of, say J , cannot produce the clusters you want, then why not directly optimize a validity functional V? First, no model can capture all the properties that

"good' clusters might possess, and this of course includes any particular V we might propose. For example, we seek, from data set to d a t a set, clusters with: compactness, isolation, maximal crispness, density gradients, particular distributions, etc. And more importantly, many of the validity indices that will be discussed do not fit naturally into a well behaved framework for mathematical optimization. So, we use validity measures as an "after the fact" way to gain further confidence in a pgirticular clustering solution.

There are two ways to view clustering algorithms. First, it is possible to regard G as a parametric estimation method - U and any additional parameters such as B in the c-means and c-shells models are being estimated by C using X. In this case V is regarded as a measure of goodness of fit of the estimated parameters (to a true but unknown set!). This interpretation is usually (but not exclusively) m a d e for validity m e a s u r e s in the context of probabilistic clustering.

The second interpretation of C is in the sense of exploratory data analysis. When 1/ assesses U alone (even if the measure involves other parameters such as B), V is interpreted as a measure of the quality of U in the sense of partitioning for substructure. This is the rationale underlying most of the methods discussed in this section.

When D,, = M. , we call V a direct measure; because it assesses

V hen

properties of crisp (real) clusters or subsets in X; otherwise, it is indirect. When Dy = M^cn x other parameters, the test V performs is

e.g. prototypes B

still direct, b u t addition of the other parameters is an important c h a n g e , b e c a u s e t h e s e p a r a m e t e r s often contain valuable information about cluster geometry (for example, measures that assess how well the prototypes B fit the cluster shapes). We call indices that fall into this category direct parametric indices.

When U is not crisp, validity measures are applied to an algorithmic derivative of X so they are called indirect measures of cluster validity. There are both indirect and indirect param.etric measures of partition quality.

CLUSTER ANALYSIS 89

Finally, many validity m e a s u r e s also use X. This is a third important aspect of validity functionals: do they use the vectors in X during the calculation of V? We indicate explicit dependence of V on X by adding the word data when this is the case. Let Q, represent the parameter space for B. Table 2.7 shows a classification of validity functionals into six types based on their arguments (domains).

Table 2.7 One classification of validity measures Type of Index Variables Domain D^ of V

Direct U Mhcn

Direct Parametric (U,B) M h c n X ^

Direct Parametric Data (U, B, X) M, xQx9?P

hen

Indirect U ( M p e n - M h e n )

Indirect Parametric (U,B) ( M p c n - M h c n ) x "

Indirect Parametric Data (U, B, X) {M - M . ) x t 2 x 9 t P

^ pen hen'

Choosing c=l or c=n constitutes rejection of the hypothesis that X contains cluster substructure. Most validity functionals are not equipped to deal with these two special cases. Instead, they concentrate on 2 < c < n, implicitly ignoring the important question of whether X has clusters in it at all.

^ Notation It is hard to choose a notation for validity indices that is both comprehensive and comprehensible. Ordinarily, validation means "find the best c", so the logical choice is to show V as V(c). But in many cases, c doesn't even appear on the right side of an equation that defines V. X in Table 2.7 is fixed, but U and B are functions of c through the algorithm that produces them, so any index that uses either of these variables is implicitly a function of c as well. A notation that indicates functional dependency in a precise way would be truly formidable. For example, the Xie and Beni (1991) index (which can be used to validate the number of clusters found) depends on (U, B, X), U and B depend on C, the clustering algorithm that produces them, and C either determines or uses c, the number of clusters represented in U. How would you write the independent variables for this function? Well, we don't know a best way, so we will vacillate between two or three forms that make sense to u s and that, we hope, will not confuse you. Dunn's index (Dunn, 1974a), for example, will be written as Vp(U;X)when we feel it important to show the variables it depends upon, but when the emphasis is on its use in its application context, recognizing the fact that U is a function of c, we will write VQ{C). The partition entropy defined

below depends on both U (and hence c) as well as (a), the base of the l o g a r i t h m i c f u n c t i o n c h o s e n : t h u s , we m a y u s e

Vpg(U.c,a),Vpg(U)orVpg(c).