• Aucun résultat trouvé

ai,t = ij^-vtii7lXi

C. Dvmn's index

Dunn (1974a) proposed an index based on geometric considerations that h a s the same basic rationale as V^„ ^ in that both are designed

DB.qt "^

to identify volumetric clusters that are compact and well separated.

Let S and T be non empty subsets of 9^^, and let 5:5RP X 9?P H^ 5^+ be any metric. The standard definitions of the diameter A of S and the set distance 5 between S and T are

Ai{S) = max{5(x,y)}

x . y e S

; and (2.91)

5i(S,T)=mln{5(x,y)}

xeS

(2.92)

y e '

D u n n defined the separation index for the crisp c-partition U o { X , , . . . , X } ofXas

I 2.' c

VD(U;X) = min

l<i<c min<!

l<j<c

8 i ( X „ X j ) max{Ai(Xk)}

l<k<c

(2.93)

The quantity 5i(Xi,Xj) in the numerator of V^ is analogous to SJXj.X,) measures the in the denominator of 1/^„ :

DB.qt

distance between clusters directly on the points in the clusters, whereas Vj - v J uses the distance between their cluster centers for

II Jllq

the same purpose. The use of A (X ) in the denominator of 1/^ is analogous to a^ * in the numerator of 1/^„ ^; both are measures of

=' '^•t DB.qt

scatter volume for cluster X,. Thus, extrema of V„ and V^„ ^ share

k D DB.qt

roughly the same geometric objective: maximizing intercluster distances while simultaneously minimizing intracluster distances.

Since the measures of separation and compactness in V^ occur inversely to their appearance in V , large values of V^

correspond to good clusters. Hence, the number of clusters that maximizes V^ is taken as the best solution. V^ is not defined on

1„ when c= 1 or on 1 when c=n.

" n

CLUSTER ANALYSIS 93 Dunn called U compact and separated (CS) relative to the (point) metric 6 if and only if: for all s, q and r with q^^r, any pair of points X, y e X are closer together (with respect to 5) than any pair u,v with u G X and v e X Dunn proved that X can be clustered into a compact a n d s e p a r a t e d c-partition with respect to 6 if and only if

max I V (c) [ > 1. Dunn's indexis a direct data index.

UeM,

Example 2.13 Table 2.8 shows values of ^^^ 22 ^^<^ ^ D ^°^ terminal partitions of X produced by HCM-AO using the same protocols as in Example 2,2. Table 2.8 reports values of each index for c=2 to 10.

Each column of Table 2.8 is computed by appl5rlng the two indices to the same crisp c-partition of X. The highlighted (bold and shaded) entries correspond to optimal values of the indices.

Table 2.8 Direct cluster validity for HCM-AO partitions of X30

2 3 4 5 6 7 8 9 10

V„ DB,22 0.35 0.18 0.48 0.63 0.79 0.87 0.82 0.88 0.82

0.96 1.53 0.52 0.12 0.04 0.04 0.04 0.04 0.04

Figure 2.16 V^ and V^g ^2 ^°™^ Table 2.8 for HCM-AO on X 3 0

^DB 22 indicates c = 3 by its strong minimum value of 0.18. The table shows only two significant digits so ties may appear to occur in it, but there are no ties if four digit accuracy is retained. For this very well separated data set, V^, which is to be maximized, also gives a very strong indication that c = 3 is the best choice.

Figure 2.16 is a graph of the values in Table 2.8 that shows how strongly c = 3 is preferred by both of these direct indices. Don't expect these (or any other) indices to show such sharp, well-defined behavior on data that do not have such clear cluster structure.

Another point: don't forget that the graphs in Figure 2.16 are explicit functions of HCM-AO, the clustering algorithm that produced the partitions being evaluated. You might get different graphs (and infer a different best value of c) simply by changing the initialization, or the norm, or the termination criterion e, etc. of HCM.

Our next example illustrates the use of V^ and Vjjg22 °^ clusters found by HCM-AO in the ubiquitous Iris data (Anderson, 1935).

Interestingly, Anderson was a botanist who collected the data, but did not publish their values. Fisher (1936) was apparently the first author to publish the data values, which he used to illustrate the method of linear discriminant analysis. Several scatterplots of Iris are shown in Section 4.3. And finally, please see our comments in the preface about the real Iris data.

Example 2 . 1 4 Iris has n = 150 points in p = 4 dimensions that represent 3 physical clusters with 50 points each. Iris contains observations for 50 plants from each of three different subspecies of Iris flowers, b u t in the numerical representation in 9t* of these objects, two of the three classes have substantial overlap, while the third is well separated from the others. Because of this, many authors argue that there are only c=2 geometric clusters in Iris, and so good clustering algorithms and validity functionals should indicate that c=2 is the best choice. Table 2.9 lists the values of V^

and T^Qg 22 ° ^ terminal HCM-AO partitions of Iris. All parameters of the r u n s were as in Example 2.2 except that the initializing vectors were from Iris. Figure 2.17 shows graphs of the values of V^ and

^DB.22 in Table 2.9.

The Davies-Bouldin index clearly points to c = 2 (our first choice for the correct value), while Dunn's index seems to equally prefer c = 3 and c = 7. To four place accuracy (not shown here), c = 3 is slightly higher, so Dunn's index here would (weakly) indicate the partition corresponding to c = 3. The lesson here is not that one of these

CLUSTER ANALYSIS 95 answers is right. What is important is that these two indices point to different "right answers" on the same partitions of the data.

Table 2.9 Direct cluster validity for HCM-AO partitions of Iris

c 2 3 4 5 6 7 8 9 10

V„ DB,22 0.47 0.73 0.84 0.99 1.00 0.96 1.09 1.25 1.23

0.08 0.10 0.08 0.06 0.09 0.10 0.08 0.06 0.06

0 10

Figure 2.17 V^ and V^g 22 ^°™ Table 2.9 for HCM-AO on Iris

The numerator and denominator of V^ are both overly sensitive to changes in cluster structure. 5^ can be dramatically altered by the addition or deletion of a single point in either S or T. The denominator suffers from the same problem - for example, adding one point to S can easily scale Ai(S) by an order of magnitude.

Consequently, V^ can be greatly influenced by a few noisy points (that is, outliers or inliers to the main cluster structure) in X, and is far too sensitive to what can be a very small minority in the data.

To ameliorate this Bezdek and Pal (1998) generalized 1/^ by using two other definitions for the diameter of a set and five other

definitions for the distance between sets. Let A be any positive semi-definite [diameter] function on PCSi^), the power set of 9t^. And let 5 denote any positive semi-definite, symmetric (set distance) function on P(3iP)xp{'3iP). The general form of V^ ustag 5 and A is

V-. (U; X) = V-. (c) = min<^

1<1<C min-<

1<J<C

6(Xj,Xj) max

l<k<c

{A(X,)} (2.94)

Generally speaking indices from family (2.94) other than V^ show better performance than V^. The classification of Vg^ as in Table 2.7 depends on the choices of 5 and A. All of these indices are direct data indices (they all use U and X), and several also use the sample means V.