Determining the number of clusters

3.2 Background on cluster analysis

5 10 15 20 6000 7000 8000 9000 10000 number of clusters G WG

3.2 Background on cluster analysis

3.2.3 Determining the number of clusters

2

1

1 2 4

0

0.732

1

{1,2} 3

{1,2} 0

0.732 0

1 2 3

1 0

2 1 0

3 1 1 0

Avec les critères d’agrégation étudiés étudiés dans ce chapitre, il est

pos-sible de montrer que l’inversion est impospos-sible.

Croissante non stricte Lorsqu’il y a égalité de l’indice pour plusieurs niveaux

emboîtés, il suffit de « filtrer » la hiérarchie, c’est-à-dire conserver une seule

classe qui regroupe toutes les classes emboîtées ayant le même indice. Dans

l’exemple suivant, la classeA∪B qui a le même indice que la classeA∪B∪C

peut être supprimée.

A B C

A B C

Ce problème peut se produire avec les critères d’agrégation que nous allons

étudier et les algorithmes de mise en place de ces critères nécessiterons

donc de prévoir cette opération de filtrage.

6.4.2 Les critères d’agrégation

Il existe de nombreux critères d’agrégation, mais les plus utilisés sont les

suivants :

– critère du lien minimum (ou saut minimum ou single link)

D(A, B) = min{d(i, i!), i ∈ A et i! ∈ B};

FIGURE3.3: Example of an inversion in the dendrogram of a hierarchy.

It can be shown that the dendrograms of hierarchies built using the linkage criteria previously

presented in this section never present inversions.

3.2.3 Determining the number of clusters

In the general context of data clustering, a major challenge is how to select the right number of

clusters, also known as the model selection issue. Indeed, while in some situations the choice of

the number of classes may be motivated by prior knowledge of the user, it is in general unknown

and needs then to be estimated in a certain way.

Milligan & Cooper (1985) provide a comprehensive survey of 30 methods for estimating the

number of clusters, whileGordon(1999) compares the performances of the best five rules

ex-posed inMilligan & Cooper(1985). Gordon(1999) also divides these strategies of estimation

of the number of clusters intoglobal methodsandlocal methods. Local methods are intended

to test the hypothesis that a pair of clusters should be merged. While, with global approaches,

some measure over the entire dataset is evaluated and optimized as a function of the number of

clusters. More recently,Charrad et al.(2014) provide an implementation of an exhaustive list of

indices to estimate the number of clusters in a dataset in theRpackageNbClust.

Several model selection approaches use thewithin-group dispersionmeasure, generally noted

W, in order to estimate the number of clusters. Suppose the datasetEhas already been clustered

intoGgroupsC1, . . . , CGof sizesp1, . . . , pG, then the corresponding within-group dispersion

measureWGis defined as:

WG=

G

X

g=1

pg

X

i=1

δ(X.i,gCg)2. (3.3)

So if the dissimilarityδ is the Euclidean distance, thenWGcorresponds to the pooled

within-cluster sum of squares around the within-cluster means.

FIGURE3.4: Observed within-group dispersion measuresWGversus the number of clustersG

for a setEofp= 200items inR100clustered in10groups of size20. The dissimilarity used

is the Euclidean distance and the clustering method is the Ward’s criterion.

Figure3.4 shows a typical plot of within-group dispersion measures WG, with the Euclidean

distance and at each step of a Ward’s clustering algorithm, as a function of the number of clusters

Gwithin the data set. We can notice thatWGdecreases monotonically as the number of clusters

Gincreases. Indeed, splitting a class results in two clusters that are more homogeneous and thus

with lower dispersion. But this decrease gets slower from someGonward and the presence of

such an elbow corresponds exactly to the “true” number of clustersGˆ to be estimated (Sugar

1998,Sugar et al. 1999).

Among the global methods using theWGmeasures and performing the best according to

Milli-gan & Cooper(1985) was the Calinski and Harabasz index (Cali´nski & Harabasz 1974)

CH(G) = (p−G)

(G−1)

BG

WG.

BG =PGg=1δ(gCg,gE)2is the between-cluster dispersion measure withgEthe center of

D(A, B) = min{d(i, i^!), i ∈ A et i^! ∈ B};

measureW_Gis defined as:

W_G=

δ(X_.i,g_Cg)². (3.3)

for a setEofp= 200items inR¹⁰⁰clustered in10groups of size20. The dissimilarity used

such an elbow corresponds exactly to the “true” number of clusters_Gˆ to be estimated (Sugar

Among the global methods using theW_Gmeasures and performing the best according to

CH(G) = ⁽^p−G)

W_G^.

B_G =^PG_g₌₁δ(g_Cg,g_E)2is the between-cluster dispersion measure withg_Ethe center of

clusters is_Gˆthat maximizes CH(G).

H(G) = ¹

W_G₊₁ −1

Another approach for estimating the number of clusters using theW_Gmeasures is the proposal

DIFF(G) = (G−1)²^/pWG−1−G²^/pWG,

and chose_Gˆto maximize the quantity:

Gap(G) =E^?_p[log(WG)]−log(WG),

The optimal number of clusters_Gˆcorresponds then to the value maximizing Gap(G).

first approach, each reference variableX_.j,1≤j ≤pis generated uniformly over the range of

measureW_G.

Gap(G) = ¹

log(W_G^b)−log(WG).

1 + ¹