Evaluating clustering results

Dans le document The DART-Europe E-theses Portal (Page 32-36)

6. In Figure 2.4 shows an example of such application on the Old Faithful data set.

Algorithm 6: EM Algorithm for the GMM InitializeΘ randomly

while Learning do

E-Step: evaluateS =argmaxSp(S|X,Θ) forall the xnX do

Evaluating the quality of clustering results is a difficult task that has been an active research area for years, with new methods being proposed on a regular basis. The main difficulty with the evaluation of clustering results lies in the inherent unsupervised na-ture of clustering itself and the lack of consensus about what a”good clustering”should be. In this context, evaluating a clustering result is always more or less subjective, with each evaluation criterion favoring one concept of agood clustering (shape, compactness separation, etc.) over the others. Therefore the notions of good clustering and best clustering will depend on both the evaluation criterion and the clustering algorithm, with some evaluation criterion favoring some algorithms over others.

Still, despite this assumed relative subjectivity, there are a wide range of evaluation criterion that are commonly used in machine learning to assess and compare clustering results. There are several taxonomies available in the literature for these evaluation criteria [59, 71, 132], most of them defining 3 distinct groups:

• Unsupervised indexes, also called internal indexes: they only use internal infor-mation from the data as well as the clusters’ characteristics.

• Supervised indexes, also called external indexes: they assess the degree of similar-ity between a clustering solution and a known partition of the data set (sometimes called a ground truth).

• Relative indexes: they are a separate class of criteria that make it possible to compare several clustering results of the same algorithm. Relative indexes simply use both external and internal criteria to choose the best solution among several proposed partitions.

2.6.1 Unsupervised indexes

Unsupervised evaluation criteria [98] are based on internal information from both the data and the clusters. For instance, several of them are based on the distance between

the data and the cluster centroids. These indexes have been introduced based on the simplest principles of what defines a cluster: (1) Objects from a given cluster are supposed to be as close as possible from each other. (2) Objects belonging to different clusters should be well separated and as far as possible. To assess these intuitive criteria, most indexes adopt a strategy that consists in measuring the distance between each data elements and some object representing the clusters (centroids, representative data elements, etc.). By doing so, it is quite straightforward to evaluate the compactness and separability of the clusters.

However, with no clear definition of what a “good cluster” is, each unsupervised index has its own way of computing the compactness and separability of the clusters and to use these two values to compute a final quality criterion.

Some of these criteria can be used as objective functions. The goal of a clustering algorithm would then be to find a solution that maximizes the said objective function.

Some criteria however are too costly to be used in an objective function and are usually only computed when the clustering process is done.

Mean Squared Error (MSE) :

The mean squared error is one of the easiest way to evaluate the quality of a result for clustering algorithms that use centroids. Given a clustering solution S with K clusters, it can be computed as shown in Equation (2.12) where d(·) is a distance function, |ci| is the number of elements linked to the cluster ci and µk the centroid of a cluster ck.

For a clustering result to be considered good, the Mean Squared Error must be as low as possible.

Dunn Index (DI) :

The Dunn index [40] is another internal criterion defined as in Equation (2.13) whereD(ci, cj)is a distance metric between two clustersci andcj, and∆i is a measure of scatter for a cluster ci. Any quality index using such a kind of ratio is called a

“Dunn-like index”.

A higher Dunn index indicates a better clustering.

DU = mini6=jD(ci, cj)

maxi∈[1..K]i (2.13)

One particularity of the Dunn Index is that the distances D(ci, cj) and ∆i can be defined in many different ways:

D(ci, cj) can be the smallest distance between two objects belonging to ci and cj respectively, see Equation (2.1). In this case ∆i must be the largest distance between two objects belonging to a cluster ci:

i = max

x,y∈ci

d(x, y) (2.14)

D(ci, cj) can be the largest distance between two objects belonging to ci and cj respectively, see Equation (2.2). This is quite uncommon and give rather

2.6. EVALUATING CLUSTERING RESULTS 33 poor results. In this case ∆i must be the smallest distance between two objects belonging to a cluster ci:

i = min

x,y∈ci

d(x, y) (2.15)

D(ci, cj)can also be the distance between the centroids ofci andcj, see Equation (2.4). Then ∆i usually is the largest distance between an object belonging to ci and its centroid µi:

• Finally, D(ci, cj) can be the average distance between the data belonging to ci and cj respectively, see Equation (2.3). Then ∆i usually is the mean distance between all pairs of a cluster ci:

i = 1

The Davies-Bouldin Index [36] is a possible alternative to the Dunn Index. It assesses whether the clusters are compact and well separated. It is based on the same possible measure of separation D(ci, cj) and measure of scatter ∆i than the Dunn Index. Given a clustering solution that containsK clusters, the Davies-Bouldin Index is defined as shown in Equation (2.18).

The Davies-Bouldin index is not normalized and a lower value indicates a better quality.

The most commonly used measure of separation and measure of scatter for the Davies-Bouldin Index -as introduced in the original article [36]- are based on the cluster mean values using Equations (2.4) and (2.16).

Silhouette Index (SC) :

The Silhouette index [112] is yet another internal criterion that assesses the com-pactness of the clusters and whether or not they are well separated. The main difference between the Silhouette index and the Dunn index or the Davies-Bouldin index is the following: the Silhouette Index can be computed for a given object x, a given cluster ci, or for the whole clustering C.

For a data element, the Silhouette index is defined as shown in Equation (2.19) whereaxis the mean distance between the observed object xand all other objects that belong to the same cluster thatx, andbx is the mean distance betweenxand all other objects that are not in the same cluster thatx.

SC(x) = bxax

max(ax, bx) (2.19)

The Silhouette index takes values between −1 and 1. A positive value (ax < bx) means that xis closer from the objects belonging to its clusters than from the objects belonging to other clusters. Therefore a positive value close to 1 means that x is probably in the right cluster, while a negative one means that x should be in another cluster.

The Silhouette index for a given clusterci is the mean value of the Silhouette index computed on all the objects of this cluster:

SC(ci) = 1

|ci|

X

x∈ci

SC(x) (2.20)

A Silhouette index that is positive and close to 1 means that the observed cluster is both compact and well separated from the other clusters.

Finally, the Silhouette index can be computed on the whole partition as shown in Equation (2.21). Once again, 1 is the best value meaning that the clusters are very compact and well separated and −1 is the worst value. Usually, the Silhouette Index must be positive for a clustering result to be considered acceptable.

SC(C) = 1

Before considering to use the Silhouette index, one must consider the high com-putational cost of this index with large data sets: since it is not based on the mean vectors, the Silhouette index requires to compute several time the pairwise distances between all the data.

Wemmert-Gancarski Index (WG) :

The Wemmert-Gancarski index [147] is another elegant way to assess the quality of a clustering result based on the compactness and separability of the clusters. For a cluster ci, it is computed as follows withj = argmink6=id(x, µk):

The Wemmert-Gancarski index takes its values between0and1,1meaning that the clusters are very compact and well separated. When applied to a complete clustering result, it is defined as follows:

W G(C) = 1

Whenever the real objects classes are known, it is possible to compare the result of a clustering with the real partition. While these external criteria are not proper clus-tering indexes, it is the most convenient way to evaluate a new clusclus-tering algorithm by applying it to a data set for which the classes or clusters are known. Indexes that makes it possible to rate a clustering based on the comparison between a clustering partition and the real classes are called supervised indexes or external criteria, because they rely on information that comes neither from the data nor from the clustering but from an external ground truth used for comparison purposes.

A non-exhaustive list of external criteria is available in Appendix B.

Dans le document The DART-Europe E-theses Portal (Page 32-36)