Variants of the k-means algorithm - Clustering by k-means

Clustering by k-means

3.2 Variants of the k-means algorithm

Over the years, many variants to the standard version of thek-means algorithm have been proposed as there are a few well-known issues with the standard algorithm. The standard algorithm may be slow to converge, it may reach a local optimal solution which is not the global one, and it may provide empty clusters. The convergence speed depends on the number of iterations needed to stabilize clusters. The computational cost is mainly due to the evaluations of the distances and to the computation of the centers of the clusters. Finally, in thek-means algorithm, thekclusters are not constrained to contain a predefined number of samples, and hence the algorithm may provide empty clusters. An empty cluster does not have any practical meaning, and therefore constraints need to be considered so that the algorithm avoids creating empty clusters. The following focuses on ideas and strategies developed with the aim of overcoming these problems. Thek-means algorithm is often found in literature with other names representing various modifications to the main algorithm. In [237], an inventory of the 10 most known algorithms for data mining is presented, and none of them but thek-means algorithm is mentioned. This means that the basic algorithm is mostly used rather than its variants. However, some of the following ideas for overcoming some of the k-means limitations can be useful in particular practical cases.

In order to improve the performances of the algorithm, a simple variation of the Lloyd’s algorithm is proposed by [110]. In literature, this variation of the algorithm is sometimes referred to ash-means algorithm, and sometimes as thek-means algorithm itself as the h-means algorithm is very similar to Lloyd’s algorithm. Figure 3.9 shows theh-means algorithm. The only difference betweenk-means andh-means algorithms stands in the computation of the centers of the clusters. In the algorithm in Figure 3.2 the centers are computed into theforloop, whereas in theh-means algorithm they are computed just after theforloop. Therefore, even when a sample migrates from one cluster to another, the new centers are not recomputed. Centers are recomputed only after theforloop. In terms of Voronoi diagram, the algorithm changes as it is shown in Figure 3.10. Even though a very small change is applied to the standard algorithm, theh-means algorithm can provide different solutions. The optimal partition obtained depends on the random initial partition in clusters. Just like

randomly assign each sample to one of the k clusters S(j), 1 ≤ j ≤ k compute c(j) for each cluster S(j)

while (clusters are not stable) for each sample Sample(i)

compute the distances between Sample(i) and all the centers c(j) find j* such that c(j*) is the closest to Sample(i)

assign Sample(i) to the cluster S(j*) end for

recompute all the centers end while

Fig. 3.9 Theh-means algorithm.

randomly assign each sample to one of the k clusters S(j), 1 ≤ j ≤ k compute c(j) for each cluster S(j)

while (clusters are not stable)

build the Voronoi diagram of the set of centers c(j) for all the samples Sample(i)

locate the cell Sample(i) is contained in

assign Sample(i) to the cluster whose center generates such cell end for

recompute all the centers end while

Fig. 3.10 Theh-means algorithm presented in terms of Voronoi diagram.

k-means, theh-means algorithm can be seen as a method for local optimization. The h-means algorithm improves the current partition iteration after iteration by reducing the value of the error function (3.1). After one iteration, the obtained partition can be either better than the previous one or exactly the same. The obtained values of the error function create a decreasing sequence of values and therefore the algorithm converges toward a local minimum of the function. Therefore, both k-means and h-means algorithms are usually carried out many times using different starting partitions.

Different partitions in clusters can be randomly generated and the algorithm can be carried out as many times. In general, the algorithm can provide a different solution for each run. The greater is the number of executions of the algorithm, the greater are chances to find the global optimal partition. This procedure can be used for both k-means andh-means algorithms independently.

Moreover, the two algorithms can be used together. Theh-means algorithm is faster than thek-means algorithm, but the latter has better chances to obtain optimal solutions. Therefore, the two algorithms can be used together: theh-means algorithm can be used to obtain a partition close to the optimal one, and then, thek-means algorithm can be used to locate an optimal solution. The two-phase algorithm is often referred to ashk-means algorithm.

Both thek-means and h-means algorithms need that a predetermined value k is decided before any of the algorithms is executed. The k value is the number of clusters in which the data are partitioned. In some applications, this number is unknown. Differentkvalues can be used and the one providing a partition with the minimum error is retained. The choice ofkplays an important role in the success of the algorithm. In some cases, indeed, thek-means and theh-means algorithms may provide a final partition with one or more empty clusters. This situation is to be avoided, since thekvalue represents the number of clusters expected in the partition. Empty clusters have no practical meanings. The k-means+ and the h-means+ algorithms use a particular strategy (described in [219]) for avoiding that the found optimal partitions include empty clusters. The strategy works as follows.

Bothk-means orh-means can be carried out until their halting criteria are satisfied.

Then, the obtained partition can be checked for the presence of empty clusters. Ift clusters are empty, then all samples are considered andt samples with the greatest distance from their respective centers are selected and each of them is moved in

one of the empty clusters. In this way, the new partition hast clusters with only one sample. At this point, thek-means or theh-means algorithm can restart from this new partition and halt when the stopping criteria is satisfied. This procedure can be iterated until a partition having only non-empty clusters is obtained. Figure 3.11(a) shows that an optimal solution fork=4 is obtained and the optimal solution contains an empty cluster. This figure shows three cells of the Voronoi diagram, each cell coinciding with a cluster of the optimal partition. Clusters of the optimal partition areC,C^×andC⁺. The encircled point in clusterC^×(which has the greatest distance from the center of clusterC^×) is considered to move to the empty cluster and a new cell is therefore created. The newly created cluster contains only one sample. The new partition just created, as shown in Figure 3.11(b), is then used by thek-means algorithm as the initial partition and a new optimal solution without empty clusters is obtained. Figure 3.12 shows thek-means+ algorithm, while Figure 3.13 shows theh-means+ algorithm. In the algorithms, arepeat…untilloop is iterated until an optimal partition including only non-empty clusters is obtained.

In [101] another variant of thek-means algorithm is presented, referred to asJ -means algorithm. In the cases whenk is large, some of the centers of the clusters may coincide with or be very close to some of the samples. When a cluster contains one sample only, then its center corresponds to the sample. Generally a cluster has more than one sample, and its center can be very close to one or more of its samples.

All the samples in the same clusters are similar to their common center. Moreover, if a threshold distance or positive tolerancet olis set up, then samples with distance from centers smaller thant ol can be considered asverysimilar. Only few samples around the center satisfy this rule, and, in theJ-means algorithm, these samples are referred to asoccupiedsamples. The basic idea behind this algorithm is tojumpfrom a partition to another by selecting as center of a cluster anunoccupiedsample. At each iteration of the algorithm, a new cluster is added to the partition whose center is an unoccupied sample. When a new cluster is added to the partition, another cluster is deleted in order to keep thekvalue constant. Therefore, the unoccupied sample defining the new cluster and the old cluster to delete are chosen so that the error function (3.1) decreases as much as possible. TheJ-means algorithm is able to reduce the error function value at each iteration. When the algorithm halts, an optimal partition is reached. Hybrid algorithms can be developed using thek-means(+), h-means(+) andJ-means algorithms. For instance, the partition obtained at each step of theJ-means algorithm can be improved by applying one iteration of thek-means orh-means algorithm.

As mentioned before, thekparameter needs to have a value before thek-means(+), h-means(+) orJ-means algorithm can be carried out. Sometimes thekvalue can be easily obtained from the real-life application at hand. Some other times more than one value may be suitable for the parameterk. In these cases, the algorithms can be carried out more than once and the value providing the best partition can be selected fork.

Another variant of thek-means is theY-means algorithm, designed for cases when no information onkis available. Thekvalue is defined during the execution of the algorithm.kcan range from 1 to the total number of samples. During the execution

Fig. 3.11 (a) A partition in 4 clusters in which one cluster is empty (and therefore there is no cell for representing it); (b) a new cluster is generated as the algorithm in Figure 3.12 describes.

randomly assign each sample to one of the k clusters S(j), 1 ≤ j ≤ k repeat

if (some of the clusters S(j) is empty) compute the number t of empty clusters find the t samples farther from their centers for each of these t samples

move the sample to an empty cluster end for

end if

compute c(j) for each cluster S(j) while (clusters are not stable)

for each sample Sample(i)

compute the distances between Sample(i) and all the centers c(j) find j* such that c(j*) is the closest to Sample(i)

assign Sample(i) to the cluster S(j*) recompute the centers of the changed clusters end for

end while

until (all the clusters are non-empty) Fig. 3.12 Thek-means+ algorithm.

of the algorithm, clusters are deleted and other clusters are added to the current partition, until an optimal partition is obtained. The algorithm searches, for instance, for empty clusters. If there are empty clusters, they are deleted. The algorithm also searches for outliers, i.e., for samples which are different from the majority of the samples in the same cluster. If outliers are detected, they are removed from their clusters and used for generating new clusters. This operation splits one cluster in two

randomly assign each sample to one of the k clusters S(j), 1 ≤ j ≤ k repeat

if (some of the clusters S(j) is empty) compute the number t of empty clusters find the t samples farther from their centers for each of these t samples

move the sample to an empty cluster end for

end if

compute c(j) for each cluster S(j) while (clusters are not stable)

for each sample Sample(i)

compute the distances between Sample(i) and all the centers c(j) find j* such that c(j*) is the closest to Sample(i)

assign Sample(i) to the cluster S(j*) end for

recompute all the centers end while

until (all the clusters are non-empty) Fig. 3.13 Theh-means+ algorithm.

parts and therefore, in this case, thekvalue increases. The algorithm also looks for adjacent clusters that may overlap with each other. If such clusters are found, they are merged to form one unique cluster. This operation merges clusters: thekvalue is decreased. When the optimal partition is obtained at the same timekhas its optimal value.

Krishna and Murty [134] combined the basic k-means algorithm with genetic algorithms (GAs) [88]. As explained in Section 1.4, GAs are meta-heuristic methods for global optimization that simulate the evolutionary process of living organisms according to Darwinian theory. Thegenetick-means algorithm is a GA in which the crossover operator is substituted with one iteration of thek-means algorithm. At the start, an initial population of chromosomes is randomly generated. Each chromosome represents a partition in clusters of the data. As in GAs, the selection and mutation operators are used. Here, the mutation operator is defined such that the probability of performing a change on a sample is higher if the sample is closer to one of the centers. At each iteration of the algorithm, a partition in clusters is selected from the current population, a mutation is performed on the partition and one step of the k-means algorithm is performed on the whole set. The genetick-means algorithm performs better than the basick-means algorithm, because it couples the basic idea of thek-means with the heuristic evolutionary searches. Variations of this algorithm have been proposed in [157, 158].

Many other variants of the standardk-means algorithm can be found in the lit-erature. One of these variants is the so-calledglobalk-means algorithm [154]. This is a global optimization method which uses thek-means algorithm as a local search procedure. In [68], thek-means algorithm has been modified to avoid unnecessary distance calculations and to perform faster. The well-known triangle inequality is used in this algorithm. In [26], the performances of thek-means algorithm have been improved by refining the initial and randomly generated partition in clusters.

Another variant of the basic algorithm is the symmetry-basedk-means algorithm [50, 51, 223].

Finally, we mention a technique for clustering efficiently a feature-extended set of samples [207]. Precisely, it is supposed that a partition of a set of samples is known, and that a new partition is searched after that some features have been added for representing the samples. The technique in [207] has been applied to hierarchical clustering. However, as the authors pointed out, it takes the concept of center of a cluster from thek-means approach, and therefore it can be applied to partitioning clustering as well. The idea is to avoid to partition the set of data again after features are added to the samples, but rather to exploit the previous partition in clusters. The easiest strategy could brutally divide the samples in the clusters as they were in the previous partition. However, the introduction of new features for representing these samples may change the clustering, and samples can migrate from a cluster to another. In [207], a rule based on the centers of the clusters has been proposed for removing samples from the clusters where they should not belong anymore. The new samples can then be used for generating new clusters.

In the agglomerative hierarchical approach, the new set of clusters are successively merged and the samples are assigned to the correct cluster. In thek-means approach,

the removed samples could be assigned to the least populated cluster or equally distributed to all the clusters in a random way. The obtained partition would anyway be better than a random partition, and thek-means algorithm would reach another optimal partition faster.

In this section, many variants on the standardk-means algorithm have been pre-sented. As discussed above, they are able to overcome some of the issues arising when thek-means approach is used. However, there are still other problems that may arise when this approach is used. First of all, the basic idea of the method is to use the centers for representing such clusters. The centers are computed as the mean among all the samples in the same cluster. Unfortunately, a mean is not a good repre-sentative of a set of samples if there are outliers. Indeed, the presence of one outlier can modify the center of a cluster, and it can become closer to a certain subgroup of samples. If this happens, and thek-means algorithm is executed, this subgroup of samples is then moved in the cluster having such center. The partition in clusters can therefore drastically change if outliers are contained in the considered set of data.

For avoiding this problem, outliers have to be removed prior the application of the algorithm.

In some cases, for instance when the parameterkis not known, the quality of a partition is evaluated through the value of the error function (3.1). With fixedkvalue, the better partitions correspond to the smallest values of the error function. It is much more difficult to compare instead the error function values in correspondence with partitions in which thekvalue changes. Indeed, the error values tend to decrease whenkis larger. When only one cluster is considered, the error is the sum of all the distances between the samples and the only center. Intuitively, if two clusters are considered, then the average distances are smaller in general, and many non-optimal partitions in clusters could have an error function value which is smaller than the one corresponding to the partition in one cluster. The extreme case is the one in which the number of clusters equals the number of samples. In such a case, there is only one possible partition and the value of the error function is zero. This tendency of the error value to decrease whenkis increased makes it difficult to find out if a partition in a larger number of clusters is better than any other with fewer clusters. Indeed, a reduction on the error function value might be due to the increase of thekvalue only, and not because the quality of the partition is higher.

Dans le document DATA MINING IN AGRICULTURE (Page 75-81)