Clustering - Improving performances of queries

2.8 Improving performances of queries

2.8.4 Clustering

Clustering is a method to group similar points [47]; it is an unsupervised classification process. Membership is not a labeling process, it is a measure of the closeness with respect to a cluster representative, the centroid. Clusters are disjoint; no point can belong to more than one cluster. Clusters maximize intra-cluster similarity and minimize inter-cluster similarity. In this section we describe the two types of inter-clustering methods:

Hierarchical Clustering is a deterministic group forming process. It can be agglomerative (each point is a cluster, then combine) or divisive (all data set is a cluster, then split) and, respectively bottom up, or top-down. Figure 2 shows a synopsis of this hierarchical clustering.

Bottom-up starts with n clusters, each has a single point. A table of inter-cluster distances is constructed with a high cost. Iteratively, we merge the two closest clusters into a single super-cluster until the tree is constructed. The overall complexity of this process is known to be in in the worst case, though a direct algorithm fitted to our data set case is only in thanks to the use of triangular inequality in metric spaces.

Top-down starts with one big cluster, i.e., all the database. It then applies a non-hierarchical algorithm (e.g., k-means) to divide the set into two clusters. It repeats this process until each cluster contains one point (or a sufficiently small number of points).

Figure 2.5 Synopsis of hierarchical clustering

Partitional clustering, single or flat level clustering, is as its name suggests, a clustering method which splits a data set into a predefined number of groups or clusters. The most popular partitional method is the k-means [48], where k stands for the number of clusters which are iteratively refined. The clustering consists in: 1) starts by selecting initial k centroids, 2) then assign each point to the nearest centroid, 3) for each obtained group recomputed the centroid. The partitioning criterion of the k-means is to minimize the average squared distance for each point to the centroid.

The complexity of k-means can be derived approximately, as the number of required iterations i to converge to an optimization criterion is not deterministic: it is dependent to the selection of the initial centroids and the number of clusters. Its cost can be estimated based on the d dimensionality of the data points, the number k of clusters and the iterations i to converge for a database of size n: . Observe that the step to determine the best centroid for each point is itself a nearest neighbor search and is in for the whole data set.

As can be observed, top-down is faster than bottom-up.

Hierarchical

One characteristic of partitioning methods is the overlap. Overlap does not mean that a point belongs to more than one cluster, but that the space defined by the enclosing ball for a cluster of points overlap portions of the space delimited by another enclosing ball.

Remember that the distance used shapes the space. Thus assuming the L2 distance, hyper-spherical shapes wrap the data points, i.e., is a ball partitioning approach.

Searching using clustering is a two-step approach: pruning and refinement. Firstly, a search at a coarse level to select interesting clusters is executed by computing the distance from the query to the cluster centroid, taking into account the radius and making used of the triangle inequality. Then, a refinement step, searches within the selected clusters the desired kNN of the query.

For each cluster a representative called centroid is computed to perform searching. The centroid summarizes the cluster properties. This group has points more similar between them than across clusters.

Most of the clustering schemes [30][31] construct a hierarchical structure of similar clusters or make use of an index structure to access rapidly some kind of cluster representative, such as the centroid, which typically is the mean vector of features for each cluster. In the searching process, this hierarchy or index is traversed by comparing the query object with the cluster representatives in order to find the clusters of interest.

In both index and cluster-based strategies, the cost of computing the distances to determine similarity can be very high for large multimedia databases.

We are interested more in the partitioning properties of k-means than in grouping, i.e, we use it as a method to partition the database in a determined number of clusters. Indeed, selecting the appropriate number of clusters has been a subject of interesting research.

Some empirical results showed that around is an ideal number of clusters [82][37]

within the context of specific searching algorithms. More recently Berrani [13] stated that it is not possible to establish a relation between the number of clusters and the query time. His observations are made based on his searching techniques and experiments.

However we argue that it is possible to estimate the cost of the searching steps according to their complexity and analytically propose an optimal range of values for the number of clusters, using general searching algorithms.

The same as the curve is more and more approximated by smaller line segments so that the space can be covered by smaller and thus more compact clusters. It is thus desirable to have many small clusters than a few large ones to better group the data points and hence improve the filtering techniques.

Dans le document The DART-Europe E-theses Portal (Page 54-57)