Overview of Clustering Methods - Data Mining: A Heuristic Approach

For exploratory data mining exercises, clustering methods typically fall into two main categories, agglomerative and partition-based.

Agglomerative clustering methods

Agglomerative clustering methods begin with each item in its own cluster, and then, in a bottom-up fashion, repeatedly merge the two closest groups to form a new cluster. To support this merge process, nearest-neighbor searches are conducted.

Agglomerative clustering methods are often referred to as hierarchical methods for this reason.

A classical example of agglomerative clustering is the iterative determination of the closest pair of points belonging to different clusters, followed by the merging of their corresponding clusters. This process results in the minimum spanning tree (MST) structure. Computing an MST can be performed very quickly. However, because the decision to merge two clusters is based only on information provided

by a single pair of points, the MST generally provides clusters of poor quality.

The first agglomerative algorithm to require sub-quadratic expected time, albeit in low-dimensional settings, is DBSCAN (Ester, Kriegel, Sander, & Xu, 1996). The algorithm is regulated by two parameters, which specify the density of the clusters to be retrieved. The algorithm achieves its claimed performance in an amortized sense, by placing the points in an R*-tree, and using the tree to perform u-nearest-neighbor queries, u is typically 4. Additional effort is made in helping the users determine the density parameters, by presenting the user with a profile of the distances between data points and their 4-nearest neighbors. It is the responsibility of the user to find a valley in the distribution of these distances; the position of this valley determines the boundaries of the clusters. Overall, the method requires Q(n log n) time, given n data points of fixed dimension.

Another subfamily of clustering methods impose a grid structure on the data (Chiu, Wong & Cheung, 1991; Schikuta, 1996; Wang et al, 1997; Zhang, Ramakrishnan, & Livny, 1996). The idea is a natural one: grid boxes containing a large number of points would indicate good candidates for clusters. The difficulty is in determining an appropriate granularity. Maximum entropy discretization (Chiu et al., 1991) allows for the automatic determination of the grid granularity, but the size of the grid generally grows quadratically in the number of data points. Later, the BIRCH method saw the introduction of a hierarchical structure for the economical storage of grid information, called a Clustering Feature Tree (CF-Tree) (Zhang et al., 1996).

The recent STING method (Wang et al., 1997) combines aspects of these two approaches, again in low-dimensional spatial settings. STING constructs a hierar-chical data structure whose root covers the region of analysis. The structure is a variant of a quadtree (Samet, 1989). However, in STING, all leaves are at equal depth in the structure, and represent areas of equal size in the data domain. The structure is built by finding information at the leaves and propagating it to the parents according to arithmetic formulae. STING’s data structure is similar to that of a multidimensional database, and thus can be queried by OLAP users using an SQL-like language. When used for clustering, the query proceeds from the root down, using information about the distribution to eliminate branches from consideration.

As only those leaves that are reached are relevant, the data points under these leaves can be agglomerated. It is claimed that once the search structure is in place, the time taken by STING to produce a clustering will be sub-linear. However, determining the depth of the structure is problematic.

STING is a statistical parametric method, and as such can only be used in limited applications. It assumes the data is a mixture model and works best with knowledge of the distributions involved. However, under these conditions, non-agglomerative methods such as EM (Dempster, Laird & Rubin, 1977), AutoClass (Cheeseman et al, 1988), MML (Wallace & Freeman, 1987) and Gibb’s sampling are perhaps more effective.

For clustering two-dimensional points, O(n log n) time is possible (Krznaric &

Levcopoulos, 1998), based on a data structure called a dendrogram or proximity

tree, which can be regarded as capturing the history of a merge process based on nearest-neighbor information. Unfortunately, such hierarchical approaches had generally been disregarded for knowledge discovery in spatial databases, since it is often unclear how to use the proximity tree to obtain associations (Ester et al, 1996).

While variants emerge from the different ways in which the distance between items is extended to a distance between groups, the agglomerative approach as a whole has three fundamental drawbacks. First, agglomeration does not provide clusters naturally; some other criterion must be introduced in order to halt the merge process and to interpret the results. Second, for large data sets, the shapes of clusters formed via agglomeration may be very irregular, so much so that they defy any attempts to derive characterizations of their member data points. Third, and perhaps the most serious for data mining applications, hierarchical methods usually require quadratic time when applied in general dimensions. This is essentially because agglomerative algorithms must repeatedly extract the smallest distance from a dynamic set that originally has a quadratic number of values.

Partition-based clustering methods

The other main family of clustering methods searches for a partition of the data that best satisfies an evaluation function based on a given set of optimization criteria.

Using the evaluation function as a guide, a search mechanism is used to generate good candidate clusters. The search mechanisms of most partition-based clustering methods are variants of a general strategy called hill-climbing. The essential differences among partition-based clustering methods lie in their choice of optimi-zation criteria.

The optimization criteria of all partition-based methods make assumptions, either implicitly or explicitly, regarding the distribution of the data. Nevertheless, some methods are more generally applicable than others in the assumptions they make, and others may be guided by optimization criteria that allow for more efficient evaluation.

One particularly general optimization strategy is that of expectation maximi-zation (EM) (Dempster et al., 1977), a form of inference with maximum likelihood.

At each step, EM methods search for a representative point for each cluster in a candidate cluster. The distances from the representatives to the data elements in their clusters are used as estimates of the error in associating the data elements with this representative. In the next section, we shall focus on two variants of EM, the first being the well-known and widely used k-MEANS heuristic (MacQueen, 1967).

This algorithm exhibits linear behavior and is simple to implement; however, it typically produces poor results, requiring complex procedures for initialization (Aldenderfer & Blashfield, 1984; Bradley, Fayyad, & Reina, 1998; Fayyad et al., 1998). The second variant is k-MEDOIDS, which produces clusters of much higher quality, but requires quadratic time.

Another partition-based clustering method makes more assumptions regarding the underlying distribution of the data. AutoClass (Cheeseman et al., 1998) partitions the data set into classes using a Bayesian statistical technique. It requires

an explicit declaration of how members of a class should be distributed in order to form a probabilistic class model. AutoClass uses a variant of EM, and thus is a randomized hill-climber similar to k-MEANS, with additional techniques for escaping local maxima. It also has the capability of identifying some data points as noise.

Similarly, minimum message length (MML) methods (Wallace & Freeman, 1987) require the declaration of a model. The declaration allows an encoding of parameters of a statistical mixture model; the second part of the message is an encoding of the data given these statistical parameters. There is a trade-off between the complexity of the MML model and the quality of fit to the data. There are also difficult optimization problems that must be solved heuristically when encoding parameters in the fewest number of bits.

One of the advantages of partition-based clustering is that the optimization criteria lend themselves well to interpretation of the results. However, the family of partition-based clustering strategies includes members that require linear time as well as other members that require more than quadratic time. The main reason for this variation lies in the complexity of the optimization criteria. The more complex criteria tend to be more robust to noise and outliers, but also more expensive to compute. Simpler criteria, on the other hand, may have more local optima where the hill-climber can become trapped.

Nearest-neighbor searching

As we can see, many if not most clustering methods have at their core the computation of nearest neighbors with respect to some distance metric d. To conclude this section, we will formalize the notion of distance and nearest bors, and give a brief overview of existing methods for computing nearest neigh-bors.

Let us assume that we have been given a set S={s₁,…,s_n} of n objects to be clustered into k groups, drawn from some universal set of objects X. Let us also assume that we have been given a function d:X×X→ℜ for measuring the pairwise similarity between objects of X. If the objects of X are records having D attributes (numeric or otherwise), the time taken to compute d would be independent of n, but dependent on D. The function d is said to be a metric if it satisfies the following conditions:

1. Non-negativity: x,y∈X, d(x,y)>0 whenever x≠y, and d(x,y)=0 whenever x=y.

2. Symmetry: x,y∈X, d(x,y)=d(y,x).

3. Triangular inequality: x,y,z∈X, d(x,z)≤d(x,y)+d(y,z).

Metrics are sometimes called distance functions or simply distances. Well-known metrics include the usual Euclidean distance and Manhattan distances in spatial settings (both special cases of the Lagrange metric), and the Hamming distance in categorical settings.

Formally, a nearest neighbor of s∈S is an element a∈S such that d(s,a)≤d(s,b) for all b∈X, a≠b. The notion can be extended to that of a u-nearest-neighbor set

NN_u(s)={a₁,a₂,…,a_u}, where d(s,a_i)≤d(s,b) for all b∈S\NN_u(s). Computation of nearest and u-nearest neighbors are well-studied problems, with applications in such areas as pattern recognition, content-based retrieval of text and images, and video compression, as well as data mining. In two-dimensional spatial settings, very efficient solutions based on the Delaunay triangulation (Aurenhammer, 1991) have been devised, typically requiring O(log n) time to process nearest-neighbor queries after O(n log n) preprocessing time. However, the size of Delaunay structures can be quadratic in dimensions higher than two.

For higher-dimensional vector spaces, again many structures have been proposed for nearest-neighbor and range queries, the most prominent ones being kd-trees (Bentley, 1975, 1979), quad-kd-trees (Samet, 1989), R-kd-trees (Guttmann, 1984), R*-trees (Beckmann, Kriegel, Schneider & Seeger, 1990), and X-trees (Berchtold, Keim, & Kriegel, 1996). All use the coordinate information to partition the space into a hierarchy of regions. In processing a query, if there is any possibility of a solution element lying in a particular region, then that region must be searched.

Consequently, the number of points accessed may greatly exceed the number of elements sought. This effect worsens as the number of dimensions increases, so much so that the methods become totally impractical for high-dimensional data mining applications. In their excellent survey on searching within metric spaces, Chávez, Navarro, Baeza-Yates and Marroquín (1999) introduce the notion of intrinsic dimension, which is the smallest number of dimensions in which the points may be embedded so as to preserve distances among them. They claim that none of these techniques can cope with intrinsic dimension more than 20.

Another drawback of these search structures is that the Lagrange similarity metrics they employ cannot take into account any correlation or ‘cross-talk’ among the attribute values. The M-tree search structure (Ciaccia, Patella & Zezula, 1997) addresses this by organizing the data strictly according to the values of the metric d. This generic structure is also designed to reduce the number of distance computations and page I/O operations, making it more scalable than structures that rely on coordinate information. However, the M-tree still suffers from the ‘curse of dimensionality’ that prevents all these methods from being effective for higher-dimensional data mining.

If one were to insist (as one should) on using only generic clustering methods that were both scalable and robust, a reasonable starting point would be to look at the optimization criteria of robust methods, and attempt to approximate the choices and behaviors of these methods while still respecting limits on the amount of computational resources used. This is the approach we take in the upcoming sections.

Dans le document Data Mining: A Heuristic Approach (Page 34-38)