Algorithm NNSampler - Data Mining: A Heuristic Approach

1. Select a subset R={y₁,…,y_r} uniformly at random from among all subsets of S of size r. This requires O(r) time.

2. For each s∈S, find its m nearest elements in R. Let C_i= {y∈R | y is one of the m nearest elements of R to s_i}.

3. For each y_j∈R, construct a list or ‘bucket’ B_j of the elements S for y_j∈C_i. 4. For each s_i∈S, compute the union U_i of the m buckets to which it belongs (that

is U_i=Ó˛_si∈_Bj B_j).

5. For each s_i∈S, find the u closest points of U_i to s_i, and use them to form the adjacency list of s_i in PD(S).

Note that the distance between any pair of data points need be computed no more than twice. Consequently, the total number of distance computations required by the basic method is in O(rn+Σ_i=1,..,n |U_i|). If the points of S are distributed evenly among the r buckets, the number of distance calculations simplifies to O(rn++mn²/ /r). This is minimized when r is chosen to be (mn)^1//2, yielding O(n(mn)^1//2) distance calculations.

However, in practice, some buckets could receive more elements than others;

if any one bucket were to receive a linear number of elements, the number of distance computations would become quadratic. On the other hand, any bucket that receives a disproportionately large number of elements immediately indicates a cluster in the data, as it would have been chosen as one of the m near neighbors of many data points. If the user is unwilling or unable to declare the existence of a cluster based

on this sample point, the overfull bucket can simply be discarded, and a new random point selected to replace it. By managing the process carefully, it is not hard to see that a replacement bucket can be generated using n distance computations.

Another complication that can arise in practice is when the U_i contains fewer than u points. In this case, it is a simple matter to expand the number of buckets contributing to U_i until it contains at least u points. If this is done carefully, no additional distance computations are required for this.

Algorithm NNSampler was implemented and tested on the Reuters data set, which has previously been used in the analysis of several data mining applications (Bradley & Fayyad, 1998; Fayyad 1998). The Reuters set consists of records of categorical data coded as integers, each record having 302 attributes. Two sets of runs were performed, one set with n=1000 records, and the other with n=10,000. The sample sizes were chosen to be roughly n^1//2: r=32 for the first set, and r=100 for the second. For each set, the number of near neighbors computed was u=10 and u=20.

To test the accuracy of the near-neighbor generation, the full set of distances was calculated, and the true u nearest neighbor lists were compared with the approximate lists. The accuracy of the approximate lists are shown in Table 1, along with the time needed to compute the lists, in CPU seconds (the confidence intervals shown are at 95%). In the case where u=20, the accuracy rate of the closest 10 elements on the approximate list are compared with the 10 elements on the exact list.

The lower accuracy rate in the case of n=10,000 and u=10 is due to the high number of neighbors having identical distances—in cases of an overabundance of near neighbors with the same distance to a given data point, the selection of u near neighbors is performed arbitrarily.

Random partitioning for the TWGD-problem

We now illustrate a general non-representative randomized clustering strategy, based on a two-phase enhanced version of the interchange heuristic for the TWGD-problem. The strategy is divide-and-conquer: in the first phase, we partition the set of points randomly, and compute a clustering of each partition set. For the merge step, we perform an aggregation of the elements based on the clusters generated in the first phase. Before giving the details of the method, we require some terminology and notation.

The assignment of a data element to a cluster can be viewed as a labeling of that data element with the index associated with that cluster. Each modification

per-Table 1: Testing algorithm NNSampler versus brute force calculation

Execution Time Precision

n=1000 n=10,000 n=1000 n=10,000

Brute Force 38.5 s 3765.9 s 100% 100%

NNSampler (u=10) 12.6 s 360 ± 20 s 91% 73 ± 3%

NNSampler (u=20) 18.6 s 720 ± 35 s 98% 90 ± 4%

formed by an interchange heuristic would thus result in a re-labeling of one of the data elements. The cluster to which s_i belongs in P^t will be denoted by C_t[s_i].

Conversely, the elements of j-th cluster at time t will be denoted by C_t,j. We also evaluate s_i for its quality as a discrete representative of the j-th cluster in P^t, using the L₁ loss function: L₁(s_i,t,j) = Σ _si’∈Ct,j w_i’ d(s_i,s_i’). In the preprocessing step of the first phase, data structures are constructed that maintain information about the partition in a feasible solution P^t, and the sum of distances of each point to items of a cluster. A linear array of indices is used to maintain C_t[s_i], the assignment of data elements to clusters for the current solution P^t. A table M[i,j] of k columns and n rows will be used to store the set of loss function values L₁(s_i,t,j). Since the initialization L_si requires O(n) distance calculations, initializing the entire table would require Θ(n²) calculations, but only O(kn) space.

The matrix M facilitates the implementation of the heuristics for the TWGD(P)-problem. That is, for the interchange at time t for item s_i, we find the index j_min that is the smallest value in the row for s_i in M; that is, L₁(s_i,t,j_min) = min _j=1,..,k L₁(s_i,t,j). This clearly can be done in O(k) time. If j_min=C_t[s_i], the point s_i does not change cluster membership, and P^t+1=P^t. However, if j_min≠C_t[s_i], we have found an improvement over the current partition P^t, with s_i assigned to cluster j_min. We let j_old←C_t[s_i] and C_t+1[s_i]←j_min. We also update the information in the matrix M. For all s_i’, we update its column M_i’,* by setting

M_i’jold← M_i’jold- w_i d(s_i,s_i’), M_i’jmin←M_i’jmin + w_i d(s_i,s_i’).In either case, the total number of distance calculations in one interchange is in O(n).

Clearly, the clustering computed is the same as for the standard TWGD interchange heuristic. This matrix-based variant (referred to as TWGD-median) is apparently more complex than the standard interchange heuristic. But median will allow us to develop a faster approximation algorithm for the TWGD-problem. The algorithm starts by first randomly partitioning S into smaller Y₁,…,Y_r. This can be achieved by generating a permutation S’ of S uniformly at random (in O(n) time). We let r∈{1,…,n} be an integer parameter and determine the random partition by dividing the sequence S’ into r consecutive blocks Y₁,...,Y_r, each containing roughly n/r elements.

We will operate the interchange heuristic separately for each of the blocks Y_b, b=1,...,r. The result will be a collection of r clusters, each consisting of k clusters.

In the second phase, this collection of clusters C₁^b,…,C_k^b of the blocks Y_b, b=1,…,r, are used in turn to influence the construction of a clustering for the entire set S. The execution proceeds as if we were using the TWGD interchange on the whole set, except that the distance calculations to arbitrary points are replaced by calculations to their representative discrete medians, defined below.

The first step of the second phase is the extraction of the discrete median of each cluster C_j^b of each block (j=1,…,k and b=1,…,r). Formally, the discrete median c_j^b is a point s_i ∈T_b⊂S belonging to cluster C_j^b such that L₁(c_j^b,t,j) ≤ L₁(s,t,j) for all data points s in C_j^b. Computing the discrete median can be done simply by finding the smallest value in the j-th column for matrix M_b of the block Y_b, and identifying the row where that occurs. Since c_j^b will be used to represent all points in C_j^b, we will

assign to it the aggregation of weights in C_j^b ; that is, w(c_j^b)=Σ _si’∈Cj^b w_i’.

Next, on the collection of rk discrete medians obtained, a TWGD-style k-clustering is performed. The k groups of medians indicate which block clusters C_j^b could be merged to produce the best clustering of S into k groups.

The aggregation interchange heuristic uses this information as follows. When a point s_i in group j is being assessed for migration to group j’, we consider whether the contribution Σ si≠ si’ ∧ si’∈ clusterj w(s_i) w(s_i’) d(s_i,s_i’)is larger than Σ si≠ si’ ∧ si’∈ cluster j’ w(s_i) w(s_i’) d(s_i,s_i’). In the case that the former is larger than the latter, s_i is migrated from cluster j to group j’. This new gradient is one where the sum of all pairs of distances between points represented by s_i and points represented by s_i’ are approximated by the aggregated information from the matrices M_b.

Since the blocks have size Θ(n/r), the application of the aggregation version of TWGD-median to all blocks Y_b requires O(rn²/r²)=O(n²/r) distance computations.

The aggregation version of the hill-climber integrating these results will work with rk items per data element, and thus will require O(rkn) distance computations per complete scan through S. The overall number of distance computations is therefore in O((rk++n/r)n). This is minimized when r is chosen to be O((n/k)^1/2), yielding O(n^3/

2k^1/2) computations.

To illustrate the scalability of our methods, we implemented the three algo-rithms discussed here, namely the original interchange method for TWGD (which we will call TWGD-quadratic), then our enhanced version TWGD-median and finally, our final randomized approximation algorithm (which we will call TWGD-random). We used synthetic data, generated as a mixture of 10 probability distribu-tions in 2D. We generated data sets of different sizes, from n=4000 data items to n=1,000,000. The results are displayed graphically in Figure 1. Data in Figure 1 is on a logarithmic scale. Algorithm median is 4 to 5 times faster than

Figure 1: Illustration of CPU-time requirements of quadratic, TWGD-median and TWGD-random

CPU time comparison

0 500 1000 1500 2000 2500

4300 7500 11000 22000 60000 100000 400000 800000 number of 2D data items

Interchange MedianTb Randomized CPU time comparison

quadratic, and for n=5000 it requires only 48s, while TWGD-quadratic requires 207s. However, both have quadratic time complexity. Our divide-and-conquer TWGD-random is radically faster, being able to cluster 1,000,000 points in the same CPU time that TWGD-median takes for only slightly more than 20,000, and the original TWGD-quadratic takes for just over 11,000. Both TWGD-quadratic and TWGD-median exhibit quadratic time complexity. An example illustrating cluster-ing of mixture models for 3D data appears in Estivill-Castro & Houle (2001).

Dans le document Data Mining: A Heuristic Approach (Page 48-52)