• Aucun résultat trouvé

Algorithm NNSampler

Dans le document Data Mining: A Heuristic Approach (Page 48-52)

1. Select a subset R={y1,…,yr} uniformly at random from among all subsets of S of size r. This requires O(r) time.

2. For each s∈S, find its m nearest elements in R. Let Ci = {y∈R | y is one of the m nearest elements of R to si}.

3. For each yj∈R, construct a list or ‘bucket’ Bj of the elements S for yj∈Ci. 4. For each si∈S, compute the union Ui of the m buckets to which it belongs (that

is Ui=Ó˛siBj Bj).

5. For each si∈S, find the u closest points of Ui to si, and use them to form the adjacency list of si in PD(S).

Note that the distance between any pair of data points need be computed no more than twice. Consequently, the total number of distance computations required by the basic method is in O(rn+Σi=1,..,n |Ui|). If the points of S are distributed evenly among the r buckets, the number of distance calculations simplifies to O(rn++mn2/ /r). This is minimized when r is chosen to be (mn)1//2, yielding O(n(mn)1//2) distance calculations.

However, in practice, some buckets could receive more elements than others;

if any one bucket were to receive a linear number of elements, the number of distance computations would become quadratic. On the other hand, any bucket that receives a disproportionately large number of elements immediately indicates a cluster in the data, as it would have been chosen as one of the m near neighbors of many data points. If the user is unwilling or unable to declare the existence of a cluster based

on this sample point, the overfull bucket can simply be discarded, and a new random point selected to replace it. By managing the process carefully, it is not hard to see that a replacement bucket can be generated using n distance computations.

Another complication that can arise in practice is when the Ui contains fewer than u points. In this case, it is a simple matter to expand the number of buckets contributing to Ui until it contains at least u points. If this is done carefully, no additional distance computations are required for this.

Algorithm NNSampler was implemented and tested on the Reuters data set, which has previously been used in the analysis of several data mining applications (Bradley & Fayyad, 1998; Fayyad 1998). The Reuters set consists of records of categorical data coded as integers, each record having 302 attributes. Two sets of runs were performed, one set with n=1000 records, and the other with n=10,000. The sample sizes were chosen to be roughly n1//2: r=32 for the first set, and r=100 for the second. For each set, the number of near neighbors computed was u=10 and u=20.

To test the accuracy of the near-neighbor generation, the full set of distances was calculated, and the true u nearest neighbor lists were compared with the approximate lists. The accuracy of the approximate lists are shown in Table 1, along with the time needed to compute the lists, in CPU seconds (the confidence intervals shown are at 95%). In the case where u=20, the accuracy rate of the closest 10 elements on the approximate list are compared with the 10 elements on the exact list.

The lower accuracy rate in the case of n=10,000 and u=10 is due to the high number of neighbors having identical distances—in cases of an overabundance of near neighbors with the same distance to a given data point, the selection of u near neighbors is performed arbitrarily.

Random partitioning for the TWGD-problem

We now illustrate a general non-representative randomized clustering strategy, based on a two-phase enhanced version of the interchange heuristic for the TWGD-problem. The strategy is divide-and-conquer: in the first phase, we partition the set of points randomly, and compute a clustering of each partition set. For the merge step, we perform an aggregation of the elements based on the clusters generated in the first phase. Before giving the details of the method, we require some terminology and notation.

The assignment of a data element to a cluster can be viewed as a labeling of that data element with the index associated with that cluster. Each modification

per-Table 1: Testing algorithm NNSampler versus brute force calculation

Execution Time Precision

n=1000 n=10,000 n=1000 n=10,000

Brute Force 38.5 s 3765.9 s 100% 100%

NNSampler (u=10) 12.6 s 360 ± 20 s 91% 73 ± 3%

NNSampler (u=20) 18.6 s 720 ± 35 s 98% 90 ± 4%

formed by an interchange heuristic would thus result in a re-labeling of one of the data elements. The cluster to which si belongs in Pt will be denoted by Ct[si].

Conversely, the elements of j-th cluster at time t will be denoted by Ct,j. We also evaluate si for its quality as a discrete representative of the j-th cluster in Pt, using the L1 loss function: L1(si,t,j) = Σ si’∈Ct,j wi’ d(si,si’). In the preprocessing step of the first phase, data structures are constructed that maintain information about the partition in a feasible solution Pt, and the sum of distances of each point to items of a cluster. A linear array of indices is used to maintain Ct[si], the assignment of data elements to clusters for the current solution Pt. A table M[i,j] of k columns and n rows will be used to store the set of loss function values L1(si,t,j). Since the initialization Lsi requires O(n) distance calculations, initializing the entire table would require Θ(n2) calculations, but only O(kn) space.

The matrix M facilitates the implementation of the heuristics for the TWGD(P)-problem. That is, for the interchange at time t for item si, we find the index jmin that is the smallest value in the row for si in M; that is, L1(si,t,jmin) = min j=1,..,k L1(si,t,j). This clearly can be done in O(k) time. If jmin=Ct[si], the point si does not change cluster membership, and Pt+1=Pt. However, if jmin≠Ct[si], we have found an improvement over the current partition Pt, with si assigned to cluster jmin. We let jold←Ct[si] and Ct+1[si]←jmin. We also update the information in the matrix M. For all si’, we update its column Mi’,* by setting

Mi’jold ← Mi’jold - wi d(si,si’), Mi’jmin ←Mi’jmin + wi d(si,si’).In either case, the total number of distance calculations in one interchange is in O(n).

Clearly, the clustering computed is the same as for the standard TWGD interchange heuristic. This matrix-based variant (referred to as TWGD-median) is apparently more complex than the standard interchange heuristic. But median will allow us to develop a faster approximation algorithm for the TWGD-problem. The algorithm starts by first randomly partitioning S into smaller Y1,…,Yr. This can be achieved by generating a permutation S’ of S uniformly at random (in O(n) time). We let r∈{1,…,n} be an integer parameter and determine the random partition by dividing the sequence S’ into r consecutive blocks Y1,...,Yr, each containing roughly n/r elements.

We will operate the interchange heuristic separately for each of the blocks Yb, b=1,...,r. The result will be a collection of r clusters, each consisting of k clusters.

In the second phase, this collection of clusters C1b,…,Ckb of the blocks Yb, b=1,…,r, are used in turn to influence the construction of a clustering for the entire set S. The execution proceeds as if we were using the TWGD interchange on the whole set, except that the distance calculations to arbitrary points are replaced by calculations to their representative discrete medians, defined below.

The first step of the second phase is the extraction of the discrete median of each cluster Cjb of each block (j=1,…,k and b=1,…,r). Formally, the discrete median cjb is a point si ∈Tb⊂S belonging to cluster Cjb such that L1(cjb,t,j) ≤ L1(s,t,j) for all data points s in Cjb. Computing the discrete median can be done simply by finding the smallest value in the j-th column for matrix Mb of the block Yb, and identifying the row where that occurs. Since cjb will be used to represent all points in Cjb, we will

assign to it the aggregation of weights in Cjb ; that is, w(cjb)=Σ si’∈Cjb wi’.

Next, on the collection of rk discrete medians obtained, a TWGD-style k-clustering is performed. The k groups of medians indicate which block clusters Cjb could be merged to produce the best clustering of S into k groups.

The aggregation interchange heuristic uses this information as follows. When a point si in group j is being assessed for migration to group j’, we consider whether the contribution Σ si≠ si’ ∧ si’∈ clusterj w(si) w(si’) d(si,si’)is larger than Σ si≠ si’ ∧ si’∈ cluster j’ w(si) w(si’) d(si,si’). In the case that the former is larger than the latter, si is migrated from cluster j to group j’. This new gradient is one where the sum of all pairs of distances between points represented by si and points represented by si’ are approximated by the aggregated information from the matrices Mb.

Since the blocks have size Θ(n/r), the application of the aggregation version of TWGD-median to all blocks Yb requires O(rn2/r2)=O(n2/r) distance computations.

The aggregation version of the hill-climber integrating these results will work with rk items per data element, and thus will require O(rkn) distance computations per complete scan through S. The overall number of distance computations is therefore in O((rk++n/r)n). This is minimized when r is chosen to be O((n/k)1/2), yielding O(n3/

2k1/2) computations.

To illustrate the scalability of our methods, we implemented the three algo-rithms discussed here, namely the original interchange method for TWGD (which we will call TWGD-quadratic), then our enhanced version TWGD-median and finally, our final randomized approximation algorithm (which we will call TWGD-random). We used synthetic data, generated as a mixture of 10 probability distribu-tions in 2D. We generated data sets of different sizes, from n=4000 data items to n=1,000,000. The results are displayed graphically in Figure 1. Data in Figure 1 is on a logarithmic scale. Algorithm median is 4 to 5 times faster than

Figure 1: Illustration of CPU-time requirements of quadratic, TWGD-median and TWGD-random

CPU time comparison

0 500 1000 1500 2000 2500

4300 7500 11000 22000 60000 100000 400000 800000 number of 2D data items

Interchange MedianTb Randomized CPU time comparison

quadratic, and for n=5000 it requires only 48s, while TWGD-quadratic requires 207s. However, both have quadratic time complexity. Our divide-and-conquer TWGD-random is radically faster, being able to cluster 1,000,000 points in the same CPU time that TWGD-median takes for only slightly more than 20,000, and the original TWGD-quadratic takes for just over 11,000. Both TWGD-quadratic and TWGD-median exhibit quadratic time complexity. An example illustrating cluster-ing of mixture models for 3D data appears in Estivill-Castro & Houle (2001).

Dans le document Data Mining: A Heuristic Approach (Page 48-52)