• Aucun résultat trouvé

Determining the k in k-means with MapReduce

N/A
N/A
Protected

Academic year: 2022

Partager "Determining the k in k-means with MapReduce"

Copied!
26
0
0

Texte intégral

(1)

Determining the k in k-means with MapReduce

Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard

Algorithms for MapReduce and Beyond 2014

(2)

Clustering & k-means

Clustering

K-means

[Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.]

– 1982 (a great year!)

– But still largely used

– Drawbacks (amongst others):

Local minimum

K is a parameter!

(3)

Clustering & k-means

Determine k:

– VERY difficult

[Anil K Jain. Data Clustering : 50 Years Beyond K-Means.

Pattern Recognition Letters, 2009]

– Using cluster evaluation metrics:

Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,...

O(k²)

(4)

G-means

G-means

[Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003]

K-means : points in each cluster are

spherically distributed around the center

Source:scikit-learn

(5)

G-means

G-means

[Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003]

K-means : points in each cluster are

spherically distributed around the center

normality test &

recursion

(6)

G-means

Dataset

(7)

G-means

1. Pick 2 centers

(8)

G-means

2. k-means

(9)

G-means

3. Project

(10)

G-means

3. Project

(11)

G-means

Normal?

No

=> iterate

4. Normality test

(12)

G-means

5. Recursion

(13)

MapReduce G-means

Challenges:

1. Reduce I/O operations

2. Reduce number of jobs

3. Maximize parallelism

4. Limit memory usage

(14)

MapReduce G-means

Challenges:

1. Reduce I/O operations

2. Reduce number of jobs

3. Maximize parallelism

4. Limit memory usage

(15)

MapReduce G-means

2. Reduce number of jobs

PickInitialCenters

while Not ClusteringCompleted do KMeans

KMeansAndFindNewCenters TestClusters

end while

(16)

MapReduce G-means

TestClusters

Map(key, point) Find cluster Find vector

Project point on vector Emit(cluster, projection) end procedure

Reduce(cluster, projections) Build a vector

ADtest(vector) if normal then

Mark cluster end if

end procedure

3. Maximize parallelism

4. Limit memory

usage

(17)

MapReduce G-means

TestClusters

Map(key, point) Find cluster Find vector

Project point on vector Emit(cluster, projection) end procedure

Reduce(cluster, projections) Build a vector

ADtest(vector) if normal then

Mark cluster end if

end procedure

Bot tle nec k

3. Maximize parallelism

4. Limit memory

usage (risk of crash)

(18)

MapReduce G-means

TestClusters

Map(key, point) Find cluster Find vector

Project point on vector Emit(cluster, projection) end procedure

Reduce(cluster, projections) Build a vector

ADtest(vector) if normal then

Mark cluster end if

end procedure

TestFew Clusters

Map(key, point) Find cluster Find vector

Project point on vector Add projection to list end procedure

Close()

For each list do Build a vector

A2 = ADtest(vector) Emit(cluster, A2) End for each

end procedure

In memory combiner

(19)

MapReduce G-means

TestClusters

Map(key, point) Find cluster Find vector

Project point on vector Emit(cluster, projection) end procedure

Reduce(cluster, projections) Build a vector

ADtest(vector) if normal then

Mark cluster end if

end procedure

TestFewClusters

Map(key, point) Find cluster Find vector

Project point on vector Add projection to list end procedure

Close()

For each list do Build a vector

A2 = ADtest(vector) Emit(cluster, A2) End for each

end procedure

#clusters > #reducers

&

Estimated required memory < Java heap

(20)

MapReduce G-means

TestClusters

Map(key, point) Find cluster Find vector

Project point on vector Emit(cluster, projection) end procedure

Reduce(cluster, projections) Build a vector

ADtest(vector) if normal then

Mark cluster end if

end procedure

TestFewClusters

Map(key, point) Find cluster Find vector

Project point on vector Add projection to list end procedure

Close()

For each list do Build a vector

A2 = ADtest(vector) Emit(cluster, A2) End for each

end procedure

#clusters > #reducers

&

Estimated required memory < Java heap

Experimentally:

64 Bytes / point

(21)

Comparison

MR multi-k-means MR G-means Speed

Quality

all possible values of k

in a single job

(22)

Comparison

MR multi-k-means MR G-means

Speed O(nk² ) computations O(nk) computations

But:

more iterations

more dataset reads

log

2(k)

Quality

New centers added if and

where needed But:

tends to overestimate k!

(23)

Experimental results : Speed

Hadoop

Synthetic dataset

10M points in R

10

Euclidean distance

8 machines

(24)

Experimental results : Quality

Hadoop

Synthetic dataset

10M points in R

10

Euclidean distance

8 machines

k 100 200 400

k

found

150 279 639

Within Cluster Sum of Square

(less is better)

MR G-means 3.34 3.33 3.23 multi-k-means 3.71 3.6 3.39

(with same k)

x ~1.5

(25)

Conclusions & future work...

MapReduce algorithm to determine k

Running time proportional to k

Future:

– Overestimation of k

– Test on real data

– Test scalability

– Reduce I/O (using Spark)

– Consider skewed data

– Consider impact of machine failure

(26)

Thank you!

Références

Documents relatifs

I have done numerical simulations of one particular model (off-lattice hierarchical reaction limited cluster-cluster aggregation (RLCA) [7, 8]) in several space dimensions and

We have shown that, if the set of points is known to be on a (fixed) sphere, then there exists a set of a few lines that allow to determine uniquely the set of points.

They are SVM Venn Machine with homogeneous intervals (VM-SVM-HI) generalized from the binary-only version of [10] together with our alternative method SVM Venn Machine with

In this paper, we propose a new clustering method based on the com- bination of K-harmonic means (KHM) clustering algorithm and cluster validity index for remotely sensed

We propose an approach that dynamically adjusts the ratio between the number of concurrently running MapReduce jobs and their corresponding map an reduce tasks within a cluster

Gribonval, “Differentially Private Compressive K-means,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

We present in this paper an approach for solving k-cluster problems to optimal- ity by using standard branch-and-bound techniques together with new semidefinite programming bounds..

As of 2014, the state-of-the-art method for solving problem (1) to optimality is the semidefinite-based branch-and-bound algorithm of [MR12] which is able to solve instances of size n