The global k-means clustering algorithm

(1)

HAL Id: inria-00321515

https://hal.inria.fr/inria-00321515

Submitted on 16 Feb 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

The global k-means clustering algorithm

Aristidis Likas, Nikos Vlassis, Jakob Verbeek

To cite this version:

Aristidis Likas, Nikos Vlassis, Jakob Verbeek. The global k-means clustering algorithm. [Technical

Report] IAS-UVA-01-02, 2001, pp.12. �inria-00321515�

(2)

Intelligent

Autonomous

Systems

The global k-means lustering algorithm

AristidisLikas

Departmentof Computer S ien e

Universityof Ioannina

Gree e

NikosVlassis

Computer S ien e Institute

Fa ultyofS ien e

Universityof Amsterdam

TheNetherlands

Ja obJ. Verbeek

Computer S ien e Institute

Fa ultyofS ien e

Universityof Amsterdam

TheNetherlands

We present the global k-means algorithm whi h is an in remental approa h to

lusteringthat dynami allyaddsone luster enteratatimethrougha

determin-isti global sear h pro edure onsisting ofN (withN beingthe size of thedata

set)exe utions ofthek-means algorithmfrom suitableinitial positions. Wealso

proposemodi ationsofthemethodtoredu ethe omputationalloadwithout

sig-ni antlyae ting solutionquality. Theproposed lustering methods aretested

on well-known data sets and they ompare favorably to the k-means algorithm

withrandomrestarts.

Keywords: Clustering; K-meansalgorithm;Global optimization; k-d trees, Data

mining.

(3)

Contents

1 Introdu tion 1

2 The global k-means algorithm 1

3 Speeding-up exe ution 3

3.1 The fastglobalk-means algorithm . . . 3

3.2 Initializationwithk-dtrees . . . 4

4 Experimental results 5

5 Dis ussion and on lusions 7

IntelligentAutonomousSystems

ComputerS ien e Institute

Fa ultyofS ien e UniversityofAmsterdam Kruislaan403,1098SJAmsterdam TheNetherlands tel: +31 205257461 fax: +31205257490 http://www.s ien e.uva.nl/resea r h/ ias/ Correspondingauthor: AristidisLikas arly s.uoi.gr

(4)

1 Introdu tion

Afundamental problemthat frequentlyarises ina greatvarietyof eldssu h aspattern

re og-nition, image pro essing, ma hine learning and statisti s is the lustering problem [1℄. In its

basi formthe lusteringproblemisdenedastheproblemofndinggroupsofdatapointsina

given data set. Ea hof these groupsis alleda lusterand an be denedas aregioninwhi h

thedensityof obje ts islo allyhigherthaninother regions.

The simplestform of lustering is partitional lustering whi h aims at partitioninga given

data set into disjoint subsets ( lusters) so that spe i lustering riteria are optimized. The

most widely used riterion is the lustering error riterion whi h for ea h point omputes its

squareddistan efromthe orresponding luster enterandthentakesthesumofthesedistan es

forall pointsin thedata set. A popular lustering method that minimizes the lustering error

is the k-means algorithm. However, the k-means algorithm is a lo al sear h pro edure and it

is well-known that it suers from the serious drawba k that its performan e heavily depends

on theinitial starting onditions [2 ℄. To treat this problem several other te hniques have been

developed that are based on sto hasti globaloptimization methods (eg. simulatedannealing,

geneti algorithms). However, it must be noted that these te hniques have not gained wide

a eptan eandinmanypra ti alappli ationsthe lusteringmethodthatisusedisthek-means

algorithmwithmultiplerestarts [1 ℄.

Inthiswork we proposetheglobalk-means lustering algorithm,whi h onstitutes a

deter-ministi ee tive global lustering algorithm for the minimization of the lustering error that

employs the k-means algorithm as a lo al sear h pro edure. The algorithm pro eeds inan

in- remental way: to solve a lustering problem with M lusters, all intermediate problems with

1;2;:::;M 1 lustersaresequentiallysolved. Thebasi ideaunderlyingtheproposedmethodis

thatanoptimalsolutionfora lusteringproblemwithM lusters anbeobtainedusingaseries

oflo al sear hes(usingthe k-meansalgorithm). At ea hlo alsear h theM 1 luster enters

are always initiallypla ed at their optimal positions orresponding to the lustering problem

withM 1 lusters. The remaining M-th luster enter is initiallypla edat several positions

withinthedata spa e. Sin eforM =1theoptimalsolutionisknown,we an iterativelyapply

the above pro edure to nd optimal solutions for all k- lustering problems k = 1;:::;M. In

additionto ee tiveness, themethod isdeterministi and doesnotdependon anyinitial

ondi-tionsorempiri ally adjustable parameters. These aresigni ant advantages over all lustering

approa hes mentionedabove.

In the following se tion starts with a formal denition of the lustering error and a brief

des riptionofthek-meansalgorithmandthendes ribestheproposedglobalk-meansalgorithm.

Se tion 3 des ribes modi ations of the basi method that require less omputation at the

expenseofbeingslightlylessee tive. Se tion4providesexperimentalresultsand omparisons

withthe k-meansalgorithm with multiple restarts. FinallySe tion5 provides on lusions and

des ribesdire tions forfuture resear h.

2 The global k-means algorithm

SupposewearegivenadatasetX=fx

1 ;:::;x N g,x n 2R d

. TheM- lusteringproblemaimsat

partitioningthisdata set into M disjoint subsets ( lusters) C

1

;:::;C

M

,su h thata lustering

riterion is optimized. The most widely used lustering riterion is the sum of the squared

Eu lidean distan es between ea h data point x

i

and the entroid m

k

( luster enter) of the

(5)

enters m 1 ;:::;m M : E(m 1 ;:::;m M )= N X i=1 M X k=1 I(x i 2C k )jx i m k j 2 (1)

whereI(X)=1 ifX is trueand 0 otherwise.

The k-means algorithm nds lo ally optimal solutions with respe tto the lustering error.

It is a fast iterative algorithm that has been used in many lustering appli ations. It is a

point-based lustering method that starts with the luster enters initiallypla ed at arbitrary

positions and pro eeds by moving at ea h step the luster enters in order to minimize the

lustering error. Themain disadvantageof themethod liesin itssensitivityto initialpositions

of the luster enters. Therefore, in order to obtain near optimal solutions using the k-means

algorithmseveral runsmust be s heduleddieringintheinitialpositionsof the luster enters.

In this paper, the global k-means lustering algorithm is proposed, whi h onstitutes a

deterministi globaloptimization methodthatdoesnotdependonanyinitialparametervalues

and employs the k-meansalgorithm asa lo al sear h pro edure. Instead of randomlysele ting

initial values for all luster enters as is the ase with most global lustering algorithms, the

proposedte hniquepro eedsinanin rementalwayattemptingtooptimallyaddonenew luster

enter at ea h stage.

More spe i ally, to solve a lustering problem with M lusters the method pro eeds as

follows. We start with one luster (k =1) and ndits optimal position whi h orresponds to

the entroidofthedatasetX. Inordertosolvetheproblemwithtwo lusters(k=2)weperform

N exe utionsofthek-meansalgorithmfromthefollowinginitialpositionsofthe luster enters:

therst luster enterisalwayspla edattheoptimalpositionfortheproblemwithk=1,while

the se ond enter at exe utionn is pla edat the positionof the datapoint x

n

(n=1;:::;N).

The best solution obtained after the N exe utions of the k-means algorithm is onsidered as

the solution for the lustering problemwith k = 2. In general, let (m

1 (k);:::;m k (k)) denote

the nal solution for k- lustering problem. On e we have found the solution for the (k

1)- lusteringproblem,wetrytondthesolutionofthek- lusteringproblemasfollows: weperform

N runsof thek-means algorithmwith k lusterswhere ea h runnstarts from theinitial state

(m 1 (k 1);:::;m (k 1) (k 1);x n

). ThebestsolutionobtainedfromtheN runsis onsideredas

thesolution(m 1 (k);:::;m k

(k))ofthek- lusteringproblem. Bypro eedingintheabovefashion

we nally obtain a solution with M lusters having also found solutions for all k- lustering

problemswithk <M.

The latter hara teristi an be advantageous in many appli ations where the aim is also

to dis over the ` orre t' number of lusters. To a hieve this, one has to solve the k- lustering

problemfor variousnumbers of lusters and then employappropriate riteria for sele ting the

most suitable value of k [3 ℄. In this ase the proposed method dire tly provides lustering

solutionsforall intermediatevaluesof k,thusrequiringno additional omputationaleort.

In what on erns omputational omplexity, the method requires N exe utions of the

k-meansalgorithm forea h value of k (k =1;:::;M). Dependingon theavailable resour esand

the values of N and M, the algorithm may be an attra tive approa h, sin e, as experimental

results indi ate, the performan e of the method is ex ellent. Moreover, as we will show later

there areseveral modi ations that an beapplied inorder toredu e the omputational load.

Therationalebehindtheproposedmethodisbasedonthefollowingassumption: anoptimal

lusteringsolutionwithk lusters anbeobtainedthroughlo alsear h(usingk-means)starting

from an initialstate with

(6)

This assumption seems very natural: we expe t that the solution of the k- lustering problem

tobe rea hable (throughlo alsear h) from thesolutionof (k 1)- lustering problem,on ethe

additional enter is pla edat an appropriatepositionwithinthedata set. It is also reasonable

to restri t the set of possible initial positions of the k-th enter to the set X of available data

points. Itmustbenotedthatthisisarather omputationalheavyassumptionandseveral other

options (examining fewer initial positions) may also be onsidered. The above assumptions

are also veried experimentally, sin e in all experiments (and for all values of k) the solution

obtainedbytheproposedmethodwasatleastasgoodasthatobtainedusingnumerousrandom

restarts of the k-means algorithm. In this spirit, we an autiously state that the proposed

methodis experimentallyoptimal (althoughitis diÆ ultto prove theoreti ally).

3 Speeding-up exe ution

Based onthe generalidea ofthe globalk-meansalgorithm,several heuristi s an bedevisedto

redu ethe omputationalloadwithoutsigni antlyae ting thequalityof thesolution. Inthe

followingsubse tions two modi ationsareproposed,ea h one referringto adierentaspe tof

themethod.

3.1 The fast global k-means algorithm

Thefastglobalk-meansalgorithm onstitutesastraightforwardmethodtoa eleratetheglobal

k-means algorithm. The dieren e lies in the way a solution for the k- lustering problem is

obtained, given the solutionof the(k 1)- lustering problem. Forea h of theN initial states

(m 1 (k 1);:::;m (k 1) (k 1);x n

) we do not exe ute the k-means algorithm until onvergen e

to obtain the nal lustering error E

n

. Instead we ompute an upper bound E

n

E b

n

on the resulting error E

n

for all possible allo ation positions x

n

, where E is the error in the

(k 1)- lusteringproblem. Wetheninitializethepositionofthenew luster enter atthepoint

x i

thatminimizesE

n

,orequivalentlythatmaximizesb

n

,andexe ute thek-meansalgorithmto

obtainthe solutionwithk lusters. Formally we have

b n = N X j=1 max(d j k 1 jx n x j j 2 ;0); (2) i =argmax n b n (3) whered j k 1

is the squared distan e between x

j

and the losest enter among thek 1 luster

enters obtained sofar (ie., enter of the lusterwhere x

j

belongs). The quantity b

n

measures

the guaranteed redu tion in the error measure obtained by inserting a new luster enter at

positionx

n .

Supposethesolutionofthe(k 1)- lusteringproblemis(m

1 (k 1);:::;m (k 1) (k 1))anda

new luster enterisaddedatlo ationx

n

. Thenthenew enterwillallo ateallpointsx

j whose

squareddistan efrom x

n

issmaller thanthedistan ed

j

k 1

fromtheirpreviously losest enter.

Therefore, for ea h su h data point x

j

the lustering error will de rease by d

j k 1 jx n x j j 2 .

The summation over all su h data points x

j

provides the quantity b

n

for a spe i insertion

lo ationx

n

. Sin e thek-meansalgorithm isguaranteedto de rease the lustering errorat ea h

step,E b

n

upperboundstheerrormeasurethatwillbeobtainedifwerunthealgorithmuntil

onvergen e after inserting the new enter at x

n

(this is the error measure used in the global

k-meansalgorithm).

(7)

2

4

6

8

10

12

14

16

18

20

0

50

100

150

200

250

300

350

400

450

500 number of buckets

clustering error

Moreover, the lusterinsertion pro edure anbe eÆ ientlyimplementedby storingina matrix

all pairwisesquareddistan es betweenpointswhen thealgorithm starts,and usingthismatrix

for dire tly omputingthe upperbounds above. A similar`tri k' has been used inthe related

problemsof greedy mixture densityestimation usingtheEM algorithm [4℄ and prin ipal urve

tting[5 ℄.

Finally,we maystillapplythismethodaswellastheglobalk-meansalgorithm whenwedo

not onsidereverydatapointx

n

(n=1;:::;N)aspossibleinsertionpositionforthenew enter,

butuseonlyasmallersetofappropriatelysele tedinsertionpositions. Afastandsensible hoi e

forsele ting su ha set ofpositionsbased onk-dtrees isdis ussednext.

3.2 Initialization with k-d trees

A k-dtree [6 , 7℄ is a multi-dimensionalgeneralization of the standard one-dimensional binary

sear h tree,that fa ilitatesstorage and sear h over k-dimensionaldatasets. A k-dtree denes

a re ursive partitioningof thedataspa e into disjointsubspa es. Ea h node ofthetree denes

a subspa e of the original data spa e and, onsequently, a subset ontaining the data points

residing in thissubspa e. Ea h nonterminal node has two su essors, ea h of them asso iated

withoneofthetwosubspa esobtainedfromthepartitioningoftheparentspa eusinga utting

hyperplane. The k-d tree stru ture was originally used for speedingup distan e-based sear h

operationslike nearestneighborsqueries, rangequeries, et .

In our ase we use a variation of the original k-d tree proposed in [7 ℄. There, the utting

hyperplane is dened as the plane that is perpendi ularto the dire tion of the prin ipal

om-ponent ofthedata points orrespondingto ea h node,therefore thealgorithm an beregarded

asa method fornested(re ursive)prin ipal omponent analysis of thedataset. The re ursion

usuallyterminates ifa terminal node( alled bu ket)is reated ontaining lessthan a

prespe -ied number of points b ( alled bu ket size) or if a prespe ied number of bu kets have been

reated. Itturnsoutthat,evenifthealgorithmisnotusedfornearestneighborqueries, merely

the onstru tion of the tree provides a very good preliminary lustering of the data set. The

idea is to use the bu ket enters (whi h are fewer than the data points) as possible insertion

lo ations forthealgorithmspresentedpreviously.

(8)

Gaussianmixtureare wellseparatedand exhibitlimited e entri ity.

We ompare theresultsofthree methodsto the lustering problemwithk =15 enters: (i)

Thedashed linedepi ts theresultswhen usingthefast globalk-meansalgorithm withall data

points onstitutingpotential insertion lo ations. The average lustering erroroverthe 10 data

sets is 15:7 with standard deviation 1:2. (ii) The solid line depi ts results when the standard

k-meansalgorithm isused: onerunforea hdatasetwas ondu ted. Atea h runthe15 luster

enters were initially positioned to the entroids of the bu kets obtained from the appli ation

of the k-dtree algorithm until 15 bu kets were reated. The average lustering errorover the

10 data sets is 24:4 with standard deviation 9:8. (iii) The solid line (with error bars) shows

theresultswhen usingthe fastglobal k-meansalgorithm withthe potentialinsertion lo ations

onstrained by the entroids of the bu kets of a k-d tree. On the horizontal axis we vary the

numberof bu kets forthek-dtree ofthelastmethod.

We also omputedthe`theoreti al' lustering errorforea h dataset, ie.,theerror omputed

by usingthe true luster enters. The average error value over the 10 data sets was 14:9 with

standard deviation 1:3. These results were too lose to theresults of the standard fast global

k-meansto in ludethem inthegure.

We an on ludefromthisexperimentthat(a)thefastglobalk-meansapproa hgivesriseto

performan esigni antlybetterthanwhenstartingwithall entersat thesametimeinitialized

usingthek-dtreemethod,and(b)restri tingtheinsertionlo ationsforthefastglobalk-means

to those given bythe k-d tree (instead of usingall data points)does not signi antlydegrade

performan e if we onsider a suÆ iently large number of bu kets in the k-d tree (in general

largerthanthenumber lusters).

Obviously,itisalsopossibletoemploytheabovepresentedk-dtreeapproa hwiththeglobal

k-meansalgorithm.

4 Experimental results

We have testedtheproposed lusteringalgorithmson several well-knowndatasets,namely the

irisdata set [8 ℄,thesyntheti dataset [9 ℄ and the imagesegmentationdata set [8 ℄. In all data

setswe ondu tedexperimentsforthe lustering problemsobtainedby onsideringonlyfeature

ve tors and ignoring lass labels. The iris data set ontains 150 four-dimensional data points,

the syntheti data set 250 two-dimensional data points and the for the image segmentation

data set we onsider 210 six-dimensional data points obtained through PCA on the original

18-dimensionaldatapoints. Thequalityoftheobtainedsolutionswasevaluatedintermsofthe

valuesof thenal lustering error.

Forea h dataset we ondu tedthefollowingexperiments:

one runoftheglobalk-means algorithmforM =15.

one runofthefast globalk-meansalgorithm forM =15.

the k-meansalgorithm fork =1;:::;15. Forea h value of k,the k-meansalgorithm was

exe uted N times (where N is the number of data points) starting from random initial

positionsforthek entersand we omputedtheminimumand average lusteringerroras

wellasits standarddeviation.

For ea h of the three data sets the experimental results are displayed in Figures 2, 3 and

4 respe tively. Ea h gure plot displaysthe lustering errorvalue as a fun tionof the number

of lusters. It is lear thatthe globalk-means algorithm is very ee tive providing inall ases

(9)

2

4

6

8

10

12

14

16

0

20

40

60

80

100

120

140

160

180 number of clusters k

clustering error

global k−means

fast global k−means

k−means

min k−means

2

4

6

8

10

12

14

16

2

4

6

8

10

12

14

16

18

20

22 number of clusters k

clustering error

global k−means

fast global k−means

k−means

(10)

2

4

6

8

10

12

14

16

0

50

100

150

200

250

300

350

400 number of clusters k

clustering error

global k−means

fast global k−means

k−means

min k−means

faster, it provides solutions of ex ellent quality, omparable to those obtained by the original

method. Therefore, it onstitutes a very eÆ ient algorithm, bothin terms of solutionquality

and omputational omplexityand anruneven fasterifk-dtreesareemployed asexplainedin

thepreviousse tion.

Matlabimplementationsofthefastglobalk-meansandthek-dtree buildingalgorithms an

be downloadedfrom http://www.s ien e.uva.nl/resear h/ias.

5 Dis ussion and on lusions

We have presented the global k-means lustering algorithm, whi h onstitutes a deterministi

lustering method providing ex ellent results in terms of the lustering error riterion. The

methodis independentof anystarting onditionsand ompares favorably to thek-means

algo-rithm with multiple random restarts. The deterministi nature of the method is parti ularly

important in ases wherethe lustering method is used either to spe ifyinitial parameter

val-uesfor other methods (for exampleRBF training) or onstitutes a module in a more omplex

system. Insu ha asewe an be almost ertainthattheemploymentoftheglobalk-means(or

any of the fast variants) will always provide sensible lustering solutions. Therefore, one an

evaluatethe omplexsystemandadjust riti al systemparameters withouthavingtoworryfor

dependen eof systemperforman eon the lustering methodemployed.

Another advantage of the proposed te hnique is that in order to solve the M- lustering

problem, all intermediate k- lustering problems are also solved for k = 1;:::;M. This may

prove useful in many appli ations where we seek for the a tual number of lusters and the

k- lustering problemis solved for several values of k. We have also developed the fast global

k-meansalgorithm, whi h signi antlyredu es the required omputational eort, whileat the

sametime providing solutionsof almostthe same quality.

We have also proposed two modi ationof themethod thatredu e the omputationalload

withoutsigni antlyae tingsolutionquality. Thesemethods anbeemployedtondsolutions

to lusteringproblems withthousands of high-dimensionalpointsand one ofour primaryaims

isto test thete hniqueson larges aledataminingproblems.

(11)

pendent and an beperformed inparallel. Anotherresear h dire tion on erns theappli ation

of the proposed method to other types of lustering (for examplefuzzy lustering), as well as

to topographi methods like SOM. Moreover, an important issue that deserves further study

is related withthe possibledevelopment of theoreti al foundations fortheassumptions behind

the method. Finally, it is also possible to employ the global k-means algorithm as a method

for providing ee tive initial parameter values for RBF networks and data modeling problems

using Gaussian mixture models and ompare the ee tiveness of the obtained solutions with

other trainingte hniquesforGaussianmixture models[10 ,11 ℄.

A knowledgements

N.Vlassis and J.J. Verbeek aresupportedbytheDut h Te hnology Foundation STWproje t

AIF4997.

Referen es

[1℄ M. N. MurtyA. K. Jain and P.J. Flynn, \Data lustering: A review," ACMComputing

Surveys,vol. 31,no.3, pp.264{323, 1999.

[2℄ J.A.Lozano J.M.Penaand P.Larranaga, \Anempiri al omparison offourinitialization

methodsforthek-meansalgorithm," PatternRe ognitionLetters, vol. 20,pp.1027{1040,

1999.

[3℄ G. W. Milligan and M. C. Cooper, \An examination of pro edures for determining the

numberof lustersina dataset," Psy hometrika,vol. 50,pp.159{179, 1985.

[4℄ N.Vlassis and A. Likas, \A greedy EM algorithm for Gaussian mixture learning," Te h.

Rep.,Computer S ien eInstitute,Universityof Amsterdam,The Netherlands,Sept.2000,

IAS-UVA-00-08.

[5℄ J.J.Verbeek,N.Vlassis,andB. Krose, \A k-segmentsalgorithm tondprin ipal urves,"

Te h.Rep., Computer S ien eInstitute,Universityof Amsterdam,The Netherlands,Nov.

2000, IAS-UVA-00-11.

[6℄ J.L.Bentley, \Multidimensionalbinarysear h treesusedforasso iative sear hing,"

Com-mun.ACM, vol. 18,no.9,pp. 509{517, 1975.

[7℄ R.F. Sproull, \Renements to nearest-neighborsear hing in k-dimensional trees,"

Algo-rithmi a,vol. 6,pp.579{589, 1991.

[8℄ C.L.Blake andC.J.Merz, \UCI repositoryofma hine learningdatabases,"Universityof

California,Irvine,Dept.of Informationand ComputerS ien es,1998.

[9℄ B. D. Ripley, Pattern Re ognition and Neural Networks, Cambridge University Press,

Cambridge,U.K., 1996.

[10℄ G.J. M La hlan and D.Peel, FiniteMixtureModels, Wiley,NewYork, 2000.

[11℄ N. Vlassis and A. Likas, \A kurtosis-based dynami approa h to Gaussian mixture

mod-eling," IEEE Trans.Systems, Man, and Cyberneti s, Part A, vol. 29, no. 4, pp. 393{399,

(12)

Intelligent

Autonomous

Systems

(13)

This report is in the series of IAS te hni al reports. The series editor is Stephan ten Hagen

(stephanhs ien e.uva.nl). Withinthisseriesthefollowingtitlesappeared:

See: http://www.s ien e.uva.nl/resear h/ias/tr/

You may order opiesof the IAS te hni al reports from the orresponding author orthe series

editor. Mostofthereports an alsobefoundontheweb pagesoftheIAS group(seetheinside

front page).