HAL Id: inria-00321515
https://hal.inria.fr/inria-00321515
Submitted on 16 Feb 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
The global k-means clustering algorithm
Aristidis Likas, Nikos Vlassis, Jakob Verbeek
To cite this version:
Aristidis Likas, Nikos Vlassis, Jakob Verbeek. The global k-means clustering algorithm. [Technical
Report] IAS-UVA-01-02, 2001, pp.12. �inria-00321515�
Intelligent
Autonomous
Systems
The global k-means lustering algorithm
AristidisLikas
Departmentof Computer S ien e
Universityof Ioannina
Gree e
NikosVlassis
Computer S ien e Institute
Fa ultyofS ien e
Universityof Amsterdam
TheNetherlands
Ja obJ. Verbeek
Computer S ien e Institute
Fa ultyofS ien e
Universityof Amsterdam
TheNetherlands
We present the global k-means algorithm whi h is an in remental approa h to
lusteringthat dynami allyaddsone luster enteratatimethrougha
determin-isti global sear h pro edure onsisting ofN (withN beingthe size of thedata
set)exe utions ofthek-means algorithmfrom suitableinitial positions. Wealso
proposemodi ationsofthemethodtoredu ethe omputationalloadwithout
sig-ni antlyae ting solutionquality. Theproposed lustering methods aretested
on well-known data sets and they ompare favorably to the k-means algorithm
withrandomrestarts.
Keywords: Clustering; K-meansalgorithm;Global optimization; k-d trees, Data
mining.
Contents
1 Introdu tion 1
2 The global k-means algorithm 1
3 Speeding-up exe ution 3
3.1 The fastglobalk-means algorithm . . . 3
3.2 Initializationwithk-dtrees . . . 4
4 Experimental results 5
5 Dis ussion and on lusions 7
IntelligentAutonomousSystems
ComputerS ien e Institute
Fa ultyofS ien e UniversityofAmsterdam Kruislaan403,1098SJAmsterdam TheNetherlands tel: +31 205257461 fax: +31205257490 http://www.s ien e.uva.nl/resea r h/ ias/ Correspondingauthor: AristidisLikas arly s.uoi.gr
1 Introdu tion
Afundamental problemthat frequentlyarises ina greatvarietyof eldssu h aspattern
re og-nition, image pro essing, ma hine learning and statisti s is the lustering problem [1℄. In its
basi formthe lusteringproblemisdenedastheproblemofndinggroupsofdatapointsina
given data set. Ea hof these groupsis alleda lusterand an be denedas aregioninwhi h
thedensityof obje ts islo allyhigherthaninother regions.
The simplestform of lustering is partitional lustering whi h aims at partitioninga given
data set into disjoint subsets ( lusters) so that spe i lustering riteria are optimized. The
most widely used riterion is the lustering error riterion whi h for ea h point omputes its
squareddistan efromthe orresponding luster enterandthentakesthesumofthesedistan es
forall pointsin thedata set. A popular lustering method that minimizes the lustering error
is the k-means algorithm. However, the k-means algorithm is a lo al sear h pro edure and it
is well-known that it suers from the serious drawba k that its performan e heavily depends
on theinitial starting onditions [2 ℄. To treat this problem several other te hniques have been
developed that are based on sto hasti globaloptimization methods (eg. simulatedannealing,
geneti algorithms). However, it must be noted that these te hniques have not gained wide
a eptan eandinmanypra ti alappli ationsthe lusteringmethodthatisusedisthek-means
algorithmwithmultiplerestarts [1 ℄.
Inthiswork we proposetheglobalk-means lustering algorithm,whi h onstitutes a
deter-ministi ee tive global lustering algorithm for the minimization of the lustering error that
employs the k-means algorithm as a lo al sear h pro edure. The algorithm pro eeds inan
in- remental way: to solve a lustering problem with M lusters, all intermediate problems with
1;2;:::;M 1 lustersaresequentiallysolved. Thebasi ideaunderlyingtheproposedmethodis
thatanoptimalsolutionfora lusteringproblemwithM lusters anbeobtainedusingaseries
oflo al sear hes(usingthe k-meansalgorithm). At ea hlo alsear h theM 1 luster enters
are always initiallypla ed at their optimal positions orresponding to the lustering problem
withM 1 lusters. The remaining M-th luster enter is initiallypla edat several positions
withinthedata spa e. Sin eforM =1theoptimalsolutionisknown,we an iterativelyapply
the above pro edure to nd optimal solutions for all k- lustering problems k = 1;:::;M. In
additionto ee tiveness, themethod isdeterministi and doesnotdependon anyinitial
ondi-tionsorempiri ally adjustable parameters. These aresigni ant advantages over all lustering
approa hes mentionedabove.
In the following se tion starts with a formal denition of the lustering error and a brief
des riptionofthek-meansalgorithmandthendes ribestheproposedglobalk-meansalgorithm.
Se tion 3 des ribes modi ations of the basi method that require less omputation at the
expenseofbeingslightlylessee tive. Se tion4providesexperimentalresultsand omparisons
withthe k-meansalgorithm with multiple restarts. FinallySe tion5 provides on lusions and
des ribesdire tions forfuture resear h.
2 The global k-means algorithm
SupposewearegivenadatasetX=fx
1 ;:::;x N g,x n 2R d
. TheM- lusteringproblemaimsat
partitioningthisdata set into M disjoint subsets ( lusters) C
1
;:::;C
M
,su h thata lustering
riterion is optimized. The most widely used lustering riterion is the sum of the squared
Eu lidean distan es between ea h data point x
i
and the entroid m
k
( luster enter) of the
enters m 1 ;:::;m M : E(m 1 ;:::;m M )= N X i=1 M X k=1 I(x i 2C k )jx i m k j 2 (1)
whereI(X)=1 ifX is trueand 0 otherwise.
The k-means algorithm nds lo ally optimal solutions with respe tto the lustering error.
It is a fast iterative algorithm that has been used in many lustering appli ations. It is a
point-based lustering method that starts with the luster enters initiallypla ed at arbitrary
positions and pro eeds by moving at ea h step the luster enters in order to minimize the
lustering error. Themain disadvantageof themethod liesin itssensitivityto initialpositions
of the luster enters. Therefore, in order to obtain near optimal solutions using the k-means
algorithmseveral runsmust be s heduleddieringintheinitialpositionsof the luster enters.
In this paper, the global k-means lustering algorithm is proposed, whi h onstitutes a
deterministi globaloptimization methodthatdoesnotdependonanyinitialparametervalues
and employs the k-meansalgorithm asa lo al sear h pro edure. Instead of randomlysele ting
initial values for all luster enters as is the ase with most global lustering algorithms, the
proposedte hniquepro eedsinanin rementalwayattemptingtooptimallyaddonenew luster
enter at ea h stage.
More spe i ally, to solve a lustering problem with M lusters the method pro eeds as
follows. We start with one luster (k =1) and ndits optimal position whi h orresponds to
the entroidofthedatasetX. Inordertosolvetheproblemwithtwo lusters(k=2)weperform
N exe utionsofthek-meansalgorithmfromthefollowinginitialpositionsofthe luster enters:
therst luster enterisalwayspla edattheoptimalpositionfortheproblemwithk=1,while
the se ond enter at exe utionn is pla edat the positionof the datapoint x
n
(n=1;:::;N).
The best solution obtained after the N exe utions of the k-means algorithm is onsidered as
the solution for the lustering problemwith k = 2. In general, let (m
1 (k);:::;m k (k)) denote
the nal solution for k- lustering problem. On e we have found the solution for the (k
1)- lusteringproblem,wetrytondthesolutionofthek- lusteringproblemasfollows: weperform
N runsof thek-means algorithmwith k lusterswhere ea h runnstarts from theinitial state
(m 1 (k 1);:::;m (k 1) (k 1);x n
). ThebestsolutionobtainedfromtheN runsis onsideredas
thesolution(m 1 (k);:::;m k
(k))ofthek- lusteringproblem. Bypro eedingintheabovefashion
we nally obtain a solution with M lusters having also found solutions for all k- lustering
problemswithk <M.
The latter hara teristi an be advantageous in many appli ations where the aim is also
to dis over the ` orre t' number of lusters. To a hieve this, one has to solve the k- lustering
problemfor variousnumbers of lusters and then employappropriate riteria for sele ting the
most suitable value of k [3 ℄. In this ase the proposed method dire tly provides lustering
solutionsforall intermediatevaluesof k,thusrequiringno additional omputationaleort.
In what on erns omputational omplexity, the method requires N exe utions of the
k-meansalgorithm forea h value of k (k =1;:::;M). Dependingon theavailable resour esand
the values of N and M, the algorithm may be an attra tive approa h, sin e, as experimental
results indi ate, the performan e of the method is ex ellent. Moreover, as we will show later
there areseveral modi ations that an beapplied inorder toredu e the omputational load.
Therationalebehindtheproposedmethodisbasedonthefollowingassumption: anoptimal
lusteringsolutionwithk lusters anbeobtainedthroughlo alsear h(usingk-means)starting
from an initialstate with
This assumption seems very natural: we expe t that the solution of the k- lustering problem
tobe rea hable (throughlo alsear h) from thesolutionof (k 1)- lustering problem,on ethe
additional enter is pla edat an appropriatepositionwithinthedata set. It is also reasonable
to restri t the set of possible initial positions of the k-th enter to the set X of available data
points. Itmustbenotedthatthisisarather omputationalheavyassumptionandseveral other
options (examining fewer initial positions) may also be onsidered. The above assumptions
are also veried experimentally, sin e in all experiments (and for all values of k) the solution
obtainedbytheproposedmethodwasatleastasgoodasthatobtainedusingnumerousrandom
restarts of the k-means algorithm. In this spirit, we an autiously state that the proposed
methodis experimentallyoptimal (althoughitis diÆ ultto prove theoreti ally).
3 Speeding-up exe ution
Based onthe generalidea ofthe globalk-meansalgorithm,several heuristi s an bedevisedto
redu ethe omputationalloadwithoutsigni antlyae ting thequalityof thesolution. Inthe
followingsubse tions two modi ationsareproposed,ea h one referringto adierentaspe tof
themethod.
3.1 The fast global k-means algorithm
Thefastglobalk-meansalgorithm onstitutesastraightforwardmethodtoa eleratetheglobal
k-means algorithm. The dieren e lies in the way a solution for the k- lustering problem is
obtained, given the solutionof the(k 1)- lustering problem. Forea h of theN initial states
(m 1 (k 1);:::;m (k 1) (k 1);x n
) we do not exe ute the k-means algorithm until onvergen e
to obtain the nal lustering error E
n
. Instead we ompute an upper bound E
n
E b
n
on the resulting error E
n
for all possible allo ation positions x
n
, where E is the error in the
(k 1)- lusteringproblem. Wetheninitializethepositionofthenew luster enter atthepoint
x i
thatminimizesE
n
,orequivalentlythatmaximizesb
n
,andexe ute thek-meansalgorithmto
obtainthe solutionwithk lusters. Formally we have
b n = N X j=1 max(d j k 1 jx n x j j 2 ;0); (2) i =argmax n b n (3) whered j k 1
is the squared distan e between x
j
and the losest enter among thek 1 luster
enters obtained sofar (ie., enter of the lusterwhere x
j
belongs). The quantity b
n
measures
the guaranteed redu tion in the error measure obtained by inserting a new luster enter at
positionx
n .
Supposethesolutionofthe(k 1)- lusteringproblemis(m
1 (k 1);:::;m (k 1) (k 1))anda
new luster enterisaddedatlo ationx
n
. Thenthenew enterwillallo ateallpointsx
j whose
squareddistan efrom x
n
issmaller thanthedistan ed
j
k 1
fromtheirpreviously losest enter.
Therefore, for ea h su h data point x
j
the lustering error will de rease by d
j k 1 jx n x j j 2 .
The summation over all su h data points x
j
provides the quantity b
n
for a spe i insertion
lo ationx
n
. Sin e thek-meansalgorithm isguaranteedto de rease the lustering errorat ea h
step,E b
n
upperboundstheerrormeasurethatwillbeobtainedifwerunthealgorithmuntil
onvergen e after inserting the new enter at x
n
(this is the error measure used in the global
k-meansalgorithm).
2
4
6
8
10
12
14
16
18
20
0
50
100
150
200
250
300
350
400
450
500
number of buckets
clustering error
Moreover, the lusterinsertion pro edure anbe eÆ ientlyimplementedby storingina matrix
all pairwisesquareddistan es betweenpointswhen thealgorithm starts,and usingthismatrix
for dire tly omputingthe upperbounds above. A similar`tri k' has been used inthe related
problemsof greedy mixture densityestimation usingtheEM algorithm [4℄ and prin ipal urve
tting[5 ℄.
Finally,we maystillapplythismethodaswellastheglobalk-meansalgorithm whenwedo
not onsidereverydatapointx
n
(n=1;:::;N)aspossibleinsertionpositionforthenew enter,
butuseonlyasmallersetofappropriatelysele tedinsertionpositions. Afastandsensible hoi e
forsele ting su ha set ofpositionsbased onk-dtrees isdis ussednext.
3.2 Initialization with k-d trees
A k-dtree [6 , 7℄ is a multi-dimensionalgeneralization of the standard one-dimensional binary
sear h tree,that fa ilitatesstorage and sear h over k-dimensionaldatasets. A k-dtree denes
a re ursive partitioningof thedataspa e into disjointsubspa es. Ea h node ofthetree denes
a subspa e of the original data spa e and, onsequently, a subset ontaining the data points
residing in thissubspa e. Ea h nonterminal node has two su essors, ea h of them asso iated
withoneofthetwosubspa esobtainedfromthepartitioningoftheparentspa eusinga utting
hyperplane. The k-d tree stru ture was originally used for speedingup distan e-based sear h
operationslike nearestneighborsqueries, rangequeries, et .
In our ase we use a variation of the original k-d tree proposed in [7 ℄. There, the utting
hyperplane is dened as the plane that is perpendi ularto the dire tion of the prin ipal
om-ponent ofthedata points orrespondingto ea h node,therefore thealgorithm an beregarded
asa method fornested(re ursive)prin ipal omponent analysis of thedataset. The re ursion
usuallyterminates ifa terminal node( alled bu ket)is reated ontaining lessthan a
prespe -ied number of points b ( alled bu ket size) or if a prespe ied number of bu kets have been
reated. Itturnsoutthat,evenifthealgorithmisnotusedfornearestneighborqueries, merely
the onstru tion of the tree provides a very good preliminary lustering of the data set. The
idea is to use the bu ket enters (whi h are fewer than the data points) as possible insertion
lo ations forthealgorithmspresentedpreviously.
Gaussianmixtureare wellseparatedand exhibitlimited e entri ity.
We ompare theresultsofthree methodsto the lustering problemwithk =15 enters: (i)
Thedashed linedepi ts theresultswhen usingthefast globalk-meansalgorithm withall data
points onstitutingpotential insertion lo ations. The average lustering erroroverthe 10 data
sets is 15:7 with standard deviation 1:2. (ii) The solid line depi ts results when the standard
k-meansalgorithm isused: onerunforea hdatasetwas ondu ted. Atea h runthe15 luster
enters were initially positioned to the entroids of the bu kets obtained from the appli ation
of the k-dtree algorithm until 15 bu kets were reated. The average lustering errorover the
10 data sets is 24:4 with standard deviation 9:8. (iii) The solid line (with error bars) shows
theresultswhen usingthe fastglobal k-meansalgorithm withthe potentialinsertion lo ations
onstrained by the entroids of the bu kets of a k-d tree. On the horizontal axis we vary the
numberof bu kets forthek-dtree ofthelastmethod.
We also omputedthe`theoreti al' lustering errorforea h dataset, ie.,theerror omputed
by usingthe true luster enters. The average error value over the 10 data sets was 14:9 with
standard deviation 1:3. These results were too lose to theresults of the standard fast global
k-meansto in ludethem inthegure.
We an on ludefromthisexperimentthat(a)thefastglobalk-meansapproa hgivesriseto
performan esigni antlybetterthanwhenstartingwithall entersat thesametimeinitialized
usingthek-dtreemethod,and(b)restri tingtheinsertionlo ationsforthefastglobalk-means
to those given bythe k-d tree (instead of usingall data points)does not signi antlydegrade
performan e if we onsider a suÆ iently large number of bu kets in the k-d tree (in general
largerthanthenumber lusters).
Obviously,itisalsopossibletoemploytheabovepresentedk-dtreeapproa hwiththeglobal
k-meansalgorithm.
4 Experimental results
We have testedtheproposed lusteringalgorithmson several well-knowndatasets,namely the
irisdata set [8 ℄,thesyntheti dataset [9 ℄ and the imagesegmentationdata set [8 ℄. In all data
setswe ondu tedexperimentsforthe lustering problemsobtainedby onsideringonlyfeature
ve tors and ignoring lass labels. The iris data set ontains 150 four-dimensional data points,
the syntheti data set 250 two-dimensional data points and the for the image segmentation
data set we onsider 210 six-dimensional data points obtained through PCA on the original
18-dimensionaldatapoints. Thequalityoftheobtainedsolutionswasevaluatedintermsofthe
valuesof thenal lustering error.
Forea h dataset we ondu tedthefollowingexperiments:
one runoftheglobalk-means algorithmforM =15.
one runofthefast globalk-meansalgorithm forM =15.
the k-meansalgorithm fork =1;:::;15. Forea h value of k,the k-meansalgorithm was
exe uted N times (where N is the number of data points) starting from random initial
positionsforthek entersand we omputedtheminimumand average lusteringerroras
wellasits standarddeviation.
For ea h of the three data sets the experimental results are displayed in Figures 2, 3 and
4 respe tively. Ea h gure plot displaysthe lustering errorvalue as a fun tionof the number
of lusters. It is lear thatthe globalk-means algorithm is very ee tive providing inall ases
2
4
6
8
10
12
14
16
0
20
40
60
80
100
120
140
160
180
number of clusters k
clustering error
global k−means
fast global k−means
k−means
min k−means
2
4
6
8
10
12
14
16
2
4
6
8
10
12
14
16
18
20
22
number of clusters k
clustering error
global k−means
fast global k−means
k−means
2
4
6
8
10
12
14
16
0
50
100
150
200
250
300
350
400
number of clusters k
clustering error
global k−means
fast global k−means
k−means
min k−means
faster, it provides solutions of ex ellent quality, omparable to those obtained by the original
method. Therefore, it onstitutes a very eÆ ient algorithm, bothin terms of solutionquality
and omputational omplexityand anruneven fasterifk-dtreesareemployed asexplainedin
thepreviousse tion.
Matlabimplementationsofthefastglobalk-meansandthek-dtree buildingalgorithms an
be downloadedfrom http://www.s ien e.uva.nl/resear h/ias.
5 Dis ussion and on lusions
We have presented the global k-means lustering algorithm, whi h onstitutes a deterministi
lustering method providing ex ellent results in terms of the lustering error riterion. The
methodis independentof anystarting onditionsand ompares favorably to thek-means
algo-rithm with multiple random restarts. The deterministi nature of the method is parti ularly
important in ases wherethe lustering method is used either to spe ifyinitial parameter
val-uesfor other methods (for exampleRBF training) or onstitutes a module in a more omplex
system. Insu ha asewe an be almost ertainthattheemploymentoftheglobalk-means(or
any of the fast variants) will always provide sensible lustering solutions. Therefore, one an
evaluatethe omplexsystemandadjust riti al systemparameters withouthavingtoworryfor
dependen eof systemperforman eon the lustering methodemployed.
Another advantage of the proposed te hnique is that in order to solve the M- lustering
problem, all intermediate k- lustering problems are also solved for k = 1;:::;M. This may
prove useful in many appli ations where we seek for the a tual number of lusters and the
k- lustering problemis solved for several values of k. We have also developed the fast global
k-meansalgorithm, whi h signi antlyredu es the required omputational eort, whileat the
sametime providing solutionsof almostthe same quality.
We have also proposed two modi ationof themethod thatredu e the omputationalload
withoutsigni antlyae tingsolutionquality. Thesemethods anbeemployedtondsolutions
to lusteringproblems withthousands of high-dimensionalpointsand one ofour primaryaims
isto test thete hniqueson larges aledataminingproblems.
pendent and an beperformed inparallel. Anotherresear h dire tion on erns theappli ation
of the proposed method to other types of lustering (for examplefuzzy lustering), as well as
to topographi methods like SOM. Moreover, an important issue that deserves further study
is related withthe possibledevelopment of theoreti al foundations fortheassumptions behind
the method. Finally, it is also possible to employ the global k-means algorithm as a method
for providing ee tive initial parameter values for RBF networks and data modeling problems
using Gaussian mixture models and ompare the ee tiveness of the obtained solutions with
other trainingte hniquesforGaussianmixture models[10 ,11 ℄.
A knowledgements
N.Vlassis and J.J. Verbeek aresupportedbytheDut h Te hnology Foundation STWproje t
AIF4997.
Referen es
[1℄ M. N. MurtyA. K. Jain and P.J. Flynn, \Data lustering: A review," ACMComputing
Surveys,vol. 31,no.3, pp.264{323, 1999.
[2℄ J.A.Lozano J.M.Penaand P.Larranaga, \Anempiri al omparison offourinitialization
methodsforthek-meansalgorithm," PatternRe ognitionLetters, vol. 20,pp.1027{1040,
1999.
[3℄ G. W. Milligan and M. C. Cooper, \An examination of pro edures for determining the
numberof lustersina dataset," Psy hometrika,vol. 50,pp.159{179, 1985.
[4℄ N.Vlassis and A. Likas, \A greedy EM algorithm for Gaussian mixture learning," Te h.
Rep.,Computer S ien eInstitute,Universityof Amsterdam,The Netherlands,Sept.2000,
IAS-UVA-00-08.
[5℄ J.J.Verbeek,N.Vlassis,andB. Krose, \A k-segmentsalgorithm tondprin ipal urves,"
Te h.Rep., Computer S ien eInstitute,Universityof Amsterdam,The Netherlands,Nov.
2000, IAS-UVA-00-11.
[6℄ J.L.Bentley, \Multidimensionalbinarysear h treesusedforasso iative sear hing,"
Com-mun.ACM, vol. 18,no.9,pp. 509{517, 1975.
[7℄ R.F. Sproull, \Renements to nearest-neighborsear hing in k-dimensional trees,"
Algo-rithmi a,vol. 6,pp.579{589, 1991.
[8℄ C.L.Blake andC.J.Merz, \UCI repositoryofma hine learningdatabases,"Universityof
California,Irvine,Dept.of Informationand ComputerS ien es,1998.
[9℄ B. D. Ripley, Pattern Re ognition and Neural Networks, Cambridge University Press,
Cambridge,U.K., 1996.
[10℄ G.J. M La hlan and D.Peel, FiniteMixtureModels, Wiley,NewYork, 2000.
[11℄ N. Vlassis and A. Likas, \A kurtosis-based dynami approa h to Gaussian mixture
mod-eling," IEEE Trans.Systems, Man, and Cyberneti s, Part A, vol. 29, no. 4, pp. 393{399,
Intelligent
Autonomous
Systems
This report is in the series of IAS te hni al reports. The series editor is Stephan ten Hagen
(stephanhs ien e.uva.nl). Withinthisseriesthefollowingtitlesappeared:
See: http://www.s ien e.uva.nl/resear h/ias/tr/
You may order opiesof the IAS te hni al reports from the orresponding author orthe series
editor. Mostofthereports an alsobefoundontheweb pagesoftheIAS group(seetheinside
front page).