HAL Id: inria-00489816
https://hal.inria.fr/inria-00489816
Submitted on 7 Dec 2010
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
High-Dimensional Indexing
Gylfi Thór Gudmundsson, Björn Thór Jónsson, Laurent Amsaleg
To cite this version:
Gylfi Thór Gudmundsson, Björn Thór Jónsson, Laurent Amsaleg. A Large-Scale Performance Study of
Cluster-Based High-Dimensional Indexing. [Research Report] RR-7307, INRIA. 2010. �inria-00489816�
a p p o r t
d e r e c h e r c h e
N0249-6399ISRNINRIA/RR--7307--FR+ENG
Vision, Perception and Multimedia Understanding
A Large-Scale Performance Study of Cluster-Based High-Dimensional Indexing
Gylfi Þór Gudmundsson — Björn Þór Jónsson — Laurent Amsaleg
N° 7307
Juin 2010
Centre de recherche INRIA Rennes – Bretagne Atlantique
Gyl ÞórGudmundsson
∗
, Björn ÞórJónsson
†
, Laurent Amsaleg
Theme: Vision, PereptionandMultimediaUnderstanding
Équipes-ProjetsTexmex
Rapportdereherhe n°7307Juin201020pages
Abstrat: High-dimensional lustering is a method that is used by some
ontent-based image retrieval systems to partition the data into groups; the
groups(lusters)are thenindexed to aeleratetheproessingofqueries. Re-
ently, the Cluster Pruning approah was proposed as a very simple way to
eientlyand eetively produesuh lusters. While theoriginal evaluation
ofthealgorithmwasperformedwithinatextindexingontextatarathersmall
sale, itssimpliityand performanemotivated us to study itsbehaviorin an
image indexing ontext at a muh largersale. We experiment with two ol-
letionsof72-dimensionalstate-of-the-artloaldesriptors,thelargerolletion
ontaining 189million desriptors. This papersummarizes the results of this
study and shows that while the basi algorithm works fairly well, three ex-
tensionsandramatiallyimproveitsperformaneandsalability,aelerating
bothquery proessingand the onstrutionof lusters, making ClusterPrun-
ing apromisingbasisfor buildinglarge-salesystemsthat requirealustering
algorithm.
Key-words: Content-BasedImageRetrievalSystems,lustering,multidimen-
sionalindexing,largesale
∗
Shool of Computer Siene, Reykjavík University, Menntavegi 1, IS 101 Reykjavík,
Ieland. gyl03ru.is
†
Shool of Computer Siene, Reykjavík University, Menntavegi 1, IS 101 Reykjavík,
Ieland. bjornru.is
Résumé: Lelustering engrandesdimensionsestuneméthodeemployéepar
ertainssystèmesdereherhed'imagesparleontenupourpartitionnerl'espae
engroupes. Lesgroupessontensuiteindexéspouraélérerletraitementdesre-
quêtes. Réemment,uneapprohediteClusterPruningaétéproposéeomme
permettantl'obtentionsimple,rapideeteaedees groupes. Alorsqueson
évaluation originale s'est eetuée dans un ontexte d'indexation de texteset
àune éhelleréduite,sasimpliité et sesperformanesontété uneforte moti-
vation pour étudierson omportement àbien plusgrande éhelle, et dans un
ontexteimage. Nousmenonsdesexpérimentationsoùsontutilisésdesdesrip-
teurs loaux d'image appartenant à l'état de l'artet de dimension 72. Nous
traitons plusieurs olletions de desripteurs, dont la plus grande en ontient
189millions. Cet artileprésenteune synthèse des résultatsde ette étudeet
montre quel'algorithmeoriginal fontionnerelativementbien. Toutefois,trois
extensions simplespermettentd'améliorer demanièretrèsimportante sesper-
formaneset sonaptitudeàpasseràl'éhelle, enaéléranttantletraitement
desrequêtesqueletempsdeonstrutiondesgroupes.Dotéedeesextensions,
l'approheClusterPruningdevientalorsunebriqueessentiellepouvantservir
auxsystèmesgrandeéhellenéessitantlaréationdegroupesdepoints.
Mots-lés : Systèmesdereherhed'imagesparleontenu,partitionnement,
indexationmultidimensionnelle,grandeéhelle
1 Introdution
Reently,there hasbeenasigniantburstof researh ativityon datastru-
turesandalgorithmsforapproximatenearestneighborsearhinhigh-dimensional
desriptorolletions(e.g.,see[4,5,9,19℄). Generallyspeaking,allthesemeth-
ods arebasedon somesortof segmentationof thehigh-dimensionalolletion
intogroupsofdesriptors,whiharestoredtogetherondisk. Atquerytime,an
indexisthentypiallyusedtoseletthesinglenearestsuhgroupforsearhing.
The goalof the approximate searhis to nd agood trade-o betweenresult
qualityandretrievaltime.
1.1 Cluster-Based Retrieval
Severalof themethods that havebeenproposedare basedonusing lustering
algorithmstogroupthedata. Thislineofwork waspioneeredbyLietal.[11℄,
whihproposedtheClindexframework,whereadynamisearhalgorithmould
haltproessingafterreadingagivennumberoflusters. Theyshowedthatgood
approximate results ould be obtainedby reading asmall numberof lusters,
albeitforaverysmallolletion. Theirpartiularlusteringalgorithmdidnot
salewellin pratie,however.
Traditionally,lusteringalgorithms,suhask-means,ndthenaturallus-
ters of the data, and produe large lusters (ontaining many desriptors) in
dense areas of the high-dimensional spae and small lusters (ontaining few
desriptors)insparseareas. Sigurðardóttiretal.[18℄showed,however,fortheir
partiular olletion, that large lusters are very detrimental to performane,
andthatexellentapproximateresultsouldbereturnedbysimplybulk-loading
thedesriptorsintoanSR-tree andusingtheresultingleavesto reatelusters
of an even size. Indeed, when result quality wasonsidered as a funtion of
time, early results were muh better with this simple lustering sheme than
withatraditionallusteringalgorithm.
Chierihettiet al.[3℄thenproposedaverysimplealgorithm,alledCluster
Pruning,whihusestheinitialstepsofthek-meansalgorithmtoseletanumber
of random luster leaders and assign eah desriptor to a single leader. Like
in [11℄, atsearhtime,thenearestb lustersareread andusedto produethe
approximateresults. Toimproveresultquality,theyproposedsomeparameters
aetingthesize oflusters andthedepth oftheluster index.
1.2 Salability
WhilethealgorithmofChierihettietal.iseientandeetive,aspredited
bythepreviousresults,andtheiranalysisisimpressive,theperformaneofthe
algorithmwasonlystudiedusingasmallsaletextolletion. Itssimpliityand
performanewasastrongmotivationtostudyitsbehaviorinanimageindexing
ontextat alargersale,where seondarystorageisneeded.
State-of-the-art image appliations typially use the SIFT desriptors [12℄
or variants thereof [7, 9℄. These desriptors have two important properties
that makethemsuitablefor large-saleretrieval. First, theyhavebeenshown
to sale very well with respet to result quality [10℄. Seond, eah image is
desribed by hundreds of desriptors, making approximate queries (and thus
potentiallyClusterPruning) appropriatefor these appliations. Beause eah
imageisdesribedbyhundredsofthesehigh-dimensionaldesriptors,large-sale
indexingandretrievalisabsolutelyneessary.
Amajorassumptionmadein theoriginaldesignofClusterPruningisthat
CPU ostis dominantduring thesearh. Asaresultof thedeisionto ignore
diskost,theoptimalsegmentationistoindexaolletionofndesriptorsinto
√nlustersontaining,onaverage,√
ndesriptorseah;thisdivisionminimizes
thetotalCPUostoftheretrieval. WhilethealulationofEulideandistanes
is indeed CPU intensive, disk operationsare also asigniantsoure of ost,
asshown in[18℄. It isthereforeneessaryto study,forrealisti workloadsand
data setsthat needto bestoredondisks, theoptimalsettings forthenumber
oflustersandtheresultingdistributionoflustersizes.
1.3 Contributions
Inthispaper,westudytheperformaneoftheClusterPruningalgorithminthe
ontext ofalarge-saleimage opyrightprotetionappliation. Theopyright
protetionappliationhasbeenstudiedsigniantlyintheliterature(e.g.,see[1,
8,9℄) andgood resultshavebeenobtainedusing anumberof loal desriptor
variants. Furthermore,asqueriesareformedbymodifyingimagesintheimage
olletion, there is no need for subjetive judgment on similarity of images,
greatlyfailitating interpretationofresults.
Westudythe eet ofthevarious parametersoftheClusterPruningalgo-
rithm, inluding index depth and luster size, in this disk-basedsetting. Our
resultsontraditsomeoftheonlusionsreahedbyChierihettietal.[3℄,due
to the large sale of our experimental setup. While the basi algorithm still
worksfairlywell,weproposethreekeyhangeswhihsigniantlyimproveits
performane. First,anewparameterisneededtoontrollustersizeondisk,to
betterbalaneIOandCPUosts. Seond,amodiation,whihenablestheuse
oftheluster indexduringthelusteringphase,allowslusteringtheolletion
in areasonabletime. Third,byreatingadditionallustersandthenreluster-
ing the ontents of the smallest lusters, luster size distribution is improved
whih,inturn, improvessearheieny.
Notethat,asmentionedabove,therehasbeenmuhreentresearhativity
intheareaofhigh-dimensionalindexing. Asaresult,thereareotherompeting
approahes,whih havesimilartheoretial properties, but maybeappropriate
for dierent appliations (e.g., see [4, 10, 13, 15, 19℄). In this paper, we do
notattemptaomparisonofalltheseapproahes,assuh aomparisonwould
beextremely time-onsuming, but fous insteadon understanding the perfor-
maneofonespeiapproah,theClusterPruningalgorithm,forapartiular
workloadsetting. ThereissigniantoverlapbetweentheideasbehindCluster
Pruningandthe otherapproahes; ClusterPruninganthereforebeseenasa
goodrepresentativeforawholefamilyofalgorithmswherelusteringisentral.
Wethusbelievethatouranalysisrepresentsaveryvaluableontributiontothe
generalunderstandingofdisk-orientedluster-basedindexing.
1.4 Outline of the Paper
The remainder of the paper is organized as follows. In Setion 2 we review
the opyright protetionappliation weuse in our work. We then review the
ClusterPruningalgorithmin Setion3. InSetion 4weproposeextensionsto
this algorithm for disk-based proessing of large olletions. In Setion 5 we
thenrunadetailed studyoftheimpatofvariousparametersonperformane.
Wedisussrelatedworkin Setion6,before onludingin Setion7.
2 Image Copyright Protetion
Theappliationweuseasaasestudy isthewellknownimageopyrightpro-
tetion appliation (see [9, 8℄). It is very dierent from the one studied by
Chierihetti etal., where they used about 95,000 doument desriptors with
morethan400,000dimensions. Inordertosettheontextforthework,andfor
ourexamples,wenowdesribethis appliationand ourexperimental environ-
ment.
2.1 Image Colletions and Queries
Weusetwoolletionsofimages. Therstolletionontains30Khigh-quality
news photos, whih are very variedin ontent. The seond olletion, whih
inludes therstolletion,ontainsabout300Ksuhphotos.
Queriesareintendedtosimulateimagetheft. Thestandardmethodforthis
purpose is to generate modied variants of images in theolletion using the
StirMark software [14℄ and use those variants asqueries. The goalis then to
return the original image asa math, but noother images. Forthe purposes
of ourevaluation, 120images were hosen atrandom from theolletion, and
modiedwith26dierentStirMarkvariants(thevariantsinluderesizing,rop-
ping,ompression,andsomeseverebrightnessmodiations,see[9℄fordetails),
resultingin3,120queryimages.
2.2 Desriptors and Query Model
Eahimageisdesribedwithmanyloaldesriptors,eahdesribingasmallpor-
tionoftheimage. WeusetheE
2
desriptors,whihareavariantofSIFT[12℄,
but performsigniantlybetterforthisappliation [9℄. AnE
2
desriptorhas
72 dimensions, eah stored in a byte. Additionally, eah desriptor storesthe
identieroftheimageitwasextrated from,foratotalof76bytes. Thesmall
olletion has a total of 20,445,871desriptors, while the large olletion has
189,605,419desriptors. Theolletionsthusrequire1.5GBand13.4GBofdisk
storage,respetively.
Beyeretal.[2℄andShaft andRamakrishnan[17℄haveshownthattheonly
wayto obtainmeaningfulperformane resultsforlarge-salehigh-dimensional
indexing, isto usereal appliationdata whih hasbeenshownto salewellin
termsofretrievalquality. Theyhave,forexample,shown thatthedata distri-
bution of most generatedolletions issuh that those olletionsanneither
yield meaningfulresults[2℄, norbeeiently indexed[17℄. Previouswork has
shownthat SIFTdesriptors doindeed salewell tolargeolletions[10℄, and
webelievethatourolletionsarelargeenoughforouronlusionsto bequite
general.
Thedesriptorsfromtheimagesinthephotoolletionsarestoredinalarge
desriptorle,whihistheinputtothelusteringproess. Whenaqueryleis
reeived,eah ofitsq querydesriptorsis usedin ak-nearestneighborsearh:
thelosestlusterrepresentativeisrstfound,theontentsofthelusterfethed
inmemoryanddistanesnallyomputedtogetthekneighbors. Inthispaper,
weuse k = 20, but theresults are notverysensitive to that setting for large
olletions. Eah neighborvotes for the image it was extrated from. These
votesare aggregatedovertheimage identiers,and theimages with themost
votesarereturnedasananswertothequery.
2.3 Metris
TheostoflusteringandsearhismeasuredthroughCPUtimeandIO time,
but typially reported together as wall-lok time. The searh time reported
orresponds to the average time spent to perform eah of the 3,120 queries.
Quality,ontheotherhand,ismeasuredasfollows. Foreahofthe3,120query
images, it is lear whih image should be returned asa math. We onsider
animageaorretmath whentheorretimagehasat leasttwie asmany
votesastheimagewiththeseondmostvotes. Theperentageofsuhorret
mathesisourbaselinequalitymetri.
Notethatthequalityresultsin thisstudyarelowerthanreported inmany
other studies, forthree reasons. First,some ofthe StirMarkvariantsare very
diult to nd and even an exat sequential san does not nd all the or-
retmathes. Seond,afewoftheseletedimageshavenear-dupliatesin the
olletion, and therefore are neverfound as aorret math using our simple
measure. Third, our riteria of having twie as many votes is very strit; it
is possible to nd a math with a relativelysmall numberof votes by apply-
ing post-proessing to the top images (e.g., see [8, 12℄), but forsimpliity we
avoid suh post-proessing. The point of this study, however, is not to show
thatthedesriptorsareeetiveatimageopyrightprotetionthisisalready
known[8,9,12℄. ThemainpointistoinvestigatetheperformaneoftheCluster
Pruningalgorithm,andthissimpledenitionofaorretmathsuesforthat
purpose.
3 The Cluster Pruning Approah
Inthis setion,webrieydesribethe ClusterPruningapproah. Werstde-
sribethebasialgorithm,andthenthreeparametersaetingitsbehavior. We
endbydisussingtheostsoftheClusterPruningapproahbeforesummarizing
theresultsreportedin[3℄.
3.1 Basi Algorithm
Assume aolletionC=p1, . . . , pn ofnpointsin high-dimensionalspae. The lusters are then formed as follows. First, a set of l = √n luster leaders is
hosen randomly from C. Then, eah point pi is ompared to all l luster
leaders andassigned to itslosestleader. Finally, onethe lustershavebeen
formed,a luster representative is hosen, per luster (the obvioushoies are
thelusterleaderitself,theentroidoftheluster,orthemedoidoftheluster).
At query time, the query point q is rst ompared to the set of l luster
representatives to nd the nearest representative. Then, the query point is
ompared to allthe pointsin that representative'sluster,to determinethek
nearest neighbors found in the luster. Those neighbors are returned as the
approximate answertothequery.
Thehoieofl=√
nlustersismadebeausethetotalnumberofeulidean
distanealulations,whihisl+n/l,isminimizedwhenl=√
n. Onaverage,
eahlusterontains
√npoints,resultinginatotalof2√
ndistanealulations.
Assuming that the set of luster representatives ts in memory, but not the
desriptorolletion,onediskreadis requiredatsearhtime.
3.2 Extended Searhes: The b Parameter
Sometimes,readingasinglelustermaynotyieldresultsofsatisfatoryquality.
In suh ases, it is possible to read b lusters to answereah query; thebasi
algorithmorrespondstob= 1. TheostofretrievalthenonsistsofbIOsand (1+b)√
ndistanealulations. Usingb,itispossibletodynamiallyhangethe
queryexeutionstrategy,forexampletoreadmorelusters toimproveresults.
Asbgrows,however,returnsareexpetedtodiminishasthenearestneigh-
bors aremostlikelyto be ontainedwithin thenearestlusters [18℄. Unfortu-
nately,asuitablehoieofb isdiulttodeterminedynamially,astheresult qualityisnotknownatrun-time;insteadthenumberoflustersrequiredfora-
eptableresultqualitymustbedeterminedexpliitlythroughexperimentation.
3.3 Redundant Clustering: The a Parameter
Alternatively, it is possible to inrease the quality of the results by assigning
eahdatapointtoa >1lusters,andreadingonlyb= 1lusteratquerytime.
Eah lusterwill thenontain,onaverage,a√
npoints,resultingin (a+ 1)√ n
eulideandistanealulations,butonlyoneIO.
Thelusteringphaseisalwaysmoreostlywithhighera(theaverageluster
size is proportionalto a). Furthermore, itis not possible to hange thea pa-
rameteronethelustersareformed,whilethebparameteranbedynamially
modiedatquerytime.
1
Theeetoftheaparameteronqueryproessingost
ismoreomplex,andisstudiedinSetion5. Inshort,asaisinreased,thesize
ofthelustersondiskinreases,aswellasthetimerequiredtoproessthem.
3.4 Reursive Clustering: The L Parameter
For large olletions,
√n is a large number, resulting in exessive CPU ost
andpotentiallyevensigniantIOost. ThesolutionsuggestedbyChierihetti
etal. istoreursivelylusterthesetof lusterrepresentatives,usingtheexat
samemethod. Theyintrodueaparameter,L,to ontrol thenumberoflevels
in thereursion;thedefault algorithmdesribedaboveorrespondstoL= 1.
TheLparameterisusedasfollowsduringthelustering,whihisperformed
in abottom-upmanner. First,l=nL/(L+1) lusterleadersarenowhosenini-
tially,resultinginl lustersontainingonaveragen1/(L+1)desriptors. Cluster assignmentthenproeedsasbefore,asdoesthehoieoflusterrepresentatives.
One thelusterrepresentativesare formed,however,theyareonsidered asa
olletion of high-dimensional points, and lustered using n(L−1)/(L+1) repre-
sentatives. This proess is repeated reursively, and the outomeis an L-tier
1
Notethatwhileitispossibletohavebotha >1andb >1,suhsettingswillmostlikely
resultinseveraldatapointsbeingreadatimesandarethereforenotonsidered.