A Large-Scale Performance Study of Cluster-Based High-Dimensional Indexing


Academic year: 2021

HAL Id: inria-00489816


Submitted on 7 Dec 2010


High-Dimensional Indexing

Gylfi Thór Gudmundsson, Björn Thór Jónsson, Laurent Amsaleg

To cite this version:

Gylfi Thór Gudmundsson, Björn Thór Jónsson, Laurent Amsaleg. A Large-Scale Performance Study of

Cluster-Based High-Dimensional Indexing. [Research Report] RR-7307, INRIA. 2010. �inria-00489816�


a p p o r t

d e r e c h e r c h e


Vision, Perception and Multimedia Understanding

A Large-Scale Performance Study of Cluster-Based High-Dimensional Indexing

Gylfi Þór Gudmundsson — Björn Þór Jónsson — Laurent Amsaleg

N° 7307

Juin 2010


Centre de recherche INRIA Rennes – Bretagne Atlantique

Gyl ÞórGudmundsson

, Björn ÞórJónsson

, Laurent Amsaleg

Theme: Vision, PereptionandMultimediaUnderstanding


Rapportdereherhe 7307Juin201020pages

Abstrat: High-dimensional lustering is a method that is used by some

ontent-based image retrieval systems to partition the data into groups; the

groups(lusters)are thenindexed to aeleratetheproessingofqueries. Re-

ently, the Cluster Pruning approah was proposed as a very simple way to

eientlyand eetively produesuh lusters. While theoriginal evaluation


sale, itssimpliityand performanemotivated us to study itsbehaviorin an

image indexing ontext at a muh largersale. We experiment with two ol-


ontaining 189million desriptors. This papersummarizes the results of this

study and shows that while the basi algorithm works fairly well, three ex-


bothquery proessingand the onstrutionof lusters, making ClusterPrun-

ing apromisingbasisfor buildinglarge-salesystemsthat requirealustering


Key-words: Content-BasedImageRetrievalSystems,lustering,multidimen-


Shool of Computer Siene, Reykjavík University, Menntavegi 1, IS 101 Reykjavík,

Ieland. gyl03ru.is

Shool of Computer Siene, Reykjavík University, Menntavegi 1, IS 101 Reykjavík,

Ieland. bjornru.is


Résumé: Lelustering engrandesdimensionsestuneméthodeemployéepar


engroupes. Lesgroupessontensuiteindexéspouraélérerletraitementdesre-

quêtes. Réemment,uneapprohediteClusterPruningaétéproposéeomme

permettantl'obtentionsimple,rapideeteaedees groupes. Alorsqueson

évaluation originale s'est eetuée dans un ontexte d'indexation de texteset

àune éhelleréduite,sasimpliité et sesperformanesontété uneforte moti-

vation pour étudierson omportement àbien plusgrande éhelle, et dans un

ontexteimage. Nousmenonsdesexpérimentationssontutilisésdesdesrip-

teurs loaux d'image appartenant à l'état de l'artet de dimension 72. Nous

traitons plusieurs olletions de desripteurs, dont la plus grande en ontient

189millions. Cet artileprésenteune synthèse des résultatsde ette étudeet

montre quel'algorithmeoriginal fontionnerelativementbien. Toutefois,trois

extensions simplespermettentd'améliorer demanièretrèsimportante sesper-

formaneset sonaptitudeàpasseràl'éhelle, enaéléranttantletraitement




Mots-lés : Systèmesdereherhed'imagesparleontenu,partitionnement,



1 Introdution

Reently,there hasbeenasigniantburstof researh ativityon datastru-


desriptorolletions(e.g.,see[4,5,9,19℄). Generallyspeaking,allthesemeth-

ods arebasedon somesortof segmentationof thehigh-dimensionalolletion

intogroupsofdesriptors,whiharestoredtogetherondisk. Atquerytime,an


The goalof the approximate searhis to nd agood trade-o betweenresult


1.1 Cluster-Based Retrieval

Severalof themethods that havebeenproposedare basedonusing lustering

algorithmstogroupthedata. Thislineofwork waspioneeredbyLietal.[11℄,


haltproessingafterreadingagivennumberoflusters. Theyshowedthatgood

approximate results ould be obtainedby reading asmall numberof lusters,

albeitforaverysmallolletion. Theirpartiularlusteringalgorithmdidnot

salewellin pratie,however.


ters of the data, and produe large lusters (ontaining many desriptors) in

dense areas of the high-dimensional spae and small lusters (ontaining few

desriptors)insparseareas. Sigurðardóttiretal.[18℄showed,however,fortheir

partiular olletion, that large lusters are very detrimental to performane,


thedesriptorsintoanSR-tree andusingtheresultingleavesto reatelusters

of an even size. Indeed, when result quality wasonsidered as a funtion of

time, early results were muh better with this simple lustering sheme than


Chierihettiet al.[3℄thenproposedaverysimplealgorithm,alledCluster


of random luster leaders and assign eah desriptor to a single leader. Like

in [11℄, atsearhtime,thenearestb lustersareread andusedto produethe

approximateresults. Toimproveresultquality,theyproposedsomeparameters

aetingthesize oflusters andthedepth oftheluster index.

1.2 Salability



algorithmwasonlystudiedusingasmallsaletextolletion. Itssimpliityand


ontextat alargersale,where seondarystorageisneeded.

State-of-the-art image appliations typially use the SIFT desriptors [12℄

or variants thereof [7, 9℄. These desriptors have two important properties

that makethemsuitablefor large-saleretrieval. First, theyhavebeenshown

to sale very well with respet to result quality [10℄. Seond, eah image is

desribed by hundreds of desriptors, making approximate queries (and thus

potentiallyClusterPruning) appropriatefor these appliations. Beause eah




Amajorassumptionmadein theoriginaldesignofClusterPruningisthat

CPU ostis dominantduring thesearh. Asaresultof thedeisionto ignore




thetotalCPUostoftheretrieval. WhilethealulationofEulideandistanes

is indeed CPU intensive, disk operationsare also asigniantsoure of ost,

asshown in[18℄. It isthereforeneessaryto study,forrealisti workloadsand

data setsthat needto bestoredondisks, theoptimalsettings forthenumber


1.3 Contributions


ontext ofalarge-saleimage opyrightprotetionappliation. Theopyright


8,9℄) andgood resultshavebeenobtainedusing anumberof loal desriptor

variants. Furthermore,asqueriesareformedbymodifyingimagesintheimage

olletion, there is no need for subjetive judgment on similarity of images,

greatlyfailitating interpretationofresults.

Westudythe eet ofthevarious parametersoftheClusterPruningalgo-

rithm, inluding index depth and luster size, in this disk-basedsetting. Our


to the large sale of our experimental setup. While the basi algorithm still


performane. First,anewparameterisneededtoontrollustersizeondisk,to

betterbalaneIOandCPUosts. Seond,amodiation,whihenablestheuse

oftheluster indexduringthelusteringphase,allowslusteringtheolletion

in areasonabletime. Third,byreatingadditionallustersandthenreluster-

ing the ontents of the smallest lusters, luster size distribution is improved

whih,inturn, improvessearheieny.


intheareaofhigh-dimensionalindexing. Asaresult,thereareotherompeting

approahes,whih havesimilartheoretial properties, but maybeappropriate

for dierent appliations (e.g., see [4, 10, 13, 15, 19℄). In this paper, we do

notattemptaomparisonofalltheseapproahes,assuh aomparisonwould

beextremely time-onsuming, but fous insteadon understanding the perfor-


workloadsetting. ThereissigniantoverlapbetweentheideasbehindCluster

Pruningandthe otherapproahes; ClusterPruninganthereforebeseenasa




1.4 Outline of the Paper

The remainder of the paper is organized as follows. In Setion 2 we review

the opyright protetionappliation weuse in our work. We then review the

ClusterPruningalgorithmin Setion3. InSetion 4weproposeextensionsto


this algorithm for disk-based proessing of large olletions. In Setion 5 we

thenrunadetailed studyoftheimpatofvariousparametersonperformane.

Wedisussrelatedworkin Setion6,before onludingin Setion7.

2 Image Copyright Protetion

Theappliationweuseasaasestudy isthewellknownimageopyrightpro-

tetion appliation (see [9, 8℄). It is very dierent from the one studied by

Chierihetti etal., where they used about 95,000 doument desriptors with

morethan400,000dimensions. Inordertosettheontextforthework,andfor

ourexamples,wenowdesribethis appliationand ourexperimental environ-


2.1 Image Colletions and Queries

Weusetwoolletionsofimages. Therstolletionontains30Khigh-quality

news photos, whih are very variedin ontent. The seond olletion, whih

inludes therstolletion,ontainsabout300Ksuhphotos.

Queriesareintendedtosimulateimagetheft. Thestandardmethodforthis

purpose is to generate modied variants of images in theolletion using the

StirMark software [14℄ and use those variants asqueries. The goalis then to

return the original image asa math, but noother images. Forthe purposes

of ourevaluation, 120images were hosen atrandom from theolletion, and




2.2 Desriptors and Query Model


tionoftheimage. WeusetheE



but performsigniantlybetterforthisappliation [9℄. AnE



72 dimensions, eah stored in a byte. Additionally, eah desriptor storesthe

identieroftheimageitwasextrated from,foratotalof76bytes. Thesmall

olletion has a total of 20,445,871desriptors, while the large olletion has

189,605,419desriptors. Theolletionsthusrequire1.5GBand13.4GBofdisk


Beyeretal.[2℄andShaft andRamakrishnan[17℄haveshownthattheonly

wayto obtainmeaningfulperformane resultsforlarge-salehigh-dimensional

indexing, isto usereal appliationdata whih hasbeenshownto salewellin

termsofretrievalquality. Theyhave,forexample,shown thatthedata distri-

bution of most generatedolletions issuh that those olletionsanneither

yield meaningfulresults[2℄, norbeeiently indexed[17℄. Previouswork has

shownthat SIFTdesriptors doindeed salewell tolargeolletions[10℄, and

webelievethatourolletionsarelargeenoughforouronlusionsto bequite



desriptorle,whihistheinputtothelusteringproess. Whenaqueryleis

reeived,eah ofitsq querydesriptorsis usedin ak-nearestneighborsearh:



inmemoryanddistanesnallyomputedtogetthekneighbors. Inthispaper,

weuse k = 20, but theresults are notverysensitive to that setting for large

olletions. Eah neighborvotes for the image it was extrated from. These

votesare aggregatedovertheimage identiers,and theimages with themost


2.3 Metris

TheostoflusteringandsearhismeasuredthroughCPUtimeandIO time,

but typially reported together as wall-lok time. The searh time reported

orresponds to the average time spent to perform eah of the 3,120 queries.

Quality,ontheotherhand,ismeasuredasfollows. Foreahofthe3,120query

images, it is lear whih image should be returned asa math. We onsider

animageaorretmath whentheorretimagehasat leasttwie asmany

votesastheimagewiththeseondmostvotes. Theperentageofsuhorret


Notethatthequalityresultsin thisstudyarelowerthanreported inmany

other studies, forthree reasons. First,some ofthe StirMarkvariantsare very

diult to nd and even an exat sequential san does not nd all the or-

retmathes. Seond,afewoftheseletedimageshavenear-dupliatesin the

olletion, and therefore are neverfound as aorret math using our simple

measure. Third, our riteria of having twie as many votes is very strit; it

is possible to nd a math with a relativelysmall numberof votes by apply-

ing post-proessing to the top images (e.g., see [8, 12℄), but forsimpliity we

avoid suh post-proessing. The point of this study, however, is not to show


known[8,9,12℄. ThemainpointistoinvestigatetheperformaneoftheCluster



3 The Cluster Pruning Approah

Inthis setion,webrieydesribethe ClusterPruningapproah. Werstde-

sribethebasialgorithm,andthenthreeparametersaetingitsbehavior. We



3.1 Basi Algorithm

Assume aolletionC=p1, . . . , pn ofnpointsin high-dimensionalspae. The lusters are then formed as follows. First, a set of l = √n luster leaders is

hosen randomly from C. Then, eah point pi is ompared to all l luster

leaders andassigned to itslosestleader. Finally, onethe lustershavebeen

formed,a luster representative is hosen, per luster (the obvioushoies are


At query time, the query point q is rst ompared to the set of l luster

representatives to nd the nearest representative. Then, the query point is

ompared to allthe pointsin that representative'sluster,to determinethek


nearest neighbors found in the luster. Those neighbors are returned as the

approximate answertothequery.




n. Onaverage,




Assuming that the set of luster representatives ts in memory, but not the

desriptorolletion,onediskreadis requiredatsearhtime.

3.2 Extended Searhes: The b Parameter


In suh ases, it is possible to read b lusters to answereah query; thebasi

algorithmorrespondstob= 1. TheostofretrievalthenonsistsofbIOsand (1+b)√

ndistanealulations. Usingb,itispossibletodynamiallyhangethe

queryexeutionstrategy,forexampletoreadmorelusters toimproveresults.


bors aremostlikelyto be ontainedwithin thenearestlusters [18℄. Unfortu-

nately,asuitablehoieofb isdiulttodeterminedynamially,astheresult qualityisnotknownatrun-time;insteadthenumberoflustersrequiredfora-


3.3 Redundant Clustering: The a Parameter

Alternatively, it is possible to inrease the quality of the results by assigning

eahdatapointtoa >1lusters,andreadingonlyb= 1lusteratquerytime.

Eah lusterwill thenontain,onaverage,a√

npoints,resultingin (a+ 1)√ n



size is proportionalto a). Furthermore, itis not possible to hange thea pa-





ismoreomplex,andisstudiedinSetion5. Inshort,asaisinreased,thesize


3.4 Reursive Clustering: The L Parameter

For large olletions,

√n is a large number, resulting in exessive CPU ost

andpotentiallyevensigniantIOost. ThesolutionsuggestedbyChierihetti

etal. istoreursivelylusterthesetof lusterrepresentatives,usingtheexat

samemethod. Theyintrodueaparameter,L,to ontrol thenumberoflevels

in thereursion;thedefault algorithmdesribedaboveorrespondstoL= 1.


in abottom-upmanner. First,l=nL/(L+1) lusterleadersarenowhosenini-

tially,resultinginl lustersontainingonaveragen1/(L+1)desriptors. Cluster assignmentthenproeedsasbefore,asdoesthehoieoflusterrepresentatives.

One thelusterrepresentativesare formed,however,theyareonsidered asa

olletion of high-dimensional points, and lustered using n(L1)/(L+1) repre-

sentatives. This proess is repeated reursively, and the outomeis an L-tier


Notethatwhileitispossibletohavebotha >1andb >1,suhsettingswillmostlikely



