A Large-Scale Performance Study of Cluster-Based High-Dimensional Indexing

(1)

HAL Id: inria-00489816

https://hal.inria.fr/inria-00489816

Submitted on 7 Dec 2010

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

High-Dimensional Indexing

Gylfi Thór Gudmundsson, Björn Thór Jónsson, Laurent Amsaleg

To cite this version:

Gylfi Thór Gudmundsson, Björn Thór Jónsson, Laurent Amsaleg. A Large-Scale Performance Study of

Cluster-Based High-Dimensional Indexing. [Research Report] RR-7307, INRIA. 2010. �inria-00489816�

(2)

a p p o r t

d e r e c h e r c h e

N0249-6399ISRNINRIA/RR--7307--FR+ENG

Vision, Perception and Multimedia Understanding

A Large-Scale Performance Study of Cluster-Based High-Dimensional Indexing

Gylfi Þór Gudmundsson — Björn Þór Jónsson — Laurent Amsaleg

N° 7307

Juin 2010

(3)

(4)

Centre de recherche INRIA Rennes – Bretagne Atlantique

Gyl ÞórGudmundsson

∗

, Björn ÞórJónsson

†

, Laurent Amsaleg

Theme: Vision, PereptionandMultimediaUnderstanding

Équipes-ProjetsTexmex

Rapportdereherhe n°7307Juin201020pages

Abstrat: High-dimensional lustering is a method that is used by some

ontent-based image retrieval systems to partition the data into groups; the

groups(lusters)are thenindexed to aeleratetheproessingofqueries. Re-

ently, the Cluster Pruning approah was proposed as a very simple way to

eientlyand eetively produesuh lusters. While theoriginal evaluation

ofthealgorithmwasperformedwithinatextindexingontextatarathersmall

sale, itssimpliityand performanemotivated us to study itsbehaviorin an

image indexing ontext at a muh largersale. We experiment with two ol-

letionsof72-dimensionalstate-of-the-artloaldesriptors,thelargerolletion

ontaining 189million desriptors. This papersummarizes the results of this

study and shows that while the basi algorithm works fairly well, three ex-

tensionsandramatiallyimproveitsperformaneandsalability,aelerating

bothquery proessingand the onstrutionof lusters, making ClusterPrun-

ing apromisingbasisfor buildinglarge-salesystemsthat requirealustering

algorithm.

Key-words: Content-BasedImageRetrievalSystems,lustering,multidimen-

sionalindexing,largesale

∗

Shool of Computer Siene, Reykjavík University, Menntavegi 1, IS 101 Reykjavík,

Ieland. gyl03ru.is

†

Shool of Computer Siene, Reykjavík University, Menntavegi 1, IS 101 Reykjavík,

Ieland. bjornru.is

(5)

Résumé: Lelustering engrandesdimensionsestuneméthodeemployéepar

ertainssystèmesdereherhed'imagesparleontenupourpartitionnerl'espae

engroupes. Lesgroupessontensuiteindexéspouraélérerletraitementdesre-

quêtes. Réemment,uneapprohediteClusterPruningaétéproposéeomme

permettantl'obtentionsimple,rapideeteaedees groupes. Alorsqueson

évaluation originale s'est eetuée dans un ontexte d'indexation de texteset

àune éhelleréduite,sasimpliité et sesperformanesontété uneforte moti-

vation pour étudierson omportement àbien plusgrande éhelle, et dans un

ontexteimage. Nousmenonsdesexpérimentationsoùsontutilisésdesdesrip-

teurs loaux d'image appartenant à l'état de l'artet de dimension 72. Nous

traitons plusieurs olletions de desripteurs, dont la plus grande en ontient

189millions. Cet artileprésenteune synthèse des résultatsde ette étudeet

montre quel'algorithmeoriginal fontionnerelativementbien. Toutefois,trois

extensions simplespermettentd'améliorer demanièretrèsimportante sesper-

formaneset sonaptitudeàpasseràl'éhelle, enaéléranttantletraitement

desrequêtesqueletempsdeonstrutiondesgroupes.Dotéedeesextensions,

l'approheClusterPruningdevientalorsunebriqueessentiellepouvantservir

auxsystèmesgrandeéhellenéessitantlaréationdegroupesdepoints.

Mots-lés : Systèmesdereherhed'imagesparleontenu,partitionnement,

indexationmultidimensionnelle,grandeéhelle

(6)

1 Introdution

Reently,there hasbeenasigniantburstof researh ativityon datastru-

turesandalgorithmsforapproximatenearestneighborsearhinhigh-dimensional

desriptorolletions(e.g.,see[4,5,9,19℄). Generallyspeaking,allthesemeth-

ods arebasedon somesortof segmentationof thehigh-dimensionalolletion

intogroupsofdesriptors,whiharestoredtogetherondisk. Atquerytime,an

indexisthentypiallyusedtoseletthesinglenearestsuhgroupforsearhing.

The goalof the approximate searhis to nd agood trade-o betweenresult

qualityandretrievaltime.

1.1 Cluster-Based Retrieval

Severalof themethods that havebeenproposedare basedonusing lustering

algorithmstogroupthedata. Thislineofwork waspioneeredbyLietal.[11℄,

whihproposedtheClindexframework,whereadynamisearhalgorithmould

haltproessingafterreadingagivennumberoflusters. Theyshowedthatgood

approximate results ould be obtainedby reading asmall numberof lusters,

albeitforaverysmallolletion. Theirpartiularlusteringalgorithmdidnot

salewellin pratie,however.

Traditionally,lusteringalgorithms,suhask^-means,^nd^the^natural^lus-

ters of the data, and produe large lusters (ontaining many desriptors) in

dense areas of the high-dimensional spae and small lusters (ontaining few

desriptors)insparseareas. Sigurðardóttiretal.[18℄showed,however,fortheir

partiular olletion, that large lusters are very detrimental to performane,

andthatexellentapproximateresultsouldbereturnedbysimplybulk-loading

thedesriptorsintoanSR-tree andusingtheresultingleavesto reatelusters

of an even size. Indeed, when result quality wasonsidered as a funtion of

time, early results were muh better with this simple lustering sheme than

withatraditionallusteringalgorithm.

Chierihettiet al.[3℄thenproposedaverysimplealgorithm,alledCluster

Pruning,whihusestheinitialstepsofthek^-means^algorithm^to^selet^a^number

of random luster leaders and assign eah desriptor to a single leader. Like

in [11℄, atsearhtime,thenearestb ^lustersâre^read ândûsed^to ^produe^the

approximateresults. Toimproveresultquality,theyproposedsomeparameters

aetingthesize oflusters andthedepth oftheluster index.

1.2 Salability

WhilethealgorithmofChierihettietal.iseientandeetive,aspredited

bythepreviousresults,andtheiranalysisisimpressive,theperformaneofthe

algorithmwasonlystudiedusingasmallsaletextolletion. Itssimpliityand

performanewasastrongmotivationtostudyitsbehaviorinanimageindexing

ontextat alargersale,where seondarystorageisneeded.

State-of-the-art image appliations typially use the SIFT desriptors [12℄

or variants thereof [7, 9℄. These desriptors have two important properties

that makethemsuitablefor large-saleretrieval. First, theyhavebeenshown

to sale very well with respet to result quality [10℄. Seond, eah image is

desribed by hundreds of desriptors, making approximate queries (and thus

potentiallyClusterPruning) appropriatefor these appliations. Beause eah

(7)

imageisdesribedbyhundredsofthesehigh-dimensionaldesriptors,large-sale

indexingandretrievalisabsolutelyneessary.

Amajorassumptionmadein theoriginaldesignofClusterPruningisthat

CPU ostis dominantduring thesearh. Asaresultof thedeisionto ignore

diskost,theoptimalsegmentationistoindexaolletionofn^desriptors^into

√n^lustersôntaining,ônâverage,√

n^desriptors^eah;^this^division^minimizes

thetotalCPUostoftheretrieval. WhilethealulationofEulideandistanes

is indeed CPU intensive, disk operationsare also asigniantsoure of ost,

asshown in[18℄. It isthereforeneessaryto study,forrealisti workloadsand

data setsthat needto bestoredondisks, theoptimalsettings forthenumber

oflustersandtheresultingdistributionoflustersizes.

1.3 Contributions

Inthispaper,westudytheperformaneoftheClusterPruningalgorithminthe

ontext ofalarge-saleimage opyrightprotetionappliation. Theopyright

protetionappliationhasbeenstudiedsigniantlyintheliterature(e.g.,see[1,

8,9℄) andgood resultshavebeenobtainedusing anumberof loal desriptor

variants. Furthermore,asqueriesareformedbymodifyingimagesintheimage

olletion, there is no need for subjetive judgment on similarity of images,

greatlyfailitating interpretationofresults.

Westudythe eet ofthevarious parametersoftheClusterPruningalgo-

rithm, inluding index depth and luster size, in this disk-basedsetting. Our

resultsontraditsomeoftheonlusionsreahedbyChierihettietal.[3℄,due

to the large sale of our experimental setup. While the basi algorithm still

worksfairlywell,weproposethreekeyhangeswhihsigniantlyimproveits

performane. First,anewparameterisneededtoontrollustersizeondisk,to

betterbalaneIOandCPUosts. Seond,amodiation,whihenablestheuse

oftheluster indexduringthelusteringphase,allowslusteringtheolletion

in areasonabletime. Third,byreatingadditionallustersandthenreluster-

ing the ontents of the smallest lusters, luster size distribution is improved

whih,inturn, improvessearheieny.

Notethat,asmentionedabove,therehasbeenmuhreentresearhativity

intheareaofhigh-dimensionalindexing. Asaresult,thereareotherompeting

approahes,whih havesimilartheoretial properties, but maybeappropriate

for dierent appliations (e.g., see [4, 10, 13, 15, 19℄). In this paper, we do

notattemptaomparisonofalltheseapproahes,assuh aomparisonwould

beextremely time-onsuming, but fous insteadon understanding the perfor-

maneofonespeiapproah,theClusterPruningalgorithm,forapartiular

workloadsetting. ThereissigniantoverlapbetweentheideasbehindCluster

Pruningandthe otherapproahes; ClusterPruninganthereforebeseenasa

goodrepresentativeforawholefamilyofalgorithmswherelusteringisentral.

Wethusbelievethatouranalysisrepresentsaveryvaluableontributiontothe

generalunderstandingofdisk-orientedluster-basedindexing.

1.4 Outline of the Paper

The remainder of the paper is organized as follows. In Setion 2 we review

the opyright protetionappliation weuse in our work. We then review the

ClusterPruningalgorithmin Setion3. InSetion 4weproposeextensionsto

(8)

this algorithm for disk-based proessing of large olletions. In Setion 5 we

thenrunadetailed studyoftheimpatofvariousparametersonperformane.

Wedisussrelatedworkin Setion6,before onludingin Setion7.

2 Image Copyright Protetion

Theappliationweuseasaasestudy isthewellknownimageopyrightpro-

tetion appliation (see [9, 8℄). It is very dierent from the one studied by

Chierihetti etal., where they used about 95,000 doument desriptors with

morethan400,000dimensions. Inordertosettheontextforthework,andfor

ourexamples,wenowdesribethis appliationand ourexperimental environ-

ment.

2.1 Image Colletions and Queries

Weusetwoolletionsofimages. Therstolletionontains30Khigh-quality

news photos, whih are very variedin ontent. The seond olletion, whih

inludes therstolletion,ontainsabout300Ksuhphotos.

Queriesareintendedtosimulateimagetheft. Thestandardmethodforthis

purpose is to generate modied variants of images in theolletion using the

StirMark software [14℄ and use those variants asqueries. The goalis then to

return the original image asa math, but noother images. Forthe purposes

of ourevaluation, 120images were hosen atrandom from theolletion, and

modiedwith26dierentStirMarkvariants(thevariantsinluderesizing,rop-

ping,ompression,andsomeseverebrightnessmodiations,see[9℄fordetails),

resultingin3,120queryimages.

2.2 Desriptors and Query Model

Eahimageisdesribedwithmanyloaldesriptors,eahdesribingasmallpor-

tionoftheimage. WeusetheE

2

desriptors,whihareavariantofSIFT[12℄,

but performsigniantlybetterforthisappliation [9℄. AnE

2

desriptorhas

72 dimensions, eah stored in a byte. Additionally, eah desriptor storesthe

identieroftheimageitwasextrated from,foratotalof76bytes. Thesmall

olletion has a total of 20,445,871desriptors, while the large olletion has

189,605,419desriptors. Theolletionsthusrequire1.5GBand13.4GBofdisk

storage,respetively.

Beyeretal.[2℄andShaft andRamakrishnan[17℄haveshownthattheonly

wayto obtainmeaningfulperformane resultsforlarge-salehigh-dimensional

indexing, isto usereal appliationdata whih hasbeenshownto salewellin

termsofretrievalquality. Theyhave,forexample,shown thatthedata distri-

bution of most generatedolletions issuh that those olletionsanneither

yield meaningfulresults[2℄, norbeeiently indexed[17℄. Previouswork has

shownthat SIFTdesriptors doindeed salewell tolargeolletions[10℄, and

webelievethatourolletionsarelargeenoughforouronlusionsto bequite

general.

Thedesriptorsfromtheimagesinthephotoolletionsarestoredinalarge

desriptorle,whihistheinputtothelusteringproess. Whenaqueryleis

reeived,eah ofitsq ^query^desriptorsîs ûsedⁱⁿ âk^-nearest^neighbor^searh:

(9)

thelosestlusterrepresentativeisrstfound,theontentsofthelusterfethed

inmemoryanddistanesnallyomputedtogetthek^neighbors. ^In^this^paper,

weuse k = 20^, ^but ^the^results ^are ^not^very^sensitive ^to ^that ^setting ^for ^large

olletions. Eah neighborvotes for the image it was extrated from. These

votesare aggregatedovertheimage identiers,and theimages with themost

votesarereturnedasananswertothequery.

2.3 Metris

TheostoflusteringandsearhismeasuredthroughCPUtimeandIO time,

but typially reported together as wall-lok time. The searh time reported

orresponds to the average time spent to perform eah of the 3,120 queries.

Quality,ontheotherhand,ismeasuredasfollows. Foreahofthe3,120query

images, it is lear whih image should be returned asa math. We onsider

animageaorretmath whentheorretimagehasat leasttwie asmany

votesastheimagewiththeseondmostvotes. Theperentageofsuhorret

mathesisourbaselinequalitymetri.

Notethatthequalityresultsin thisstudyarelowerthanreported inmany

other studies, forthree reasons. First,some ofthe StirMarkvariantsare very

diult to nd and even an exat sequential san does not nd all the or-

retmathes. Seond,afewoftheseletedimageshavenear-dupliatesin the

olletion, and therefore are neverfound as aorret math using our simple

measure. Third, our riteria of having twie as many votes is very strit; it

is possible to nd a math with a relativelysmall numberof votes by apply-

ing post-proessing to the top images (e.g., see [8, 12℄), but forsimpliity we

avoid suh post-proessing. The point of this study, however, is not to show

thatthedesriptorsareeetiveatimageopyrightprotetionthisisalready

known[8,9,12℄. ThemainpointistoinvestigatetheperformaneoftheCluster

Pruningalgorithm,andthissimpledenitionofaorretmathsuesforthat

purpose.

3 The Cluster Pruning Approah

Inthis setion,webrieydesribethe ClusterPruningapproah. Werstde-

sribethebasialgorithm,andthenthreeparametersaetingitsbehavior. We

endbydisussingtheostsoftheClusterPruningapproahbeforesummarizing

theresultsreportedin[3℄.

3.1 Basi Algorithm

Assume aolletionC=p1, . . . , pn ^ofn^pointsⁱⁿ high-dimensionalspae. The lusters are then formed as follows. First, a set of l = √n ^luster ^leaders ^is

hosen randomly from C^. ^Then, êah ^point pi îs ômpared ^to âll l ^luster

leaders andassigned to itslosestleader. Finally, onethe lustershavebeen

formed,a luster representative is hosen, per luster (the obvioushoies are

thelusterleaderitself,theentroidoftheluster,orthemedoidoftheluster).

At query time, the query point q îs ^rst ômpared ^to ^the ^set ôf l ^luster

representatives to nd the nearest representative. Then, the query point is

ompared to allthe pointsin that representative'sluster,to determinethek

(10)

nearest neighbors found in the luster. Those neighbors are returned as the

approximate answertothequery.

Thehoieofl=√

n^lustersîs^made^beause^the^total^numberôfêulidean

distanealulations,whihisl+n/l^,^is^minimized^whenl=√

n^. ^On^average,

eahlusterontains

√n^points,^resultingⁱⁿ^a^total^of2√

n^distanealulations.

Assuming that the set of luster representatives ts in memory, but not the

desriptorolletion,onediskreadis requiredatsearhtime.

3.2 Extended Searhes: The b Parameter

Sometimes,readingasinglelustermaynotyieldresultsofsatisfatoryquality.

In suh ases, it is possible to read b ^lusters ^to ^answer^eah ^query; ^the^basi

algorithmorrespondstob= 1^. ^Theôstôf^retrieval^thenônsistsôfbÎOsând (1+b)√

n^distanealulations. Usingb^,^it^is^possible^to^dynamially^hange^the

queryexeutionstrategy,forexampletoreadmorelusters toimproveresults.

Asb^grows,^however,^returnsâreêxpeted^to^diminishâs^the^nearest^neigh-

bors aremostlikelyto be ontainedwithin thenearestlusters [18℄. Unfortu-

nately,asuitablehoieofb ^is^diult^to^determinedynamially,astheresult qualityisnotknownatrun-time;insteadthenumberoflustersrequiredfora-

eptableresultqualitymustbedeterminedexpliitlythroughexperimentation.

3.3 Redundant Clustering: The a Parameter

Alternatively, it is possible to inrease the quality of the results by assigning

eahdatapointtoa >1^lusters,ând^readingônlyb= 1^lusterât^query^time.

Eah lusterwill thenontain,onaverage,a√

n^points,^resultingⁱⁿ (a+ 1)√ n

eulideandistanealulations,butonlyoneIO.

Thelusteringphaseisalwaysmoreostlywithhighera^(the^average^luster

size is proportionalto a^). ^Furthermore, itis not possible to hange thea ^pa-

rameteronethelustersareformed,whiletheb^parameter^an^be^dynamially

modiedatquerytime.

1

Theeetofthea^parameter^on^query^proessing^ost

ismoreomplex,andisstudiedinSetion5. Inshort,asa^is^inreased,^the^size

ofthelustersondiskinreases,aswellasthetimerequiredtoproessthem.

3.4 Reursive Clustering: The L Parameter

For large olletions,

√n îs â ^large ^number, ^resulting ⁱⁿ êxessive ^CPU ôst

andpotentiallyevensigniantIOost. ThesolutionsuggestedbyChierihetti

etal. istoreursivelylusterthesetof lusterrepresentatives,usingtheexat

samemethod. Theyintrodueaparameter,L^,^to ^ontrol ^the^number^of^levels

in thereursion;thedefault algorithmdesribedaboveorrespondstoL= 1^.

TheL^parameterîsûsedâs^follows^during^the^lustering,^whihîs^performed

in abottom-upmanner. First,l=n^L/(L+1) ^luster^leaders^are^now^hosen^ini-

tially,resultinginl ^lustersôntainingônâveragen^1/(L+1)desriptors. Cluster assignmentthenproeedsasbefore,asdoesthehoieoflusterrepresentatives.

One thelusterrepresentativesare formed,however,theyareonsidered asa

olletion of high-dimensional points, and lustered using n^(L⁻^1)/(L+1) ^repre-

sentatives. This proess is repeated reursively, and the outomeis an L^-tier

1

Notethatwhileitispossibletohavebotha >1^andb >1^,^suh^settings^will^most^likely

resultinseveraldatapointsbeingreadatimesandarethereforenotonsidered.