HAL Id: hal-00961322
https://hal-ujm.archives-ouvertes.fr/hal-00961322v2
Submitted on 10 Sep 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Combining Elimination Rules in Tree-Based Nearest
Neighbor Search Algorithms
Eva Gómez-Ballester, Luisa Micó, Franck Thollard, Jose Oncina, Francisco
Moreno-Seco
To cite this version:
Nearest Neighbor Sear h Algorithms EvaGómez-Ballester
1
,LuisaMi ó1
, Fran kThollard2
,JoseOn ina1
,and Fran is oMoreno-Se o1
1
Dept.LenguajesySistemasInformáti os
UniversidaddeAli ante,E-03071 Ali ante,Spain
{eva,mi o,on ina,pa o}dlsi.ua .es
2
GrenobleUniversity,LIG
BP53,38041GrenobleCedex9
thollarduniv-st-etienne.fr
Abstra t. A ommon a tivity inmanypatternre ognition tasks,
im-agepro essingor lusteringte hniquesinvolvessear hingalabeleddata
set looking for the nearest point to agiven unlabelled sample. To
re-du ethe omputationaloverhead whenthe naive exhaustive sear h is
applied,some fast nearestneighbor sear h (NNS) algorithmshave
ap-peared inthe last years.Dependingonthe stru tureused tostore the
trainingset(usuallyatree),dierent strategiestospeedup thesear h
havebeendened.Inthispaper,anewalgorithmbasedonthe
ombina-tionofdierent pruningrules isproposed. Anexperimentalevaluation
and omparisonofitsbehaviorwithrespe ttootherte hniqueshasbeen
performed,usingbothrealandarti ialdata.
1 Introdu tion
NearestNeighborSear h(NNS)isanimportantte hniqueinavarietyof
appli- ationsin ludingpatternre ognition[6℄,vision[13℄,ordatamining[1,5℄.These
te hniquesaimatndingtheobje tofasetnearesttoagiventestobje t,using
adistan e fun tion [6℄.The useofasimple brute-for emethod issometimes a
bottlene kduetothelargenumberofdistan esthatshouldbe omputedand/or
their omputationaleort.In this work we have onsidered the omputational
problemof ndingnearestneighbors ingeneralmetri spa es. Spa esthat may
notbe onvenientlyembeddedorapproximatedinanEu lideanspa eareof
par-ti ularinterest.Manyte hniqueshavebeenproposedforusingdierenttypesof
stru tures(vp-tree[16℄,GNAT[3℄,sa-tree[10℄,AESA[14℄,M-tree[4℄):the
tree-based te hniquesare neverthelessmore popular. The Fukunaga and Narendra
algorithm (FNA [7℄) is oneof the rst known tree-based exampleof this type
of te hniques. It prunes the traversal of the tree by taking advantage, as the
aforementioned methods, of the triangular inequality of the distan e between
tablerule[12℄,arulethatisbasedoninformationstoredinthesiblingnode(the
sibling rule[9℄),theoriginal rulefrom theFNA(Fukunaga andNarendra rule,
FNR), and a generalizationof both thesibling rule and the FNR one[9℄. We
end upwithanewalgorithmfor ombiningtherules thatsigni antlyredu es
thenumberofdistan e omputations.
Thealgorithm isevaluated onboth arti ialand realworlddata and
om-paredwithstate-of-the-artmethods.
Thepaperisorganizedasfollows:wewillrstre alltheFNAalgorithmand
denethegeneralframeworkofthenewalgorithm(inparti ularhowthetreeis
built).Wethen reviewthedierentrulesweaimat ombining(se tion3). We
thenproposeournewalgorithm(se tion4).Se tions5presentstheexperimental
omparison.
2 The basi algorithm
The FNA is a fast tree-basedsear h method that an work in generalmetri
spa es.IntheoriginalFNAthe
c
-meansalgorithmwasusedtodenethe parti-tionofthedata.Inthework byGómez-Ballesteretal[8℄manystrategieswereexplored:the best one,namely theMost Distant from the Father tree (MDF),
in whi h therepresentativeof theleft nodeisthesameastherepresentativeof
itsfather,isthestrategyusedin theexperimentspresentedinthiswork.Thus,
ea h time when an expansion of the node is ne essary, only one new distan e
needstobe omputed(instead oftwo), hen eredu ingthenumberofdistan es
omputed. This strategywasalsosu essfully used byNoltomeier et al[11℄ in
the ontextofbise tortrees.
IntheMDFtreeea hleafstoresapointofthesear hspa e.Theinformation
storedinea hnode
t
isS
t
,thesetofpointsstoredintheleavesoft
sub-tree,M
t
(therepresentativeofS
t
)andtheradiusofS
t
,R
t
= argmax
x∈S
t
d(M
t
, x)
.Figure
1showsapartitionofthedataina2-dimensionalunithyper ube.Therootnode
will beasso iatedwithallthepointsoftheset. Theleftnodewillrepresentall
thepointsthatbelongtothehyperplaneunderthesegment[(0,0.95);(0.65,0)℄;
therightnodewill beasso iated withtheotherpoints.A ording totheMDF
strategy,therepresentativeoftherightnode(
M
r
)isthesameasthefather,and the representativeof the left node (M
ℓ
) is the mostdistant point toM
r
. The spa eisthenre ursivelypartitioned.3 A review of pruning rules
Fukunaga and Narendra Rule (FNR)
Thepruning ruledened byFukunaga andNarendra forinternalnodesmakes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig.1:PartitionofthedatausingtheMDF strategy.Representativesofea hnodein
dierentlevelsaredrawnasrings.
ne esaryto omputethedistan e fromthetest sampletothe representativeof
andidate nodethat aim tobeeliminated. Figure2a presentsagraphi alview
oftheFukunaga andNarendrarule.
Rule:No
y ∈ S
t
anbethenearestneighbortox
ifd(x, nn) + R
t
< d(x, M
t
)
The SiblingBased Rule (SBR)
Giventwosibling nodes
r
andℓ
, thisrule requires thatea h noder
storesthe distan ed(M
r
, e
ℓ
)
, that isthedistan ebetweentherepresentativeof thenode,M
r
, and thenearestpoint,e
ℓ
, in the sibling nodeℓ
(S
ℓ
). Figure2b presentsa graphi alviewoftheSiblingbasedrule.Rule:No
y ∈ S
ℓ
an bethe nearestneighbortox
ifd(M
r
, x) + d(x, nn) <
d(M
r
, e
ℓ
)
.UnliketheFNR,SBR anbeappliedtoeliminatenode
ℓ
without omputingd(M
ℓ
, x)
,avoidingsomeextradistan e omputationsatsear htime.Generalized rule (GR)
This rule is an iterated ombination of the FNR and the SBR (due to spa e
onstraintswereferthereaderto[9℄fordetailsonthegeneralizedrule).InGR,
M
t
x
nn
R
t
S
t
(a)Geometri alviewofFNRrule.
M
l
x
nn
R
l
S
l
M
r
R
r
e
l
(b)Geometri alviewofSBRrule.
The table rule (TR)
Thisre entrule[12℄prunesthetreebytakingthe urrentnearestneighborasa
referen e.Inordertodoso,anewdistan e shouldbedened:
Denition.Givenaprototypeorsamplepoint
p
,thedistan ebetweenp
to asetofprototypesS
isdened asd(p, S) = min
y∈S
d(p, y)
Atpre-pro esstime,thedistan esfromea hprototypetoea hprototypeset
of ea h node
t
,S
t
, in the tree are omputed and stored in atable, allowinga onstanttimepruning.Notethatthesizeofthistableisquadrati inthenumberofprototypessin e,asthetreeis binary, thenumberof nodesistwotimesthe
numberof prototypes.
M
t
x
nn
R
t
S
t
e
td(nn,S
t)
M
t
x
nn
R
t
S
t
e
Fig.2: Table rule and node
S
t
:situation where it an be pruned (up) and where it annot(down)InAlgorithm1ane ient ombinationofpruningrulesisproposed.Notethat,
astheGRgeneralizesboththeFNRandtheSBR,thesetworulesarenotapplied
while thegeneralizedoneisa tivated (lines 11-19).Whenthe MDFmethod is
usedtobuildthetree,itisimportanttonotethatea htimeanodeisexpanded,
only one of the representatives is new (the left node), while the other (right)
is the same as the father node (in this ase, only the radius of the node an
hange). For this reason, in this ase the distan e
d
r
= d(x, M
r
)
in line 9 is never omputed(asitisalreadyknown).Then,whenanodeisexaminedduringthesear h,everypruningthat anbeappliedwithout omputinganewdistan e
isapplied(lines3to8).Ifnoneoftheserulesisabletoprune,thedistan etothe
urrentnodeis omputed(line9). Thepruningrulesthatusethenewdistan e
arethenapplied(lines11to28).
5 Experiments
We haveperformed someexperimentsin order to ompare ouralgorithm with
somestate of theart methods. Therst method, the multi-vantage-pointtree
(mvp),isabalan edtreerequiringlinearspa ewherethearity anbeextended
andmultiplepivotspernode anbeapplied[2℄.These ondmethod isthe
Spa-tialApproximationTree(sat),whosestru tureusesagraphbasedonDelaunay
triangulationand itdoesnotdepend onanyparameter[10℄.The odeof these
algorithms omesfromtheSISAPlibrary(www.sisap.org).Weappliedthemvp
withonlyonepivot bynode,abu ketsizeof1andanarityof2asthissetting
leadstobetterperforman esa ordingtopreliminaryexperimentsonthesedata
sets.AlltheexperimentswereperformedonaLinuxboxwith16GBofmemory.
Fromnowandonlyforthegraphs,theFNRrule(andrespe tivelytheSBR,
GR and TR rules) will be abbreviated by "f"(respe tively "s", "g"and "t");
onsequently, ombiningtheFBRandSBRwill bereferredas"fs".The
ombi-nationsofrule"g"with"s"or"f"arenotpresentas"g"generalizestheserules:
everybran hprunedbyoneofthemisalsoprunedby"g".
Inordertoevaluatetheperforman eofdierent ombinedrules,wepresentin
thisse tiontheexperimentsonbotharti ialandrealworlddatausingdierent
settingsofouralgorithm.
5.1 Arti ial datawith uniform distributions
We onsiderherepointsdrawnin aspa eofdimension
n
rangingfrom 5to 30.The algorithms are ompared with a growing number of prototypes.The size
oftheprototypesetsrangedfrom
2, 000
prototypesto30, 000
instepsof4, 000
. Ea hexperimentmeasurestheaveragedistan e omputationsof10, 000
sear hesData:
t
:anodetree;x
:asamplepoint;Result:
nn
:thenearestneighborprototype;d
min
:thedistan etonn
; ift
isnotaleaf then1
r
= right
_child(t)
;ℓ
= lef t
_child(t)
; 2if (SBR(
ℓ
)||TR(ℓ
))then 3if (noFNR(
r
))&&(noTR(r
)) then 4CPR
(r, x)
/*left(sibling)nodehasbeenpruned*/;5
end 6
Return/*iepruneboth*/; 7
end 8
d
r
= d(x, M
r
)
;d
ℓ
= d(x, M
ℓ
)
;9
update
d
min
andnn
; 10 if A tivated(GR)then 11 ifd
ℓ
< d
r
then 12 if (noGR(ℓ
))then CPR(ℓ, x)
; 13 if (noGR(r
))then CPR(r, x)
; 14 else 15 if (no GR(r
))then CPR(r, x)
; 16 if (no GR(ℓ
))then CPR(ℓ, x)
; 17 end 18 else 19 ifd
ℓ
< d
r
then 20if (no FNR(
ℓ
)) &&(noSBR(ℓ
))then CPR(ℓ, x)
; 21if (no FNR(
r
))&& (noSBR(r
))then CPR(r, x)
; 22else 23
if (no FNR(
r
))&& (noSBR(r
))then CPR(r, x)
; 24if (no FNR(
ℓ
)) &&(noSBR(ℓ
))then CPR(ℓ, x)
; 25 end 26 end 27 end 28Figure3ashowstheaveragenumberofdistan e omputationsina10-dimensional
spa e following a uniform distribution. Standard deviation of measures is not
in ludedasitisalmostnegligible.Asit anbeseen,bothsat andmvp are
out-performedbytheother pruningrules.Althoughthetablerulealsooutperforms
theFNRandGRones,itisworthmentioningthat thesemethodshaveaspa e
onsumptionsmallerthanthetablerule.Inthe aseofsmallspa e apabilities,
thesemethodsshouldbepreferred.Consideringthe lassi FNAalgorithmasa
referen e,weobservethatGRandTRrulesoutperformtheoriginalrule,namely
FNR.Moreover,itappearsthat ombiningthetablerule,witheitherthesibling
orgeneralizedrule, doesnot perform betterthan ombiningthe FNRand the
0
1000
2000
3000
4000
5000
6000
7000
8000
0
5000
10000 15000 20000 25000 30000
distance computations
training set size
fst,ft,tg
dim 10
g
f,fs
mvp
sat
t
(a)Distan e omputationsw.r.t. training
setsizeina10-dimensionalspa e.
0
2000
4000
6000
8000
10000
12000
0
5
10
15
20
25
30
dimension
fst,ft,tg
sat
t
mvp
f
11000 training samples
(b) Distan e omputations w.r.t
dimen-sionality.
Fig.3:Comparisonofdierentpruningrules ombinationswithsatandmvpalgorithms
eralizesthesibling rule,the ombinationof "fst"doesnotperformbetterthan
"fg",asexpe ted.
Another lassi problem to address is the urse of dimensionality 3
. It
ex-presses thefa t that thevolumeof theunit hyper ube in reasesexponentially
with the dimension ofthe spa e. In other words, thepointstend to be at the
same distan e one to ea h other in great dimensions. In our setting, this will
obviouslypreventalargenumberofprunings:thealgorithmwilltendtobehave
likethebrutefor ealgorithmasthedimensionin reases.Thisalgorithmi
limi-tationisnotarealproblemsin elookingforanearestneighbordoesnotmake
sensein aspa ewhere thedistan esbetweenea h pairofpointsaresimilar.
Figure3b addressesa omparativeanalysis ofthe behaviorofthe methods
asthedimensionin reases.Thenumberofprototypeissetto
11, 000
pointsandthe dimensionality ranges from 2 to 30. It an be observed herethat the TR
ruleislesssensibletothedimensionalitythantheothermethods. Moreover,as
before, ombiningtheTRrulewiththeFNRonestill performsbetterthanthe
other ombinations:at dimension25, the"ft" ombination isableto save20%
ofdistan e omputationswhiletheothermethods omputeallthedistan es,as
theexhaustivesear h.
Twomoreexperimentswereperformed:rst,inordertoshowthedieren es
whenabest-rst strategyisusedinsteadofadepth-rststrategy.InFigure4a
one an see that similar resultsare obtained, for this reason, only depth-rst
strategyisusedin thiswork.Se ond,aswellasthedistan e omputations,the
per entageofthedatabaseexaminedisanalyzedforallthemethods.Results an
beseeninFigure4b.Asinthe aseofdistan e omputations,theCPRmethod
alsoredu estheoverheadofthesear hvisitingonaveragelessnodes(orpoints
in thedataset).
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
2000
4000
6000
8000
10000
Visited nodes
Training set
dim 10
dim 20
df
d10
bf
d10
df
d20
bf
d20
(a) Best-rst (bf) versus depth-rst (df)
strategiesin10and20dimensionalspa es.
0
1000
2000
3000
4000
5000
2000
4000
6000
8000
10000
Training set
dim 10
sat
mvp
ft
(b)Visitednodesw.r.t.trainingsetsize.
Fig.4: Average numberof visitednodes during the sear h for the best pruning rule
ombination,dierentsear hstrategiesandsat andmvpalgorithms.
5.2 Real worlddata
Toshowtheperforman eofthealgorithmswithrealdata,sometestswere
on-du tedonaspellingtask.Fortheseexperiments,adatabaseof
69, 069
wordsof an English di tionary wasused4
. The input test of the spellerwassimulated
distortingthewordsbymeansofrandominsertion,deletionandsubstitution
op-erationsoverthewordsintheoriginaldi tionary.TheLevenshteindistan e[15℄
wasused to ompare thewords.Di tionaries of in reasing size (from
2, 000
to30, 000
)were obtained by extra ting randomly words of the whole di tionary.Testpointswereobtaineddistortingthewordsin thetrainingset. Forea h
ex-periment,
1000
distorted words weregenerated and usedastest set.Toobtain reliableresults,theexperimentswererepeated10
times.Theaveragesareshowed ontheplots.Theexperimentperformed in Figure 3afor arti ial data(averagenumber
of distan e omputations using in reasingsize prototypesets) wasrepeatedin
thespellingtask.ResultsareshowninFigure5.Theexperimentsshowa
redu -tioninthenumberofdistan e omputationsaround20%whentheSBRruleis
ombined with theFNR, and around 40%for generalizedrule with respe t to
the referen e FNR rule. Moreover,when ombiningboth the f and t rules
(withorwithoutthegrule),theresulting ombination learlyoutperformsthe
other ombinations, asithappens withother kindsof data, saving60%of the
averagenumberofdistan e omputations.
6 Con lusions and further works
Anewalgorithmhasbeendenedtooptimizethe ombinationofseveralpruning
rules using the FNA tree-based sear h algorithm. When the rules are applied
0
1000
2000
3000
4000
5000
6000
7000
0
5000
10000
15000
20000
25000
30000
distance computations
training set size
mvp
sat
f
t
fs
g
ft,fst,st,gt
Fig.5:Pruningrules ombinedinaspellingtaskinrelationtoothersmethods.
alone, redu tionsbetween20% and 60%are obtainedfor low dimensions and
this redu tion de reases with the dimensionality (a normal behaviorsin e the
problemisgettingharderwithin reasingdimensionalities)when omparingwith
the baseline FNR rule. When the rules are ombined, more redu tionsin the
average numberof distan e omputations and in theoverhead of the methods
(measuredastheaveragenumberofvisitednodesorpoints),inparti ular anbe
observed(e.g.roughly80%redu tionina10-dimensionalspa e).Similarresults
arealsoobtainedonarealworldtask(namelyaspellingtask).
Weare urrentlystudyingnewpruningrulesand ombinations,andalsohow
to usethem in dynami treestru tures. Wethink alsothat this algorithm an
beadaptedwithminor hangestoothertree-basedsear hmethodsnotexplored
in thiswork.
7 A knowledgments
TheauthorsthanktheSpanishCICyTforpartialsupportofthisworkthrough
proje tsTIN2009-14205-C04-C1andTIN2009-14247-C02-02,theIstProgramme
of the European Community, under the Pas alNetwork of Ex ellen e, (Ist
2006-216886),andtheprogramConsolider Ingenio2010(Csd2007-00018).
Referen es
1. Christian Böhm and Florian Krebs. High performan e data mining using the
nearest neighbor join. InICDM'02:Pro eedings of the 2002 IEEE International
Conferen e onDataMining.IEEEComputerSo iety,2002.
met-ACM.
3. S.Brin. Nearneighborsear hinlargemetri spa es. InVLDBConferen e,pages
574584,1995.
4. P. Cia ia, M. Patella, and P. Zezula. M-tree: An e ient a ess method for
similaritysear hinmetri spa es. InVLDBConferen e, pages426435.Morgan
KaufmannPublishers,In .,1997.
5. B.V.Dasarathy. Dataminingtasksandmethods:Classi ation:nearest-neighbor
approa hes. pages288298,2002.
6. R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassi ation.Wiley,NewYork,
2000. 2ndEdition.
7. K.FukunagaandP.M.Narendra. Abran handboundalgorithmfor omputing
k
-nearestneighbors. IEEETransa tionsonComputers,IEC,24:750753, 1975. 8. E.Gómez-Ballester,L.Mi ó,andJOn ina.Someimprovementsintreebasednear-estneighboursear halgorithms.0302-9743-LNCS-LNAIntelligen e,(2905):456
463, 2003.
9. E. Gómez-Ballester, L. Mi ó, andJ. On ina. Some approa hes to improve
tree-based nearest neighbour sear h algorithms. Pattern Re ognition, 39(2):171179,
2006.
10. G.Navarro. Sear hinginmetri spa esby spatialapproximation. InSPIRE'99:
Pro eedings ofthe String Pro essing and InformationRetrievalSymposium, page
141. IEEEComputerSo iety,1999.
11. H.Noltemeier,K.Verbarg,andC.Zirkelba h.Monotonousbise tor*trees-atool
fore ientpartitioningof omplexs enesofgeometri obje ts.InDataStru tures
andE ientAlgorithms,FinalReportontheDFGSpe ialJointInitiative,pages
186203,London,UK,1992.Springer-Verlag.
12. J. On ina,F.Thollard,E.Gómez-Ballester, L.Mi ó L., andF.Moreno-Se o. A
tabularpruningruleintree-basedpruningrulefastnearestneighboursear h
algo-rithms. LNCS,(4478):306313,2007.
13. GregoryShakhnarovi h,TrevorDarrell,andPiotrIndyk,editors.Nearest-Neighbor
MethodsinLearningandVision. TheMITPress,2006.
14. E. Vidal. New formulation and improvementsof the nearest-neighbour
approx-imating and eliminatingsear halgorithm (AESA). Pattern Re ognition Letters,
15:17,1994.
15. R.A.WagnerandM.J.Fis her.Thestring-to-string orre tionproblem.Journal
of theAsso iation forComputingMa hinery,21(1):168173,1974.
16. P.N.Yianilos. Datastru turesandalgorithmsfornearestneighborsear hin
gen-eral metri spa es. InPro eedings oftheACM-SIAMSymposiumonDis rete