• Aucun résultat trouvé

Combining Elimination Rules in Tree-Based Nearest Neighbor Search Algorithms

N/A
N/A
Protected

Academic year: 2021

Partager "Combining Elimination Rules in Tree-Based Nearest Neighbor Search Algorithms"

Copied!
11
0
0

Texte intégral

(1)

HAL Id: hal-00961322

https://hal-ujm.archives-ouvertes.fr/hal-00961322v2

Submitted on 10 Sep 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Combining Elimination Rules in Tree-Based Nearest

Neighbor Search Algorithms

Eva Gómez-Ballester, Luisa Micó, Franck Thollard, Jose Oncina, Francisco

Moreno-Seco

To cite this version:

(2)

Nearest Neighbor Sear h Algorithms EvaGómez-Ballester

1

,LuisaMi ó

1

, Fran kThollard

2

,JoseOn ina

1

,and Fran is oMoreno-Se o

1

1

Dept.LenguajesySistemasInformáti os

UniversidaddeAli ante,E-03071 Ali ante,Spain

{eva,mi o,on ina,pa o}dlsi.ua .es

2

GrenobleUniversity,LIG

BP53,38041GrenobleCedex9

thollarduniv-st-etienne.fr

Abstra t. A ommon a tivity inmanypatternre ognition tasks,

im-agepro essingor lusteringte hniquesinvolvessear hingalabeleddata

set looking for the nearest point to agiven unlabelled sample. To

re-du ethe omputationaloverhead whenthe naive exhaustive sear h is

applied,some fast nearestneighbor sear h (NNS) algorithmshave

ap-peared inthe last years.Dependingonthe stru tureused tostore the

trainingset(usuallyatree),dierent strategiestospeedup thesear h

havebeendened.Inthispaper,anewalgorithmbasedonthe

ombina-tionofdierent pruningrules isproposed. Anexperimentalevaluation

and omparisonofitsbehaviorwithrespe ttootherte hniqueshasbeen

performed,usingbothrealandarti ialdata.

1 Introdu tion

NearestNeighborSear h(NNS)isanimportantte hniqueinavarietyof

appli- ationsin ludingpatternre ognition[6℄,vision[13℄,ordatamining[1,5℄.These

te hniquesaimatndingtheobje tofasetnearesttoagiventestobje t,using

adistan e fun tion [6℄.The useofasimple brute-for emethod issometimes a

bottlene kduetothelargenumberofdistan esthatshouldbe omputedand/or

their omputationaleort.In this work we have onsidered the omputational

problemof ndingnearestneighbors ingeneralmetri spa es. Spa esthat may

notbe onvenientlyembeddedorapproximatedinanEu lideanspa eareof

par-ti ularinterest.Manyte hniqueshavebeenproposedforusingdierenttypesof

stru tures(vp-tree[16℄,GNAT[3℄,sa-tree[10℄,AESA[14℄,M-tree[4℄):the

tree-based te hniquesare neverthelessmore popular. The Fukunaga and Narendra

algorithm (FNA [7℄) is oneof the rst known tree-based exampleof this type

of te hniques. It prunes the traversal of the tree by taking advantage, as the

aforementioned methods, of the triangular inequality of the distan e between

(3)

tablerule[12℄,arulethatisbasedoninformationstoredinthesiblingnode(the

sibling rule[9℄),theoriginal rulefrom theFNA(Fukunaga andNarendra rule,

FNR), and a generalizationof both thesibling rule and the FNR one[9℄. We

end upwithanewalgorithmfor ombiningtherules thatsigni antlyredu es

thenumberofdistan e omputations.

Thealgorithm isevaluated onboth arti ialand realworlddata and

om-paredwithstate-of-the-artmethods.

Thepaperisorganizedasfollows:wewillrstre alltheFNAalgorithmand

denethegeneralframeworkofthenewalgorithm(inparti ularhowthetreeis

built).Wethen reviewthedierentrulesweaimat ombining(se tion3). We

thenproposeournewalgorithm(se tion4).Se tions5presentstheexperimental

omparison.

2 The basi algorithm

The FNA is a fast tree-basedsear h method that an work in generalmetri

spa es.IntheoriginalFNAthe

c

-meansalgorithmwasusedtodenethe parti-tionofthedata.Inthework byGómez-Ballesteretal[8℄manystrategieswere

explored:the best one,namely theMost Distant from the Father tree (MDF),

in whi h therepresentativeof theleft nodeisthesameastherepresentativeof

itsfather,isthestrategyusedin theexperimentspresentedinthiswork.Thus,

ea h time when an expansion of the node is ne essary, only one new distan e

needstobe omputed(instead oftwo), hen eredu ingthenumberofdistan es

omputed. This strategywasalsosu essfully used byNoltomeier et al[11℄ in

the ontextofbise tortrees.

IntheMDFtreeea hleafstoresapointofthesear hspa e.Theinformation

storedinea hnode

t

is

S

t

,thesetofpointsstoredintheleavesof

t

sub-tree,

M

t

(therepresentativeof

S

t

)andtheradiusof

S

t

,

R

t

= argmax

x∈S

t

d(M

t

, x)

.Figure

1showsapartitionofthedataina2-dimensionalunithyper ube.Therootnode

will beasso iatedwithallthepointsoftheset. Theleftnodewillrepresentall

thepointsthatbelongtothehyperplaneunderthesegment[(0,0.95);(0.65,0)℄;

therightnodewill beasso iated withtheotherpoints.A ording totheMDF

strategy,therepresentativeoftherightnode(

M

r

)isthesameasthefather,and the representativeof the left node (

M

) is the mostdistant point to

M

r

. The spa eisthenre ursivelypartitioned.

3 A review of pruning rules

Fukunaga and Narendra Rule (FNR)

Thepruning ruledened byFukunaga andNarendra forinternalnodesmakes

(4)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig.1:PartitionofthedatausingtheMDF strategy.Representativesofea hnodein

dierentlevelsaredrawnasrings.

ne esaryto omputethedistan e fromthetest sampletothe representativeof

andidate nodethat aim tobeeliminated. Figure2a presentsagraphi alview

oftheFukunaga andNarendrarule.

Rule:No

y ∈ S

t

anbethenearestneighborto

x

if

d(x, nn) + R

t

< d(x, M

t

)

The SiblingBased Rule (SBR)

Giventwosibling nodes

r

and

, thisrule requires thatea h node

r

storesthe distan e

d(M

r

, e

)

, that isthedistan ebetweentherepresentativeof thenode,

M

r

, and thenearestpoint,

e

, in the sibling node

(

S

). Figure2b presentsa graphi alviewoftheSiblingbasedrule.

Rule:No

y ∈ S

an bethe nearestneighborto

x

if

d(M

r

, x) + d(x, nn) <

d(M

r

, e

)

.

UnliketheFNR,SBR anbeappliedtoeliminatenode

without omputing

d(M

, x)

,avoidingsomeextradistan e omputationsatsear htime.

Generalized rule (GR)

This rule is an iterated ombination of the FNR and the SBR (due to spa e

onstraintswereferthereaderto[9℄fordetailsonthegeneralizedrule).InGR,

(5)

M

t

x

nn

R

t

S

t

(a)Geometri alviewofFNRrule.

M

l

x

nn

R

l

S

l

M

r

R

r

e

l

(b)Geometri alviewofSBRrule.

The table rule (TR)

Thisre entrule[12℄prunesthetreebytakingthe urrentnearestneighborasa

referen e.Inordertodoso,anewdistan e shouldbedened:

Denition.Givenaprototypeorsamplepoint

p

,thedistan ebetween

p

to asetofprototypes

S

isdened as

d(p, S) = min

y∈S

d(p, y)

Atpre-pro esstime,thedistan esfromea hprototypetoea hprototypeset

of ea h node

t

,

S

t

, in the tree are omputed and stored in atable, allowinga onstanttimepruning.Notethatthesizeofthistableisquadrati inthenumber

ofprototypessin e,asthetreeis binary, thenumberof nodesistwotimesthe

numberof prototypes.

M

t

x

nn

R

t

S

t

e

td(nn,S

t)

M

t

x

nn

R

t

S

t

e

Fig.2: Table rule and node

S

t

:situation where it an be pruned (up) and where it annot(down)

(6)

InAlgorithm1ane ient ombinationofpruningrulesisproposed.Notethat,

astheGRgeneralizesboththeFNRandtheSBR,thesetworulesarenotapplied

while thegeneralizedoneisa tivated (lines 11-19).Whenthe MDFmethod is

usedtobuildthetree,itisimportanttonotethatea htimeanodeisexpanded,

only one of the representatives is new (the left node), while the other (right)

is the same as the father node (in this ase, only the radius of the node an

hange). For this reason, in this ase the distan e

d

r

= d(x, M

r

)

in line 9 is never omputed(asitisalreadyknown).Then,whenanodeisexaminedduring

thesear h,everypruningthat anbeappliedwithout omputinganewdistan e

isapplied(lines3to8).Ifnoneoftheserulesisabletoprune,thedistan etothe

urrentnodeis omputed(line9). Thepruningrulesthatusethenewdistan e

arethenapplied(lines11to28).

5 Experiments

We haveperformed someexperimentsin order to ompare ouralgorithm with

somestate of theart methods. Therst method, the multi-vantage-pointtree

(mvp),isabalan edtreerequiringlinearspa ewherethearity anbeextended

andmultiplepivotspernode anbeapplied[2℄.These ondmethod isthe

Spa-tialApproximationTree(sat),whosestru tureusesagraphbasedonDelaunay

triangulationand itdoesnotdepend onanyparameter[10℄.The odeof these

algorithms omesfromtheSISAPlibrary(www.sisap.org).Weappliedthemvp

withonlyonepivot bynode,abu ketsizeof1andanarityof2asthissetting

leadstobetterperforman esa ordingtopreliminaryexperimentsonthesedata

sets.AlltheexperimentswereperformedonaLinuxboxwith16GBofmemory.

Fromnowandonlyforthegraphs,theFNRrule(andrespe tivelytheSBR,

GR and TR rules) will be abbreviated by "f"(respe tively "s", "g"and "t");

onsequently, ombiningtheFBRandSBRwill bereferredas"fs".The

ombi-nationsofrule"g"with"s"or"f"arenotpresentas"g"generalizestheserules:

everybran hprunedbyoneofthemisalsoprunedby"g".

Inordertoevaluatetheperforman eofdierent ombinedrules,wepresentin

thisse tiontheexperimentsonbotharti ialandrealworlddatausingdierent

settingsofouralgorithm.

5.1 Arti ial datawith uniform distributions

We onsiderherepointsdrawnin aspa eofdimension

n

rangingfrom 5to 30.

The algorithms are ompared with a growing number of prototypes.The size

oftheprototypesetsrangedfrom

2, 000

prototypesto

30, 000

instepsof

4, 000

. Ea hexperimentmeasurestheaveragedistan e omputationsof

10, 000

sear hes

(7)

Data:

t

:anodetree;

x

:asamplepoint;

Result:

nn

:thenearestneighborprototype;

d

min

:thedistan eto

nn

; if

t

isnotaleaf then

1

r

= right

_

child(t)

;

= lef t

_

child(t)

; 2

if (SBR(

)||TR(

))then 3

if (noFNR(

r

))&&(noTR(

r

)) then 4

CPR

(r, x)

/*left(sibling)nodehasbeenpruned*/;

5

end 6

Return/*iepruneboth*/; 7

end 8

d

r

= d(x, M

r

)

;

d

= d(x, M

)

;

9

update

d

min

and

nn

; 10 if A tivated(GR)then 11 if

d

< d

r

then 12 if (noGR(

))then CPR

(ℓ, x)

; 13 if (noGR(

r

))then CPR

(r, x)

; 14 else 15 if (no GR(

r

))then CPR

(r, x)

; 16 if (no GR(

))then CPR

(ℓ, x)

; 17 end 18 else 19 if

d

< d

r

then 20

if (no FNR(

)) &&(noSBR(

))then CPR

(ℓ, x)

; 21

if (no FNR(

r

))&& (noSBR(

r

))then CPR

(r, x)

; 22

else 23

if (no FNR(

r

))&& (noSBR(

r

))then CPR

(r, x)

; 24

if (no FNR(

)) &&(noSBR(

))then CPR

(ℓ, x)

; 25 end 26 end 27 end 28

Figure3ashowstheaveragenumberofdistan e omputationsina10-dimensional

spa e following a uniform distribution. Standard deviation of measures is not

in ludedasitisalmostnegligible.Asit anbeseen,bothsat andmvp are

out-performedbytheother pruningrules.Althoughthetablerulealsooutperforms

theFNRandGRones,itisworthmentioningthat thesemethodshaveaspa e

onsumptionsmallerthanthetablerule.Inthe aseofsmallspa e apabilities,

thesemethodsshouldbepreferred.Consideringthe lassi FNAalgorithmasa

referen e,weobservethatGRandTRrulesoutperformtheoriginalrule,namely

FNR.Moreover,itappearsthat ombiningthetablerule,witheitherthesibling

orgeneralizedrule, doesnot perform betterthan ombiningthe FNRand the

(8)

0

1000

2000

3000

4000

5000

6000

7000

8000

0

5000

10000 15000 20000 25000 30000

distance computations

training set size

fst,ft,tg

dim 10

g

f,fs

mvp

sat

t

(a)Distan e omputationsw.r.t. training

setsizeina10-dimensionalspa e.

0

2000

4000

6000

8000

10000

12000

0

5

10

15

20

25

30

dimension

fst,ft,tg

sat

t

mvp

f

11000 training samples

(b) Distan e omputations w.r.t

dimen-sionality.

Fig.3:Comparisonofdierentpruningrules ombinationswithsatandmvpalgorithms

eralizesthesibling rule,the ombinationof "fst"doesnotperformbetterthan

"fg",asexpe ted.

Another lassi problem to address is the urse of dimensionality 3

. It

ex-presses thefa t that thevolumeof theunit hyper ube in reasesexponentially

with the dimension ofthe spa e. In other words, thepointstend to be at the

same distan e one to ea h other in great dimensions. In our setting, this will

obviouslypreventalargenumberofprunings:thealgorithmwilltendtobehave

likethebrutefor ealgorithmasthedimensionin reases.Thisalgorithmi

limi-tationisnotarealproblemsin elookingforanearestneighbordoesnotmake

sensein aspa ewhere thedistan esbetweenea h pairofpointsaresimilar.

Figure3b addressesa omparativeanalysis ofthe behaviorofthe methods

asthedimensionin reases.Thenumberofprototypeissetto

11, 000

pointsand

the dimensionality ranges from 2 to 30. It an be observed herethat the TR

ruleislesssensibletothedimensionalitythantheothermethods. Moreover,as

before, ombiningtheTRrulewiththeFNRonestill performsbetterthanthe

other ombinations:at dimension25, the"ft" ombination isableto save20%

ofdistan e omputationswhiletheothermethods omputeallthedistan es,as

theexhaustivesear h.

Twomoreexperimentswereperformed:rst,inordertoshowthedieren es

whenabest-rst strategyisusedinsteadofadepth-rststrategy.InFigure4a

one an see that similar resultsare obtained, for this reason, only depth-rst

strategyisusedin thiswork.Se ond,aswellasthedistan e omputations,the

per entageofthedatabaseexaminedisanalyzedforallthemethods.Results an

beseeninFigure4b.Asinthe aseofdistan e omputations,theCPRmethod

alsoredu estheoverheadofthesear hvisitingonaveragelessnodes(orpoints

in thedataset).

(9)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

2000

4000

6000

8000

10000

Visited nodes

Training set

dim 10

dim 20

df

d10

bf

d10

df

d20

bf

d20

(a) Best-rst (bf) versus depth-rst (df)

strategiesin10and20dimensionalspa es.

0

1000

2000

3000

4000

5000

2000

4000

6000

8000

10000

Training set

dim 10

sat

mvp

ft

(b)Visitednodesw.r.t.trainingsetsize.

Fig.4: Average numberof visitednodes during the sear h for the best pruning rule

ombination,dierentsear hstrategiesandsat andmvpalgorithms.

5.2 Real worlddata

Toshowtheperforman eofthealgorithmswithrealdata,sometestswere

on-du tedonaspellingtask.Fortheseexperiments,adatabaseof

69, 069

wordsof an English di tionary wasused

4

. The input test of the spellerwassimulated

distortingthewordsbymeansofrandominsertion,deletionandsubstitution

op-erationsoverthewordsintheoriginaldi tionary.TheLevenshteindistan e[15℄

wasused to ompare thewords.Di tionaries of in reasing size (from

2, 000

to

30, 000

)were obtained by extra ting randomly words of the whole di tionary.

Testpointswereobtaineddistortingthewordsin thetrainingset. Forea h

ex-periment,

1000

distorted words weregenerated and usedastest set.Toobtain reliableresults,theexperimentswererepeated

10

times.Theaveragesareshowed ontheplots.

Theexperimentperformed in Figure 3afor arti ial data(averagenumber

of distan e omputations using in reasingsize prototypesets) wasrepeatedin

thespellingtask.ResultsareshowninFigure5.Theexperimentsshowa

redu -tioninthenumberofdistan e omputationsaround20%whentheSBRruleis

ombined with theFNR, and around 40%for generalizedrule with respe t to

the referen e FNR rule. Moreover,when ombiningboth the f and t rules

(withorwithoutthegrule),theresulting ombination learlyoutperformsthe

other ombinations, asithappens withother kindsof data, saving60%of the

averagenumberofdistan e omputations.

6 Con lusions and further works

Anewalgorithmhasbeendenedtooptimizethe ombinationofseveralpruning

rules using the FNA tree-based sear h algorithm. When the rules are applied

(10)

0

1000

2000

3000

4000

5000

6000

7000

0

5000

10000

15000

20000

25000

30000

distance computations

training set size

mvp

sat

f

t

fs

g

ft,fst,st,gt

Fig.5:Pruningrules ombinedinaspellingtaskinrelationtoothersmethods.

alone, redu tionsbetween20% and 60%are obtainedfor low dimensions and

this redu tion de reases with the dimensionality (a normal behaviorsin e the

problemisgettingharderwithin reasingdimensionalities)when omparingwith

the baseline FNR rule. When the rules are ombined, more redu tionsin the

average numberof distan e omputations and in theoverhead of the methods

(measuredastheaveragenumberofvisitednodesorpoints),inparti ular anbe

observed(e.g.roughly80%redu tionina10-dimensionalspa e).Similarresults

arealsoobtainedonarealworldtask(namelyaspellingtask).

Weare urrentlystudyingnewpruningrulesand ombinations,andalsohow

to usethem in dynami treestru tures. Wethink alsothat this algorithm an

beadaptedwithminor hangestoothertree-basedsear hmethodsnotexplored

in thiswork.

7 A knowledgments

TheauthorsthanktheSpanishCICyTforpartialsupportofthisworkthrough

proje tsTIN2009-14205-C04-C1andTIN2009-14247-C02-02,theIstProgramme

of the European Community, under the Pas alNetwork of Ex ellen e, (Ist

2006-216886),andtheprogramConsolider Ingenio2010(Csd2007-00018).

Referen es

1. Christian Böhm and Florian Krebs. High performan e data mining using the

nearest neighbor join. InICDM'02:Pro eedings of the 2002 IEEE International

Conferen e onDataMining.IEEEComputerSo iety,2002.

(11)

met-ACM.

3. S.Brin. Nearneighborsear hinlargemetri spa es. InVLDBConferen e,pages

574584,1995.

4. P. Cia ia, M. Patella, and P. Zezula. M-tree: An e ient a ess method for

similaritysear hinmetri spa es. InVLDBConferen e, pages426435.Morgan

KaufmannPublishers,In .,1997.

5. B.V.Dasarathy. Dataminingtasksandmethods:Classi ation:nearest-neighbor

approa hes. pages288298,2002.

6. R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassi ation.Wiley,NewYork,

2000. 2ndEdition.

7. K.FukunagaandP.M.Narendra. Abran handboundalgorithmfor omputing

k

-nearestneighbors. IEEETransa tionsonComputers,IEC,24:750753, 1975. 8. E.Gómez-Ballester,L.Mi ó,andJOn ina.Someimprovementsintreebased

near-estneighboursear halgorithms.0302-9743-LNCS-LNAIntelligen e,(2905):456

463, 2003.

9. E. Gómez-Ballester, L. Mi ó, andJ. On ina. Some approa hes to improve

tree-based nearest neighbour sear h algorithms. Pattern Re ognition, 39(2):171179,

2006.

10. G.Navarro. Sear hinginmetri spa esby spatialapproximation. InSPIRE'99:

Pro eedings ofthe String Pro essing and InformationRetrievalSymposium, page

141. IEEEComputerSo iety,1999.

11. H.Noltemeier,K.Verbarg,andC.Zirkelba h.Monotonousbise tor*trees-atool

fore ientpartitioningof omplexs enesofgeometri obje ts.InDataStru tures

andE ientAlgorithms,FinalReportontheDFGSpe ialJointInitiative,pages

186203,London,UK,1992.Springer-Verlag.

12. J. On ina,F.Thollard,E.Gómez-Ballester, L.Mi ó L., andF.Moreno-Se o. A

tabularpruningruleintree-basedpruningrulefastnearestneighboursear h

algo-rithms. LNCS,(4478):306313,2007.

13. GregoryShakhnarovi h,TrevorDarrell,andPiotrIndyk,editors.Nearest-Neighbor

MethodsinLearningandVision. TheMITPress,2006.

14. E. Vidal. New formulation and improvementsof the nearest-neighbour

approx-imating and eliminatingsear halgorithm (AESA). Pattern Re ognition Letters,

15:17,1994.

15. R.A.WagnerandM.J.Fis her.Thestring-to-string orre tionproblem.Journal

of theAsso iation forComputingMa hinery,21(1):168173,1974.

16. P.N.Yianilos. Datastru turesandalgorithmsfornearestneighborsear hin

gen-eral metri spa es. InPro eedings oftheACM-SIAMSymposiumonDis rete

Références

Documents relatifs

[10] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph

A fusion algorithm is proposed to combine the DBRB and KBRB to obtain an optimal hybrid belief rule base (HBRB), based on which a query pattern is classified by taking into

Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the

Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data (Extended abstract).. Yixiang Fang, Reynold Cheng, Wenbin Tang, Silviu Maniu,

In this work, we investigate the use of Chapel high-productivity language for the design and implementation of distributed tree search algorithms for solving combinatorial

The IVFADC method is parametrized by the shortlist size R used for re-ranking the vector with the L2 distance, and the two parameters w and k 0 of the inverted file, which correspond

In this paper, we have proposed a method that automatically identifies and classifies different question types by using a domain specific syntax information, which is based on

Abstract: This paper proposes a binarization scheme for vectors of high dimen- sion based on the recent concept of anti-sparse coding, and shows its excellent performance