A Tabular Pruning Rule in Tree-Based Fast Nearest Neighbor Search Algorithms

(1)

HAL Id: ujm-00165425

https://hal-ujm.archives-ouvertes.fr/ujm-00165425

Submitted on 9 Mar 2009

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A Tabular Pruning Rule in Tree-Based Fast Nearest

Neighbor Search Algorithms

Jose Oncina, Franck Thollard, Eva Gómez-Ballester, Luisa Micó, Francisco

Moreno-Seco

To cite this version:

(2)

Neighbor Sear h algorithms JoseOn ina

1

,Fran kThollard

2

,EvaGómez-Ballester

1

,LuisaMi ó

1

,and Fran is oMoreno-Se o

1

Dept.LenguajesySistemasInformáti os UniversidaddeAli ante,E-03071 Ali ante,Spain

{on ina,eva,mi o,pa o}dlsi.ua. es

2

LaboratoireHubertCurien(exEURISE)-UMRCNRS5516 18rueduProf.Lauras-42000Saint-ÉtienneCedex2,Fran e

thollarduniv-st-etienne.fr

Abstra t. Some fast nearest neighbor sear h (NNS)algorithms using metri propertieshaveappearedinthelastyearsforredu ing omputa-tional ost.Depending onthe stru tureusedto store the training set, dierent strategies to speed up the sear h have been dened. For in-stan e, pruning rules avoid the sear h ofsome bran hes of atree ina tree-basedsear halgorithm.Inthispaper,weproposeanewandsimple pruningrulethat anbeusedinmostofthetree-basedsear halgorithms. Alltheinformationneededbytherule anbestoredinatable(at pre-pro essingtime).Moreover, therule anbe omputedin onstanttime. Thisapproa hisevaluatedthroughrealandarti ialdataexperiments. Inordertotest itsperforman e, theruleis omparedtoand ombined withotherpreviouslydenedrules.

1 Introdu tion

NearestNeighborSear h(NNS) te hniques aim at ndingthenearestpoint of aset to agiventest pointusing adistan e fun tion [4℄. Thenaïveapproa h is sometimesabottlene kdue tothe largenumberofdistan esto be omputed. Many methods have been developped in order to avoid the exhaustive sear h (see [3℄ and [2℄for a survey). Tree-based stru tures are very popular in most oftheproposedalgorithms[6,5,10,1,9℄,asthisstru tureprovidesasimpleway to avoidtheexplorationofsomesubsetsofpoints.Amongthese methods,only someofthemaresuitableforgeneralmetri spa es,i.e.,spa eswheretheobje ts (prototypes)neednottoberepresentedasapoint,and onlyrequireaproperly dened distan e fun tion. The mostpopular and refereed algorithm of su h a typewasproposedbyFukunagaandNarendra(FNA)[6℄.Thisalgorithmisvery suitableforstudyingnewtreebuildingstrategiesandnewpruningrules[7,8℄as apreviousstepforextendingthenewideastoothertree-basedalgorithms.

(3)

metri ). Ina lassi al way, the FNA algorithm will serveas abaselinefor the omparisonwithotherte hniques.

Thepaperisorganizedasfollow:wewill rstintrodu ethebasi algorithm (se tion2). Weintrodu ethedierentpruningrules that were used in the ex-perimentinse tion3and4.Wewillprovidea omparativeexperimentoneither arti ial and real world data (se tion 5). We then on lude suggesting some future works(se tion6).

2 The basi algorithm

The FNA is afast sear h method that uses abinary tree stru ture. Ea h leaf storesapointofthesear hspa e.Atea hnode

t

isasso iated

S

t

,thesetofthe pointsstoredintheleavesof

t

sub-tree.Ea hnodestores

M

t

(therepresentative of

S

t

)andtheradiusof

S

t

,

R

t

= max

x∈S

t

d(M

t

, x)

.

Thetreeis generallybuiltusingre ursive allsto a lustering algorithm.In theoriginal FNA the

c

-means algorithmwasused. In [7℄someotherstrategies wereexplored:inthebestmethod,namelytheMostDistantfromtheFathertree (MDF),therepresentativeoftheleftnodewasthesamethantherepresentative ofitsfather.Thus,ea htimeanexpansionofthenodeisne essary,onlyonenew distan e mustbe omputed(instead of two),redu ingthenumberof distan es omputed.Asthepruningrulesapplyonanytree,inthefollowing,thetreewill bebuiltusingtheMDFmethod.

Inalgorithm1,asimpliedversionofFNAispresented;onlythePrune_FNR fun tion all mustbe hangedwhen onsideringanotherpruningrule. Inorder tomakethepseudo- odesimpler,the

d

min

and

nn

are onsideredglobalvariable. Also,onlybinarytreeswith onepointontheleavesare onsidered.

The use of the Fukunaga and Narendra Rule (FNR) for pruning internal nodesisdetailed in[6℄.

Whenanewsample point

x

isgiven,its nearestneighbor

nn

issear hedin thetreeusingadepth-rststrategy.Atagivenlevel,thenode

t

withasmaller distan e

d(x, M

t

)

is explored rst. In order to avoid the exploration of some bran hesofthetreetheFNAusestheFNRrule.

3 A review of pruning rules Fukunaga and Narendra Rule (FNR)

The pruning rule dened by Fukunaga and Narendra for internal nodes only makesuseoftheinformationinthenode

t

to bepruned(withrepresentant

M

t

and radius

R

t

) and thehyperspheri alvolume entered in the samplepoint

x

with radius

d(x, nn)

, where

nn

is the nearest prototype onsidered up to the moment.

(4)

Data:

t

:anodetree;

x

:asamplepoint;

Result:

nn

:thenearestneighborprototype;

d

min

:thedistan eto

nn

; if

t

isnotaleaf then

r

= right

_

child(t)

;

ℓ

= lef t

_

child(t)

;

d

r

= d(x, M

r

)

;

d

ℓ

= d(x, M

ℓ

)

;

update

d

min

and

nn

;

if

d

ℓ

< d

r

then

if notPrune_FNR(

ℓ

)then sear h

(ℓ, x)

;

if notPrune_FNR(

r

)then sear h

(r, x)

;

else

if notPrune_FNR(

r

)then sear h

(r, x)

;

if notPrune_FNR(

ℓ

)then sear h

(ℓ, x)

;

The SiblingBased Rule (SBR)

Giventwosibling nodes

r

and

ℓ

, thisrule requires thatea h node

r

storesthe distan ebetweentherepresentativeofthenode,

M

r

,andthenearestpoint,

e

ℓ

, in thesibling node

ℓ

(

S

ℓ

).

Rule:No

y

∈ S

ℓ

anbethenearestneighborto

x

if

d(M

r

, e

ℓ

) > d(M

r

, x) +

d(x, nn)

UnliketheFNR,SBR anbeappliedtoeliminatenode

ℓ

without omputing

d(M

ℓ

, x)

,avoidingsomeextradistan e omputationsatsear htime.

Generalized rule (GR)

This ruleisaniterated ombinationoftheFNRandtheSBR(see[8℄formore details).Givenanode

ℓ

,asetofprototypes

{e

i

}

isdenedinthefollowingway:

G

1 = S

ℓ

e

i

=

argmax

p∈G

i

d(p, M

ℓ

)

G

i+1

= {p ∈ G

i

: d(p, M

r

) < d(e

i

, M

r

)}

where

M

r

istherepresentativeof thesibling node

r

,and

G

i

are auxiliarysets ofprototypes.

(5)

Rule: No

y

∈ S

ℓ

an be the nearest neighborif there is aninteger

i

su h that:

d(M

r

, e

i

) ≥ d(M

r

, x) + d(x, nn)

(1)

d(M

ℓ

, e

i+1

) ≤ d(M

ℓ

, x) − d(x, nn)

(2)

Cases

i

= 0

and

i

= s

arealsoin ludednot onsideringequations(1)or(2) respe tively.Notethat ondition(1) isequivalentto SBRrulewhen

i

= s

and ondition(2)isequivalenttoFNRrulewhen

i

= 0

.

4 The table rule (TR)

Thisruleprunesbytakingthe urrentnearestneighborasareferen e.Inorder to dosothedistan e from aprototype

p

toaset of prototypes

S

is dened as

d(p, S) = min

y∈S

d(p, y)

.Atprepro esstime,thedistan esfromea hprototype to ea hnode set

S

t

in thetree are omputed andstored in atable,allowinga onstanttimepruning.Notethatthesize ofthistable growswiththesquareof thenumberofprototypessin e,asthetreeisbinary,thenumberofnodesistwo timesthenumberofprototypes.

t

nn

x

d(nn,S )

t: node

x: sample point

nn: current nearest neighbor

t

Fig.1.Appli ationofthetablerule

Rule:Figure 1,Presentagraphi alviewofthetable rule.

Proposition1(TableRule) Given the table rule

(2d(x, nn) < d(t, nn))

, no prototype

e

i

innode

t

an benearesttothe testsample

x

than

nn

,i.e.

∀e

i

∈ t, d(x, e

i

) ≥ d(x, nn)

Proof:

Let

e

i

∈ S

t

.Bythedenitionofthedistan ebetweenapointandanode

(6)

d(nn, S

t

) ≤ d(e

i

, nn)

Moreover,bythetriangleinequality,wehave:

d(e

i

, nn) ≤ d(e

i

, x) + d(x, nn)

Combiningthese inequalities,wehave:

d(nn, S

t

) ≤ d(e

i

, nn) ≤ d(e

i

, x) + d(x, nn)

⇒ d(e

i

, x) ≥ d(nn, S

t

) − d(x, nn)

usingthetable rule,wenallyhave:

d(e

i

, x) ≥ 2d(x, nn) − d(x, nn) = d(x, nn)

whi h ompletestheproof.

5 Experiments

Asseenintheproofofthe orre tnessofthetablerule,itisonlyrequiredthat

d

is atruedistan e. In parti ular, onthe ontraryto otherte hniques su h as thewellknownkd-treealgorithm,ave torspa eisnotneededinordertoapply thetablerule.

In order to evaluate the power of the table rule, the performan e of the algorithm has been measuredin real and arti ial data experimentsusing the mostsigni ative ombinationsofthepruningrules.

In the arti ial data set up, the prototypes where obtained from a 5 and 10-dimensionaluniformdistribution intheunit hyper ube.

Arstexperimentwasperformedusing in reasingsizeprototypessetsfrom

1, 000

prototypes to

8, 000

in steps of

1, 000

for 5 and 10 dimensional data. Ea hexperimentmeasurestheaveragedistan e omputationsof

16, 000

sear hes (

1, 000

sear hes over 16 dierent prototypes sets). The samples were obtained fromthesamedistribution.

Figures2and3showtheresultsforsome ombinationsofthepruningrules where f, s,g and t standfor theFukunaga,sibling,generalized and tablepruningrulesrespe tively.Standarddeviationofmeasuresisalsoin luded (though withvaluealmostnegligible).

Asit anbeobserved,thetablepruningrule,whenappliedalone, ana hieve

∼ 50%

distan e omputations redu tion,although additionalredu tion(up to

∼ 70%

) an be a hieved when ombined with f, fs or g pruning rules. In these three ases the dieren es are not noti eable. Obviously, as the time omplexity of the generalized pruning rule is not onstant, the ombinations withf orfs aremoreappealing.

(7)

0

50

100

150

200

250 1000

2000

3000

4000

5000

6000

7000

8000

num. distances

prototypes

Uniform distribution, dimension 5

f

fs

g

t

ft

fst

gt

Fig.2.Pruningrules ombinationsinauniformdistribution5-dimensionalspa e.

Theinput test of the spellerwassimulated distorting the wordsby means ofrandominsertion,deletion andsubstitutionoperationsoverthewordsinthe original di tionary.Theedit distan ewasusedto ompare thewords.In these experiments,thevaluesoftheweightingoperations ostsoftheeditdistan e (in-sertion,deletionandsubstitution)werexedto1.Thismakestheeditdistan e amathemati aldistan ewhi hmakesthetableruleappli able.Pleasenotethat somefast NN sear h te hniques(i.e.kd-tree) ould notbe applied hereasthe data ouldhardlyberepresentedinave torspa e.

Di tionariesofin reasingsize(from

1, 000

to

8, 000

)wereobtainedextra ting randomly words of the whole di tionary. The test points were

1, 000

distorted wordsobtainedfromrandomlysele teddi tionarywords.Toobtainreliable re-sults the experiments were repeated

16

times. The averagesand thestandard deviationareshowedontheplots.

TheexperimentperformedinFigures2and3forarti ialdata(average num-berofdistan e omputationsusingin reasingsizeprototypesets)wererepeated in thespellingtask.ResultsareshowninFigure4.

Theexperimentsshowaredu tionin the numberofdistan e omputations (around 40%) for thetable rule when ombinedwith "f","fs"or "g"pruning rules.

(8)

0

200

400

600

800 1000

1200

1400

1600

1800

2000

1000

2000

3000

4000

5000

6000

7000

8000

num. distances

prototypes

Uniform distribution, dimension 10

f

fs

g

t

ft

fst

gt

Fig.3.Pruningrules ombinationsinauniformdistribution10-dimensionalspa e.

6 Con lusions and Further works

Tosummarize,anewpruningrulehasbeendenedthat anbeenappliedin tree-basedsear halgorithms.Toapplytherule,adistan etableshouldbe omputed andstoredinprepro esstime.Thistablerulestoresthedistan esbetweenea h prototypeinthetrainingsetandeverynodeofthetree;itsspa e omplexityis thereforequadrati inthesizeofthetrainingset.

As the experiments suggest,this rule savethe omputation of

70%

of dis-tan es in the ase of 10-dimensionaldata and

40%

in the aseof strings with trainingsetaround

8, 000

pointswhen omparedwiththegeneralizedrule.

Infuture works,amore exhaustivestudy of the rulewill be performed. In parti ular,theideaistostudyontheonehandwhi histhebetter ombination ofrules(withtheminor ost),andontheotherhand,whatisthe onditionand orderwhereea hrule anbeapplied.

Otherproblemthatshouldbeexploredishowtoredu ethespa e omplexity ofthetablerule.

7 A knowledgments

(9)

0

500 1000

1500

2000

2500

1000

2000

3000

4000

5000

6000

7000

8000

num. distances

prototypes

f

fs

g

t

gt

fst

ft

Fig.4.Pruningrules ombinedinaspellingtask.

Referen es

1. S.Brin. Near neighbor sear hinlarge metri spa es. InPro eedings of the

21 st

VLDBConferen e,pages574584,1995.

2. E.Chávez,G.Navarro,R.Baeza-Yates,andJ.L.Marroquin. Sear hinginmetri spa es. ACMComputingSurveys,33(3):273321,September2001.

3. B.V.Dasarathy. NearestNeighbor(NN)Norms: NNPatternClassi ation Te h-niques. IEEEComputerSo ietyPress,LosAlamitos,CA.,1991.

4. R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassi ation.Wiley,NewYork, 2000. 2ndEdition.

5. J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for nding best mat hesinlogarithmi expe tedtime. ACM Transa tionsonMathemati al Soft-ware,3:209226, 1977.

6. K.FukunagaandP.M.Narendra. Abran handboundalgorithmfor omputing

k

-nearestneighbors. IEEETransa tionsonComputers,IEC,24:750753, 1975. 7. E.Gómez-Ballester,L.Mi ó,andJOn ina.Someimprovementsintreebased

near-estneighboursear halgorithms. 0302-9743-Le ture NotesinComputerS ien e -Le tureNotesinArti ialIntelligen e,(2905):456463,2003.

8. E. Gómez-Ballester, L. Mi ó, andJ. On ina. Some approa hes to improve tree-based nearest neighbour sear h algorithms. Pattern Re ognition, 39(2):171179, February2006.

9. J. M Names. A fast nearest neighbor algorithm basedon a prin ipal axis tree. IEEE Transa tionsonPatternAnalysisandMa hineIntelligen e,23(9):964976, September2001.