HAL Id: ujm-00165425
https://hal-ujm.archives-ouvertes.fr/ujm-00165425
Submitted on 9 Mar 2009
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A Tabular Pruning Rule in Tree-Based Fast Nearest
Neighbor Search Algorithms
Jose Oncina, Franck Thollard, Eva Gómez-Ballester, Luisa Micó, Francisco
Moreno-Seco
To cite this version:
Neighbor Sear h algorithms JoseOn ina
1
,Fran kThollard2
,EvaGómez-Ballester1
,LuisaMi ó1
,and Fran is oMoreno-Se o1
1
Dept.LenguajesySistemasInformáti os UniversidaddeAli ante,E-03071 Ali ante,Spain
{on ina,eva,mi o,pa o}dlsi.ua. es
2
LaboratoireHubertCurien(exEURISE)-UMRCNRS5516 18rueduProf.Lauras-42000Saint-ÉtienneCedex2,Fran e
thollarduniv-st-etienne.fr
Abstra t. Some fast nearest neighbor sear h (NNS)algorithms using metri propertieshaveappearedinthelastyearsforredu ing omputa-tional ost.Depending onthe stru tureusedto store the training set, dierent strategies to speed up the sear h have been dened. For in-stan e, pruning rules avoid the sear h ofsome bran hes of atree ina tree-basedsear halgorithm.Inthispaper,weproposeanewandsimple pruningrulethat anbeusedinmostofthetree-basedsear halgorithms. Alltheinformationneededbytherule anbestoredinatable(at pre-pro essingtime).Moreover, therule anbe omputedin onstanttime. Thisapproa hisevaluatedthroughrealandarti ialdataexperiments. Inordertotest itsperforman e, theruleis omparedtoand ombined withotherpreviouslydenedrules.
1 Introdu tion
NearestNeighborSear h(NNS) te hniques aim at ndingthenearestpoint of aset to agiventest pointusing adistan e fun tion [4℄. Thenaïveapproa h is sometimesabottlene kdue tothe largenumberofdistan esto be omputed. Many methods have been developped in order to avoid the exhaustive sear h (see [3℄ and [2℄for a survey). Tree-based stru tures are very popular in most oftheproposedalgorithms[6,5,10,1,9℄,asthisstru tureprovidesasimpleway to avoidtheexplorationofsomesubsetsofpoints.Amongthese methods,only someofthemaresuitableforgeneralmetri spa es,i.e.,spa eswheretheobje ts (prototypes)neednottoberepresentedasapoint,and onlyrequireaproperly dened distan e fun tion. The mostpopular and refereed algorithm of su h a typewasproposedbyFukunagaandNarendra(FNA)[6℄.Thisalgorithmisvery suitableforstudyingnewtreebuildingstrategiesandnewpruningrules[7,8℄as apreviousstepforextendingthenewideastoothertree-basedalgorithms.
metri ). Ina lassi al way, the FNA algorithm will serveas abaselinefor the omparisonwithotherte hniques.
Thepaperisorganizedasfollow:wewill rstintrodu ethebasi algorithm (se tion2). Weintrodu ethedierentpruningrules that were used in the ex-perimentinse tion3and4.Wewillprovidea omparativeexperimentoneither arti ial and real world data (se tion 5). We then on lude suggesting some future works(se tion6).
2 The basi algorithm
The FNA is afast sear h method that uses abinary tree stru ture. Ea h leaf storesapointofthesear hspa e.Atea hnode
t
isasso iatedS
t
,thesetofthe pointsstoredintheleavesoft
sub-tree.Ea hnodestoresM
t
(therepresentative ofS
t
)andtheradiusofS
t
,R
t
= max
x∈S
t
d(M
t
, x)
.Thetreeis generallybuiltusingre ursive allsto a lustering algorithm.In theoriginal FNA the
c
-means algorithmwasused. In [7℄someotherstrategies wereexplored:inthebestmethod,namelytheMostDistantfromtheFathertree (MDF),therepresentativeoftheleftnodewasthesamethantherepresentative ofitsfather.Thus,ea htimeanexpansionofthenodeisne essary,onlyonenew distan e mustbe omputed(instead of two),redu ingthenumberof distan es omputed.Asthepruningrulesapplyonanytree,inthefollowing,thetreewill bebuiltusingtheMDFmethod.Inalgorithm1,asimpliedversionofFNAispresented;onlythePrune_FNR fun tion all mustbe hangedwhen onsideringanotherpruningrule. Inorder tomakethepseudo- odesimpler,the
d
min
andnn
are onsideredglobalvariable. Also,onlybinarytreeswith onepointontheleavesare onsidered.The use of the Fukunaga and Narendra Rule (FNR) for pruning internal nodesisdetailed in[6℄.
Whenanewsample point
x
isgiven,its nearestneighbornn
issear hedin thetreeusingadepth-rststrategy.Atagivenlevel,thenodet
withasmaller distan ed(x, M
t
)
is explored rst. In order to avoid the exploration of some bran hesofthetreetheFNAusestheFNRrule.3 A review of pruning rules Fukunaga and Narendra Rule (FNR)
The pruning rule dened by Fukunaga and Narendra for internal nodes only makesuseoftheinformationinthenode
t
to bepruned(withrepresentantM
t
and radiusR
t
) and thehyperspheri alvolume entered in the samplepointx
with radiusd(x, nn)
, wherenn
is the nearest prototype onsidered up to the moment.Data:
t
:anodetree;x
:asamplepoint;Result:
nn
:thenearestneighborprototype;d
min
:thedistan etonn
; ift
isnotaleaf thenr
= right
_child(t)
;ℓ
= lef t
_child(t)
;d
r
= d(x, M
r
)
;d
ℓ
= d(x, M
ℓ
)
;update
d
min
andnn
;if
d
ℓ
< d
r
thenif notPrune_FNR(
ℓ
)then sear h(ℓ, x)
;if notPrune_FNR(
r
)then sear h(r, x)
;else
if notPrune_FNR(
r
)then sear h(r, x)
;if notPrune_FNR(
ℓ
)then sear h(ℓ, x)
;The SiblingBased Rule (SBR)
Giventwosibling nodes
r
andℓ
, thisrule requires thatea h noder
storesthe distan ebetweentherepresentativeofthenode,M
r
,andthenearestpoint,e
ℓ
, in thesibling nodeℓ
(S
ℓ
).Rule:No
y
∈ S
ℓ
anbethenearestneighbortox
ifd(M
r
, e
ℓ
) > d(M
r
, x) +
d(x, nn)
UnliketheFNR,SBR anbeappliedtoeliminatenode
ℓ
without omputingd(M
ℓ
, x)
,avoidingsomeextradistan e omputationsatsear htime.Generalized rule (GR)
This ruleisaniterated ombinationoftheFNRandtheSBR(see[8℄formore details).Givenanode
ℓ
,asetofprototypes{e
i
}
isdenedinthefollowingway:G
1
= S
ℓ
e
i
=
argmaxp∈G
i
d(p, M
ℓ
)
G
i+1
= {p ∈ G
i
: d(p, M
r
) < d(e
i
, M
r
)}
where
M
r
istherepresentativeof thesibling noder
,andG
i
are auxiliarysets ofprototypes.Rule: No
y
∈ S
ℓ
an be the nearest neighborif there is anintegeri
su h that:d(M
r
, e
i
) ≥ d(M
r
, x) + d(x, nn)
(1)d(M
ℓ
, e
i+1
) ≤ d(M
ℓ
, x) − d(x, nn)
(2)Cases
i
= 0
andi
= s
arealsoin ludednot onsideringequations(1)or(2) respe tively.Notethat ondition(1) isequivalentto SBRrulewheni
= s
and ondition(2)isequivalenttoFNRrulewheni
= 0
.4 The table rule (TR)
Thisruleprunesbytakingthe urrentnearestneighborasareferen e.Inorder to dosothedistan e from aprototype
p
toaset of prototypesS
is dened asd(p, S) = min
y∈S
d(p, y)
.Atprepro esstime,thedistan esfromea hprototype to ea hnode setS
t
in thetree are omputed andstored in atable,allowinga onstanttimepruning.Notethatthesize ofthistable growswiththesquareof thenumberofprototypessin e,asthetreeisbinary,thenumberofnodesistwo timesthenumberofprototypes.t
nn
x
d(nn,S )
t: node
x: sample point
nn: current nearest neighbor
t
Fig.1.Appli ationofthetablerule
Rule:Figure 1,Presentagraphi alviewofthetable rule.
Proposition1(TableRule) Given the table rule
(2d(x, nn) < d(t, nn))
, no prototypee
i
innodet
an benearesttothe testsamplex
thannn
,i.e.∀e
i
∈ t, d(x, e
i
) ≥ d(x, nn)
Proof:
Let
e
i
∈ S
t
.Bythedenitionofthedistan ebetweenapointandanoded(nn, S
t
) ≤ d(e
i
, nn)
Moreover,bythetriangleinequality,wehave:d(e
i
, nn) ≤ d(e
i
, x) + d(x, nn)
Combiningthese inequalities,wehave:
d(nn, S
t
) ≤ d(e
i
, nn) ≤ d(e
i
, x) + d(x, nn)
⇒ d(e
i
, x) ≥ d(nn, S
t
) − d(x, nn)
usingthetable rule,wenallyhave:
d(e
i
, x) ≥ 2d(x, nn) − d(x, nn) = d(x, nn)
whi h ompletestheproof.
5 Experiments
Asseenintheproofofthe orre tnessofthetablerule,itisonlyrequiredthat
d
is atruedistan e. In parti ular, onthe ontraryto otherte hniques su h as thewellknownkd-treealgorithm,ave torspa eisnotneededinordertoapply thetablerule.In order to evaluate the power of the table rule, the performan e of the algorithm has been measuredin real and arti ial data experimentsusing the mostsigni ative ombinationsofthepruningrules.
In the arti ial data set up, the prototypes where obtained from a 5 and 10-dimensionaluniformdistribution intheunit hyper ube.
Arstexperimentwasperformedusing in reasingsizeprototypessetsfrom
1, 000
prototypes to8, 000
in steps of1, 000
for 5 and 10 dimensional data. Ea hexperimentmeasurestheaveragedistan e omputationsof16, 000
sear hes (1, 000
sear hes over 16 dierent prototypes sets). The samples were obtained fromthesamedistribution.Figures2and3showtheresultsforsome ombinationsofthepruningrules where f, s,g and t standfor theFukunaga,sibling,generalized and tablepruningrulesrespe tively.Standarddeviationofmeasuresisalsoin luded (though withvaluealmostnegligible).
Asit anbeobserved,thetablepruningrule,whenappliedalone, ana hieve
∼ 50%
distan e omputations redu tion,although additionalredu tion(up to∼ 70%
) an be a hieved when ombined with f, fs or g pruning rules. In these three ases the dieren es are not noti eable. Obviously, as the time omplexity of the generalized pruning rule is not onstant, the ombinations withf orfs aremoreappealing.0
50
100
150
200
250
1000
2000
3000
4000
5000
6000
7000
8000
num. distances
prototypes
Uniform distribution, dimension 5
f
fs
g
t
ft
fst
gt
Fig.2.Pruningrules ombinationsinauniformdistribution5-dimensionalspa e.
Theinput test of the spellerwassimulated distorting the wordsby means ofrandominsertion,deletion andsubstitutionoperationsoverthewordsinthe original di tionary.Theedit distan ewasusedto ompare thewords.In these experiments,thevaluesoftheweightingoperations ostsoftheeditdistan e (in-sertion,deletionandsubstitution)werexedto1.Thismakestheeditdistan e amathemati aldistan ewhi hmakesthetableruleappli able.Pleasenotethat somefast NN sear h te hniques(i.e.kd-tree) ould notbe applied hereasthe data ouldhardlyberepresentedinave torspa e.
Di tionariesofin reasingsize(from
1, 000
to8, 000
)wereobtainedextra ting randomly words of the whole di tionary. The test points were1, 000
distorted wordsobtainedfromrandomlysele teddi tionarywords.Toobtainreliable re-sults the experiments were repeated16
times. The averagesand thestandard deviationareshowedontheplots.TheexperimentperformedinFigures2and3forarti ialdata(average num-berofdistan e omputationsusingin reasingsizeprototypesets)wererepeated in thespellingtask.ResultsareshowninFigure4.
Theexperimentsshowaredu tionin the numberofdistan e omputations (around 40%) for thetable rule when ombinedwith "f","fs"or "g"pruning rules.
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1000
2000
3000
4000
5000
6000
7000
8000
num. distances
prototypes
Uniform distribution, dimension 10
f
fs
g
t
ft
fst
gt
Fig.3.Pruningrules ombinationsinauniformdistribution10-dimensionalspa e.
6 Con lusions and Further works
Tosummarize,anewpruningrulehasbeendenedthat anbeenappliedin tree-basedsear halgorithms.Toapplytherule,adistan etableshouldbe omputed andstoredinprepro esstime.Thistablerulestoresthedistan esbetweenea h prototypeinthetrainingsetandeverynodeofthetree;itsspa e omplexityis thereforequadrati inthesizeofthetrainingset.
As the experiments suggest,this rule savethe omputation of
70%
of dis-tan es in the ase of 10-dimensionaldata and40%
in the aseof strings with trainingsetaround8, 000
pointswhen omparedwiththegeneralizedrule.Infuture works,amore exhaustivestudy of the rulewill be performed. In parti ular,theideaistostudyontheonehandwhi histhebetter ombination ofrules(withtheminor ost),andontheotherhand,whatisthe onditionand orderwhereea hrule anbeapplied.
Otherproblemthatshouldbeexploredishowtoredu ethespa e omplexity ofthetablerule.
7 A knowledgments
0
500
1000
1500
2000
2500
1000
2000
3000
4000
5000
6000
7000
8000
num. distances
prototypes
f
fs
g
t
gt
fst
ft
Fig.4.Pruningrules ombinedinaspellingtask.
Referen es
1. S.Brin. Near neighbor sear hinlarge metri spa es. InPro eedings of the
21
st
VLDBConferen e,pages574584,1995.
2. E.Chávez,G.Navarro,R.Baeza-Yates,andJ.L.Marroquin. Sear hinginmetri spa es. ACMComputingSurveys,33(3):273321,September2001.
3. B.V.Dasarathy. NearestNeighbor(NN)Norms: NNPatternClassi ation Te h-niques. IEEEComputerSo ietyPress,LosAlamitos,CA.,1991.
4. R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassi ation.Wiley,NewYork, 2000. 2ndEdition.
5. J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for nding best mat hesinlogarithmi expe tedtime. ACM Transa tionsonMathemati al Soft-ware,3:209226, 1977.
6. K.FukunagaandP.M.Narendra. Abran handboundalgorithmfor omputing
k
-nearestneighbors. IEEETransa tionsonComputers,IEC,24:750753, 1975. 7. E.Gómez-Ballester,L.Mi ó,andJOn ina.Someimprovementsintreebasednear-estneighboursear halgorithms. 0302-9743-Le ture NotesinComputerS ien e -Le tureNotesinArti ialIntelligen e,(2905):456463,2003.
8. E. Gómez-Ballester, L. Mi ó, andJ. On ina. Some approa hes to improve tree-based nearest neighbour sear h algorithms. Pattern Re ognition, 39(2):171179, February2006.
9. J. M Names. A fast nearest neighbor algorithm basedon a prin ipal axis tree. IEEE Transa tionsonPatternAnalysisandMa hineIntelligen e,23(9):964976, September2001.