HAL Id: tel-00555064
https://tel.archives-ouvertes.fr/tel-00555064
Submitted on 12 Jan 2011
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Descriptors for Generic Object Class Recognition
Gyuri Dorkó
To cite this version:
Gyuri Dorkó. Selection of Discriminative Regions and Local Descriptors for Generic Object Class
Recognition. Human-Computer Interaction [cs.HC]. Institut National Polytechnique de Grenoble -
INPG, 2006. English. �tel-00555064�
Nattribué par la bibliothèque
THÈSE
pour obtenir legrade de
DOCTEUR DE L'INPG
Spéialité : Mathématiques et Informatique
préparée au laboratoireGRAVIR IMAG, projet LEAR,
dans leadre de l'Eole Dotorale Mathématiques, Sienes et Tehnologie
de l'Information
présentée etsoutenue publiquement
par
Gyuri Dorkó
le 9 juin 2006
Seletion of Disriminative Regions and Loal
Desriptors for Generi Objet Class Reognition
Direteur de thèse : Dr. Cordelia Shmid
JURY
Prof. Roger Mohr, Président
Prof. Bernt Shiele, Rapporteur
Prof. AndrewZisserman, Rapporteur
Dr. CordeliaShmid, Direteurde thèse
Dr. Tinne Tuytelaars, Examinateur
Szüleimnek
FOR GENERICOBJECT CLASS RECOGNITION
Gyuri Dorkó, Ph.D. dissertation
Institut National Polytehnique de Grenoble, 9June 2006
Objet ategory reognition is one of the most diult problems in omputer vi-
sion. It involves reognizing objets despite intra-lass variations, viewpoint hanges
and bakground lutter. The goal of this thesis is to investigate robust invariant
loal image desription and the seletion of disriminative features. We show that
lass-disriminative sale-invariant features ahieve exellent results for image-level
ategorization and objet loalization. We present solutions for two key problems:
(i) we improve the quality of the image desription based on a novel sale-invariant
keypoint detetion method and (ii) we integrate feature ltering tehniques into our
objet models.
Our novel sale-invariantdetetor is based on the idea of a maximallystable de-
sription,i.e., thedesriptorshouldbestableeven inthe preseneofminorvariations
of the detetor. The tehnique performs sale seletion based on a region desrip-
tor, here SIFT, and hooses regions for whih this desriptor is maximally stable,
i.e., the dierene between desriptors extrated for onseutive sales reahes a mi-
nimum. This sale seletion tehnique is applied to multi-saleHarris and Laplaian
points. Experimentalresults evaluate the performane of our detetor and show that
it outperforms existing ones in the ontext of image mathing, ategory and texture
lassiation,as well asobjet loalization.
To onstrut objet models based on disriminative features, we rst luster the
sale-invariant desriptors and obtain a set of visual words. We then estimate
the disriminative information of these lusters based on dierent feature seletion
tehniquesseveral of whih are traditionallyused in text retrieval. We disuss their
propertiesfeature frequeny, disriminative power, and redundanyand analyze
their performane in the ontext of image lassiation and objet loalization. We
showthateahtaskhasdierentrequirements,andindiatewhihseletiontehniques
are the most appropriate. Experimental results for reognition on hallenging large
datasets demonstrate the performane ofthe approah.
DESCRIPTEURS POURLA RECONNAISSANCE DE CLASSES GÉNÉRIQUES
D'OBJETS
GyuriDorkó
Institut National Polytehnique de Grenoble, 9June 2006
La atégorisationd'objets est l'un des problèmes les plus diiles en vision par ordi-
nateur. Lebut est de reonnaître des objetsvisuels malgrédes variationsintra-lasse,
des hangementsde pointdevueetun fortbruitde fond. L'objetifde ettethèse est
d'investiguer un desripteur loal d'image et une méthode de séletion de aratéris-
tiques disriminatives. Nous montrons que des desripteurs disriminatifsinvariants
par éhelle donnent d'exellent résultats en atégorisation et en loalisation d'objet.
Des solutions sont apportées aux deux problèmes fondamentaux suivants: (i) nous
améliorons la qualité de la desription des images grâe à un nouveau déteteur de
points d'intérêts invariant par éhelle et (ii)nous intégrons des tehniques de ltrage
de desripteurs dans nos modèles d'objets.
Notre nouveau déteteur invariant par éhelle est basé sur l'idée de région stable
maximale, 'est-à-dire le fait que la position du point d'intérêt est stable même en
présene de variations mineures du déteteur. La méthode séletionne une éhelle à
partir d'un desripteur loal dans notre as SIFT et hoisit les régions pour
lesquelles la stabilité du desripteur est maximale, 'est-à-dire la diérene entre les
desripteurs à deux éhelles onséutives atteint un minimum. Cette tehnique de
séletion d'éhelle est appliquée au déteteur de Harris multi-éhelle et les points de
Laplae. Des résultatsexpérimentauxpermettentd'évaluer lesperformanes de notre
déteteur etmontrent qu'ilamélioreles résultatsde mise en orrespondane d'image,
de lassiation d'objets etde texture etlaloalisationd'objets.
Andeonstruiredesmodèlesd'objetsbaséssurdes fateursdisriminatifs,lesde-
sripteursinvariantsparéhellesontlassésdans deslustersetdonneun ensemblede
mots visuels. Ensuite,nous estimonsl'informationdisriminativeontenue dans es
lusters en utilisant diérentes tehniques de séletion disriminatives Plusieurs
d'entre elles sont traditionnellement utilisées en reherhe d'information textuelle.
Nous disutons leurs propriétés fréquene, pouvoir disriminatif et redondane
et analysons leur performanes dans le ontexte de lassiation et de loalisation
d'objet. Nous montrons que haque tahe a ses partiularités et indiquons quelle
tehnique de séletion est la plus appropriée. Des résultats expérimentaux de reon-
naissaned'objets sur desjeux de données diilesmontrentlesbonnes performanes
de la méthodologieproposée.
Iwouldliketothankallpeoplethathaveontributed tothe ompletionofthis thesis.
My sinerest thanks go to my advisor Cordelia Shmid for her guidane, many sug-
gestions, original ideas, feedbaks, and helpful ritiism throughout this thesis. I am
grateful to Prof. Bernt Shiele and Prof. Andrew Zisserman for their interest in my
work, for being the reporters of this dissertation, and also to Prof. Roger Mohr and
Tinne Tuytelaars forbeingthe o-examiners atmy defense.
I would also like to thank Bill Triggs, Frédéri Jurie, and my friend Guillaume
Bouhard for the their many useful ommentsand disussions that helped meunder-
stand the sometimes diultorners of omputer visionand mahine learning.
IamgratefultomyfellowresearhersfromtheLEARgroup,EriNowak,Navneet
Dalal,AnkurAgarwal,JianguoZhang,DianeLarlus-Larrondo,PeterCarbonetto,Car-
oline Pantofaru, Marin Marszaªek, and Joost Van de Weijer, for their support, and
for making INRIAa fun and motivatingplae towork.
IwouldliketothankthesupportfortheEuropeanprojetLAVA(IST-2001-34405),
inludingallthe partners, and the European PASCAL network of exellene.
Iamthankfulforallresearherthathaveontributedbymakingtheirodeavailable
to help my researh, espeiallyfor Prof. David Lowe, Krystian Mikolajzyk, Mihael
Sdika, and Matthijs Douze. I am also grateful for Barbara Caputo, Prof. Dietrih
Paulus, Prof. Laszló Csink, Laszló Kutor, and MáriaDudás, without whom I would
not havestarted my PhD.
Myspeialthanks goestomyfriendsStan,Marlen,Carla,andBram,forthemany
joyful moments in Grenoble, as well as to my Hungarian friends Kriszta, Andi, and
Gáborwho have not forgottenabout me even that I ambeing sofar fromhome.
Last, but not least, I would like to thank my family for their love, emotional
support, and enouragement. Without them, I would not have madeit.
1 Introdution 13
1.1 Context . . . 13
1.2 Our Approah . . . 14
1.3 Contributions . . . 16
1.4 Appliations . . . 17
1.5 Overview. . . 19
2 Loal Image Representation 21 2.1 Bakground . . . 24
2.1.1 Interest PointDetetors . . . 24
2.1.2 LoalDesription: Sale-InvariantFeature Transform . . . 29
2.2 SaleSeletion by MaximallyStable LoalDesription . . . 30
2.3 Evaluationfor image mathing . . . 34
2.3.1 ViewpointChanges . . . 36
2.3.2 Changes inIllumination . . . 39
2.3.3 Overall Performane . . . 39
2.4 Evaluationfor image ategorization . . . 41
2.5 ImplementationDetails . . . 44
2.6 Conlusions . . . 47
3 Disriminative Feature Seletion for Objet Class Appearane 49 3.1 ProbabilistiInterpretation. . . 52
3.2 FeatureSoring Tehniques . . . 54
3.3 Seletion forLoalFeatures . . . 65
3.3.1 VisualWords . . . 65
3.3.2 RetrievingObjetFeatures . . . 67
3.4 Disussion . . . 75
4.1.1 Classier for Objets Presene . . . 82
4.1.2 Experimental Set-Up . . . 83
4.1.3 Experiments: Image lassiation . . . 85
4.2 Objet Loalizationwith Disriminative Features . . . 91
4.2.1 The LoalizationApproah . . . 91
4.2.2 Evaluationof Dierent Parameters . . . 95
4.2.3 AdditionalResults: PASCALChallenge, Butterles . . . 101
4.3 ImplementationDetails . . . 104
4.4 Disussion . . . 107
5 Conlusion and Future Work 109
Appendix: Inuene of the number of interest points 115
Introdution
O
bjet reognition is a hallenge that omputer vision researhers, psyhologists and researhers fromother elds have been trying tounderstand formore than 40years. After many years of researh artiial vision is still far behind human vision.
People are able to see, to reognize, and to ategorize objets inthe world. However,
foromputersthis isnot aneasy task. Theability,forexample,tosee ahair fromall
dierentviewpointsandtounderstandandknowthatitisthesamehairareextremely
ompliatedtasks. The 2-Dappearaneof thesame objet anbeverydierentwhen
the viewpoint hanges. Furthermore, due to our generalizationapability, people are
apable of nding a hair, even if they have not seen that partiular instane before.
Creatingategories,ndingsharedproperties,generalizingappearane arehallenging
tasksforomputers,mainlyduetoapotentiallyhighintra-lassvarianearossobjet
instanes.
1.1 Context
Whileobjet reognition is a large eld, in this thesis we fous on visualobjet lass
ategorizationand loalization. Figure1.1illustratessome of the diulties of reog-
nizing objet ategories. Intra-lass variations amonginstanes of a lass is only one
(a) (b) () (d) (e)
Figure1.1: Fivedierent biyles illustratethe hallenge for objet lass reognition.
Dierent viewpoints, olusion, noise, and luttered bakground make it hard to re-
ognizethe objets. Intra-lassvariation(shapeand olor)aross thedierentbiyles
(a) (b) () (d)
Figure1.2: Examples of wildats.
(a) (b) () (d) (e) (f)
Figure 1.3: Examples of butteries.
ofthehallenges: objetpartsanhavedierentgeometrialstruture, olororanbe
ompletelymissing. InFigure1.1biyles(a)and(e) aredierentinolor,whilebiy-
le(b) has dierentgeometrial proportions. Many appliations require objets tobe
found inpredenedpose andorientation,suhasreognizing prolesof faes, orside-
views of ars. Others, likethe biyle example, are less restrited and therefore more
diult: biyles(d) and(e) are viewed fromdierent viewpoints,and (a)and (b) are
imagedatdierent sales (magniation). Robustness toolusions andmissingparts
are usually additional requirements for state-of-the-art appliations; e.g., biyle (a)
has a missing (overed) seat. Olusions may be aused by the environment, or even
by the objetitself: thespokesof the rst tireare oluded on(d). Everyday objets,
suh asbiyles, often appear together with other objetsor on luttered bakground.
This additional data, so alled ontext, an distrat our system and needs in general
tobedisarded. Note that it an alsohelp to reognize the objet lass. An example
is atra ontrolsystem deteting ars. In suh a system the reognition of roads is
probablyuselessbeausethey our inallimages. However, the shadowof thear (on
the road) isprobably a useful disovery.
1.2 Our Approah
Instanes of an objet ategory often share some visual appearane, and our main
goal is to nd these ommon features. The examples in Figure 1.2 and Figure 1.3
show two dierent objet ategories. The seletion of ommon disriminative objet
partsis relativelyeasy,beausealmostany set offeatures (ofadequatesize) separates
wildats from butteries. However, if Figure 1.2 itself are dened to ontain two
harder to nd. Furthermore, if we assume that examples in Figure 1.3 are from two
ategories,then butteryexpertswouldimmediatelynotiethat (a)and (b) areblak
swallowtails, while ()-(f)are monarhs. Those who haveless experiene with insets
would probably say that (a)-(d) are open while (e)-(f) are losed butteries. So we
see thatommonfeatures arenot alwaysdisriminative,andaordingtothe taskthe
useful features are dierent. Todisoverdisriminativeobjet partswe use
loal or semi-loalrepresentations of imagesto desribe objet parts,
away tomeasure their usefulness, and selet disriminative features.
Sparse loal representations are typially omputed on a set of interest point
loations. Their aimis todesribethe regionsby keeping distintiveinformation,and
at the same time providing robustness to small translations and noise. Loal repre-
sentation of images oer a solutionto deal with olusionand luttered bakground:
individualdesriptorsonlystoreinformationoftheloalontent,andthereforetheyare
not distrated by otherparts of the image. The inuential work of Shmid and Mohr
(1997)is the rst thatuses interest pointsfor ontent based objet reognition. In-
terestpointsare automatially deteted imageloations,suh asorners orenters of
blobs. They allowtoreateasparseloalrepresentationofimagesbyseletingregions
whih keep distintive information,and at the same time providerobustness tosmall
translations and noise. In the lastfew years these pointsbeameinvariantto various
image transformations,likehanges in viewpointand sale. At the time of writing at
least a dozen of these detetors exist all seleting regions by dierent riteria. The
ombinationof interestpointsdetetors and loaldesriptorsallows sparse androbust
representation of objet, senes, or textures. Rotated objets, senes from dierent
viewpoints or with illuminationhanges are hallenges that an be solved already at
representation level,i.e., thereis noneed to learn those by examples.
State of the art methods provide relatively good solutions for reognizing spei
objets, suh as a given biyle or ar, by mathing loal appearane. However, de-
tetionof objetategories requires additionalgeneralization apabilitiestodeal with
intra-lassvariability. Disriminative feature seletion methods an guide objet
reognition to nd ategory-disriminative objet parts and to disard unneessary
bakgroundfeatures. Thesemethodsarereenttoolsinomputervisionadopted from
the text literature. Loal representation of images and standard learning tehniques,
suh as vetor quantization, have built a bridge between omputer vision and text
reognition. Our imagesbeomevisualdoumentsandthe quantizedloaldesriptors
beamevisualwords. Owingtoahuge availabilityof douments,the text ommunity
has earlyrealized the need for disriminativefeature seletion. For example,to index
news diretories or web pages, relevant information has to be seleted to train las-
siers to reognize dierent ategories. In the last few years, the growing number of
examples (Internet) direted researhers toimprovelassiation eieny and au-
these tehniques toomputer vision. In objet ategory reognition,loalrepresenta-
tion and feature seletion together help todevelop high performane automati tools
for objet and texture reognition, ategorization and detetion, for sene analysis,
and for imageindexing.
1.3 Contributions
Inthisthesiswedisussandoersolutionsforreentproblemsofimagerepresentation
and objet detetion. The key ontributions are the following:
Interest Point Detetion by Maximally Stable Loal Image Representation
Many interest point detetors and loal desriptors have been developed during the
last few years. Their quality depends on the task. For example, some perform well
forimage mathingwhile others are better forobjetreognition. Theirbehavior an
be explained by the dierent ways they selet image regions and inorporate various
feature properties. As an examples, image lassiation or image retrieval may only
maththe loalregionspurely byappearane,i.e., ignoringtheir sales,loations,and
spatialorganization. Forotherappliations,suhasimagemathingorameraalibra-
tion,these propertiesare veryimportant,and many timestheirestimation isunstable
or noisy. Consequently, the quality of interest point detetors is not straightforward
to measure, sine dierent methods should be used depending on the ontext. Our
experiene has shown that one of the weakest properties of sale-invariant detetors
isthe sale estimation. This thesis proposes anovelmethodtodetermine (selet) the
harateristi sales for interest point detetors. Our idea is to use an appropriately
hosendesriptortoseletregionsforwhihthisdesriptorismaximallystable. Exper-
imentalresults show thatour new riterionimprovesperformane forimagemathing
in hallenging environments, suh as variation in illumination onditions. Due to a
more stable appearane-based representation, texture ategorization on popular sets
shows
3 − 10%
improvementwith the new detetors.Feature Seletion for Loal desriptors
In this thesis we adaptand ompare several tehniques fromthe text literature, most
of whih are new in vision. We analyze several feature properties inluding feature
frequeny, i.e., how often a feature appears, disriminative power to separate objet
frombakground,andredundany. Dierenttrade-osbetween propertiesare pointed
out, and seletion methods are distinguished (grouped) aordingly. By the orret
ombination of these properties, i.e., by hoosing the seletion method wisely for a
given task,weshowhowtoahieve goodreognitionperformane withmanyorjusta
sparse set of features. Our experimentsevaluate lass-disriminative feature seletion
Improved Objet Class Reognition via Feature Ranking and Seletion
We have hosen objet ategory lassiation and loalization to demonstrate the
performane of disriminative feature seletion. A simple lassiation framework
demonstrates that disovering disriminative features an diretly be used for objet
reognition. Seletion methods on dierent types of features are ompared and dis-
ussedforthree dierent tasks: Objetfeature retrievaltries toreallfeaturesprovid-
ingthe best objetoverage, while keeping the bakgroundfeatureless orvery sparse.
Appearane-basedobjetlassiationusesdisriminativefeaturestodeideaboutthe
presene of anobjet lass inimages. Objet lass loalizationaims to determine the
exat position of unseen objet instanes in test images. For loalization we extend
an existing state-of-the-art method by inorporating feature ranks. This leads to a
faster system with improved performane. We additionallyextend the framework for
rotationinvariant trainingand detetion.
1.4 Appliations
Advanes suh as disriminative feature seletion and sale-invariant loal represen-
tations, disussed in this thesis, help to analyze and improve state-of-the-art image
representation and objet reognitiontehniques. Inthe followingwelistafew exam-
ples among awide range of possible appliations.
Surveillane and Seurity
One of the most useful appliations of objet reognition are surveillane systems.
Reent seurity systems based on photography or CCTV (Closed Ciruit Television)
useomputervisiontomathdigitalimagestakenfromameraswithimagesstoredina
database. Disriminativefeatureseletionmayhelptodetermineanimportantsubset
of features inadvane, and therefore inrease the system qualityand performane.
Manufaturing Proesses and Quality Control
Improved feature extration and loal desription of images an help industrial ap-
pliationto support manufaturing proesses. Many quality ontrolmethodsemploy
omputer vision. They are based onstatistial analysis of deteted features, and aim
toredue the amount of faulty produts, inorder to meetustomer requirements.
Autonomous Vehiles
Eventhoughautonomousdrivingarsarenotyetavailableforthemarket,manufatur-
ers have already demonstrated preliminaryprototypes and driving systems. Learning
and rapid disovery of useful features, suh as parts of other ars or obstales, an
were used for surveillane, and nowadays, almostall major militaryhave them. They
arealsoused tomonitortra,detetertainevents,suhasforestres. Robustloal
image representation and fous of attention mehanism (feature seletion) help those
vehiles for better motion planning, navigation,sene analysis (to detetwhere it is),
orimproved SLAM tehniques 1
.
Web Searh and Content Based Image Retrieval
Did you know that the verb google 2
has been added to the New Oxford Amerian
Ditionary? The Internet searh engines have beome a part of our everyday life.
Researhers from the text domain have implemented disriminative feature seletion
sosuessfullythatsearhenginesgeneratearound85%ofthetotalweb tra. Nowit
isourturntoindeximages. Manyreentsearhengines, suhasGoogle, MSN,Lyos,
Yahoo,Altavista,andA9supportsearhforimages. Howevertheiralgorithmisbased
on purely textual information, suh as lenames, image meta-data, and surrounding
HTMLontent. Whilemany times this issuient,indexing by image ontent would
improve urrent performane, aswellas open new possibilities:
visualsimilarity between imageshelps torejet inorret mathes, and inrease
the reall by disovering new orretones,
queries an be based on images instead of text; e.g., we an look for a ertain
ar by its piture,or nd our opyright protetedimages and identify fraud,
given animage orimagesof someoneorsomething,e.g., afamousbuildingoran
atress,we an reoverits identity, suh asits plae and name,
mixedtextandimagequeriesanprovideariherwayoflookingforinformation.
Inordertoeientlyindex andrankimages,theorretfeatureshavetobegenerated
and seleted. Disriminative feature seletion may help to develop domain spei
searh engines, aswell asto nd the most informativefeatures in general.
Video Indexing
Digitalvideosarenowavailablenotonlyforprofessionalsbutalsoforeverydaypeople.
DVD players and reorders, reent digital ameras, and high speed Internet onne-
tions made indexing for videos as important as for images. Videos an be seen as a
sequeneofimages,andthereforemanytehniquesfromimagesanbeappliedwithout
1
InSimultaneousLoalizationAndMapping(SLAM),thequalityoftheiterativelybuiltmapanberenedand
thereforeimprovedbymathingdisriminativeloalfeaturesovertime.
2
goo
·
gle|'go ogU
l|(alsoGoo
·
gle)·
verbinformal[intrans.℄useanInternetsearhengine,partiularlyGoogle.om:shespenttheafternoongoogling aimlessly.
·
[trans.℄ searhforthenameof(someone) ontheInternet tondoutinformationaboutthem: youmeetsomeone,swapnumbers,xadate,thenGooglethemthrough1,346,966,000Web
majormodiation. However, addingtemporalinformationtothefeature spaeopens
new perspetives, suh as searhing for ertain ations. Presently only preliminary
versions of video web searh are available on major sites (Google, Yahoo, Altavista,
A9) and similarlyto images,their indies are build ontextual information only. Dis-
riminative feature seletion ould help to built domain spei searh, e.g., looking
for the appearane of an ator in a movie, or to determine the dierene between
ations. Sene analysis an guide professionals when editing movies, or an identify
viewers preferenes (e.g.,improveTiVosuggestions).
1.5 Overview
The manusript is organized as follows. Chapter 2 introdues a sparse loal image
representation with interest point detetors and loal desriptors. In Setion 2.2 we
desribe our new sale seletion method. Evaluation and omparison with existing
tehniques are arried out for image mathing (Setion 2.3), objet and texture las-
siation(Setion 2.4 and Setion4.1.3), and objet loalization(Setion 4.2.2).
Chapter 3 introduesdierent seletionand rankingtehniques. In Setion3.3we
buildthelinkbetweenimagerepresentationandfeaturesbyreatingvisualwords,and
experimentallyomparetheintroduedseletiontehniquesforobjetfeatureretrieval.
Chapter4integratesfeatureseletionintoaframeworkforobjetreognition. Firstwe
show anappliation to reognize the presene or absene of objets in images(image
lassiation),and omparethe results ofdierentfeatures and seletionmethods. In
Setion4.2we showhowtoimproveobjetloalizationbylass-disriminativefeature
ranking.
Loal Image Representation
Sale Seletion via Maximally Stable Loal Desription
L
oalphotometri desriptors omputedat keypointshave demonstrated exellent results in many vision appliations, inluding objet reognition (Fergus et al.,2003; Opeltet al., 2004), image mathing (Shaalitzky and Zisserman, 2002), and
sparse texture representation (Lazebnik et al., 2003). Reent work has onentrated
on making these desriptors invariant to image transformations. This requires on-
struting invariantimage regions whih are then used as support regions to ompute
invariantdesriptors. Inmostasesadetetedregionisdesribed byanindependently
hosendesriptor. Itwould, however, beadvantageoustouse adesriptionadapted to
the region. For example, for blob-like detetors whih extrat regions surrounded by
edges, a natural hoie would be a desriptor based on those edges. However, those
adapted representations may not provide enough disriminative information for the
region, and onsequently, a general purpose desriptor (e.g. wavelets, shape-ontext,
SIFT, et.) might be a better hoie. Many times this leads to better performane,
yetless stablerepresentations: smallhangesinsaleorloationanalterthe desrip-
tors signiantly. Our experiments have shown that the most sensitive omponent of
keypoint-based sale-invariant detetors is the sale seletion. This motivated us to
develop anovel detetor whih uses the desriptor hosen for the given task to selet
the harateristi sales. Our feature detetion approah onsists of two steps. We
rst apply aninterest point detetor on multiple sales todetermine informative and
repeatable loations. For eah position we then apply a sale seletion algorithm to
identify maximallystable representations, i.e., a sale for whih a loal desriptor is
the most stable. The loal desription an be any measure that an be omputed
ona pixel neighborhood,suh as olor histograms, steerable lters and wavelets. For
ourexperimentswehosethe Sale-InvariantFeatureTransform(SIFT)(Lowe,2004),
whih has proven exellent performane for objet representation and image math-
ing(Mikolajzyk and Shmid,2004a).
Our new method for sale-invariant keypoint detetion and image representation
Oursaleseletionmethodguaranteesmore stabledesriptorsthanstate-of-the-
art tehniques by expliitly using desriptors during keypoint detetion. The
stability riterion is developed to minimize the variationof the desriptor for a
smallhange insale.
Repeatableloationsare provided by interest point detetors (e.g. Harris), and
therefore they have rih and salient neighborhoods. This onsequently helps to
hoose repeatable and harateristi sales. We verify this experimentally, and
showthat our seletion ompetes favorably with the best available detetors.
The detetor takes advantage of the propertiesof the loaldesriptor. This an
inludeinvarianetoilluminationorrotationaswellasrobustnesstonoise. Our
experimentsshowthat theloalinvariantimagerepresentationextrated by our
algorithmleads tosigniant improvement for objet and texture reognition.
Related Work
For seleting loal invariant regions, many dierent sale- and ane-invariant dete-
tors exist in the literature. Harris-Laplae (Mikolajzyk and Shmid, 2004b) detets
multi-sale keypoint loations with the Harris detetor (Harrisand Stephens, 1988)
and the harateristi sales are then determined by the Laplaian operator. Loa-
tions based on Harris points are very aurate. However, sale estimation is often
unstable on orner-like strutures, beause it depends on the exat orner loation,
i.e.,shiftsby one pixelmaymodifythe seleted salesigniantly. Thesale-invariant
Laplaian detetor (Lindeberg and Garding, 1994) (LoG) selets the extremal values
inloation-salespae. TheDiereneofGaussian(DoG)detetordeveloped byLowe
(2004)approximatesthe Laplaian,and thereforeit similarlyselets sale-spae max-
ima to nd blob-like strutures. Blobs are well loalized strutures, but due to their
homogeneity,the informationontentisoftenpoorinthe enter of the region. Triggs'
detetor(Triggs,2004)extendstheFörstner-Harrisapproahtogeneralmotionmodels
androbusttemplatemathingbyndingregionswhihanbeauratelyself-mathed
under various similarity or ane transformations. This detetor extrats fewer but
very stable keypoints. For instane, the rotation invariantdetetion rejets point-like
strutures, sine they annot be well-loalized (self-mathed) under image rotation,
i.e., they have no harateristi orientation. The method of Kadiret al. (2004) ex-
trats irularor elliptial regionsin the imageas maxima of the entropy sale-spae
ofregionhistograms. Thisisalsoablobdetetor,buthasbeenshowntoprovideamore
robustappearanebasedrepresentationforsomeobjetategories(Kadir et al.,2004).
Mikolajzyk et al. (2005b) showed that it performs poorly for imagemathing, whih
mightbeduetothesparsityoftheirsalequantization. Presumablyperformaneissues
prohibit them for more extensive searh in sale-spae. The Intensity-Based Region
detetor (Tuytelaarsand Van Gool,2004)seletsmulti-saleloationsatextremal in-
nearby intensity hanges. The edge-based region detetor (Tuytelaarsand Van Gool,
2004) nds quadrangular segments with a orner deteted by the multi-sale Harris
operator and sides determined by near edges. The objet-part detetor of Jurie et
al. (Jurie and Shmid, 2004) selets irular regions with the most salient onvex ar-
rangementofloaledgesextrated bythe Canny-Derihe operator. Sinethe deteted
regionsare surroundedby edges, they proposedaloalimagerepresentation based on
this struture. These desriptors are however not asdisriminativeas other available
representations,sineitonlyenodesinformationofthesurroundingedges. Duetothe
homogeneityoftheseletedregionsitsuersfromthesameproblemsasotherblob-like
methods. The Maximally Stable Extremal Regions (MSER) detetor (Matas et al.,
2002) denes extremal regions as image segments where eah inner-pixel intensity
value is less/greater than a ertain threshold
t
, and all intensities aroundthe bound- aryaregreater/lessthanthe samet
. Anextremalregionismaximally stable whenthearea(ortheboundarylength)of thesegmenthangestheleast withrespetto
t
. Thisdetetor works partiularlywell on images with welldened edges, but is less robust
to noise and not adapted to texture-like strutures. It usually selets relatively few
regions.
Viewpoint invariane is sometimes required to ahieve reliable image math-
ing, objet or texture reognition. Ane-invariant detetors (Kadiret al., 2004;
Matas et al., 2002; Mikolajzyk and Shmid, 2004b; Tuytelaars and Van Gool, 2004)
expliitly estimate the ane shape of the regions to allow pre-normalization of
the path prior to the desriptor omputation. The ane extension of Harris-
Laplae (Mikolajzyk and Shmid, 2004b) is similar to the one rst used by
Lindeberg and Garding(1997)forshape-from-texture. Itappliestheanekernelonly
to xed points to redue the omplexity of the entire ane-spae. This is one of the
most widely used approahes; Lazebnik et al. (2003) use a similar tehnique for the
LoGdetetor toperform texturelassiationunder anetransformations. However,
note, that their adaptation proedure is a post-proessing step of the sale-invariant
detetion based onthe satter matrix of image gradientsat keypointloations.
Mikolajzyk et al. (2005b) evaluated several ane-invariant detetors.
MSER (Matas et al., 2002) performed best, losely followed by Hessian- and
Harris-Laplae. Moreels and Perona(2005)alsondthatHarris-andHessian-Laplae
perform best for objet reognition. Their study shows poor performane of the
MSER detetor for 3D environments. Mikolajzyk et al. (2005a) experimentally
omparedthe performane ofreently proposed detetorsand desriptorsforategory
reognition, and found Hessian-Laplae (Mikolajzyk and Shmid, 2004b) and the
entropy detetor (Kadiret al.,2004) tobethe most suitable.
Overview
This hapter is organized as follows. In Setion 2.1 we present the interest point
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
(a) (b) ()
Figure 2.1: Harrisorner detetion. (a) the original image, (b) the Harrisimage, ()
the loalmaxima of the Harris imagemarked onthe original image.
newsaleseletiontehnique MaximallyStable LoalSIFTDesriptionandintrodues
two new detetors, Harris-MSLSD and Laplaian-MSLSD. We then ompare their
performanetoHarris-LaplaeandtheLaplaiandetetors. In Setion2.3weevaluate
theperformaneforimagemathingusingapublilyavailableframework. Setion2.4
reportsresultsforobjet-ategory andtexturelassiation. Finally,inSetion2.6we
onlude.
2.1 Bakground
This setion provides a detailed desription of the interest point detetors
of (Mikolajzyk and Shmid, 2004b; Lowe, 2004; Triggs, 2004; Lindeberg, 1998;
Matas et al., 2002), and the Sale-Invariant Feature Transform desriptor (Lowe,
2004). Our aim is not to over the full theory of sale-invariant detetors and lo-
alrepresentation,but toprovidesuientbakgroundinformationforthetehniques
that are used later in this hapter. Our experiments will ompare our sale seletion
toseveral existing tehniques inthe literature.
2.1.1 Interest Point Detetors
Harris Points a orner detetor
The satter matrix (or seondmomentmatrix) of loalimagegradients,
R ∇ I T ∇ I dx
,isoften used for feature detetion, and itis given as
µ(x, σ I , σ D ) = σ D 2 g(σ I ) ∗
I 2 x (x, σ D ) I x I y (x, σ D ) I x I y (x, σ D ) I 2 y (x, σ D )
.
(2.1)Image derivatives
I x
andI y
are omputed by onvolution of Gaussian lters withsale
σ D
(derivation sale), and loally averaged by Gaussian smoothing with saleσ I
(integration sale). The eigenvalues of this matrix represent the two prinipalPSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
(a) (b) ()
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
(d)
Figure 2.2: Extration of multi-sale Harris points. (a) shows the multi-sale image
pyramid, (b) the omputed Harris images at eah sale, and () the image pyramid
with the multi-sale Harris points. (d) shows the detetions projeted bak to the
originalimage. The radii of the irles orrespond to the sale (
2σ
).urvatures of a point
x
. Corner-like strutures an be extrated at points where both of these urvatures are signiant in orthogonal diretions. The Harris dete-tor (Harrisand Stephens, 1988) is based on this priniple. The Harris ornerness
ombinesthe determinant and trae of this matrix and dened by
det(µ(x, σ I , σ D )) − αtrace 2 (µ(x, σ I , σ D )).
(2.2)The keypoints are determined as loal maxima of this value. Figure 2.1 shows a
Harris image, i.e., the ornerness for eah point, and the keypoints on an example
image. Shmid et al.(2000)showthattheHarrisdetetor issuperiortoothermethods
(Cottier, 1994; Heitger et al.,1992;Horaud et al., 1990).
Multi-Sale Interest Points
Amulti-salerepresentationofimagesisruialformanyappliations. Atypialexam-
are basedonthe Gaussiankernel. A multi-salerepresentation onsistsof aset of im-
agesatdierentdisretelevelsofsale(Witkin,1983). Koenderink(1984)showed that
sale-spaesatises thediusion equationforwhihthe solutionisaonvolutionwith
aunique Gaussiankernel (Babaud et al.,1986;Lindeberg, 1990;Florak et al.,1992).
Images onoarse sales are obtained by smoothing imageson ner saleswith an ap-
propriate Gaussiankernel. Animplementation an samplethe oarser sale image by
the orresponding sale fator to aelerate the omputation and this representation
isoften referred asthe sale-spaeimage pyramid.
Whenaninterestpointoperatorisappliedonmultiplesalesweallthedetetions
multi-saleinterest points. Eventhoughtheyare alledpoints,they anbeinterpreted
as regionspoints and their neighborhoodas they are parameterized by a loation
x
, and a saleσ
. 1 As for the Harris operator, Dufournaud et al. (2000) proposeda sale adaptive extension, where the points are deteted at the loal maxima of
the Harris images omputed at dierent sales. Figure 2.2 illustrates the multi-sale
Harris interest points. Figure 2.2(a) shows the original image pyramid, and (b) the
orresponding Harris images. Figure 2.2() marks the detetions, i.e., the maxima
of (b) on the original images (a), and nally on (d) we show all the detetions with
irles orresponding to the detetion sale. Note, that for illustration purposes, we
omit some sale levels fromthe pyramids(a), (b), and ().
Sale-Invariant Interest Points
Instead of extrating interest points for every sale level, automati sale-seletion
tehniques determine one or a few harateristi sales at eah loation. These de-
tetions are alled sale-invariant interest points beause they mark the same points
(
x
,σ
) on images taken at dierent resolutions. There are two main advantages of se- leting sales. First, the number of interest points is redued by intelligent rejetionof unneessary sales, and seond, the sale beomes a new harateristi property of
the detetion. Many appliations,suhasthe oneinSetion4.2, relyonthisproperty
toperform sale-invariantlearning and reognition.
Oneoftherstsale-invariantinterestpointdetetorsistheLaplaian-of-Gaussian
(LoG)developed byLindeberg (1998). ItisbasedontheGaussiansale-spae(sues-
sivesmoothingwithGaussiankernels),anditselets3DloalextremaoftheLaplaian
ltered images. Detetions are obtained on blob-like image strutures. Figure 2.3(b)
shows an example detetion of LoG. To demonstrate the multi-sale behavior, i.e.,
LoG without sale seletion, Figure 2.3(a) shows the loal extrema of the Laplaian
1
Inseveral multi-saledetetorsthatarebasedonseondmomentmatrixomputa-
tions, wedistinguish between twosale parameters, the derivation sale (
σ D
) and theintegration sale(
σ I
) (f.Setion 2.1.1). Usually, aonstantfator isused betweenσ D
and
σ I
tobalanethe size of the area used toalulatethe statistisof loalgradientPSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
(a) multi-sale (b) sale-invariant
Figure 2.3: The LoG detetor. (a) shows all extrema of the 2D LoG funtion on
multiple sales. (b) LoG 3D maxima in loation-sale spae. Note for illustration
purposes weomit some sales from (a).
on eah sale. As before, the radii of the irles indiate the sale. We an observe
thatwhiletheLoG(Figure2.3(b))detetorseletsonlyblob-likefeatures,the2DLoG
maxima (Figure2.3(a)) inludes alsodetetions near ornersand edges.
Mikolajzyk and Shmid (2001)evaluate dierent sale seletionriteriafor sale-
invariant image mathing environments. Apart from the Laplaian they study the
squared image gradients, the Dierene-of-Gaussians (Lowe, 2004) (the dierene of
the Gaussianlterresponsesbetween twoonseutive sales),and theHarrisfuntion
(2.2). Theirevaluationshows that the Laplaianfuntionselets the highest perent-
age of orret harateristi sales, and as a result they introdue the sale-invariant
Harris-Laplae (H-Lap) detetor, whih ombines the stable Harrisdetetor with the
Laplaian sale-seletion. Unfortunately, their evaluation of sale seletion funtions
are arried out in general, i.e., for eah pixel in the image. While it is a reasonable
assumption to transfer the results to Harrispoints, they did not verify the quality of
sale seletion speially on keypoint loations. Even though, they did not searh
for the Harris maxima in sale spae, we nd it interesting to investigate the Harris
sale seletion on Harris points, and inlude the Harris-Harris (H-Har) detetor in
our experiments.
Triggs (2004) generalizes the Förstner-Harris approah to general motion models
and oers a new harateristi sale seletion tehnique. Inluding sale as a (non-
translational)motionparameterforesthedetetionstobeauratelyself-mathednot
onlyinloationbutalsoinsale-spae. SinethisisamoregeneralizedHarrisdetetor,
weallitHarris-Gen(H-Gen)inourexperiments. NotiethedierenebetweenHarris-
HarrisandHarris-Gen. Theformeromputesthe2DHarrisimagesforstableloations
and hooses the maxima of ornerness in sale-spae, while Harris-Gen optimizes the
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
(a) Harris-Laplae (b) Harris-Harris
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
() Harris-Gen (d) Harris-MSLSD
Figure 2.4: Sale-Invariant Harris points. The example shows the points with their
harateristisales foreah saleseletionmethod. Forillustrationweomiteddete-
tions with
σ < 2
.spae. In our experiments Harris-Gen is used with rotation stability enabled, so the
motion model atually inludes
4
parameters2 (loation+sale+rotation). Example detetionsforthevariousHarris-baseddetetorsanbefoundinFigure2.4. Figure2.4(d) alsoshows resultsof our sale seletionapproah introduedin Setion2.2.
Maximally Stable Extremal Regions (MSER) (Matas et al., 2002) diretly opti-
mizes the region shape for stability. The algorithm determines a small subset of
all regions, the so-alled extremal regions, where eah inner-pixel intensity value is
less/greater than a ertain threshold
t
, and all intensities around the boundary is greater/less thant
. Among these extremal regions they selet the ones that arethe most stable in shape. Stability is measured by the hange in region area (or
boundary length) with respet to
t
. The MSER detetor has been shown to performwell (Mikolajzyk and Shmid, 2004b) for mathing senes with signiantviewpoint
hanges.
2
In our experiments we donot inlude otherstability properties, e.g., ane trans-
formations, illumination, et, into H-Gen; the detetor is onsistently used with the
same riteria. Note, that wehave tried toadd otherparameters, but the resultswere
always inferior tousing loation+sale+rotation.
8−bin orentation histogram
a Cell 4
4
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
Figure 2.5: The SIFT desriptor omputed on a 4x4 grid with 8-bin orientation his-
tograms.
2.1.2 Loal Desription: Sale-Invariant Feature Transform
Loalimage representations are typially aset of vetors omputed onimage pathes
at various loations. Possible hoies of image desriptors are raw image intensities,
olor histograms (Swainand Ballard, 1991), wavelets (Grossmann and Morlet, 1984),
steerable lters (Freemanand Adelson, 1991), moment invariants (Van Goolet al.,
1996), dierential invariants (Koenderink and vanDoom, 1987), omplex l-
ters (Shaalitzkyand Zisserman, 2002), shape ontext (Belongieet al., 2002), spin
images (Lazebnik et al., 2003), sale-invariant feature transform (SIFT) (Lowe,
2004), and its variants (Ke and Sukthankar, 2004; Lazebniket al., 2005;
Mikolajzyk and Shmid, 2004a). Mikolajzyk and Shmid (2004a) ompared
some of these desriptors and show that SIFT (Lowe, 2004) features performs
better than others. Evaluation of Moreels and Perona (2005) also found SIFT and
shape-ontexttoperformbest forobjetreognition. Basedontheirresultswealways
use SIFT as aloalimagerepresentation.
Figure 2.5 illustrates the omputation of SIFT on an image path entered on
keypoint loations (
x
) and using a window size related to its sale (σ
). The path isdivided by an
IS
xIS
grid, whereIS
is the index size, and is set to4
. For eah ellan
OS
-bin histogram of loal orientations (weighted by the gradient magnitudes) is omputed(OS = 8
),leadingtoaonatenated,4 ∗ 4 ∗ 8 = 128
dimensionalrealvetor.These parameters were suggested by Lowe (2004),and are xed for our experiments.
For robust desription, histograms are omputed with a Gaussian weighting funtion
(
σ = half window size
) and a trilinearinterpolation is used todistribute the value of eah gradient sample into adjaent histogram bins (eah orientation falls to2 3 = 8
bins). TheSIFT desriptorisnormalizedtounitlength,providinginvarianetosalar
hangesinimageontrast. Sinethedesriptorisbasedongradients,itisalsoinvariant
toadditiveonstanthangesinbrightness. SIFTwasoriginallyproposedtoberotation
invariant,whihis ahieved byaneientdominantgradientomputation,whihan
Pratially, many times sale-invariantinterest point detetions are followed by a
normalization to obtain a regular region before the omputation of the desriptors.
This may inludean elliptialoran irregularshape normalizationto unit square ora
rotationofpathestoapre-omputedharateristiorientation. Inourexperimentswe
alsofollowthis priniple, however, rotationinvariane isonlyappliedwhen indiated,
i.e., ingeneral the SIFT desriptors are omputedin anon-rotation invariant way.
2.2 Sale Seletion by Maximally Stable Loal Desription
Inthissetionwepropose anewmethodforseletingharateristisalesforkeypoint
detetors anddisussthe advantages andpropertiesof the newapproah. Weaddress
two key features of interest point detetors: repeatability and desription stability.
Repeatability determines how well the detetor selets the same region under various
image transformations, and is important for image mathing. In pratie, due to
noise and objet variations, the orresponding regions are never exatly the same
but their underlying desriptions are expeted to be similar. This is what we all
the desription stability, and itis important forimage representation and appearane
based reognition.
The two properties, repeatability and desriptor stability, are in theory ontradi-
tory. A homogeneous region provides the most stable desription, whereas its shape
is ingeneral not stable. On the other hand, if the region shape isstable, for example
using edges as region boundaries, small errors in loalization will often ause signi-
ant hanges of the desriptor. Our solution is to apply the Maximally Stable Loal
Desription algorithm to interest point loations only. These points have repeatable
loations and informative neighborhoods. Our algorithm adjusts their sale param-
eters to stabilize the desriptions and rejets loations where the required stability
annot be ahieved. The ombination of repeatable loationseletion and desriptor
stabilizedsaleseletionprovidesabalanedsolution. InSetion2.3weshowthatour
new method provide omparable performane to Harris-Laplae and LoG for image
mathing. Moreover, due to additional robustness (whih is disussed later in this
setion)they outperform their ounterparts.
Sale-invariant MSLSD detetors
To selet harateristi loations with high repeatability we rst apply an interest
point detetor at multiple sales. We hose two widely used omplementary meth-
ods, Harris(Harrisand Stephens,1988)andtheLaplaian(Blostein and Ahuja,1989;
Lindeberg,1998)detetors. Theseondstep of ourapproahseletsthe harateristi
sales for eah keypoint loation. We use desription stability as riterion for sale
seletion: thesale foreahloationishosen suhthat theorrespondingrepresenta-
tion(inouraseSIFT(Lowe,2004))hangesthe least withrespettosale. Figure2.6
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
1.6 2.6 4.2 6.7 10.8 17.4 28.1 45.3
change of description
scale
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
1.6 2.6 4.2 6.7 10.8 17.4 28.1 45.3 scale
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
Figure2.6: Twoexamplesofsaleseletion. Theleftandrightgraphsshowthehange
of theloaldesriptionas afuntion ofsale for the leftand rightpointsrespetively.
The sales for whih the funtions have loal minima are shown in the image. The
bright thik irles orresponds tothe globalminima.
desriptors hange asweinrease the sale (the radius of the region)for the two key-
points. To measure the dierene between SIFT desriptions we use the Eulidean
distane as in (Lowe, 2004). The minimaof the funtions determine the sales where
the desriptions are the most stable; their orresponding regions are depited by ir-
les intheimage. Our algorithmselets theabsolute minimum (shownasbrightthik
irles) for eah point, yet in ases of extreme sale hanges we reommend hoosing
allminimaand disovering multiplesparse seletions of sales per keypointloations.
Multi-salepoints whih orrespond tothe same imagestruture often have the same
absoluteminimum,i.e.,resultinthesameregion. In thisaseonlyoneofthemiskept
inourimplementation. Tolimitthenumberofseleted regionsanadditionalthreshold
an be used to rejet unstable keypoints, i.e., if the minimum hange of desription
is above a ertain value the keypoint loation is rejeted. For eah point we use a
perentage ofthe maximumhangeoversalesatthepointloation,set to
50%
inourexperiments.
Our algorithmis in the following referred toas Maximally Stable Loal SIFT De-
and L for Laplaian,i.e., H-MSLSD and L-MSLSD.
Illumination and Rotation Invariane
Ournewdetetorsare robusttoilluminationhanges,asoursaleseletionisbasedon
theSIFTdesriptor. Reall,thattheSIFTdesriptorisinvarianttoaneillumination
hanges.
Many appliationsrequirerepresentations thatare invarianttosimilaritytransfor-
mations inluding rotation. This is either ahieved by a rotation invariant desrip-
tor (Lazebniket al., 2003), or, as we disussed when we introdued SIFT, by the
extration of a dominant orientation. In ase of SIFT, if deteted keypoints have
poorlydened orientations,theresultingdesriptionsmaybeomeunstableandnoisy.
(Thisisnottheaseifthe detetedregionshaveaenteredirulartextureorthey are
ompletlyhomogenious.) In ouralgorithm,we orientthe pathinthedominantdire-
tion prior tothe desriptor omputation foreah sale. Maximaldesription stability
is then found for loationswith well dened loalgradients. In our experiments a -R
sux indiatesrotationinvariane. Experimentalresults inSetion2.4showthat our
integrated estimationof the dominantorientationansigniantly improveresults,in
ontrast toother detetors laking this typeof stability.
Ane invariane
The ane extension of our detetor is based on the ane adaptation
in (Lindeberg and Garding, 1994; Baumberg, 2000), where the shape of the elliptial
regionisdeterminedby theseondmomentmatrixoftheintensitygradient. However,
unlikeother detetors (Lazebnik et al.,2003;Mikolajzyk and Shmid,2004b), we do
notuse thisestimation asapost-proessingstepaftersaleseletion, butestimatethe
elliptial region prior to the desriptor omputation for eah sale. When the ane
adaptation isunstable, i.e., sensitive to smallhangesof the initialsale, the desrip-
tor hanges signiantly and the region is rejeted. This improves the robustness of
our ane-invariant representation. In our experiments an -A sux indiates ane
invariane. Fullane invariane requires rotationinvariane,asthe shape of eah el-
liptialregionistransformedintoairlereduingthe aneambiguitytoarotational
one. Rotation normalization of the path is, therefore, always inluded when ane
invariane isused in our experiments.
Illustration of Sale Seletion
Table 2.1 shows thenumberofextrated interestpoints forthe motorbikeimage from
Figure2.6 (640x480). On the left, Harris and Laplaian interest points are extrated
on eah sale. Note that the number of multi-sale detetions depends on the multi-
plierbetween neighboring salesof the imagepyramid (
1.2
inourase). On the right,Detetor
#of points
Multi-SaleHarris
2228
Multi-SaleLaplaian
4893
Sale-invariantdetetor # of points
Harris-Laplae
1011
Harris-Harris
283
Harris-Gen
66
Our H-MSLSD
1225
LoG
2862
Our L-MSLSD
1261
Table 2.1: The number of interest points extrated for the image in Figure 2.6. On
the left we shows multi-salepointswith
1.2
multiplier between sales. On the rightweshowtheresultsaftersaleseletionwithHarris-LaplaeandHarris-Harris,Harris-
Gen, our new H-MSLSD, and for LoGand our new L-MSLSD.See text for details.
0 500 1000 1500 2000 2500 3000 3500 4000
0 2000 4000 6000 8000 10000
Harris Scale Selected Points
Harris Multi-Scale Points H-MSLSD
H-Lap H-Har H-Gen
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 2000 4000 6000 8000 10000
Selection Ratio
Harris Multi-Scale Points H-MSLSD
H-Lap H-Har H-Gen
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0 2000 4000 6000 8000 10000
Lap. Scale Selected Points
Lap. Multi-Scale Points L-MSLSD
LoG
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 2000 4000 6000 8000 10000
Selection Ratio
Lap. Multi-Scale Points L-MSLSD
LoG
PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
Figure 2.7: Number of seleted points with gradually inreased multi-sale points.
Seletion Ratio is denein (2.3) See text fordisussion.
line shows the Harris-Laplae detetor (Mikolajzyk and Shmid, 2001) followed by
the otherHarris-based detetors inthe next threerows. Thelasttworows show sale
seletions on Laplaian points. In pratie, to further limit the number of seleted
LoGand Harris-Harrisdetetors,twoseparate thresholds anbe set, onefor theloa-
tionand one forthe sale funtion. Please alsonotethat rotationinvariane,whihis
enabledintheseexamples,furtherreduedthenumbersofpointsfoundbyHarris-Gen,
H-MSLSD,and L-MSLSD.
Usinga xed image pyramid wedene the saleseletion ratio as
Selection Ratio = Scale Invariant Points
Multi Scale Points
(2.3)Table2.1 showsthatH-Lap, H-MSLSD,LoGandL-MSLSDprovidesuientamount
ofdetetions,yetatthesametime, theirsaleseletionratioisrelativehigh,i.e., they
keep many of the multi-salepoints.
Figure2.7analyzeshowmuhthe detetednumberofpointsdependsonthesale-
spae pyramid. We gradually hange the sale multiplier between
1.5
and1.03
andplot the number of sale-invariant points as a funtion of multi-sale points. Sine
the absolute numberof pointsfor eah detetor may easilybealtered by a threshold,
the interesting part of the urves are their shapes. One would expet that after a
ertain level adding intermediate new layers in the pyramid should not inrease the
number of detetions. Surprisingly, the H-Lap detetor (almost straight line) always
seletsaertainratioof multi-salepoints. This ouldbeausedbynoiseorimpreise
LaplaiansaleseletiononHarrispoints. TheseletionratioofH-Hardetetorbegins
asexpeted, butafter
3000
multi-salepointsitatuallystartstoinrease. H-GenandH-MSLSDbothdemonstratetheexpeteddesendingshape. Inaseof theLaplaian-
based detetors (Figure 2.7 seond line), we draw similar onlusions, MSLSD stops
inreasing the number of detetions after a ertain limit. The expeted behavior of
our MSLSD implementation is probably due the smoothing fator introdued in our
implementationduringtheomputationofdesriptordierenes. Itexpliitlyremoves
highfrequenynoisefromthesaleseletionfuntion. Alsonotethatoursaleseletion
always uses a ner sale-step then the multi-saleinitialization.
2.3 Evaluation for image mathing
Thissetionevaluatesthe performaneof thenew detetorsforimagemathingbased
ontheevaluationframeworkin(Mikolajzyk et al.,2005b).
3
Weompare ourresults
to H-Lap, H-Har, H-Gen and LoG respetively. The two main evaluation riteria of
the frameworkwe alsoapplied are repeatability and mathing rates.
Therepeatabilityratemeasureshowwellthedetetorseletsthesameseneregion
undervariousimagetransformations. Eahsequene has one refereneimage and ve
images with known homographies to the referene image. Regions are deteted for
the images and their auray is measured by the amount of overlap between the
3
The evaluationsript may be downloaded from
http://www.robots.ox.a.uk/
∼
vgg/researh/ane/evaluation.html.(a) PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
(b) PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
() PSfragreplaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
PSfrag replaements
H-Lap
H-Lap-A
COMB
ENTR
referene image images fromthe sequene
Figure 2.8: Image sequenes used in the mathing experiments. (a) and (b) are
sequenes with viewpoint hange, while () ontains illumination hange. The
rst olumn shows the referene image, the other images are examples whih ho-
mography is known to the referene. These sequenes may be downloaded from
http://www.robots.ox.a.uk/
∼
vgg/researh/ane/index.html.deteted regionand theorresponding regionprojeted fromthe refereneimagewith
the known homography. Two regions are mathed if their overlap error is suiently
small:
1 − R µ a ∩ R (H T µ b H)
R µ a ∪ R (H T µ b H) < ǫ O
where
R µ
is the ellipti or irular region extrated by the detetor andH
is thehomographybetweenthetwoimages. Theunion(
R µ a ∪ R (H T µ b H)
)andtheintersetion (R µ a ∩ R (H T µ b H)
)of the deteted and projeted regions areomputed numerially. As in(Mikolajzyk et al., 2005b)the maximum possible overlap errorǫ O
isset to40%
inour experiments. The repeatability sore is the ratio between the orretmathes and
the smaller numberof deteted regionsin the pair of images.
Theseondriterion, themathingsore,measuresthe disriminativepowerofthe
detetedregions. Eahdesriptorismathed toitsnearestneighborinthe seondim-
age. Thismathismarkedasorretifitorrespondstoaregionmathwithmaximum
overlap error
40%
. The mathing sore is the ratio between the orret mathes andthe smaller numberof deteted regions inthe pair of images. See (Mikolajzyk et al.,
2005b)for more detaileddisussion of the proedure.
2.3.1 Viewpoint Changes
The performane of our detetors for viewpoint hanges is evaluated ontwo dierent
image sequenes with viewpoint hanges from
20
to60
degrees. Figure 2.8(a) showssampleimages of the grati sequene. This sequene has welldened edges, whereas
the wall sequene (Figure 2.8(b)) is more texture-like.
Figure2.9shows therepeatabilityrateandthemathingsoresaswellasthe num-
berof mathes for dierent ane-invariantdetetors. Theorderingof thedetetors is
very similarfor the riteriarepeatabilityrate andmathing sore,asexpeted. In the
followingwefousontheomparisonofH-MSLSD-AtotheotherHarrisbaseddete-
tors, and L-MSLSD-A to LoG-A respetively. On the gratisequene (Figure2.9,
rst row) the original Harris-Laplae (H-Lap-A) detetor performs better than the
other Harris detetors. On this sequene the new H-MSLSD-A are outperformed
by H-Lap-A and H-Har-A. On the wall sequene, a more natural sene, results
for H-MSLSD-A are slightly better than for H-L-A. This shows that the Lapla-
ian sale seletionprovides good repeatability mainlyinthe presene of welldened
edges. In ase of the Laplaianour detetor (L-MSLSD-A) outperformsthe original
one(LoG) forboth sequenes. Thisan beexplainedby thefatthatLoG-Adetets
a large number of unstable (poorly repeatable) regions for nearly parallel edges, see
Figure2.10. A smallshiftor salehangeof the initialregions an lead toompletely
dierent ane parameters of LoG-A. These regions are rejeted by L-MSLSD-A,
asthe varying ane parameters auselarge hanges inthe loaldesription overon-
seutive sale parameters. Note that in ase of ane divergene all detetors rejet
the points. This example learly shows that desription stability may lead to more
repeatableregions. Inaseofnaturalsenes, asforexamplethewallsequene, thisad-
vantageiseven moreapparent,i.e., thedierenebetweenL-MSLSD-AoverLoG-A
ishigher than for the grati sequene.
We an observe that we obtain a signiantly higher number of orret mathes
with our L-MSLSD. This is due to a larger number of deteted regions. This ould
inrease the probability of aidental mathes. To ensure that this did not bias our
resultsand to evaluate the eet of the deteted region densitywe ompared the
performane for dierent Laplaianthresholds for the L-MSLSD detetor. Note that
theLaplaianthresholddeterminesthenumberofdetetionsinloationspae,whereas
thesale threshold rejetsunstable loationsandremainsxed throughoutthe thesis.
Figure2.11showsthatasthenumberoforretmathesgraduallyderease,thequality
ofthedesriptors(mathingsore)staysthesame. Consequently,weanonludethat
the quality of the detetions does not depend onthe density of the extrated regions.
Figure 2.12 shows that in ase of smallviewpoint hanges the sale-invariant ver-