Selection of Discriminative Regions and Local Descriptors for Generic Object Class Recognition

(1)

HAL Id: tel-00555064

https://tel.archives-ouvertes.fr/tel-00555064

Submitted on 12 Jan 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Descriptors for Generic Object Class Recognition

Gyuri Dorkó

To cite this version:

Gyuri Dorkó. Selection of Discriminative Regions and Local Descriptors for Generic Object Class

Recognition. Human-Computer Interaction [cs.HC]. Institut National Polytechnique de Grenoble -

INPG, 2006. English. �tel-00555064�

(2)

Nattribué par la bibliothèque

THÈSE

pour obtenir legrade de

DOCTEUR DE L'INPG

Spéialité : Mathématiques et Informatique

préparée au laboratoireGRAVIR IMAG, projet LEAR,

dans leadre de l'Eole Dotorale Mathématiques, Sienes et Tehnologie

de l'Information

présentée etsoutenue publiquement

par

Gyuri Dorkó

le 9 juin 2006

Seletion of Disriminative Regions and Loal

Desriptors for Generi Objet Class Reognition

Direteur de thèse : Dr. Cordelia Shmid

JURY

Prof. Roger Mohr, Président

Prof. Bernt Shiele, Rapporteur

Prof. AndrewZisserman, Rapporteur

Dr. CordeliaShmid, Direteurde thèse

Dr. Tinne Tuytelaars, Examinateur

(3)

(4)

Szüleimnek

(5)

(6)

FOR GENERICOBJECT CLASS RECOGNITION

Gyuri Dorkó, Ph.D. dissertation

Institut National Polytehnique de Grenoble, 9June 2006

Objet ategory reognition is one of the most diult problems in omputer vi-

sion. It involves reognizing objets despite intra-lass variations, viewpoint hanges

and bakground lutter. The goal of this thesis is to investigate robust invariant

loal image desription and the seletion of disriminative features. We show that

lass-disriminative sale-invariant features ahieve exellent results for image-level

ategorization and objet loalization. We present solutions for two key problems:

(i) we improve the quality of the image desription based on a novel sale-invariant

keypoint detetion method and (ii) we integrate feature ltering tehniques into our

objet models.

Our novel sale-invariantdetetor is based on the idea of a maximallystable de-

sription,i.e., thedesriptorshouldbestableeven inthe preseneofminorvariations

of the detetor. The tehnique performs sale seletion based on a region desrip-

tor, here SIFT, and hooses regions for whih this desriptor is maximally stable,

i.e., the dierene between desriptors extrated for onseutive sales reahes a mi-

nimum. This sale seletion tehnique is applied to multi-saleHarris and Laplaian

points. Experimentalresults evaluate the performane of our detetor and show that

it outperforms existing ones in the ontext of image mathing, ategory and texture

lassiation,as well asobjet loalization.

To onstrut objet models based on disriminative features, we rst luster the

sale-invariant desriptors and obtain a set of visual words. We then estimate

the disriminative information of these lusters based on dierent feature seletion

tehniquesseveral of whih are traditionallyused in text retrieval. We disuss their

propertiesfeature frequeny, disriminative power, and redundanyand analyze

their performane in the ontext of image lassiation and objet loalization. We

showthateahtaskhasdierentrequirements,andindiatewhihseletiontehniques

are the most appropriate. Experimental results for reognition on hallenging large

datasets demonstrate the performane ofthe approah.

(7)

(8)

DESCRIPTEURS POURLA RECONNAISSANCE DE CLASSES GÉNÉRIQUES

D'OBJETS

GyuriDorkó

Institut National Polytehnique de Grenoble, 9June 2006

La atégorisationd'objets est l'un des problèmes les plus diiles en vision par ordi-

nateur. Lebut est de reonnaître des objetsvisuels malgrédes variationsintra-lasse,

des hangementsde pointdevueetun fortbruitde fond. L'objetifde ettethèse est

d'investiguer un desripteur loal d'image et une méthode de séletion de aratéris-

tiques disriminatives. Nous montrons que des desripteurs disriminatifsinvariants

par éhelle donnent d'exellent résultats en atégorisation et en loalisation d'objet.

Des solutions sont apportées aux deux problèmes fondamentaux suivants: (i) nous

améliorons la qualité de la desription des images grâe à un nouveau déteteur de

points d'intérêts invariant par éhelle et (ii)nous intégrons des tehniques de ltrage

de desripteurs dans nos modèles d'objets.

Notre nouveau déteteur invariant par éhelle est basé sur l'idée de région stable

maximale, 'est-à-dire le fait que la position du point d'intérêt est stable même en

présene de variations mineures du déteteur. La méthode séletionne une éhelle à

partir d'un desripteur loal dans notre as SIFT et hoisit les régions pour

lesquelles la stabilité du desripteur est maximale, 'est-à-dire la diérene entre les

desripteurs à deux éhelles onséutives atteint un minimum. Cette tehnique de

séletion d'éhelle est appliquée au déteteur de Harris multi-éhelle et les points de

Laplae. Des résultatsexpérimentauxpermettentd'évaluer lesperformanes de notre

déteteur etmontrent qu'ilamélioreles résultatsde mise en orrespondane d'image,

de lassiation d'objets etde texture etlaloalisationd'objets.

Andeonstruiredesmodèlesd'objetsbaséssurdes fateursdisriminatifs,lesde-

sripteursinvariantsparéhellesontlassésdans deslustersetdonneun ensemblede

mots visuels. Ensuite,nous estimonsl'informationdisriminativeontenue dans es

lusters en utilisant diérentes tehniques de séletion disriminatives Plusieurs

d'entre elles sont traditionnellement utilisées en reherhe d'information textuelle.

Nous disutons leurs propriétés fréquene, pouvoir disriminatif et redondane

et analysons leur performanes dans le ontexte de lassiation et de loalisation

d'objet. Nous montrons que haque tahe a ses partiularités et indiquons quelle

tehnique de séletion est la plus appropriée. Des résultats expérimentaux de reon-

naissaned'objets sur desjeux de données diilesmontrentlesbonnes performanes

de la méthodologieproposée.

(9)

(10)

Iwouldliketothankallpeoplethathaveontributed tothe ompletionofthis thesis.

My sinerest thanks go to my advisor Cordelia Shmid for her guidane, many sug-

gestions, original ideas, feedbaks, and helpful ritiism throughout this thesis. I am

grateful to Prof. Bernt Shiele and Prof. Andrew Zisserman for their interest in my

work, for being the reporters of this dissertation, and also to Prof. Roger Mohr and

Tinne Tuytelaars forbeingthe o-examiners atmy defense.

I would also like to thank Bill Triggs, Frédéri Jurie, and my friend Guillaume

Bouhard for the their many useful ommentsand disussions that helped meunder-

stand the sometimes diultorners of omputer visionand mahine learning.

IamgratefultomyfellowresearhersfromtheLEARgroup,EriNowak,Navneet

Dalal,AnkurAgarwal,JianguoZhang,DianeLarlus-Larrondo,PeterCarbonetto,Car-

oline Pantofaru, Marin Marszaªek, and Joost Van de Weijer, for their support, and

for making INRIAa fun and motivatingplae towork.

IwouldliketothankthesupportfortheEuropeanprojetLAVA(IST-2001-34405),

inludingallthe partners, and the European PASCAL network of exellene.

Iamthankfulforallresearherthathaveontributedbymakingtheirodeavailable

to help my researh, espeiallyfor Prof. David Lowe, Krystian Mikolajzyk, Mihael

Sdika, and Matthijs Douze. I am also grateful for Barbara Caputo, Prof. Dietrih

Paulus, Prof. Laszló Csink, Laszló Kutor, and MáriaDudás, without whom I would

not havestarted my PhD.

Myspeialthanks goestomyfriendsStan,Marlen,Carla,andBram,forthemany

joyful moments in Grenoble, as well as to my Hungarian friends Kriszta, Andi, and

Gáborwho have not forgottenabout me even that I ambeing sofar fromhome.

Last, but not least, I would like to thank my family for their love, emotional

support, and enouragement. Without them, I would not have madeit.

(11)

(12)

1 Introdution 13

1.1 Context . . . 13

1.2 Our Approah . . . 14

1.3 Contributions . . . 16

1.4 Appliations . . . 17

1.5 Overview. . . 19

2 Loal Image Representation 21 2.1 Bakground . . . 24

2.1.1 Interest PointDetetors . . . 24

2.1.2 LoalDesription: Sale-InvariantFeature Transform . . . 29

2.2 SaleSeletion by MaximallyStable LoalDesription . . . 30

2.3 Evaluationfor image mathing . . . 34

2.3.1 ViewpointChanges . . . 36

2.3.2 Changes inIllumination . . . 39

2.3.3 Overall Performane . . . 39

2.4 Evaluationfor image ategorization . . . 41

2.5 ImplementationDetails . . . 44

2.6 Conlusions . . . 47

3 Disriminative Feature Seletion for Objet Class Appearane 49 3.1 ProbabilistiInterpretation. . . 52

3.2 FeatureSoring Tehniques . . . 54

3.3 Seletion forLoalFeatures . . . 65

3.3.1 VisualWords . . . 65

3.3.2 RetrievingObjetFeatures . . . 67

3.4 Disussion . . . 75

(13)

4.1.1 Classier for Objets Presene . . . 82

4.1.2 Experimental Set-Up . . . 83

4.1.3 Experiments: Image lassiation . . . 85

4.2 Objet Loalizationwith Disriminative Features . . . 91

4.2.1 The LoalizationApproah . . . 91

4.2.2 Evaluationof Dierent Parameters . . . 95

4.2.3 AdditionalResults: PASCALChallenge, Butterles . . . 101

4.3 ImplementationDetails . . . 104

4.4 Disussion . . . 107

5 Conlusion and Future Work 109

Appendix: Inuene of the number of interest points 115

(14)

Introdution

O

^bjet ^reognition îs â ^hallenge ^that ômputer ^vision researhers, psyhologists and researhers fromother elds have been trying tounderstand formore than 40

years. After many years of researh artiial vision is still far behind human vision.

People are able to see, to reognize, and to ategorize objets inthe world. However,

foromputersthis isnot aneasy task. Theability,forexample,tosee ahair fromall

dierentviewpointsandtounderstandandknowthatitisthesamehairareextremely

ompliatedtasks. The 2-Dappearaneof thesame objet anbeverydierentwhen

the viewpoint hanges. Furthermore, due to our generalizationapability, people are

apable of nding a hair, even if they have not seen that partiular instane before.

Creatingategories,ndingsharedproperties,generalizingappearane arehallenging

tasksforomputers,mainlyduetoapotentiallyhighintra-lassvarianearossobjet

instanes.

1.1 Context

Whileobjet reognition is a large eld, in this thesis we fous on visualobjet lass

ategorizationand loalization. Figure1.1illustratessome of the diulties of reog-

nizing objet ategories. Intra-lass variations amonginstanes of a lass is only one

(a) (b) () (d) (e)

Figure1.1: Fivedierent biyles illustratethe hallenge for objet lass reognition.

Dierent viewpoints, olusion, noise, and luttered bakground make it hard to re-

ognizethe objets. Intra-lassvariation(shapeand olor)aross thedierentbiyles

(15)

(a) (b) () (d)

Figure1.2: Examples of wildats.

(a) (b) () (d) (e) (f)

Figure 1.3: Examples of butteries.

ofthehallenges: objetpartsanhavedierentgeometrialstruture, olororanbe

ompletelymissing. InFigure1.1biyles(a)and(e) aredierentinolor,whilebiy-

le(b) has dierentgeometrial proportions. Many appliations require objets tobe

found inpredenedpose andorientation,suhasreognizing prolesof faes, orside-

views of ars. Others, likethe biyle example, are less restrited and therefore more

diult: biyles(d) and(e) are viewed fromdierent viewpoints,and (a)and (b) are

imagedatdierent sales (magniation). Robustness toolusions andmissingparts

are usually additional requirements for state-of-the-art appliations; e.g., biyle (a)

has a missing (overed) seat. Olusions may be aused by the environment, or even

by the objetitself: thespokesof the rst tireare oluded on(d). Everyday objets,

suh asbiyles, often appear together with other objetsor on luttered bakground.

This additional data, so alled ontext, an distrat our system and needs in general

tobedisarded. Note that it an alsohelp to reognize the objet lass. An example

is atra ontrolsystem deteting ars. In suh a system the reognition of roads is

probablyuselessbeausethey our inallimages. However, the shadowof thear (on

the road) isprobably a useful disovery.

1.2 Our Approah

Instanes of an objet ategory often share some visual appearane, and our main

goal is to nd these ommon features. The examples in Figure 1.2 and Figure 1.3

show two dierent objet ategories. The seletion of ommon disriminative objet

partsis relativelyeasy,beausealmostany set offeatures (ofadequatesize) separates

wildats from butteries. However, if Figure 1.2 itself are dened to ontain two

(16)

harder to nd. Furthermore, if we assume that examples in Figure 1.3 are from two

ategories,then butteryexpertswouldimmediatelynotiethat (a)and (b) areblak

swallowtails, while ()-(f)are monarhs. Those who haveless experiene with insets

would probably say that (a)-(d) are open while (e)-(f) are losed butteries. So we

see thatommonfeatures arenot alwaysdisriminative,andaordingtothe taskthe

useful features are dierent. Todisoverdisriminativeobjet partswe use

loal or semi-loalrepresentations of imagesto desribe objet parts,

away tomeasure their usefulness, and selet disriminative features.

Sparse loal representations are typially omputed on a set of interest point

loations. Their aimis todesribethe regionsby keeping distintiveinformation,and

at the same time providing robustness to small translations and noise. Loal repre-

sentation of images oer a solutionto deal with olusionand luttered bakground:

individualdesriptorsonlystoreinformationoftheloalontent,andthereforetheyare

not distrated by otherparts of the image. The inuential work of Shmid and Mohr

(1997)is the rst thatuses interest pointsfor ontent based objet reognition. In-

terestpointsare automatially deteted imageloations,suh asorners orenters of

blobs. They allowtoreateasparseloalrepresentationofimagesbyseletingregions

whih keep distintive information,and at the same time providerobustness tosmall

translations and noise. In the lastfew years these pointsbeameinvariantto various

image transformations,likehanges in viewpointand sale. At the time of writing at

least a dozen of these detetors exist all seleting regions by dierent riteria. The

ombinationof interestpointsdetetors and loaldesriptorsallows sparse androbust

representation of objet, senes, or textures. Rotated objets, senes from dierent

viewpoints or with illuminationhanges are hallenges that an be solved already at

representation level,i.e., thereis noneed to learn those by examples.

State of the art methods provide relatively good solutions for reognizing spei

objets, suh as a given biyle or ar, by mathing loal appearane. However, de-

tetionof objetategories requires additionalgeneralization apabilitiestodeal with

intra-lassvariability. Disriminative feature seletion methods an guide objet

reognition to nd ategory-disriminative objet parts and to disard unneessary

bakgroundfeatures. Thesemethodsarereenttoolsinomputervisionadopted from

the text literature. Loal representation of images and standard learning tehniques,

suh as vetor quantization, have built a bridge between omputer vision and text

reognition. Our imagesbeomevisualdoumentsandthe quantizedloaldesriptors

beamevisualwords. Owingtoahuge availabilityof douments,the text ommunity

has earlyrealized the need for disriminativefeature seletion. For example,to index

news diretories or web pages, relevant information has to be seleted to train las-

siers to reognize dierent ategories. In the last few years, the growing number of

examples (Internet) direted researhers toimprovelassiation eieny and au-

(17)

these tehniques toomputer vision. In objet ategory reognition,loalrepresenta-

tion and feature seletion together help todevelop high performane automati tools

for objet and texture reognition, ategorization and detetion, for sene analysis,

and for imageindexing.

1.3 Contributions

Inthisthesiswedisussandoersolutionsforreentproblemsofimagerepresentation

and objet detetion. The key ontributions are the following:

Interest Point Detetion by Maximally Stable Loal Image Representation

Many interest point detetors and loal desriptors have been developed during the

last few years. Their quality depends on the task. For example, some perform well

forimage mathingwhile others are better forobjetreognition. Theirbehavior an

be explained by the dierent ways they selet image regions and inorporate various

feature properties. As an examples, image lassiation or image retrieval may only

maththe loalregionspurely byappearane,i.e., ignoringtheir sales,loations,and

spatialorganization. Forotherappliations,suhasimagemathingorameraalibra-

tion,these propertiesare veryimportant,and many timestheirestimation isunstable

or noisy. Consequently, the quality of interest point detetors is not straightforward

to measure, sine dierent methods should be used depending on the ontext. Our

experiene has shown that one of the weakest properties of sale-invariant detetors

isthe sale estimation. This thesis proposes anovelmethodtodetermine (selet) the

harateristi sales for interest point detetors. Our idea is to use an appropriately

hosendesriptortoseletregionsforwhihthisdesriptorismaximallystable. Exper-

imentalresults show thatour new riterionimprovesperformane forimagemathing

in hallenging environments, suh as variation in illumination onditions. Due to a

more stable appearane-based representation, texture ategorization on popular sets

shows

3 − 10%

improvementwith the new detetors.

Feature Seletion for Loal desriptors

In this thesis we adaptand ompare several tehniques fromthe text literature, most

of whih are new in vision. We analyze several feature properties inluding feature

frequeny, i.e., how often a feature appears, disriminative power to separate objet

frombakground,andredundany. Dierenttrade-osbetween propertiesare pointed

out, and seletion methods are distinguished (grouped) aordingly. By the orret

ombination of these properties, i.e., by hoosing the seletion method wisely for a

given task,weshowhowtoahieve goodreognitionperformane withmanyorjusta

sparse set of features. Our experimentsevaluate lass-disriminative feature seletion

(18)

Improved Objet Class Reognition via Feature Ranking and Seletion

We have hosen objet ategory lassiation and loalization to demonstrate the

performane of disriminative feature seletion. A simple lassiation framework

demonstrates that disovering disriminative features an diretly be used for objet

reognition. Seletion methods on dierent types of features are ompared and dis-

ussedforthree dierent tasks: Objetfeature retrievaltries toreallfeaturesprovid-

ingthe best objetoverage, while keeping the bakgroundfeatureless orvery sparse.

Appearane-basedobjetlassiationusesdisriminativefeaturestodeideaboutthe

presene of anobjet lass inimages. Objet lass loalizationaims to determine the

exat position of unseen objet instanes in test images. For loalization we extend

an existing state-of-the-art method by inorporating feature ranks. This leads to a

faster system with improved performane. We additionallyextend the framework for

rotationinvariant trainingand detetion.

1.4 Appliations

Advanes suh as disriminative feature seletion and sale-invariant loal represen-

tations, disussed in this thesis, help to analyze and improve state-of-the-art image

representation and objet reognitiontehniques. Inthe followingwelistafew exam-

ples among awide range of possible appliations.

Surveillane and Seurity

One of the most useful appliations of objet reognition are surveillane systems.

Reent seurity systems based on photography or CCTV (Closed Ciruit Television)

useomputervisiontomathdigitalimagestakenfromameraswithimagesstoredina

database. Disriminativefeatureseletionmayhelptodetermineanimportantsubset

of features inadvane, and therefore inrease the system qualityand performane.

Manufaturing Proesses and Quality Control

Improved feature extration and loal desription of images an help industrial ap-

pliationto support manufaturing proesses. Many quality ontrolmethodsemploy

omputer vision. They are based onstatistial analysis of deteted features, and aim

toredue the amount of faulty produts, inorder to meetustomer requirements.

Autonomous Vehiles

Eventhoughautonomousdrivingarsarenotyetavailableforthemarket,manufatur-

ers have already demonstrated preliminaryprototypes and driving systems. Learning

and rapid disovery of useful features, suh as parts of other ars or obstales, an

(19)

were used for surveillane, and nowadays, almostall major militaryhave them. They

arealsoused tomonitortra,detetertainevents,suhasforestres. Robustloal

image representation and fous of attention mehanism (feature seletion) help those

vehiles for better motion planning, navigation,sene analysis (to detetwhere it is),

orimproved SLAM tehniques 1

.

Web Searh and Content Based Image Retrieval

Did you know that the verb google 2

has been added to the New Oxford Amerian

Ditionary? The Internet searh engines have beome a part of our everyday life.

Researhers from the text domain have implemented disriminative feature seletion

sosuessfullythatsearhenginesgeneratearound85%ofthetotalweb tra. Nowit

isourturntoindeximages. Manyreentsearhengines, suhasGoogle, MSN,Lyos,

Yahoo,Altavista,andA9supportsearhforimages. Howevertheiralgorithmisbased

on purely textual information, suh as lenames, image meta-data, and surrounding

HTMLontent. Whilemany times this issuient,indexing by image ontent would

improve urrent performane, aswellas open new possibilities:

visualsimilarity between imageshelps torejet inorret mathes, and inrease

the reall by disovering new orretones,

queries an be based on images instead of text; e.g., we an look for a ertain

ar by its piture,or nd our opyright protetedimages and identify fraud,

given animage orimagesof someoneorsomething,e.g., afamousbuildingoran

atress,we an reoverits identity, suh asits plae and name,

mixedtextandimagequeriesanprovideariherwayoflookingforinformation.

Inordertoeientlyindex andrankimages,theorretfeatureshavetobegenerated

and seleted. Disriminative feature seletion may help to develop domain spei

searh engines, aswell asto nd the most informativefeatures in general.

Video Indexing

Digitalvideosarenowavailablenotonlyforprofessionalsbutalsoforeverydaypeople.

DVD players and reorders, reent digital ameras, and high speed Internet onne-

tions made indexing for videos as important as for images. Videos an be seen as a

sequeneofimages,andthereforemanytehniquesfromimagesanbeappliedwithout

1

InSimultaneousLoalizationAndMapping(SLAM),thequalityoftheiterativelybuiltmapanberenedand

thereforeimprovedbymathingdisriminativeloalfeaturesovertime.

2

goo

·

^gle^|'go^og

U

l|(alsoGoo

·

^gle)

·

^verbînformal^[întrans.^℄ûseânÎnternet^searhêngine,partiularlyGoogle.om:

shespenttheafternoongoogling aimlessly.

·

^[^trans.^℄ ^searh^for^the^nameôf^(someone) ôn^theÎnternet ^to^ndôut

informationaboutthem: youmeetsomeone,swapnumbers,xadate,thenGooglethemthrough1,346,966,000Web

(20)

majormodiation. However, addingtemporalinformationtothefeature spaeopens

new perspetives, suh as searhing for ertain ations. Presently only preliminary

versions of video web searh are available on major sites (Google, Yahoo, Altavista,

A9) and similarlyto images,their indies are build ontextual information only. Dis-

riminative feature seletion ould help to built domain spei searh, e.g., looking

for the appearane of an ator in a movie, or to determine the dierene between

ations. Sene analysis an guide professionals when editing movies, or an identify

viewers preferenes (e.g.,improveTiVosuggestions).

1.5 Overview

The manusript is organized as follows. Chapter 2 introdues a sparse loal image

representation with interest point detetors and loal desriptors. In Setion 2.2 we

desribe our new sale seletion method. Evaluation and omparison with existing

tehniques are arried out for image mathing (Setion 2.3), objet and texture las-

siation(Setion 2.4 and Setion4.1.3), and objet loalization(Setion 4.2.2).

Chapter 3 introduesdierent seletionand rankingtehniques. In Setion3.3we

buildthelinkbetweenimagerepresentationandfeaturesbyreatingvisualwords,and

experimentallyomparetheintroduedseletiontehniquesforobjetfeatureretrieval.

Chapter4integratesfeatureseletionintoaframeworkforobjetreognition. Firstwe

show anappliation to reognize the presene or absene of objets in images(image

lassiation),and omparethe results ofdierentfeatures and seletionmethods. In

Setion4.2we showhowtoimproveobjetloalizationbylass-disriminativefeature

ranking.

(21)

(22)

Loal Image Representation

Sale Seletion via Maximally Stable Loal Desription

L

ôal^photometri ^desriptors ômputedât ^keypoints^have demonstrated exellent results in many vision appliations, inluding objet reognition (Fergus et al.,

2003; Opeltet al., 2004), image mathing (Shaalitzky and Zisserman, 2002), and

sparse texture representation (Lazebnik et al., 2003). Reent work has onentrated

on making these desriptors invariant to image transformations. This requires on-

struting invariantimage regions whih are then used as support regions to ompute

invariantdesriptors. Inmostasesadetetedregionisdesribed byanindependently

hosendesriptor. Itwould, however, beadvantageoustouse adesriptionadapted to

the region. For example, for blob-like detetors whih extrat regions surrounded by

edges, a natural hoie would be a desriptor based on those edges. However, those

adapted representations may not provide enough disriminative information for the

region, and onsequently, a general purpose desriptor (e.g. wavelets, shape-ontext,

SIFT, et.) might be a better hoie. Many times this leads to better performane,

yetless stablerepresentations: smallhangesinsaleorloationanalterthe desrip-

tors signiantly. Our experiments have shown that the most sensitive omponent of

keypoint-based sale-invariant detetors is the sale seletion. This motivated us to

develop anovel detetor whih uses the desriptor hosen for the given task to selet

the harateristi sales. Our feature detetion approah onsists of two steps. We

rst apply aninterest point detetor on multiple sales todetermine informative and

repeatable loations. For eah position we then apply a sale seletion algorithm to

identify maximallystable representations, i.e., a sale for whih a loal desriptor is

the most stable. The loal desription an be any measure that an be omputed

ona pixel neighborhood,suh as olor histograms, steerable lters and wavelets. For

ourexperimentswehosethe Sale-InvariantFeatureTransform(SIFT)(Lowe,2004),

whih has proven exellent performane for objet representation and image math-

ing(Mikolajzyk and Shmid,2004a).

Our new method for sale-invariant keypoint detetion and image representation

(23)

Oursaleseletionmethodguaranteesmore stabledesriptorsthanstate-of-the-

art tehniques by expliitly using desriptors during keypoint detetion. The

stability riterion is developed to minimize the variationof the desriptor for a

smallhange insale.

Repeatableloationsare provided by interest point detetors (e.g. Harris), and

therefore they have rih and salient neighborhoods. This onsequently helps to

hoose repeatable and harateristi sales. We verify this experimentally, and

showthat our seletion ompetes favorably with the best available detetors.

The detetor takes advantage of the propertiesof the loaldesriptor. This an

inludeinvarianetoilluminationorrotationaswellasrobustnesstonoise. Our

experimentsshowthat theloalinvariantimagerepresentationextrated by our

algorithmleads tosigniant improvement for objet and texture reognition.

Related Work

For seleting loal invariant regions, many dierent sale- and ane-invariant dete-

tors exist in the literature. Harris-Laplae (Mikolajzyk and Shmid, 2004b) detets

multi-sale keypoint loations with the Harris detetor (Harrisand Stephens, 1988)

and the harateristi sales are then determined by the Laplaian operator. Loa-

tions based on Harris points are very aurate. However, sale estimation is often

unstable on orner-like strutures, beause it depends on the exat orner loation,

i.e.,shiftsby one pixelmaymodifythe seleted salesigniantly. Thesale-invariant

Laplaian detetor (Lindeberg and Garding, 1994) (LoG) selets the extremal values

inloation-salespae. TheDiereneofGaussian(DoG)detetordeveloped byLowe

(2004)approximatesthe Laplaian,and thereforeit similarlyselets sale-spae max-

ima to nd blob-like strutures. Blobs are well loalized strutures, but due to their

homogeneity,the informationontentisoftenpoorinthe enter of the region. Triggs'

detetor(Triggs,2004)extendstheFörstner-Harrisapproahtogeneralmotionmodels

androbusttemplatemathingbyndingregionswhihanbeauratelyself-mathed

under various similarity or ane transformations. This detetor extrats fewer but

very stable keypoints. For instane, the rotation invariantdetetion rejets point-like

strutures, sine they annot be well-loalized (self-mathed) under image rotation,

i.e., they have no harateristi orientation. The method of Kadiret al. (2004) ex-

trats irularor elliptial regionsin the imageas maxima of the entropy sale-spae

ofregionhistograms. Thisisalsoablobdetetor,buthasbeenshowntoprovideamore

robustappearanebasedrepresentationforsomeobjetategories(Kadir et al.,2004).

Mikolajzyk et al. (2005b) showed that it performs poorly for imagemathing, whih

mightbeduetothesparsityoftheirsalequantization. Presumablyperformaneissues

prohibit them for more extensive searh in sale-spae. The Intensity-Based Region

detetor (Tuytelaarsand Van Gool,2004)seletsmulti-saleloationsatextremal in-

(24)

nearby intensity hanges. The edge-based region detetor (Tuytelaarsand Van Gool,

2004) nds quadrangular segments with a orner deteted by the multi-sale Harris

operator and sides determined by near edges. The objet-part detetor of Jurie et

al. (Jurie and Shmid, 2004) selets irular regions with the most salient onvex ar-

rangementofloaledgesextrated bythe Canny-Derihe operator. Sinethe deteted

regionsare surroundedby edges, they proposedaloalimagerepresentation based on

this struture. These desriptors are however not asdisriminativeas other available

representations,sineitonlyenodesinformationofthesurroundingedges. Duetothe

homogeneityoftheseletedregionsitsuersfromthesameproblemsasotherblob-like

methods. The Maximally Stable Extremal Regions (MSER) detetor (Matas et al.,

2002) denes extremal regions as image segments where eah inner-pixel intensity

value is less/greater than a ertain threshold

t

^, ^and ^all intensities aroundthe bound- aryaregreater/lessthanthe same

t

^. Ânêxtremal^regionîs^maximally ^stable ^when^the

area(ortheboundarylength)of thesegmenthangestheleast withrespetto

t

^. ^This

detetor works partiularlywell on images with welldened edges, but is less robust

to noise and not adapted to texture-like strutures. It usually selets relatively few

regions.

Viewpoint invariane is sometimes required to ahieve reliable image math-

ing, objet or texture reognition. Ane-invariant detetors (Kadiret al., 2004;

Matas et al., 2002; Mikolajzyk and Shmid, 2004b; Tuytelaars and Van Gool, 2004)

expliitly estimate the ane shape of the regions to allow pre-normalization of

the path prior to the desriptor omputation. The ane extension of Harris-

Laplae (Mikolajzyk and Shmid, 2004b) is similar to the one rst used by

Lindeberg and Garding(1997)forshape-from-texture. Itappliestheanekernelonly

to xed points to redue the omplexity of the entire ane-spae. This is one of the

most widely used approahes; Lazebnik et al. (2003) use a similar tehnique for the

LoGdetetor toperform texturelassiationunder anetransformations. However,

note, that their adaptation proedure is a post-proessing step of the sale-invariant

detetion based onthe satter matrix of image gradientsat keypointloations.

Mikolajzyk et al. (2005b) evaluated several ane-invariant detetors.

MSER (Matas et al., 2002) performed best, losely followed by Hessian- and

Harris-Laplae. Moreels and Perona(2005)alsondthatHarris-andHessian-Laplae

perform best for objet reognition. Their study shows poor performane of the

MSER detetor for 3D environments. Mikolajzyk et al. (2005a) experimentally

omparedthe performane ofreently proposed detetorsand desriptorsforategory

reognition, and found Hessian-Laplae (Mikolajzyk and Shmid, 2004b) and the

entropy detetor (Kadiret al.,2004) tobethe most suitable.

Overview

This hapter is organized as follows. In Setion 2.1 we present the interest point

(25)

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

(a) (b) ()

Figure 2.1: Harrisorner detetion. (a) the original image, (b) the Harrisimage, ()

the loalmaxima of the Harris imagemarked onthe original image.

newsaleseletiontehnique MaximallyStable LoalSIFTDesriptionandintrodues

two new detetors, Harris-MSLSD and Laplaian-MSLSD. We then ompare their

performanetoHarris-LaplaeandtheLaplaiandetetors. In Setion2.3weevaluate

theperformaneforimagemathingusingapublilyavailableframework. Setion2.4

reportsresultsforobjet-ategory andtexturelassiation. Finally,inSetion2.6we

onlude.

2.1 Bakground

This setion provides a detailed desription of the interest point detetors

of (Mikolajzyk and Shmid, 2004b; Lowe, 2004; Triggs, 2004; Lindeberg, 1998;

Matas et al., 2002), and the Sale-Invariant Feature Transform desriptor (Lowe,

2004). Our aim is not to over the full theory of sale-invariant detetors and lo-

alrepresentation,but toprovidesuientbakgroundinformationforthetehniques

that are used later in this hapter. Our experiments will ompare our sale seletion

toseveral existing tehniques inthe literature.

2.1.1 Interest Point Detetors

Harris Points a orner detetor

The satter matrix (or seondmomentmatrix) of loalimagegradients,

R ∇ I ^T ∇ I dx

^,

isoften used for feature detetion, and itis given as

µ(x, σ I , σ D ) = σ _D ² g(σ I ) ∗

I ² _x (x, σ D ) I x I y (x, σ D ) I x I y (x, σ D ) I ² _y (x, σ D )

.

^(2.1)

Image derivatives

I x

^and

I y

âre ômputed ^by ônvolution ôf ^Gaussian ^lters ^with

sale

σ D

(derivation sale), and loally averaged by Gaussian smoothing with sale

σ I

(integration sale). The eigenvalues of this matrix represent the two prinipal

(26)

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

(a) (b) ()

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

(d)

Figure 2.2: Extration of multi-sale Harris points. (a) shows the multi-sale image

pyramid, (b) the omputed Harris images at eah sale, and () the image pyramid

with the multi-sale Harris points. (d) shows the detetions projeted bak to the

originalimage. The radii of the irles orrespond to the sale (

2σ

^).

urvatures of a point

x

^. Corner-like strutures an be extrated at points where both of these urvatures are signiant in orthogonal diretions. The Harris dete-

tor (Harrisand Stephens, 1988) is based on this priniple. The Harris ornerness

ombinesthe determinant and trae of this matrix and dened by

det(µ(x, σ I , σ D )) − αtrace ² (µ(x, σ I , σ D )).

^(2.2)

The keypoints are determined as loal maxima of this value. Figure 2.1 shows a

Harris image, i.e., the ornerness for eah point, and the keypoints on an example

image. Shmid et al.(2000)showthattheHarrisdetetor issuperiortoothermethods

(Cottier, 1994; Heitger et al.,1992;Horaud et al., 1990).

Multi-Sale Interest Points

Amulti-salerepresentationofimagesisruialformanyappliations. Atypialexam-

(27)

are basedonthe Gaussiankernel. A multi-salerepresentation onsistsof aset of im-

agesatdierentdisretelevelsofsale(Witkin,1983). Koenderink(1984)showed that

sale-spaesatises thediusion equationforwhihthe solutionisaonvolutionwith

aunique Gaussiankernel (Babaud et al.,1986;Lindeberg, 1990;Florak et al.,1992).

Images onoarse sales are obtained by smoothing imageson ner saleswith an ap-

propriate Gaussiankernel. Animplementation an samplethe oarser sale image by

the orresponding sale fator to aelerate the omputation and this representation

isoften referred asthe sale-spaeimage pyramid.

Whenaninterestpointoperatorisappliedonmultiplesalesweallthedetetions

multi-saleinterest points. Eventhoughtheyare alledpoints,they anbeinterpreted

as regionspoints and their neighborhoodas they are parameterized by a loation

x

^, ^and ^a ^sale

σ

^. ¹ Âs ^for ^the ^Harris ôperator, ^Dufournaud êt âl. ⁽²⁰⁰⁰⁾ ^proposed

a sale adaptive extension, where the points are deteted at the loal maxima of

the Harris images omputed at dierent sales. Figure 2.2 illustrates the multi-sale

Harris interest points. Figure 2.2(a) shows the original image pyramid, and (b) the

orresponding Harris images. Figure 2.2() marks the detetions, i.e., the maxima

of (b) on the original images (a), and nally on (d) we show all the detetions with

irles orresponding to the detetion sale. Note, that for illustration purposes, we

omit some sale levels fromthe pyramids(a), (b), and ().

Sale-Invariant Interest Points

Instead of extrating interest points for every sale level, automati sale-seletion

tehniques determine one or a few harateristi sales at eah loation. These de-

tetions are alled sale-invariant interest points beause they mark the same points

(

x

^,

σ

⁾ ôn îmages ^taken ât ^dierent resolutions. There are two main advantages of seleting sales. First, the number of interest points is redued by intelligent rejetion

of unneessary sales, and seond, the sale beomes a new harateristi property of

the detetion. Many appliations,suhasthe oneinSetion4.2, relyonthisproperty

toperform sale-invariantlearning and reognition.

Oneoftherstsale-invariantinterestpointdetetorsistheLaplaian-of-Gaussian

(LoG)developed byLindeberg (1998). ItisbasedontheGaussiansale-spae(sues-

sivesmoothingwithGaussiankernels),anditselets3DloalextremaoftheLaplaian

ltered images. Detetions are obtained on blob-like image strutures. Figure 2.3(b)

shows an example detetion of LoG. To demonstrate the multi-sale behavior, i.e.,

LoG without sale seletion, Figure 2.3(a) shows the loal extrema of the Laplaian

1

Inseveral multi-saledetetorsthatarebasedonseondmomentmatrixomputa-

tions, wedistinguish between twosale parameters, the derivation sale (

σ D

⁾ ^and ^the

integration sale(

σ I

⁾ ^(f.Setion ^2.1.1). Ûsually^, âônstant^fator îsûsed ^between

σ D

and

σ I

^to^balane^the ^size ôf ^the ârea ûsed ^toâlulate^the ^statistisôf ^loal^gradient

(28)

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

(a) multi-sale (b) sale-invariant

Figure 2.3: The LoG detetor. (a) shows all extrema of the 2D LoG funtion on

multiple sales. (b) LoG 3D maxima in loation-sale spae. Note for illustration

purposes weomit some sales from (a).

on eah sale. As before, the radii of the irles indiate the sale. We an observe

thatwhiletheLoG(Figure2.3(b))detetorseletsonlyblob-likefeatures,the2DLoG

maxima (Figure2.3(a)) inludes alsodetetions near ornersand edges.

Mikolajzyk and Shmid (2001)evaluate dierent sale seletionriteriafor sale-

invariant image mathing environments. Apart from the Laplaian they study the

squared image gradients, the Dierene-of-Gaussians (Lowe, 2004) (the dierene of

the Gaussianlterresponsesbetween twoonseutive sales),and theHarrisfuntion

(2.2). Theirevaluationshows that the Laplaianfuntionselets the highest perent-

age of orret harateristi sales, and as a result they introdue the sale-invariant

Harris-Laplae (H-Lap) detetor, whih ombines the stable Harrisdetetor with the

Laplaian sale-seletion. Unfortunately, their evaluation of sale seletion funtions

are arried out in general, i.e., for eah pixel in the image. While it is a reasonable

assumption to transfer the results to Harrispoints, they did not verify the quality of

sale seletion speially on keypoint loations. Even though, they did not searh

for the Harris maxima in sale spae, we nd it interesting to investigate the Harris

sale seletion on Harris points, and inlude the Harris-Harris (H-Har) detetor in

our experiments.

Triggs (2004) generalizes the Förstner-Harris approah to general motion models

and oers a new harateristi sale seletion tehnique. Inluding sale as a (non-

translational)motionparameterforesthedetetionstobeauratelyself-mathednot

onlyinloationbutalsoinsale-spae. SinethisisamoregeneralizedHarrisdetetor,

weallitHarris-Gen(H-Gen)inourexperiments. NotiethedierenebetweenHarris-

HarrisandHarris-Gen. Theformeromputesthe2DHarrisimagesforstableloations

and hooses the maxima of ornerness in sale-spae, while Harris-Gen optimizes the

(29)

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

(a) Harris-Laplae (b) Harris-Harris

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

() Harris-Gen (d) Harris-MSLSD

Figure 2.4: Sale-Invariant Harris points. The example shows the points with their

harateristisales foreah saleseletionmethod. Forillustrationweomiteddete-

tions with

σ < 2

^.

spae. In our experiments Harris-Gen is used with rotation stability enabled, so the

motion model atually inludes

4

^parameters² (loation+sale+rotation). Example detetionsforthevariousHarris-baseddetetorsanbefoundinFigure2.4. Figure2.4

(d) alsoshows resultsof our sale seletionapproah introduedin Setion2.2.

Maximally Stable Extremal Regions (MSER) (Matas et al., 2002) diretly opti-

mizes the region shape for stability. The algorithm determines a small subset of

all regions, the so-alled extremal regions, where eah inner-pixel intensity value is

less/greater than a ertain threshold

t

^, ^and ^all intensities around the boundary is greater/less than

t

^. Âmong ^these êxtremal ^regions ^they ^selet ^the ônes ^that âre

the most stable in shape. Stability is measured by the hange in region area (or

boundary length) with respet to

t

^. ^The ^MSER ^detetor ^has ^been ^shown ^to ^perform

well (Mikolajzyk and Shmid, 2004b) for mathing senes with signiantviewpoint

hanges.

2

In our experiments we donot inlude otherstability properties, e.g., ane trans-

formations, illumination, et, into H-Gen; the detetor is onsistently used with the

same riteria. Note, that wehave tried toadd otherparameters, but the resultswere

always inferior tousing loation+sale+rotation.

(30)

8−bin orentation histogram

a Cell 4

4

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

Figure 2.5: The SIFT desriptor omputed on a 4x4 grid with 8-bin orientation his-

tograms.

2.1.2 Loal Desription: Sale-Invariant Feature Transform

Loalimage representations are typially aset of vetors omputed onimage pathes

at various loations. Possible hoies of image desriptors are raw image intensities,

olor histograms (Swainand Ballard, 1991), wavelets (Grossmann and Morlet, 1984),

steerable lters (Freemanand Adelson, 1991), moment invariants (Van Goolet al.,

1996), dierential invariants (Koenderink and vanDoom, 1987), omplex l-

ters (Shaalitzkyand Zisserman, 2002), shape ontext (Belongieet al., 2002), spin

images (Lazebnik et al., 2003), sale-invariant feature transform (SIFT) (Lowe,

2004), and its variants (Ke and Sukthankar, 2004; Lazebniket al., 2005;

Mikolajzyk and Shmid, 2004a). Mikolajzyk and Shmid (2004a) ompared

some of these desriptors and show that SIFT (Lowe, 2004) features performs

better than others. Evaluation of Moreels and Perona (2005) also found SIFT and

shape-ontexttoperformbest forobjetreognition. Basedontheirresultswealways

use SIFT as aloalimagerepresentation.

Figure 2.5 illustrates the omputation of SIFT on an image path entered on

keypoint loations (

x

⁾ ând ûsing â ^window ^size ^related ^to îts ^sale ⁽

σ

^). ^The ^path ^is

divided by an

IS

^x

IS

^grid, ^where

IS

îs ^the îndex ^size, ând îs ^set ^to

4

^. ^Fôr êah êll

an

OS

^-bin ^histogram ^of ^loal orientations (weighted by the gradient magnitudes) is omputed(

OS = 8

^),^leading^to^aonatenated,

4 ∗ 4 ∗ 8 = 128

dimensionalrealvetor.

These parameters were suggested by Lowe (2004),and are xed for our experiments.

For robust desription, histograms are omputed with a Gaussian weighting funtion

(

σ = half window size

⁾ ^and ^a ^trilinearinterpolation is used todistribute the value of eah gradient sample into adjaent histogram bins (eah orientation falls to

2 ³ = 8

bins). TheSIFT desriptorisnormalizedtounitlength,providinginvarianetosalar

hangesinimageontrast. Sinethedesriptorisbasedongradients,itisalsoinvariant

toadditiveonstanthangesinbrightness. SIFTwasoriginallyproposedtoberotation

invariant,whihis ahieved byaneientdominantgradientomputation,whihan

(31)

Pratially, many times sale-invariantinterest point detetions are followed by a

normalization to obtain a regular region before the omputation of the desriptors.

This may inludean elliptialoran irregularshape normalizationto unit square ora

rotationofpathestoapre-omputedharateristiorientation. Inourexperimentswe

alsofollowthis priniple, however, rotationinvariane isonlyappliedwhen indiated,

i.e., ingeneral the SIFT desriptors are omputedin anon-rotation invariant way.

2.2 Sale Seletion by Maximally Stable Loal Desription

Inthissetionwepropose anewmethodforseletingharateristisalesforkeypoint

detetors anddisussthe advantages andpropertiesof the newapproah. Weaddress

two key features of interest point detetors: repeatability and desription stability.

Repeatability determines how well the detetor selets the same region under various

image transformations, and is important for image mathing. In pratie, due to

noise and objet variations, the orresponding regions are never exatly the same

but their underlying desriptions are expeted to be similar. This is what we all

the desription stability, and itis important forimage representation and appearane

based reognition.

The two properties, repeatability and desriptor stability, are in theory ontradi-

tory. A homogeneous region provides the most stable desription, whereas its shape

is ingeneral not stable. On the other hand, if the region shape isstable, for example

using edges as region boundaries, small errors in loalization will often ause signi-

ant hanges of the desriptor. Our solution is to apply the Maximally Stable Loal

Desription algorithm to interest point loations only. These points have repeatable

loations and informative neighborhoods. Our algorithm adjusts their sale param-

eters to stabilize the desriptions and rejets loations where the required stability

annot be ahieved. The ombination of repeatable loationseletion and desriptor

stabilizedsaleseletionprovidesabalanedsolution. InSetion2.3weshowthatour

new method provide omparable performane to Harris-Laplae and LoG for image

mathing. Moreover, due to additional robustness (whih is disussed later in this

setion)they outperform their ounterparts.

Sale-invariant MSLSD detetors

To selet harateristi loations with high repeatability we rst apply an interest

point detetor at multiple sales. We hose two widely used omplementary meth-

ods, Harris(Harrisand Stephens,1988)andtheLaplaian(Blostein and Ahuja,1989;

Lindeberg,1998)detetors. Theseondstep of ourapproahseletsthe harateristi

sales for eah keypoint loation. We use desription stability as riterion for sale

seletion: thesale foreahloationishosen suhthat theorrespondingrepresenta-

tion(inouraseSIFT(Lowe,2004))hangesthe least withrespettosale. Figure2.6

(32)

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

1.6 2.6 4.2 6.7 10.8 17.4 28.1 45.3

change of description

scale

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

1.6 2.6 4.2 6.7 10.8 17.4 28.1 45.3 scale

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

Figure2.6: Twoexamplesofsaleseletion. Theleftandrightgraphsshowthehange

of theloaldesriptionas afuntion ofsale for the leftand rightpointsrespetively.

The sales for whih the funtions have loal minima are shown in the image. The

bright thik irles orresponds tothe globalminima.

desriptors hange asweinrease the sale (the radius of the region)for the two key-

points. To measure the dierene between SIFT desriptions we use the Eulidean

distane as in (Lowe, 2004). The minimaof the funtions determine the sales where

the desriptions are the most stable; their orresponding regions are depited by ir-

les intheimage. Our algorithmselets theabsolute minimum (shownasbrightthik

irles) for eah point, yet in ases of extreme sale hanges we reommend hoosing

allminimaand disovering multiplesparse seletions of sales per keypointloations.

Multi-salepoints whih orrespond tothe same imagestruture often have the same

absoluteminimum,i.e.,resultinthesameregion. In thisaseonlyoneofthemiskept

inourimplementation. Tolimitthenumberofseleted regionsanadditionalthreshold

an be used to rejet unstable keypoints, i.e., if the minimum hange of desription

is above a ertain value the keypoint loation is rejeted. For eah point we use a

perentage ofthe maximumhangeoversalesatthepointloation,set to

50%

ⁱⁿ^our

experiments.

Our algorithmis in the following referred toas Maximally Stable Loal SIFT De-

(33)

and L for Laplaian,i.e., H-MSLSD and L-MSLSD.

Illumination and Rotation Invariane

Ournewdetetorsare robusttoilluminationhanges,asoursaleseletionisbasedon

theSIFTdesriptor. Reall,thattheSIFTdesriptorisinvarianttoaneillumination

hanges.

Many appliationsrequirerepresentations thatare invarianttosimilaritytransfor-

mations inluding rotation. This is either ahieved by a rotation invariant desrip-

tor (Lazebniket al., 2003), or, as we disussed when we introdued SIFT, by the

extration of a dominant orientation. In ase of SIFT, if deteted keypoints have

poorlydened orientations,theresultingdesriptionsmaybeomeunstableandnoisy.

(Thisisnottheaseifthe detetedregionshaveaenteredirulartextureorthey are

ompletlyhomogenious.) In ouralgorithm,we orientthe pathinthedominantdire-

tion prior tothe desriptor omputation foreah sale. Maximaldesription stability

is then found for loationswith well dened loalgradients. In our experiments a -R

sux indiatesrotationinvariane. Experimentalresults inSetion2.4showthat our

integrated estimationof the dominantorientationansigniantly improveresults,in

ontrast toother detetors laking this typeof stability.

Ane invariane

The ane extension of our detetor is based on the ane adaptation

in (Lindeberg and Garding, 1994; Baumberg, 2000), where the shape of the elliptial

regionisdeterminedby theseondmomentmatrixoftheintensitygradient. However,

unlikeother detetors (Lazebnik et al.,2003;Mikolajzyk and Shmid,2004b), we do

notuse thisestimation asapost-proessingstepaftersaleseletion, butestimatethe

elliptial region prior to the desriptor omputation for eah sale. When the ane

adaptation isunstable, i.e., sensitive to smallhangesof the initialsale, the desrip-

tor hanges signiantly and the region is rejeted. This improves the robustness of

our ane-invariant representation. In our experiments an -A sux indiates ane

invariane. Fullane invariane requires rotationinvariane,asthe shape of eah el-

liptialregionistransformedintoairlereduingthe aneambiguitytoarotational

one. Rotation normalization of the path is, therefore, always inluded when ane

invariane isused in our experiments.

Illustration of Sale Seletion

Table 2.1 shows thenumberofextrated interestpoints forthe motorbikeimage from

Figure2.6 (640x480). On the left, Harris and Laplaian interest points are extrated

on eah sale. Note that the number of multi-sale detetions depends on the multi-

plierbetween neighboring salesof the imagepyramid (

1.2

ⁱⁿôurâse). Ôn ^the ^right,

(34)

Detetor

#of points

Multi-SaleHarris

2228

Multi-SaleLaplaian

4893

Sale-invariantdetetor # of points

Harris-Laplae

1011

Harris-Harris

283

Harris-Gen

66

Our H-MSLSD

1225

LoG

2862

Our L-MSLSD

1261

Table 2.1: The number of interest points extrated for the image in Figure 2.6. On

the left we shows multi-salepointswith

1.2

^multiplier ^between ^sales. ^On ^the ^right

weshowtheresultsaftersaleseletionwithHarris-LaplaeandHarris-Harris,Harris-

Gen, our new H-MSLSD, and for LoGand our new L-MSLSD.See text for details.

0 500 1000 1500 2000 2500 3000 3500 4000

0 2000 4000 6000 8000 10000

Harris Scale Selected Points

Harris Multi-Scale Points H-MSLSD

H-Lap H-Har H-Gen

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 2000 4000 6000 8000 10000

Selection Ratio

Harris Multi-Scale Points H-MSLSD

H-Lap H-Har H-Gen

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0 2000 4000 6000 8000 10000

Lap. Scale Selected Points

Lap. Multi-Scale Points L-MSLSD

LoG

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 2000 4000 6000 8000 10000

Selection Ratio

Lap. Multi-Scale Points L-MSLSD

LoG

PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

Figure 2.7: Number of seleted points with gradually inreased multi-sale points.

Seletion Ratio is denein (2.3) See text fordisussion.

line shows the Harris-Laplae detetor (Mikolajzyk and Shmid, 2001) followed by

the otherHarris-based detetors inthe next threerows. Thelasttworows show sale

seletions on Laplaian points. In pratie, to further limit the number of seleted

(35)

LoGand Harris-Harrisdetetors,twoseparate thresholds anbe set, onefor theloa-

tionand one forthe sale funtion. Please alsonotethat rotationinvariane,whihis

enabledintheseexamples,furtherreduedthenumbersofpointsfoundbyHarris-Gen,

H-MSLSD,and L-MSLSD.

Usinga xed image pyramid wedene the saleseletion ratio as

Selection Ratio = Scale Invariant Points

Multi Scale Points

^(2.3)

Table2.1 showsthatH-Lap, H-MSLSD,LoGandL-MSLSDprovidesuientamount

ofdetetions,yetatthesametime, theirsaleseletionratioisrelativehigh,i.e., they

keep many of the multi-salepoints.

Figure2.7analyzeshowmuhthe detetednumberofpointsdependsonthesale-

spae pyramid. We gradually hange the sale multiplier between

1.5

^and

1.03

^and

plot the number of sale-invariant points as a funtion of multi-sale points. Sine

the absolute numberof pointsfor eah detetor may easilybealtered by a threshold,

the interesting part of the urves are their shapes. One would expet that after a

ertain level adding intermediate new layers in the pyramid should not inrease the

number of detetions. Surprisingly, the H-Lap detetor (almost straight line) always

seletsaertainratioof multi-salepoints. This ouldbeausedbynoiseorimpreise

LaplaiansaleseletiononHarrispoints. TheseletionratioofH-Hardetetorbegins

asexpeted, butafter

3000

^multi-sale^pointsîtâtually^starts^toînrease. ^H-Genând

H-MSLSDbothdemonstratetheexpeteddesendingshape. Inaseof theLaplaian-

based detetors (Figure 2.7 seond line), we draw similar onlusions, MSLSD stops

inreasing the number of detetions after a ertain limit. The expeted behavior of

our MSLSD implementation is probably due the smoothing fator introdued in our

implementationduringtheomputationofdesriptordierenes. Itexpliitlyremoves

highfrequenynoisefromthesaleseletionfuntion. Alsonotethatoursaleseletion

always uses a ner sale-step then the multi-saleinitialization.

2.3 Evaluation for image mathing

Thissetionevaluatesthe performaneof thenew detetorsforimagemathingbased

ontheevaluationframeworkin(Mikolajzyk et al.,2005b).

3

Weompare ourresults

to H-Lap, H-Har, H-Gen and LoG respetively. The two main evaluation riteria of

the frameworkwe alsoapplied are repeatability and mathing rates.

Therepeatabilityratemeasureshowwellthedetetorseletsthesameseneregion

undervariousimagetransformations. Eahsequene has one refereneimage and ve

images with known homographies to the referene image. Regions are deteted for

the images and their auray is measured by the amount of overlap between the

3

The evaluationsript may be downloaded from

http://www.robots.ox.a.uk/

∼

vgg/researh/ane/evaluation.html.

(36)

(a) PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

(b) PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

() PSfragreplaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

PSfrag replaements

H-Lap

H-Lap-A

COMB

ENTR

referene image images fromthe sequene

Figure 2.8: Image sequenes used in the mathing experiments. (a) and (b) are

sequenes with viewpoint hange, while () ontains illumination hange. The

rst olumn shows the referene image, the other images are examples whih ho-

mography is known to the referene. These sequenes may be downloaded from

http://www.robots.ox.a.uk/

∼

vgg/researh/ane/index.html.

deteted regionand theorresponding regionprojeted fromthe refereneimagewith

the known homography. Two regions are mathed if their overlap error is suiently

small:

1 − R µ a ∩ R _(H ^T µ b H)

R µ a ∪ R _(H ^T _µ _b _H) < ǫ O

where

R µ

îs ^the êllipti ôr îrular ^region êxtrated ^by ^the ^detetor ând

H

^is ^the

homographybetweenthetwoimages. Theunion(

R µ a ∪ R _(H ^T µ b H)

⁾^and^theintersetion (

R µ a ∩ R _(H ^T µ b H)

⁾ôf ^the ^deteted ând ^projeted ^regions âreômputed numerially. As in(Mikolajzyk et al., 2005b)the maximum possible overlap error

ǫ O

^is^set ^to

40%

ⁱⁿ

our experiments. The repeatability sore is the ratio between the orretmathes and

the smaller numberof deteted regionsin the pair of images.

Theseondriterion, themathingsore,measuresthe disriminativepowerofthe

detetedregions. Eahdesriptorismathed toitsnearestneighborinthe seondim-

age. Thismathismarkedasorretifitorrespondstoaregionmathwithmaximum

overlap error

40%

^. ^The ^mathing ^sore îs ^the ^ratio ^between ^the ôrret ^mathes ând

(37)

the smaller numberof deteted regions inthe pair of images. See (Mikolajzyk et al.,

2005b)for more detaileddisussion of the proedure.

2.3.1 Viewpoint Changes

The performane of our detetors for viewpoint hanges is evaluated ontwo dierent

image sequenes with viewpoint hanges from

20

^to

60

^degrees. ^Figure ^2.8(a) ^shows

sampleimages of the grati sequene. This sequene has welldened edges, whereas

the wall sequene (Figure 2.8(b)) is more texture-like.

Figure2.9shows therepeatabilityrateandthemathingsoresaswellasthe num-

berof mathes for dierent ane-invariantdetetors. Theorderingof thedetetors is

very similarfor the riteriarepeatabilityrate andmathing sore,asexpeted. In the

followingwefousontheomparisonofH-MSLSD-AtotheotherHarrisbaseddete-

tors, and L-MSLSD-A to LoG-A respetively. On the gratisequene (Figure2.9,

rst row) the original Harris-Laplae (H-Lap-A) detetor performs better than the

other Harris detetors. On this sequene the new H-MSLSD-A are outperformed

by H-Lap-A and H-Har-A. On the wall sequene, a more natural sene, results

for H-MSLSD-A are slightly better than for H-L-A. This shows that the Lapla-

ian sale seletionprovides good repeatability mainlyinthe presene of welldened

edges. In ase of the Laplaianour detetor (L-MSLSD-A) outperformsthe original

one(LoG) forboth sequenes. Thisan beexplainedby thefatthatLoG-Adetets

a large number of unstable (poorly repeatable) regions for nearly parallel edges, see

Figure2.10. A smallshiftor salehangeof the initialregions an lead toompletely

dierent ane parameters of LoG-A. These regions are rejeted by L-MSLSD-A,

asthe varying ane parameters auselarge hanges inthe loaldesription overon-

seutive sale parameters. Note that in ase of ane divergene all detetors rejet

the points. This example learly shows that desription stability may lead to more

repeatableregions. Inaseofnaturalsenes, asforexamplethewallsequene, thisad-

vantageiseven moreapparent,i.e., thedierenebetweenL-MSLSD-AoverLoG-A

ishigher than for the grati sequene.

We an observe that we obtain a signiantly higher number of orret mathes

with our L-MSLSD. This is due to a larger number of deteted regions. This ould

inrease the probability of aidental mathes. To ensure that this did not bias our

resultsand to evaluate the eet of the deteted region densitywe ompared the

performane for dierent Laplaianthresholds for the L-MSLSD detetor. Note that

theLaplaianthresholddeterminesthenumberofdetetionsinloationspae,whereas

thesale threshold rejetsunstable loationsandremainsxed throughoutthe thesis.

Figure2.11showsthatasthenumberoforretmathesgraduallyderease,thequality

ofthedesriptors(mathingsore)staysthesame. Consequently,weanonludethat

the quality of the detetions does not depend onthe density of the extrated regions.

Figure 2.12 shows that in ase of smallviewpoint hanges the sale-invariant ver-

Selection of Discriminative Regions and Local Descriptors for Generic Object Class Recognition

HAL Id: tel-00555064

https://tel.archives-ouvertes.fr/tel-00555064

Submitted on 12 Jan 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Descriptors for Generic Object Class Recognition

Gyuri Dorkó

To cite this version:

Gyuri Dorkó. Selection of Discriminative Regions and Local Descriptors for Generic Object Class

Recognition. Human-Computer Interaction [cs.HC]. Institut National Polytechnique de Grenoble -

INPG, 2006. English. �tel-00555064�

O

3 − 10%

·

U

·

·

·

L

t

t

t

R ∇ I T ∇ I dx

µ(x, σ I , σ D ) = σ D 2 g(σ I ) ∗

I 2 x (x, σ D ) I x I y (x, σ D ) I x I y (x, σ D ) I 2 y (x, σ D )

.

I x

I y

σ D

σ I

2σ

x

det(µ(x, σ I , σ D )) − αtrace 2 (µ(x, σ I , σ D )).

x

σ

x

σ

σ D

σ I

σ D

σ I

σ < 2

4

t

t

t

8−bin orentation histogram

a Cell 4

4

x

σ

IS

IS

IS

4

OS

OS = 8

4 ∗ 4 ∗ 8 = 128

σ = half window size

2 3 = 8

1.6 2.6 4.2 6.7 10.8 17.4 28.1 45.3

change of description

scale

1.6 2.6 4.2 6.7 10.8 17.4 28.1 45.3 scale

50%

1.2

2228

4893

1011

283

66

1225

2862

1261

1.2

0 500 1000 1500 2000 2500 3000 3500 4000

0 2000 4000 6000 8000 10000

Harris Scale Selected Points

Harris Multi-Scale Points H-MSLSD

R ∇ I ^T ∇ I dx

µ(x, σ I , σ D ) = σ _D ² g(σ I ) ∗

I ² _x (x, σ D ) I x I y (x, σ D ) I x I y (x, σ D ) I ² _y (x, σ D )

det(µ(x, σ I , σ D )) − αtrace ² (µ(x, σ I , σ D )).

2 ³ = 8

1 − R µ a ∩ R _(H ^T µ b H)

R µ a ∪ R _(H ^T _µ _b _H) < ǫ O

R µ a ∪ R _(H ^T µ b H)

R µ a ∩ R _(H ^T µ b H)