• Aucun résultat trouvé

Minimization of calibrated loss functions for image classification

N/A
N/A
Protected

Academic year: 2021

Partager "Minimization of calibrated loss functions for image classification"

Copied!
108
0
0

Texte intégral

(1)

HAL Id: tel-00934062

https://tel.archives-ouvertes.fr/tel-00934062

Submitted on 21 Jan 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Minimization of calibrated loss functions for image

classification

Wafa Bel Haj Ali

To cite this version:

Wafa Bel Haj Ali. Minimization of calibrated loss functions for image classification. Other [cs.OH].

Université Nice Sophia Antipolis, 2013. English. �NNT : 2013NICE4079�. �tel-00934062�

(2)

DOCTORAL SCHOOL STIC

SCIENCES ET TECHNOLOGIES DEL'INFORMATION

ET DE LA COMMUNICATION

P H D T H E S I S

toobtain the titleof

PhD of S ien e

of the University of Ni e - Sophia Antipolis

Spe iality : Computer S ien e

Defended by

Wafa Bel haj ali

Minimisation de Fon tions de

Perte Calibree pour la

Classi ation des Images

Thesis Advisor: Mi hel Barlaud

prepared at I3S Sophia Antipolis, MediaCoding Team

defendedon O tober 11, 2013

Jury :

Reviewers : FlorentPerronnin Resear h Manager XRCE, Grenoble

Patri k Perez DR INRIA-Te hni olor,Rennes

Advisor : Mi helBarlaud PR EmeriteUNS, Sophia Antipolis

Examinators : Sherif Makram-Ebeid Resear h FellowPhilips Resear h, Paris

Frédéri Pre ioso PR UNS,Sophia Antipolis

Cordelia S hmidt DR INRIA, Grenoble

(3)
(4)

mother,tothememoryofmyfather,

be ause of their faith and pride,

be- auseofallthewonderfulthingsthey

did for me...

A spe ialdedi ation to mylovelyhusbandwhoalways supportsmeand

to my ute sonwho gives me the joy. A spe ialfeeling to my brothers

and "sisters", who always en ourage me, and to mymother-in-law and

myfather-in-lawfortheirlove. Finally,aspe ialthanktoallmyfriends,

whith whom I spent those years and shared wonderful things. I do not

wanttolistthem,be ausethelist ouldbelongandIamafraidtoforget

(5)
(6)

First of all, I would like to thank Professor Mi hel Barlaud for his support, his

advises, hispresen e and above all his faith. Workingwith him duringthree years

of thesiswasa greathonor.

I would adress a great thank to Professor Ri hard No kand Frank Nielsen for

their ollaboration and their pre ious ontributions. I wish also to thank and

a -knowledge the ontribution of theteam of biologists Tiro and espe iallyProfessor

ThierryPour herandPhilippePognone . IdonotforgetEri Debreuve,whohelped

mefor longtimes andgave mepre ious advises.

Finally,Ithankeveryonewhosupportedmeandhelpedmeduring thoseyears:

(7)
(8)

1 General introdu tion 1

1.1 Introdu tion . . . 1

1.2 Setting the problem . . . 2

I Learning weighted k-NN Classiers with Calibrated Losses 5 2 Universal Nearest Neighbors algorithm: UNN 7 2.1 Introdu tion . . . 7

2.2 Basi notions and annotations . . . 8

2.2.1 Thek-NN lassier. . . 8

2.2.2 Calibratedlosses . . . 8

2.3 UNN,Leveraging thek-NN lassier . . . 8

2.3.1 Learning leveragedk-NNsin aboosting framework . . . 9

2.3.2 Stepby stepalgorithm . . . 10

2.4 Implementation details and optimizations . . . 12

2.4.1 Implementation . . . 12

2.4.2 Metri setting . . . 13

2.4.3 Parameters andoptimization . . . 13

2.5 Experiments . . . 14

2.6 Con lusion. . . 18

3 Newton Nearest Neighbor algorithm: N

3

19 3.1 Introdu tion . . . 19 3.2 Basi denitions . . . 20 3.3 Classi ation- alibratedlosses . . . 21 3.4 N

3

: Adaptive Newton NearestNeighbors . . . 22

3.4.1 Algorithm . . . 22

3.4.2 A keyto thepropertiesof N

3

. . . 23 3.5 Algorithmi propertiesof N

3

. . . 25 3.6 Experimental Evaluation . . . 27

3.6.1 Settings: ontenders, databases andfeatures . . . 27

3.6.2 A divide and onquer algorithm to ope with the urse of dimensionalitywith lowmemory requirement . . . 28

(9)

II Learning Linear Classiers with Calibrated Losses 31

4 Sto hasti Low-Rank Newton Des ent algorithm: SLND 33

4.1 Introdu tion . . . 33

4.2 Reminder . . . 34

4.2.1 Framework . . . 34

4.2.2 Calibrated risks . . . 34

4.3 SLND : Sto hasti Low-RankNewton Des ent. . . 35

4.3.1 Computing gradient update . . . 35

4.3.2 Core optimization . . . 37

4.3.3 Remarks . . . 38

4.4 Experimental evaluation . . . 39

4.4.1 Settings . . . 39

4.4.2 Tuningparameters of SLND . . . 39

4.4.3 Convergen e rateanalysis . . . 41

4.5 SLND Theoreti al onvergen e analysis . . . 45

4.5.1 Bestrank

k

approximation . . . 45

4.5.2 A Weak SeparabilityAssumption . . . 45

4.5.3 Convergen e theorem . . . 46

4.6 Con lusion. . . 47

III Bio-Inspired features for biologi al ells lassi ation 49 5 Bio-Medi al ells lassi ation 51 5.1 Introdu tion . . . 51

5.2 Region basedbio-inspired des riptor . . . 51

5.3 Appli ation to thelo alization ofNIS protein inthe ellsof the thy-roid gland . . . 53

5.3.1 Experimentssettings . . . 54

5.3.2 Cells dete tion andsegmentation . . . 56

5.3.3 Featuresand lassi ation . . . 59

5.4 Appli ation to Immuno-Fluores en e ells . . . 61

5.5 Con lusion. . . 64

6 General on lusion 67 IV Appendi es 69 A UNN optimization with metri learning 71 A.1 Introdu tion . . . 71

A.2 Proposedapproa h . . . 71

(10)

A.3 Experiments . . . 73

A.3.1 Dataset . . . 73

A.3.2 Settings . . . 73

A.3.3 Robustnessto dimension redu tion . . . 74

A.3.4 Boostingk-NNresultsand omparisontothek-NN lassi a-tionmethod . . . 75

A.3.5 Evaluationofthemetri learningpro ess . . . 75

B Convergen e proof of N

3

and statisti al properties 79 B.1 Proofof Theorem 3. . . 79 B.2 Statisti al propertiesof N

3

. . . 80 B.3 Proofof Theorem 6. . . 81 C Convergen e proof of SLND 83 C.1 Proofsket hof Theorem 5 . . . 83 Bibliography 87

(11)
(12)

General introdu tion

1.1 Introdu tion

The lassi ation task onsistsin predi ting ategory membership of an unlabeled

data based on its ontent. Classifying images is a hallenging task in omputer

vision, sin e it involves dierent elds and appli ations. In fa t, two main elds

arebeingstudied toperform image lassi ation and pattern re ognition: therst,

whi hbelongsto theimage pro essingeld,deals withextra tingthefeaturesfrom

data. A wayto en ode imageswithless omplex stru turesthat best des ribesthe

information ontained in the image. While the se ond one is a ma hine learning

taskdeningthe lassi ation rule.

In omputervisiontasks, imagefeaturesareusually onsideredeitheraslo alor

asglobaldes riptors. Bothofthemhavebeenshowntobee ient. Gistglobal

fea-ture[Oliva &Torralba 2001,Oliva &Torralba 2006℄forexamplerepresentsawhole

s ene in a unique des riptor, while the s ale invariant feature transform (SIFT)

[Lowe 2004℄ or the histogram of oriented gradients (HOG) [Dalal&Triggs 2005℄

represent lo al information in the image allowing the des ription of signi ant

ob-je ts inthe s ene independently. Lo al featuresare relevant for image des ription.

In omputervision,theyarewelladaptedfor obje tsdete tionandimage retrieval:

they give a sparse representation and over a wide range of visual features in the

image. However, for lassi ation task, we almost need global feature des ription,

sin ewe ompare ategoriesandnotonlypairsofimages. Hen e,weusuallyen ode

lo al features into global ones using statisti al models. This global representation

des ribes the o urren e of relevant visual features in the image. State of the art

Bagof features/words(BoF/BoW) [Sivi &Zisserman 2006℄arethemost ommon

approa hes in this ontext. Re ently an e ient feature alledsher ve tors (FV)

[Perronnin etal. 2010℄was extensively usedfor large s ale image lassi ation.

Getting e ient des riptors is not su ient to perform ategorization. Robust

lassi ation algorithms should be designed to a omplish su h hallenging task.

For most state of the art methods, thetask of image lassi ation is addressed as

a learningproblem. Withinthis ontext, we distinguishtwo major approa hes,

de-pendingonwetherwehaveor have notaknowledgeabout the ategoriesandabout

thelabels ofaset ofdata. Ontheone hand,unsupervised approa hes, like

luster-ing, tendto groupdataa ording totheir visual ontent similarities. Ontheother

hand, supervised learning uses an already labeled training set to learn lassiers

(13)

Thekernelbasedalgorithmsandmorepre iselytheSupportVe torMa hine(SVM)

[Cristianini &Shawe-Taylor2000℄ arerobust lassi ation methods. The boosting

basedalgorithmssu hasAdaboost[Freund&S hapire 1999℄ares alable,havelow

omputational omplexityandstillreliable. NearestNeighbors approa hesarefast,

simple and s alable, but still poorly ee ient in a ura y. Re ently, a sto hasti

gradient des ent (SGD) algorithm was introdu ed by [Bottou 2010℄, a robust and

non omplex method for large s ale data.

Tostateasupervised lassi ationproblem, weneedtodeneour lassierrst.

In fa t, the lassi ation rule isa fun tion mapping between thedatafeatures and

their predi ted labels. Among state of the art lassiers, we an ite k-Nearest

Neighbors, linearorkernelbased lassiers. However,despitethenatureofa

lassi- ation rule,itisoftendened bya setofparameters. Therefore,weset alearning

pro ess torea htheoptimalrule. Indeed,givenasetofalreadyannotateddata,we

tendtoestimatethe optimalparametersbyminimizing the lassi ation errorrate.

This thesis deals with supervised learning approa hes for image lassi ation.

Espe ially,we areinterested inthe minimizationof a riterion basedon some

spe- i loss fun tions (Calibrated losses) for dierent kind of lassi ation rules. In

a rst part, we are interested in k-NN lassiers. A rst approa h, revisits and

expands aleveragedk-NNrulebyminimizingtherisk riterioninaboosting

frame-work. In the same ontext,a se ond approa h deals with fast onvergen e Newton

basedleveragedNearestNeighborsrule. Inase ondpart, wedesignafastlowrank

Newtondes entalgorithm of riterion minimizationforlearnings alablelinear

las-siers. This latteris a robust algorithm espe ially for bigdatasets and shows high

omputational performan e and pre ision towards state of the art approa hes. In

a nal part, this thesispresentsan appli ation of image ategorization to an

inter-esting eld: bio-medial imaging. In a rst step, we design a spe i des riptor for

su happli ation: amultis ale ontrast based feature, well adapted for ell images.

Then, we reportexamplesofexperimentsontwo dierentappli ations ofbiologi al

ells lassi ation.

1.2 Setting the problem

Werst provide some generalities that dene our supervised learnings heme. Our

setting is that of multi lass, multilabel lassi ation. In supervised learning, we

have a ess to an annotated input set of

m

observations,

S

.

= {o

i

= (x

i

, y

i

), i =

1, 2, ..., m}

. Ve tor

x

i

∈ X

isa feature data where

X

denotes thefeaturespa e. We

adoptthemainstreamone-vs-all lassi ations heme. Then, ve tor

y

i

∈ {−1, +1}

C

en odes lass memberships, assuming

y

ic

= +1

means that sample

x

i

belongs to

lass

c

and

y

ic

= −1

otherwise.

Thegoalistolearna lassier

H

whi h isafun tionmapping observations in

X

to ve tors in

R

C

. Given some sample

x

,thesign of oordinate

c

in

H (x)

(

H

c

(x)

)

(14)

To dene the lassier

H

, we will minimize the Empiri al (or Hamming) Risk

ε

0/1

(H, S)

whi h omputes over lasses and observations the miss lassi ation rate

of

H

:

ε

0/1

(H, S)

=

.

1

C

C

X

c=1

1

m

m

X

i=1

[(y

ic

H

c

(x

i

)) < 0] ,

(1.1)

where

[.]

isthe indi atorfun tion equalto

1

ifthe onditionistrueand

0

otherwise

and whi h represents here the

0/1

or empiri al loss. We denote this loss

F

0/1

.

Unfortunatly, the minimization of su h problem is not tra table sin e the

0/1

loss

fun tion isnot onvex.

A ommon alternative to minimize (1.1) is to rather minimize an upperbound

of this empiri al risk, known as the Surrogate Risk. Lets denote this later

ε

F

.

This surrogate sums over observations and lasses a stri tly onvex loss fun tion

F : R → R

thatsatises

∀x ∈ R, F

0/1

(x) ≤ F (x)

.

ε

F

(H, S)

=

.

1

C

C

X

c=1

1

m

m

X

i=1

F (y

ic

H

c

(x

i

)) .

(1.2)

Thelossfun tion

F

isbasedonthe fun tionalmargin

y

ic

H

c

(x

i

)

orwhat we allthe

edge of lassi ation anddenoteby

ρ(H

c

, o

i,c

)

. Obviously,theminimizationof(1.2)

leads to a lose form solutionof theinitial problem(1.1).

The onsisten y of lassi ation rules is ru ial properties without whi h the

minimization of the loss brings no strong statisti al guarantee: the risk of

lassi- ation should get lose to the lowest possible risk with a large probability (Bayes

rule). To satisfythis property, a set of loss fun tionsrelevant for learning is often

(15)
(16)

Learning weighted

k-NN Classiers with Calibrated

(17)
(18)

Universal Nearest Neighbors

algorithm: UNN

2.1 Introdu tion

The nearest neighbors (NNs) rule belongs to the oldest, simplest and still most

widely studied lassi ation algorithms [Devroye et al. 1996℄. It relies on a

non-negative real-valued distan e fun tion. This fun tion measures how mu h two

observationsdierfromea hother,andmaynotne essarilysatisfytherequirements

of metri s.

k-NN lassi ationhasprovensu essful,thankstoitseasyimplementationand

its good generalization properties [Shakhnarovi h etal.2006℄. A major advantage

of thek-NN ruleis to not require expli it onstru tion ofthe featurespa e and be

naturally adapted to multi- lass problems. Moreover, from thetheoreti al point of

view,straightforward boundsareknownfor thetruerisk(error)ofk-NN

lassi a-tionwithrespe ttoBayesoptimum,evenfornitesamples([No k &Sebban 2001℄).

Infa t,itisyeta hallengetoredu ethe trueriskofthek-NNrule,usuallyta kled

by dataredu tionte hniques [Hart 1968℄.

Weproposeinthis hapteranoptimizationofageneralizedsolutiontothe

prob-lem of boosting k-NN lassiers in the general multi- lass setting, and for general

lassesoflosses,not restri tedto Adaboost'sexponentialloss,built upontheworks

of [Piro etal. 2012, No k&Nielsen 2009, No k&Nielsen 2008℄. Namely, we

pro-posealeveraged nearestneighbor rulethatgeneralizes theuniformk-NN rule,and

whose onvergen e rate is guaranteed for many lassi ation alibrated losses,

en- ompassing popular hoi es, su h as the logisti loss or the matsushita loss. The

voting rule isredened as astrong lassier thatlinearly ombines weak lassiers

of thek-NNrule.

The remaining of the hapter is organized as follows: Se tion 2.2 brievly

in-trodu es the basi notions about k-NN lassiers and about the alibrated loss

fun tionsused latterinthe learningframework. Se tion 2.3 presents theUniversal

NearestNeighborsalgorithmforleveragingthek-NN lassierandSe tion2.4gives

detailsabouttheoptimizationsbroughtonthisalgorithmandtheimplementationof

(19)

rit alibrated lossF annotation A

exp(−x)

exp

B

ln(1 + exp(−x))

log

C

−x +

1 + x

2

mat

Table 2.1: The stri tly onvex losses that areused in UNN. From top to bottom,

losses areexponential, logisti and matsushita'sloss.

2.2 Basi notions and annotations

2.2.1 The k-NN lassier

We let

j →

k

x

denote the assertion that example

(x

j

, y

j

)

, or simply example

j

,

belongs to the

k

NNsof observation

x

. We shall abbreviate

j →

k

x

i

by

j →

k

i



inthis ase, we saythatexample

i

belongs to the inverse neighborhood of example

j

. To lassify an observation

x

, the

k

-NN rule

H(x)

omputes the sum of lass

ve tors ofits nearestneighbors. The oordinate

c

in

H(x)

is :

H

c

(x)

=

.

X

j→

k

x

y

jc

.

(2.1)

2.2.2 Calibrated losses

Classi ation alibrated losses are surrogates suitable for lassi ation. To be

lassi ation- alibrated, loss

F : R → R

is required to be onvex, dierentiable

and su h that

F

(0) < 0

[Bartlettetal. 2006℄(Theorem 4), [Vernetetal. 2011℄.

In this hapter, we are interested in a subset of the alibrated losses alled

Stri tly Convex Losses (SCL).Thissetin ludes, inaddition totheexponential loss,

the logisti , the matsushita and the squared loss. The stri tly onvex losses Fwe

are intrestedinaregiven inTable2.1.

2.3 UNN, Leveraging the k-NN lassier

Aspreviouslyintrodu ed,aleveragedk-NNruleisanon-uniformvotingamongthe

k-Nearest Neighborsdened like below:

H

c

(x

i

)

=

.

X

j→

k

i

α

jc

y

jc

.

(2.2)

The lassier

H

c

is dened as a sum among a set of

T

weak lassiers. We

all those laters prototypes. So, given a set

S

.

= {o

i

= (x

i

, y

i

), i = 1, 2, ..., m}

,

one prototype, denoted by the index

j

, is a training sample

∈ S

dened by its

featureve tor

x

j

,label

y

jc

andlaterbyitsleveragingweight

α

jc

. Thoseweightsare

(20)

2.3.1 Learning leveraged k-NNs in a boosting framework

Votingweights

α

jc

in(2.2)aresolutionsoftheminimizationofthefollowingaverage

surrogaterisk:

ε

F

(H, S)

=

.

1

C

C

X

c=1

1

m

m

X

i=1

F (y

ic

H

c

(x

i

))

|

{z

}

ε

F

(H

c

,S)

.

(2.3)

Sin e we are in the one-vs-all learning s heme we an minimize the per- lass risk

ε

F

(H

c

, S)

orresponding to

H

c

. To to so, one alternative is to use a boosting like

approa h and then minimize ea h surrogate

ε

F

(H

c

, S)

iteratively. In fa t, at ea h

iteration we pi kone prototype

j ∈ S

for whi h the lassi ation rule isdened as

thefollowing weak lassier:

h

jc

(x

i

)

= α

.

jc

y

jc

;

j →

k

i

(2.4) su h that:

H

c

(x

i

)

=

.

X

j→

k

i

h

jc

(x

i

) .

(2.5)

Thus the lo alrisk (of the weak lassier) isthe sum of losses due to

h

jc

over the

training set

S

:

ε

F

(h

jc

, S)

=

.

1

m

m

X

i=1

F (y

ic

h

jc

(x

i

)) .

(2.6)

Notethat the lassier

h

jc

follows the leveraged k-NN ruleand thenonly a subset

of

S

for whi h sample

j

is a k-NN are on erned by the voting of

j

. We denote

this subset by

R

j

⊆ S

whi h is exa tlythesetofinverse nearestneighbors of

j

and

whi h ardinality isequal to

n

j

. Hen ewe redu eon eagaintheriskfun tion that

should be minimizedto this following:

ε

F

(h

jc

, R

j

)

=

.

1

n

j

n

j

X

i=1

F (y

ic

h

jc

(x

i

)) .

(2.7)

Weneed tondoptimalvotingweight thatminimizestheriskfun tionin(2.7).

To do so, we iteratively update the leveraging weight of the a tual weak

lassi-er / prototype

j

in a boosting like pro edure. Hen e, we give samples weights of

lassi ation denoted by

w

ic

andprogressively update them a ordingto the

miss- lassi ationof

h

jc

. Thatis, weightsofbadly lassiedsamplesshouldbeenhan ed

andthoseof well lassied oneswillbe narrowed. We onsiderthefollowing

updat-ingrules for prototypesweights

α

jc

, lassi ation rule

h

jc

and for trainingsamples

weights

w

ic

:

(21)

h

t

jc

(x

i

) = h

t−1

jc

(x

i

) + δ

jc

y

jc

.

(2.9)

w

ic

t

= −F

(y

ic

h

t

jc

(x

i

))

(2.10)

A tually,at ea hiteration

t

weshouldminimize

ε

F

(h

t

jc

, R

j

)

a ordingto

δ

jc

. Letus

rst repla e

h

jc

in(2.7) byits expressionin(2.9). Then, theriskfun tion be omes

ε

F

(h

t

jc

, R

j

)

=

.

1

n

j

n

j

X

i=1

F (y

ic

h

t−1

jc

(x

i

) + y

ic

δ

jc

y

jc

) .

(2.11)

and its rstderivative a ordingto

δ

jc

isexpressedlike follows:

∂ε

F

(h

t

jc

, R

j

)

∂δ

jc

=

1

n

j

n

j

X

i=1

y

ic

y

jc

F

(y

ic

h

jc

t−1

(x

i

) + y

ic

δ

jc

y

jc

)

(2.12)

=

1

n

j

n

j

X

i=1

y

ic

y

jc

F

(F

1

(−w

ic

t−1

) + y

ic

δ

jc

y

jc

)

(2.13)

Finally,nding

δ

jc

= arg min



ε

F

(h

t

jc

, R

j

)



amountstosolvingthefollowinggeneral

equation basedon thesurrogatelossF:

n

j

X

i=1

y

ic

y

jc

F

(F

1

(−w

t−1

ic

) + y

ic

δ

jc

y

jc

) = 0 .

(2.14)

2.3.2 Step by step algorithm

The dierent steps of UNN are detailed in the general algorithm 1. The step

[I.0℄ in the algorithm onsists in hoosing the prototype

j ∈ {1, 2, ..., m}

(weak

lassier). In fa t, at ea h iteration, the index to leverage

j

, is obtained by a all

to a weak index hooser ora le Wi

(., ., .)

. The sele tion of theindex

j

of thenext

weak lassier ouldbedone randomly,or usingsome riterion. Inthese ond ase,

we pi k

T ≥ m

,and let

j

be hosen byWi

({1, 2, ..., m}, t, c)

su h that

δ

j

is large

enough. Ea h

j

an be hosenmore than on e or one an restri t this index to be

hosenonly on e.

Thedemonstration of the omputation of

δ

j

solution of(2.15) and

w

i

in(2.16)

will bedetailedlater. Thoseexpressions aregiveninTable2.2respe tivelyfor ea h

ofthe onsideredlossinTable2.1.

W

+

jc

and

W

jc

,usedinTable2.2,arerespe tively

thesum ofweightsof positif (good) inverse-NNs and thatof negatif(bad) ones:

W

jc

+

=

n

j

X

i=1

[y

ic

y

jc

> 0] w

ic

;

(2.17)

W

jc

=

n

j

X

i=1

[y

ic

y

jc

< 0] w

ic

;

(2.18)

(22)

Algorithm 1: AlgorithmUniversal Nearest Neighbors UNN(

S

,

F ) Input:

S

= {(x

i

, y

i

), i = 1, 2, ..., m}

,lossF; for

c = 1, 2, ..., C

do Let

α

jc

← 0,

∀j

; Let

w

i

← −

F

(0) ∈ R

m

+∗

,

∀i

; for

t = 1, 2, ..., T

do

[I.0℄ Let

j ←

Wi

({1, 2, ..., m}, t)

;

[I.1℄ Let

δ

j

∈ R

solution of:

n

j

X

i=1

y

ic

y

jc

F

(

F

1

(−w

ic

) + y

ic

δ

jc

y

jc

) = 0 ;

(2.15)

[I.2℄

∀i : j ∼

k

i

,let

w

i

← −

F



y

ic

δ

jc

y

jc

+

F

1

(−w

i

)



;

(2.16) [I.3℄ Let

α

jc

← α

jc

+ δ

j

; Output:

h

c

(x) =

P

i∼

k

x

α

ic

y

ic

,

∀c ;

For now, we will give some details about the demonstration getting to the

ex-pressionsintable2.2. Wewill onsiderrsttheexponential lossfun tionAinTable

2.1 whi h is a spe ial ase sin e it leads to a lose form solution of

δ

jc

. Then we

will explainhowto solve theproblemfor general ases. Lets onsiderthe equation

(2.14) orresponding tothe exponential riskfun tion,then:

n

j

X

i=1

y

ic

y

jc

(− exp(−(− ln(w

ic

t−1

) + y

ic

δ

jc

y

jc

))) = 0

(2.19)

n

j

X

i=1

y

ic

y

jc

exp(ln(w

t−1

ic

)) exp(−y

ic

δ

jc

y

jc

) = 0

(2.20)

n

j

X

i=1

y

ic

y

jc

w

ic

t−1

exp(−y

ic

δ

jc

y

jc

) = 0

(2.21)

n

j

X

i=1

[y

ic

y

jc

> 0] w

t−1

ic

exp(−δ

jc

) −

n

j

X

i=1

[y

ic

y

jc

< 0] w

t−1

ic

exp(δ

jc

) = 0 ;

(2.22)

Inexpression(2.22) we splitthe sumontheinverse-NNs su hthatwe separatethe

set

R

j

into

R

+

j

and

R

j

where

R

+

j

denotes the good inverse NNs (i-NN with the

same label as

j

) and

R

j

denotes the bad ones (i-NNs whi h does not have same

labelas

j

). Then, using denitions(2.17)and (2.18) we get:

(23)

F

δ

jc

,see (2.17) and(2.18)

g : w

i

← g(w

i

)

exp

1

2

ln

W

+

jc

W

jc

w

i

exp(−y

ic

y

jc

δ

jc

)

log

ln

W

+

jc

W

jc

w

i

exp(−y

ic

y

jc

δ

jc

)

1−w

i

(1−exp(−y

ic

y

jc

δ

jc

))

mat

2W

jc

−1

2

W

jc

(1−W

jc

)

1 −

1−w

i

+

w

i

(2−w

i

jc

y

ic

y

jc

q

1+δ

2

jc

w

i

(2−w

i

)+2(1−w

i

)

w

i

(2−w

i

jc

y

ic

y

jc

Table2.2: Computationof

δ

jc

andthe weight updateruleofour implementation of

UNN,forthestri tly onvexlossesinTable2.1. UNNleverages example

j

for lass

c

, andthe weight updateis thatofexample

i

(Seetext for details andnotations).

whi h leads to the following nal expressionof

δ

jc

:

δ

jc

=

1

2

ln

W

jc

+

W

jc

!

.

(2.24)

Therefore,the iterative updateofboosting weights

w

t

ic

in(2.10)asafun tion of

δ

jc

is expressedlike bellow:

w

ic

t

= exp −y

ic

h

t

jc

(x

i

)



(2.25)

= exp



−y

ic

h

t−1

jc

(x

i

) − y

ic

y

jc

δ

jc



(2.26)

= w

ic

t−1

exp (−y

ic

y

jc

δ

jc

)

(2.27)

Forthe remaining lossfun tions,itisnotpossibleto dire tlysolve (2.15). Then

we will assumethat F

(

F

1

(−w

ic

) + y

ic

δ

jc

y

jc

) ≃ −w

ic

F

(y

ic

δ

jc

y

jc

)

. Therefore,the equation (2.14) be omes:

n

j

X

i=1

y

ic

y

jc

w

t−1

ic

F

(y

ic

δ

jc

y

jc

) = 0

(2.28)

n

j

X

i=1

[y

ic

y

jc

> 0] w

ic

t−1

F

jc

) −

n

j

X

i=1

[y

ic

y

jc

< 0] w

ic

t−1

F

(−δ

jc

) = 0

(2.29)

W

jc

+

F

jc

) − W

jc

F

(−δ

jc

) = 0 .

(2.30) Repla ing F

in (2.30) and (2.10) by its expression orresponding to ea h of the

onsidered losseswilldire tly leadtothe Table2.2. The onvergen e proofandthe

theoreti al propertiesof UNN aredetailed in[No k etal.2012℄.

2.4 Implementation details and optimizations

2.4.1 Implementation

(24)

The rst one is that, we an fa e unbalan ing problem espe ially be ause we

are onsideringaone-vs-all framework. To opewith su hproblemweuseadaptive

weights

w

ic

. That is: initially,

w

0

ic

areweighted,a ordingto wetherthey belongor

donot belongto the lass"

c

",bytheproportionof positive (respe tivelynegative)

samples in this lass su h that the sum of weights is equal to

1

. Then, at ea h

iteration, wenormalize weights

w

ic

, i = 1..m

,to unity aftertheupdate in(2.16).

Notethatwhen

W

+

j

and/or

W

j

iszero,

δ

jc

inTable2.2isnotnite. Wesuggest

a simple alternative to ope with this issue: we use

(W

+

j

+ ε)

insteadof

W

+

j

and

(W

j

+ ε)

insteadof

W

j

.

Then, for the hoi e of the prototype

j

in step [I.0℄ of Algorithm 1, we adopt

thenexts heme: we pi k

T ≤ m

, onsiderthe

m

samples, hoose

j

su hthat

α

jc

is

large enough andenable ea h exampleto be hosen onlyon e.

2.4.2 Metri setting

Two major issues arise when implementing our UNN algorithm in pra ti e. The

rst one on erns the distan e(or, more generally,the dissimilaritymeasure) used

for the k-NN sear h. The se ond one onsists in setting the value of

k

for both

training andtesting ourprototype-based lassiers (see se tion2.4.3).

In fa t, dening the most appropriate dissimilarity measurefor k-NN sear h is

parti ularly hallengingwhendealingwithveryhigh-dimensionalfeatureve torslike

the ones ommonly used for ategorization. Indeed, the standard metri distan es

maybeinadequatewhensu hve torsaregeneratedbysophisti ated pre-pro essing

stages (e.g., ve tor quantization or unsupervised di tionary learning), thus lying

on omplex high-dimensional manifolds. In general, this should require an

addi-tional distan e learning stage in order to dene the optimal dissimilarity measure

for the parti ular type of data at hand. In this respe t, our UNN method has

the advantage of being fully omplementary with any metri learning algorithm

[BelHajAliet al.2010℄, a ting on the top of the k-NN sear h (see Appendix A).

Furthermore, sin ewe usehereBoFbased onnormalized histograms, weprefer use

standard

L1

distan eand thenavoid expensive omputational tasks.

2.4.3 Parameters and optimization

Sele ting agoodvaluefor

k

amounts to learningparameter-dependent weak

lassi-ers, wherethe parameter

k

spe iesthe sizeofthevotingneighborhoodin

lassi- ationrule(2.2). Fromthetheoreti alstandpoint,abrute-for eapproa hispossible

withboosting: one an dene multiple andidate weak lassiers per example,one

for ea h value of

k

, i.e., for ea h neighborhood size, and then learn prototypes by

optimizing the surrogate risk fun tion over

k

as well. This strategy has the

ad-vantage ofenabling dire tlearningof

k

at training time. However, training several

weak lassiers perexample without omputation tri ks wouldpotentially severely

(25)

# of ategories 10 20 30 40 50 60 100

k-NNBoF 76.38 57.28 45.00 40.27 36.09 32.30 24.67

SVMBoF 83.85 67.65 58.21 53.45 47.81 44.09 35.31

AdaBoostBoF 75.37 58.21 45.57 37.75 32.41 29.01 26.72

UNN

s

BoF 84.28 70.44 58.49 51.07 46.34 41.80 31.61

Table 2.3: Classi ation performan es of thedierent methods we tested interms

of the averagea ura y ormAP asafun tion of thenumberof ategories.

tionwhi h,to lassifynewobservations, onvolutesweightingwithasimpledensity

estimation suggestedbyboosting. Typi ally, we onsider a logisti estimator for a

Bernoulli prior whi hvanisheswith therank oftheexample intheneighbors, thus

de reasing the importan e ofthe farthestneighbors:

ˆ

p(j) = β

j

=

1

1 + exp(λ(j − 1))

,

(2.31)

with

λ > 0

. Theshapeprioris hosenthiswaybe auseitwasshownthatboosting,

as arriedoutina numberof algorithmsnot restri tedtotheindu tionoflinear

separators [No k&Nielsen2009℄lo allytslogisti estimatorsforBernoulli

pri-ors. The soft versionof UNN we obtain, alledUNN

s

(for Soft UNN), repla es

(2.2) by:

h

c

(x) =

X

j∼

k

x

β

j

α

jc

y

jc

.

(2.32)

Noti e that it is useless to enfor e the normalization of oe ients

β

j

in (2.31),

be ause itwould not hange the lassi ation of UNN

s

. Noti ealso thatthe

β

j

in

(2.32) are used only to lassify new observations: the training steps of UNN

s

are

the same as UNN, and so UNN

s

meets the same theoreti al properties as UNN

des ribedin[No ketal. 2012℄.

2.5 Experiments

In this se tion, we present experimental results of UNN for image ategorization.

Our experiments aim at arefully quantifying and explaining thegains brought by

boosting on k-NN voting on real image databases. In parti ular, we propose in

this se tion pre ision and a ura y omparison between UNN vsk-NN, SVM and

AdaBoost using Bag-of-Features (BoF) as des riptors. Here, we extra ted

2500

SIFT [Lowe 2004℄ per image to form a odebook of

500

visual words. BoF, of a

dimension

500

, are then omputed by ve tor quantizing the lo al features SIFT

using this odebook.

Wesele ted100 ategoriesfromtheSUNdatabase[Xiao et al.2010℄. Wekeptall

the imagesof ea h ategoryand theinherent unbalan ingof theoriginal database.

(26)

Figure 2.1: Classi ation performan es of thetested methods as a fun tion of the

numberof image ategories.

averaging lassi ationrates over ategories(diagonal ofthe onfusionmatrix)and

then averaging those values after repeating ea h experiment 10 times on dierent

folds. Tospeed-uppro essingtime,weusedYaeltoolbox

1

forafastimplementation

ofk-NN.Furthermore,wealsodevelopedanoptimizedversionofourprogram,whi h

exploitsmulti-threadfun tionalities. WedenotethisversionasUNN

s

(MT.)Allthe

experimentswere runon anIntelXeonX5690 12- orespro essorat 3.46GHz.

We ompared UNN

s

, SVM with Gaussian RBF Kernel, and AdaBoost with

de ision stumps

2

(i.e., de ision trees with a single internal node), using BoF

de-s riptors. Inparti ular, we followed the guidelines of [Hsuet al.2003℄for arrying

out the SVM experiments, thus arrying out ross-validation for sele ting thebest

parameters values for SVM.

In Table 2.3 we report the a ura y for ea h lassi ation method. Results in

thesetablesareprovided asafun tionofthenumberofimage ategories. Themost

relevantresults obtainedarealso displayedinFigure2.1(mAPasafun tion ofthe

number of ategories) and Figures 2.2 and 2.3, for the training and lassi ation

times, respe tively.

A ura y results display that UNN

s

dramati ally outperforms AdaBoost (and

k-NN aswell); thisresult, whi h somehow experimentally onrmsthatUNN

su - essfullyexploitstheboostingtheory,wasquitepredi table,asUNNbuildsa

pie e-1

Codeavailableathttps://gforge.inria.fr/frs/?g roup_ id=2 151

2

(27)

Figure2.2: Trainingtime asa fun tion ofthenumber ofimage ategories.

(28)

# ategories 10 20 30 40 50 60 100 #trainingimages 951 2,162 3,099 4,381 5,540 6,568 11,186 k-NN 0 SVM 2.4 27 83 226 472 806 4526 AdaBoost 96 218 341 442 559 662 1128 UNN

s

1.7 16 58 150 295 498 2146 UNN

s

(MT) 0.3 2.5 7.8 19 36 53 257

Table2.4: Computationtime[s℄ for thetraining phase.

# ategories 10 20 30 40 50 60 100 #testimages 951 2,162 3,099 4,381 5,540 6,568 11,186 k-NN 0.20 1.0 2.0 4.0 6.0 9.0 22.0 SVM 0.25 5.7 13 31 56 80 260 AdaBoost 0.02 0.1 0.25 0.43 0.67 0.95 2.74 UNN

s

0.21 0.72 1.6 2.7 4.2 5.9 17 UNN

s

(MT) 0.08 0.2 0.37 0.58 0.84 1.11 3.25

Table2.5: Computationtime[s℄ for thetesting phase.

wiselinearde isionfun tionintheinitialdomain

X

,whileAdaBoostbuildsalinear

separator in this domain. SVM, on the other hand, have a ess to non-linear

t-ting of data, by lifting the data to a domain whose dimension far ex eeds that of

X

. Yet, SVM testing results are somehow not as good as one might expe t from

this lear uttheoreti aladvantageoverUNN,andalsofromthefa tthatwe arried

outSVMwithsigni antparametersoptimization[Hsuetal. 2003℄. Indeed,UNN

s

even beatsSVMsover10 to 30 ategories, beingslightly outperformedby themon

more ategories.

InTable2.4and2.5wereportthe orresponding omputationtime(inse onds)

for thetraining and lassi ation phase, respe tively. Obviously, the omputation

timesovertrainingandtestingarealsoakeyforexploitingtheexperimentalresults.

Table 2.4 displays that, while the training time of AdaBoost is linear, UNN

s

is

a logi al lear ut winner over SVM for training: it a hieves speedups ranging in

between two and more than seventeen over SVM. Thus, UNN provides the best

pre ision/timetrade-oamongthetestedmethods,whi hsuggeststhatUNNmight

well be more than a legal ontender to lassi ation methods dealing with huge

domains, or domains where the testing set is huge ompared to the training set,

whi histhe ase,forinstan e,for ell lassi ationinbiologi alimages. Finally,we

have only s rat hed experimental optimizations for UNN,and have not optimized

(29)

2.6 Con lusion

In this hapter, we ontribute to ll an important void of NN methods,

show-ing how boosting an betransferred to k-NN lassi ation, with onvergen e rates

guarantees for alarge numberofsurrogates. UNN,whi hbuilds upontheworksof

([Piro etal.2012℄), generalizes lassi k-NN to weighted voting where weights, the

so- alled leveraging oe ients, areiteratively learnedbyUNN.Weprovethatthis

algorithm onverges to the global optimum of many surrogate risks in ompetitive

times undervery mildassumptions. Compared to [Piro etal.2012℄,weenlarge the

setofformalboostingavorsof UNN,fromasingletonasso iatedtotheexponential

lossto a seten ompassingpopular losseslike thelogisti and matsushitaloss.

Ourapproa his also therstextensiveassessment of UNN to omputer vision

related tasks. Comparisons with k-NN, support ve tor ma hines and AdaBoost,

usingBag-of-Featuredes riptors, onrealdomains,displaytheabilityof UNNtobe

ompetitive withits ontenders, a hieving higha ura y in omparatively redu ed

training andtesting times.

An optimization approa h using metri learning wasnot reportedinthis

hap-ter, sin e it does not on ern our learning framework, is reported in Appendix A

([Bel Haj Alietal. 2010℄). Itin ludes blending UNNwithanapproa hthatlearns

(30)

Newton Nearest Neighbor

algorithm: N

3

3.1 Introdu tion

Large s ale image lassi ation implies satisfying tight time, memory and

numer-i al pro essing requirements. Coping with them involves in general two kinds of

approa hes. For the rst one, s alability goes hand in hand with simpli ation:

algorithmsarebuilt aroundsophisti ated,state-of-theartapproa hes thatare

sim-pliedto tintotheserequirements,su hasSupportVe torMa hines(SVM)with

linear kernels [Shalev-Shwartz etal. 2007℄, or (Ada)Boosting with weight lipping

and simplestumps asweak lassiers [Ali etal. 2011℄.

The se ond kind of approa hes useas ore very simple algorithms that already

t into these requirements, and then, from this basis, elaborate more omplex

ap-proa hes with improved performan es: this is the ase for the

k

-nearest neighbor

(NN) lassier, or the nearest lass mean lassier embedded with metri learning

[Mensinketal.2012,Weinberger &Saul2009℄. Fromtheexperimental standpoint,

these latter approa hes obtain surprising ompetitive results with respe t to the

former ones. In fa t, they may have another advantage: while theoreti al

guaran-tees barely survive extreme simpli ation, elaborating on a ore makes it perhaps

easier to preserve its theoreti al properties, su h as its statisti al onsisten y (e.g.

for k-NN [Devroye etal. 1996℄).

Ouralgorithm belongs tothese ond ategoryofapproa hes, asweelaborateon

theordinaryk-NN lassier. Ourapproa hisdierentbut omplementarytometri

learningapproa hes, aswe hoosetoadaptk-NNtotheboostingframework. Itisin

thesame line ofworksasUNN algorithm introdu ed in hapter 2,but thepresent

oneisofNewton-Raphsontype,andthenmoreadaptedforlarges ale lassi ation.

Our high-level ontribution is threefold: a novel Adaptive Newton-Raphson

s heme to leverage k-NN, alled N

3

, an extensive theoreti al analysis of the

ap-proa h, and ne-grained experimental validations on three large and hallenging

domains: SUN and Calte h. To be more spe i , the novelty of our method

in- ludes:

(i)

a proof of the boosting ability of N

3

, the rst boosting- ompliant onvergen e

rates for a Newton-type approa h to onvex loss minimization to the best of our

knowledge;

(iii)

a proofthattheoutput of N

3

dire tlyyields e ient estimators of posteriors;

(31)

20 Chapter 3. Newton Nearest Neighbor algorithm: N

3

urse of dimensionalitywithlow memoryrequirement;

(v)

experimentally optimized ore-pro essing stages for N

3

with linear ost per

boostingiteration.

Experimental results displaythat N

3

manages to hallenge a ura y of

sophis-ti ated approa heswhile being faster,and requireslowmemory.

The remaining of the hapter is organized as follows: Se tion 3.2 states basi

denitions. Se tion 3.3presents lassi ation- alibrated losses. Se tion 3.4presents

N

3

algorithm. Se tion 3.5dis usses its theoreti al properties. Se tion 3.6presents

experiments, andse tion 3.7 on ludesthe hapter.

3.2 Basi denitions

We rst provide some basi denitions. Our setting is multi lass, multilabel

lassi ation. We have a ess to an input set of

m

examples (or prototypes),

S

= {(x

.

i

, y

i

), i = 1, 2, ..., m}

. Ve tor

y

i

∈ {−1, +1}

C

en odes lass memberships,

assuming

y

ic

= +1

means that observation

x

i

belongs to lass

c

. A lassier

H

is

a fun tion mapping observations to ve tors in

R

C

. Given some observation

x

,the

sign of oordinate

c

in

H(x)

gives whether

H

predi ts that

x

belongs to lass

c

,

while its absolutevaluemaybeviewed asa onden e in lassi ation.

Thenearestneighbors (NNs)rule belongs to theoldest, simplestand stillmost

widely studied lassi ation algorithms [Devroye etal.1996℄. It relies on a

non-negative real-valued distan e fun tion. This fun tion measures how mu h two

observationsdierfromea hother,andmaynotne essarilysatisfytherequirements

of metri s. We let

j →

k

x

denote the assertion that example

(x

j

, y

j

)

, or simply

example

j

,belongs tothe

k

NNsof observation

x

. Weshallabbreviate

j →

k

x

i

by

j →

k

i

inthis ase,we saythatexample

i

belongsto theinverse neighborhood of

example

j

. To lassifyan observation

x

,the

k

-NNrule

H(x)

omputes thesum of

lass ve torsofits nearestneighbors, thatis:

H

c

(x)

.

=

P

j→

k

x

y

jc

isthe oordinate

c

in

H(x)

. A leveraged k-NN rule[No k etal. 2012℄generalizesthis to:

H

c

(x)

=

.

X

j→

k

x

α

jc

y

jc

,

(3.1) where

α

j

∈ R

C

leverages the lasses of example

j

. Leveraging nearest neighbors

raisesthequestionastowhetherthereexistse ientindu tivelearnings hemes for

these leveraging oe ients.

To learn them, we adopt the framework of [Bartlettetal. 2006,

Vernet etal.2011℄, and fo us on the minimization of a total alibrated risk

whi h sums per- lasslosses:

ε

F

(H, S)

=

.

1

C

C

X

c=1

1

m

m

X

i=1

F (y

ic

H

c

(x

i

))

|

{z

}

ε

F

(H

c

,S)

.

(3.2)

(32)

crit

transferfun tion

f

alibrated loss

F

A

1

1+exp(−x)

ln(1 + exp(−x))

B

1

1+2

x

ln(1 + 2

−x

)

C

1

2



1 +

x

1+x

2



exp sinh

−1

(−x)

D

1+max{0,x}

2+|x|

max{0, −x} − ln(2 + |x|)

Table 3.1: Calibrated losses that mat h (3.3) for several transfer fun tions. From

top to bottom, losses are the logisti loss, binary logisti loss, Matsushita's loss,

alibrated linearHinge loss.

Tobe lassi ation- alibrated,loss

F : R → R

isrequiredtobe onvex,dierentiable

andsu hthat

F

(0) < 0

[Bartlettetal.2006℄(Theorem4),[Vernet et al.2011℄. The

re entadvan esintheunderstandingandformalizationof(multi lass)lossfun tions

suitable for lassi ation have essentially on luded that lassi ation alibration

is mandatory for the loss to be Fisher onsistent or proper [Bartlettetal. 2006,

Vernetetal. 2011℄. Theseare ru ial propertieswithoutwhi htheminimizationof

the loss brings no string statisti al guarantee with respe t to Bayes rule (su h as

universal onsisten y).

3.3 Classi ation- alibrated losses

In this hapter, we are interested in a subset of lassi ation- alibrated fun tions,

namely thosefor whi h:

F (x)

= −x +

.

Z

f ,

(3.3)

for some ontinuous transfer fun tion

f : R → [0, 1]

, in reasing and symmetri

with respe t to

(0,

1

/

2

= f (0))

. Intuitively, a transfer fun tion brings an estimate

of posteriors: it isa bije tive mapping between areal-valuedpredi tion

H

c

(x)

and

a orresponding posterior estimation for the lass,

p[y

ˆ

c

= +1|x]

, mapping whi h

states that both values are positively orrelated, and establishes a tie for

H

c

= 0

to whi h orresponds

p[y

ˆ

c

= +1|x] =

1

/

2

. Transfer fun tions have a longstanding

history in optimization [Kivinen &Warmuth 2001℄, and the set of

F

that mat h

(3.3) stri tly ontains balan ed onvex losses, fun tions with appealing statisti al

properties [No ketal.2012℄ (and referen es therein). Table 3.1 provides four

ex-ample of su h losseson whi h we fo us. Anotherexample of losses thatmeet (3.3)

isthesquared loss,for transfer

f = min{1, max{0, x +

1

/

2

}}

.

To arryouttheminimizationof(3.2),weadoptamainstream1-vs-restboosting

(33)

22 Chapter 3. Newton Nearest Neighbor algorithm: N

3

Algorithm 2: AlgorithmNewton Nearest NeighborsN

3

(

S

, crit, k

)

Input: Sample

S

, riterion

crit ∈ {A, B, C, D}

,

k ∈ N

; Let

α

j

← 0, ∀j = 1, 2, ..., m

; for

c = 1, 2, ..., C

do //Minimize

ε

F

(H

c

, S)

Let

w

i

1

k1+y

ic

y

i

k

1

, ∀i

; for

t = 1, 2, ..., T

do

[I.0℄//Choi eof the example toleverage

Let

j ←

Wi

(S, w)

; [I.1℄//Leveraging update,

δ

j

Let

η(c, j) ←

P

i:j→

k

i

w

ti

y

ic

y

jc

;

Let

n

j

← |{i : j →

k

i}|

;

Compute

δ

j

following Table 3.2,using

crit

;

[I.2℄//Weightsupdate

∀i : j →

k

i

,update

w

i

asin Table3.2,using

crit

;

[I.3℄//Leveraging oe ient update

Let

α

jc

← α

jc

+ δ

j

; Output:

H

(x)

.

=

P

j→

k

x

α

j

◦ y

j

ε

F

(H

c

, S)

in

ε

F

(H, S)

. To do so, itts the

c

th

oordinate in leveraging oe ients

by onsidering thetwo- lass problemof lass

c

versusall others.

3.4 N

3

: Adaptive Newton Nearest Neighbors

3.4.1 Algorithm

WenowpresentalgorithmN

3

,whi hstandsforNewtonNearestNeighbors. N

3

up-dates iteratively the leveraging oe ients of an example in

S

, example pi ked by

an ora le, Wi for Weak Index Chooser ora le. We detail below the properties

and implementationof Wi . Thete hni al detailsof theN

3

aregiven inTable 3.2.

N

3

follows the boosting s heme, with iterative updates of leveraging oe ients

followed by an iterative re-weighting of examples. Before embarking into formal

algorithmi and statisti al properties for N

3

, we rst show that N

3

is of Newton-Raphson type. Theorem 1 N

3

performs adaptive Newton-Raphson steps to minimize

ε

F

(H

c

, S)

,

∀c

.

Proofsket h: Thekeytotheproof,whi h weexplorefurtherinsubse tion3.4.2,

is the existen e of a parti ular fun tion

g

F

, stri tly on ave and symmetri with

respe tto

1

/

2

,whi h allows to rewritethelossas:

(34)

3.4. N

3

: Adaptive Newton Nearest Neighbors 23

where

denotes the (Legendre) onvex onjugate. Convex onjugates have the

property that their derivatives are inverses of ea h other. This property, along

with(3.4),allowsto simplify the omputation ofthederivativesoftheloss, forany

example

i

inthe inverse neighborhood of

j

:

∂F (y

ic

H

c

(x

i

))

∂δ

j

= y

ic

y

jc

F

(y

ic

H

c

(x

i

))

(3.5)

= −y

ic

y

jc

((−g

F

)

)

(−y

ic

H

c

(x

i

))

= −y

ic

y

jc

((−g

F

)

)

−1

(−y

ic

H

c

(x

i

))

= −y

ic

y

jc

(1 − (g

F

)

−1

(−y

ic

H

c

(x

i

)))

= −y

ic

y

jc

(g

F

)

−1

(y

ic

H

c

(x

i

))

= −K

F

w

i

y

ic

y

jc

.

(3.6)

Eq. (3.6)holds be ause we an also rewritethe weights update(Table3.2) as:

w

i

1

K

F

(g

F

)

−1

δ

j

y

ic

y

jc

+ g

F

(K

F

w

i

)



,

(3.7) where

(g

F

)

−1

is the inverse fun tion of the rst derivative of

g

F

, and

K

F

is a

normalizing onstant: it is respe tively

ln(2), 1,

1

/

2

, 1

for A, B, C and D in Table

3.3. From (3.5), it also omes

2

F (y

ic

H

c

(x

i

))/∂δ

2

j

= F

′′

(y

ic

H

c

(x

i

))

, where

F

′′

denotesthese ondderivative. Consideringthe wholeinverseneighborhoodof

j

,the

Newton-Raphsonupdatefor

δ

j

is(with

η(c, j)

.

=

P

i:j→

k

i

w

ti

y

ic

y

jc

inN

3

):

δ

j

← λ

F

×

K

F

η(c, j)

P

i:j→

k

i

F

′′

(y

ic

H

c

(x

i

))

,

(3.8)

for learning rate

0 < λ

F

≤ 1

. Mat hing this expression withthe updates inTable

3.2brings learning rate:

0 < λ

F

=

L

F

P

i:j→

k

i

F

′′

(y

ic

H

c

(x

i

))

K

F

n

j

L

F

F

′′

(0)

K

F

= 1 ,

forea h riteriaA,B,CandD,where

L

F

isrespe tively

4 ln(2), 4/ ln

2

(2),

1

/

2

, 4

,and

n

j

= |{i : j →

.

k

i}|

in N

3

. The inequalities ome from the fa t that

F

′′

> 0

and

takesitsmaximumin0for all riteria. Wethen he kthat

F

′′

(0) = K

F

/L

F

for A,

B,C andD.

3.4.2 A key to the properties of N

3

Thedualitybetween real-valued lassi ation andposterior estimationwhi h stems

from

f

(SeeSe tion3.3)isfundamentalforthealgorithmi andstatisti alproperties

1

of N

3

. Tosimplifythe statementof resultsand proofs,itis onvenient to make the

parallelbetweenour alibratedlosses

F

andfun tionselsewhere alledpermissible

2 ,

1

SeeAppendixBfordetailsonstatisti alpropertiesof N

3

. 2

The usual denitions are more restri ted: for example the generator of the alibrated

(35)

24 Chapter 3. Newton Nearest Neighbor algorithm: N

3

crit

leveraging weight update

update,

δ

j

g : w

i

← g(w

i

, δ

j

, y

ic

, y

jc

)

A

4 ln(2)η(c,j)

n

j

w

i

w

i

ln 2+(1−w

i

ln 2)×exp(δ

j

y

ic

y

jc

)

B

4η(c,j)

ln

2

(2)n

j

w

i

w

i

+(1−w

i

)×2

δj yicyjc

C

η(c,j)

2n

j

1 −

1−w

i

+

w

i

(2−w

i

j

y

ic

y

jc

q

1+δ

2

jc

w

i

(2−w

i

)+2(1−w

i

)

w

i

(2−w

i

j

y

ic

y

jc

D

4η(c,j)

n

j

1+max

n

0,−



δ

j

y

ic

y

jc

+

err(wi)

1−2wi

o

2+

δ

j

y

ic

y

jc

+

1−2wi

err(wi)

Table 3.2: Leveraging and weight updates in N

3

orresponding to ea h hoi e of

alibrated lossinTable 3.1.

crit

generator

g

F

A

−x ln x − (1 − x) ln(1 − x)

B

−x log

2

x − (1 − x) log

2

(1 − x)

C

p

x(1 − x)

D

ln(2err(x)) + 1 − 2err(x)

(36)

3.5. Algorithmi properties of N

3

25

that is, fun tions dened on

(0, 1)

, stri tly on ave, dierentiable and symmetri

with respe t to

x =

1

/

2

. It an be shown that for any of our hoi es of

F

, there

existsapermissible

g

F

,thatwe allagenerator,forwhi htherelationships(3.7)and

(3.4)usedintheproofsket hofTheorem1indeedhold. Furthermore,thegenerator

isalso useful to writethetransferfun tion itself,aswe have:

f (x) = (−g

F

)

′−1

(x) .

(3.9)

Table 3.3 provides the four generators orresponding to hoi es A, B, C and D.

Thepermissiblegeneratorofthe alibrated linearHingelossmakesuseoftheerror

fun tion:

err(x)

= min{x, 1 − x} .

.

(3.10)

Permissible fun tions (as well as (3.10)) are used in losses that rely on

poste-rior estimation rather than real-valued lassi ation. Su h losses are the

or-nerstone of de ision-tree indu tion and other methods that dire tly t posteriors

[Devroye etal.1996℄. Hen e, (3.4) establishes a duality between the two kinds of

losses,dualitywhi happearsasa watermark invarious works[Bartlettetal.2006,

Friedman etal. 2000℄. The writing of the weight update using

g

F

in (3.7) is also

extremely useful to simplify the proofs of the following Theorems. Finally, there

is a syntheti writing for the weights, whi h sheds light on their interpretation:

unraveling theweight update(3.7) and using (3.9),we obtainthat

w

i

satises:

w

i

∝ 1 − f (y

ic

H

c

(x

i

)) .

(3.11)

Hen e, weights and estimated posteriors are in opposite linear relationship.

A - ordingto(3.11),exampleseasierto lassify (re eivinglargeestimatedposteriors)

re eive small weight. This is a fundamental property of boosting algorithms, that

progressively on entrate onthe hardest examples.

3.5 Algorithmi properties of N

3

Therst resultis adire tfollow-up fromTable 3.2.

Lemma 2 With hoi e D ( alibrated linear Hinge loss), N

3

may be implemented

using onlyrationalarithmeti .

Comments on Lemma 2: In the light of the boosting properties of N

3

, this

result is important in itself. Most existing boosting algorithms, in luding UNN,

AdaBoost, Gentle AdaBoost and spawns [No ketal. 2012, Friedmanetal.2000℄

makeitne essarytotweakor lipthekeynumeri alsteps,in ludingweightsupdate

or leveraging oe ients[Alietal. 2011℄, at the possible expense offailingto meet

boosting's onvergen e or a ura y. Rational arithmeti still requires signi ant

omputational resour es with respe t to oating point omputation, but Lemma

(37)

26 Chapter 3. Newton Nearest Neighbor algorithm: N

3

LetusnowshifttotheboostingresultonN

3

,whi hisstatedunderthefollowing

weaklearning assumption:

There exist onstants

γ

u

> 0, γ

n

> 0

su h that at any iterations

c, t

of N

3

,

index

j

returned by Wi is su h that

n

j

> 0

and the following holds: (i)

P

i

:j→ki

w

i

n

j

γ

u

K

F

,and (ii)

p

w

[y

jc

6= y

ic

|j →

k

i] −

1

/

2

| ≥ γ

n

.

Requirement (ii) orrespondsto theusualweaklearningassumption ofboosting: it

postulatesthatthe urrentnormalized weightsintheinverseneighborhoodof

exam-ple

j

authorize a lassi ation dierent from random byat least

γ

n

. Requirement

(i) states that unnormalized weights must not be too small. This is a ne essary

ondition as unnormalized weights of minute order do not ne essary prevent (i) to

be met,but wouldobviouslyimpair the onvergen e of N

3

given thelinear

depen-den e of

δ

j

inthe unnormalized weights. Thefollowing Theorem statesthat N

3

is

a boosting algorithm.

Theorem 3 Suppose N

3

is ran for

T

steps for ea h

c

, and that the weak learning

assumption holds at ea h iteration of N

3

. Denote

I

the whole multi-set of indexes

returned by Wi . Then for any riterion A, B, C,D, the total alibrated risk does

not ex eed some

ε ≤ F (0)

provided:

X

j∈I

n

j

= Ω

 (C + |ε|)m

γ

2

n

γ

u

2



.

(3.12)

Remark: requirement

ε ≤ F (0)

omes fromthe fa tthata leveragedNN withnull

leveraging ve tors wouldmakea total alibrated riskequal to

F (0)

.

Comments on Theorem 3: to thebestof our knowledge, noformal onvergen e

rate has been established to date for Newton approa hes to boosting, in luding

thepopularGentleAdaBoost[Friedmanetal. 2000℄. Theorem3givesseveralrules

of thumb to run N

3

and implement Wi . The rst is that Wi should hoose

examples whose inverse neighborhood is not too small. For example, assume that

boosted examples have inverse neighborhood's size not smaller than the average,

implying

(1/T )

P

j∈I

n

j

≥ k

. Then, omitting onstants inthebig omega of (3.12),

weobtain that(3.12) is satisedassoon asthe numberofiterations (

T

) meets:

T

(C + |ε|)m

2

n

γ

u

2

.

This inequality suggest to hoose

k

(i) proportional to

C

and (ii) moderately

in- reasing in

m

. These two hoi es imply, underthe weak learningassumption, that

N

3

Figure

Table 2.2: Computation of δ jc and the weight update rule of our implementation of UNN, for the stritly onvex losses in T able 2.1
Table 2.3: Classiation performanes of the dierent methods we tested in terms
Figure 2.1: Classiation performanes of the tested methods as a funtion of the
Figure 2.3: Classiation time for UNN( s ) vs SVM as a funtion of the number of
+7

Références

Documents relatifs

supervisée utilisées pour l'analyse de prols temporels d'expression de gènes. Nous utilisons simplement le terme lassi ation à la pla e de

D ans une telle stratégie, il s'agit avant tout de construire un espace d'élém ents sim ultanés et veiller, non plus à revenir au point de départ, m ais à ne pas être subm ergé

Leurs interventions ont respectivement pour titre : Les cadres sous l’emprise managériale, Sous influence : ambivalence du discours managérial et fragilisation

At the critical point [4J (the point wher e a line of second order ph ase tran sitions becomes a line of first orde r phose transitions) it is necessary toluwc Acr(H, T) = 0

i) We need two image charges, one for each one of the real charges producing the constant external electric field.. The first term is just the potential due to the external field

David Murgia souligne l’importance de ce travail sur la langue, rappelant que « les mots qualifient le monde et changer les mots c’est changer notre rapport au monde... On trouve

To gain insight into TET1 gene regulation, analysis was focused on two cell lines differing in the transcriptional activity of TET1 gene: MOLT-3 cells (high TET1 expression) and

nne de fluid ce est trouv est approxim est pris à 9,8 e du résultat ésultante d'. ession entre le à