HAL Id: tel-00934062
https://tel.archives-ouvertes.fr/tel-00934062
Submitted on 21 Jan 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Minimization of calibrated loss functions for image
classification
Wafa Bel Haj Ali
To cite this version:
Wafa Bel Haj Ali. Minimization of calibrated loss functions for image classification. Other [cs.OH].
Université Nice Sophia Antipolis, 2013. English. �NNT : 2013NICE4079�. �tel-00934062�
DOCTORAL SCHOOL STIC
SCIENCES ET TECHNOLOGIES DEL'INFORMATION
ET DE LA COMMUNICATION
P H D T H E S I S
toobtain the titleof
PhD of S ien e
of the University of Ni e - Sophia Antipolis
Spe iality : Computer S ien e
Defended by
Wafa Bel haj ali
Minimisation de Fon tions de
Perte Calibree pour la
Classi ation des Images
Thesis Advisor: Mi hel Barlaud
prepared at I3S Sophia Antipolis, MediaCoding Team
defendedon O tober 11, 2013
Jury :
Reviewers : FlorentPerronnin Resear h Manager XRCE, Grenoble
Patri k Perez DR INRIA-Te hni olor,Rennes
Advisor : Mi helBarlaud PR EmeriteUNS, Sophia Antipolis
Examinators : Sherif Makram-Ebeid Resear h FellowPhilips Resear h, Paris
Frédéri Pre ioso PR UNS,Sophia Antipolis
Cordelia S hmidt DR INRIA, Grenoble
mother,tothememoryofmyfather,
be ause of their faith and pride,
be- auseofallthewonderfulthingsthey
did for me...
A spe ialdedi ation to mylovelyhusbandwhoalways supportsmeand
to my ute sonwho gives me the joy. A spe ialfeeling to my brothers
and "sisters", who always en ourage me, and to mymother-in-law and
myfather-in-lawfortheirlove. Finally,aspe ialthanktoallmyfriends,
whith whom I spent those years and shared wonderful things. I do not
wanttolistthem,be ausethelist ouldbelongandIamafraidtoforget
First of all, I would like to thank Professor Mi hel Barlaud for his support, his
advises, hispresen e and above all his faith. Workingwith him duringthree years
of thesiswasa greathonor.
I would adress a great thank to Professor Ri hard No kand Frank Nielsen for
their ollaboration and their pre ious ontributions. I wish also to thank and
a -knowledge the ontribution of theteam of biologists Tiro and espe iallyProfessor
ThierryPour herandPhilippePognone . IdonotforgetEri Debreuve,whohelped
mefor longtimes andgave mepre ious advises.
Finally,Ithankeveryonewhosupportedmeandhelpedmeduring thoseyears:
1 General introdu tion 1
1.1 Introdu tion . . . 1
1.2 Setting the problem . . . 2
I Learning weighted k-NN Classiers with Calibrated Losses 5 2 Universal Nearest Neighbors algorithm: UNN 7 2.1 Introdu tion . . . 7
2.2 Basi notions and annotations . . . 8
2.2.1 Thek-NN lassier. . . 8
2.2.2 Calibratedlosses . . . 8
2.3 UNN,Leveraging thek-NN lassier . . . 8
2.3.1 Learning leveragedk-NNsin aboosting framework . . . 9
2.3.2 Stepby stepalgorithm . . . 10
2.4 Implementation details and optimizations . . . 12
2.4.1 Implementation . . . 12
2.4.2 Metri setting . . . 13
2.4.3 Parameters andoptimization . . . 13
2.5 Experiments . . . 14
2.6 Con lusion. . . 18
3 Newton Nearest Neighbor algorithm: N
3
19 3.1 Introdu tion . . . 19 3.2 Basi denitions . . . 20 3.3 Classi ation- alibratedlosses . . . 21 3.4 N3
: Adaptive Newton NearestNeighbors . . . 223.4.1 Algorithm . . . 22
3.4.2 A keyto thepropertiesof N
3
. . . 23 3.5 Algorithmi propertiesof N3
. . . 25 3.6 Experimental Evaluation . . . 273.6.1 Settings: ontenders, databases andfeatures . . . 27
3.6.2 A divide and onquer algorithm to ope with the urse of dimensionalitywith lowmemory requirement . . . 28
II Learning Linear Classiers with Calibrated Losses 31
4 Sto hasti Low-Rank Newton Des ent algorithm: SLND 33
4.1 Introdu tion . . . 33
4.2 Reminder . . . 34
4.2.1 Framework . . . 34
4.2.2 Calibrated risks . . . 34
4.3 SLND : Sto hasti Low-RankNewton Des ent. . . 35
4.3.1 Computing gradient update . . . 35
4.3.2 Core optimization . . . 37
4.3.3 Remarks . . . 38
4.4 Experimental evaluation . . . 39
4.4.1 Settings . . . 39
4.4.2 Tuningparameters of SLND . . . 39
4.4.3 Convergen e rateanalysis . . . 41
4.5 SLND Theoreti al onvergen e analysis . . . 45
4.5.1 Bestrank
k
approximation . . . 454.5.2 A Weak SeparabilityAssumption . . . 45
4.5.3 Convergen e theorem . . . 46
4.6 Con lusion. . . 47
III Bio-Inspired features for biologi al ells lassi ation 49 5 Bio-Medi al ells lassi ation 51 5.1 Introdu tion . . . 51
5.2 Region basedbio-inspired des riptor . . . 51
5.3 Appli ation to thelo alization ofNIS protein inthe ellsof the thy-roid gland . . . 53
5.3.1 Experimentssettings . . . 54
5.3.2 Cells dete tion andsegmentation . . . 56
5.3.3 Featuresand lassi ation . . . 59
5.4 Appli ation to Immuno-Fluores en e ells . . . 61
5.5 Con lusion. . . 64
6 General on lusion 67 IV Appendi es 69 A UNN optimization with metri learning 71 A.1 Introdu tion . . . 71
A.2 Proposedapproa h . . . 71
A.3 Experiments . . . 73
A.3.1 Dataset . . . 73
A.3.2 Settings . . . 73
A.3.3 Robustnessto dimension redu tion . . . 74
A.3.4 Boostingk-NNresultsand omparisontothek-NN lassi a-tionmethod . . . 75
A.3.5 Evaluationofthemetri learningpro ess . . . 75
B Convergen e proof of N
3
and statisti al properties 79 B.1 Proofof Theorem 3. . . 79 B.2 Statisti al propertiesof N3
. . . 80 B.3 Proofof Theorem 6. . . 81 C Convergen e proof of SLND 83 C.1 Proofsket hof Theorem 5 . . . 83 Bibliography 87General introdu tion
1.1 Introdu tion
The lassi ation task onsistsin predi ting ategory membership of an unlabeled
data based on its ontent. Classifying images is a hallenging task in omputer
vision, sin e it involves dierent elds and appli ations. In fa t, two main elds
arebeingstudied toperform image lassi ation and pattern re ognition: therst,
whi hbelongsto theimage pro essingeld,deals withextra tingthefeaturesfrom
data. A wayto en ode imageswithless omplex stru turesthat best des ribesthe
information ontained in the image. While the se ond one is a ma hine learning
taskdeningthe lassi ation rule.
In omputervisiontasks, imagefeaturesareusually onsideredeitheraslo alor
asglobaldes riptors. Bothofthemhavebeenshowntobee ient. Gistglobal
fea-ture[Oliva &Torralba 2001,Oliva &Torralba 2006℄forexamplerepresentsawhole
s ene in a unique des riptor, while the s ale invariant feature transform (SIFT)
[Lowe 2004℄ or the histogram of oriented gradients (HOG) [Dalal&Triggs 2005℄
represent lo al information in the image allowing the des ription of signi ant
ob-je ts inthe s ene independently. Lo al featuresare relevant for image des ription.
In omputervision,theyarewelladaptedfor obje tsdete tionandimage retrieval:
they give a sparse representation and over a wide range of visual features in the
image. However, for lassi ation task, we almost need global feature des ription,
sin ewe ompare ategoriesandnotonlypairsofimages. Hen e,weusuallyen ode
lo al features into global ones using statisti al models. This global representation
des ribes the o urren e of relevant visual features in the image. State of the art
Bagof features/words(BoF/BoW) [Sivi &Zisserman 2006℄arethemost ommon
approa hes in this ontext. Re ently an e ient feature alledsher ve tors (FV)
[Perronnin etal. 2010℄was extensively usedfor large s ale image lassi ation.
Getting e ient des riptors is not su ient to perform ategorization. Robust
lassi ation algorithms should be designed to a omplish su h hallenging task.
For most state of the art methods, thetask of image lassi ation is addressed as
a learningproblem. Withinthis ontext, we distinguishtwo major approa hes,
de-pendingonwetherwehaveor have notaknowledgeabout the ategoriesandabout
thelabels ofaset ofdata. Ontheone hand,unsupervised approa hes, like
luster-ing, tendto groupdataa ording totheir visual ontent similarities. Ontheother
hand, supervised learning uses an already labeled training set to learn lassiers
Thekernelbasedalgorithmsandmorepre iselytheSupportVe torMa hine(SVM)
[Cristianini &Shawe-Taylor2000℄ arerobust lassi ation methods. The boosting
basedalgorithmssu hasAdaboost[Freund&S hapire 1999℄ares alable,havelow
omputational omplexityandstillreliable. NearestNeighbors approa hesarefast,
simple and s alable, but still poorly ee ient in a ura y. Re ently, a sto hasti
gradient des ent (SGD) algorithm was introdu ed by [Bottou 2010℄, a robust and
non omplex method for large s ale data.
Tostateasupervised lassi ationproblem, weneedtodeneour lassierrst.
In fa t, the lassi ation rule isa fun tion mapping between thedatafeatures and
their predi ted labels. Among state of the art lassiers, we an ite k-Nearest
Neighbors, linearorkernelbased lassiers. However,despitethenatureofa
lassi- ation rule,itisoftendened bya setofparameters. Therefore,weset alearning
pro ess torea htheoptimalrule. Indeed,givenasetofalreadyannotateddata,we
tendtoestimatethe optimalparametersbyminimizing the lassi ation errorrate.
This thesis deals with supervised learning approa hes for image lassi ation.
Espe ially,we areinterested inthe minimizationof a riterion basedon some
spe- i loss fun tions (Calibrated losses) for dierent kind of lassi ation rules. In
a rst part, we are interested in k-NN lassiers. A rst approa h, revisits and
expands aleveragedk-NNrulebyminimizingtherisk riterioninaboosting
frame-work. In the same ontext,a se ond approa h deals with fast onvergen e Newton
basedleveragedNearestNeighborsrule. Inase ondpart, wedesignafastlowrank
Newtondes entalgorithm of riterion minimizationforlearnings alablelinear
las-siers. This latteris a robust algorithm espe ially for bigdatasets and shows high
omputational performan e and pre ision towards state of the art approa hes. In
a nal part, this thesispresentsan appli ation of image ategorization to an
inter-esting eld: bio-medial imaging. In a rst step, we design a spe i des riptor for
su happli ation: amultis ale ontrast based feature, well adapted for ell images.
Then, we reportexamplesofexperimentsontwo dierentappli ations ofbiologi al
ells lassi ation.
1.2 Setting the problem
Werst provide some generalities that dene our supervised learnings heme. Our
setting is that of multi lass, multilabel lassi ation. In supervised learning, we
have a ess to an annotated input set of
m
observations,S
.
= {o
i
= (x
i
, y
i
), i =
1, 2, ..., m}
. Ve torx
i
∈ X
isa feature data whereX
denotes thefeaturespa e. Weadoptthemainstreamone-vs-all lassi ations heme. Then, ve tor
y
i
∈ {−1, +1}
C
en odes lass memberships, assuming
y
ic
= +1
means that samplex
i
belongs tolass
c
andy
ic
= −1
otherwise.Thegoalistolearna lassier
H
whi h isafun tionmapping observations inX
to ve tors in
R
C
. Given some sample
x
,thesign of oordinatec
inH (x)
(H
c
(x)
)To dene the lassier
H
, we will minimize the Empiri al (or Hamming) Riskε
0/1(H, S)
whi h omputes over lasses and observations the miss lassi ation rateof
H
:ε
0/1(H, S)
=
.
1
C
C
X
c=1
1
m
m
X
i=1
[(y
ic
H
c
(x
i
)) < 0] ,
(1.1)where
[.]
isthe indi atorfun tion equalto1
ifthe onditionistrueand0
otherwiseand whi h represents here the
0/1
or empiri al loss. We denote this lossF
0/1
.Unfortunatly, the minimization of su h problem is not tra table sin e the
0/1
lossfun tion isnot onvex.
A ommon alternative to minimize (1.1) is to rather minimize an upperbound
of this empiri al risk, known as the Surrogate Risk. Lets denote this later
ε
F
.This surrogate sums over observations and lasses a stri tly onvex loss fun tion
F : R → R
thatsatises∀x ∈ R, F
0/1
(x) ≤ F (x)
.ε
F
(H, S)
=
.
1
C
C
X
c=1
1
m
m
X
i=1
F (y
ic
H
c
(x
i
)) .
(1.2)Thelossfun tion
F
isbasedonthe fun tionalmarginy
ic
H
c
(x
i
)
orwhat we alltheedge of lassi ation anddenoteby
ρ(H
c
, o
i,c
)
. Obviously,theminimizationof(1.2)leads to a lose form solutionof theinitial problem(1.1).
The onsisten y of lassi ation rules is ru ial properties without whi h the
minimization of the loss brings no strong statisti al guarantee: the risk of
lassi- ation should get lose to the lowest possible risk with a large probability (Bayes
rule). To satisfythis property, a set of loss fun tionsrelevant for learning is often
Learning weighted
k-NN Classiers with Calibrated
Universal Nearest Neighbors
algorithm: UNN
2.1 Introdu tion
The nearest neighbors (NNs) rule belongs to the oldest, simplest and still most
widely studied lassi ation algorithms [Devroye et al. 1996℄. It relies on a
non-negative real-valued distan e fun tion. This fun tion measures how mu h two
observationsdierfromea hother,andmaynotne essarilysatisfytherequirements
of metri s.
k-NN lassi ationhasprovensu essful,thankstoitseasyimplementationand
its good generalization properties [Shakhnarovi h etal.2006℄. A major advantage
of thek-NN ruleis to not require expli it onstru tion ofthe featurespa e and be
naturally adapted to multi- lass problems. Moreover, from thetheoreti al point of
view,straightforward boundsareknownfor thetruerisk(error)ofk-NN
lassi a-tionwithrespe ttoBayesoptimum,evenfornitesamples([No k &Sebban 2001℄).
Infa t,itisyeta hallengetoredu ethe trueriskofthek-NNrule,usuallyta kled
by dataredu tionte hniques [Hart 1968℄.
Weproposeinthis hapteranoptimizationofageneralizedsolutiontothe
prob-lem of boosting k-NN lassiers in the general multi- lass setting, and for general
lassesoflosses,not restri tedto Adaboost'sexponentialloss,built upontheworks
of [Piro etal. 2012, No k&Nielsen 2009, No k&Nielsen 2008℄. Namely, we
pro-posealeveraged nearestneighbor rulethatgeneralizes theuniformk-NN rule,and
whose onvergen e rate is guaranteed for many lassi ation alibrated losses,
en- ompassing popular hoi es, su h as the logisti loss or the matsushita loss. The
voting rule isredened as astrong lassier thatlinearly ombines weak lassiers
of thek-NNrule.
The remaining of the hapter is organized as follows: Se tion 2.2 brievly
in-trodu es the basi notions about k-NN lassiers and about the alibrated loss
fun tionsused latterinthe learningframework. Se tion 2.3 presents theUniversal
NearestNeighborsalgorithmforleveragingthek-NN lassierandSe tion2.4gives
detailsabouttheoptimizationsbroughtonthisalgorithmandtheimplementationof
rit alibrated lossF annotation A
exp(−x)
exp
Bln(1 + exp(−x))
log
C−x +
√
1 + x
2
mat
Table 2.1: The stri tly onvex losses that areused in UNN. From top to bottom,
losses areexponential, logisti and matsushita'sloss.
2.2 Basi notions and annotations
2.2.1 The k-NN lassier
We let
j →
k
x
denote the assertion that example(x
j
, y
j
)
, or simply examplej
,belongs to the
k
NNsof observationx
. We shall abbreviatej →
k
x
i
byj →
k
i
inthis ase, we saythatexample
i
belongs to the inverse neighborhood of examplej
. To lassify an observationx
, thek
-NN ruleH(x)
omputes the sum of lassve tors ofits nearestneighbors. The oordinate
c
inH(x)
is :H
c
(x)
=
.
X
j→
k
x
y
jc
.
(2.1)2.2.2 Calibrated losses
Classi ation alibrated losses are surrogates suitable for lassi ation. To be
lassi ation- alibrated, loss
F : R → R
is required to be onvex, dierentiableand su h that
F
′
(0) < 0
[Bartlettetal. 2006℄(Theorem 4), [Vernetetal. 2011℄.In this hapter, we are interested in a subset of the alibrated losses alled
Stri tly Convex Losses (SCL).Thissetin ludes, inaddition totheexponential loss,
the logisti , the matsushita and the squared loss. The stri tly onvex losses Fwe
are intrestedinaregiven inTable2.1.
2.3 UNN, Leveraging the k-NN lassier
Aspreviouslyintrodu ed,aleveragedk-NNruleisanon-uniformvotingamongthe
k-Nearest Neighborsdened like below:
H
c
(x
i
)
=
.
X
j→
k
i
α
jc
y
jc
.
(2.2)The lassier
H
c
is dened as a sum among a set ofT
weak lassiers. Weall those laters prototypes. So, given a set
S
.
= {o
i
= (x
i
, y
i
), i = 1, 2, ..., m}
,one prototype, denoted by the index
j
, is a training sample∈ S
dened by itsfeatureve tor
x
j
,labely
jc
andlaterbyitsleveragingweightα
jc
. Thoseweightsare2.3.1 Learning leveraged k-NNs in a boosting framework
Votingweights
α
jc
in(2.2)aresolutionsoftheminimizationofthefollowingaveragesurrogaterisk:
ε
F
(H, S)
=
.
1
C
C
X
c=1
1
m
m
X
i=1
F (y
ic
H
c
(x
i
))
|
{z
}
ε
F
(H
c
,S)
.
(2.3)Sin e we are in the one-vs-all learning s heme we an minimize the per- lass risk
ε
F
(H
c
, S)
orresponding toH
c
. To to so, one alternative is to use a boosting likeapproa h and then minimize ea h surrogate
ε
F
(H
c
, S)
iteratively. In fa t, at ea hiteration we pi kone prototype
j ∈ S
for whi h the lassi ation rule isdened asthefollowing weak lassier:
h
jc
(x
i
)
= α
.
jc
y
jc
;
j →
k
i
(2.4) su h that:H
c
(x
i
)
=
.
X
j→
k
i
h
jc
(x
i
) .
(2.5)Thus the lo alrisk (of the weak lassier) isthe sum of losses due to
h
jc
over thetraining set
S
:ε
F
(h
jc
, S)
=
.
1
m
m
X
i=1
F (y
ic
h
jc
(x
i
)) .
(2.6)Notethat the lassier
h
jc
follows the leveraged k-NN ruleand thenonly a subsetof
S
for whi h samplej
is a k-NN are on erned by the voting ofj
. We denotethis subset by
R
j
⊆ S
whi h is exa tlythesetofinverse nearestneighbors ofj
andwhi h ardinality isequal to
n
j
. Hen ewe redu eon eagaintheriskfun tion thatshould be minimizedto this following:
ε
F
(h
jc
, R
j
)
=
.
1
n
j
n
j
X
i=1
F (y
ic
h
jc
(x
i
)) .
(2.7)Weneed tondoptimalvotingweight thatminimizestheriskfun tionin(2.7).
To do so, we iteratively update the leveraging weight of the a tual weak
lassi-er / prototype
j
in a boosting like pro edure. Hen e, we give samples weights oflassi ation denoted by
w
ic
andprogressively update them a ordingto themiss- lassi ationof
h
jc
. Thatis, weightsofbadly lassiedsamplesshouldbeenhan edandthoseof well lassied oneswillbe narrowed. We onsiderthefollowing
updat-ingrules for prototypesweights
α
jc
, lassi ation ruleh
jc
and for trainingsamplesweights
w
ic
:h
t
jc
(x
i
) = h
t−1
jc
(x
i
) + δ
jc
y
jc
.
(2.9)w
ic
t
= −F
′
(y
ic
h
t
jc
(x
i
))
(2.10)A tually,at ea hiteration
t
weshouldminimizeε
F
(h
t
jc
, R
j
)
a ordingtoδ
jc
. Letusrst repla e
h
jc
in(2.7) byits expressionin(2.9). Then, theriskfun tion be omesε
F
(h
t
jc
, R
j
)
=
.
1
n
j
n
j
X
i=1
F (y
ic
h
t−1
jc
(x
i
) + y
ic
δ
jc
y
jc
) .
(2.11)and its rstderivative a ordingto
δ
jc
isexpressedlike follows:∂ε
F
(h
t
jc
, R
j
)
∂δ
jc
=
1
n
j
n
j
X
i=1
y
ic
y
jc
F
′
(y
ic
h
jc
t−1
(x
i
) + y
ic
δ
jc
y
jc
)
(2.12)=
1
n
j
n
j
X
i=1
y
ic
y
jc
F
′
(F
′
−
1
(−w
ic
t−1
) + y
ic
δ
jc
y
jc
)
(2.13)Finally,nding
δ
jc
= arg min
ε
F
(h
t
jc
, R
j
)
amountstosolvingthefollowinggeneral
equation basedon thesurrogatelossF:
n
j
X
i=1
y
ic
y
jc
F
′
(F
′
−
1
(−w
t−1
ic
) + y
ic
δ
jc
y
jc
) = 0 .
(2.14)2.3.2 Step by step algorithm
The dierent steps of UNN are detailed in the general algorithm 1. The step
[I.0℄ in the algorithm onsists in hoosing the prototype
j ∈ {1, 2, ..., m}
(weaklassier). In fa t, at ea h iteration, the index to leverage
j
, is obtained by a allto a weak index hooser ora le Wi
(., ., .)
. The sele tion of theindexj
of thenextweak lassier ouldbedone randomly,or usingsome riterion. Inthese ond ase,
we pi k
T ≥ m
,and letj
be hosen byWi({1, 2, ..., m}, t, c)
su h thatδ
j
is largeenough. Ea h
j
an be hosenmore than on e or one an restri t this index to behosenonly on e.
Thedemonstration of the omputation of
δ
j
solution of(2.15) andw
i
in(2.16)will bedetailedlater. Thoseexpressions aregiveninTable2.2respe tivelyfor ea h
ofthe onsideredlossinTable2.1.
W
+
jc
andW
−
jc
,usedinTable2.2,arerespe tivelythesum ofweightsof positif (good) inverse-NNs and thatof negatif(bad) ones:
W
jc
+
=
n
j
X
i=1
[y
ic
y
jc
> 0] w
ic
;
(2.17)W
jc
−
=
n
j
X
i=1
[y
ic
y
jc
< 0] w
ic
;
(2.18)Algorithm 1: AlgorithmUniversal Nearest Neighbors UNN(
S
,
F ) Input:S
= {(x
i
, y
i
), i = 1, 2, ..., m}
,lossF; forc = 1, 2, ..., C
do Letα
jc
← 0,
∀j
; Letw
i
← −
F′
(0) ∈ R
m
+∗
,
∀i
; fort = 1, 2, ..., T
do[I.0℄ Let
j ←
Wi({1, 2, ..., m}, t)
;[I.1℄ Let
δ
j
∈ R
solution of:n
j
X
i=1
y
ic
y
jc
F′
(
F′
−
1
(−w
ic
) + y
ic
δ
jc
y
jc
) = 0 ;
(2.15)[I.2℄
∀i : j ∼
k
i
,letw
i
← −
F′
y
ic
δ
jc
y
jc
+
F′
−
1
(−w
i
)
;
(2.16) [I.3℄ Letα
jc
← α
jc
+ δ
j
; Output:h
c
(x) =
P
i∼
k
x
α
ic
y
ic
,
∀c ;
For now, we will give some details about the demonstration getting to the
ex-pressionsintable2.2. Wewill onsiderrsttheexponential lossfun tionAinTable
2.1 whi h is a spe ial ase sin e it leads to a lose form solution of
δ
jc
. Then wewill explainhowto solve theproblemfor general ases. Lets onsiderthe equation
(2.14) orresponding tothe exponential riskfun tion,then:
n
j
X
i=1
y
ic
y
jc
(− exp(−(− ln(w
ic
t−1
) + y
ic
δ
jc
y
jc
))) = 0
(2.19)n
j
X
i=1
y
ic
y
jc
exp(ln(w
t−1
ic
)) exp(−y
ic
δ
jc
y
jc
) = 0
(2.20)n
j
X
i=1
y
ic
y
jc
w
ic
t−1
exp(−y
ic
δ
jc
y
jc
) = 0
(2.21)n
j
X
i=1
[y
ic
y
jc
> 0] w
t−1
ic
exp(−δ
jc
) −
n
j
X
i=1
[y
ic
y
jc
< 0] w
t−1
ic
exp(δ
jc
) = 0 ;
(2.22)Inexpression(2.22) we splitthe sumontheinverse-NNs su hthatwe separatethe
set
R
j
intoR
+
j
andR
−
j
whereR
+
j
denotes the good inverse NNs (i-NN with thesame label as
j
) andR
−
j
denotes the bad ones (i-NNs whi h does not have samelabelas
j
). Then, using denitions(2.17)and (2.18) we get:F
δ
jc
,see (2.17) and(2.18)g : w
i
← g(w
i
)
exp
1
2
ln
W
+
jc
W
−
jc
w
i
exp(−y
ic
y
jc
δ
jc
)
log
ln
W
+
jc
W
−
jc
w
i
exp(−y
ic
y
jc
δ
jc
)
1−w
i
(1−exp(−y
ic
y
jc
δ
jc
))
mat
2W
jc
−1
2
√
W
jc
(1−W
jc
)
1 −
1−w
i
+
√
w
i
(2−w
i
)δ
jc
y
ic
y
jc
q
1+δ
2
jc
w
i
(2−w
i
)+2(1−w
i
)
√
w
i
(2−w
i
)δ
jc
y
ic
y
jc
Table2.2: Computationof
δ
jc
andthe weight updateruleofour implementation ofUNN,forthestri tly onvexlossesinTable2.1. UNNleverages example
j
for lassc
, andthe weight updateis thatofexamplei
(Seetext for details andnotations).whi h leads to the following nal expressionof
δ
jc
:δ
jc
=
1
2
ln
W
jc
+
W
jc
−
!
.
(2.24)Therefore,the iterative updateofboosting weights
w
t
ic
in(2.10)asafun tion ofδ
jc
is expressedlike bellow:
w
ic
t
= exp −y
ic
h
t
jc
(x
i
)
(2.25)= exp
−y
ic
h
t−1
jc
(x
i
) − y
ic
y
jc
δ
jc
(2.26)= w
ic
t−1
exp (−y
ic
y
jc
δ
jc
)
(2.27)Forthe remaining lossfun tions,itisnotpossibleto dire tlysolve (2.15). Then
we will assumethat F
′
(
F′
−
1
(−w
ic
) + y
ic
δ
jc
y
jc
) ≃ −w
ic
F′
(y
ic
δ
jc
y
jc
)
. Therefore,the equation (2.14) be omes:n
j
X
i=1
y
ic
y
jc
w
t−1
ic
F′
(y
ic
δ
jc
y
jc
) = 0
(2.28)n
j
X
i=1
[y
ic
y
jc
> 0] w
ic
t−1
F′
(δ
jc
) −
n
j
X
i=1
[y
ic
y
jc
< 0] w
ic
t−1
F′
(−δ
jc
) = 0
(2.29)W
jc
+
F′
(δ
jc
) − W
jc
−
F′
(−δ
jc
) = 0 .
(2.30) Repla ing F′
in (2.30) and (2.10) by its expression orresponding to ea h of the
onsidered losseswilldire tly leadtothe Table2.2. The onvergen e proofandthe
theoreti al propertiesof UNN aredetailed in[No k etal.2012℄.
2.4 Implementation details and optimizations
2.4.1 Implementation
The rst one is that, we an fa e unbalan ing problem espe ially be ause we
are onsideringaone-vs-all framework. To opewith su hproblemweuseadaptive
weights
w
ic
. That is: initially,w
0
ic
areweighted,a ordingto wetherthey belongordonot belongto the lass"
c
",bytheproportionof positive (respe tivelynegative)samples in this lass su h that the sum of weights is equal to
1
. Then, at ea hiteration, wenormalize weights
w
ic
, i = 1..m
,to unity aftertheupdate in(2.16).Notethatwhen
W
+
j
and/orW
−
j
iszero,δ
jc
inTable2.2isnotnite. Wesuggesta simple alternative to ope with this issue: we use
(W
+
j
+ ε)
insteadofW
+
j
and(W
j
−
+ ε)
insteadofW
−
j
.Then, for the hoi e of the prototype
j
in step [I.0℄ of Algorithm 1, we adoptthenexts heme: we pi k
T ≤ m
, onsiderthem
samples, hoosej
su hthatα
jc
islarge enough andenable ea h exampleto be hosen onlyon e.
2.4.2 Metri setting
Two major issues arise when implementing our UNN algorithm in pra ti e. The
rst one on erns the distan e(or, more generally,the dissimilaritymeasure) used
for the k-NN sear h. The se ond one onsists in setting the value of
k
for bothtraining andtesting ourprototype-based lassiers (see se tion2.4.3).
In fa t, dening the most appropriate dissimilarity measurefor k-NN sear h is
parti ularly hallengingwhendealingwithveryhigh-dimensionalfeatureve torslike
the ones ommonly used for ategorization. Indeed, the standard metri distan es
maybeinadequatewhensu hve torsaregeneratedbysophisti ated pre-pro essing
stages (e.g., ve tor quantization or unsupervised di tionary learning), thus lying
on omplex high-dimensional manifolds. In general, this should require an
addi-tional distan e learning stage in order to dene the optimal dissimilarity measure
for the parti ular type of data at hand. In this respe t, our UNN method has
the advantage of being fully omplementary with any metri learning algorithm
[BelHajAliet al.2010℄, a ting on the top of the k-NN sear h (see Appendix A).
Furthermore, sin ewe usehereBoFbased onnormalized histograms, weprefer use
standard
L1
distan eand thenavoid expensive omputational tasks.2.4.3 Parameters and optimization
Sele ting agoodvaluefor
k
amounts to learningparameter-dependent weaklassi-ers, wherethe parameter
k
spe iesthe sizeofthevotingneighborhoodinlassi- ationrule(2.2). Fromthetheoreti alstandpoint,abrute-for eapproa hispossible
withboosting: one an dene multiple andidate weak lassiers per example,one
for ea h value of
k
, i.e., for ea h neighborhood size, and then learn prototypes byoptimizing the surrogate risk fun tion over
k
as well. This strategy has thead-vantage ofenabling dire tlearningof
k
at training time. However, training severalweak lassiers perexample without omputation tri ks wouldpotentially severely
# of ategories 10 20 30 40 50 60 100
k-NNBoF 76.38 57.28 45.00 40.27 36.09 32.30 24.67
SVMBoF 83.85 67.65 58.21 53.45 47.81 44.09 35.31
AdaBoostBoF 75.37 58.21 45.57 37.75 32.41 29.01 26.72
UNN
s
BoF 84.28 70.44 58.49 51.07 46.34 41.80 31.61Table 2.3: Classi ation performan es of thedierent methods we tested interms
of the averagea ura y ormAP asafun tion of thenumberof ategories.
tionwhi h,to lassifynewobservations, onvolutesweightingwithasimpledensity
estimation suggestedbyboosting. Typi ally, we onsider a logisti estimator for a
Bernoulli prior whi hvanisheswith therank oftheexample intheneighbors, thus
de reasing the importan e ofthe farthestneighbors:
ˆ
p(j) = β
j
=
1
1 + exp(λ(j − 1))
,
(2.31)with
λ > 0
. Theshapeprioris hosenthiswaybe auseitwasshownthatboosting,as arriedoutina numberof algorithmsnot restri tedtotheindu tionoflinear
separators [No k&Nielsen2009℄lo allytslogisti estimatorsforBernoulli
pri-ors. The soft versionof UNN we obtain, alledUNN
s
(for Soft UNN), repla es(2.2) by:
h
ℓ
c
(x) =
X
j∼
k
x
β
j
α
jc
y
jc
.
(2.32)Noti e that it is useless to enfor e the normalization of oe ients
β
j
in (2.31),be ause itwould not hange the lassi ation of UNN
s
. Noti ealso thattheβ
j
in(2.32) are used only to lassify new observations: the training steps of UNN
s
arethe same as UNN, and so UNN
s
meets the same theoreti al properties as UNNdes ribedin[No ketal. 2012℄.
2.5 Experiments
In this se tion, we present experimental results of UNN for image ategorization.
Our experiments aim at arefully quantifying and explaining thegains brought by
boosting on k-NN voting on real image databases. In parti ular, we propose in
this se tion pre ision and a ura y omparison between UNN vsk-NN, SVM and
AdaBoost using Bag-of-Features (BoF) as des riptors. Here, we extra ted
2500
SIFT [Lowe 2004℄ per image to form a odebook of
500
visual words. BoF, of adimension
500
, are then omputed by ve tor quantizing the lo al features SIFTusing this odebook.
Wesele ted100 ategoriesfromtheSUNdatabase[Xiao et al.2010℄. Wekeptall
the imagesof ea h ategoryand theinherent unbalan ingof theoriginal database.
Figure 2.1: Classi ation performan es of thetested methods as a fun tion of the
numberof image ategories.
averaging lassi ationrates over ategories(diagonal ofthe onfusionmatrix)and
then averaging those values after repeating ea h experiment 10 times on dierent
folds. Tospeed-uppro essingtime,weusedYaeltoolbox
1
forafastimplementation
ofk-NN.Furthermore,wealsodevelopedanoptimizedversionofourprogram,whi h
exploitsmulti-threadfun tionalities. WedenotethisversionasUNN
s
(MT.)Alltheexperimentswere runon anIntelXeonX5690 12- orespro essorat 3.46GHz.
We ompared UNN
s
, SVM with Gaussian RBF Kernel, and AdaBoost withde ision stumps
2
(i.e., de ision trees with a single internal node), using BoF
de-s riptors. Inparti ular, we followed the guidelines of [Hsuet al.2003℄for arrying
out the SVM experiments, thus arrying out ross-validation for sele ting thebest
parameters values for SVM.
In Table 2.3 we report the a ura y for ea h lassi ation method. Results in
thesetablesareprovided asafun tionofthenumberofimage ategories. Themost
relevantresults obtainedarealso displayedinFigure2.1(mAPasafun tion ofthe
number of ategories) and Figures 2.2 and 2.3, for the training and lassi ation
times, respe tively.
A ura y results display that UNN
s
dramati ally outperforms AdaBoost (andk-NN aswell); thisresult, whi h somehow experimentally onrmsthatUNN
su - essfullyexploitstheboostingtheory,wasquitepredi table,asUNNbuildsa
pie e-1
Codeavailableathttps://gforge.inria.fr/frs/?g roup_ id=2 151
2
Figure2.2: Trainingtime asa fun tion ofthenumber ofimage ategories.
# ategories 10 20 30 40 50 60 100 #trainingimages 951 2,162 3,099 4,381 5,540 6,568 11,186 k-NN 0 SVM 2.4 27 83 226 472 806 4526 AdaBoost 96 218 341 442 559 662 1128 UNN
s
1.7 16 58 150 295 498 2146 UNNs
(MT) 0.3 2.5 7.8 19 36 53 257Table2.4: Computationtime[s℄ for thetraining phase.
# ategories 10 20 30 40 50 60 100 #testimages 951 2,162 3,099 4,381 5,540 6,568 11,186 k-NN 0.20 1.0 2.0 4.0 6.0 9.0 22.0 SVM 0.25 5.7 13 31 56 80 260 AdaBoost 0.02 0.1 0.25 0.43 0.67 0.95 2.74 UNN
s
0.21 0.72 1.6 2.7 4.2 5.9 17 UNNs
(MT) 0.08 0.2 0.37 0.58 0.84 1.11 3.25Table2.5: Computationtime[s℄ for thetesting phase.
wiselinearde isionfun tionintheinitialdomain
X
,whileAdaBoostbuildsalinearseparator in this domain. SVM, on the other hand, have a ess to non-linear
t-ting of data, by lifting the data to a domain whose dimension far ex eeds that of
X
. Yet, SVM testing results are somehow not as good as one might expe t fromthis lear uttheoreti aladvantageoverUNN,andalsofromthefa tthatwe arried
outSVMwithsigni antparametersoptimization[Hsuetal. 2003℄. Indeed,UNN
s
even beatsSVMsover10 to 30 ategories, beingslightly outperformedby themon
more ategories.
InTable2.4and2.5wereportthe orresponding omputationtime(inse onds)
for thetraining and lassi ation phase, respe tively. Obviously, the omputation
timesovertrainingandtestingarealsoakeyforexploitingtheexperimentalresults.
Table 2.4 displays that, while the training time of AdaBoost is linear, UNN
s
isa logi al lear ut winner over SVM for training: it a hieves speedups ranging in
between two and more than seventeen over SVM. Thus, UNN provides the best
pre ision/timetrade-oamongthetestedmethods,whi hsuggeststhatUNNmight
well be more than a legal ontender to lassi ation methods dealing with huge
domains, or domains where the testing set is huge ompared to the training set,
whi histhe ase,forinstan e,for ell lassi ationinbiologi alimages. Finally,we
have only s rat hed experimental optimizations for UNN,and have not optimized
2.6 Con lusion
In this hapter, we ontribute to ll an important void of NN methods,
show-ing how boosting an betransferred to k-NN lassi ation, with onvergen e rates
guarantees for alarge numberofsurrogates. UNN,whi hbuilds upontheworksof
([Piro etal.2012℄), generalizes lassi k-NN to weighted voting where weights, the
so- alled leveraging oe ients, areiteratively learnedbyUNN.Weprovethatthis
algorithm onverges to the global optimum of many surrogate risks in ompetitive
times undervery mildassumptions. Compared to [Piro etal.2012℄,weenlarge the
setofformalboostingavorsof UNN,fromasingletonasso iatedtotheexponential
lossto a seten ompassingpopular losseslike thelogisti and matsushitaloss.
Ourapproa his also therstextensiveassessment of UNN to omputer vision
related tasks. Comparisons with k-NN, support ve tor ma hines and AdaBoost,
usingBag-of-Featuredes riptors, onrealdomains,displaytheabilityof UNNtobe
ompetitive withits ontenders, a hieving higha ura y in omparatively redu ed
training andtesting times.
An optimization approa h using metri learning wasnot reportedinthis
hap-ter, sin e it does not on ern our learning framework, is reported in Appendix A
([Bel Haj Alietal. 2010℄). Itin ludes blending UNNwithanapproa hthatlearns
Newton Nearest Neighbor
algorithm: N
3
3.1 Introdu tion
Large s ale image lassi ation implies satisfying tight time, memory and
numer-i al pro essing requirements. Coping with them involves in general two kinds of
approa hes. For the rst one, s alability goes hand in hand with simpli ation:
algorithmsarebuilt aroundsophisti ated,state-of-theartapproa hes thatare
sim-pliedto tintotheserequirements,su hasSupportVe torMa hines(SVM)with
linear kernels [Shalev-Shwartz etal. 2007℄, or (Ada)Boosting with weight lipping
and simplestumps asweak lassiers [Ali etal. 2011℄.
The se ond kind of approa hes useas ore very simple algorithms that already
t into these requirements, and then, from this basis, elaborate more omplex
ap-proa hes with improved performan es: this is the ase for the
k
-nearest neighbor(NN) lassier, or the nearest lass mean lassier embedded with metri learning
[Mensinketal.2012,Weinberger &Saul2009℄. Fromtheexperimental standpoint,
these latter approa hes obtain surprising ompetitive results with respe t to the
former ones. In fa t, they may have another advantage: while theoreti al
guaran-tees barely survive extreme simpli ation, elaborating on a ore makes it perhaps
easier to preserve its theoreti al properties, su h as its statisti al onsisten y (e.g.
for k-NN [Devroye etal. 1996℄).
Ouralgorithm belongs tothese ond ategoryofapproa hes, asweelaborateon
theordinaryk-NN lassier. Ourapproa hisdierentbut omplementarytometri
learningapproa hes, aswe hoosetoadaptk-NNtotheboostingframework. Itisin
thesame line ofworksasUNN algorithm introdu ed in hapter 2,but thepresent
oneisofNewton-Raphsontype,andthenmoreadaptedforlarges ale lassi ation.
Our high-level ontribution is threefold: a novel Adaptive Newton-Raphson
s heme to leverage k-NN, alled N
3
, an extensive theoreti al analysis of the
ap-proa h, and ne-grained experimental validations on three large and hallenging
domains: SUN and Calte h. To be more spe i , the novelty of our method
in- ludes:
(i)
a proof of the boosting ability of N3
, the rst boosting- ompliant onvergen e
rates for a Newton-type approa h to onvex loss minimization to the best of our
knowledge;
(iii)
a proofthattheoutput of N3
dire tlyyields e ient estimators of posteriors;
20 Chapter 3. Newton Nearest Neighbor algorithm: N
3
urse of dimensionalitywithlow memoryrequirement;
(v)
experimentally optimized ore-pro essing stages for N3
with linear ost per
boostingiteration.
Experimental results displaythat N
3
manages to hallenge a ura y of
sophis-ti ated approa heswhile being faster,and requireslowmemory.
The remaining of the hapter is organized as follows: Se tion 3.2 states basi
denitions. Se tion 3.3presents lassi ation- alibrated losses. Se tion 3.4presents
N
3
algorithm. Se tion 3.5dis usses its theoreti al properties. Se tion 3.6presents
experiments, andse tion 3.7 on ludesthe hapter.
3.2 Basi denitions
We rst provide some basi denitions. Our setting is multi lass, multilabel
lassi ation. We have a ess to an input set of
m
examples (or prototypes),S
= {(x
.
i
, y
i
), i = 1, 2, ..., m}
. Ve tory
i
∈ {−1, +1}
C
en odes lass memberships,
assuming
y
ic
= +1
means that observationx
i
belongs to lassc
. A lassierH
isa fun tion mapping observations to ve tors in
R
C
. Given some observation
x
,thesign of oordinate
c
inH(x)
gives whetherH
predi ts thatx
belongs to lassc
,while its absolutevaluemaybeviewed asa onden e in lassi ation.
Thenearestneighbors (NNs)rule belongs to theoldest, simplestand stillmost
widely studied lassi ation algorithms [Devroye etal.1996℄. It relies on a
non-negative real-valued distan e fun tion. This fun tion measures how mu h two
observationsdierfromea hother,andmaynotne essarilysatisfytherequirements
of metri s. We let
j →
k
x
denote the assertion that example(x
j
, y
j
)
, or simplyexample
j
,belongs tothek
NNsof observationx
. Weshallabbreviatej →
k
x
i
byj →
k
i
inthis ase,we saythatexamplei
belongsto theinverse neighborhood ofexample
j
. To lassifyan observationx
,thek
-NNruleH(x)
omputes thesum oflass ve torsofits nearestneighbors, thatis:
H
c
(x)
.
=
P
j→
k
x
y
jc
isthe oordinatec
inH(x)
. A leveraged k-NN rule[No k etal. 2012℄generalizesthis to:H
c
(x)
=
.
X
j→
k
x
α
jc
y
jc
,
(3.1) whereα
j
∈ R
C
leverages the lasses of example
j
. Leveraging nearest neighborsraisesthequestionastowhetherthereexistse ientindu tivelearnings hemes for
these leveraging oe ients.
To learn them, we adopt the framework of [Bartlettetal. 2006,
Vernet etal.2011℄, and fo us on the minimization of a total alibrated risk
whi h sums per- lasslosses:
ε
F
(H, S)
=
.
1
C
C
X
c=1
1
m
m
X
i=1
F (y
ic
H
c
(x
i
))
|
{z
}
ε
F
(H
c
,S)
.
(3.2)crit
transferfun tionf
alibrated lossF
A1
1+exp(−x)
ln(1 + exp(−x))
B1
1+2
−
x
ln(1 + 2
−x
)
C1
2
1 +
√
x
1+x
2
exp sinh
−1
(−x)
D1+max{0,x}
2+|x|
max{0, −x} − ln(2 + |x|)
Table 3.1: Calibrated losses that mat h (3.3) for several transfer fun tions. From
top to bottom, losses are the logisti loss, binary logisti loss, Matsushita's loss,
alibrated linearHinge loss.
Tobe lassi ation- alibrated,loss
F : R → R
isrequiredtobe onvex,dierentiableandsu hthat
F
′
(0) < 0
[Bartlettetal.2006℄(Theorem4),[Vernet et al.2011℄. The
re entadvan esintheunderstandingandformalizationof(multi lass)lossfun tions
suitable for lassi ation have essentially on luded that lassi ation alibration
is mandatory for the loss to be Fisher onsistent or proper [Bartlettetal. 2006,
Vernetetal. 2011℄. Theseare ru ial propertieswithoutwhi htheminimizationof
the loss brings no string statisti al guarantee with respe t to Bayes rule (su h as
universal onsisten y).
3.3 Classi ation- alibrated losses
In this hapter, we are interested in a subset of lassi ation- alibrated fun tions,
namely thosefor whi h:
F (x)
= −x +
.
Z
f ,
(3.3)for some ontinuous transfer fun tion
f : R → [0, 1]
, in reasing and symmetriwith respe t to
(0,
1
/
2
= f (0))
. Intuitively, a transfer fun tion brings an estimateof posteriors: it isa bije tive mapping between areal-valuedpredi tion
H
c
(x)
anda orresponding posterior estimation for the lass,
p[y
ˆ
c
= +1|x]
, mapping whi hstates that both values are positively orrelated, and establishes a tie for
H
c
= 0
to whi h orresponds
p[y
ˆ
c
= +1|x] =
1
/
2
. Transfer fun tions have a longstandinghistory in optimization [Kivinen &Warmuth 2001℄, and the set of
F
that mat h(3.3) stri tly ontains balan ed onvex losses, fun tions with appealing statisti al
properties [No ketal.2012℄ (and referen es therein). Table 3.1 provides four
ex-ample of su h losseson whi h we fo us. Anotherexample of losses thatmeet (3.3)
isthesquared loss,for transfer
f = min{1, max{0, x +
1
/
2
}}
.To arryouttheminimizationof(3.2),weadoptamainstream1-vs-restboosting
22 Chapter 3. Newton Nearest Neighbor algorithm: N
3
Algorithm 2: AlgorithmNewton Nearest NeighborsN
3
(
S
, crit, k
)Input: Sample
S
, riterioncrit ∈ {A, B, C, D}
,k ∈ N
∗
; Letα
j
← 0, ∀j = 1, 2, ..., m
; forc = 1, 2, ..., C
do //Minimizeε
F
(H
c
, S)
Letw
i
←
1
k1+y
ic
y
i
k
1
, ∀i
; fort = 1, 2, ..., T
do[I.0℄//Choi eof the example toleverage
Let
j ←
Wi(S, w)
; [I.1℄//Leveraging update,δ
j
Letη(c, j) ←
P
i:j→
k
i
w
ti
y
ic
y
jc
;Let
n
j
← |{i : j →
k
i}|
;Compute
δ
j
following Table 3.2,usingcrit
;[I.2℄//Weightsupdate
∀i : j →
k
i
,updatew
i
asin Table3.2,usingcrit
;[I.3℄//Leveraging oe ient update
Let
α
jc
← α
jc
+ δ
j
; Output:H
(x)
.
=
P
j→
k
x
α
j
◦ y
j
ε
F
(H
c
, S)
inε
F
(H, S)
. To do so, itts thec
th
oordinate in leveraging oe ients
by onsidering thetwo- lass problemof lass
c
versusall others.3.4 N
3
: Adaptive Newton Nearest Neighbors
3.4.1 Algorithm
WenowpresentalgorithmN
3
,whi hstandsforNewtonNearestNeighbors. N
3
up-dates iteratively the leveraging oe ients of an example in
S
, example pi ked byan ora le, Wi for Weak Index Chooser ora le. We detail below the properties
and implementationof Wi . Thete hni al detailsof theN
3
aregiven inTable 3.2.
N
3
follows the boosting s heme, with iterative updates of leveraging oe ients
followed by an iterative re-weighting of examples. Before embarking into formal
algorithmi and statisti al properties for N
3
, we rst show that N3
is of Newton-Raphson type. Theorem 1 N3
performs adaptive Newton-Raphson steps to minimize
ε
F
(H
c
, S)
,∀c
.Proofsket h: Thekeytotheproof,whi h weexplorefurtherinsubse tion3.4.2,
is the existen e of a parti ular fun tion
g
F
, stri tly on ave and symmetri withrespe tto
1
/
2
,whi h allows to rewritethelossas:3.4. N
3
: Adaptive Newton Nearest Neighbors 23
where
⋆
denotes the (Legendre) onvex onjugate. Convex onjugates have theproperty that their derivatives are inverses of ea h other. This property, along
with(3.4),allowsto simplify the omputation ofthederivativesoftheloss, forany
example
i
inthe inverse neighborhood ofj
:∂F (y
ic
H
c
(x
i
))
∂δ
j
= y
ic
y
jc
F
′
(y
ic
H
c
(x
i
))
(3.5)= −y
ic
y
jc
((−g
F
)
⋆
)
′
(−y
ic
H
c
(x
i
))
= −y
ic
y
jc
((−g
F
)
′
)
−1
(−y
ic
H
c
(x
i
))
= −y
ic
y
jc
(1 − (g
F
′
)
−1
(−y
ic
H
c
(x
i
)))
= −y
ic
y
jc
(g
F
′
)
−1
(y
ic
H
c
(x
i
))
= −K
F
w
i
y
ic
y
jc
.
(3.6)Eq. (3.6)holds be ause we an also rewritethe weights update(Table3.2) as:
w
i
←
1
K
F
(g
F
′
)
−1
δ
j
y
ic
y
jc
+ g
′
F
(K
F
w
i
)
,
(3.7) where(g
′
F
)
−1
is the inverse fun tion of the rst derivative ofg
F
, andK
F
is anormalizing onstant: it is respe tively
ln(2), 1,
1
/
2
, 1
for A, B, C and D in Table3.3. From (3.5), it also omes
∂
2
F (y
ic
H
c
(x
i
))/∂δ
2
j
= F
′′
(y
ic
H
c
(x
i
))
, whereF
′′
denotesthese ondderivative. Consideringthe wholeinverseneighborhoodof
j
,theNewton-Raphsonupdatefor
δ
j
is(withη(c, j)
.
=
P
i:j→
k
i
w
ti
y
ic
y
jc
inN3
):δ
j
← λ
F
×
K
F
η(c, j)
P
i:j→
k
i
F
′′
(y
ic
H
c
(x
i
))
,
(3.8)for learning rate
0 < λ
F
≤ 1
. Mat hing this expression withthe updates inTable3.2brings learning rate:
0 < λ
F
=
L
F
P
i:j→
k
i
F
′′
(y
ic
H
c
(x
i
))
K
F
n
j
≤
L
F
F
′′
(0)
K
F
= 1 ,
forea h riteriaA,B,CandD,where
L
F
isrespe tively4 ln(2), 4/ ln
2
(2),
1
/
2
, 4
,andn
j
= |{i : j →
.
k
i}|
in N3
. The inequalities ome from the fa t that
F
′′
> 0
and
takesitsmaximumin0for all riteria. Wethen he kthat
F
′′
(0) = K
F
/L
F
for A,B,C andD.
3.4.2 A key to the properties of N
3
Thedualitybetween real-valued lassi ation andposterior estimationwhi h stems
from
f
(SeeSe tion3.3)isfundamentalforthealgorithmi andstatisti alproperties1
of N
3
. Tosimplifythe statementof resultsand proofs,itis onvenient to make the
parallelbetweenour alibratedlosses
F
andfun tionselsewhere alledpermissible2 ,
1
SeeAppendixBfordetailsonstatisti alpropertiesof N
3
. 2The usual denitions are more restri ted: for example the generator of the alibrated
24 Chapter 3. Newton Nearest Neighbor algorithm: N
3
crit
leveraging weight updateupdate,
δ
j
g : w
i
← g(w
i
, δ
j
, y
ic
, y
jc
)
A4 ln(2)η(c,j)
n
j
w
i
w
i
ln 2+(1−w
i
ln 2)×exp(δ
j
y
ic
y
jc
)
B4η(c,j)
ln
2
(2)n
j
w
i
w
i
+(1−w
i
)×2
δj yicyjc
Cη(c,j)
2n
j
1 −
1−w
i
+
√
w
i
(2−w
i
)δ
j
y
ic
y
jc
q
1+δ
2
jc
w
i
(2−w
i
)+2(1−w
i
)
√
w
i
(2−w
i
)δ
j
y
ic
y
jc
D4η(c,j)
n
j
1+max
n
0,−
δ
j
y
ic
y
jc
+
err(wi)
1−2wi
o
2+
δ
j
y
ic
y
jc
+
1−2wi
err(wi)
Table 3.2: Leveraging and weight updates in N
3
orresponding to ea h hoi e of
alibrated lossinTable 3.1.
crit
generatorg
F
A−x ln x − (1 − x) ln(1 − x)
B−x log
2
x − (1 − x) log
2
(1 − x)
Cp
x(1 − x)
Dln(2err(x)) + 1 − 2err(x)
3.5. Algorithmi properties of N
3
25
that is, fun tions dened on
(0, 1)
, stri tly on ave, dierentiable and symmetriwith respe t to
x =
1
/
2
. It an be shown that for any of our hoi es ofF
, thereexistsapermissible
g
F
,thatwe allagenerator,forwhi htherelationships(3.7)and(3.4)usedintheproofsket hofTheorem1indeedhold. Furthermore,thegenerator
isalso useful to writethetransferfun tion itself,aswe have:
f (x) = (−g
F
)
′−1
(x) .
(3.9)Table 3.3 provides the four generators orresponding to hoi es A, B, C and D.
Thepermissiblegeneratorofthe alibrated linearHingelossmakesuseoftheerror
fun tion:
err(x)
= min{x, 1 − x} .
.
(3.10)Permissible fun tions (as well as (3.10)) are used in losses that rely on
poste-rior estimation rather than real-valued lassi ation. Su h losses are the
or-nerstone of de ision-tree indu tion and other methods that dire tly t posteriors
[Devroye etal.1996℄. Hen e, (3.4) establishes a duality between the two kinds of
losses,dualitywhi happearsasa watermark invarious works[Bartlettetal.2006,
Friedman etal. 2000℄. The writing of the weight update using
g
F
in (3.7) is alsoextremely useful to simplify the proofs of the following Theorems. Finally, there
is a syntheti writing for the weights, whi h sheds light on their interpretation:
unraveling theweight update(3.7) and using (3.9),we obtainthat
w
i
satises:w
i
∝ 1 − f (y
ic
H
c
(x
i
)) .
(3.11)Hen e, weights and estimated posteriors are in opposite linear relationship.
A - ordingto(3.11),exampleseasierto lassify (re eivinglargeestimatedposteriors)
re eive small weight. This is a fundamental property of boosting algorithms, that
progressively on entrate onthe hardest examples.
3.5 Algorithmi properties of N
3
Therst resultis adire tfollow-up fromTable 3.2.
Lemma 2 With hoi e D ( alibrated linear Hinge loss), N
3
may be implemented
using onlyrationalarithmeti .
Comments on Lemma 2: In the light of the boosting properties of N
3
, this
result is important in itself. Most existing boosting algorithms, in luding UNN,
AdaBoost, Gentle AdaBoost and spawns [No ketal. 2012, Friedmanetal.2000℄
makeitne essarytotweakor lipthekeynumeri alsteps,in ludingweightsupdate
or leveraging oe ients[Alietal. 2011℄, at the possible expense offailingto meet
boosting's onvergen e or a ura y. Rational arithmeti still requires signi ant
omputational resour es with respe t to oating point omputation, but Lemma
26 Chapter 3. Newton Nearest Neighbor algorithm: N
3
LetusnowshifttotheboostingresultonN
3
,whi hisstatedunderthefollowing
weaklearning assumption:
There exist onstants
γ
u
> 0, γ
n
> 0
su h that at any iterationsc, t
of N3
,index
j
returned by Wi is su h thatn
j
> 0
and the following holds: (i)P
i
:j→ki
w
i
n
j
≥
γ
u
K
F
,and (ii)|ˆ
p
w
[y
jc
6= y
ic
|j →
k
i] −
1
/
2
| ≥ γ
n
.Requirement (ii) orrespondsto theusualweaklearningassumption ofboosting: it
postulatesthatthe urrentnormalized weightsintheinverseneighborhoodof
exam-ple
j
authorize a lassi ation dierent from random byat leastγ
n
. Requirement(i) states that unnormalized weights must not be too small. This is a ne essary
ondition as unnormalized weights of minute order do not ne essary prevent (i) to
be met,but wouldobviouslyimpair the onvergen e of N
3
given thelinear
depen-den e of
δ
j
inthe unnormalized weights. Thefollowing Theorem statesthat N3
isa boosting algorithm.
Theorem 3 Suppose N
3
is ran for
T
steps for ea hc
, and that the weak learningassumption holds at ea h iteration of N
3
. Denote
I
the whole multi-set of indexesreturned by Wi . Then for any riterion A, B, C,D, the total alibrated risk does
not ex eed some
ε ≤ F (0)
provided:X
j∈I
n
j
= Ω
(C + |ε|)m
γ
2
n
γ
u
2
.
(3.12)Remark: requirement
ε ≤ F (0)
omes fromthe fa tthata leveragedNN withnullleveraging ve tors wouldmakea total alibrated riskequal to
F (0)
.Comments on Theorem 3: to thebestof our knowledge, noformal onvergen e
rate has been established to date for Newton approa hes to boosting, in luding
thepopularGentleAdaBoost[Friedmanetal. 2000℄. Theorem3givesseveralrules
of thumb to run N
3
and implement Wi . The rst is that Wi should hoose
examples whose inverse neighborhood is not too small. For example, assume that
boosted examples have inverse neighborhood's size not smaller than the average,
implying
(1/T )
P
j∈I
n
j
≥ k
. Then, omitting onstants inthebig omega of (3.12),weobtain that(3.12) is satisedassoon asthe numberofiterations (
T
) meets:T
≥
(C + |ε|)m
kγ
2
n
γ
u
2
.
This inequality suggest to hoose
k
(i) proportional toC
and (ii) moderatelyin- reasing in
m
. These two hoi es imply, underthe weak learningassumption, thatN