HAL Id: hal-00111670
https://hal.archives-ouvertes.fr/hal-00111670v2
Preprint submitted on 13 Feb 2007
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Ranking the best instances
Stéphan Clémençon, Nicolas Vayatis
To cite this version:
Stéphan Clémençon, Nicolas Vayatis. Ranking the best instances. 2007. �hal-00111670v2�
hal-00111670, version 2 - 13 Feb 2007
Stephan Clemenon
MODALX - UniversiteParis X Nanterre
&
Laboratoire de Probabiliteset Modeles Aleatoires
UMRCNRS 7599 -Universites Paris VIet Paris VII
Niolas Vayatis
Laboratoire de Probabiliteset Modeles Aleatoires
UMRCNRS 7599 -Universites Paris VIet Paris VII
February 14, 2007
Abstract
We formulatethe loal ranking problem in the framework of bipartite ranking
wherethegoalistofousonthebestinstanes. Weproposeamethodologybasedon
theonstrutionof real-valuedsoringfuntions. Westudyempirialriskminimiza-
tion of dediatedstatistiswhih involveempirialquantilesof the sores. We rst
state theproblem ofnding thebest instaneswhih anbeast asa lassiation
problemwithmassonstraint. Next,wedevelopspeialperformanemeasuresforthe
loal ranking problemwhih extend the Area UnderanROCCurve(AUC/AROC)
riterionanddesribethe optimalelementsof these newriteria. Wealso highlight
the fat that the goal of rankingthe best instanes annot be ahieved in a stage-
wisemannerwhererst,thebestinstaneswouldbetentativelyidentiedandthena
standardAUCriterionouldbeapplied. Eventually,westatepreliminarystatistial
resultsfortheloalrankingproblem.
Keywords:
Ranking,ROCurveand AUC,empirialriskminimization,fastrates.Running title:
Rankingthebestinstanes
Addressoforrespondingauthor: NiolasVayatis,LaboratoiredeProbabilitesetModelesAleatoires
-UniversiteParis6-175,rueduChevaleret-75013Paris, Frane-Email:
vayatis@ccr.jussieu.fr
1 Introduction
Therst takesalltheglory,theseond takesnothing. Inappliations whererankingisat
stake, people oftenfous on the best instanes. When sanning theresults from aquery
on a searh engine, we rarely go beyond the one or two rst pages on the sreen. In
thedierentontext of reditrisksreening, reditestablishments elaboratesoring rules
as reliability indiators and their main onern is to identify risky prospets espeially
among the top sores. In medial diagnosis, test sores indiate the odds for a patient
to be healthy given a series of measurements (age, blood pressure, ...). There again
a partiular attention is given to the "best" instanes not to miss a possible diseased
patientamongthehighestsores. Thesevarioussituations anbeformulatedinthesetup
ofbipartite rankingwhereone observesi.i.d. opiesof arandompair
(X, Y)
withX
beingan observation vetor desribing theinstane (web page, debtor,patient) and
Y
a binarylabel assigning to one population or the other (relevant vs. non relevant, good vs. bad,
healthyvs. diseased). Inthis problem,thegoalisto rank theinstanes insteadofsimply
lassifying them. There is a growing literature on the ranking problem in the eld of
Mahine LearningbutmostofitonsiderstheAreaundertheROCCurve(also knownas
theAUCorAROC)riterionasameasureofperformaneoftherankingrule[6 ,13 ,26 ,1 ℄.
Inapreviouswork,wehavementionedthatthebipartiterankingproblemundertheAUC
riterion ould be interpreted as a lassiation problem with pairs of observations [4 ℄.
But the limit of this approah is that it weights uniformly the pairs of items whih are
badly ranked. Therefore itdoes notpermit to distinguish between ranking rulesmaking
the same number of mistakes but in very dierent parts of the ROC urve. The AUC
is indeed a global riterion whih does notallow to onentrate on the"best" instanes.
Speialperformanemeasures,suh astheDisountedCumulative Gain(DCG)riterion,
have been introdued by pratitioners in order to weight instanes aording to their
rank[16 ℄ (seealso [25,7 ℄) butproviding theoryforsuhriteriaand developing empirial
risk minimization strategies still is a very open issue. In the present paper, we extend
the results of our previous work in [4℄ and set theoretial grounds for the problem of
loal ranking. The methodology we propose is based on the seletion of a real-valued
soring funtion for whih we formulate appropriate performane measures generalizing
the AUC riterion. We point out that ranking the best instanes is an involved task as
it is a two-fold problem: (i) nd the best instanes and (ii) provide a good ranking on
these instanes. The fat that these two goals annot be onsidered independently will
be highlighted in thepaper. Despite this observation, we will rst formulate theissue of
ndingthebestinstaneswhihistobeunderstoodasatoyproblemforourpurpose. This
problemorrespondsto abinary lassiation problem with a mass onstraint(where
the proportion
u 0
of +1 labels predited by the lassiers is xed) and it might presentaninterestper se. Themainompliationherehastodowiththeneessityofperforming
tehnique was inspired by the former work of Koul [18 ℄ in the ontext of
R
-estimation where similar statistis arise.The rest of the paper is organized as follows. We rst state the problem of nding
thebestinstanes and studytheperformaneofempirial riskminimizationinthissetup
(Setion 2). We also explore the onditions on the distribution in order to reover fast
rates of onvergene. In Setion 3 we formulate performanemeasures for loal ranking
and provide extensions of the AUC riterion. Eventually (Setion 4), we state some
preliminary statistial resultson empirial risk minimizationof these new riteria.
2 Finding the best instances
Inthepresentsetion, wehave alimitedgoalwhihisonlytodeterminethebestinstanes
without bothering of their order in the list. By onsidering this subproblem, we will
identify themain tehnial issues involved inthesequel. Italso permitsto introdue the
main notations of thepaper.
Just as in standard binary lassiation, we onsider the pair of random variables
(X, Y)
whereX
is an observation vetor in ameasurable spae X andY
is a binary labelin
{−1, +1}
. The distribution of(X, Y)
an be desribed by the pair(µ, η)
whereµ
isthe marginal distribution of
X
andη
is the a posteriori distribution dened byη(x) =
P
{Y = 1 | X = x}
,8x
2X. We denethe rate of best instanes asthe proportion ofbestinstanes to be onsidered and denote it by
u 0
2(0, 1)
. We denote byQ(η, 1 − u 0 )
the(1 − u 0 )
-quantile oftherandomvariableη(X)
. Then theset of best instanes at rateu 0
is given by:
C
u 0 = {x
2X| η(x)
Q(η, 1 − u 0 )} .
Wementiontwotrivialpropertiesoftheset
C
u 0
whihwillbeimportantinthesequel: Mass onstraint: we have
µ C
u 0
=
PX
2C
u 0 = u 0
, Invariane property: as a funtional of
η
, the setC
u
0
is invariant by stritlyinreasing transformsof
η
.The problem of nding a proportion
u 0
of the best instanes boils down to the es-timation of the unknown set
C
u 0
on the basis of empirial data. Before turning to thestatistial analysisof theproblem, we rst relateit to binarylassiation.
2.1 A classification problem with a mass constraint
A lassier is ameasurable funtion
g :
X→ {−1, +1}
and its performane ismeasuredby the lassiation error
L(g) =
P{Y
6= g(X)}
. Letu 0
2(0, 1)
be xed. Denote byg
u
0 = 2
IC
u0 − 1
thelassier prediting+1on thesetof best instanesC
u
0
and -1onitsomplement. The nextproposition shows that
g
u 0
isan optimal element fortheproblemof minimization of
L(g)
over the family of lassiersg
satisfying the mass onstraintP
{g(X) = 1} = u 0
.Proposition 1
For any lassierg :
X→ {−1, +1}
suh thatg(x) = 2
IC (x) − 1
forsome subset
C
of X andµ(C) =
P{g(X) = 1} = u 0
, we haveL
u
0
$L g
u
0
L(g) .
Furthermore, we have
L
u 0 = 1 − Q(η, 1 − u 0 ) + (1 − u 0 )(2Q(η, 1 − u 0 ) − 1) −
E( |η(X) − Q(η, 1 − u 0 )|) ,
and
L(g) − L g
u 0
= 2
E
|η(X) − Q(η, 1 − u 0 )|
IC
u0 ∆C (X)
,
where
∆
denotes the symmetri dierene operation between two subsets of X.proof. For simpliity, we temporarily hange the notation and set
q = Q(η, 1 − u 0 )
.Then, foranylassier
g
satisfying thethe onstraint P{g(X) = 1} = u 0
,we haveL(g) =
E
(η(X) − q)
I[ g(X )=−1] + (q − η(X))
I[ g(X)=+ 1]
+ (1 − u 0 )q + (1 − q)u 0 .
The statementsof theproposition immediatelyfollow.
There are several progresses in the eld of lassiation theory where the aim is to
introdueonstraintsinthelassiation proedure orto adapt itto other problems. We
relateour formulationto other approahesinthe following remarks.
Remark 1
(Connetion to hypothesis testing). The impliit asymmetry in theproblemduetotheemphasison thebestinstanes isreminisent ofthestatistial theory
of hypothesis testing. We an formulate a test of simple hypothesis by taking the null
assumption to be
H 0 : Y = +1
and the alternative assumption beingH 1 : Y = −1
.We want to deide whih hypothesis is true given the observation
X
. Eah lassierg
provides atest statisti
g(X)
. The performaneof thetest isthen desribedbyits type Ierror
α(g) =
P{g(X) = 1 | Y = −1}
and itspowerβ(g) =
P{g(X) = 1 | Y = +1}
. Wepointoutthatifthelassier
g
satisesamassonstraint, thenwe an relatethe lassiationerrorwith thetype Ierrorof thetest dened by
g
through therelation:L(g) = 2(1 − p)α(g) + p − u 0
where
p =
P{Y = 1}
,andsimilarly,wehave:L(g) = 2p(1 − β(g)) − p − u 0
. Therefore,theoptimal lassier minimizes the type I error (maximizes the power) among all lassiers
on theprobability of a false alarm(type Ierror) and maximizethe power. This question
is explored ina reent paperbySott [27 ℄ (see also[29 ℄).
Remark 2
(Connetionwith regression level set estimation)We mention that theestimationofthelevelsets oftheregressionfuntionhasbeenstudiedinthestatistisliterature [3℄ (see also [32 ℄, [38 ℄) as well as in the learning literature, for instane in the
ontext of anomaly detetion ([31 , 28 ,37 ℄). In our framework of lassiation with mass
onstraint,thethresholddeningthelevelsetinvolvesthequantileoftherandomvariable
η(X)
.Remark 3
(Connetion with the minimum volume set approah) Although thepoint ofviewadoptedinthis paperisvery dierent, theproblemdesribedabovemaybe
formulated intheframework ofminimum volumesets learningas onsideredin[30 ℄. As
amatteroffat,theset
C
u 0
maybeviewedasthesolutionoftheonstrainedoptimization problem:min
C
P
{X
2C | Y = −1}
overthelass ofmeasurable sets
C
,subjet toP
{X
2C}
u 0 .
Themain diereneinouraseomesfrom thefat thattheonstraint onthevolumeset
has to be estimated using thedata while in [30℄ it is omputed from a known referene
measure. We believe that learning methods for minimum volume set estimation may
hopefully be extended to our setting. A natural way to do it would onsist in replaing
onditional distribution of
X
givenY = −1
by its empirial ounterpart. This is beyond thesope ofthe present paperbut will bethe subjet offuture investigation.2.2 Empirical risk minimization
We now investigate the estimation of the set
C
u 0
of best instanes at rateu 0
based ontraining data. Suppose that we are given
n
i.i.d. opies(X 1 , Y 1 ),
, (X n , Y n )
of thepair
(X, Y)
. Sine we have the ranking probleminmind, ourmethodologywill onsist in building the andidate sets from a lass S of real-valued soring funtionss :
X→
R.Indeed, we onsider sets of theform
C s
$C s,u 0 = {x
2X| s(x)
Q(s, 1 − u 0 )} ,
where
s
isanelement ofS andQ(s, 1− u 0 )
isthe(1 − u 0 )
-quantileoftherandomvariables(X)
. NotethatsuhsetssatisfythesamepropertiesofC
u 0
withrespettomassonstraintand invariane to stritly inreasingtransforms of
s
.L(s)
$L(s, u 0 )
$L(C s ) =
P{Y
(s(X) − Q(s, 1 − u 0 )) < 0} .
A soring funtionminimizingthe quantity
L n (s) = 1 n
X n
i =1
I
{Y i
(s(X i ) − Q(s, 1 − u 0 )) < 0}.
is expeted to approximately minimize the true error
L(s)
, but the quantile dependson the unknown distribution of
X
. In pratie, one has to replaeQ(s, 1 − u 0 )
by itsempirial ounterpart
Q(s, 1 ^ − u 0 )
whih orresponds to theempirial quantile. We willthusonsider, insteadof
L n (s)
, thetruly empirialerror:^ L n (s) = 1 n
X n
i =1
I
{Y i
(s(X i ) − ^ Q(s, 1 − u 0 )) < 0}.
Note that
^ L n (s)
is a ompliated statisti sine the empirial quantile involves all theinstanes
X 1 , . . . , X n
. We alsomentionthat^ L n (s)
isabiased estimateofthelassiationerror
L(s)
of thelassierg s (x) = 2
I{s(x)
Q(s, 1 − u 0 )} − 1
.Weintrodue somemorenotations. Set, forall
t
2R:
F s (t) =
P{s(X)
t}
G s (t) =
P{s(X)
t | Y = +1}
H s (t) =
P{s(X)
t | Y = −1}
to betheumulative distribution funtions (df)of
s(X)
(respetively, givenY = 1
,givenY = −1
). We reall that thedenition of thequantiles of (the distribution of) a random variableinvolves the notion ofgeneralized inverseF −1
of afuntionF
:F −1 (z) =
inf{t
2R| F(t)
z} .
Thus,wehave, forall
v
2(0, 1)
:Q(s, v) = F −1 s (v)
andQ(s, v) = ^ ^ F −1 s (v)
where
^ F s
isthe empirial df ofs(X)
:^ F s (t) = n 1 P n
i=1
I{s(X i )
t}
,8t
2R.Without loss of generality, we will assume that all soring funtions in S take their
values in
(0, λ)
forsomeλ > 0
. We now turn to study theperformane of minimizers ofL ^ n (s)
overa lassS ofsoring funtions dened by^
s n =
argmins
2SL ^ n (s).
Our rst main result is an exess risk boundfortheempirial risk minimizer
^ s n
overa lass S of uniformly bounded soring funtions. In the following theorem, we onsider
thatthelevelsets ofsoringfuntionsfromthelassS formaVapnik-Chervonenkis (VC)
lassof sets.
Theorem 2
Weassume that(i) the lass S is symmetri (i.e. if
s
2S thenλ − s
2S) and is a VC major lassof funtions with VC dimension
V
.(ii) the family K
= { G s , H s : s
2 S}
of dfs satises the following property: anyK
2 K has left and right derivatives, denoted byK +
0 andK −
0, and there existstritly positive onstants
b
,B
suh that 8(K, t)
2K(0, λ)
,b
K +
0(t)
B
andb
K −
0(t)
B .
For any
δ > 0
, we have, with probability largerthan1 − δ
,L(^ s n ) −
infs
2SL(s)
c 1
s
V n + c 2
s
ln(1/δ)
n ,
for some positive onstants
c 1 , c 2
.Wenow providesomeinsights ononditions (i) and (ii)of thetheorem.
Remark 4
(on the omplexity assumption) On the terminology of major sets and majorlasses,werefertoDudley[10 ℄. Intheproof,weneedtoontrolempirialproessesindexed by sets of the form
{x : s(x)
t}
or{x : s(x)
t}
. Condition (i) guaranteesthatthese sets form a VClass ofsets.
Remark 5
(onthehoie ofthelassS ofsoring funtions)Inordertograspthe meaning of ondition (ii) of the theorem, we onsider the one-dimensional ase with
real-valued soring funtions. Assume that the distribution of the random variable
X i
has a bounded density
f
with respet to Lebesgue measure. Assume also that soringfuntions
s
aredierentiableexept, possibly,at anitenumberofpoints,andderivatives aredenoted bys
0. Denote byf s
the density ofs(X)
. Lett
2(0, λ)
and denote byx 1
, ...,x p
the real rootsof the equations(x) = t
. We an express the density ofs(X)
thankstothehange-of-variable formula(see e.g. [24℄):
f s (t) = f(x 1 )
s
0(x 1 ) + . . . + f(x p ) s
0(x p ) .
Thisshowsthat thesoringfuntions shouldnotpresent neitherat norsteep parts. We
an take forinstane, thelassS to be thelassof linear-by-parts funtions with anite
Scoring function
x
s(x)
Figure1: Typialexample ofa soring funtion.
numberof loal extrema and with uniformly bounded left and right derivatives: 8
s
2S,8
x
,m
s +
0(x)
M
andm
s −
0(x)
M
forsomestritly positive onstantsm
, andM
(seeFigure1). Notethatanysubintervalof
[0, λ]
hastobeintherangeofsoringfuntionss
(if not, some elements of K will present a plateau). In fat, the proof requires suh abehavioronly intheviinity ofthepoints orresponding to thequantiles
Q(s, 1 − u 0 )
forall
s
2S.proof. Set
v 0 = 1 − u 0
. By astandard argument (see e.g. [8 ℄), we have:L(^ s n ) −
infs
2SL(s)
2
sups
2S
^ L n (s) − L(s)
2
sups
2S
^ L n (s) − L n (s)
+ 2
sups
2S|L n (s) − L(s)| .
Note that the seond term in the bound is an empirial proess whose behavior is
well-known. Inourase, assumption (i)implies thatthelassof sets
{x : s(x)
Q(s, v 0 )}
indexed bysoring funtions
s
hasaVCdimension smaller thanV
. Hene, we have by aonentrationargument ombinedwithaVCboundfortheexpetationofthesupremum
(see,e.g. [20 ℄), forany
δ > 0
,withprobabilitylarger than1 − δ
,sup
s
2S|L n (s) − L(s)|
c
s
V n + c
0s
ln(1/δ) n
foruniversal onstants
c, c
0.We now show how to handle the rst term. Following the work of Koul [18 ℄, we set
thefollowing notations: