Ranking the best instances

(1)

HAL Id: hal-00111670

https://hal.archives-ouvertes.fr/hal-00111670v2

Preprint submitted on 13 Feb 2007

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Ranking the best instances

Stéphan Clémençon, Nicolas Vayatis

To cite this version:

Stéphan Clémençon, Nicolas Vayatis. Ranking the best instances. 2007. �hal-00111670v2�

(2)

hal-00111670, version 2 - 13 Feb 2007

Stephan Clemenon

MODALX - UniversiteParis X Nanterre

&

Laboratoire de Probabiliteset Modeles Aleatoires

UMRCNRS 7599 -Universites Paris VIet Paris VII

Niolas Vayatis

Laboratoire de Probabiliteset Modeles Aleatoires

UMRCNRS 7599 -Universites Paris VIet Paris VII

February 14, 2007

Abstract

We formulatethe loal ranking problem in the framework of bipartite ranking

wherethegoalistofousonthebestinstanes. Weproposeamethodologybasedon

theonstrutionof real-valuedsoringfuntions. Westudyempirialriskminimiza-

tion of dediatedstatistiswhih involveempirialquantilesof the sores. We rst

state theproblem ofnding thebest instaneswhih anbeast asa lassiation

problemwithmassonstraint. Next,wedevelopspeialperformanemeasuresforthe

loal ranking problemwhih extend the Area UnderanROCCurve(AUC/AROC)

riterionanddesribethe optimalelementsof these newriteria. Wealso highlight

the fat that the goal of rankingthe best instanes annot be ahieved in a stage-

wisemannerwhererst,thebestinstaneswouldbetentativelyidentiedandthena

standardAUCriterionouldbeapplied. Eventually,westatepreliminarystatistial

resultsfortheloalrankingproblem.

Keywords:

^Ranking,^ROCûrveând ÂUC,êmpirial^riskminimization,fastrates.

Running title:

^Ranking^the^best^instanes

Addressoforrespondingauthor: NiolasVayatis,LaboratoiredeProbabilitesetModelesAleatoires

-UniversiteParis6-175,rueduChevaleret-75013Paris, Frane-Email:

[email protected]

(3)

1 Introduction

Therst takesalltheglory,theseond takesnothing. Inappliations whererankingisat

stake, people oftenfous on the best instanes. When sanning theresults from aquery

on a searh engine, we rarely go beyond the one or two rst pages on the sreen. In

thedierentontext of reditrisksreening, reditestablishments elaboratesoring rules

as reliability indiators and their main onern is to identify risky prospets espeially

among the top sores. In medial diagnosis, test sores indiate the odds for a patient

to be healthy given a series of measurements (age, blood pressure, ...). There again

a partiular attention is given to the "best" instanes not to miss a possible diseased

patientamongthehighestsores. Thesevarioussituations anbeformulatedinthesetup

ofbipartite rankingwhereone observesi.i.d. opiesof arandompair

(X, Y)

^with

X

^being

an observation vetor desribing theinstane (web page, debtor,patient) and

Y

^a ^binary

label assigning to one population or the other (relevant vs. non relevant, good vs. bad,

healthyvs. diseased). Inthis problem,thegoalisto rank theinstanes insteadofsimply

lassifying them. There is a growing literature on the ranking problem in the eld of

Mahine LearningbutmostofitonsiderstheAreaundertheROCCurve(also knownas

theAUCorAROC)riterionasameasureofperformaneoftherankingrule[6 ,13 ,26 ,1 ℄.

Inapreviouswork,wehavementionedthatthebipartiterankingproblemundertheAUC

riterion ould be interpreted as a lassiation problem with pairs of observations [4 ℄.

But the limit of this approah is that it weights uniformly the pairs of items whih are

badly ranked. Therefore itdoes notpermit to distinguish between ranking rulesmaking

the same number of mistakes but in very dierent parts of the ROC urve. The AUC

is indeed a global riterion whih does notallow to onentrate on the"best" instanes.

Speialperformanemeasures,suh astheDisountedCumulative Gain(DCG)riterion,

have been introdued by pratitioners in order to weight instanes aording to their

rank[16 ℄ (seealso [25,7 ℄) butproviding theoryforsuhriteriaand developing empirial

risk minimization strategies still is a very open issue. In the present paper, we extend

the results of our previous work in [4℄ and set theoretial grounds for the problem of

loal ranking. The methodology we propose is based on the seletion of a real-valued

soring funtion for whih we formulate appropriate performane measures generalizing

the AUC riterion. We point out that ranking the best instanes is an involved task as

it is a two-fold problem: (i) nd the best instanes and (ii) provide a good ranking on

these instanes. The fat that these two goals annot be onsidered independently will

be highlighted in thepaper. Despite this observation, we will rst formulate theissue of

ndingthebestinstaneswhihistobeunderstoodasatoyproblemforourpurpose. This

problemorrespondsto abinary lassiation problem with a mass onstraint(where

the proportion

u 0

ôf ⁺¹ ^labels ^predited ^by ^the ^lassiers îs ^xed) ând ît ^might ^present

aninterestper se. Themainompliationherehastodowiththeneessityofperforming

(4)

tehnique was inspired by the former work of Koul [18 ℄ in the ontext of

R

-estimation where similar statistis arise.

The rest of the paper is organized as follows. We rst state the problem of nding

thebestinstanes and studytheperformaneofempirial riskminimizationinthissetup

(Setion 2). We also explore the onditions on the distribution in order to reover fast

rates of onvergene. In Setion 3 we formulate performanemeasures for loal ranking

and provide extensions of the AUC riterion. Eventually (Setion 4), we state some

preliminary statistial resultson empirial risk minimizationof these new riteria.

2 Finding the best instances

Inthepresentsetion, wehave alimitedgoalwhihisonlytodeterminethebestinstanes

without bothering of their order in the list. By onsidering this subproblem, we will

identify themain tehnial issues involved inthesequel. Italso permitsto introdue the

main notations of thepaper.

Just as in standard binary lassiation, we onsider the pair of random variables

(X, Y)

^where

X

^is ^an observation vetor in ameasurable spae X and

Y

^is ^a ^binary ^label

in

{−1, +1}

^. ^The distribution of

(X, Y)

^an ^be ^desribed ^by ^the ^pair

(µ, η)

^where

µ

^is

the marginal distribution of

X

^and

η

^is ^the ^a ^posteriori distribution dened by

η(x) =

P

{Y = 1 | X = x}

^,⁸

x

²^X^. ^Wê ^dene^the ^rate ôf ^best înstanes âs^the ^proportion ôf^best

instanes to be onsidered and denote it by

u ₀

²

(0, 1)

^. ^W^e ^denote ^by

Q(η, 1 − u ₀ )

^the

(1 − u ₀ )

^-quantile ^of^the^random^variable

η(X)

^. ^Then ^the^set ôf ^best înstanes ât ^rate

u ₀

is given by:

C

_u ₀ = {x

²^X

| η(x)

Q(η, 1 − u ₀ )} .

Wementiontwotrivialpropertiesoftheset

C

_u ₀

^whih^will^be^importantⁱⁿ^the^sequel:

Mass onstraint: we have

µ C

_u ₀

=

^P

X

²

C

_u ₀ = u ₀

^,

Invariane property: as a funtional of

η

^, ^the ^set

C

_u

0

^is ^invariant ^by ^stritly

inreasing transformsof

η

^.

The problem of nding a proportion

u ₀

ôf ^the ^best înstanes ^boils ^down ^to ^the ês-

timation of the unknown set

C

_u ₀

ôn ^the ^basis ôf êmpirial ^data. ^Before ^turning ^to ^the

statistial analysisof theproblem, we rst relateit to binarylassiation.

2.1 A classification problem with a mass constraint

A lassier is ameasurable funtion

g :

^X

→ {−1, +1}

ând îts ^performane îs^measured

by the lassiation error

L(g) =

^P

{Y

⁶

= g(X)}

^. ^Let

u ₀

²

(0, 1)

^be ^xed. ^Denote ^by

(5)

g

_u

0 = 2

^I

_C

_u0 − 1

^the^lassier ^prediting⁺¹ôn ^the^setôf ^best înstanes

C

_u

0

ând ^-1ônîts

omplement. The nextproposition shows that

g

_u ₀

îsân ôptimal êlement ^for^the^problem

of minimization of

L(g)

^over ^the ^family ^of ^lassiers

g

^satisfying ^the ^mass ^onstraint

P

{g(X) = 1} = u ₀

^.

Proposition 1

^For ^any ^lassier

g :

^X

→ {−1, +1}

^suh ^that

g(x) = 2

^I

_C (x) − 1

^for

some subset

C

^of ^X ^and

µ(C) =

^P

{g(X) = 1} = u ₀

^, ^we ^have

L

_u

0

^$

L g

_u

0 L(g) .

Furthermore, we have

L

_u ₀ = 1 − Q(η, 1 − u ₀ ) + (1 − u ₀ )(2Q(η, 1 − u ₀ ) − 1) −

^E

( |η(X) − Q(η, 1 − u ₀ )|) ,

and

L(g) − L g

_u ₀

= 2

^E

|η(X) − Q(η, 1 − u ₀ )|

^I

_C

_u0 _∆C (X)

,

where

∆

^denotes ^the ^symmetri ^dierene ^operation ^between ^two ^subsets ^of ^X^.

proof. For simpliity, we temporarily hange the notation and set

q = Q(η, 1 − u ₀ )

^.

Then, foranylassier

g

^satisfying ^the^the ^onstraint ^P

{g(X) = 1} = u ₀

^,^we ^have

L(g) =

^E

(η(X) − q)

^I

_[ _g(X _)=−1] + (q − η(X))

^I

_[ _g(X)=+ _1]

+ (1 − u ₀ )q + (1 − q)u ₀ .

The statementsof theproposition immediatelyfollow.

There are several progresses in the eld of lassiation theory where the aim is to

introdueonstraintsinthelassiation proedure orto adapt itto other problems. We

relateour formulationto other approahesinthe following remarks.

Remark 1

^(Connetion ^to ^hypothesis ^testing). ^The ^impliit ^asymmetry ⁱⁿ ^the

problemduetotheemphasison thebestinstanes isreminisent ofthestatistial theory

of hypothesis testing. We an formulate a test of simple hypothesis by taking the null

assumption to be

H 0 : Y = +1

^and ^the alternative assumption being

H 1 : Y = −1

^.

We want to deide whih hypothesis is true given the observation

X

^. ^Eah ^lassier

g

provides atest statisti

g(X)

^. ^The ^performaneôf ^the^test îs^then ^desribed^byîts ^type Î

error

α(g) =

^P

{g(X) = 1 | Y = −1}

^and ^its^power

β(g) =

^P

{g(X) = 1 | Y = +1}

^. ^We^point

outthatifthelassier

g

^satisesâ^massônstraint, ^then^we ân ^relate^the ^lassiation

errorwith thetype Ierrorof thetest dened by

g

^through ^the^relation:

L(g) = 2(1 − p)α(g) + p − u ₀

where

p =

^P

{Y = 1}

^,^and^similarly,^we^have:

L(g) = 2p(1 − β(g)) − p − u ₀

^. ^Therefore,^the

optimal lassier minimizes the type I error (maximizes the power) among all lassiers

(6)

on theprobability of a false alarm(type Ierror) and maximizethe power. This question

is explored ina reent paperbySott [27 ℄ (see also[29 ℄).

Remark 2

^(Connetion^with ^regression ^level ^set estimation)We mention that theestimationofthelevelsets oftheregressionfuntionhasbeenstudiedinthestatistis

literature [3℄ (see also [32 ℄, [38 ℄) as well as in the learning literature, for instane in the

ontext of anomaly detetion ([31 , 28 ,37 ℄). In our framework of lassiation with mass

onstraint,thethresholddeningthelevelsetinvolvesthequantileoftherandomvariable

η(X)

^.

Remark 3

^(Connetion ^with ^the ^minimum ^volume ^set ^approah) ^Although ^the

point ofviewadoptedinthis paperisvery dierent, theproblemdesribedabovemaybe

formulated intheframework ofminimum volumesets learningas onsideredin[30 ℄. As

amatteroffat,theset

C

_u ₀

^may^be^viewedâs^the^solutionôf^theônstrainedoptimization problem:

min

C

P

{X

²

C | Y = −1}

overthelass ofmeasurable sets

C

^,^subjet ^to

P

{X

²

C}

u ₀ .

Themain diereneinouraseomesfrom thefat thattheonstraint onthevolumeset

has to be estimated using thedata while in [30℄ it is omputed from a known referene

measure. We believe that learning methods for minimum volume set estimation may

hopefully be extended to our setting. A natural way to do it would onsist in replaing

onditional distribution of

X

^given

Y = −1

^by ^its ^empirial ounterpart. This is beyond thesope ofthe present paperbut will bethe subjet offuture investigation.

2.2 Empirical risk minimization

We now investigate the estimation of the set

C

_u ₀

ôf ^best înstanes ât ^rate

u 0

^based ^on

training data. Suppose that we are given

n

^i.i.d. ^opies

(X ₁ , Y 1 ),

, (X _n , Y n )

^of ^the

pair

(X, Y)

^. ^Sine ^we ^have ^the ^ranking ^problemⁱⁿ^mind, ^ourmethodologywill onsist in building the andidate sets from a lass S of real-valued soring funtions

s :

^X

→

^R.

Indeed, we onsider sets of theform

C _s

^$

C _s,u ₀ = {x

²^X

| s(x)

Q(s, 1 − u ₀ )} ,

where

s

îsânêlement ôf^S ând

Q(s, 1− u ₀ )

^is^the

(1 − u ₀ )

^-quantile^of^the^random^variable

s(X)

^. ^Note^that^suh^sets^satisfy^the^same^properties^of

C

_u ₀

^with^respet^to^mass^onstraint

and invariane to stritly inreasingtransforms of

s

^.

(7)

L(s)

^$

L(s, u ₀ )

^$

L(C _s ) =

^P

{Y

(s(X) − Q(s, 1 − u 0 )) < 0} .

A soring funtionminimizingthe quantity

L _n (s) = 1 n

X n

i =1

I

{Y _i

(s(X _i ) − Q(s, 1 − u ₀ )) < 0}.

is expeted to approximately minimize the true error

L(s)

^, ^but ^the ^quantile ^depends

on the unknown distribution of

X

^. ^In ^pratie, ^one ^has ^to ^replae

Q(s, 1 − u ₀ )

^by ^its

empirial ounterpart

Q(s, 1 ^ − u ₀ )

^whih ôrresponds ^to ^theêmpirial ^quantile. ^Wê ^will

thusonsider, insteadof

L n (s)

^, ^the^truly ^empirial^error:

^ L _n (s) = 1 n

X n

i =1

I

{Y _i

(s(X _i ) − ^ Q(s, 1 − u ₀ )) < 0}.

Note that

^ L _n (s)

îs â ômpliated ^statisti ^sine ^the êmpirial ^quantile învolves âll ^the

instanes

X ₁ , . . . , X _n

^. ^W^e ^also^mention^that

^ L _n (s)

îsâ^biased êstimateôf^the^lassiation

error

L(s)

^of ^the^lassier

g s (x) = 2

^I

{s(x)

Q(s, 1 − u 0 )} − 1

^.

Weintrodue somemorenotations. Set, forall

t

²^R:

F _s (t) =

^P

{s(X)

t}

G _s (t) =

^P

{s(X)

t | Y = +1}

H _s (t) =

^P

{s(X)

t | Y = −1}

to betheumulative distribution funtions (df)of

s(X)

(respetively, given

Y = 1

^,^given

Y = −1

^). ^Wê ^reall ^that ^the^denition ôf ^the^quantiles ôf ^(the distribution of) a random variableinvolves the notion ofgeneralized inverse

F ⁻¹

^of ^a^funtion

F

^:

F ⁻¹ (z) =

^inf

{t

²^R

| F(t)

z} .

Thus,wehave, forall

v

²

(0, 1)

^:

Q(s, v) = F ⁻¹ _s (v)

^and

Q(s, v) = ^ ^ F ⁻¹ _s (v)

where

^ F _s

îs^the êmpirial ^df ôf

s(X)

^:

^ F _s (t) = _n ¹ P n

i=1

^I

{s(X _i )

t}

^,⁸

t

²^R.

Without loss of generality, we will assume that all soring funtions in S take their

values in

(0, λ)

^for^some

λ > 0

^. ^Wê ^now ^turn ^to ^study ^the^performane ôf ^minimizers ôf

L ^ _n (s)

ôverâ ^lass^S ôf^soring ^funtions ^dened ^by

^

s _n =

^arg^min

s

^2S

L ^ _n (s).

(8)

Our rst main result is an exess risk boundfortheempirial risk minimizer

^ s _n

^over

a lass S of uniformly bounded soring funtions. In the following theorem, we onsider

thatthelevelsets ofsoringfuntionsfromthelassS formaVapnik-Chervonenkis (VC)

lassof sets.

Theorem 2

^We^assume ^that

(i) the lass S is symmetri (i.e. if

s

²^S ^then

λ − s

²^S) ând îs â ^VC ^major ^lass

of funtions with VC dimension

V

^.

(ii) the family K

= { G s , H s : s

² ^S

}

^of ^dfs ^satises ^the ^following ^property: ^any

K

² ^K ^has ^left ^and ^right derivatives, denoted by

K ₊

⁰ ^and

K ₋

⁰^, ^and ^there ^exist

stritly positive onstants

b

^,

B

^suh ^that ⁸

(K, t)

²^K

(0, λ)

^,

b

K ₊

⁰

(t)

B

^and

b

K ₋

⁰

(t)

B .

For any

δ > 0

^, ^we ^have, ^with probability largerthan

1 − δ

^,

L(^ s _n ) −

^inf

s

^2S

L(s)

c ₁

s

V n + c ₂

s

ln(1/δ)

n ,

for some positive onstants

c ₁ , c ₂

^.

Wenow providesomeinsights ononditions (i) and (ii)of thetheorem.

Remark 4

^(on ^the ^omplexity assumption) On the terminology of major sets and majorlasses,werefertoDudley[10 ℄. Intheproof,weneedtoontrolempirialproesses

indexed by sets of the form

{x : s(x)

t}

^or

{x : s(x)

t}

^. ^Condition ⁽ⁱ⁾ ^guarantees

thatthese sets form a VClass ofsets.

Remark 5

^(on^the^hoie ôf^the^lass^S ôf^soring ^funtions)Înôrder^to^grasp

the meaning of ondition (ii) of the theorem, we onsider the one-dimensional ase with

real-valued soring funtions. Assume that the distribution of the random variable

X i

has a bounded density

f

^with ^respet ^to ^Lebesgue ^measure. ^Assume ^also ^that ^soring

funtions

s

^aredierentiableexept, possibly,at anitenumberofpoints,andderivatives aredenoted by

s

⁰^. ^Denote ^by

f s

^the ^density ^of

s(X)

^. ^Let

t

²

(0, λ)

^and ^denote ^by

x 1

^, ^...,

x _p

^the ^real ^roots^of ^the ^equation

s(x) = t

^. ^Wê ân êxpress ^the ^density ôf

s(X)

^thanks^to

thehange-of-variable formula(see e.g. [24℄):

f _s (t) = f(x ₁ )

s

⁰

(x ₁ ) + . . . + f(x _p ) s

⁰

(x _p ) .

Thisshowsthat thesoringfuntions shouldnotpresent neitherat norsteep parts. We

an take forinstane, thelassS to be thelassof linear-by-parts funtions with anite

(9)

Scoring function

x

s(x)

Figure1: Typialexample ofa soring funtion.

numberof loal extrema and with uniformly bounded left and right derivatives: 8

s

²^S,

8

x

^,

m

s ₊

⁰

(x)

M

^and

m

s ₋

⁰

(x)

M

^for^some^stritly ^positive ^onstants

m

^, ^and

M

(seeFigure1). Notethatanysubintervalof

[0, λ]

^has^to^beⁱⁿ^the^range^of^soring^funtions

s

^(if ^not, ^some êlements ôf ^K ^will ^present â ^plateau). În ^fat, ^the ^proof ^requires ^suh â

behavioronly intheviinity ofthepoints orresponding to thequantiles

Q(s, 1 − u ₀ )

^for

all

s

²^S.

proof. Set

v ₀ = 1 − u ₀

^. ^By â^standard ârgument ^(see ê.g. ^{[8 ℄),} ^we ^have:

L(^ s _n ) −

^inf

s

^2S

L(s)

2

^sup

s

^2S

^ L _n (s) − L(s)

2

^sup

s

^2S

^ L _n (s) − L _n (s)

+ 2

^sup

s

^2S

|L _n (s) − L(s)| .

Note that the seond term in the bound is an empirial proess whose behavior is

well-known. Inourase, assumption (i)implies thatthelassof sets

{x : s(x)

Q(s, v ₀ )}

indexed bysoring funtions

s

^has^a^VC^dimension ^smaller ^than

V

^. ^Hene, ^we ^have ^by ^a

onentrationargument ombinedwithaVCboundfortheexpetationofthesupremum

(see,e.g. [20 ℄), forany

δ > 0

^,^withprobabilitylarger than

1 − δ

^,

sup

s

^2S

|L n (s) − L(s)|

c

s

V n + c

⁰

s

ln(1/δ) n

foruniversal onstants

c, c

⁰^.

We now show how to handle the rst term. Following the work of Koul [18 ℄, we set

Ranking the best instances

HAL Id: hal-00111670

https://hal.archives-ouvertes.fr/hal-00111670v2

Preprint submitted on 13 Feb 2007

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires