• Aucun résultat trouvé

Ranking the best instances

N/A
N/A
Protected

Academic year: 2021

Partager "Ranking the best instances"

Copied!
30
0
0

Texte intégral

(1)

HAL Id: hal-00111670

https://hal.archives-ouvertes.fr/hal-00111670v2

Preprint submitted on 13 Feb 2007

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Ranking the best instances

Stéphan Clémençon, Nicolas Vayatis

To cite this version:

Stéphan Clémençon, Nicolas Vayatis. Ranking the best instances. 2007. �hal-00111670v2�

(2)

hal-00111670, version 2 - 13 Feb 2007

Stephan Clemenon

MODALX - UniversiteParis X Nanterre

&

Laboratoire de Probabiliteset Modeles Aleatoires

UMRCNRS 7599 -Universites Paris VIet Paris VII

Niolas Vayatis

Laboratoire de Probabiliteset Modeles Aleatoires

UMRCNRS 7599 -Universites Paris VIet Paris VII

February 14, 2007

Abstract

We formulatethe loal ranking problem in the framework of bipartite ranking

wherethegoalistofousonthebestinstanes. Weproposeamethodologybasedon

theonstrutionof real-valuedsoringfuntions. Westudyempirialriskminimiza-

tion of dediatedstatistiswhih involveempirialquantilesof the sores. We rst

state theproblem ofnding thebest instaneswhih anbeast asa lassiation

problemwithmassonstraint. Next,wedevelopspeialperformanemeasuresforthe

loal ranking problemwhih extend the Area UnderanROCCurve(AUC/AROC)

riterionanddesribethe optimalelementsof these newriteria. Wealso highlight

the fat that the goal of rankingthe best instanes annot be ahieved in a stage-

wisemannerwhererst,thebestinstaneswouldbetentativelyidentiedandthena

standardAUCriterionouldbeapplied. Eventually,westatepreliminarystatistial

resultsfortheloalrankingproblem.

Keywords:

Ranking,ROCurveand AUC,empirialriskminimization,fastrates.

Running title:

Rankingthebestinstanes

Addressoforrespondingauthor: NiolasVayatis,LaboratoiredeProbabilitesetModelesAleatoires

-UniversiteParis6-175,rueduChevaleret-75013Paris, Frane-Email:

vayatis@ccr.jussieu.fr

(3)

1 Introduction

Therst takesalltheglory,theseond takesnothing. Inappliations whererankingisat

stake, people oftenfous on the best instanes. When sanning theresults from aquery

on a searh engine, we rarely go beyond the one or two rst pages on the sreen. In

thedierentontext of reditrisksreening, reditestablishments elaboratesoring rules

as reliability indiators and their main onern is to identify risky prospets espeially

among the top sores. In medial diagnosis, test sores indiate the odds for a patient

to be healthy given a series of measurements (age, blood pressure, ...). There again

a partiular attention is given to the "best" instanes not to miss a possible diseased

patientamongthehighestsores. Thesevarioussituations anbeformulatedinthesetup

ofbipartite rankingwhereone observesi.i.d. opiesof arandompair

(X, Y)

with

X

being

an observation vetor desribing theinstane (web page, debtor,patient) and

Y

a binary

label assigning to one population or the other (relevant vs. non relevant, good vs. bad,

healthyvs. diseased). Inthis problem,thegoalisto rank theinstanes insteadofsimply

lassifying them. There is a growing literature on the ranking problem in the eld of

Mahine LearningbutmostofitonsiderstheAreaundertheROCCurve(also knownas

theAUCorAROC)riterionasameasureofperformaneoftherankingrule[6 ,13 ,26 ,1 ℄.

Inapreviouswork,wehavementionedthatthebipartiterankingproblemundertheAUC

riterion ould be interpreted as a lassiation problem with pairs of observations [4 ℄.

But the limit of this approah is that it weights uniformly the pairs of items whih are

badly ranked. Therefore itdoes notpermit to distinguish between ranking rulesmaking

the same number of mistakes but in very dierent parts of the ROC urve. The AUC

is indeed a global riterion whih does notallow to onentrate on the"best" instanes.

Speialperformanemeasures,suh astheDisountedCumulative Gain(DCG)riterion,

have been introdued by pratitioners in order to weight instanes aording to their

rank[16 ℄ (seealso [25,7 ℄) butproviding theoryforsuhriteriaand developing empirial

risk minimization strategies still is a very open issue. In the present paper, we extend

the results of our previous work in [4℄ and set theoretial grounds for the problem of

loal ranking. The methodology we propose is based on the seletion of a real-valued

soring funtion for whih we formulate appropriate performane measures generalizing

the AUC riterion. We point out that ranking the best instanes is an involved task as

it is a two-fold problem: (i) nd the best instanes and (ii) provide a good ranking on

these instanes. The fat that these two goals annot be onsidered independently will

be highlighted in thepaper. Despite this observation, we will rst formulate theissue of

ndingthebestinstaneswhihistobeunderstoodasatoyproblemforourpurpose. This

problemorrespondsto abinary lassiation problem with a mass onstraint(where

the proportion

u 0

of +1 labels predited by the lassiers is xed) and it might present

aninterestper se. Themainompliationherehastodowiththeneessityofperforming

(4)

tehnique was inspired by the former work of Koul [18 ℄ in the ontext of

R

-estimation where similar statistis arise.

The rest of the paper is organized as follows. We rst state the problem of nding

thebestinstanes and studytheperformaneofempirial riskminimizationinthissetup

(Setion 2). We also explore the onditions on the distribution in order to reover fast

rates of onvergene. In Setion 3 we formulate performanemeasures for loal ranking

and provide extensions of the AUC riterion. Eventually (Setion 4), we state some

preliminary statistial resultson empirial risk minimizationof these new riteria.

2 Finding the best instances

Inthepresentsetion, wehave alimitedgoalwhihisonlytodeterminethebestinstanes

without bothering of their order in the list. By onsidering this subproblem, we will

identify themain tehnial issues involved inthesequel. Italso permitsto introdue the

main notations of thepaper.

Just as in standard binary lassiation, we onsider the pair of random variables

(X, Y)

where

X

is an observation vetor in ameasurable spae X and

Y

is a binary label

in

{−1, +1}

. The distribution of

(X, Y)

an be desribed by the pair

(µ, η)

where

µ

is

the marginal distribution of

X

and

η

is the a posteriori distribution dened by

η(x) =

P

{Y = 1 | X = x}

,8

x

2X. We denethe rate of best instanes asthe proportion ofbest

instanes to be onsidered and denote it by

u 0

2

(0, 1)

. We denote by

Q(η, 1 − u 0 )

the

(1 − u 0 )

-quantile oftherandomvariable

η(X)

. Then theset of best instanes at rate

u 0

is given by:

C

u 0 = {x

2X

| η(x)

Q(η, 1 − u 0 )} .

Wementiontwotrivialpropertiesoftheset

C

u 0

whihwillbeimportantinthesequel:

ˆ Mass onstraint: we have

µ C

u 0

=

P

X

2

C

u 0 = u 0

,

ˆ Invariane property: as a funtional of

η

, the set

C

u

0

is invariant by stritly

inreasing transformsof

η

.

The problem of nding a proportion

u 0

of the best instanes boils down to the es-

timation of the unknown set

C

u 0

on the basis of empirial data. Before turning to the

statistial analysisof theproblem, we rst relateit to binarylassiation.

2.1 A classification problem with a mass constraint

A lassier is ameasurable funtion

g :

X

→ {−1, +1}

and its performane ismeasured

by the lassiation error

L(g) =

P

{Y

6

= g(X)}

. Let

u 0

2

(0, 1)

be xed. Denote by

(5)

g

u

0 = 2

I

C

u0 − 1

thelassier prediting+1on thesetof best instanes

C

u

0

and -1onits

omplement. The nextproposition shows that

g

u 0

isan optimal element fortheproblem

of minimization of

L(g)

over the family of lassiers

g

satisfying the mass onstraint

P

{g(X) = 1} = u 0

.

Proposition 1

For any lassier

g :

X

→ {−1, +1}

suh that

g(x) = 2

I

C (x) − 1

for

some subset

C

of X and

µ(C) =

P

{g(X) = 1} = u 0

, we have

L

u

0

$

L g

u

0

L(g) .

Furthermore, we have

L

u 0 = 1 − Q(η, 1 − u 0 ) + (1 − u 0 )(2Q(η, 1 − u 0 ) − 1) −

E

( |η(X) − Q(η, 1 − u 0 )|) ,

and

L(g) − L g

u 0

= 2

E

|η(X) − Q(η, 1 − u 0 )|

I

C

u0 ∆C (X)

,

where

denotes the symmetri dierene operation between two subsets of X.

proof. For simpliity, we temporarily hange the notation and set

q = Q(η, 1 − u 0 )

.

Then, foranylassier

g

satisfying thethe onstraint P

{g(X) = 1} = u 0

,we have

L(g) =

E

(η(X) − q)

I

[ g(X )=−1] + (q − η(X))

I

[ g(X)=+ 1]

+ (1 − u 0 )q + (1 − q)u 0 .

The statementsof theproposition immediatelyfollow.

There are several progresses in the eld of lassiation theory where the aim is to

introdueonstraintsinthelassiation proedure orto adapt itto other problems. We

relateour formulationto other approahesinthe following remarks.

Remark 1

(Connetion to hypothesis testing). The impliit asymmetry in the

problemduetotheemphasison thebestinstanes isreminisent ofthestatistial theory

of hypothesis testing. We an formulate a test of simple hypothesis by taking the null

assumption to be

H 0 : Y = +1

and the alternative assumption being

H 1 : Y = −1

.

We want to deide whih hypothesis is true given the observation

X

. Eah lassier

g

provides atest statisti

g(X)

. The performaneof thetest isthen desribedbyits type I

error

α(g) =

P

{g(X) = 1 | Y = −1}

and itspower

β(g) =

P

{g(X) = 1 | Y = +1}

. Wepoint

outthatifthelassier

g

satisesamassonstraint, thenwe an relatethe lassiation

errorwith thetype Ierrorof thetest dened by

g

through therelation:

L(g) = 2(1 − p)α(g) + p − u 0

where

p =

P

{Y = 1}

,andsimilarly,wehave:

L(g) = 2p(1 − β(g)) − p − u 0

. Therefore,the

optimal lassier minimizes the type I error (maximizes the power) among all lassiers

(6)

on theprobability of a false alarm(type Ierror) and maximizethe power. This question

is explored ina reent paperbySott [27 ℄ (see also[29 ℄).

Remark 2

(Connetionwith regression level set estimation)We mention that theestimationofthelevelsets oftheregressionfuntionhasbeenstudiedinthestatistis

literature [3℄ (see also [32 ℄, [38 ℄) as well as in the learning literature, for instane in the

ontext of anomaly detetion ([31 , 28 ,37 ℄). In our framework of lassiation with mass

onstraint,thethresholddeningthelevelsetinvolvesthequantileoftherandomvariable

η(X)

.

Remark 3

(Connetion with the minimum volume set approah) Although the

point ofviewadoptedinthis paperisvery dierent, theproblemdesribedabovemaybe

formulated intheframework ofminimum volumesets learningas onsideredin[30 ℄. As

amatteroffat,theset

C

u 0

maybeviewedasthesolutionoftheonstrainedoptimization problem:

min

C

P

{X

2

C | Y = −1}

overthelass ofmeasurable sets

C

,subjet to

P

{X

2

C}

u 0 .

Themain diereneinouraseomesfrom thefat thattheonstraint onthevolumeset

has to be estimated using thedata while in [30℄ it is omputed from a known referene

measure. We believe that learning methods for minimum volume set estimation may

hopefully be extended to our setting. A natural way to do it would onsist in replaing

onditional distribution of

X

given

Y = −1

by its empirial ounterpart. This is beyond thesope ofthe present paperbut will bethe subjet offuture investigation.

2.2 Empirical risk minimization

We now investigate the estimation of the set

C

u 0

of best instanes at rate

u 0

based on

training data. Suppose that we are given

n

i.i.d. opies

(X 1 , Y 1 ),

, (X n , Y n )

of the

pair

(X, Y)

. Sine we have the ranking probleminmind, ourmethodologywill onsist in building the andidate sets from a lass S of real-valued soring funtions

s :

X

R.

Indeed, we onsider sets of theform

C s

$

C s,u 0 = {x

2X

| s(x)

Q(s, 1 − u 0 )} ,

where

s

isanelement ofS and

Q(s, 1− u 0 )

isthe

(1 − u 0 )

-quantileoftherandomvariable

s(X)

. Notethatsuhsetssatisfythesamepropertiesof

C

u 0

withrespettomassonstraint

and invariane to stritly inreasingtransforms of

s

.

(7)

L(s)

$

L(s, u 0 )

$

L(C s ) =

P

{Y

(s(X) − Q(s, 1 − u 0 )) < 0} .

A soring funtionminimizingthe quantity

L n (s) = 1 n

X n

i =1

I

{Y i

(s(X i ) − Q(s, 1 − u 0 )) < 0}.

is expeted to approximately minimize the true error

L(s)

, but the quantile depends

on the unknown distribution of

X

. In pratie, one has to replae

Q(s, 1 − u 0 )

by its

empirial ounterpart

Q(s, 1 ^ − u 0 )

whih orresponds to theempirial quantile. We will

thusonsider, insteadof

L n (s)

, thetruly empirialerror:

^ L n (s) = 1 n

X n

i =1

I

{Y i

(s(X i ) − ^ Q(s, 1 − u 0 )) < 0}.

Note that

^ L n (s)

is a ompliated statisti sine the empirial quantile involves all the

instanes

X 1 , . . . , X n

. We alsomentionthat

^ L n (s)

isabiased estimateofthelassiation

error

L(s)

of thelassier

g s (x) = 2

I

{s(x)

Q(s, 1 − u 0 )} − 1

.

Weintrodue somemorenotations. Set, forall

t

2R:

ˆ

F s (t) =

P

{s(X)

t}

ˆ

G s (t) =

P

{s(X)

t | Y = +1}

ˆ

H s (t) =

P

{s(X)

t | Y = −1}

to betheumulative distribution funtions (df)of

s(X)

(respetively, given

Y = 1

,given

Y = −1

). We reall that thedenition of thequantiles of (the distribution of) a random variableinvolves the notion ofgeneralized inverse

F −1

of afuntion

F

:

F −1 (z) =

inf

{t

2R

| F(t)

z} .

Thus,wehave, forall

v

2

(0, 1)

:

Q(s, v) = F −1 s (v)

and

Q(s, v) = ^ ^ F −1 s (v)

where

^ F s

isthe empirial df of

s(X)

:

^ F s (t) = n 1 P n

i=1

I

{s(X i )

t}

,8

t

2R.

Without loss of generality, we will assume that all soring funtions in S take their

values in

(0, λ)

forsome

λ > 0

. We now turn to study theperformane of minimizers of

L ^ n (s)

overa lassS ofsoring funtions dened by

^

s n =

argmin

s

2S

L ^ n (s).

(8)

Our rst main result is an exess risk boundfortheempirial risk minimizer

^ s n

over

a lass S of uniformly bounded soring funtions. In the following theorem, we onsider

thatthelevelsets ofsoringfuntionsfromthelassS formaVapnik-Chervonenkis (VC)

lassof sets.

Theorem 2

Weassume that

(i) the lass S is symmetri (i.e. if

s

2S then

λ − s

2S) and is a VC major lass

of funtions with VC dimension

V

.

(ii) the family K

= { G s , H s : s

2 S

}

of dfs satises the following property: any

K

2 K has left and right derivatives, denoted by

K +

0 and

K

0, and there exist

stritly positive onstants

b

,

B

suh that 8

(K, t)

2K

(0, λ)

,

b

K +

0

(t)

B

and

b

K

0

(t)

B .

For any

δ > 0

, we have, with probability largerthan

1 − δ

,

L(^ s n ) −

inf

s

2S

L(s)

c 1

s

V n + c 2

s

ln(1/δ)

n ,

for some positive onstants

c 1 , c 2

.

Wenow providesomeinsights ononditions (i) and (ii)of thetheorem.

Remark 4

(on the omplexity assumption) On the terminology of major sets and majorlasses,werefertoDudley[10 ℄. Intheproof,weneedtoontrolempirialproesses

indexed by sets of the form

{x : s(x)

t}

or

{x : s(x)

t}

. Condition (i) guarantees

thatthese sets form a VClass ofsets.

Remark 5

(onthehoie ofthelassS ofsoring funtions)Inordertograsp

the meaning of ondition (ii) of the theorem, we onsider the one-dimensional ase with

real-valued soring funtions. Assume that the distribution of the random variable

X i

has a bounded density

f

with respet to Lebesgue measure. Assume also that soring

funtions

s

aredierentiableexept, possibly,at anitenumberofpoints,andderivatives aredenoted by

s

0. Denote by

f s

the density of

s(X)

. Let

t

2

(0, λ)

and denote by

x 1

, ...,

x p

the real rootsof the equation

s(x) = t

. We an express the density of

s(X)

thanksto

thehange-of-variable formula(see e.g. [24℄):

f s (t) = f(x 1 )

s

0

(x 1 ) + . . . + f(x p ) s

0

(x p ) .

Thisshowsthat thesoringfuntions shouldnotpresent neitherat norsteep parts. We

an take forinstane, thelassS to be thelassof linear-by-parts funtions with anite

(9)

Scoring function

x

s(x)

Figure1: Typialexample ofa soring funtion.

numberof loal extrema and with uniformly bounded left and right derivatives: 8

s

2S,

8

x

,

m

s +

0

(x)

M

and

m

s

0

(x)

M

forsomestritly positive onstants

m

, and

M

(seeFigure1). Notethatanysubintervalof

[0, λ]

hastobeintherangeofsoringfuntions

s

(if not, some elements of K will present a plateau). In fat, the proof requires suh a

behavioronly intheviinity ofthepoints orresponding to thequantiles

Q(s, 1 − u 0 )

for

all

s

2S.

proof. Set

v 0 = 1 − u 0

. By astandard argument (see e.g. [8 ℄), we have:

L(^ s n ) −

inf

s

2S

L(s)

2

sup

s

2S

^ L n (s) − L(s)

2

sup

s

2S

^ L n (s) − L n (s)

+ 2

sup

s

2S

|L n (s) − L(s)| .

Note that the seond term in the bound is an empirial proess whose behavior is

well-known. Inourase, assumption (i)implies thatthelassof sets

{x : s(x)

Q(s, v 0 )}

indexed bysoring funtions

s

hasaVCdimension smaller than

V

. Hene, we have by a

onentrationargument ombinedwithaVCboundfortheexpetationofthesupremum

(see,e.g. [20 ℄), forany

δ > 0

,withprobabilitylarger than

1 − δ

,

sup

s

2S

|L n (s) − L(s)|

c

s

V n + c

0

s

ln(1/δ) n

foruniversal onstants

c, c

0.

We now show how to handle the rst term. Following the work of Koul [18 ℄, we set

thefollowing notations:

M(s, v) =

P

Y

s(X) − Q(s, v)

< 0

Références

Documents relatifs

we first apply Top-K spatial preference query algorithms on the data set and generate three ranking set namely NN, RNG and INF which stands for Nearest Neighbor , Range Score

but have yet to succeed in that goal. Without it, writing macros that require significant calculation of expressions is exceedingly tedious and error-prone. Aside from the

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

In its standard form, a European-style call on the maximum (the “best”) of n assets provides the investor, at the option expiry, with the difference, if positive, between the highest

the corresponding hierarchical set of partitions. 3 presents the gain and the loss of the best partitions depending on the value of the α parameter. The best-partitions algorithm

In our model, consumer preferences are such that there is a positive correlation between the valuation of the tying good and the tied environmental quality, as illustrated for

Bergounioux, Optimal control of abstract elliptic variational inequalities with state constraints,, to appear, SIAM Journal on Control and Optimization, Vol. Tiba, General

throughout the country as part of the Centennial Celebrations of 1967. This was meant to highlight “the diversity of Canadian culture.” 70 This chapter also examines the cultural