Ranking and empirical minimization of U-statistics

(1)

HAL Id: hal-00020087

https://hal.archives-ouvertes.fr/hal-00020087

Preprint submitted on 5 Mar 2006

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Ranking and empirical minimization of U-statistics

Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis

To cite this version:

Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis. Ranking and empirical minimization of U- statistics. 2006. �hal-00020087�

(2)

ccsd-00020087, version 1 - 5 Mar 2006

U^-statistis

StéphanClémençon

MODALX- Université Paris X Nanterre

&

Laboratoire deProbabilités et Modèles Aléatoires

UMR CNRS 7599 - Universités Paris VI et Paris VII

Gábor Lugosi

Departamentd'Eonomiai Empresa

Universitat Pompeu Fabra

NiolasVayatis

Laboratoire deProbabilités et Modèles Aléatoires

UMR CNRS 7599 - Universités Paris VI et Paris VII

Marh 5, 2006

Abstrat

The problem of ranking/ordering instanes, instead of simply las-

sifying them, has reently gained muh attention in mahine learning.

Inthispaperweformulatethe ranking problem inarigorousstatistial

framework. Thegoal is tolearnarankingrule for deiding,amongtwo

instanes, whihoneis "better",with minimum rankingrisk. Sinethe

natural estimates of the risk are of the form of aU^-statisti, ^results ^of

the theory of U^-proesses ^are ^required ^for investigatingthe onsisteny ofempirialriskminimizers. Weestablishinpartiularatailinequality

for degenerate U^-proesses, ând âpplyît ^for ^showing ^that ^fast ^rates ôf

onvergene may beahievedunderspeinoiseassumptions, justlike

inlassiation. Convexriskminimizationmethodsarealsostudied.

TheseondauthoraknowledgessupportbytheSpanishMinistryofSieneandTeh-

nologyandFEDER,grantBMF2003-03324andbythePASCALNetworkofExelleneunder

ECgrantno.506778.

(3)

Motivated byvariousappliationsinludingproblemsrelatedto doumentre-

trieval or redit-risk sreening, the ranking problem has reeived inreasing

attention both inthestatistialand mahinelearningliterature. Inthe rank-

ing problem one hasto ompare twodierent observations and deide whih

one is "better". For example, in doument retrieval appliations, one may

beonernedwith omparingdoumentsbydegree ofrelevanefor a partiu-

larrequest, rather than simplylassifying them as relevantornot. Similarly,

reditestablishmentsolletandmanagelargedatabasesontainingthesoio-

demographiandredit-historyharateristisoftheirlientstobuildaranking

rulewhihaimsatindiatingreliability.

In this paper we dene a statistial framework for studying suh ranking

problems. TherankingproblemdenedhereisloselyrelatedtoStute'sondi-

tional U^-statistis ^[36, ^37℄. Îndeed, ^Stute's ^results împly ^that êrtain ^non-

parametri estimates based on loal U^-statistis ^gives universally onsistent ranking rules. Our approah here is dierent. Instead of loal averages, we

onsider empirial minimizers of U-statistis, more in the spirit of empirial risk minimizationpopularin statistiallearning theory, see, e.g., Vapnikand

Chervonenkis [40℄, Bartlett and Mendelson [6℄, Bousquet, Bouheron, Lugosi

[8℄,Kolthinskii[24℄,Massart[29℄forsurveysandreentdevelopment. Theim-

portantfeatureoftherankingproblemisthat naturalestimatesoftheranking

riskinvolveU-statistis. Therefore,themethodologyisbasedonthetheoryof

U^-proesses,ând^the^key^toolsînvolve^maximalândonentrationinequalities,

symmetrizationtriks, and a "ontration priniple"for U^-proesses. ^For ^an

exellentaountofthe theoryof U^-statistis^and U^-proesses^we ^refer^to^the

monographofdelaPeñaand Giné[12℄.

Furthermoreweprovideatheoretialanalysisofertainnonparametrirank-

ingmethods thatare basedon anempirialminimizationof onvexostfun-

tionals over onvex sets of soring funtions. The methods are inspired by

boosting-, andsupport vetormahine-typealgorithmsfor lassiation. The

mainresults of the paperproveuniversal onsisteny of properly regularized

versions of these methods, establish a novel tail inequalityfor degenerate U^-

proessesand, based on thelatter result, show that fast ratesof onvergene

maybeahievedforempirialriskminimizersundersuitablenoiseonditions.

We point out that under ertain onditions, nding a good ranking rule

amounts to onstruting a soring funtion s^. Ân împortant ^speial âse îs

thebipartiterankingprobleminwhih theavailableinstanesinthedataare

labelled bybinary labels(good and bad). Inthis asetherankingriterionis

loselyrelatedtotheso-alledau(areaunderthe"ro"urve)riterion(see

theAppendixformoredetails).

Therestofthepaperisorganizedasfollows. InSetion2,thebasimodels

(4)

Setion3providessomebasiuniformonvergeneandonsistenyresultsfor

empirial risk minimizers. Setion 4 ontains the main statistial results of

thepaper,establishingperformaneboundsforempirialriskminimizationfor

rankingproblems. InSetion5,wedesribethenoiseassumptionswhihguar-

antee fast rates of onvergenein partiularases. In Setion 6 a new expo-

nentialonentrationinequalityis establishedfor U^-proesses^whih ^serves^as

amain tool inouranalysis. InSetion7wedisussonvexriskminimization

forrankingproblems,layingdownatheoretialframeworkforstudyingboost-

ing and support vetor mahine-type ranking methods. In the Appendix we

summarizesomebasipropertiesofU^-statistis^and^highlight^some^onnetions

of the rankingproblem denedhere to properties of theso-alled ro urve,

appearinginrelatedproblems.

2 The ranking problem

Let (X, Y) ^be â ^pair ôf ^random ^variables ^taking ^values ⁱⁿ ^X ^R ^where ^X îs

ameasurablespae. Therandom objetX^models^someobservationand Y ^its

real-valued label. Let (X⁰, Y⁰) ^denote â ^pair ôf ^random ^variables îdentially

distributedwith(X, Y)^,^and independentofit. Denote

Z= Y−Y⁰ 2 .

IntherankingproblemoneobservesX^andX⁰^but^not^their^labelsY ^andY⁰^. ^We

thinkaboutX ^being"better"than X⁰ îf Y > Y⁰^, ^that îs, îfZ > 0^. ^(The^fator 1/2 ⁱⁿ ^the ^denition ôf Z îs ^not ^signiant, ît îs ^merely ^hereâs â ônvenient

normalization.)ThegoalistorankX^andX⁰^suh^that^theprobabilitythatthe betterranked of them hasa smaller labelis as smallas possible. Formally, a

ranking rule is a funtionr : ^X ^X → {−1, 1}^. ^If r(x, x⁰) = 1 ^then ^the ^rule

ranksx^higher ^thanx⁰^. ^The^performaneôfâ ^ranking^ruleîs^measured^by^the

rankingrisk

L(r) =^P{Zr(X, X⁰)< 0},

thatis,theprobabilitythatr^ranks^two^randomly^drawn^instanes^inorretly^.

Observethatinthisformalization,therankingproblemisequivalenttoabinary

lassiationprobleminwhihthesignoftherandomvariableZ^is^to^be^guessed

baseduponthe pair of observations (X, X⁰)^. ^Now ît îs êasy ^to ^determine^the

rankingrulewith minimalrisk. Introduethenotation

ρ+(X, X⁰) =^P{Z > 0|X, X⁰} ρ−(X, X⁰) =^P{Z < 0|X, X⁰}.

Thenwehavethefollowingsimplefat:

(5)

r(x, x⁰) =2^I[ρ+(x,x⁰)ρ−(x,x⁰)]−1

anddenoteL=L(r) =^E{^min(ρ₊(X, X⁰), ρ₋(X, X⁰))}^. ^Then^for^any^ranking

ruler^,

LL(r) .

proof.Letr^be^any^ranking^rule.^Observe^that,^byonditioningrston(X, X⁰)^,

onemaywrite

L(r) =Ê Î_[r(X,X⁰_)=1]ρ−(X, X⁰) +Î_[_r(X,X⁰_)=−1]ρ+(X, X⁰)

.

ItisnoweasytohekthatL(r)^is^minimal^forr=r^.

Thus,r ^minimizes^the ^ranking^risk ôverâll ^possible ^ranking^rules. În^the

denition of r ^ties âre ^broken ⁱⁿ ^favor ôf ρ+ ^but ôbviously îf ρ+(x, x⁰) = ρ−(x, x⁰)^,ânârbitrary^valueân^be^hosen^forr ^withoutâlteringîts^risk.

Thepurposeofthispaperistoinvestigatetheonstrutionofrankingrules

oflowriskbasedontrainingdata. Weassumethat nindependent,identially distributedopiesof (X, Y)^, ^are ^available: Dn = (X1, Y1), . . . ,(Xn, Yn)^. ^Given

a ranking rule r^, ône ^may ûse ^the ^training ^data ^to êstimate îts ^risk L(r) =

P{Zr(X, X⁰)< 0}^. ^The^perhaps^most^natural^estimate^is^theU^-statisti

Ln(r) = 1 n(n−1)

X

i⁶=j

I[Zi,jr(Xi,Xj)<0].

In this paper we onsider minimizers of the empirial estimate Ln(r) ^over ^a

lassRofrankingrulesandstudytheperformaneofsuhempiriallyseleted

rankingrules. Beforedisussingempirialriskminimizationforranking,afew

remarksareinorder.

Remark1 NotethattheatualvaluesoftheYi^'s^are^never^usedⁱⁿ^the^ranking

rulesdisussedinthis paper. It issuientto knowthevaluesoftheZi,j^, ^or,

equivalently,theorderingoftheYi^'s.

Remark2 (a moregeneralframework.) Onemayonsiderageneraliza-

tionofthesetupdesribedabove.InsteadofrankingjusttwoobservationsX, X⁰^,

onemaybe interestedin rankingm independentobservationsX⁽¹⁾, . . . , X^(m)^.

Inthisasethevalue ofa rankingfuntionr(X⁽¹⁾, . . . , X^(m))^is ^apermutation

πôf{1, . . . , m}ând^the^goalîs^thatπ^shouldôinide^with^(orât^least^resemble

to)thepermutationπ^for^whih Y^(π(1))Y^(π(m))^. ^Givenâ^loss^funtion ℓ^that âssignsâ ^numberⁱⁿ[0, 1] ^to â^pair ôfpermutations,theranking riskis

denedas

L(r) =^Eℓ(r(X⁽¹⁾, . . . , X^(m)), π).

(6)

Inthis generalase, naturalestimatesof L(r) ^involvem^-th ^orderU-statistis.

Manyoftheresultsof thispapermaybeextended, inamoreorlessstraight-

forward manner, to this general setup. In orderto lighten the notation and

simplifythearguments,werestritthedisussiontotheasedesribedabove,

thatis,to theasewhenm=2 ând^the^loss^funtionîsℓ(π, π) =Î_[π6=π]^.

Remark3 (ranking and soring.) Inmany interesting asesthe ranking

problemmaybereduedtondinganappropriatesoringfuntion. Theseare

the ases when the joint distribution of X ând Y îs ^suh ^that ^there êxists â

funtions:^X →^R^suh^that

r(x, x⁰) =1 îfând ônlyîf s(x)s(x⁰).

Afuntion s ^satisfying ^theâssumptionîs âlledân ôptimal^soring ^funtion.

Obviously,anystritlyinreasingtransformationofanoptimalsoringfuntion

isalsoanoptimalsoringfuntion. Belowwedesribesomeimportantspeial

aseswhentherankingproblemmaybereduedtosoring.

Example1 (the bipartite ranking problem.) In the bipartite ranking

problemthelabelY îs^binary,ît ^takes^valuesⁱⁿ{−1, 1}^. ^Writing η(x) =^P{Y = 1|X=x}^,îtîsêasy^to^see^that^the^Bayes^ranking^riskêquals

L=^E^min{η(X)(1−η(X⁰)), η(X⁰)(1−η(X))}

=^E^min{η(X), η(X⁰)}− (^Eη(X))²

andalso,

L=^V^ar

Y+1 2

−1

2^E|η(X) −η(X⁰)| .

Inpartiular,

L^Var

Y+1 2

1/4

wheretheequalityL=^Var ^Y+1₂

holdswhenXândYâreindependentandthe maximumisattainedwhenη1/2^. Ôbserve^that^the^diultyôf^the^bipartite

ranking problem dependson the onentration properties of the distribution

of η(X) = ^P(Y = 1 | X) ^through ^the ^quantity Ê(|η(X) −η(X⁰)|) ^whih îs â

lassialmeasureofonentration,knownasGini's meandierene. Forgiven

p=Ê(η(X))^,^Gini's^mean^dierene^ranges^fromâ^minimum^valueôf^zero,^when η(X)p^,^toâ^maximum^valueôf ¹₂p(1−p)ⁱⁿ^theâse^whenη(X) = (Y+1)/2^.

ItislearfromtheformoftheBayesrankingrulethattheoptimalrankingruleis

givenbyasoringfuntions^wheresîsâny^stritlyînreasingtransformation of η^. ^Then ône ^may ^restrit ^the ^searh ^to ^ranking ^rules ^dened ^by ^soring

funtionss^,^that îs, ^ranking^rules ôf^formr(x, x⁰) =2Î_[s(x)s(x⁰)]−1^. ^Writing L(s)^def= L(r)^, ône^has

L(s) −L=^E |η(X⁰) −η(X)|^I_[(s(X)−s(X⁰))(η(X)−η(X⁰))<0]

.

(7)

riterionwhihisastandardperformanemeasureinthebipartitesetting(see

[14℄andAppendix2). Morepreisely,wehave:

au(s) =^P{s(X)s(X⁰)|Y=1, Y⁰ = −1}=1− 1

2p(1−p)L(s),

wherep=^P(Y=1)^,^so^that^maximizing^the^au^riterion^boils^down^to^mini-

mizingtherankingerror.

Example2 (a regression model). Assumenowthat Y îs real-valuedand the joint distributionof X ând Y îs ^suh ^that Y = m(X) +ǫ ^where m(x) =

E(Y|X = x)îs ^the ^regression ^funtion, ǫ îs independent of X ând ^hasâ ^sym-

metridistributionaroundzero. Thenlearlytheoptimalrankingrule r ^may

be obtained by a soring funtion s ^where s ^may ^be ^taken ^as ^any ^stritly

inreasingtransformationofm^.

3 Empirial risk minimization

BasedontheempirialestimateLn(r)ôf^the^riskL(r)ôfâ^ranking^rule^dened

above,onemayonsider hoosinga ranking ruleby minimizingtheempirial

riskovera lassRof rankingrulesr :^X ^X →{−1, 1}^. ^Dene ^the^empirial

riskminimizer,overR,by

r_n =^arg^min

r^2R

L_n(r).

(Tiesarebrokeninanarbitraryway.) Ina"rst-order"approah,wemaystudy

theperformaneL(rn) =^P{Zrn(X, X⁰)< 0|Dn}^of^the^empirial^risk^minimizer

bythestandardbound(see,e.g.,[13℄)

L(rn) − ^inf

r^2RL(r)2^sup

r^2R

|Ln(r) −L(r)|. ⁽¹⁾

Thisinequalitypointsoutthatboundingtheperformaneofanempirialmin-

imizer of the ranking risk boils down to investigating the properties of U^-

proesses, that is,suprema ofU^-statistisîndexed^byâ ^lass ôf^ranking^rules.

For a detailed and modernaountof U^-proess ^theory ^we ^refer ^to ^the ^book

ofde laPeñaand Giné [12℄. In arst-orderapproahwe basiallyreduethe

problemtothestudyofordinaryempirialproesses.

By using the simple Lemma 14 given inthe Appendix, we obtainthe fol-

lowing:

Proposition 2 Denethe Rademaheraverage

Rn=^sup

r^2R

1

bn/2

X

i=1

ǫi^I[Zi,^bn/2+ir(Xi,Xbn/2+i)<0]

(8)

whereǫ1, ..., ǫn âreî.i.d.^Râdemaher^random^variables^(i.e.,^random^sym-

metrisign variables). Then forany onvex nondereasing funtion ψ^,

Eψ

L(r_n) − ^inf

r^2RL(r)

Eψ(4R_n) .

proof.The inequalityfollowsimmediately from(1), Lemma 14 (see the Ap-

pendix), and a standard symmetrization inequality, see, e.g., Giné and Zinn

[17℄.

One may easilyuse this result to deriveprobabilistiperformane bounds

fortheempirialriskminimizer. Forexample,bytaking ψ(x) =e^λx ^for ^some λ > 0^, ândûsing^the ^bounded^dierenes înequality^(see ^MDiarmid^[31℄), ^we

have

Eexp

λ(L(r_n) − ^inf

r^2RL(r))

Eexp(4λRn)

exp

4λ^ER_n + 4λ² (n−1)

.

ByusingMarkov'sinequalityandhoosingλ^to^minimize^the^bound,^we^readily

obtain:

Corollary3 Let δ > 0^. ^Withprobabilityatleast 1−δ^,

L(r_n) − ^inf

r^2RL(r)4^ER_n+4

r

ln(1/δ) n−1 .

TheexpetedvalueoftheRademaheraverageR_n^may^now^be^bounded^by

standardmethods,see, e.g.,Lugosi[27℄, Bouheron,Bousquet, andLugosi[8℄.

For example, if the lass R of indiatorfuntions has nite v dimensionV^,

then

ERn c

r

V n

forauniversalonstantc^.

This result is similar to the one proved in the bipartite ranking ase by

Agarwal,Graepel,Herbrih,Har-Peled,and Roth[2℄with the restritionthat

theirboundholdsonditionallyona label sequene. The analysisof[2℄ relies

onapartiularomplexitymeasurealledrank-shatteroeientbuttheore

oftheargumentisthesame.

Thepropositionaboveisonvenient,simple,and,inaertainsense,notim-

provable. However,it iswell knownfromthetheoryofstatistiallearningand

empirialrisk minimizationfor lassiationthat the bound (1)is oftenquite

(9)

isdueto thefatthat thevarianeof theestimatorsoftheriskis ignoredand

boundeduniformlybya onstant. Therefore, themaininterestinonsidering

U^-statistis^preisely^onsistsⁱⁿ^the^fat^that^they^have^minimal^variane^among

allunbiasedestimators. However,theredued-varianepropertyofU^-statistis

playsno role inthe above analysisof the rankingproblem. Observe that all

upperboundsobtained inthis setion remaintrue foranempirialrisk mini-

mizerthat,insteadofusingestimatesbasedon U-statistis,estimatestherisk ofarankingrulebysplittingthedatasetintotwohalvesandestimatesL(r)^by

1

bn/2

X

i=1

I[^Zi,^bn/2+ir(Xi,Xbn/2+i)<0] .

Hene,intheprevious studyoneloses theadvantageof usingU-statistis. In Setion 4 it is shown that under ertain, not unommon, irumstanes sig-

niantly smaller risk bounds are ahievable. Thereit will have an essential

importane to use sharp exponential bounds for U^-proesses, ^involving ^their

reduedvariane.

4 Fast rates

Themainresults ofthispapershowthat theboundsobtainedintheprevious

setionmaybesigniantlyimprovedunderertainonditions.Itiswellknown

(see,e.g.,5.2inthesurvey[8℄andthereferenestherein)thattighterbounds

fortheexessriskintheontextofbinarylassiationmaybeobtainedifone

anontrolthevarianeoftheexessriskbyitsexpetedvalue. Inlassiation

thisanbeguaranteedunderertain"low-noise"onditions(seeTsybakov[39℄,

MassartandNédéle[30℄,Kolthinskii[24℄).

Nextweexaminepossibilitiesofobtainingsuhimprovedperformanebounds

forempirialrankingriskminimization.Themainmessageisthatintherank-

ing problem one also may obtain signiantly improved bounds under some

onditions that are analogousto the low-noiseonditionsin the lassiation

problem,thoughquitedierentinnature.

HerewewillgreatlybenetfromusingU^-statistis^(as^opposed^to^splitting

thesample)asthesmallvarianeoftheU^-statistis^used^to^estimate^the^ranking

risk gives rise to sharper bounds. The starting point of our analysis is the

HoedingdeompositionofU^-statistis^(see^Appendix^1).

Setrst

qr((x, y),(x⁰, y⁰)) =^I_[(y−y⁰₎r(x,x⁰)<0]−^I_[(y−y⁰₎r(x,x⁰)<0]

and onsider the following estimate of the exess risk Λ(r) = L(r) −L =