HAL Id: hal-00020087
https://hal.archives-ouvertes.fr/hal-00020087
Preprint submitted on 5 Mar 2006
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Ranking and empirical minimization of U-statistics
Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis
To cite this version:
Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis. Ranking and empirical minimization of U- statistics. 2006. �hal-00020087�
ccsd-00020087, version 1 - 5 Mar 2006
U-statistis
StéphanClémençon
MODALX- Université Paris X Nanterre
&
Laboratoire deProbabilités et Modèles Aléatoires
UMR CNRS 7599 - Universités Paris VI et Paris VII
Gábor Lugosi
Departamentd'Eonomiai Empresa
Universitat Pompeu Fabra
NiolasVayatis
Laboratoire deProbabilités et Modèles Aléatoires
UMR CNRS 7599 - Universités Paris VI et Paris VII
Marh 5, 2006
Abstrat
The problem of ranking/ordering instanes, instead of simply las-
sifying them, has reently gained muh attention in mahine learning.
Inthispaperweformulatethe ranking problem inarigorousstatistial
framework. Thegoal is tolearnarankingrule for deiding,amongtwo
instanes, whihoneis "better",with minimum rankingrisk. Sinethe
natural estimates of the risk are of the form of aU-statisti, results of
the theory of U-proesses are required for investigatingthe onsisteny ofempirialriskminimizers. Weestablishinpartiularatailinequality
for degenerate U-proesses, and applyit for showing that fast rates of
onvergene may beahievedunderspeinoiseassumptions, justlike
inlassiation. Convexriskminimizationmethodsarealsostudied.
TheseondauthoraknowledgessupportbytheSpanishMinistryofSieneandTeh-
nologyandFEDER,grantBMF2003-03324andbythePASCALNetworkofExelleneunder
ECgrantno.506778.
Motivated byvariousappliationsinludingproblemsrelatedto doumentre-
trieval or redit-risk sreening, the ranking problem has reeived inreasing
attention both inthestatistialand mahinelearningliterature. Inthe rank-
ing problem one hasto ompare twodierent observations and deide whih
one is "better". For example, in doument retrieval appliations, one may
beonernedwith omparingdoumentsbydegree ofrelevanefor a partiu-
larrequest, rather than simplylassifying them as relevantornot. Similarly,
reditestablishmentsolletandmanagelargedatabasesontainingthesoio-
demographiandredit-historyharateristisoftheirlientstobuildaranking
rulewhihaimsatindiatingreliability.
In this paper we dene a statistial framework for studying suh ranking
problems. TherankingproblemdenedhereisloselyrelatedtoStute'sondi-
tional U-statistis [36, 37℄. Indeed, Stute's results imply that ertain non-
parametri estimates based on loal U-statistis gives universally onsistent ranking rules. Our approah here is dierent. Instead of loal averages, we
onsider empirial minimizers of U-statistis, more in the spirit of empirial risk minimizationpopularin statistiallearning theory, see, e.g., Vapnikand
Chervonenkis [40℄, Bartlett and Mendelson [6℄, Bousquet, Bouheron, Lugosi
[8℄,Kolthinskii[24℄,Massart[29℄forsurveysandreentdevelopment. Theim-
portantfeatureoftherankingproblemisthat naturalestimatesoftheranking
riskinvolveU-statistis. Therefore,themethodologyisbasedonthetheoryof
U-proesses,andthekeytoolsinvolvemaximalandonentrationinequalities,
symmetrizationtriks, and a "ontration priniple"for U-proesses. For an
exellentaountofthe theoryof U-statistisand U-proesseswe refertothe
monographofdelaPeñaand Giné[12℄.
Furthermoreweprovideatheoretialanalysisofertainnonparametrirank-
ingmethods thatare basedon anempirialminimizationof onvexostfun-
tionals over onvex sets of soring funtions. The methods are inspired by
boosting-, andsupport vetormahine-typealgorithmsfor lassiation. The
mainresults of the paperproveuniversal onsisteny of properly regularized
versions of these methods, establish a novel tail inequalityfor degenerate U-
proessesand, based on thelatter result, show that fast ratesof onvergene
maybeahievedforempirialriskminimizersundersuitablenoiseonditions.
We point out that under ertain onditions, nding a good ranking rule
amounts to onstruting a soring funtion s. An important speial ase is
thebipartiterankingprobleminwhih theavailableinstanesinthedataare
labelled bybinary labels(good and bad). Inthis asetherankingriterionis
loselyrelatedtotheso-alledau(areaunderthe"ro"urve)riterion(see
theAppendixformoredetails).
Therestofthepaperisorganizedasfollows. InSetion2,thebasimodels
Setion3providessomebasiuniformonvergeneandonsistenyresultsfor
empirial risk minimizers. Setion 4 ontains the main statistial results of
thepaper,establishingperformaneboundsforempirialriskminimizationfor
rankingproblems. InSetion5,wedesribethenoiseassumptionswhihguar-
antee fast rates of onvergenein partiularases. In Setion 6 a new expo-
nentialonentrationinequalityis establishedfor U-proesseswhih servesas
amain tool inouranalysis. InSetion7wedisussonvexriskminimization
forrankingproblems,layingdownatheoretialframeworkforstudyingboost-
ing and support vetor mahine-type ranking methods. In the Appendix we
summarizesomebasipropertiesofU-statistisandhighlightsomeonnetions
of the rankingproblem denedhere to properties of theso-alled ro urve,
appearinginrelatedproblems.
2 The ranking problem
Let (X, Y) be a pair of random variables taking values in X R where X is
ameasurablespae. Therandom objetXmodelssomeobservationand Y its
real-valued label. Let (X0, Y0) denote a pair of random variables identially
distributedwith(X, Y),and independentofit. Denote
Z= Y−Y0 2 .
IntherankingproblemoneobservesXandX0butnottheirlabelsY andY0. We
thinkaboutX being"better"than X0 if Y > Y0, that is, ifZ > 0. (Thefator 1/2 in the denition of Z is not signiant, it is merely hereas a onvenient
normalization.)ThegoalistorankXandX0suhthattheprobabilitythatthe betterranked of them hasa smaller labelis as smallas possible. Formally, a
ranking rule is a funtionr : X X → {−1, 1}. If r(x, x0) = 1 then the rule
ranksxhigher thanx0. Theperformaneofa rankingruleismeasuredbythe
rankingrisk
L(r) =P{Zr(X, X0)< 0},
thatis,theprobabilitythatrrankstworandomlydrawninstanesinorretly.
Observethatinthisformalization,therankingproblemisequivalenttoabinary
lassiationprobleminwhihthesignoftherandomvariableZistobeguessed
baseduponthe pair of observations (X, X0). Now it is easy to determinethe
rankingrulewith minimalrisk. Introduethenotation
ρ+(X, X0) =P{Z > 0|X, X0} ρ−(X, X0) =P{Z < 0|X, X0}.
Thenwehavethefollowingsimplefat:
r(x, x0) =2I[ρ+(x,x0)ρ−(x,x0)]−1
anddenoteL=L(r) =E{min(ρ+(X, X0), ρ−(X, X0))}. Thenforanyranking
ruler,
LL(r) .
proof.Letrbeanyrankingrule.Observethat,byonditioningrston(X, X0),
onemaywrite
L(r) =E I[r(X,X0)=1]ρ−(X, X0) +I[r(X,X0)=−1]ρ+(X, X0)
.
ItisnoweasytohekthatL(r)isminimalforr=r.
Thus,r minimizesthe rankingrisk overall possible rankingrules. Inthe
denition of r ties are broken in favor of ρ+ but obviously if ρ+(x, x0) = ρ−(x, x0),anarbitraryvalueanbehosenforr withoutalteringitsrisk.
Thepurposeofthispaperistoinvestigatetheonstrutionofrankingrules
oflowriskbasedontrainingdata. Weassumethat nindependent,identially distributedopiesof (X, Y), are available: Dn = (X1, Y1), . . . ,(Xn, Yn). Given
a ranking rule r, one may use the training data to estimate its risk L(r) =
P{Zr(X, X0)< 0}. TheperhapsmostnaturalestimateistheU-statisti
Ln(r) = 1 n(n−1)
X
i6=j
I[Zi,jr(Xi,Xj)<0].
In this paper we onsider minimizers of the empirial estimate Ln(r) over a
lassRofrankingrulesandstudytheperformaneofsuhempiriallyseleted
rankingrules. Beforedisussingempirialriskminimizationforranking,afew
remarksareinorder.
Remark1 NotethattheatualvaluesoftheYi'sareneverusedintheranking
rulesdisussedinthis paper. It issuientto knowthevaluesoftheZi,j, or,
equivalently,theorderingoftheYi's.
Remark2 (a moregeneralframework.) Onemayonsiderageneraliza-
tionofthesetupdesribedabove.InsteadofrankingjusttwoobservationsX, X0,
onemaybe interestedin rankingm independentobservationsX(1), . . . , X(m).
Inthisasethevalue ofa rankingfuntionr(X(1), . . . , X(m))is apermutation
πof{1, . . . , m}andthegoalisthatπshouldoinidewith(oratleastresemble
to)thepermutationπforwhih Y(π(1))Y(π(m)). Givenalossfuntion ℓthat assignsa numberin[0, 1] to apair ofpermutations,theranking riskis
denedas
L(r) =Eℓ(r(X(1), . . . , X(m)), π).
Inthis generalase, naturalestimatesof L(r) involvem-th orderU-statistis.
Manyoftheresultsof thispapermaybeextended, inamoreorlessstraight-
forward manner, to this general setup. In orderto lighten the notation and
simplifythearguments,werestritthedisussiontotheasedesribedabove,
thatis,to theasewhenm=2 andthelossfuntionisℓ(π, π) =I[π6=π].
Remark3 (ranking and soring.) Inmany interesting asesthe ranking
problemmaybereduedtondinganappropriatesoringfuntion. Theseare
the ases when the joint distribution of X and Y is suh that there exists a
funtions:X →Rsuhthat
r(x, x0) =1 ifand onlyif s(x)s(x0).
Afuntion s satisfying theassumptionis alledan optimalsoring funtion.
Obviously,anystritlyinreasingtransformationofanoptimalsoringfuntion
isalsoanoptimalsoringfuntion. Belowwedesribesomeimportantspeial
aseswhentherankingproblemmaybereduedtosoring.
Example1 (the bipartite ranking problem.) In the bipartite ranking
problemthelabelY isbinary,it takesvaluesin{−1, 1}. Writing η(x) =P{Y = 1|X=x},itiseasytoseethattheBayesrankingriskequals
L=Emin{η(X)(1−η(X0)), η(X0)(1−η(X))}
=Emin{η(X), η(X0)}− (Eη(X))2
andalso,
L=Var
Y+1 2
−1
2E|η(X) −η(X0)| .
Inpartiular,
LVar
Y+1 2
1/4
wheretheequalityL=Var Y+12
holdswhenXandYareindependentandthe maximumisattainedwhenη1/2. Observethatthediultyofthebipartite
ranking problem dependson the onentration properties of the distribution
of η(X) = P(Y = 1 | X) through the quantity E(|η(X) −η(X0)|) whih is a
lassialmeasureofonentration,knownasGini's meandierene. Forgiven
p=E(η(X)),Gini'smeandierenerangesfromaminimumvalueofzero,when η(X)p,toamaximumvalueof 12p(1−p)intheasewhenη(X) = (Y+1)/2.
ItislearfromtheformoftheBayesrankingrulethattheoptimalrankingruleis
givenbyasoringfuntionswheresisanystritlyinreasingtransformation of η. Then one may restrit the searh to ranking rules dened by soring
funtionss,that is, rankingrules offormr(x, x0) =2I[s(x)s(x0)]−1. Writing L(s)def= L(r), onehas
L(s) −L=E |η(X0) −η(X)|I[(s(X)−s(X0))(η(X)−η(X0))<0]
.
riterionwhihisastandardperformanemeasureinthebipartitesetting(see
[14℄andAppendix2). Morepreisely,wehave:
au(s) =P{s(X)s(X0)|Y=1, Y0 = −1}=1− 1
2p(1−p)L(s),
wherep=P(Y=1),sothatmaximizingtheauriterionboilsdowntomini-
mizingtherankingerror.
Example2 (a regression model). Assumenowthat Y is real-valuedand the joint distributionof X and Y is suh that Y = m(X) +ǫ where m(x) =
E(Y|X = x)is the regression funtion, ǫ is independent of X and hasa sym-
metridistributionaroundzero. Thenlearlytheoptimalrankingrule r may
be obtained by a soring funtion s where s may be taken as any stritly
inreasingtransformationofm.
3 Empirial risk minimization
BasedontheempirialestimateLn(r)oftheriskL(r)ofarankingruledened
above,onemayonsider hoosinga ranking ruleby minimizingtheempirial
riskovera lassRof rankingrulesr :X X →{−1, 1}. Dene theempirial
riskminimizer,overR,by
rn =argmin
r2R
Ln(r).
(Tiesarebrokeninanarbitraryway.) Ina"rst-order"approah,wemaystudy
theperformaneL(rn) =P{Zrn(X, X0)< 0|Dn}oftheempirialriskminimizer
bythestandardbound(see,e.g.,[13℄)
L(rn) − inf
r2RL(r)2sup
r2R
|Ln(r) −L(r)|. (1)
Thisinequalitypointsoutthatboundingtheperformaneofanempirialmin-
imizer of the ranking risk boils down to investigating the properties of U-
proesses, that is,suprema ofU-statistisindexedbya lass ofrankingrules.
For a detailed and modernaountof U-proess theory we refer to the book
ofde laPeñaand Giné [12℄. In arst-orderapproahwe basiallyreduethe
problemtothestudyofordinaryempirialproesses.
By using the simple Lemma 14 given inthe Appendix, we obtainthe fol-
lowing:
Proposition 2 Denethe Rademaheraverage
Rn=sup
r2R
1
bn/2
bn/2
X
i=1
ǫiI[Zi,bn/2+ir(Xi,Xbn/2+i)<0]
whereǫ1, ..., ǫn arei.i.d.Rademaherrandomvariables(i.e.,randomsym-
metrisign variables). Then forany onvex nondereasing funtion ψ,
Eψ
L(rn) − inf
r2RL(r)
Eψ(4Rn) .
proof.The inequalityfollowsimmediately from(1), Lemma 14 (see the Ap-
pendix), and a standard symmetrization inequality, see, e.g., Giné and Zinn
[17℄.
One may easilyuse this result to deriveprobabilistiperformane bounds
fortheempirialriskminimizer. Forexample,bytaking ψ(x) =eλx for some λ > 0, andusingthe boundeddierenes inequality(see MDiarmid[31℄), we
have
Eexp
λ(L(rn) − inf
r2RL(r))
Eexp(4λRn)
exp
4λERn + 4λ2 (n−1)
.
ByusingMarkov'sinequalityandhoosingλtominimizethebound,wereadily
obtain:
Corollary3 Let δ > 0. Withprobabilityatleast 1−δ,
L(rn) − inf
r2RL(r)4ERn+4
r
ln(1/δ) n−1 .
TheexpetedvalueoftheRademaheraverageRnmaynowbeboundedby
standardmethods,see, e.g.,Lugosi[27℄, Bouheron,Bousquet, andLugosi[8℄.
For example, if the lass R of indiatorfuntions has nite v dimensionV,
then
ERn c
r
V n
forauniversalonstantc.
This result is similar to the one proved in the bipartite ranking ase by
Agarwal,Graepel,Herbrih,Har-Peled,and Roth[2℄with the restritionthat
theirboundholdsonditionallyona label sequene. The analysisof[2℄ relies
onapartiularomplexitymeasurealledrank-shatteroeientbuttheore
oftheargumentisthesame.
Thepropositionaboveisonvenient,simple,and,inaertainsense,notim-
provable. However,it iswell knownfromthetheoryofstatistiallearningand
empirialrisk minimizationfor lassiationthat the bound (1)is oftenquite
isdueto thefatthat thevarianeof theestimatorsoftheriskis ignoredand
boundeduniformlybya onstant. Therefore, themaininterestinonsidering
U-statistispreiselyonsistsinthefatthattheyhaveminimalvarianeamong
allunbiasedestimators. However,theredued-varianepropertyofU-statistis
playsno role inthe above analysisof the rankingproblem. Observe that all
upperboundsobtained inthis setion remaintrue foranempirialrisk mini-
mizerthat,insteadofusingestimatesbasedon U-statistis,estimatestherisk ofarankingrulebysplittingthedatasetintotwohalvesandestimatesL(r)by
1
bn/2
bn/2
X
i=1
I[Zi,bn/2+ir(Xi,Xbn/2+i)<0] .
Hene,intheprevious studyoneloses theadvantageof usingU-statistis. In Setion 4 it is shown that under ertain, not unommon, irumstanes sig-
niantly smaller risk bounds are ahievable. Thereit will have an essential
importane to use sharp exponential bounds for U-proesses, involving their
reduedvariane.
4 Fast rates
Themainresults ofthispapershowthat theboundsobtainedintheprevious
setionmaybesigniantlyimprovedunderertainonditions.Itiswellknown
(see,e.g.,5.2inthesurvey[8℄andthereferenestherein)thattighterbounds
fortheexessriskintheontextofbinarylassiationmaybeobtainedifone
anontrolthevarianeoftheexessriskbyitsexpetedvalue. Inlassiation
thisanbeguaranteedunderertain"low-noise"onditions(seeTsybakov[39℄,
MassartandNédéle[30℄,Kolthinskii[24℄).
Nextweexaminepossibilitiesofobtainingsuhimprovedperformanebounds
forempirialrankingriskminimization.Themainmessageisthatintherank-
ing problem one also may obtain signiantly improved bounds under some
onditions that are analogousto the low-noiseonditionsin the lassiation
problem,thoughquitedierentinnature.
HerewewillgreatlybenetfromusingU-statistis(asopposedtosplitting
thesample)asthesmallvarianeoftheU-statistisusedtoestimatetheranking
risk gives rise to sharper bounds. The starting point of our analysis is the
HoedingdeompositionofU-statistis(seeAppendix1).
Setrst
qr((x, y),(x0, y0)) =I[(y−y0)r(x,x0)<0]−I[(y−y0)r(x,x0)<0]
and onsider the following estimate of the exess risk Λ(r) = L(r) −L =