• Aucun résultat trouvé

Ranking and empirical minimization of U-statistics

N/A
N/A
Protected

Academic year: 2021

Partager "Ranking and empirical minimization of U-statistics"

Copied!
33
0
0

Texte intégral

(1)

HAL Id: hal-00020087

https://hal.archives-ouvertes.fr/hal-00020087

Preprint submitted on 5 Mar 2006

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Ranking and empirical minimization of U-statistics

Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis

To cite this version:

Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis. Ranking and empirical minimization of U- statistics. 2006. �hal-00020087�

(2)

ccsd-00020087, version 1 - 5 Mar 2006

U-statistis

StéphanClémençon

MODALX- Université Paris X Nanterre

&

Laboratoire deProbabilités et Modèles Aléatoires

UMR CNRS 7599 - Universités Paris VI et Paris VII

Gábor Lugosi

Departamentd'Eonomiai Empresa

Universitat Pompeu Fabra

NiolasVayatis

Laboratoire deProbabilités et Modèles Aléatoires

UMR CNRS 7599 - Universités Paris VI et Paris VII

Marh 5, 2006

Abstrat

The problem of ranking/ordering instanes, instead of simply las-

sifying them, has reently gained muh attention in mahine learning.

Inthispaperweformulatethe ranking problem inarigorousstatistial

framework. Thegoal is tolearnarankingrule for deiding,amongtwo

instanes, whihoneis "better",with minimum rankingrisk. Sinethe

natural estimates of the risk are of the form of aU-statisti, results of

the theory of U-proesses are required for investigatingthe onsisteny ofempirialriskminimizers. Weestablishinpartiularatailinequality

for degenerate U-proesses, and applyit for showing that fast rates of

onvergene may beahievedunderspeinoiseassumptions, justlike

inlassiation. Convexriskminimizationmethodsarealsostudied.

TheseondauthoraknowledgessupportbytheSpanishMinistryofSieneandTeh-

nologyandFEDER,grantBMF2003-03324andbythePASCALNetworkofExelleneunder

ECgrantno.506778.

(3)

Motivated byvariousappliationsinludingproblemsrelatedto doumentre-

trieval or redit-risk sreening, the ranking problem has reeived inreasing

attention both inthestatistialand mahinelearningliterature. Inthe rank-

ing problem one hasto ompare twodierent observations and deide whih

one is "better". For example, in doument retrieval appliations, one may

beonernedwith omparingdoumentsbydegree ofrelevanefor a partiu-

larrequest, rather than simplylassifying them as relevantornot. Similarly,

reditestablishmentsolletandmanagelargedatabasesontainingthesoio-

demographiandredit-historyharateristisoftheirlientstobuildaranking

rulewhihaimsatindiatingreliability.

In this paper we dene a statistial framework for studying suh ranking

problems. TherankingproblemdenedhereisloselyrelatedtoStute'sondi-

tional U-statistis [36, 37℄. Indeed, Stute's results imply that ertain non-

parametri estimates based on loal U-statistis gives universally onsistent ranking rules. Our approah here is dierent. Instead of loal averages, we

onsider empirial minimizers of U-statistis, more in the spirit of empirial risk minimizationpopularin statistiallearning theory, see, e.g., Vapnikand

Chervonenkis [40℄, Bartlett and Mendelson [6℄, Bousquet, Bouheron, Lugosi

[8℄,Kolthinskii[24℄,Massart[29℄forsurveysandreentdevelopment. Theim-

portantfeatureoftherankingproblemisthat naturalestimatesoftheranking

riskinvolveU-statistis. Therefore,themethodologyisbasedonthetheoryof

U-proesses,andthekeytoolsinvolvemaximalandonentrationinequalities,

symmetrizationtriks, and a "ontration priniple"for U-proesses. For an

exellentaountofthe theoryof U-statistisand U-proesseswe refertothe

monographofdelaPeñaand Giné[12℄.

Furthermoreweprovideatheoretialanalysisofertainnonparametrirank-

ingmethods thatare basedon anempirialminimizationof onvexostfun-

tionals over onvex sets of soring funtions. The methods are inspired by

boosting-, andsupport vetormahine-typealgorithmsfor lassiation. The

mainresults of the paperproveuniversal onsisteny of properly regularized

versions of these methods, establish a novel tail inequalityfor degenerate U-

proessesand, based on thelatter result, show that fast ratesof onvergene

maybeahievedforempirialriskminimizersundersuitablenoiseonditions.

We point out that under ertain onditions, nding a good ranking rule

amounts to onstruting a soring funtion s. An important speial ase is

thebipartiterankingprobleminwhih theavailableinstanesinthedataare

labelled bybinary labels(good and bad). Inthis asetherankingriterionis

loselyrelatedtotheso-alledau(areaunderthe"ro"urve)riterion(see

theAppendixformoredetails).

Therestofthepaperisorganizedasfollows. InSetion2,thebasimodels

(4)

Setion3providessomebasiuniformonvergeneandonsistenyresultsfor

empirial risk minimizers. Setion 4 ontains the main statistial results of

thepaper,establishingperformaneboundsforempirialriskminimizationfor

rankingproblems. InSetion5,wedesribethenoiseassumptionswhihguar-

antee fast rates of onvergenein partiularases. In Setion 6 a new expo-

nentialonentrationinequalityis establishedfor U-proesseswhih servesas

amain tool inouranalysis. InSetion7wedisussonvexriskminimization

forrankingproblems,layingdownatheoretialframeworkforstudyingboost-

ing and support vetor mahine-type ranking methods. In the Appendix we

summarizesomebasipropertiesofU-statistisandhighlightsomeonnetions

of the rankingproblem denedhere to properties of theso-alled ro urve,

appearinginrelatedproblems.

2 The ranking problem

Let (X, Y) be a pair of random variables taking values in X R where X is

ameasurablespae. Therandom objetXmodelssomeobservationand Y its

real-valued label. Let (X0, Y0) denote a pair of random variables identially

distributedwith(X, Y),and independentofit. Denote

Z= YY0 2 .

IntherankingproblemoneobservesXandX0butnottheirlabelsY andY0. We

thinkaboutX being"better"than X0 if Y > Y0, that is, ifZ > 0. (Thefator 1/2 in the denition of Z is not signiant, it is merely hereas a onvenient

normalization.)ThegoalistorankXandX0suhthattheprobabilitythatthe betterranked of them hasa smaller labelis as smallas possible. Formally, a

ranking rule is a funtionr : X X {−1, 1}. If r(x, x0) = 1 then the rule

ranksxhigher thanx0. Theperformaneofa rankingruleismeasuredbythe

rankingrisk

L(r) =P{Zr(X, X0)< 0},

thatis,theprobabilitythatrrankstworandomlydrawninstanesinorretly.

Observethatinthisformalization,therankingproblemisequivalenttoabinary

lassiationprobleminwhihthesignoftherandomvariableZistobeguessed

baseduponthe pair of observations (X, X0). Now it is easy to determinethe

rankingrulewith minimalrisk. Introduethenotation

ρ+(X, X0) =P{Z > 0|X, X0} ρ(X, X0) =P{Z < 0|X, X0}.

Thenwehavethefollowingsimplefat:

(5)

r(x, x0) =2I+(x,x0)ρ(x,x0)]1

anddenoteL=L(r) =E{min+(X, X0), ρ(X, X0))}. Thenforanyranking

ruler,

LL(r) .

proof.Letrbeanyrankingrule.Observethat,byonditioningrston(X, X0),

onemaywrite

L(r) =E I[r(X,X0)=1]ρ(X, X0) +I[r(X,X0)=−1]ρ+(X, X0)

.

ItisnoweasytohekthatL(r)isminimalforr=r.

Thus,r minimizesthe rankingrisk overall possible rankingrules. Inthe

denition of r ties are broken in favor of ρ+ but obviously if ρ+(x, x0) = ρ(x, x0),anarbitraryvalueanbehosenforr withoutalteringitsrisk.

Thepurposeofthispaperistoinvestigatetheonstrutionofrankingrules

oflowriskbasedontrainingdata. Weassumethat nindependent,identially distributedopiesof (X, Y), are available: Dn = (X1, Y1), . . . ,(Xn, Yn). Given

a ranking rule r, one may use the training data to estimate its risk L(r) =

P{Zr(X, X0)< 0}. TheperhapsmostnaturalestimateistheU-statisti

Ln(r) = 1 n(n1)

X

i6=j

I[Zi,jr(Xi,Xj)<0].

In this paper we onsider minimizers of the empirial estimate Ln(r) over a

lassRofrankingrulesandstudytheperformaneofsuhempiriallyseleted

rankingrules. Beforedisussingempirialriskminimizationforranking,afew

remarksareinorder.

Remark1 NotethattheatualvaluesoftheYi'sareneverusedintheranking

rulesdisussedinthis paper. It issuientto knowthevaluesoftheZi,j, or,

equivalently,theorderingoftheYi's.

Remark2 (a moregeneralframework.) Onemayonsiderageneraliza-

tionofthesetupdesribedabove.InsteadofrankingjusttwoobservationsX, X0,

onemaybe interestedin rankingm independentobservationsX(1), . . . , X(m).

Inthisasethevalue ofa rankingfuntionr(X(1), . . . , X(m))is apermutation

πof{1, . . . , m}andthegoalisthatπshouldoinidewith(oratleastresemble

to)thepermutationπforwhih Y(π(1))Y(π(m)). Givenalossfuntion that assignsa numberin[0, 1] to apair ofpermutations,theranking riskis

denedas

L(r) =Eℓ(r(X(1), . . . , X(m)), π).

(6)

Inthis generalase, naturalestimatesof L(r) involvem-th orderU-statistis.

Manyoftheresultsof thispapermaybeextended, inamoreorlessstraight-

forward manner, to this general setup. In orderto lighten the notation and

simplifythearguments,werestritthedisussiontotheasedesribedabove,

thatis,to theasewhenm=2 andthelossfuntionisℓ(π, π) =I6=π].

Remark3 (ranking and soring.) Inmany interesting asesthe ranking

problemmaybereduedtondinganappropriatesoringfuntion. Theseare

the ases when the joint distribution of X and Y is suh that there exists a

funtions:X Rsuhthat

r(x, x0) =1 ifand onlyif s(x)s(x0).

Afuntion s satisfying theassumptionis alledan optimalsoring funtion.

Obviously,anystritlyinreasingtransformationofanoptimalsoringfuntion

isalsoanoptimalsoringfuntion. Belowwedesribesomeimportantspeial

aseswhentherankingproblemmaybereduedtosoring.

Example1 (the bipartite ranking problem.) In the bipartite ranking

problemthelabelY isbinary,it takesvaluesin{−1, 1}. Writing η(x) =P{Y = 1|X=x},itiseasytoseethattheBayesrankingriskequals

L=Emin{η(X)(1η(X0)), η(X0)(1η(X))}

=Emin{η(X), η(X0)}− (Eη(X))2

andalso,

L=Var

Y+1 2

1

2E|η(X) −η(X0)| .

Inpartiular,

LVar

Y+1 2

1/4

wheretheequalityL=Var Y+12

holdswhenXandYareindependentandthe maximumisattainedwhenη1/2. Observethatthediultyofthebipartite

ranking problem dependson the onentration properties of the distribution

of η(X) = P(Y = 1 | X) through the quantity E(|η(X) −η(X0)|) whih is a

lassialmeasureofonentration,knownasGini's meandierene. Forgiven

p=E(η(X)),Gini'smeandierenerangesfromaminimumvalueofzero,when η(X)p,toamaximumvalueof 12p(1p)intheasewhenη(X) = (Y+1)/2.

ItislearfromtheformoftheBayesrankingrulethattheoptimalrankingruleis

givenbyasoringfuntionswheresisanystritlyinreasingtransformation of η. Then one may restrit the searh to ranking rules dened by soring

funtionss,that is, rankingrules offormr(x, x0) =2I[s(x)s(x0)]1. Writing L(s)def= L(r), onehas

L(s) −L=E |η(X0) −η(X)|I[(s(X)−s(X0))(η(X)−η(X0))<0]

.

(7)

riterionwhihisastandardperformanemeasureinthebipartitesetting(see

[14℄andAppendix2). Morepreisely,wehave:

au(s) =P{s(X)s(X0)|Y=1, Y0 = −1}=1 1

2p(1p)L(s),

wherep=P(Y=1),sothatmaximizingtheauriterionboilsdowntomini-

mizingtherankingerror.

Example2 (a regression model). Assumenowthat Y is real-valuedand the joint distributionof X and Y is suh that Y = m(X) +ǫ where m(x) =

E(Y|X = x)is the regression funtion, ǫ is independent of X and hasa sym-

metridistributionaroundzero. Thenlearlytheoptimalrankingrule r may

be obtained by a soring funtion s where s may be taken as any stritly

inreasingtransformationofm.

3 Empirial risk minimization

BasedontheempirialestimateLn(r)oftheriskL(r)ofarankingruledened

above,onemayonsider hoosinga ranking ruleby minimizingtheempirial

riskovera lassRof rankingrulesr :X X {−1, 1}. Dene theempirial

riskminimizer,overR,by

rn =argmin

r2R

Ln(r).

(Tiesarebrokeninanarbitraryway.) Ina"rst-order"approah,wemaystudy

theperformaneL(rn) =P{Zrn(X, X0)< 0|Dn}oftheempirialriskminimizer

bythestandardbound(see,e.g.,[13℄)

L(rn) − inf

r2RL(r)2sup

r2R

|Ln(r) −L(r)|. (1)

Thisinequalitypointsoutthatboundingtheperformaneofanempirialmin-

imizer of the ranking risk boils down to investigating the properties of U-

proesses, that is,suprema ofU-statistisindexedbya lass ofrankingrules.

For a detailed and modernaountof U-proess theory we refer to the book

ofde laPeñaand Giné [12℄. In arst-orderapproahwe basiallyreduethe

problemtothestudyofordinaryempirialproesses.

By using the simple Lemma 14 given inthe Appendix, we obtainthe fol-

lowing:

Proposition 2 Denethe Rademaheraverage

Rn=sup

r2R

1

bn/2

bn/2

X

i=1

ǫiI[Zi,bn/2+ir(Xi,Xbn/2+i)<0]

(8)

whereǫ1, ..., ǫn arei.i.d.Rademaherrandomvariables(i.e.,randomsym-

metrisign variables). Then forany onvex nondereasing funtion ψ,

Eψ

L(rn) − inf

r2RL(r)

Eψ(4Rn) .

proof.The inequalityfollowsimmediately from(1), Lemma 14 (see the Ap-

pendix), and a standard symmetrization inequality, see, e.g., Giné and Zinn

[17℄.

One may easilyuse this result to deriveprobabilistiperformane bounds

fortheempirialriskminimizer. Forexample,bytaking ψ(x) =eλx for some λ > 0, andusingthe boundeddierenes inequality(see MDiarmid[31℄), we

have

Eexp

λ(L(rn) − inf

r2RL(r))

Eexp(4λRn)

exp

ERn + 2 (n1)

.

ByusingMarkov'sinequalityandhoosingλtominimizethebound,wereadily

obtain:

Corollary3 Let δ > 0. Withprobabilityatleast 1δ,

L(rn) − inf

r2RL(r)4ERn+4

r

ln(1/δ) n1 .

TheexpetedvalueoftheRademaheraverageRnmaynowbeboundedby

standardmethods,see, e.g.,Lugosi[27℄, Bouheron,Bousquet, andLugosi[8℄.

For example, if the lass R of indiatorfuntions has nite v dimensionV,

then

ERn c

r

V n

forauniversalonstantc.

This result is similar to the one proved in the bipartite ranking ase by

Agarwal,Graepel,Herbrih,Har-Peled,and Roth[2℄with the restritionthat

theirboundholdsonditionallyona label sequene. The analysisof[2℄ relies

onapartiularomplexitymeasurealledrank-shatteroeientbuttheore

oftheargumentisthesame.

Thepropositionaboveisonvenient,simple,and,inaertainsense,notim-

provable. However,it iswell knownfromthetheoryofstatistiallearningand

empirialrisk minimizationfor lassiationthat the bound (1)is oftenquite

(9)

isdueto thefatthat thevarianeof theestimatorsoftheriskis ignoredand

boundeduniformlybya onstant. Therefore, themaininterestinonsidering

U-statistispreiselyonsistsinthefatthattheyhaveminimalvarianeamong

allunbiasedestimators. However,theredued-varianepropertyofU-statistis

playsno role inthe above analysisof the rankingproblem. Observe that all

upperboundsobtained inthis setion remaintrue foranempirialrisk mini-

mizerthat,insteadofusingestimatesbasedon U-statistis,estimatestherisk ofarankingrulebysplittingthedatasetintotwohalvesandestimatesL(r)by

1

bn/2

bn/2

X

i=1

I[Zi,bn/2+ir(Xi,Xbn/2+i)<0] .

Hene,intheprevious studyoneloses theadvantageof usingU-statistis. In Setion 4 it is shown that under ertain, not unommon, irumstanes sig-

niantly smaller risk bounds are ahievable. Thereit will have an essential

importane to use sharp exponential bounds for U-proesses, involving their

reduedvariane.

4 Fast rates

Themainresults ofthispapershowthat theboundsobtainedintheprevious

setionmaybesigniantlyimprovedunderertainonditions.Itiswellknown

(see,e.g.,Ÿ5.2inthesurvey[8℄andthereferenestherein)thattighterbounds

fortheexessriskintheontextofbinarylassiationmaybeobtainedifone

anontrolthevarianeoftheexessriskbyitsexpetedvalue. Inlassiation

thisanbeguaranteedunderertain"low-noise"onditions(seeTsybakov[39℄,

MassartandNédéle[30℄,Kolthinskii[24℄).

Nextweexaminepossibilitiesofobtainingsuhimprovedperformanebounds

forempirialrankingriskminimization.Themainmessageisthatintherank-

ing problem one also may obtain signiantly improved bounds under some

onditions that are analogousto the low-noiseonditionsin the lassiation

problem,thoughquitedierentinnature.

HerewewillgreatlybenetfromusingU-statistis(asopposedtosplitting

thesample)asthesmallvarianeoftheU-statistisusedtoestimatetheranking

risk gives rise to sharper bounds. The starting point of our analysis is the

HoedingdeompositionofU-statistis(seeAppendix1).

Setrst

qr((x, y),(x0, y0)) =I[(y−y0)r(x,x0)<0]I[(y−y0)r(x,x0)<0]

and onsider the following estimate of the exess risk Λ(r) = L(r) −L =

Références

Documents relatifs

Dimension is an inherent bottleneck to some modern learning tasks, where optimization methods suffer from the size of the data. In this paper, we study non-isotropic distributions

We study the performance of empirical risk minimization (ERM), with re- spect to the quadratic risk, in the context of convex aggregation, in which one wants to construct a

We have shown in Section 2 that the small-ball condition is satis…ed for linear models such as histograms, piecewise polynomials or compactly supported wavelets, but with constants

We prove that, if all the M classifiers are binary, the (penalized) Empirical Risk Minimization procedures are suboptimal (even under the margin/low noise condition) when the

Although statistical learning theory mainly focuses on establishing universal rate bounds (i.e., which hold for any distribution of the data) for the accuracy of a decision rule

In this paper, we argue that for such problems, Empirical Risk Minimization can be implemented using statistical counterparts of the risk based on much less terms (picked randomly

All evaluations are performed for logistic regression in the framework of binary classification as well as linear regression with quadratic loss using simulated data, while

We also extend to continuous logic a classical result saying that in order to verify that a theory is dependent it suffices to verify that every formula ϕ(x, ¯ y) is dependent where