Learning from Positive and Unlabeled Examples

(1)

HAL Id: inria-00536692

https://hal.inria.fr/inria-00536692

Submitted on 16 Nov 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Learning from Positive and Unlabeled Examples

François Denis, Rémi Gilleron, Fabien Letouzey

To cite this version:

François Denis, Rémi Gilleron, Fabien Letouzey. Learning from Positive and Unlabeled Examples.

Theoretical Computer Science, Elsevier, 2005, 348 (1), pp.70-83. �10.1016/j.tcs.2005.09.007�. �inria- 00536692�

(2)

Examples

?

François DENIS a

Rémi GILLERON b

Fabien LETOUZEY b

a

Équipe BDAA, LIF, Centre deMathématiques et d'Informatique (CMI),

Université deProvene, Marseille,FRANCE. E-mail:fdenismi.univ-mrs.fr

b

Équipe Grappa, LIFL, UPRESA 8022 CNRS,Université de Lille1 and

Université Charles de Gaulle,Lille3, FRANCE. E-mail:{gilleron,letouzey}li.fr

Abstrat

In many mahine learning settings, labeled examples are diult to ollet while

unlabeleddataareabundant.Also,for somebinarylassiationproblems,positive

exampleswhihareelementsofthetargetoneptareavailable.Cantheseadditional

databeusedtoimproveaurayofsupervisedlearningalgorithms?Weinvestigate

inthispaperthedesignoflearningalgorithmsfrompositiveandunlabeleddataonly.

Manymahinelearninganddataminingalgorithms, suhasdeisiontreeindution

algorithms and naive Bayes algorithms, use examples only to evaluate statistial

queries (SQ-like algorithms). Kearnsdesigned the Statistial Querylearning model

in order to desribe these algorithms. Here, we design an algorithm sheme whih

transforms any SQ-like algorithm into an algorithm based on positive statistial

queries (estimate for probabilities over the set of positive instanes) and instane

statistial queries (estimate for probabilities over the instane spae). We prove

that any lass learnable in the Statistial Query learning model is learnable from

positive statistialqueries and instane statistial queries only ifa lower boundon

theweight of any target onept f an be estimatedin polynomial time. Then, we

design a deisiontree indution algorithm POSC4.5, based on C4.5,that usesonly

positiveandunlabeledexamplesandwegiveexperimentalresultsforthisalgorithm.

In the ase of imbalaned lasses in the sense that one of the two lasses (say the

positive lass)isheavilyunderrepresentedompared totheotherlass,thelearning

problem remains open. This problem is hallenging beause it is enountered in

manyreal-worldappliations.

Key words: PAC learning, Statistial Querymodel, Semi-supervised Learning,

DataMining

?

This researh was partially supported by: "CPER 2000-2006, Contrat de Plan

état-régionNord/Pas-de-Calais: axeTACT,projetTIC";fondseuropéensFEDER

(3)

The eldofData Mining(sometimesreferred toknowledgedisovery indata-

bases)addressesthequestionofhowbesttousevarioussetsofdatatodisover

regularities and toimprove deisions. The learningstep is entralinthe data

mining proess. A rst generation of supervised mahine learning algorithms

(e.g. deision tree indution algorithms, neural network learning methods,

bayesian learning methods, logisti regression, ...) have been demonstrated

to be of signiant value in a Data Mining perspetive and they are now

widely used and available in ommerial produts. But these mahine learn-

ing methods are issued from non parametri statistis and suppose that the

input sample is a quitelarge set of independently and identially distributed

(i.i.d.)labeleddatadesribedby numeriorsymbolifeatures.But,inaData

Miningora TextMining perspetive,onehas touse historialdata thathave

been olleted from various origins and moreover, i.i.d. labeled data may be

expensive to ollet or even unavailable. On the other hand, unlabeled data

providing information about the underlying distribution or examples of one

partiular lass (that we shall allthe positive lass) may be easilyavailable.

Can this additionalinformation help to learn? Here, we address the issue of

designing lassiation algorithms that are able to utilize data from diverse

data soures: labeled data (if available),unlabeled data, and positive data.

Alongthis lineofresearh,therehasreently beensigniantinterestinsemi-

supervisedlearning,thatisthedesignoflearningalgorithmsfrombothlabeled

andunlabeleddata.Inthesemi-supervisedsetting,oneofthequestionsis:an

unlabeleddatabeusedtoimproveaurayofsupervisedlearningalgorithms?

Intuitively, the answer is positivebeause unlabeled data must provide some

information about the hidden distribution. Nevertheless, it seems that the

question is hallenging from a theoretial perspetive as well as a pratial

one. A promisinglineof researhis the o-training settingrst dened in[3℄.

Supposing that the features are naturally divided into two disjoint sets, the

o-training algorithmbuildstwo lassiers, and eahone of these two is used

to label unlabeled data for the other. In [3℄, theoretial results are proved,

learning situations for whih the assumption is true are desribed in [14℄,

experimental results may be found in [3℄ and [15℄. See also [8℄ for another

approah of the o-training setting. Other approahes inlude using the EM

algorithm [16℄, and using transdutive inferene [11℄. A NIPS'99 workshop

and a NIPS'00 ompetition were also organized on using unlabeled data for

supervised learning.

Inthispaper,weonsiderbinarylassiationproblems.Oneofthetwolasses

"TIC - Fouille Intelligente de données - Traitement Intelligent des Connaissanes"

OBJ 2-phasingout - 2001/3- 4.1- n 3"

(4)

Howan unlabeled dataand positivedata beused toimprovethe auray

ofsupervised learningalgorithms?

Howan learningalgorithmsfrom unlabeled data and positivedata bede-

signedfrom previously known supervised learningalgorithms?

First,letusjustifythattheproblemisrelevantforappliations.Wearguethat,

inmanypratialsituations,elementsof thetargetonept may beabundant

and heap toollet.Forinstane, onsider one diagnosisofdiseases: inorder

toobtainani.i.d.sampleof labeledexamples,itisneessary tosystematially

detetthediseaseonarepresentativesampleofpatientsand thistaskmay be

quiteexpensive(or impossible).On the otherhand, it may be easyto ollet

the medial les of patients who have the disease. Also, unlabeled data are

any poolof patients possibly havingthe disease.

Seond, let us note that many mahine learning algorithms as deision tree

learning algorithms and Bayesian learning algorithms only use examples to

estimate statistis. In other words, many mahine learning algorithms may

be onsidered as Statistial Query (SQ) learning algorithms. Thus we are

interested in generalshemes whih transformsupervised SQ-likelearningal-

gorithmsintolearningalgorithmsfrombothunlabeleddataandpositivedata.

Inapreliminarypaper[6℄, wehavegivenevidenewithboththeoretialand

empirialarguments that positive dataand unlabeled data an boost au-

ray of SQ-like learning algorithms. It was noted that learning with positive

andunlabeleddataispossibleassoonastheweightofthe targetonept(i.e.

the ratio of positive examples) is known by the learner. An estimate of the

weight an be obtained either by an extra-orale (say for a similar problem)

orfromasmallset oflabeled examples.In the presentpaper, weonsider the

more generalproblem where onlypositivedata and unlabeled data are avail-

able. We present a general sheme whih transforms any SQ-like supervised

learning algorithmL into an algorithmPL using only positive data and un-

labeleddata. We provethat PLisa learningalgorithmassoonasthe learner

is given aess to a lower bound on the weight of the target onept. It re-

mains open whether it is possible to design an algorithm from positive data

and unlabeleddata from any SQ learning algorithminthe general ase.

Thetheoretialframeworkispresented inSetion2.Ourlearningalgorithmis

dened and proved in Setion3, some onsequenes about the equivalene of

models are alsogiven. Itis appliedtotree indutionand experimentalresults

are given inSetion 4.

(5)

2.1 Learning Models from Labeled Data

For eah n 1, X

n

denotes an instane spae on n attributes. A onept f

isa subsetof some instane spaeX

n

orequivalentlya f0;1g-valued funtion

dened on X

n

. For eah n 1, let C

n 2

Xn

be a set of onepts. Then

C = S

n1 C

n

denotes a onept lassoverX = S

n1 X

n

.The size of aonept

f isthesizeofasmallestrepresentationoff foragivenrepresentationsheme.

Anexample ofaonept f isapairhx;f(x)i, whihispositive iff(x)=1and

negative otherwise. LetD be a distribution overthe instane spae X

n , for a

subsetAofX

n

,wedenoteby D(A)theprobabilityofthe event[x2A℄.Fora

subsetAofX

n

suhthatD(A)6=0,wedenotebyD

A

theindueddistribution

overA. Forinstane,for aonept f overX

n

suh thatD(f)6=0and forany

x2X

n , D

f

(x)=D(x)=D(f) when f(x)=1and D

f

(x)=0otherwise. Letf

andg beoneptsovertheinstane spaeX

n

,wedenotebyf the omplement

of the set f inX

n

and by fg the set fg =fx2X

n

jf(x)6=g(x)g.

Letf beatargetonept overX insomeoneptlassC.LetDbethehidden

distributiondenedoverX.InthePACmodel[18℄,thelearnerisgiven aess

to an example orale EX(f;D) whih returns an example hx;f(x)i drawn

randomly aording to D at eah all. A onept lass C is PAC learnable

if there exist a learning algorithm L and a polynomial p(:;:;:;:) with the

following property: for any n and any f 2C

n

, for any distribution D on X

n ,

and for any 0 < < 1 and 0 < Æ < 1, if L is given aess to EX(f;D)

and to inputs and Æ, then with probability at least 1 Æ, L outputs a

hypothesis onept h satisfying error(h) = D(fh) in time bounded

by p(1=;1=Æ;n;size(f)). In this paper, we always suppose that the value of

size(f) is known by the learner. Reall that if size(f) is not given then the

halting riterion of the algorithm is probabilisti [9℄. Also, for many onept

lasses the natural denition of size(f) is already bounded by a polynomial

inn.

One ritiism of the PAC model is that it is a noise free model. Therefore

extensions in whih the label provided with eah random example may be

orruptedwithrandomnoisewerestudied.Thelassiationnoisemodel (CN

modelfor short) was rst dened by Angluin and Laird [1℄. A variant of the

CN model, namely the onstant-partition lassiation noise model (CPCN

model for short) has been dened by Deatur [5℄. In this model, the labeled

examplespae ispartitionedintoaonstant numberof regions,eahofwhih

may have adierentnoise rate. An interesting exampleis the ase where the

rateof falsepositiveexamplesdiersfromthe rateof falsenegativeexamples.

We only dene this restrited variant of the CPCN model. The noisy orale

(6)

EX +

(f;D) is a proedure whih, at eah all, draws an element x of X

n

aording to D and returns (i) (x;1) with probability 1

+

and (x;0) with

probability

+

if x 2 f, (ii) (x;0) with probability 1 and (x;1) with

probability if x 2 f. Let C be a onept lass over X. We say that C

is CPCN learnable if there exist a learning algorithm L and a polynomial

p(:;:;:;:;:) with the following property: for any n and any f 2 C

n

, for any

distribution D on X

n

, and for any 0

+

; <1=2 and 0<;Æ <1, if L is

given aess toEX

+

;

(f;D)and toinputs and Æ, thenwith probabilityat

least1 Æ,Loutputsahypothesisonepth2C satisfyingD(fh)intime

bounded by p(1=;1=Æ;1=;size(f);n) where =minf1=2

+

;1=2 g.

Many mahine learning algorithms only use examples in order to estimate

probabilities. This is the ase for indution tree algorithmssuh as C4.5 [17℄

and CART [4℄. This is also the ase for highly pratial Bayesian learning

methodasthenaiveBayeslassier.Kearnsdenedthestatistialquery model

(SQ model for short) in [12℄. The SQ model is a speialization of the PAC

modelinwhihthelearnerformsitshypothesissolelyonthebasisofestimates

of probabilities. A statistial query over X

n

is a mapping : X

n

f0;1g !

f0;1g assoiated with a tolerane parameter 0 < 1. In the SQ model

the learner is given aess to a statistialorale STAT(f;D) whih, at eah

query(;),returnsanestimateofD(fxj(hx;f(x)i)=1g)withinauray

. Let C be a onept lass over X. We say that C is SQ learnable if there

existalearningalgorithmLandpolynomialsp(:;:;:);q(:;:;:)and r(:;:;:)with

the following property: for any f 2 C, for any distribution D over X, and

for any 0 < < 1, if L is given aess to STAT(f;D) and to input , then,

for every query (;) made by L, the prediate an be evaluated in time

q(1=;n;size(f)), and 1= is bounded by r(1=;n;size(f)), L halts in time

bounded by p(1=;n;size(f)) and L outputs a hypothesis h 2 C satisfying

D(fh) .

We slightly modify the statistial orale STAT(f;D). Let f be the target

onept and let us onsider a statistial query made by a statistial query

learning algorithmL. The statistial orale STAT(f;D) returnsan estimate

D

of D

=D(fxj (hx;f(x)i) = 1g) within some given auray. We may

write:

D

=D(fxj(hx;1i)=1^f(x)=1g)+D(fxj(hx;0i)=1^f(x)=0g)

=D(fxj(hx;1i)=1g\f)+D(fxj(hx;0i)=1g\f)

=D(B\f)+D(C\f)

where the sets B and C are dened by:

B =fxj(hx;1i)=1gand C =fxj(hx;0i)=1g:

(7)

estimatesforprobabilitiesD(f\A)andD(f\A)withinauray,wheref is

thetargetonept,fitsomplementandAanysubsetforwhihmembership

is deidable inpolynomial time of the instane spae.It should be lear for

the reader that this tehnial modiation doesnot hange the SQ learnable

lasses.

It is lear that aess to the example orale EX(f;D)being given, it is easy

to simulate the statistial orale STAT(f;D) by drawing a suiently large

set oflabeledexamples.Moreover, thereisageneralshemewhihtransforms

any SQ learning algorithm into a PAC learning algorithm. It is also proved

in [12℄ that the lass of parity funtions is learnable in the PAC model but

annotbelearned fromstatistialqueries.

It has been shown by Kearns that any lass learnable from statistial query

is also learnable in the presene of lassiation noise [12℄. Following the

results by Kearns, it has been proved by Deatur [5℄ that any lass learnable

from statistialqueries is alsolearnable in the presene of onstant-partition

lassiationnoise.Theproofusesthehypothesistestingproperty:ahypothesis

with smallerror an beseleted fromaset of hypothesesby seletingthe one

withthe fewest errorsonaset ofCPCNorrupted examples.Ifweonfuse,in

the notations, the name of the modeland the set of learnable lasses, wean

write the followinginlusions:

SQCPCN CN PAC (1)

SQPAC (2)

Toourknowledge,theequivalenesbetweenthemodelsCNandSQorbetween

the models CN and PAC remainopen despitereent insights[2℄ and [10℄.

2.2 Learning Models from Positive and Unlabeled Data

Thelearningmodelfrompositiveexamples (POSEXforshort)wasrstdened

in[7℄.Themodeldiers fromthe PACmodelinthe followingway: thelearner

gets informationabout the target funtion and the hidden distribution from

two orales, namely a positive example orale POS(f;D) and an instane

orale INST(D)instead of anexample oraleEX(f;D).At eah request by

the learner, the instane orale INST(D) returns an elementof the instane

spaeX,i.e.anunlabeledexample,aordingtothehiddendistributionD.At

eah request by the learner, the positive example orale POS(f;D) returns

a positive example aording to the hidden distribution D

f

. We have the

following result:

Proposition 1 [7℄Any lass learnable intheCPCN modelislearnablein the

(8)

PROOF. The proof is simpleand asit may helpto understand the proofof

the main algorithmof the present paper, we skethit below.

LetC bea CPCN learnable onept lass, letLbea learningalgorithmfor C

inthe CPCN model, letf bethe target onept, letD be adistribution over

the instane spae and let us suppose that D(f) 6= 0. We must show how L

an be used tolearn from the orales POS(f;D)and INST(D).

Run L. Ateah allof the noisyorale:

withprobability2=3, allPOS(f;D)and keep the positivelabel

withprobability1=3, allINST(D) and labelthe example as negative.

It an easily be shown that this is stritly equivalent alling the noisy orale

EX

+

;

(f;D 0

) where:

D 0

(x)= 8

>

<

>

: D(x)

3

if f(x)=0

D(x)+2D

f (x)

3

if f(x)=1

+

=

D(f)

2+D(f)

=0

Note that

+

1=3 < 1=2. And as for any subset A of the instane spae,

we have D(A) 3D 0

(A), it is suient to run the algorithm L with input

auray =3 and input ondene Æ to output with ondene greater than

1 Æ ahypothesis whose error rate is less than .

The learning model from positive queries (POSQ for short) was also dened

in[7℄.InthePOSQmodel,thereareapositivestatistialoralePSTAT(f;D)

whih provides estimates for probabilities D

f

(A) for any subset A of the

instane spae within a given tolerane and an instane statistial orale

ISTAT(D) whih provides estimates for probabilities D(A) for any subset

A of the instane spae within a given tolerane. The denition of a POSQ

learnable lass is similar to the denition of a SQ learnable lass: the orale

STAT(f;D) is replaed by the two orales PSTAT(f;D) and ISTAT(D).

The POSQ model is weaker than the SQ model as there is no diret way to

obtain an estimate of the weight D(f) of the target onept. However, if we

an get suh an estimate, both models beome equivalent. Indeed, statistial

(9)

assoonasthe weight of the target onept is known beauseof the following

equations:

^

D(f \A)=

^

D

f (A)

^

D(f)

^

D(f \A)=

^

D(A)

^

D(f \A)

(3)

So, any lass learnable in the SQ model is learnable in the POSQ model as

soon as the learner is given aess to the weight of the target onept or

an ompute itfromthe positivestatistialorale andthe instane statistial

orale.This isformalized inthe followingresult:

Proposition 2 [7℄Let C be a onept lasssuh thatthe weightof any target

onept an be estimated in polynomial time within any given tolerane. If C

is SQ learnable then C is POSQ learnable.

We an summarize allthe results withthe following inlusions:

POSQSQCPCN POSEXPAC (4)

CPCN CN PAC (5)

SQPOSEX (6)

The inequality between SQ and POSEX holds beause the lass of parity

funtions is POSEX learnable but not SQ learnable. Equivalenes between

POSQ and SQ and between POSEX and PAC remain open.

3 Learning Algorithms from Positive and Unlabeled Examples

We have already notied that in pratial Data Mining and Text Mining sit-

uations, statistial query-like algorithms, suh as C4.5 or naive Bayes, are

widely used. It is straightforward to see how a statistial query an be eval-

uated from labeled data. In a similar way, positive and instane statistial

queries aneasilybeevaluatedfrompositiveand unlabeleddata. So, inorder

toadapt lassiallearningalgorithmstopositiveand unlabeled examples, we

an show how SQ learning algorithms an be modied into POSQ learning

algorithms.

In[6℄,wehavestudiedtheasewheretheweightofthetargetoneptiseither

given by an orale or evaluated from a smallset of labeled examples. In this