HAL Id: inria-00536692
https://hal.inria.fr/inria-00536692
Submitted on 16 Nov 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Learning from Positive and Unlabeled Examples
François Denis, Rémi Gilleron, Fabien Letouzey
To cite this version:
François Denis, Rémi Gilleron, Fabien Letouzey. Learning from Positive and Unlabeled Examples.
Theoretical Computer Science, Elsevier, 2005, 348 (1), pp.70-83. �10.1016/j.tcs.2005.09.007�. �inria- 00536692�
Examples
?
François DENIS a
Rémi GILLERON b
Fabien LETOUZEY b
a
Équipe BDAA, LIF, Centre deMathématiques et d'Informatique (CMI),
Université deProvene, Marseille,FRANCE. E-mail:fdenismi.univ-mrs.fr
b
Équipe Grappa, LIFL, UPRESA 8022 CNRS,Université de Lille1 and
Université Charles de Gaulle,Lille3, FRANCE. E-mail:{gilleron,letouzey}li.fr
Abstrat
In many mahine learning settings, labeled examples are diult to ollet while
unlabeleddataareabundant.Also,for somebinarylassiationproblems,positive
exampleswhihareelementsofthetargetoneptareavailable.Cantheseadditional
databeusedtoimproveaurayofsupervisedlearningalgorithms?Weinvestigate
inthispaperthedesignoflearningalgorithmsfrompositiveandunlabeleddataonly.
Manymahinelearninganddataminingalgorithms, suhasdeisiontreeindution
algorithms and naive Bayes algorithms, use examples only to evaluate statistial
queries (SQ-like algorithms). Kearnsdesigned the Statistial Querylearning model
in order to desribe these algorithms. Here, we design an algorithm sheme whih
transforms any SQ-like algorithm into an algorithm based on positive statistial
queries (estimate for probabilities over the set of positive instanes) and instane
statistial queries (estimate for probabilities over the instane spae). We prove
that any lass learnable in the Statistial Query learning model is learnable from
positive statistialqueries and instane statistial queries only ifa lower boundon
theweight of any target onept f an be estimatedin polynomial time. Then, we
design a deisiontree indution algorithm POSC4.5, based on C4.5,that usesonly
positiveandunlabeledexamplesandwegiveexperimentalresultsforthisalgorithm.
In the ase of imbalaned lasses in the sense that one of the two lasses (say the
positive lass)isheavilyunderrepresentedompared totheotherlass,thelearning
problem remains open. This problem is hallenging beause it is enountered in
manyreal-worldappliations.
Key words: PAC learning, Statistial Querymodel, Semi-supervised Learning,
DataMining
?
This researh was partially supported by: "CPER 2000-2006, Contrat de Plan
état-régionNord/Pas-de-Calais: axeTACT,projetTIC";fondseuropéensFEDER
The eldofData Mining(sometimesreferred toknowledgedisovery indata-
bases)addressesthequestionofhowbesttousevarioussetsofdatatodisover
regularities and toimprove deisions. The learningstep is entralinthe data
mining proess. A rst generation of supervised mahine learning algorithms
(e.g. deision tree indution algorithms, neural network learning methods,
bayesian learning methods, logisti regression, ...) have been demonstrated
to be of signiant value in a Data Mining perspetive and they are now
widely used and available in ommerial produts. But these mahine learn-
ing methods are issued from non parametri statistis and suppose that the
input sample is a quitelarge set of independently and identially distributed
(i.i.d.)labeleddatadesribedby numeriorsymbolifeatures.But,inaData
Miningora TextMining perspetive,onehas touse historialdata thathave
been olleted from various origins and moreover, i.i.d. labeled data may be
expensive to ollet or even unavailable. On the other hand, unlabeled data
providing information about the underlying distribution or examples of one
partiular lass (that we shall allthe positive lass) may be easilyavailable.
Can this additionalinformation help to learn? Here, we address the issue of
designing lassiation algorithms that are able to utilize data from diverse
data soures: labeled data (if available),unlabeled data, and positive data.
Alongthis lineofresearh,therehasreently beensigniantinterestinsemi-
supervisedlearning,thatisthedesignoflearningalgorithmsfrombothlabeled
andunlabeleddata.Inthesemi-supervisedsetting,oneofthequestionsis:an
unlabeleddatabeusedtoimproveaurayofsupervisedlearningalgorithms?
Intuitively, the answer is positivebeause unlabeled data must provide some
information about the hidden distribution. Nevertheless, it seems that the
question is hallenging from a theoretial perspetive as well as a pratial
one. A promisinglineof researhis the o-training settingrst dened in[3℄.
Supposing that the features are naturally divided into two disjoint sets, the
o-training algorithmbuildstwo lassiers, and eahone of these two is used
to label unlabeled data for the other. In [3℄, theoretial results are proved,
learning situations for whih the assumption is true are desribed in [14℄,
experimental results may be found in [3℄ and [15℄. See also [8℄ for another
approah of the o-training setting. Other approahes inlude using the EM
algorithm [16℄, and using transdutive inferene [11℄. A NIPS'99 workshop
and a NIPS'00 ompetition were also organized on using unlabeled data for
supervised learning.
Inthispaper,weonsiderbinarylassiationproblems.Oneofthetwolasses
"TIC - Fouille Intelligente de données - Traitement Intelligent des Connaissanes"
OBJ 2-phasingout - 2001/3- 4.1- n 3"
Howan unlabeled dataand positivedata beused toimprovethe auray
ofsupervised learningalgorithms?
Howan learningalgorithmsfrom unlabeled data and positivedata bede-
signedfrom previously known supervised learningalgorithms?
First,letusjustifythattheproblemisrelevantforappliations.Wearguethat,
inmanypratialsituations,elementsof thetargetonept may beabundant
and heap toollet.Forinstane, onsider one diagnosisofdiseases: inorder
toobtainani.i.d.sampleof labeledexamples,itisneessary tosystematially
detetthediseaseonarepresentativesampleofpatientsand thistaskmay be
quiteexpensive(or impossible).On the otherhand, it may be easyto ollet
the medial les of patients who have the disease. Also, unlabeled data are
any poolof patients possibly havingthe disease.
Seond, let us note that many mahine learning algorithms as deision tree
learning algorithms and Bayesian learning algorithms only use examples to
estimate statistis. In other words, many mahine learning algorithms may
be onsidered as Statistial Query (SQ) learning algorithms. Thus we are
interested in generalshemes whih transformsupervised SQ-likelearningal-
gorithmsintolearningalgorithmsfrombothunlabeleddataandpositivedata.
Inapreliminarypaper[6℄, wehavegivenevidenewithboththeoretialand
empirialarguments that positive dataand unlabeled data an boost au-
ray of SQ-like learning algorithms. It was noted that learning with positive
andunlabeleddataispossibleassoonastheweightofthe targetonept(i.e.
the ratio of positive examples) is known by the learner. An estimate of the
weight an be obtained either by an extra-orale (say for a similar problem)
orfromasmallset oflabeled examples.In the presentpaper, weonsider the
more generalproblem where onlypositivedata and unlabeled data are avail-
able. We present a general sheme whih transforms any SQ-like supervised
learning algorithmL into an algorithmPL using only positive data and un-
labeleddata. We provethat PLisa learningalgorithmassoonasthe learner
is given aess to a lower bound on the weight of the target onept. It re-
mains open whether it is possible to design an algorithm from positive data
and unlabeleddata from any SQ learning algorithminthe general ase.
Thetheoretialframeworkispresented inSetion2.Ourlearningalgorithmis
dened and proved in Setion3, some onsequenes about the equivalene of
models are alsogiven. Itis appliedtotree indutionand experimentalresults
are given inSetion 4.
2.1 Learning Models from Labeled Data
For eah n 1, X
n
denotes an instane spae on n attributes. A onept f
isa subsetof some instane spaeX
n
orequivalentlya f0;1g-valued funtion
dened on X
n
. For eah n 1, let C
n 2
Xn
be a set of onepts. Then
C = S
n1 C
n
denotes a onept lassoverX = S
n1 X
n
.The size of aonept
f isthesizeofasmallestrepresentationoff foragivenrepresentationsheme.
Anexample ofaonept f isapairhx;f(x)i, whihispositive iff(x)=1and
negative otherwise. LetD be a distribution overthe instane spae X
n , for a
subsetAofX
n
,wedenoteby D(A)theprobabilityofthe event[x2A℄.Fora
subsetAofX
n
suhthatD(A)6=0,wedenotebyD
A
theindueddistribution
overA. Forinstane,for aonept f overX
n
suh thatD(f)6=0and forany
x2X
n , D
f
(x)=D(x)=D(f) when f(x)=1and D
f
(x)=0otherwise. Letf
andg beoneptsovertheinstane spaeX
n
,wedenotebyf the omplement
of the set f inX
n
and by fg the set fg =fx2X
n
jf(x)6=g(x)g.
Letf beatargetonept overX insomeoneptlassC.LetDbethehidden
distributiondenedoverX.InthePACmodel[18℄,thelearnerisgiven aess
to an example orale EX(f;D) whih returns an example hx;f(x)i drawn
randomly aording to D at eah all. A onept lass C is PAC learnable
if there exist a learning algorithm L and a polynomial p(:;:;:;:) with the
following property: for any n and any f 2C
n
, for any distribution D on X
n ,
and for any 0 < < 1 and 0 < Æ < 1, if L is given aess to EX(f;D)
and to inputs and Æ, then with probability at least 1 Æ, L outputs a
hypothesis onept h satisfying error(h) = D(fh) in time bounded
by p(1=;1=Æ;n;size(f)). In this paper, we always suppose that the value of
size(f) is known by the learner. Reall that if size(f) is not given then the
halting riterion of the algorithm is probabilisti [9℄. Also, for many onept
lasses the natural denition of size(f) is already bounded by a polynomial
inn.
One ritiism of the PAC model is that it is a noise free model. Therefore
extensions in whih the label provided with eah random example may be
orruptedwithrandomnoisewerestudied.Thelassiationnoisemodel (CN
modelfor short) was rst dened by Angluin and Laird [1℄. A variant of the
CN model, namely the onstant-partition lassiation noise model (CPCN
model for short) has been dened by Deatur [5℄. In this model, the labeled
examplespae ispartitionedintoaonstant numberof regions,eahofwhih
may have adierentnoise rate. An interesting exampleis the ase where the
rateof falsepositiveexamplesdiersfromthe rateof falsenegativeexamples.
We only dene this restrited variant of the CPCN model. The noisy orale
EX +
(f;D) is a proedure whih, at eah all, draws an element x of X
n
aording to D and returns (i) (x;1) with probability 1
+
and (x;0) with
probability
+
if x 2 f, (ii) (x;0) with probability 1 and (x;1) with
probability if x 2 f. Let C be a onept lass over X. We say that C
is CPCN learnable if there exist a learning algorithm L and a polynomial
p(:;:;:;:;:) with the following property: for any n and any f 2 C
n
, for any
distribution D on X
n
, and for any 0
+
; <1=2 and 0<;Æ <1, if L is
given aess toEX
+
;
(f;D)and toinputs and Æ, thenwith probabilityat
least1 Æ,Loutputsahypothesisonepth2C satisfyingD(fh)intime
bounded by p(1=;1=Æ;1=;size(f);n) where =minf1=2
+
;1=2 g.
Many mahine learning algorithms only use examples in order to estimate
probabilities. This is the ase for indution tree algorithmssuh as C4.5 [17℄
and CART [4℄. This is also the ase for highly pratial Bayesian learning
methodasthenaiveBayeslassier.Kearnsdenedthestatistialquery model
(SQ model for short) in [12℄. The SQ model is a speialization of the PAC
modelinwhihthelearnerformsitshypothesissolelyonthebasisofestimates
of probabilities. A statistial query over X
n
is a mapping : X
n
f0;1g !
f0;1g assoiated with a tolerane parameter 0 < 1. In the SQ model
the learner is given aess to a statistialorale STAT(f;D) whih, at eah
query(;),returnsanestimateofD(fxj(hx;f(x)i)=1g)withinauray
. Let C be a onept lass over X. We say that C is SQ learnable if there
existalearningalgorithmLandpolynomialsp(:;:;:);q(:;:;:)and r(:;:;:)with
the following property: for any f 2 C, for any distribution D over X, and
for any 0 < < 1, if L is given aess to STAT(f;D) and to input , then,
for every query (;) made by L, the prediate an be evaluated in time
q(1=;n;size(f)), and 1= is bounded by r(1=;n;size(f)), L halts in time
bounded by p(1=;n;size(f)) and L outputs a hypothesis h 2 C satisfying
D(fh) .
We slightly modify the statistial orale STAT(f;D). Let f be the target
onept and let us onsider a statistial query made by a statistial query
learning algorithmL. The statistial orale STAT(f;D) returnsan estimate
D
of D
=D(fxj (hx;f(x)i) = 1g) within some given auray. We may
write:
D
=D(fxj(hx;1i)=1^f(x)=1g)+D(fxj(hx;0i)=1^f(x)=0g)
=D(fxj(hx;1i)=1g\f)+D(fxj(hx;0i)=1g\f)
=D(B\f)+D(C\f)
where the sets B and C are dened by:
B =fxj(hx;1i)=1gand C =fxj(hx;0i)=1g:
estimatesforprobabilitiesD(f\A)andD(f\A)withinauray,wheref is
thetargetonept,fitsomplementandAanysubsetforwhihmembership
is deidable inpolynomial time of the instane spae.It should be lear for
the reader that this tehnial modiation doesnot hange the SQ learnable
lasses.
It is lear that aess to the example orale EX(f;D)being given, it is easy
to simulate the statistial orale STAT(f;D) by drawing a suiently large
set oflabeledexamples.Moreover, thereisageneralshemewhihtransforms
any SQ learning algorithm into a PAC learning algorithm. It is also proved
in [12℄ that the lass of parity funtions is learnable in the PAC model but
annotbelearned fromstatistialqueries.
It has been shown by Kearns that any lass learnable from statistial query
is also learnable in the presene of lassiation noise [12℄. Following the
results by Kearns, it has been proved by Deatur [5℄ that any lass learnable
from statistialqueries is alsolearnable in the presene of onstant-partition
lassiationnoise.Theproofusesthehypothesistestingproperty:ahypothesis
with smallerror an beseleted fromaset of hypothesesby seletingthe one
withthe fewest errorsonaset ofCPCNorrupted examples.Ifweonfuse,in
the notations, the name of the modeland the set of learnable lasses, wean
write the followinginlusions:
SQCPCN CN PAC (1)
SQPAC (2)
Toourknowledge,theequivalenesbetweenthemodelsCNandSQorbetween
the models CN and PAC remainopen despitereent insights[2℄ and [10℄.
2.2 Learning Models from Positive and Unlabeled Data
Thelearningmodelfrompositiveexamples (POSEXforshort)wasrstdened
in[7℄.Themodeldiers fromthe PACmodelinthe followingway: thelearner
gets informationabout the target funtion and the hidden distribution from
two orales, namely a positive example orale POS(f;D) and an instane
orale INST(D)instead of anexample oraleEX(f;D).At eah request by
the learner, the instane orale INST(D) returns an elementof the instane
spaeX,i.e.anunlabeledexample,aordingtothehiddendistributionD.At
eah request by the learner, the positive example orale POS(f;D) returns
a positive example aording to the hidden distribution D
f
. We have the
following result:
Proposition 1 [7℄Any lass learnable intheCPCN modelislearnablein the
PROOF. The proof is simpleand asit may helpto understand the proofof
the main algorithmof the present paper, we skethit below.
LetC bea CPCN learnable onept lass, letLbea learningalgorithmfor C
inthe CPCN model, letf bethe target onept, letD be adistribution over
the instane spae and let us suppose that D(f) 6= 0. We must show how L
an be used tolearn from the orales POS(f;D)and INST(D).
Run L. Ateah allof the noisyorale:
withprobability2=3, allPOS(f;D)and keep the positivelabel
withprobability1=3, allINST(D) and labelthe example as negative.
It an easily be shown that this is stritly equivalent alling the noisy orale
EX
+
;
(f;D 0
) where:
D 0
(x)= 8
>
<
>
: D(x)
3
if f(x)=0
D(x)+2D
f (x)
3
if f(x)=1
+
=
D(f)
2+D(f)
=0
Note that
+
1=3 < 1=2. And as for any subset A of the instane spae,
we have D(A) 3D 0
(A), it is suient to run the algorithm L with input
auray =3 and input ondene Æ to output with ondene greater than
1 Æ ahypothesis whose error rate is less than .
The learning model from positive queries (POSQ for short) was also dened
in[7℄.InthePOSQmodel,thereareapositivestatistialoralePSTAT(f;D)
whih provides estimates for probabilities D
f
(A) for any subset A of the
instane spae within a given tolerane and an instane statistial orale
ISTAT(D) whih provides estimates for probabilities D(A) for any subset
A of the instane spae within a given tolerane. The denition of a POSQ
learnable lass is similar to the denition of a SQ learnable lass: the orale
STAT(f;D) is replaed by the two orales PSTAT(f;D) and ISTAT(D).
The POSQ model is weaker than the SQ model as there is no diret way to
obtain an estimate of the weight D(f) of the target onept. However, if we
an get suh an estimate, both models beome equivalent. Indeed, statistial
assoonasthe weight of the target onept is known beauseof the following
equations:
^
D(f \A)=
^
D
f (A)
^
D(f)
^
D(f \A)=
^
D(A)
^
D(f \A)
(3)
So, any lass learnable in the SQ model is learnable in the POSQ model as
soon as the learner is given aess to the weight of the target onept or
an ompute itfromthe positivestatistialorale andthe instane statistial
orale.This isformalized inthe followingresult:
Proposition 2 [7℄Let C be a onept lasssuh thatthe weightof any target
onept an be estimated in polynomial time within any given tolerane. If C
is SQ learnable then C is POSQ learnable.
We an summarize allthe results withthe following inlusions:
POSQSQCPCN POSEXPAC (4)
CPCN CN PAC (5)
SQPOSEX (6)
The inequality between SQ and POSEX holds beause the lass of parity
funtions is POSEX learnable but not SQ learnable. Equivalenes between
POSQ and SQ and between POSEX and PAC remain open.
3 Learning Algorithms from Positive and Unlabeled Examples
We have already notied that in pratial Data Mining and Text Mining sit-
uations, statistial query-like algorithms, suh as C4.5 or naive Bayes, are
widely used. It is straightforward to see how a statistial query an be eval-
uated from labeled data. In a similar way, positive and instane statistial
queries aneasilybeevaluatedfrompositiveand unlabeleddata. So, inorder
toadapt lassiallearningalgorithmstopositiveand unlabeled examples, we
an show how SQ learning algorithms an be modied into POSQ learning
algorithms.
In[6℄,wehavestudiedtheasewheretheweightofthetargetoneptiseither
given by an orale or evaluated from a smallset of labeled examples. In this