impat on the planning of two ative learning
strategies
V.Lemaire,A.Bondu,andF.Clérot
FraneTéléomR&DLannion,TECH/EASY/TSI
http://perso.rd.franeteleom.fr /lema ire
vinent.lemaireorange-ftgrou p.om
Abstrat. Ativemahinelearningalgorithmsareusedwhenlargenum-
bersofunlabelledexamplesareavailableandgetting labels forthemis
ostly(e.g.requiresahumanexpert).Ativelearningmethodsseletex-
amplestobuildatrainingsetforapreditivemodelandaimatthemost
informativeexamples.Thenumberofexamplestobelabelledateahit-
erationoftheativestrategyis,mostoften,randomlyhosenorxedto
one.Howeverinpratial situations,thisnumberis aparameterwhih
inuenestheperformaneoftheativestrategy.Thispaperstudiesthe
inueneofthisparameterontwoativelearningstrategies.
1 Introdution
Mahinelearningonsistsofmethodsandalgorithmswhihlearnbehaviortoa
preditivemodel,usingtrainingexamples.Passivelearningstrategiesuseexam-
pleswhih arerandomlyhosen.Ativelearningstrategiesallowthepreditive
model to onstruts its training set in interation with a human expert. The
learningstarts with few labelled examples.Then, the model selets theexam-
pleswithnolabelwhihitonsidersthemostinformativeandaskstheirdesired
assoiatedoutputs to thehuman expert. Themodel learnsfaster using ative
learningstrategies,reahingthebestperformanesusinglessdata.Ativelearn-
ingismorespeiallyattrativeforappliationsforwhihdataisexpensiveto
obtainortolabel.
Ativelearningstrategiesarealsousefulonnewproblem,forinstanelas-
siationproblemwhereinformativeexamplesorinformativedataareunknown.
Thequestionishowto obtaintheinformation requiredto solvethisnewprob-
lem? Anoperationalplanningofanativealgorithm appliedonanewlassi-
ationproblem ouldbedened astheadditiononindividualost, individual
step,whih allowto athinformationtosolvethisnewproblem:
(
I
)aninitialisation:whih,howandhowmanylabelshavetobebuyatthe beginning(before therstlearning)).(
P P
)apre-partition[1℄;•
(P S
)apre-seletion[2℄;•
(D
)adiversiation[3℄;•
(B
)thepurhaseofN
example(s)(ustomarilyN = 1
)•
(E
)theiterationevaluation [4℄;(
M
)themodelused.Planningthepurhaseofnewexamples(perpakages) isaompromise(
C
)betweenthesedierentstepswhihinludethedilemmabetweenexploration[5℄
andexploitation[6℄,suhthat:
C = EW (α 1 I + α 2 P P + α 3 P S + α 4 D + α 5 B + α 5 E + α 6 M )
where
EW
is theevaluationof theoverall proedure.The qualityof an ativestrategyisusuallyrepresentedbyaurveassessingtheperformaneofthemodel
versusthenumberoftrainingexampleslabelled.
Fortheoneptionofanautomatishuntingsystem(forphoneservers)whih
takesintoaountemotionsinspeeh[7℄(ournewproblem)thisapproahan
beused. Inthisase,data isomposed byturn ofspeeh whihare exhanged
betweenusersandthemahine.Eahpieeofdatahastobelistenedbyahuman
expert to be labelled as ontaining (or not) negative emotions. The purpose
of ative strategies whih are onsidered in this artile is to selet the most
informativeunlabelledexamples.These approahesminimizethelabellingost
indutedbythetrainingofapreditivemodel.Fortheoneptionofautomati
shuntingsystem(forphoneservers),whihtakesintoaountemotionsinspeeh,
ourorpusontainsmorethan100000turnsofspeeh.Thereforetheoperational
planningisveryimportant.
Twomain ative learning strategies are used in the literature (see setion
2). We suspet that suh ative learners are good for exploitation (labelling
examplesneartheboundarytoreneit),buttheydonotondutexploration
(searhingforlargeareasintheinstanespaethattheywouldinorretlylas-
sify); evenworsethantherandomsampling whenlabelsare boughtbypaket.
Onewayto examinethe"exploration"behaviorof thesetwomain strategiesis
tobuymorethanonelabelateveryiteration(the weight of
α 5 above),thisis
thepurposeofthispaper.
2 Ative Learning
2.1 Notations
M ∈ M
isthepreditivemodelwhih istrainedusinganalgorithmL
.X ⊆ R n
representsallthepossibleinputexamplesofthemodeland
x ∈ X
isapartiularexample.
Y
isthe set ofthepossibleoutputs ofthe model;y ∈ Y
alass labelrelatedto
x ∈ X
.Duringitstraining,themodelobservesonlyonepart
Φ ⊆ X
oftheuniverse.The set of examples is limited and the assoiated labels are not neessarily
trainingalgorithm)isalled
L xandthesetofexamplesforwhihthelabelsare
unknownisalled
U xwithΦ = U x ∪ L xandU x ∩ L x ≡ ∅
.
U x ∩ L x ≡ ∅
.Theoneptwhihislearnedanbeseenasafuntion,
f : X → Y
,withf (x 1 )
isthedesiredanswerofthemodelfortheexample
x 1andf b : X → Y
theanswer
obtainedofthemodel;anestimationoftheonept.Theelementsof
L xandthe
assoiatedlabelsonstituteatrainingset
T
.Thetrainingexamplesarepairsofinputvetorsanddesiredlabelssuhas
(x, f (x))
:∀x ∈ L x , ∃(x, f (x)) ∈ T
.2.2 Ative Learning Methods
Introdution The point of view of seletive sampling is adopted [8℄ in this
artile. The model observes only one restrited part of the universe
Φ ⊆ X
whihismaterializedbytrainingexampleswithoutlabel.Theimageof abag
ontaining instanes for whih the model an ask assoiated labels is usually
usedto desribethisapproah.
Considering:
• M
apreditivemodelprovidedwithatrainingalgorithmL
• U x andL x thesetsofexamplesrespetivelynotlabelledandlabelled
• n
thedesirednumberoftrainingexamples• T
thetrainingsetwithkT k < n
• U : X × M → ℜ
thefuntionwhihestimatestheutilityofanexampleforthetrainingofthemodel
Repeat
(A)Trainthemodel
M
usingL
andT
(andpossiblyU x).
(B)Findtheexamplesuhthat
q = argmax u∈U x U(u, M)
(C)Withdraw
q
fromU xandaskthelabelf(q)
fromtheexpert.
(D)Add
q
toL x andadd(q, f (q))
toT
until
kT k < n
Algorithm1:Seletivesampling, Muslea2002
The problem of seletive sampling was posed formally by Muslea [9℄ (see
Algorithm1).Itusesautilityfuntion,
U tility(u, M)
,whihestimatestheutilityofanexample
u
forthetrainingofthemodelM
.Usingthisfuntion,themodelseletsexamplesforwhihithopesthegreatestimprovementofitsperformanes,
andshowstheseexamplestotheexpert.
TheAlgorithm1is generiinsofarasonlythefuntion
U tility(u, M)
mustbemodiedtoexpressapartiularativelearningstrategy.Howtomeasurethe
the ondene that themodel has in its preditions. Themodel must beable
to produe anoutput and to estimatetherelevane ofits answers.Themodel
estimatestheprobabilityofobservingeahlass,givenaninstane
x ∈ X
.Thisestimate is done seleting the lass whih maximizes
P(y ˆ j |x)
(withy j ∈ Y
)amongallpossiblelasses.Theweakertheprobabilityto observethepredited
lass,themorepreditionisonsideredunertain.Thisstrategyofativelearning
seletsunlabelledexampleswhihmaximize theunertainty ofthemodel.The
unertaintyanbeexpressedasfollow:
U ncertain(x) = 1
argmax y j ∈ Y P ˆ (y j |x) x ∈ X
Sampling by risk redution Thepurpose of thisapproahis to redue the
generalizationerror,
E(M)
,ofthemodel[11℄.Ithoosesexamplestobelabelledsoastominimizethiserror.Inpratiethiserrorannotbealulatedbeause
thedistributionof instanes in
X
is unknown. Niholas Roy [11℄showshow tobringthisstrategyinto playsinealltheelementsof
X
arenotknown.Heusesauniformpriorfor
P (x)
whihgives:E(M b t ) = 1
|L|
X |L|
i=1
Loss(M t , x i )
Inthis artile,oneestimatesthegeneralizationerror(
E(M)
)usingtheem-pirialrisk[12℄givenby:
E(M) = ˆ R(M) = X |L|
i=1
X
y j ∈ Y
1
{f(x i )6=y j } P (y j |x i )P(x i )
where
f
is the model whih estimates theprobability that an examplebelong to alass,P(y i |x i )
the realprobabilityto observethelassy i fortheexample
x i ∈ L
, 1the indiatingfuntion equalto1
iff (x i ) 6= y i andequal to 0
else.
Therefore
R(M)
is the sum of the probabilities that the model makes a bad deision on the training set (L
).Using a uniform prior to estimateP (x i )
, oneanwrite :
R(M) = ˆ 1
|L|
X |L|
i=1
X
y j ∈ Y
1
{f(x i )6=y j } P(y ˆ j |x i )
In order to selet examples, the model is re-trained several times onsidering
onemoretive example.Eah instane
x ∈ U
and eah labely j ∈ Y
anbeassoiatedtoonstitutethissupplementaryexample.Theexpetedostforany
singleexample
x ∈ U
whih isaddedtothetrainingsetisthen:R(M ˆ +x ) = X
y j ∈ Y
P(y ˆ j |x) ˆ R(M +(x,y j ) ) with x ∈ U
ThereaderanseeathirdmainstrategywhihisbasedonQuerybyCommittee
[13℄andafourthonewhereauthorsfousonamodelapproahtoativelearning
in aversion-spaeofonepts[14,15℄.
3 Number of labelled examples at every iteration
Inpratie,thenumberoflabelledexamplesateveryiteration(noted
n
)ishoseninanarbitraryway.Nevertheless,thisparameterinuenestheimplementation
of an ative learning strategy. To understand the stakes of this problem, let
us onsider both extreme situations. On the one hand the omputation time
neessary for the examples seletion "explodes", labelling a single example at
eahiteration.Inthisase,theappliationofativelearningstrategiestolarge
databasesbeomesproblemati.Thewaitingtimetopresentanexampletothe
human expert is too long and beomes unreasonable. On the other hand the
ontribution of an ativelearningstrategy dereases, labellinga largenumber
of examplesat everyiteration. The regulationof theparameter
n
anbeseenin anintuitivewayastheresearhfor aompromisebetweentheomputation
timeandtheeienyofanativelearningstrategy.
Sine the purpose here is to measure the inuene of the value on
n
. Theexperimentswere arriedoutonseverallassiationproblems,using thesame
model andthetwostrategiesdened inprevioussetion.
3.1 Evaluationriteria
The riterion whih is used to estimate model performanesis the areaunder
ROCurve[16℄(AUC).ROCurvesareusuallybuiltonsideringasinglelass.
Consequently,onehandlesasmanyROCurvestherearelasses.TobuildROC
urvesina
m
lassesproblem,oneonsidersameta-lassY 1 = y i (whihisthe
target),otherslassesonstitutetheseondmeta-lass
Y 2 = S m
j=1 , j6=i y j
.AUCis alulated for eah ROCurve,and the global performane of themodel is
estimatedbythemathematialexpetedvalueofAUC,overalllasses:
AU C global = X |Y|
i=1
P (y i ).AU C(y i )
(1)AUC an be seen as a proportion of the spae in whih ROC urves are
dened. This area is equal to
1
if the model is perfet and is equal to1 2
forrandom models. AUC has interesting statistial properties. It orresponds to
theprobabilitythatthemodelattributesamoreimportantsore,toaninstane
belongingtothegoodlass, thananinstaneofanotherlass[16℄.
3.2 Protool
Beforehand, data is normalizedusing mean and variane. At thebeginning of
randomlyhosenamongavailabledata.Ateahiteration,
n
examplesaredrawnin the dataset to belabelled andadded to thetrainingset. The rstseries of
experimentationadds1exampleateahiterationusinganativestrategy.Then
fourother seriesofexperimentationarerepeatedbyinreasing,everytime,the
numberofaddedexamples;thequantityofinformationbringtothemodel(
n
=1,4,8,16).
ThelassierisaParzenwindowwhihusesaGaussiankernel(
σ
,theparam-eterofthekernelisadjustedusingarossvalidationasin[17℄).Eahexperiment
hasbeendonetentimesinordertoobtainanaverageandavariane,forevery
pointoftheresulturves.
3.3 Used model
The large range of models whih are able to solvelassiation problems and
sometimes the great numberof parametersuseful to use them, may represent
diulties tomeasuretheontributionofalearningstrategy.
AParzenwindow,withaGaussiankernel[17℄,isusedinexperimentsbelow
sinethispreditivemodelusesasingleparameterandisabletoworkwithfew
examples.Theoutputofthismodelisanestimateoftheprobabilitytoobserve
thelabel
y j onditionallytotheinstaneu
:
P ˆ (y j |u) = P N
n=1
1{f(l n )=y j } K(u, l n ) P N
n=1 K(u, l n ) with l n , ∈ L x and u ∈ U x ∪ L x
(2)where
K(u, l n ) = e
||u−ln||2 2σ2
First,a Parzen window hasbeenrealized using all training exampleto es-
timate ifthis model is able to solve the problem. For thethree databasesthe
answerhas been positive (a good valueon theAUC hasbeenobtained). Con-
sequently,Parzenwindowsareonsidered satisfyingandvalidfor thefollowing
ativelearningproedureswithregardstheinueneof
n
.Theoptimalvalueofthekernelparameterwasfoundusingaross-validation
ontheaveragequadratierror,usingallavailabletrainingdata[17℄.Thereafter,
thisvalueisusedtoxtheParzenwindowparameter.Sinethesingleparameter
of theParzenwindowis xed, thetrainingstage isreduedto ountinstanes
(withinthesupportoftheGaussiankernel).Thestrategiesofexamplesseletion
arethusomparable,withoutbeinginuenedbythetrainingofthemodel.
4 Experimentations
4.1 Database
Three publi data sets whih ome from the "UCI repository" (http://www.
termsoftheiroxideontent(i.e.Na,Fe,K,et).Allattributesarenumeri-
valued. This data set inludes 214 instanes (Train: 146,Test: 68) hara-
terizedby9attributeswhih areontinuouslyvalued.The6lassesarethe
type of glass. The parzen window lassies an example to one of these 6
lasses.
IrisPlantDatabase:Thedatasetontains3lassesof50instaneseah,
where eah lass refers to atype of iris plant. This data set inludes 150
owers(Train:90,Test:60)desribedby4attributeswhihareontinuously
valued.Theparzenwindowlassiesanexampletooneofthese3lasses.
ImagesegmentationDatabase:Theinstanesweredrawnrandomlyfrom
adatabaseof7outdoorimages.Theimageswerehandsegmentedtoreate
alassiationforeverypixel. This databaseinludes 2310images hara-
terizedby9pixels(Train:310,Test:2000).Theparzenwindowlassiesan
exampletooneofthese7lasses
Forthethreedatasets, whihontainmorethantwolasses,theperformanes
areevaluatedusingequation1.
4.2 Results
Figures1,2,3showobtainedresultsonthethreedatasets.Oneverygure:(i)
fromuptodownandlefttoright:1,4,8or16examplesaddedateahiterationof
theativealgorithm;(ii)oneahsub-gurehorizontalandvertialaxisrepresent
respetively the number of examples labelled used and the AUC (see setion
3.1). Oneah urvetest resultsusingsampling basedonunertainty, sampling
basedonriskredutionandrandomsamplingareplotted versusthenumberof
exampleslabelledinthetrainingset.Thenathesrepresentthevarianeofthe
results(
±2σ
).ResultsonAUCshowthat,onthesethreedatasetsitisdiulttopointto astrategy.Ifweonsiderthatadding:
oneexampleateveryiteration:theunertainstrategywinsonGlassbutthe
riskstrategywinsontheothersdatasets.
fourexamples at everyiteration:the unertainstrategywins on Glassbut
therisk strategy and therandom strategy share the suess onthe others
datasets.
eight orsixteen examplesat everyiteration: the random strategy wins on
thethreedata sets.
Byinreasingthenumberof exampleslabelledat eahiteration,the ative
strategies are less and less ompetitive ompared to the random strategy. We
notieeahtimethat:(i)theresultsdonotlookso dierentfordierentbath
sizes (but ative strategies allow to obtain the optimal AUC with a smaller
numberofexamples)(ii)therandomstrategybeomesmorepowerfulthanthe
twoativestrategieswhen
n
beomeslarge,partiularlyforn≥
8.0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0 20 40 60 80 100 120 140
Random Uncertainty Risk
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0 20 40 60 80 100 120 140
Random Uncertainty Risk
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0 20 40 60 80 100 120 140
Random Uncertainty
Risk 0.5
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0 20 40 60 80 100 120 140
Random Uncertainty Risk
Fig.1.ResultsonthedatasetGlass
0.5 0.6 0.7 0.8 0.9 1
0 10 20 30 40 50 60 70 80
Random Uncertainty Risk
0.5 0.6 0.7 0.8 0.9 1
0 10 20 30 40 50 60 70 80
Random Uncertainty Risk
0.5 0.6 0.7 0.8 0.9 1
0 10 20 30 40 50 60 70 80
Random Uncertainty
Risk 0.5
0.6 0.7 0.8 0.9 1
0 10 20 30 40 50 60 70 80
Random Uncertainty Risk
Fig.2.ResultsonthedatasetIris
0.86 0.88 0.9 0.92 0.94 0.96 0.98
0 50 100 150 200 250 300
Random Uncertainty
Risk 0.86
0.88 0.9 0.92 0.94 0.96 0.98
0 50 100 150 200 250 300
Random Uncertainty Risk
0.86 0.88 0.9 0.92 0.94 0.96 0.98
0 50 100 150 200 250 300
Random Uncertainty
Risk 0.86
0.88 0.9 0.92 0.94 0.96 0.98
0 50 100 150 200 250 300
Random Uncertainty Risk
Fig.3.ResultsonthedatasetSegment
5 Conlusion and future works
Theobtainedresultsshowthatthenumberoflabelledexamplesateahiteration
of a proedure of ative learninginuenes the quality of the involvedmodel.
Theexperimentswhihwerearriedoutonrmtheintuitionthattheontribu-
tionofanativestrategy(relativelytotherandomstrategy)dereaseswhenone
inreasesthenumberoflabelledexamples.Methodswhihallowtobuyateah
iterationthesamenumber(n>1)numberoflabelsexist[18,3℄totrytoinorpo-
rateapartofexploration.Buttoourknowledgenoneoptimizesthevalueof
N
into eahiteration(whihhoose avariable numberof examplesand/orwhih
selet"pakages"ofexamplesin anoptimalway). Weareurrentlyinterested
onthissubjet:forexampletheoneptof"trajetory"ofamodelinthespae
ofthedeisionsithastotakeduringitstraining.
Theelaborationofariterion(
EW
)theevaluation(whihmeasurestheon-tributionofastrategyomparedtotherandomstrategyonthewholedataset)
should beinteresting:theperformane riterionused antakeseveral dierent
waysaordingtotheproblem. Thistypeofurveallowsonlyomparisonsbe-
tweenstrategiesinapuntualway,i.e.forapointontheurve(agivennumber
oftrainingexamples). Iftwourvespasseahother,itisverydiulttodeter-
mineifastrategyisbetterthananother(onthetotalsetoftrainingexamples).
Thispointwillbedisussedinafuturepaper.
Finally we note that the maximal number of examples to labelled, or an
ofagood riterionshouldbeindependentofthemodel andofatestset andit
isanotherwayoffuture works.
Referenes
1. Nguyen,H.T.,Smeulders,A.:Ativelearningusingpre-lustering.In:International
ConfereneonMahineLearning(ICML).(2003)
2. Gosselin,P.H.,Cord,M.: Ativelearning tehniquesfor userinterativesystems
:appliationtoimageretrieval. In:InternationalWorkshoponMahineLearning
forMultiMedia (InonjontionwithICML).(2005)
3. Brinker, K.: Inorporating diversity in ative learning withsupport vetor ma-
hines. In:InternationalConfereneonMahineLearning(ICML).(2003)5966
4. Culver, M., Kun,D., Sott, S.: Ativelearning to maximize area underthe ro
urve.In:InternationalConfereneonDataMining(ICDM).(2006)
5. Thrun, S.: Exploration inative learning. In:to appear in: Handbookof Brain
SieneandNeuralNetworks. MihaelArbib(2007)
6. Osugi, T., Kun, D., Sott, S.: Balaning exploration and exploitation: A new
algorithmforativemahinelearning.In:InternationalConfereneonDataMining
(ICDM).(2005)
7. Bondu,A.,Lemaire, V.,Poulain,B.: Ativelearning strategies: aase studyfor
detetionofemotionsinspeeh.In:IndustrialConfereneofDataMining(ICDM),
Leipzig(2007)
8. Castro,R., Willett, R.,Nowak, R.: Faster rateinregressionvia ative learning.
In:NeuralInformationProessing Systems(NIPS),Vanouver(2005)
9. Muslea,I.:AtiveLearningWithMultipleView.Phdthesis,Universityofsouthern
alifornia (2002)
10. Thrun,S.B.,Möller,K.: Ativeexplorationindynamienvironments. InMoody,
J.E.,Hanson,S.J.,Lippmann,R.P.,eds.:AdvanesinNeuralInformationProess-
ingSystems.Volume4.,MorganKaufmannPublishers,In.(1992)531538
11. Roy,N.,MCallum,A.: Towardoptimalativelearningthroughsamplingestima-
tionoferrorredution.In:InternationalConfereneonMahineLearning(ICML),
MorganKaufmann,SanFraniso,CA(2001)441448
12. Zhu, X., Laerty, J., Ghahramani, Z.: Combining ative learning and semi-
supervisedlearningusinggaussianeldsandharmonifuntions.In:International
ConfereneonMahineLearning(ICML),Washington(2003)
13. Freund,Y.,Seung,H.S.,Shamir,E.,Tishby,N.:Seletivesamplingusingthequery
byommitteealgorithm. MahineLearning28(2-3)(1997)133168
14. Dasgupta,S.: Analysisofgreedyativelearningstrategy. In:NeuralInformation
Proessing Systems(NIPS),SanDiego(2005)
15. Cohn,D.A.,Atlas,L.,Ladner,R.E.:Improvinggeneralizationwithativelearning.
MahineLearning15(2)(1994)201221
16. Fawett, T.: Ro graphs:Notes andpratial onsiderations for dataminingre-
searhers. TehnialReportHPL-2003-4,HPLabs(2003)
17. Chappelle,O.: Ativelearningfor parzenwindows lassier. In:AI&Statistis,
Barbados(2005)4956
18. Lindenbaun,M.,Markovith,S.,Rusakov,D.:Seletivesamplingfornearestneigh-
borlassiers. MahineLearning(2004)