RomariBesançon
1
,StéphaneChaudiron
2
,DjamelMostefa
3
,OlivierHamon
3,4
,
IsmaïlTimimi
2
,KhalidChoukri
3
1
CEA LIST
2
UniversitédeLille3-GERiiCO
18,routedupanorama Domaineuniv. duPontdeBois
BP692265 BP60149-59653
FontenayauxRosesFrane Villeneuved'AsqedexFrane
3
ELDA
4
LIPN-UniversitéParis13&CNRS
55-57rueBrillatSavarin 99avenueJ.-B.Clément
75013ParisFrane 93430VilletaneuseFrane
romari.besanonea.fr, stephane.haudironuniv-lille3.fr, mostefaelda.org,
hamonelda.org, ismail.timimiuniv-lille3.fr, houkrielda.org
Abstrat
TheINFILEampaignhasbeenrunforthersttimeasapilottrakinCLEF2008. Its
purposeistheevaluationofross-languageadaptivelteringsystems. Itusesaorpus
of 300,000 newswires from Agene Frane Presse (AFP) in three languages: Arabi,
EnglishandFrenh,andasetof50topisingeneralandspeidomain(sientiand
tehnologialinformation). Duetodelaysintheorganizationofthetask,theampaign
onlyhad3submissions(fromonepartiipant)whiharepresentedinthis artile.
Categories and Subjet Desriptors
H.3[InformationStorage and Retrieval℄: H.3.1ContentAnalysisand Indexing;H.3.3Infor-
mationSearhandRetrieval;H.3.4SystemsandSoftware
General Terms
Measurement,Performane,Experimentation,Algorithms
Keywords
InformationFiltering, CompetitiveIntelligene
1 Introdution
ThepurposeoftheINFILE(INformationFILteringEvaluation)evaluationampaign 1
istoevalu-
ateross-languageadaptivelteringsystems,i.e. theabilityofautomatedsystemstosuessfully
separaterelevantandnon-relevantdoumentsin aninomingstreamoftextual informationwith
respettoagivenprole. Thedoumentandprolearepossiblywrittenin dierentlanguages.
TheINFILE ampaignisapilottrakin CLEF2008ampaigns andisfunded bytheFrenh
NationalResearhAgeny(ANR)ando-organizedbytheCEALIST,ELDAandtheUniversity
ofLille3-GERiiCO.
1
ming). In the INFILE ampaign, we onsider the ontext of ompetitive intelligene: in this
ontext,theevaluationprotooloftheampaignhasbeendesignedwithapartiularattentionto
theontext ofuseof lteringsystemsby realprofessionalusers. Eveniftheampaign ismainly
atehnologialorientedevaluation proess,weadapted theprotoolandthemetris,asloseas
possible,tohowanormaluserwouldproeed,inludingthroughsomeinterationandadaptation
ofhissystem.
TheINFILEampaignanmainlybeseenasaross-lingualpursuitoftheTREC2002Adaptive
Filteringtask[RobertsonandSoboro,2002℄(adaptivelteringtrakhasbeenrunfrom2000to
2002), withapartiular interestin theorrespondene of theprotoolwith theground truthof
ompetitive intelligene (CI) professionals. In this goal, we asked CI professionals to write the
topisaordingtotheirexperieneinthedomain.
OtherrelatedampaignsaretheTopiDetetionandTraking(TDT)ampaignsfrom1998to
2004[FisusandWheatley,2004℄. However,in theTDT ampaigns,fous wasmainly ontopis
dened as "events", with a ne granularity level, and often temporally restrited, whereas in
INFILE(similartoTREC2002)topisareoflong-terminterestandsupposedtobestable,whih
an indue dierent tehniques, even if some studies show that some models an be eiently
trainedto havegoodperformaneonbothtasks[Yang etal.,2005℄.
2 Desription of the task
ThemainfeaturesoftheINFILEevaluationampaignaresummarizedhere:
•
Crosslingual: English,FrenhandArabiareonernedbytheproessbutpartiipantsmay beevaluated onmonoorbilingualruns.•
AnewswireorpusprovidedbytheAgeneFranePresse(AFP)andoveringreentyears.•
Thetopisetisomposedoftwodierentkindsofproles,oneonerninggeneralnewsandevents,andaseond oneonsientiandtehnologialsubjets.
•
The evaluation is performed using an automati interative proess for the partiipating systemstogetdoumentsandlterthem,withasimulateduserfeedbak.•
Systemsareallowedto usethefeedbakatanytimetoinreaseperformane.•
Systemsprovideabooleandeisionforeahdoumentaordingtoeahprole.•
Relevanejudgmentsareperformedbyhumanassessors.•
Partiipants are asked to ll a form to speify the languagesused, the elds used in the prolesandasummary ofthetehnologyused.Weusedanautomatiproessforthesubmissionprotool. Indeed,theprotooloftheINFILE
ampaignisdesignedtobearealisttaskforalteringsystem. Inpartiular,theideaistoavoid
makingthewholeorpusavailabletothepartiipantsbeforetheampaign,buttomakeitavailable
onedoumentatatime,simulatingthebehaviorofthenewswireservie. Theprotoolthenfores
partiipatingsystemsto beevaluatedinaone-passtest.
Theprotoolisinterativeandevaluationworksasfollows:
•
adoumentserverisstartedatthebeginningoftheampaign,initializedwiththedoument olletion: doumentsareretrievedfromthisserverandlteringresultsaresentbakbythepartiipantstotheserver;
•
the partiipant systems ommuniate with this server using a web servie protool (webservieshavebeenhosentobeabletobypasspossibleorporaterewallsofthepartiipants):
partiipantwantstosubmitseveralruns,thesystemmustonnetseveraltimestoget
dierentrunidentiers;
2. thesystemretrievesonedoument;
3. the system lters the doument, i.e. it assoiates the doument with one orseveral
proles,ordisardit;
4. foradaptivesystems,arelevanefeedbakanbeprovidedforltereddouments;
5. the system an retrievea new doument (bakto step2) that an only be retrieved
whenthepreviousdoumenthasbeenltered;
A simulated relevane feedbak is provided for adaptive systems: the idea is again to have a
simulation of a realist behavior of the CI professional. In a real proess, the CI professional
reeivesthe douments found relevantto aprole in aorresponding mailbox or diretoryand
he an read the doument and deide to remove it if it was a ltering error. In the INFILE
automatedproess,it isalsotheonlyfeedbakauthorized: relevanefeedbakanonlybeasked
onadoumentassoiatedwithaprolebythesystem,thereisnorelevanefeedbakondisarded
douments.
Furthermore,weassumethataCIprofessionalwouldnothaveaninnitepatiene: thenumber
offeedbaksisthenlimitedto50,fromtheadvietakenfromCIprofessionals. Thistendstogive
moreinteresttosystemswithquikadaptivity,thantosystemsthatneedsalargeamountofdata
tobetrained,but itseemedrightfortheorganizerstoput systemsin atheontextofarealisti
task.
Adry runhasbeenorganizedfrom June 26th to July3rd to hekthe tehnialviabilityof
theprotool. TheoialampaignhasbeenrunfromJuly7thtoJuly26th.
3 Test olletions
3.1 The topis
Asetof50proleshasbeenpreparedoveringtwodierentategories. Therstgroup(30topis)
dealswithgeneralnewsandeventsonerningnationalandinternationalaairs,sports,politis,
et. The seond one (20 topis) deals with sienti and tehnologial subjets. The sienti
topisweredevelopedbyompetitiveintelligeneprofessionalsfromINIST 2
,ARISTNordPasde
Calais 3
,Digiport 4
andOTOResearh 5
. Thetopis weredeveloppedinbothEnglishandFrenh.
TheArabiversionhasbeentranslatedfromFrenhbynativespeakers.
Topisaredened withthefollowingstruture:
•
auniqueidentier;•
atitle(6 wordsmax.) desribingthetopi inafewwords;•
adesription(20wordsmax.) orrespondingto asentene-longdesription;•
anarrative(60wordsmax.) orrespondingto thedesriptionofwhatshould beonsidered arelevantdoumentandpossiblywhatshouldnot;•
upto5keywordsallowingtoharaterizetheprole;•
an example of relevant text (120 wordsmax.) taken from a doument that is not in theolletion(typiallyfromtheweb).
Eahreordof thestruturein thedierentlanguagesorrespondtotranslations,exept forthe
sampleswhihneedto beextratedfromrealdouments.
2
theFrenhInstituteforSientiandTehnialInformationCenter,http://international.inist.fr
3
AgeneRégionaled'InformationStratégiqueetTehnologique,http://www.aristnpd.org
4
http://www.digiport.org
5
TheINFILEorpusisprovidedbytheAgeneFranePresse(AFP)forresearhpurpose. AFPis
theoldestnewsagenyintheworldandoneofthethreelargestwithAssoiatedPressandReuters.
Although AFP is the largestFrenh news ageny, it transmits news in other languagessuh as
English,Arabi,Spanish,GermanandPortuguese. Newswiresareavailableindierentlanguages
but are not neessarily translations from a language to another, sine the same information is
generallyompletelyrewrittenfromonelanguagetoanothertomaththeinterestoftheaudiene
intheorrespondingountry.
ForINFILE,weseleted3languages(Arabi,EnglishandFrenh)anda3yearsperiod(2004-
2006)whih representsaolletionof aboutoneand half millions newswiresfor around10 GB,
from whih 100,000 doumentsof eah language havebeen seleted to be used for the ltering
test. NewsartilesareenodedinXMLformatandfollowtheNewsMarkupLanguage(NewsML)
speiations 6
.
Sine we provide a real-time simulated feedbak to the partiipants, we need to have the
identiationofrelevantdoumentspriortotheampaign,asin [SoboroandRobertson,2002℄.
Foreahlanguage,the100,000doumentshavebeenseletedinthefollowingway:
•
The whole olletion has been indexed with 4 dierent searh engines: Luene7, Indri8,Zettair 9
andourownsearhenginedevelopedatCEALIST.Zettairisoriginallyonlyworking
inEnglish,buthasbeenmodiedtoalsodealwithFrenh. Thethreeotherenginesworkin
thethreelanguages(English, Frenh, Arabi).
•
Eah searh engineis queriedindependently using the 5dierent elds of thetopis, plus onequerytaking alleldsand onequerytaking allelds butthesample (onsideringthatthesamplemayintroduemorenoisethanother elds). This givesapoolof28runs.
•
The relevane of retrieved douments is judged by human assessors10, two riteria beingused: relevantornotrelevant. TheassessmentproesshasbeenperformedusingaMixture
of Experts model: the rst10 doumentsof eahrun aretakenas rstpooland assessed.
Then,asoreisomputedforeahrunandeahtopiaordingtotheurrentassessments
andanextpoolisreatedbymergingtherunsusingaweightedsumofsores(whereweights
areproportionaltothesore) 11
.
•
Thedoumentolletionisbuiltbytaking:alldoumentsthatarerelevanttoatleastonetopi;
alldoumentsthathavebeenassessedandjudgednotrelevant: thesedoumentsform
asetofdiultdouments(not relevant,butwhihsharesomethingin ommonwith
at leastonetopi,beausetheyhavebeenretrievedbyasearhengine);
aset of douments taken randomlyin therest of theolletion(i.e. from douments
that havenotbeenretrieved byany searh enginesfor anytopi, whih should limit
thenumberofrelevantdoumentsintheorpusthat havenotbeenassessed).
6
NewsMLisanXMLstandarddesignedtoprovideamedia-independent,struturalframeworkformulti-media
news. NewsMLwasdevelopedbytheInternationalPressTeleommuniationsCounil.seehttp://www.newsml.org
7
http://luene.apahe.org
8
http://www.lemurprojet.org/indri
9
http://www.seg.rmit.edu.au/zettair
10
Assessmentshavebeenperformedonasubsetofthetopisby5assessors,showinganinter-annotatoragreement
of81%(kappa=0.7). Giventhisgoodagreement, the restofthe doumentswerejudgedby2 assessors,and the
doumentsforwhihtheassessorsdidnotagreeweresubmittedtoa3rdone.
11
duetoalakoftimeandresoures,thisiterativeproesshasnotbeenusedforallassessments:forsomeofthe
queries,weusedonlytherstpool.
The resultsreturned by the partiipants are binary deisions on the assoiation of a doument
withaprole. Theresults,foragivenprole,anthenbesummarizedin aontingenytableof
theform:
Relevant NotRelevant
Retrieved a b
Not Retrieved d
Onthesedata,asetofstandardevaluationmeasuresisomputed:
•
Preision,denedasP = a+b a
•
Reall,dened asR = a+c a
•
F-measure, whih isa standardombinationof preisionand reall[VanRijsbergen,1979℄dependingonaparameter
α
, anddenedasF = 1
α P 1 + (1 − α ) R 1
Weusedthestandardvalue
α = 0.5
,whihgivesthesameimportanetopreisionandreall(F-measureisthentheharmonimeanofthetwovalues).
Following the TREC Filtering traks [HullandRoberston, 1999, RobertsonandSoboro,2002℄
andtheTDT2004Adaptivetrakingtask[FisusandWheatley,2004℄,wealsoonsiderthelinear
utility,denedas
u = w 1 × a − w 2 × b
where
w 1 is theimportane given to a relevantdoument retrievedand w 2 is the ostof a not
relevantdoumentretrieved.
Linear utility is bounded positively (to 1 for a perfet ltering), but unbounded negatively
(negativevaluesdepend onthenumberof relevantdoumentsforaprole). Hene, theaverage
valueonallproleswouldgivetoomuhimportanetothefewprolesonwhihasystemswould
performpoorly. Tobeabletoaveragethevalue,themeasureissaledasfollows:
u n = max( u max u , u min ) − u min 1 − u min
where
u max is the maximum value of the utility and u min a parameter onsidered to be the
minimum utilityvalue under whih a userwouldnot even onsider thefollowingdouments for
theprole.
IntheINFILE ampaign,weused thevalues
w 1 = 1
,w 2 = 0.5
,u min = − 0.5
,u max = a + c
(sameasinTREC2002).
>From the Topi Detetion and Traking ampaigns [NIST,1998℄, other measures are also
onsidered:
•
Theestimatedprobabilityofmissingarelevantdoument,denedasP miss = a+c c
•
The estimated probability of raising a falsealarm on a non-relevantdoumentdened asP f alse = b+d b
•
Thedetetionost,denedasc det = c miss × P miss × P topic + c f alse × P f alse × (1 − P topic )
where
c miss iftheostofamisseddoument
run2G IMAG eng-eng all
run5G IMAG eng-eng all
runname IMAG eng-eng all
Table1: Submittedrunsin theINFILEampaign
results pre reall F_0.5 util_1_0.5_-0.5 det_10_0.1
run2G.eval 0.298 0.056 0.082 0.300 0.009
run5G.eval 0.298 0.324 0.231 0.362 0.006
runname.eval 0.362 0.052 0.071 0.307 0.009
Table2: ResultsoftheINFILE ampaign
c f alseistheostofafalsealarm
P topic istheapriori probabilitythatadoumentisrelevanttoagivenprole.
IntheINFILEampaign,weusedthevalues
c miss = 10
,c f alse = 0.1
andP topic = 0.001
(aordingtoanestimationoftheaverageratioofrelevantdoumentsin theorpus).
Toomputeaveragesores,the valuesarerstomputed foreahproleand thenaveraged.
Anotherwayofaveragingwouldbetosumupthevaluesforallprolesineahelloftheontin-
genytableandomputethesoresontheresultingtable. Therstmethodis preferredbeause
itallowsequalizingtheontributionoftheproles,whosedierenesaresupposedtobethemain
soureofvarianein measures.
Inordertomeasuretheadaptivityofthesystems,themeasuresarealsoomputedatdierent
timesintheproess,eah10,000douments,andanevolutionurveofthedierentvaluesaross
timeispresented.
Additionally,weproposedtwofollowingexperimentalmeasures. Therstoneisanoriginality
measure,dened as aomparativemeasure orrespondingto the number ofrelevant douments
thesystemuniquelyretrieves(amongpartiipants). Itgivesmoreimportanetosystemsthatuse
innovativeand promisingtehnologiesthat retrieve"diult"douments. Sineweonlyhadtoo
fewruns,this measureisnotreally relevant.
Theseondoneisanantiipationmeasure,designedtogivemoreinteresttosystemsthatan
nd therstdoumentin agiven prole. This measure ismotivated in ompetitiveintelligene
bytheinterestofbeingattheuttingedgeof adomain,and notmissingtherstinformationto
bereative. Itismeasuredbytheinverserankoftherstrelevantdoumentdeteted(inthelist
of thedouments), averagedonall proles. The measureis similar to themean reiproal rank
(MRR)usedforinstaneinQuestionAnsweringEvaluation[Voorhees,1999℄,butisnotomputed
ontherankedlistofretrieveddoumentsbutonthehronologiallistoftherelevantdouments.
5 Overview of the results
Duringthedevelopmentoftheampaign,around10teamsindiatedtheirintenttopartiipateto
theINFILEtrak. Unfortunately,onlyonepartiipantatuallysubmittedruns,theIMAGteam,
whihsubmitted3runs,in monolingualEnglishltering. Table1presentsthe runsand Table2
presentstheresultson theruns,usingthe metrisdesribedin previoussetion,averagedonall
queries. Morepreiseresultsareavailableinindividualresults.
6 Conlusion
TheINFILEampaignhasbeenorganizedforthersttimethisyearasapilottrakofCLEF,to
evaluateross-languageadaptivelteringsystems. TheampaignfollowedtheTREC2002Adap-
feedbak. Due to delaysin the implementationof this setup, theampaign hasbeenpostponed
inJuly. Onlyoneteampartiipatedintheampaign,whihatleastvalidatedtheviabilityofthe
interativeapproahhosen. Forthefuture ofthistrak,ithastobeveriediftheomplexityof
theprotool istheelementthathasdisouragedpartiipants,orifitwasthelakof information
orommuniationaroundthisevaluation,orthelakofinterestinthesubjet.
Referenes
[FisusandWheatley,2004℄ Fisus,J.andWheatley,B.(2004). Overviewofthetdt2004evalu-
ationandresults. InTDT'02. NIST.
[HullandRoberston,1999℄ Hull, D. and Roberston, S. (1999). The tre-8 ltering trak nal
report. InProeedings of the EighthText REtrievalConferene(TREC-8).NIST.
[NIST,1998℄ NIST (1998). The topi detetion and traking phase 2 (tdt2) evaluation plan.
http://www.nist.gov/speeh/tests/tdt/1998/do /tdt2.eval.plan.98.v3.7.pdf.
[RobertsonandSoboro,2002℄ Robertson,S.andSoboro,I.(2002).Thetre2002lteringtrak
report. InProeedings of TheEleventh TextRetrieval Conferene(TREC 2002).NIST.
[SoboroandRobertson,2002℄ Soboro, I. and Robertson, S. (2002). Building a ltering test
olletion for tre 2002. In Proeedings of The Eleventh Text Retrieval Conferene (TREC
2002). NIST.
[VanRijsbergen,1979℄ VanRijsbergen,C.(1979).InformationRetrieval. Butterworths,London.
[Voorhees,1999℄ Voorhees,E.(1999). Thetre-8questionansweringtrakreport. InProeedings
ofthe EighthTextREtrievalConferene (TREC-8).NIST.
[Yanget al.,2005℄ Yang,Y.,Yoo,S.,Zhang,J.,andKisiel,B.(2005).Robustnessofadaptivel-
teringmethodsinaross-benhmarkevaluation.InProeedingsofthe28thannualinternational
ACM SIGIR onferene on Researh and development in information retrieval, pages 98105,
Salvador,Brazil.