Overview of CLEF 2008 INFILE Pilot Track

(1)

RomariBesançon

1

,StéphaneChaudiron

2

,DjamelMostefa

3

,OlivierHamon

3,4

,

IsmaïlTimimi

2

,KhalidChoukri

3

1

CEA LIST

2

UniversitédeLille3-GERiiCO

18,routedupanorama Domaineuniv. duPontdeBois

BP692265 BP60149-59653

FontenayauxRosesFrane Villeneuved'AsqedexFrane

3

ELDA

4

LIPN-UniversitéParis13&CNRS

55-57rueBrillatSavarin 99avenueJ.-B.Clément

75013ParisFrane 93430VilletaneuseFrane

romari.besanonea.fr, stephane.haudironuniv-lille3.fr, mostefaelda.org,

hamonelda.org, ismail.timimiuniv-lille3.fr, houkrielda.org

Abstrat

TheINFILEampaignhasbeenrunforthersttimeasapilottrakinCLEF2008. Its

purposeistheevaluationofross-languageadaptivelteringsystems. Itusesaorpus

of 300,000 newswires from Agene Frane Presse (AFP) in three languages: Arabi,

EnglishandFrenh,andasetof50topisingeneralandspeidomain(sientiand

tehnologialinformation). Duetodelaysintheorganizationofthetask,theampaign

onlyhad3submissions(fromonepartiipant)whiharepresentedinthis artile.

Categories and Subjet Desriptors

H.3[InformationStorage and Retrieval℄: H.3.1ContentAnalysisand Indexing;H.3.3Infor-

mationSearhandRetrieval;H.3.4SystemsandSoftware

General Terms

Measurement,Performane,Experimentation,Algorithms

Keywords

InformationFiltering, CompetitiveIntelligene

1 Introdution

ThepurposeoftheINFILE(INformationFILteringEvaluation)evaluationampaign 1

istoevalu-

ateross-languageadaptivelteringsystems,i.e. theabilityofautomatedsystemstosuessfully

separaterelevantandnon-relevantdoumentsin aninomingstreamoftextual informationwith

respettoagivenprole. Thedoumentandprolearepossiblywrittenin dierentlanguages.

TheINFILE ampaignisapilottrakin CLEF2008ampaigns andisfunded bytheFrenh

NationalResearhAgeny(ANR)ando-organizedbytheCEALIST,ELDAandtheUniversity

ofLille3-GERiiCO.

1

(2)

ming). In the INFILE ampaign, we onsider the ontext of ompetitive intelligene: in this

ontext,theevaluationprotooloftheampaignhasbeendesignedwithapartiularattentionto

theontext ofuseof lteringsystemsby realprofessionalusers. Eveniftheampaign ismainly

atehnologialorientedevaluation proess,weadapted theprotoolandthemetris,asloseas

possible,tohowanormaluserwouldproeed,inludingthroughsomeinterationandadaptation

ofhissystem.

TheINFILEampaignanmainlybeseenasaross-lingualpursuitoftheTREC2002Adaptive

Filteringtask[RobertsonandSoboro,2002℄(adaptivelteringtrakhasbeenrunfrom2000to

2002), withapartiular interestin theorrespondene of theprotoolwith theground truthof

ompetitive intelligene (CI) professionals. In this goal, we asked CI professionals to write the

topisaordingtotheirexperieneinthedomain.

OtherrelatedampaignsaretheTopiDetetionandTraking(TDT)ampaignsfrom1998to

2004[FisusandWheatley,2004℄. However,in theTDT ampaigns,fous wasmainly ontopis

dened as "events", with a ne granularity level, and often temporally restrited, whereas in

INFILE(similartoTREC2002)topisareoflong-terminterestandsupposedtobestable,whih

an indue dierent tehniques, even if some studies show that some models an be eiently

trainedto havegoodperformaneonbothtasks[Yang etal.,2005℄.

2 Desription of the task

ThemainfeaturesoftheINFILEevaluationampaignaresummarizedhere:

•

Crosslingual: English,FrenhandArabiareonernedbytheproessbutpartiipantsmay beevaluated onmonoorbilingualruns.

•

Â^newswireôrpus^provided^by^theÂgene^F^rane^Presse^(AFP)ândôvering^reent^years.

•

^The^topi^setîsômposedôf^two^dierent^kindsôf^proles,ôneônerning^general^newsând

events,andaseond oneonsientiandtehnologialsubjets.

•

^The êvaluation îs ^performed ûsing ân âutomati înterative ^proess ^for ^the partiipating systemstogetdoumentsandlterthem,withasimulateduserfeedbak.

•

^Systemsâreâllowed^to ûse^the^feedbakâtâny^time^toînreaseperformane.

•

^Systems^provideâ^boolean^deision^forêah^doumentâording^toêah^prole.

•

^Relevane^judgments^are^performed^by^human^assessors.

•

Partiipants are asked to ll a form to speify the languagesused, the elds used in the prolesandasummary ofthetehnologyused.

Weusedanautomatiproessforthesubmissionprotool. Indeed,theprotooloftheINFILE

ampaignisdesignedtobearealisttaskforalteringsystem. Inpartiular,theideaistoavoid

makingthewholeorpusavailabletothepartiipantsbeforetheampaign,buttomakeitavailable

onedoumentatatime,simulatingthebehaviorofthenewswireservie. Theprotoolthenfores

partiipatingsystemsto beevaluatedinaone-passtest.

Theprotoolisinterativeandevaluationworksasfollows:

•

â^doument^serverîs^startedât^the^beginningôf^theâmpaign,initializedwiththedoument olletion: doumentsareretrievedfromthisserverandlteringresultsaresentbakbythe

partiipantstotheserver;

•

^the ^partiipant ^systems ômmuniate ^with ^this ^server ûsing â ^web ^servie ^protool ^(web

servieshavebeenhosentobeabletobypasspossibleorporaterewallsofthepartiipants):

(3)

partiipantwantstosubmitseveralruns,thesystemmustonnetseveraltimestoget

dierentrunidentiers;

2. thesystemretrievesonedoument;

3. the system lters the doument, i.e. it assoiates the doument with one orseveral

proles,ordisardit;

4. foradaptivesystems,arelevanefeedbakanbeprovidedforltereddouments;

5. the system an retrievea new doument (bakto step2) that an only be retrieved

whenthepreviousdoumenthasbeenltered;

A simulated relevane feedbak is provided for adaptive systems: the idea is again to have a

simulation of a realist behavior of the CI professional. In a real proess, the CI professional

reeivesthe douments found relevantto aprole in aorresponding mailbox or diretoryand

he an read the doument and deide to remove it if it was a ltering error. In the INFILE

automatedproess,it isalsotheonlyfeedbakauthorized: relevanefeedbakanonlybeasked

onadoumentassoiatedwithaprolebythesystem,thereisnorelevanefeedbakondisarded

douments.

Furthermore,weassumethataCIprofessionalwouldnothaveaninnitepatiene: thenumber

offeedbaksisthenlimitedto50,fromtheadvietakenfromCIprofessionals. Thistendstogive

moreinteresttosystemswithquikadaptivity,thantosystemsthatneedsalargeamountofdata

tobetrained,but itseemedrightfortheorganizerstoput systemsin atheontextofarealisti

task.

Adry runhasbeenorganizedfrom June 26th to July3rd to hekthe tehnialviabilityof

theprotool. TheoialampaignhasbeenrunfromJuly7thtoJuly26th.

3 Test olletions

3.1 The topis

Asetof50proleshasbeenpreparedoveringtwodierentategories. Therstgroup(30topis)

dealswithgeneralnewsandeventsonerningnationalandinternationalaairs,sports,politis,

et. The seond one (20 topis) deals with sienti and tehnologial subjets. The sienti

topisweredevelopedbyompetitiveintelligeneprofessionalsfromINIST 2

,ARISTNordPasde

Calais 3

,Digiport 4

andOTOResearh 5

. Thetopis weredeveloppedinbothEnglishandFrenh.

TheArabiversionhasbeentranslatedfromFrenhbynativespeakers.

Topisaredened withthefollowingstruture:

•

âûniqueîdentier;

•

^a^title⁽⁶ ^words^max.) ^desribing^the^topi ⁱⁿ^a^few^words;

•

^a^desription⁽²⁰^words^max.) orrespondingto asentene-longdesription;

•

^a^narrative⁽⁶⁰^words^max.) orrespondingto thedesriptionofwhatshould beonsidered arelevantdoumentandpossiblywhatshouldnot;

•

^up^to⁵^keywords^allowing^to^haraterize^the^prole;

•

ân êxample ôf ^relevant ^text ⁽¹²⁰ ^words^max.) ^taken ^from â ^doument ^that îs ^not ⁱⁿ ^the

olletion(typiallyfromtheweb).

Eahreordof thestruturein thedierentlanguagesorrespondtotranslations,exept forthe

sampleswhihneedto beextratedfromrealdouments.

2

theFrenhInstituteforSientiandTehnialInformationCenter,http://international.inist.fr

3

AgeneRégionaled'InformationStratégiqueetTehnologique,http://www.aristnpd.org

4

http://www.digiport.org

5

(4)

TheINFILEorpusisprovidedbytheAgeneFranePresse(AFP)forresearhpurpose. AFPis

theoldestnewsagenyintheworldandoneofthethreelargestwithAssoiatedPressandReuters.

Although AFP is the largestFrenh news ageny, it transmits news in other languagessuh as

English,Arabi,Spanish,GermanandPortuguese. Newswiresareavailableindierentlanguages

but are not neessarily translations from a language to another, sine the same information is

generallyompletelyrewrittenfromonelanguagetoanothertomaththeinterestoftheaudiene

intheorrespondingountry.

ForINFILE,weseleted3languages(Arabi,EnglishandFrenh)anda3yearsperiod(2004-

2006)whih representsaolletionof aboutoneand half millions newswiresfor around10 GB,

from whih 100,000 doumentsof eah language havebeen seleted to be used for the ltering

test. NewsartilesareenodedinXMLformatandfollowtheNewsMarkupLanguage(NewsML)

speiations 6

.

Sine we provide a real-time simulated feedbak to the partiipants, we need to have the

identiationofrelevantdoumentspriortotheampaign,asin [SoboroandRobertson,2002℄.

Foreahlanguage,the100,000doumentshavebeenseletedinthefollowingway:

•

^The ^whole ôlletion ^has ^been îndexed ^with ⁴ ^dierent ^searh êngines: ^Luene⁷^, Îndri⁸^,

Zettair 9

andourownsearhenginedevelopedatCEALIST.Zettairisoriginallyonlyworking

inEnglish,buthasbeenmodiedtoalsodealwithFrenh. Thethreeotherenginesworkin

thethreelanguages(English, Frenh, Arabi).

•

Êah ^searh êngineîs ^queriedindependently using the 5dierent elds of thetopis, plus onequerytaking alleldsand onequerytaking allelds butthesample (onsideringthat

thesamplemayintroduemorenoisethanother elds). This givesapoolof28runs.

•

^The ^relevane ôf ^retrieved ^douments îs ^judged ^by ^human âssessors¹⁰^, ^two ^riteria ^being

used: relevantornotrelevant. TheassessmentproesshasbeenperformedusingaMixture

of Experts model: the rst10 doumentsof eahrun aretakenas rstpooland assessed.

Then,asoreisomputedforeahrunandeahtopiaordingtotheurrentassessments

andanextpoolisreatedbymergingtherunsusingaweightedsumofsores(whereweights

areproportionaltothesore) 11

.

•

^The^doument^olletion^is^built^by^taking:

alldoumentsthatarerelevanttoatleastonetopi;

alldoumentsthathavebeenassessedandjudgednotrelevant: thesedoumentsform

asetofdiultdouments(not relevant,butwhihsharesomethingin ommonwith

at leastonetopi,beausetheyhavebeenretrievedbyasearhengine);

aset of douments taken randomlyin therest of theolletion(i.e. from douments

that havenotbeenretrieved byany searh enginesfor anytopi, whih should limit

thenumberofrelevantdoumentsintheorpusthat havenotbeenassessed).

6

NewsMLisanXMLstandarddesignedtoprovideamedia-independent,struturalframeworkformulti-media

news. NewsMLwasdevelopedbytheInternationalPressTeleommuniationsCounil.seehttp://www.newsml.org

7

http://luene.apahe.org

8

http://www.lemurprojet.org/indri

9

http://www.seg.rmit.edu.au/zettair

10

Assessmentshavebeenperformedonasubsetofthetopisby5assessors,showinganinter-annotatoragreement

of81%(kappa=0.7). Giventhisgoodagreement, the restofthe doumentswerejudgedby2 assessors,and the

doumentsforwhihtheassessorsdidnotagreeweresubmittedtoa3rdone.

11

duetoalakoftimeandresoures,thisiterativeproesshasnotbeenusedforallassessments:forsomeofthe

queries,weusedonlytherstpool.

(5)

The resultsreturned by the partiipants are binary deisions on the assoiation of a doument

withaprole. Theresults,foragivenprole,anthenbesummarizedin aontingenytableof

theform:

Relevant NotRelevant

Retrieved a b

Not Retrieved d

Onthesedata,asetofstandardevaluationmeasuresisomputed:

•

^Preision,^dened^as

P = _a+b ^a

•

^Reall^,^dened ^as

R = _a+c ^a

•

^F-measure, ^whih îsâ ^standardômbinationôf ^preisionând ^reall^[VanRijsbergen,1979℄

dependingonaparameter

α

^, ^and^dened^as

F = 1

α _P ¹ + (1 − α ) _R ¹

Weusedthestandardvalue

α = 0.5

^,^whih^gives^the^same^importane^to^preision^and^reall

(F-measureisthentheharmonimeanofthetwovalues).

Following the TREC Filtering traks [HullandRoberston, 1999, RobertsonandSoboro,2002℄

andtheTDT2004Adaptivetrakingtask[FisusandWheatley,2004℄,wealsoonsiderthelinear

utility,denedas

u = w 1 × a − w 2 × b

where

w 1

îs ^theîmportane ^given ^to â ^relevant^doument ^retrievedând

w 2

îs ^the ôstôf â ^not

relevantdoumentretrieved.

Linear utility is bounded positively (to 1 for a perfet ltering), but unbounded negatively

(negativevaluesdepend onthenumberof relevantdoumentsforaprole). Hene, theaverage

valueonallproleswouldgivetoomuhimportanetothefewprolesonwhihasystemswould

performpoorly. Tobeabletoaveragethevalue,themeasureissaledasfollows:

u n = max( _u _max ^u , u _min ) − u _min 1 − u min

where

u max

îs ^the ^maximum ^value ôf ^the ûtility ând

u min

^a ^parameter ^onsidered ^to ^be ^the

minimum utilityvalue under whih a userwouldnot even onsider thefollowingdouments for

theprole.

IntheINFILE ampaign,weused thevalues

w 1 = 1

^,

w 2 = 0.5

^,

u min = − 0.5

^,

u max = a + c

(sameasinTREC2002).

>From the Topi Detetion and Traking ampaigns [NIST,1998℄, other measures are also

onsidered:

•

^The^estimatedprobabilityofmissingarelevantdoument,denedas

P _miss = _a+c ^c

•

^The ^estimated probability of raising a falsealarm on a non-relevantdoumentdened as

P f alse = _b+d ^b

•

^The^detetion^ost,^dened^as

c det = c miss × P miss × P topic + c f alse × P f alse × (1 − P topic )

where

c miss

îf^theôstôfâ^missed^doument

(6)

run2G IMAG eng-eng all

run5G IMAG eng-eng all

runname IMAG eng-eng all

Table1: Submittedrunsin theINFILEampaign

results pre reall F_0.5 util_1_0.5_-0.5 det_10_0.1

run2G.eval 0.298 0.056 0.082 0.300 0.009

run5G.eval 0.298 0.324 0.231 0.362 0.006

runname.eval 0.362 0.052 0.071 0.307 0.009

Table2: ResultsoftheINFILE ampaign

c _{f alse}

îs^theôstôfâ^falseâlarm

P _topic

^is^the^a^priori probabilitythatadoumentisrelevanttoagivenprole.

IntheINFILEampaign,weusedthevalues

c miss = 10

^,

c f alse = 0.1

^and

P topic = 0.001

^(aording

toanestimationoftheaverageratioofrelevantdoumentsin theorpus).

Toomputeaveragesores,the valuesarerstomputed foreahproleand thenaveraged.

Anotherwayofaveragingwouldbetosumupthevaluesforallprolesineahelloftheontin-

genytableandomputethesoresontheresultingtable. Therstmethodis preferredbeause

itallowsequalizingtheontributionoftheproles,whosedierenesaresupposedtobethemain

soureofvarianein measures.

Inordertomeasuretheadaptivityofthesystems,themeasuresarealsoomputedatdierent

timesintheproess,eah10,000douments,andanevolutionurveofthedierentvaluesaross

timeispresented.

Additionally,weproposedtwofollowingexperimentalmeasures. Therstoneisanoriginality

measure,dened as aomparativemeasure orrespondingto the number ofrelevant douments

thesystemuniquelyretrieves(amongpartiipants). Itgivesmoreimportanetosystemsthatuse

innovativeand promisingtehnologiesthat retrieve"diult"douments. Sineweonlyhadtoo

fewruns,this measureisnotreally relevant.

Theseondoneisanantiipationmeasure,designedtogivemoreinteresttosystemsthatan

nd therstdoumentin agiven prole. This measure ismotivated in ompetitiveintelligene

bytheinterestofbeingattheuttingedgeof adomain,and notmissingtherstinformationto

bereative. Itismeasuredbytheinverserankoftherstrelevantdoumentdeteted(inthelist

of thedouments), averagedonall proles. The measureis similar to themean reiproal rank

(MRR)usedforinstaneinQuestionAnsweringEvaluation[Voorhees,1999℄,butisnotomputed

ontherankedlistofretrieveddoumentsbutonthehronologiallistoftherelevantdouments.

5 Overview of the results

Duringthedevelopmentoftheampaign,around10teamsindiatedtheirintenttopartiipateto

theINFILEtrak. Unfortunately,onlyonepartiipantatuallysubmittedruns,theIMAGteam,

whihsubmitted3runs,in monolingualEnglishltering. Table1presentsthe runsand Table2

presentstheresultson theruns,usingthe metrisdesribedin previoussetion,averagedonall

queries. Morepreiseresultsareavailableinindividualresults.

6 Conlusion

TheINFILEampaignhasbeenorganizedforthersttimethisyearasapilottrakofCLEF,to

evaluateross-languageadaptivelteringsystems. TheampaignfollowedtheTREC2002Adap-

(7)

feedbak. Due to delaysin the implementationof this setup, theampaign hasbeenpostponed

inJuly. Onlyoneteampartiipatedintheampaign,whihatleastvalidatedtheviabilityofthe

interativeapproahhosen. Forthefuture ofthistrak,ithastobeveriediftheomplexityof

theprotool istheelementthathasdisouragedpartiipants,orifitwasthelakof information

orommuniationaroundthisevaluation,orthelakofinterestinthesubjet.

Referenes

[FisusandWheatley,2004℄ Fisus,J.andWheatley,B.(2004). Overviewofthetdt2004evalu-

ationandresults. InTDT'02. NIST.

[HullandRoberston,1999℄ Hull, D. and Roberston, S. (1999). The tre-8 ltering trak nal

report. InProeedings of the EighthText REtrievalConferene(TREC-8).NIST.

[NIST,1998℄ NIST (1998). The topi detetion and traking phase 2 (tdt2) evaluation plan.

http://www.nist.gov/speeh/tests/tdt/1998/do /tdt2.eval.plan.98.v3.7.pdf.

[RobertsonandSoboro,2002℄ Robertson,S.andSoboro,I.(2002).Thetre2002lteringtrak

report. InProeedings of TheEleventh TextRetrieval Conferene(TREC 2002).NIST.

[SoboroandRobertson,2002℄ Soboro, I. and Robertson, S. (2002). Building a ltering test

olletion for tre 2002. In Proeedings of The Eleventh Text Retrieval Conferene (TREC

2002). NIST.

[VanRijsbergen,1979℄ VanRijsbergen,C.(1979).InformationRetrieval. Butterworths,London.

[Voorhees,1999℄ Voorhees,E.(1999). Thetre-8questionansweringtrakreport. InProeedings

ofthe EighthTextREtrievalConferene (TREC-8).NIST.

[Yanget al.,2005℄ Yang,Y.,Yoo,S.,Zhang,J.,andKisiel,B.(2005).Robustnessofadaptivel-

teringmethodsinaross-benhmarkevaluation.InProeedingsofthe28thannualinternational

ACM SIGIR onferene on Researh and development in information retrieval, pages 98105,

Salvador,Brazil.

Overview of CLEF 2008 INFILE Pilot Track

1

2

3

3,4

2

3

1

2

3

4

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

P = a+b a

•

R = a+c a

•

α

F = 1

α P 1 + (1 − α ) R 1

α = 0.5

u = w 1 × a − w 2 × b

w 1

w 2

u n = max( u max u , u min ) − u min 1 − u min

u max

u min

w 1 = 1

w 2 = 0.5

u min = − 0.5

u max = a + c

•

P miss = a+c c

•

P f alse = b+d b

•

c det = c miss × P miss × P topic + c f alse × P f alse × (1 − P topic )

c miss

c f alse

P topic

c miss = 10

c f alse = 0.1

P topic = 0.001

P = _a+b ^a

R = _a+c ^a

α _P ¹ + (1 − α ) _R ¹

u n = max( _u _max ^u , u _min ) − u _min 1 − u min

P _miss = _a+c ^c

P f alse = _b+d ^b

c _{f alse}

P _topic