• Aucun résultat trouvé

Overview of CLEF 2008 INFILE Pilot Track

N/A
N/A
Protected

Academic year: 2022

Partager "Overview of CLEF 2008 INFILE Pilot Track"

Copied!
7
0
0

Texte intégral

(1)

RomariBesançon

1

,StéphaneChaudiron

2

,DjamelMostefa

3

,OlivierHamon

3,4

,

IsmaïlTimimi

2

,KhalidChoukri

3

1

CEA LIST

2

UniversitédeLille3-GERiiCO

18,routedupanorama Domaineuniv. duPontdeBois

BP692265 BP60149-59653

FontenayauxRosesFrane Villeneuved'AsqedexFrane

3

ELDA

4

LIPN-UniversitéParis13&CNRS

55-57rueBrillatSavarin 99avenueJ.-B.Clément

75013ParisFrane 93430VilletaneuseFrane

romari.besanonea.fr, stephane.haudironuniv-lille3.fr, mostefaelda.org,

hamonelda.org, ismail.timimiuniv-lille3.fr, houkrielda.org

Abstrat

TheINFILEampaignhasbeenrunforthersttimeasapilottrakinCLEF2008. Its

purposeistheevaluationofross-languageadaptivelteringsystems. Itusesaorpus

of 300,000 newswires from Agene Frane Presse (AFP) in three languages: Arabi,

EnglishandFrenh,andasetof50topisingeneralandspeidomain(sientiand

tehnologialinformation). Duetodelaysintheorganizationofthetask,theampaign

onlyhad3submissions(fromonepartiipant)whiharepresentedinthis artile.

Categories and Subjet Desriptors

H.3[InformationStorage and Retrieval℄: H.3.1ContentAnalysisand Indexing;H.3.3Infor-

mationSearhandRetrieval;H.3.4SystemsandSoftware

General Terms

Measurement,Performane,Experimentation,Algorithms

Keywords

InformationFiltering, CompetitiveIntelligene

1 Introdution

ThepurposeoftheINFILE(INformationFILteringEvaluation)evaluationampaign 1

istoevalu-

ateross-languageadaptivelteringsystems,i.e. theabilityofautomatedsystemstosuessfully

separaterelevantandnon-relevantdoumentsin aninomingstreamoftextual informationwith

respettoagivenprole. Thedoumentandprolearepossiblywrittenin dierentlanguages.

TheINFILE ampaignisapilottrakin CLEF2008ampaigns andisfunded bytheFrenh

NationalResearhAgeny(ANR)ando-organizedbytheCEALIST,ELDAandtheUniversity

ofLille3-GERiiCO.

1

(2)

ming). In the INFILE ampaign, we onsider the ontext of ompetitive intelligene: in this

ontext,theevaluationprotooloftheampaignhasbeendesignedwithapartiularattentionto

theontext ofuseof lteringsystemsby realprofessionalusers. Eveniftheampaign ismainly

atehnologialorientedevaluation proess,weadapted theprotoolandthemetris,asloseas

possible,tohowanormaluserwouldproeed,inludingthroughsomeinterationandadaptation

ofhissystem.

TheINFILEampaignanmainlybeseenasaross-lingualpursuitoftheTREC2002Adaptive

Filteringtask[RobertsonandSoboro,2002℄(adaptivelteringtrakhasbeenrunfrom2000to

2002), withapartiular interestin theorrespondene of theprotoolwith theground truthof

ompetitive intelligene (CI) professionals. In this goal, we asked CI professionals to write the

topisaordingtotheirexperieneinthedomain.

OtherrelatedampaignsaretheTopiDetetionandTraking(TDT)ampaignsfrom1998to

2004[FisusandWheatley,2004℄. However,in theTDT ampaigns,fous wasmainly ontopis

dened as "events", with a ne granularity level, and often temporally restrited, whereas in

INFILE(similartoTREC2002)topisareoflong-terminterestandsupposedtobestable,whih

an indue dierent tehniques, even if some studies show that some models an be eiently

trainedto havegoodperformaneonbothtasks[Yang etal.,2005℄.

2 Desription of the task

ThemainfeaturesoftheINFILEevaluationampaignaresummarizedhere:

Crosslingual: English,FrenhandArabiareonernedbytheproessbutpartiipantsmay beevaluated onmonoorbilingualruns.

AnewswireorpusprovidedbytheAgeneFranePresse(AFP)andoveringreentyears.

Thetopisetisomposedoftwodierentkindsofproles,oneonerninggeneralnewsand

events,andaseond oneonsientiandtehnologialsubjets.

The evaluation is performed using an automati interative proess for the partiipating systemstogetdoumentsandlterthem,withasimulateduserfeedbak.

Systemsareallowedto usethefeedbakatanytimetoinreaseperformane.

Systemsprovideabooleandeisionforeahdoumentaordingtoeahprole.

Relevanejudgmentsareperformedbyhumanassessors.

Partiipants are asked to ll a form to speify the languagesused, the elds used in the prolesandasummary ofthetehnologyused.

Weusedanautomatiproessforthesubmissionprotool. Indeed,theprotooloftheINFILE

ampaignisdesignedtobearealisttaskforalteringsystem. Inpartiular,theideaistoavoid

makingthewholeorpusavailabletothepartiipantsbeforetheampaign,buttomakeitavailable

onedoumentatatime,simulatingthebehaviorofthenewswireservie. Theprotoolthenfores

partiipatingsystemsto beevaluatedinaone-passtest.

Theprotoolisinterativeandevaluationworksasfollows:

adoumentserverisstartedatthebeginningoftheampaign,initializedwiththedoument olletion: doumentsareretrievedfromthisserverandlteringresultsaresentbakbythe

partiipantstotheserver;

the partiipant systems ommuniate with this server using a web servie protool (web

servieshavebeenhosentobeabletobypasspossibleorporaterewallsofthepartiipants):

(3)

partiipantwantstosubmitseveralruns,thesystemmustonnetseveraltimestoget

dierentrunidentiers;

2. thesystemretrievesonedoument;

3. the system lters the doument, i.e. it assoiates the doument with one orseveral

proles,ordisardit;

4. foradaptivesystems,arelevanefeedbakanbeprovidedforltereddouments;

5. the system an retrievea new doument (bakto step2) that an only be retrieved

whenthepreviousdoumenthasbeenltered;

A simulated relevane feedbak is provided for adaptive systems: the idea is again to have a

simulation of a realist behavior of the CI professional. In a real proess, the CI professional

reeivesthe douments found relevantto aprole in aorresponding mailbox or diretoryand

he an read the doument and deide to remove it if it was a ltering error. In the INFILE

automatedproess,it isalsotheonlyfeedbakauthorized: relevanefeedbakanonlybeasked

onadoumentassoiatedwithaprolebythesystem,thereisnorelevanefeedbakondisarded

douments.

Furthermore,weassumethataCIprofessionalwouldnothaveaninnitepatiene: thenumber

offeedbaksisthenlimitedto50,fromtheadvietakenfromCIprofessionals. Thistendstogive

moreinteresttosystemswithquikadaptivity,thantosystemsthatneedsalargeamountofdata

tobetrained,but itseemedrightfortheorganizerstoput systemsin atheontextofarealisti

task.

Adry runhasbeenorganizedfrom June 26th to July3rd to hekthe tehnialviabilityof

theprotool. TheoialampaignhasbeenrunfromJuly7thtoJuly26th.

3 Test olletions

3.1 The topis

Asetof50proleshasbeenpreparedoveringtwodierentategories. Therstgroup(30topis)

dealswithgeneralnewsandeventsonerningnationalandinternationalaairs,sports,politis,

et. The seond one (20 topis) deals with sienti and tehnologial subjets. The sienti

topisweredevelopedbyompetitiveintelligeneprofessionalsfromINIST 2

,ARISTNordPasde

Calais 3

,Digiport 4

andOTOResearh 5

. Thetopis weredeveloppedinbothEnglishandFrenh.

TheArabiversionhasbeentranslatedfromFrenhbynativespeakers.

Topisaredened withthefollowingstruture:

auniqueidentier;

atitle(6 wordsmax.) desribingthetopi inafewwords;

adesription(20wordsmax.) orrespondingto asentene-longdesription;

anarrative(60wordsmax.) orrespondingto thedesriptionofwhatshould beonsidered arelevantdoumentandpossiblywhatshouldnot;

upto5keywordsallowingtoharaterizetheprole;

an example of relevant text (120 wordsmax.) taken from a doument that is not in the

olletion(typiallyfromtheweb).

Eahreordof thestruturein thedierentlanguagesorrespondtotranslations,exept forthe

sampleswhihneedto beextratedfromrealdouments.

2

theFrenhInstituteforSientiandTehnialInformationCenter,http://international.inist.fr

3

AgeneRégionaled'InformationStratégiqueetTehnologique,http://www.aristnpd.org

4

http://www.digiport.org

5

(4)

TheINFILEorpusisprovidedbytheAgeneFranePresse(AFP)forresearhpurpose. AFPis

theoldestnewsagenyintheworldandoneofthethreelargestwithAssoiatedPressandReuters.

Although AFP is the largestFrenh news ageny, it transmits news in other languagessuh as

English,Arabi,Spanish,GermanandPortuguese. Newswiresareavailableindierentlanguages

but are not neessarily translations from a language to another, sine the same information is

generallyompletelyrewrittenfromonelanguagetoanothertomaththeinterestoftheaudiene

intheorrespondingountry.

ForINFILE,weseleted3languages(Arabi,EnglishandFrenh)anda3yearsperiod(2004-

2006)whih representsaolletionof aboutoneand half millions newswiresfor around10 GB,

from whih 100,000 doumentsof eah language havebeen seleted to be used for the ltering

test. NewsartilesareenodedinXMLformatandfollowtheNewsMarkupLanguage(NewsML)

speiations 6

.

Sine we provide a real-time simulated feedbak to the partiipants, we need to have the

identiationofrelevantdoumentspriortotheampaign,asin [SoboroandRobertson,2002℄.

Foreahlanguage,the100,000doumentshavebeenseletedinthefollowingway:

The whole olletion has been indexed with 4 dierent searh engines: Luene7, Indri8,

Zettair 9

andourownsearhenginedevelopedatCEALIST.Zettairisoriginallyonlyworking

inEnglish,buthasbeenmodiedtoalsodealwithFrenh. Thethreeotherenginesworkin

thethreelanguages(English, Frenh, Arabi).

Eah searh engineis queriedindependently using the 5dierent elds of thetopis, plus onequerytaking alleldsand onequerytaking allelds butthesample (onsideringthat

thesamplemayintroduemorenoisethanother elds). This givesapoolof28runs.

The relevane of retrieved douments is judged by human assessors10, two riteria being

used: relevantornotrelevant. TheassessmentproesshasbeenperformedusingaMixture

of Experts model: the rst10 doumentsof eahrun aretakenas rstpooland assessed.

Then,asoreisomputedforeahrunandeahtopiaordingtotheurrentassessments

andanextpoolisreatedbymergingtherunsusingaweightedsumofsores(whereweights

areproportionaltothesore) 11

.

Thedoumentolletionisbuiltbytaking:

alldoumentsthatarerelevanttoatleastonetopi;

alldoumentsthathavebeenassessedandjudgednotrelevant: thesedoumentsform

asetofdiultdouments(not relevant,butwhihsharesomethingin ommonwith

at leastonetopi,beausetheyhavebeenretrievedbyasearhengine);

aset of douments taken randomlyin therest of theolletion(i.e. from douments

that havenotbeenretrieved byany searh enginesfor anytopi, whih should limit

thenumberofrelevantdoumentsintheorpusthat havenotbeenassessed).

6

NewsMLisanXMLstandarddesignedtoprovideamedia-independent,struturalframeworkformulti-media

news. NewsMLwasdevelopedbytheInternationalPressTeleommuniationsCounil.seehttp://www.newsml.org

7

http://luene.apahe.org

8

http://www.lemurprojet.org/indri

9

http://www.seg.rmit.edu.au/zettair

10

Assessmentshavebeenperformedonasubsetofthetopisby5assessors,showinganinter-annotatoragreement

of81%(kappa=0.7). Giventhisgoodagreement, the restofthe doumentswerejudgedby2 assessors,and the

doumentsforwhihtheassessorsdidnotagreeweresubmittedtoa3rdone.

11

duetoalakoftimeandresoures,thisiterativeproesshasnotbeenusedforallassessments:forsomeofthe

queries,weusedonlytherstpool.

(5)

The resultsreturned by the partiipants are binary deisions on the assoiation of a doument

withaprole. Theresults,foragivenprole,anthenbesummarizedin aontingenytableof

theform:

Relevant NotRelevant

Retrieved a b

Not Retrieved d

Onthesedata,asetofstandardevaluationmeasuresisomputed:

Preision,denedas

P = a+b a

Reall,dened as

R = a+c a

F-measure, whih isa standardombinationof preisionand reall[VanRijsbergen,1979℄

dependingonaparameter

α

, anddenedas

F = 1

α P 1 + (1 − α ) R 1

Weusedthestandardvalue

α = 0.5

,whihgivesthesameimportanetopreisionandreall

(F-measureisthentheharmonimeanofthetwovalues).

Following the TREC Filtering traks [HullandRoberston, 1999, RobertsonandSoboro,2002℄

andtheTDT2004Adaptivetrakingtask[FisusandWheatley,2004℄,wealsoonsiderthelinear

utility,denedas

u = w 1 × a − w 2 × b

where

w 1

is theimportane given to a relevantdoument retrievedand

w 2

is the ostof a not

relevantdoumentretrieved.

Linear utility is bounded positively (to 1 for a perfet ltering), but unbounded negatively

(negativevaluesdepend onthenumberof relevantdoumentsforaprole). Hene, theaverage

valueonallproleswouldgivetoomuhimportanetothefewprolesonwhihasystemswould

performpoorly. Tobeabletoaveragethevalue,themeasureissaledasfollows:

u n = max( u max u , u min ) − u min 1 − u min

where

u max

is the maximum value of the utility and

u min

a parameter onsidered to be the

minimum utilityvalue under whih a userwouldnot even onsider thefollowingdouments for

theprole.

IntheINFILE ampaign,weused thevalues

w 1 = 1

,

w 2 = 0.5

,

u min = − 0.5

,

u max = a + c

(sameasinTREC2002).

>From the Topi Detetion and Traking ampaigns [NIST,1998℄, other measures are also

onsidered:

Theestimatedprobabilityofmissingarelevantdoument,denedas

P miss = a+c c

The estimated probability of raising a falsealarm on a non-relevantdoumentdened as

P f alse = b+d b

Thedetetionost,denedas

c det = c miss × P miss × P topic + c f alse × P f alse × (1 − P topic )

where

c miss

iftheostofamisseddoument

(6)

run2G IMAG eng-eng all

run5G IMAG eng-eng all

runname IMAG eng-eng all

Table1: Submittedrunsin theINFILEampaign

results pre reall F_0.5 util_1_0.5_-0.5 det_10_0.1

run2G.eval 0.298 0.056 0.082 0.300 0.009

run5G.eval 0.298 0.324 0.231 0.362 0.006

runname.eval 0.362 0.052 0.071 0.307 0.009

Table2: ResultsoftheINFILE ampaign

c f alse

istheostofafalsealarm

P topic

istheapriori probabilitythatadoumentisrelevanttoagivenprole.

IntheINFILEampaign,weusedthevalues

c miss = 10

,

c f alse = 0.1

and

P topic = 0.001

(aording

toanestimationoftheaverageratioofrelevantdoumentsin theorpus).

Toomputeaveragesores,the valuesarerstomputed foreahproleand thenaveraged.

Anotherwayofaveragingwouldbetosumupthevaluesforallprolesineahelloftheontin-

genytableandomputethesoresontheresultingtable. Therstmethodis preferredbeause

itallowsequalizingtheontributionoftheproles,whosedierenesaresupposedtobethemain

soureofvarianein measures.

Inordertomeasuretheadaptivityofthesystems,themeasuresarealsoomputedatdierent

timesintheproess,eah10,000douments,andanevolutionurveofthedierentvaluesaross

timeispresented.

Additionally,weproposedtwofollowingexperimentalmeasures. Therstoneisanoriginality

measure,dened as aomparativemeasure orrespondingto the number ofrelevant douments

thesystemuniquelyretrieves(amongpartiipants). Itgivesmoreimportanetosystemsthatuse

innovativeand promisingtehnologiesthat retrieve"diult"douments. Sineweonlyhadtoo

fewruns,this measureisnotreally relevant.

Theseondoneisanantiipationmeasure,designedtogivemoreinteresttosystemsthatan

nd therstdoumentin agiven prole. This measure ismotivated in ompetitiveintelligene

bytheinterestofbeingattheuttingedgeof adomain,and notmissingtherstinformationto

bereative. Itismeasuredbytheinverserankoftherstrelevantdoumentdeteted(inthelist

of thedouments), averagedonall proles. The measureis similar to themean reiproal rank

(MRR)usedforinstaneinQuestionAnsweringEvaluation[Voorhees,1999℄,butisnotomputed

ontherankedlistofretrieveddoumentsbutonthehronologiallistoftherelevantdouments.

5 Overview of the results

Duringthedevelopmentoftheampaign,around10teamsindiatedtheirintenttopartiipateto

theINFILEtrak. Unfortunately,onlyonepartiipantatuallysubmittedruns,theIMAGteam,

whihsubmitted3runs,in monolingualEnglishltering. Table1presentsthe runsand Table2

presentstheresultson theruns,usingthe metrisdesribedin previoussetion,averagedonall

queries. Morepreiseresultsareavailableinindividualresults.

6 Conlusion

TheINFILEampaignhasbeenorganizedforthersttimethisyearasapilottrakofCLEF,to

evaluateross-languageadaptivelteringsystems. TheampaignfollowedtheTREC2002Adap-

(7)

feedbak. Due to delaysin the implementationof this setup, theampaign hasbeenpostponed

inJuly. Onlyoneteampartiipatedintheampaign,whihatleastvalidatedtheviabilityofthe

interativeapproahhosen. Forthefuture ofthistrak,ithastobeveriediftheomplexityof

theprotool istheelementthathasdisouragedpartiipants,orifitwasthelakof information

orommuniationaroundthisevaluation,orthelakofinterestinthesubjet.

Referenes

[FisusandWheatley,2004℄ Fisus,J.andWheatley,B.(2004). Overviewofthetdt2004evalu-

ationandresults. InTDT'02. NIST.

[HullandRoberston,1999℄ Hull, D. and Roberston, S. (1999). The tre-8 ltering trak nal

report. InProeedings of the EighthText REtrievalConferene(TREC-8).NIST.

[NIST,1998℄ NIST (1998). The topi detetion and traking phase 2 (tdt2) evaluation plan.

http://www.nist.gov/speeh/tests/tdt/1998/do /tdt2.eval.plan.98.v3.7.pdf.

[RobertsonandSoboro,2002℄ Robertson,S.andSoboro,I.(2002).Thetre2002lteringtrak

report. InProeedings of TheEleventh TextRetrieval Conferene(TREC 2002).NIST.

[SoboroandRobertson,2002℄ Soboro, I. and Robertson, S. (2002). Building a ltering test

olletion for tre 2002. In Proeedings of The Eleventh Text Retrieval Conferene (TREC

2002). NIST.

[VanRijsbergen,1979℄ VanRijsbergen,C.(1979).InformationRetrieval. Butterworths,London.

[Voorhees,1999℄ Voorhees,E.(1999). Thetre-8questionansweringtrakreport. InProeedings

ofthe EighthTextREtrievalConferene (TREC-8).NIST.

[Yanget al.,2005℄ Yang,Y.,Yoo,S.,Zhang,J.,andKisiel,B.(2005).Robustnessofadaptivel-

teringmethodsinaross-benhmarkevaluation.InProeedingsofthe28thannualinternational

ACM SIGIR onferene on Researh and development in information retrieval, pages 98105,

Salvador,Brazil.

Références

Documents relatifs

It is clear that, for a large number of the runs, the MAP results for the mixed retrieval submissions were very similar to those from the purely textual retrieval systems..

In 2005, the medical image annotation task was introduced in the ImageCLEF 1 challenge.. Its main contribution was to provide a resource for benchmarking content-based

The challenge was to build a system able to answer the question ’where are you?’ (I’m in the kitchen, in the corridor, etc.) when presented with a test sequence containing

The filtering process may be crosslingual: English, French and Arabic are available for the documents and topics, and participants may be evaluated

As in previous years, there were two Geographic Information Retrieval tasks: monolingual (English to English, German to German and Portuguese to Portuguese) and bilingual (language

Vocabulary mappings are one-directional, intellectually created term transformations between two controlled vocabularies. They can be used to switch from the subject metadata terms

The track was changed considerably this year with the introduc- tion of new document collections consisting of library catalog records de- rived from The European Library, with

number of relevant documents entirely in German : 24 number of relevant documents in English and German : 78 number of relevant documents entirely in French : 4 number of