Italian Monolingual Information Retrieval with PROSIT

(1)

PROSIT

GianniAmati

FondazioneUgoBordoni

gba@fub.it

ClaudioCarpineto

carpinet@fub.it

GiovanniRomano

romano@fub.it

Abstract

PROSIT(PRObabilisticSiftofInformationTerms)isanovelprobabilisticinforma-

tion retrievalsystemthat combinesaterm-weightingmo delbased ondeviation from

randomness withinformation-theoreticquery expansion. PROSIT hasdemonstrated

to b e highlyeective at retrievingwebdo cumentsattherecent TREC-10andaver-

sion of the system is currently available as the search engine of the web site of the

Italian Ministry of Communications (http://www.comunicazioni.it). In this pap er,

we rep ort on the application of PROSIT to the Italian monolingualtask at CLEF.

We exp erimented with b oth standard PROSIT and with someenhanced versions of

it. Inparticular, we investigatedtheuse ofbigrams andco ordination levelmatching

within the PROSIT framework. The main nding of ourresearch are that (i) stan-

dard PROSITwas quiteeective,with anaverage precision of0.5116onCLEF2001

queriesand0.5019onCLEF2002queries,(ii)bigramswereusefulprovidedthatthey

wereincorp oratedintothemainalgorithm,and(iii)theb enetsofco ordinationlevel

matchingwereunclear.

1 Introduction

Recent research has shown the eectiveness of deviation fromrandomness [2] and information-

theoretic query expansion [6]. We combined these techniques into a comprehensive do cument

rankingsystemnamedPROSIT,whichstandsforPRObabilisticSiftofInformationTerms.

Theb estfeaturesofPROSITarethat itisfast,itcan b eeasilyundersto o dand replicated,it

do es notvirtuallyrequiretrainingorparametertuning,it do es notemployanyadho clinguistic

manipulations,and,p erhapsevenmoreimp ortant,ithasshowntob e surprisinglyeective.

AnearlierversionofPROSITwastestedatthewebtrackofthelastTREC10[1],whereitwas

rankedastherstsystemforthetopicrelevancetask. Subsequently,we develop edawebversion

ofPROSITtoactasthesearch engineofthewebsiteoftheItalianMinistryofCommunications

(http://www.comunicazioni.it). Thesearch engine hasb een runningon thesite since theendof

July2002.

Thestudyrep ortedinthispap erisafollow-uponthispreviousresearch,aimingatextending

the scop e of applications of PROSIT and improvingits p erformance. As the site search engine

basedonPROSITreceivesqueriestypicallyexpressedinItalianandretrievesrelevantdo cuments

mostlyfromItalian web pages, we were particularly interested inevaluatingtheeectiveness of

PROSITfortheItalianlanguage. ThiswastherstgoalofourparticipationinCLEF.

In addition, as PROSIT p erforms single-word indexing and do es not take advantage of the

structureofqueriesanddo cuments,weintendedtoinvestigatetheuseofmulti-wordindexingand

textstructure toimproveitsp erformance. Thiswas oursecondmaingoal.

Intherestofthepap er,we rstdescrib e theindexing stageandthetwomaincomp onentsof

PROSIT.ThenwediscusshowtoenhancePROSITwithbigramsandco ordinationlevelmatching.

Finally,we present thep erformanceresults anddrawsomeconclusions.

(2)

Our system rst identied the individual words o ccurring in the do cuments, considering only

the admissible sections and ignoring punctuation and case. The system then p erformed word

stemmingandwordstopping.

Similarto earlier results rep orted at CLEF 2001,we found that stemmingimproves p erfor-

mance. We usedasimpleformofstemmingforconatingsingularandpluralwords tothesame

ro ot [8]. Itis likelythat theuse of amoresophisticated stemmersuch as theItalian versionof

Porter'swouldpro duce b etterresults, butwe didnothaveachancetotryit. Toremovecommon

words,weusedthestoplistprovidedbySavoy[8].

We didnot useany adho c linguistic manipulationsuch as removingcertain words fromthe

querytext,e.g.,\trova"(nd),\do cumenti"(do cuments),etc.,orexpandingacronyms,e.g.,do es

AIstandforAmnestyinternationalor Articialintelligence?,or usinglists ofprop er nouns, e.g,

Alb ertoTomba.

The useof such manipulations makesit diÆcultto evaluate theoverall results and makes it

even morediÆcultto replicate theexp eriments. We thinkthat itwouldb e b etter todiscourage

theiruse unlessit is supp ortedbysome linguisticresources which are publicor whichare made

availablebytheauthors.

3 Description of PROSIT

PROSITconsistsoftwomaincomp onents: theretrieval{matchingfunctionmo duleandtheauto-

maticquery expansion mo dule. Thesystemhasb een implementedinESL, aLisp-likelanguage

thatisautomaticallytranslatedintoANSICandthencompiledbygcccompiler. PROSITisable

toindextwogigabytesofdo cumentsp erhourandallowssub-secondssearchesona550MHzPen-

tiumI I Iwith256megabytesofRAMrunningLinux.PROSIThasb eenreleasedasanapplication

for searching do cumentson WEB collection. In thefollowingtwosections we explain indetails

thetwomaincomp onentsofthesystem.

3.1 Term weighting

PROSIT can implement dierent matching functions which have similar and excellent p erfor-

mance. These basic retrieval functions are generatedfrom auniqueprobabilisticframework [2].

The app ealing feature of the basic term weighting mo dels is the absence of parameters which

should b elearned andtuned bymeansof arelevancetrainingset. Inaddition,theframework is

easytoimplementsincethemo delsuseupatotalnumb erof6randomvariableswhich aregiven

bythecollectionstatistics,namely:

tf thewithindo cumenttermfrequency

N thesize ofthecollection

n

t

thesize oftheelitesetE

t

oftheterm(see b elow)

F

t

thetotalnumb eroftermo ccurrences initseliteset

l thelengthofthedo cument

av g l theaveragelengthofdo cuments

However,forb othTREC-10andCLEFcollectionswehaveintro duced aparameterfordo cument

lengthnormalizationwhichenhancestheretrievaloutcome.

Theframeworkisbasedoncomputingtheinformationgainforeachtermquery. Theinforma-

tiongainisobtainedbyacombinationof threedistinct probabilisticpro cesses: theprobabilistic

pro cess computingtheamountofthe informationcontent oftheterm withresp ect tothe entire

collection,theprobabilisticpro cess computingaconditionalprobabilityofo ccurrence oftheterm

with resp ect to an \Elite" set of do cuments, which is the set of do cuments containg the query

term,andtheprobabilisticpro cess derivingthetermfrequencywithinthedo cumentnormalized

(3)

normalization factor comp onent relative to the elite set of the observed term, and the \term

frequency normalizationfunction"comp onentrelativetothedo cumentlength.

Ourformulationofinformativegainis:

w= (1 Pr ob

1

)( l og

2 Pr ob

2

) (1)

Inourexp eriments we have instantiated ourframeworkcho osingLaplace's lawof succession for

Pr ob

1

as thegainnormalizationfactorandtheBose-Einsteinsatistics forPr ob

2 :

Pr ob

1

=

tfn

tfn+1

(2)

Pr ob

2

= log

2

1

1+

tfnlog

2

1+

(3)

where:

isthetermfrequency inthecollection F

t

N

tfn isthenormalizedtermfrequency tflog

2

1+ av g l

l

.

The correcting factor c = 3 may b e inserted to the term frequency normalization and obtain

tfn = tflog

2

1+

cav g l

l

. The system displayedin Formulas1, 2 and 3 was used in our

exp erimentsandiscalledB

E L2(B

E

stands forBose-EinsteinandLforLaplace).

3.2 Retrieval feedback

Formulas1,2and3pro ducearstrankingofp ossiblerelevantdo cuments. Thetopmostonesare

candidates to b e assessed relevant and therefore we mightconsider them toconstitute asecond

dierent \Eliteset T of do cuments", namelydo cumentswhich b est describ e the content of the

query. We have considered inour exp erimentsonly 3 do cumentsas pseudo-relevant do cuments

and extracted fromthem the rst 10 most informative termswhich were added to theoriginal

query. The most informative terms are selected by using the information-theoretic Kullback-

Leiblerdivergencefunction:

KL= flog

2 f

p

(4)

where:

f isthetermfrequencyinthesetT,

p istheprior,e.g. thetermfrequency inthecollection.

Once the computationof the Kullback{Leibler values KL are obtained and the new terms

selected, wecombinetheinitialtermfrequency ofthetermwithinthequery(p ossiblyequalto0)

withthescoreKLasfollows:

tfq

exp

=

tfq

Max

tfq +

KL

Max

KL

(5)

where:

Max

tfq

isthemaximumnumb erofo ccurrences of aterminthequery

Max

KL

isthehighestKL valueintheset T

(4)

w= tfq

exp

(1 Pr ob

1

)( l og

2 Pr ob

2

) (6)

wherePr ob

1

andPr ob

2

wheredenedas therstretrieval,that isaccording Formulas2and3.

4 Augmenting PROSIT with bigrams

PROSIT, like most information retrieval systems, is based on index units consisting of single

keywords,orunigrams. We attemptedtoimproveitsp erformancebyusingtwo-wordindexunits

(bigrams).

Bigramsarego o dfordisambiguatingtermsandforhandlingtopic drift,i.e.,whentheresults

of queries on sp ecic asp ects of wide topics containdo cumentsthat are relevant to the general

topicbutnottotherequestedasp ectofit. Thisphenomenoncanalsob eseenassomequeryterms

matching out of context of their relationships to other terms [4]. For instance, using unigrams

mostreturneddo cumentsfortheCLEFquery\Kaurismakilms"wereab outotherfamouslms,

whereas theuseofbigramsconsiderablyimprovedtheprecisionofsearch.

Ontheotherhand,somebigramsthataregeneratedautomaticallymay,inturn,over-emphasize

concepts that are commonto b oth relevant and nonrelevant do cuments[9]. So far, the results

ab outtheeectiveness ofbigramsversusunigrams havenotb een conclusive.

WeusedasimpletechniqueknownaslexicalaÆnities. LexicalaÆnitiesareidentiedbynding

pairsofwords thato ccur closeto each otherinawindowof somepredenedsmallsize. For the

CLEF exp eriments, we used the query title and chose a distance of 5words. Allthe bigrams

generated this way are seen as new index units and are used to increase the relevance of those

do cuments that have the samepair of words o ccurring within thesp ecied window. The score

assignedtobigramsiscomputedusingthesameweightingfunctionusedforunigrams.

Fromanimplementationp ointofview,inorderto eÆcientlycomputethebigramscoresit is

necessary to enco de the informationab out thep osition of each termin each do cument into the

invertedle. Duringquery evaluation,foreachbigramextracted fromthequery,thep ostinglists

asso ciatedwiththebigramwordsintheinvertedlearemergedandanewpseudop ostinglistis

created that containsall do cumentsthat contain thebigramalong with therelevanto ccurrence

information.

The lexical aÆnitytechnique was rep orted to pro duce very go o d results on the web TREC

collection, even b etter than those obtained using unigrams [5]. However, we were not able to

obtainsuchgo o dresults ontheCLEFcollection. Infact,we foundthatthebigramp erformance

was considerably worse than the unigram p erformance; even when combining the scores, the

p erformanceremainedlowerthanthat obtainablebyusingjustunigramscores.

Based onthese observations,we decidedto usebigramsinaddition to,notinplaceof,single

words. Second,instead ofrunningtwoseparaterankingsystems,one forunigramsand theother

for bigrams, and then combining their scores, we tried to incorp orate the bigram comp onent

directlyintoPROSIT's mainalgorithm.

Thebigramscoreswerethuscombinedwiththeunigramscoretopro ducetherst-passranking

of PROSIT. In this way one can hop e to increase the quality of the do cuments on which the

subsequentqueryexpansionstepisbased. Thismayhapp enb ecausemoretoprelevantdo cuments

areretrievedorb ecausethenonrelevantdo cumentswhichcontributetoqueryexpansionaremore

similartothequery.

After the rst rankingwas computed using unigramand bigram scores, the top do cuments

wereusedtogeneratetheexpandedqueryandPROSITcomputedthesecondrankingasifitwere

just usingunigrams. We chose tonotexpand theoriginalquery with two-wordunits due to the

dimensionalityproblem,and we didnot usethe bigrammetho dduring the second-pass ranking

ofPROSITb ecausetheorderofwords intheexpandedqueryis notrelevant.

WesubmittedoneruntoCLEF2002,lab eLLedas\fub02l",whichwaspro ducedusingPROSIT

(5)

CLEF2001 0.5116 0.5208 0.5127 0.5223

CLEF2002 0.5019 0.5088 0.4872 0.4947

Table1: Retrievalp erformanceofPROSITanditsvariants.

It should alsob e noted thatwe exp erimentedwith othertyp es ofmulti-wordindex units,by

usingwindowsofdierentsizeandbyselectingalargernumb erofwords. However,wefoundthat

usingjust twowords withawindowof size5wastheoptimalchoice.

5 Reranking PROSITresults usingcoordination level match-

ing

Consistentwithearliereectivenessresults,mostinformationretrievalsystemsarebasedonb est-

matchingalgorithmsb etween queryanddo cuments.

However, the use of very short queries on the part of most of the users and the prevailing

interest in precision rather than recall have fostered new research on exact matching retrieval,

seenas analternativeor asacomplementarytechniquetotraditionalb est-matching retrieval. In

particular,ithasb eenshownthattakingintoaccountthenumb erofquerywordsmatchedbythe

do cumentstorerankretrievalresultsmayimprovep erformanceincertainsituations(e.g.,[7],[3]).

Forthe CLEFexp eriments,we fo cusedon thequerytitle. Thegoalwasto prefer do cuments

thatmatchedallofthequerykeywordsab ovedo cumentsthatmatchedallbutoneofthekeywords,

andso on.

Toimplementthis strategy,wemo diedthestandardb est-matchingsimilarityscore b etween

query and do cuments,computed as explained inSection 2, byadding amuch larger addendum

toitwhichwasprop ortionaltothenumb eroftermssharedbythedo cumentandthequery title.

Inthisway,thedo cumentswerepartiallyordered accordingtotheirco ordinationlevelmatching

withthequerytitle,withtiesb eingbrokenusingtheirb est-matchingsimilarityscoretothequery.

However,theresultsweresomewhatdisapp ointing. Weobtainedamuchb etterretrievaleec-

tivenessbysimplypreferringthedo cumentsthatcontainedallthewordsofthequerytitle,without

payingattentionto lowerlevelsofco ordinationmatching.This wasourchoice(run fub02b).

Finally,wesubmittedafourthrunbyusingthefullyenhancedversionofPROSIT,i.e.,bigrams

+co ordinationlevelmatching(runfub02lb)

6 Results

WetestedPROSITanditsthreevariants(i.e.,PROSITwithbigrams,PROSITwithco ordination

level matching,and PROSIT withb oth bigramsand co ordinationlevel matching)ontheCLEF

2001 andCLEF 2002 Italianmonolingualtasks. Table1showsthe retrieval p erformanceof the

foursystems onthetwotestcollections usingtheaverageprecision asevaluationmeasure.

Theresults ofTable1showthat thep erformanceofstandardPROSITwasexcellent onb oth

testcollections,withthevalueobtainedforCLEF2001(0.5116)b eingmuchhigherthantheresult

oftheb estsystematCLEF2001 (0.4865). Thisresult isaconrmationofthehigheectiveness

oftheprobabilisticrankingmo delimplementedinPROSIT,whichisexclusivelybasedonsimple

do cumentandquery statistics.

Table 1also shows that, in general, the variationsin p erformance when passingfrom basic

PROSIT to enhanced PROSIT were small. More in particular, the use of bigrams improved

p erformance across b oth test collections, whereas the use of co ordination level matching was

slightlyb enecialforCLEF2001anddetrimentalforCLEF2002.

Combiningb oth enhancementsimprovedtheretrievalp erformanceoverusing CLMalone on

b othtestcollectionsbutitwasstillworsethanbaselinep erformanceontheCLEF2002collection

(6)

Overall, the results ab out theenhanced versionsof PROSIT are inconclusive. More work is

needed to collect further evidence ab outtheir eectiveness, e.g., byusing amore representative

sampleofp erformancemeasuresorbyconsideringotherqueryscenarios.

Besides more robust evaluation of retrieval p erformance, it would b e useful ab etter under-

standingofwhytheuseofbigramsintoPROSIT'smainalgorithmyielded p ositive resultsinthe

exp erimentsrep ortedinthispap er. Thismightb edone,forinstance,byanalysingthevariations

on qualityof the top ranked do cumentsused for query expansion or by p erforming aquery by

queryanalysis ofconcept driftinthenal retrieveddo cuments.

7 Conclusions

Wehaveexp erimentedwiththePROSITsystemontheItalianmonolingualtaskandhaveexplored

theuseofbigramsand co ordinationlevelmatchingwithinPROSIT'smainalgorithm. Fromour

exp erimentalevaluation,thefollowingmainconclusionscanb edrawn.

Thenovelprobabilisticmo delimplementedinPROSITachievedhighretrievaleectiveness

onb oth theCLEF 2001andCLEF2002 Italianmonolingualtasks. Theseresults are even

moreremarkableconsideringthat thesystememploysvery simpleindexingtechniques and

do es notrelyonanysp ecialised oradho cnaturallanguagepro cessingtechniques.

Usingbigramsintheplaceofunigramshurtp erformance;thecombinationofbigramscores

andunigramscoresp erformedb etterbutitwasstillinferiortotheresultsobtainedbyusing

unigramsalone. However,usingthebigramscoresintherst-passranking,just torankthe

do cumentsusedforquery expansion,resultedinap erformanceimprovement.Theseresults

held acrossb othtestcollections.

Usingco ordinationlevelmatchingtoreranktheretrievalresultsdidnot,ingeneral,improve

p erformance Favouringthe do cuments that contained all the keywords in the query title

workedb etterononetestcollectionandworseontheothercollection,whereasorderingthe

do cumentsaccordingto theirlevelofco ordinationmatchinghurtp erformanceonb othtest

collections.

We regret that due to tight schedule we were not able to test PROSIT on the other CLEF

monolingualtasks. However,astheapplicationofPROSITtotheItaliantaskdidnotrequireany

sp ecialwork,wearecondentthatwithasmalleortwecouldobtainsimilarresultsfortheother

languages. Thisisleftforfuture work.

References

[1] Gianni Amati, Claudio Carpineto, and Giovanni Romano. FUB at TREC 10 web track:

a probabilistic framework for topic relevance term weighting. In E.M. Vo orhees and D.K.

Harman, editors, In Proceedings of the 10th Text Retrieval Conference TREC 2001, pages

182{191,Gaithersburg,MD,2002.NISTSp ecial Pubblication500-250.

[2] GianniAmatiandCornelisJo ostvanRijsb ergen. Probabilisticmo delsofinformationretrieval

basedonmeasuringdivergencefromrandomness.ACMTransactionsonInformationSystems,

(toapp ear), 2002.

[3] E.Berenci,C.Carpineto,V.Giannini,S.Mizzaro. Eectivenessofkeyword-baseddisplayand

selectionofretrievalresultsforinteractivesearches.InternationalJournalOnDigitalLibraries,

3(3):249-260,2000.

[4] D. Bo do, A. Kambil. Partial co ordination. I. The b est of pre-co ordination and p ost-

(7)

Exp erimentswithindexpruning. InE.M.Vo orheesandD.K.Harman,editors,InProceedings

of the 10th Text Retrieval Conference TREC 2001, pages228-236,Gaithersburg, MD,2002.

NISTSp ecialPubblication500-250.

[6] C. Carpineto, R. De Mori, G. Romano,and B. Bigi. An informationtheoretic approach to

automaticqueryexpansion. ACM Transactionson InformationSystems,19(1):1{27,2001.

[7] C.Clarke,G.Cormack,E.Tudhop e,(1997). Relevancerankingforonetothreetermqueries.

Proceedings of RIAO'97,388-400,1997.

[8] JSavoy. Rep orts onclef-2001exp eriments. InWorking notes ofCLEF,Darmstadt,2001.

[9] C.M Tan, Y.F.Wang,C.DLee. The useof bigramsto enhance text categorization. IP&M,

38(4):529-546,2002.