PROSIT
GianniAmati
FondazioneUgoBordoni
gba@fub.it
ClaudioCarpineto
FondazioneUgoBordoni
carpinet@fub.it
GiovanniRomano
FondazioneUgoBordoni
romano@fub.it
Abstract
PROSIT(PRObabilisticSiftofInformationTerms)isanovelprobabilisticinforma-
tion retrievalsystemthat combinesaterm-weightingmo delbased ondeviation from
randomness withinformation-theoreticquery expansion. PROSIT hasdemonstrated
to b e highlyeective at retrievingwebdo cumentsattherecent TREC-10andaver-
sion of the system is currently available as the search engine of the web site of the
Italian Ministry of Communications (http://www.comunicazioni.it). In this pap er,
we rep ort on the application of PROSIT to the Italian monolingualtask at CLEF.
We exp erimented with b oth standard PROSIT and with someenhanced versions of
it. Inparticular, we investigatedtheuse ofbigrams andco ordination levelmatching
within the PROSIT framework. The main nding of ourresearch are that (i) stan-
dard PROSITwas quiteeective,with anaverage precision of0.5116onCLEF2001
queriesand0.5019onCLEF2002queries,(ii)bigramswereusefulprovidedthatthey
wereincorp oratedintothemainalgorithm,and(iii)theb enetsofco ordinationlevel
matchingwereunclear.
1 Introduction
Recent research has shown the eectiveness of deviation fromrandomness [2] and information-
theoretic query expansion [6]. We combined these techniques into a comprehensive do cument
rankingsystemnamedPROSIT,whichstandsforPRObabilisticSiftofInformationTerms.
Theb estfeaturesofPROSITarethat itisfast,itcan b eeasilyundersto o dand replicated,it
do es notvirtuallyrequiretrainingorparametertuning,it do es notemployanyadho clinguistic
manipulations,and,p erhapsevenmoreimp ortant,ithasshowntob e surprisinglyeective.
AnearlierversionofPROSITwastestedatthewebtrackofthelastTREC10[1],whereitwas
rankedastherstsystemforthetopicrelevancetask. Subsequently,we develop edawebversion
ofPROSITtoactasthesearch engineofthewebsiteoftheItalianMinistryofCommunications
(http://www.comunicazioni.it). Thesearch engine hasb een runningon thesite since theendof
July2002.
Thestudyrep ortedinthispap erisafollow-uponthispreviousresearch,aimingatextending
the scop e of applications of PROSIT and improvingits p erformance. As the site search engine
basedonPROSITreceivesqueriestypicallyexpressedinItalianandretrievesrelevantdo cuments
mostlyfromItalian web pages, we were particularly interested inevaluatingtheeectiveness of
PROSITfortheItalianlanguage. ThiswastherstgoalofourparticipationinCLEF.
In addition, as PROSIT p erforms single-word indexing and do es not take advantage of the
structureofqueriesanddo cuments,weintendedtoinvestigatetheuseofmulti-wordindexingand
textstructure toimproveitsp erformance. Thiswas oursecondmaingoal.
Intherestofthepap er,we rstdescrib e theindexing stageandthetwomaincomp onentsof
PROSIT.ThenwediscusshowtoenhancePROSITwithbigramsandco ordinationlevelmatching.
Finally,we present thep erformanceresults anddrawsomeconclusions.
Our system rst identied the individual words o ccurring in the do cuments, considering only
the admissible sections and ignoring punctuation and case. The system then p erformed word
stemmingandwordstopping.
Similarto earlier results rep orted at CLEF 2001,we found that stemmingimproves p erfor-
mance. We usedasimpleformofstemmingforconatingsingularandpluralwords tothesame
ro ot [8]. Itis likelythat theuse of amoresophisticated stemmersuch as theItalian versionof
Porter'swouldpro duce b etterresults, butwe didnothaveachancetotryit. Toremovecommon
words,weusedthestoplistprovidedbySavoy[8].
We didnot useany adho c linguistic manipulationsuch as removingcertain words fromthe
querytext,e.g.,\trova"(nd),\do cumenti"(do cuments),etc.,orexpandingacronyms,e.g.,do es
AIstandforAmnestyinternationalor Articialintelligence?,or usinglists ofprop er nouns, e.g,
Alb ertoTomba.
The useof such manipulations makesit diÆcultto evaluate theoverall results and makes it
even morediÆcultto replicate theexp eriments. We thinkthat itwouldb e b etter todiscourage
theiruse unlessit is supp ortedbysome linguisticresources which are publicor whichare made
availablebytheauthors.
3 Description of PROSIT
PROSITconsistsoftwomaincomp onents: theretrieval{matchingfunctionmo duleandtheauto-
maticquery expansion mo dule. Thesystemhasb een implementedinESL, aLisp-likelanguage
thatisautomaticallytranslatedintoANSICandthencompiledbygcccompiler. PROSITisable
toindextwogigabytesofdo cumentsp erhourandallowssub-secondssearchesona550MHzPen-
tiumI I Iwith256megabytesofRAMrunningLinux.PROSIThasb eenreleasedasanapplication
for searching do cumentson WEB collection. In thefollowingtwosections we explain indetails
thetwomaincomp onentsofthesystem.
3.1 Term weighting
PROSIT can implement dierent matching functions which have similar and excellent p erfor-
mance. These basic retrieval functions are generatedfrom auniqueprobabilisticframework [2].
The app ealing feature of the basic term weighting mo dels is the absence of parameters which
should b elearned andtuned bymeansof arelevancetrainingset. Inaddition,theframework is
easytoimplementsincethemo delsuseupatotalnumb erof6randomvariableswhich aregiven
bythecollectionstatistics,namely:
tf thewithindo cumenttermfrequency
N thesize ofthecollection
n
t
thesize oftheelitesetE
t
oftheterm(see b elow)
F
t
thetotalnumb eroftermo ccurrences initseliteset
l thelengthofthedo cument
av g l theaveragelengthofdo cuments
However,forb othTREC-10andCLEFcollectionswehaveintro duced aparameterfordo cument
lengthnormalizationwhichenhancestheretrievaloutcome.
Theframeworkisbasedoncomputingtheinformationgainforeachtermquery. Theinforma-
tiongainisobtainedbyacombinationof threedistinct probabilisticpro cesses: theprobabilistic
pro cess computingtheamountofthe informationcontent oftheterm withresp ect tothe entire
collection,theprobabilisticpro cess computingaconditionalprobabilityofo ccurrence oftheterm
with resp ect to an \Elite" set of do cuments, which is the set of do cuments containg the query
term,andtheprobabilisticpro cess derivingthetermfrequencywithinthedo cumentnormalized
normalization factor comp onent relative to the elite set of the observed term, and the \term
frequency normalizationfunction"comp onentrelativetothedo cumentlength.
Ourformulationofinformativegainis:
w= (1 Pr ob
1
)( l og
2 Pr ob
2
) (1)
Inourexp eriments we have instantiated ourframeworkcho osingLaplace's lawof succession for
Pr ob
1
as thegainnormalizationfactorandtheBose-Einsteinsatistics forPr ob
2 :
Pr ob
1
=
tfn
tfn+1
(2)
Pr ob
2
= log
2
1
1+
tfnlog
2
1+
(3)
where:
isthetermfrequency inthecollection F
t
N
tfn isthenormalizedtermfrequency tflog
2
1+ av g l
l
.
The correcting factor c = 3 may b e inserted to the term frequency normalization and obtain
tfn = tflog
2
1+
cav g l
l
. The system displayedin Formulas1, 2 and 3 was used in our
exp erimentsandiscalledB
E L2(B
E
stands forBose-EinsteinandLforLaplace).
3.2 Retrieval feedback
Formulas1,2and3pro ducearstrankingofp ossiblerelevantdo cuments. Thetopmostonesare
candidates to b e assessed relevant and therefore we mightconsider them toconstitute asecond
dierent \Eliteset T of do cuments", namelydo cumentswhich b est describ e the content of the
query. We have considered inour exp erimentsonly 3 do cumentsas pseudo-relevant do cuments
and extracted fromthem the rst 10 most informative termswhich were added to theoriginal
query. The most informative terms are selected by using the information-theoretic Kullback-
Leiblerdivergencefunction:
KL= flog
2 f
p
(4)
where:
f isthetermfrequencyinthesetT,
p istheprior,e.g. thetermfrequency inthecollection.
Once the computationof the Kullback{Leibler values KL are obtained and the new terms
selected, wecombinetheinitialtermfrequency ofthetermwithinthequery(p ossiblyequalto0)
withthescoreKLasfollows:
tfq
exp
=
tfq
Max
tfq +
KL
Max
KL
(5)
where:
Max
tfq
isthemaximumnumb erofo ccurrences of aterminthequery
Max
KL
isthehighestKL valueintheset T
w= tfq
exp
(1 Pr ob
1
)( l og
2 Pr ob
2
) (6)
wherePr ob
1
andPr ob
2
wheredenedas therstretrieval,that isaccording Formulas2and3.
4 Augmenting PROSIT with bigrams
PROSIT, like most information retrieval systems, is based on index units consisting of single
keywords,orunigrams. We attemptedtoimproveitsp erformancebyusingtwo-wordindexunits
(bigrams).
Bigramsarego o dfordisambiguatingtermsandforhandlingtopic drift,i.e.,whentheresults
of queries on sp ecic asp ects of wide topics containdo cumentsthat are relevant to the general
topicbutnottotherequestedasp ectofit. Thisphenomenoncanalsob eseenassomequeryterms
matching out of context of their relationships to other terms [4]. For instance, using unigrams
mostreturneddo cumentsfortheCLEFquery\Kaurismakilms"wereab outotherfamouslms,
whereas theuseofbigramsconsiderablyimprovedtheprecisionofsearch.
Ontheotherhand,somebigramsthataregeneratedautomaticallymay,inturn,over-emphasize
concepts that are commonto b oth relevant and nonrelevant do cuments[9]. So far, the results
ab outtheeectiveness ofbigramsversusunigrams havenotb een conclusive.
WeusedasimpletechniqueknownaslexicalaÆnities. LexicalaÆnitiesareidentiedbynding
pairsofwords thato ccur closeto each otherinawindowof somepredenedsmallsize. For the
CLEF exp eriments, we used the query title and chose a distance of 5words. Allthe bigrams
generated this way are seen as new index units and are used to increase the relevance of those
do cuments that have the samepair of words o ccurring within thesp ecied window. The score
assignedtobigramsiscomputedusingthesameweightingfunctionusedforunigrams.
Fromanimplementationp ointofview,inorderto eÆcientlycomputethebigramscoresit is
necessary to enco de the informationab out thep osition of each termin each do cument into the
invertedle. Duringquery evaluation,foreachbigramextracted fromthequery,thep ostinglists
asso ciatedwiththebigramwordsintheinvertedlearemergedandanewpseudop ostinglistis
created that containsall do cumentsthat contain thebigramalong with therelevanto ccurrence
information.
The lexical aÆnitytechnique was rep orted to pro duce very go o d results on the web TREC
collection, even b etter than those obtained using unigrams [5]. However, we were not able to
obtainsuchgo o dresults ontheCLEFcollection. Infact,we foundthatthebigramp erformance
was considerably worse than the unigram p erformance; even when combining the scores, the
p erformanceremainedlowerthanthat obtainablebyusingjustunigramscores.
Based onthese observations,we decidedto usebigramsinaddition to,notinplaceof,single
words. Second,instead ofrunningtwoseparaterankingsystems,one forunigramsand theother
for bigrams, and then combining their scores, we tried to incorp orate the bigram comp onent
directlyintoPROSIT's mainalgorithm.
Thebigramscoreswerethuscombinedwiththeunigramscoretopro ducetherst-passranking
of PROSIT. In this way one can hop e to increase the quality of the do cuments on which the
subsequentqueryexpansionstepisbased. Thismayhapp enb ecausemoretoprelevantdo cuments
areretrievedorb ecausethenonrelevantdo cumentswhichcontributetoqueryexpansionaremore
similartothequery.
After the rst rankingwas computed using unigramand bigram scores, the top do cuments
wereusedtogeneratetheexpandedqueryandPROSITcomputedthesecondrankingasifitwere
just usingunigrams. We chose tonotexpand theoriginalquery with two-wordunits due to the
dimensionalityproblem,and we didnot usethe bigrammetho dduring the second-pass ranking
ofPROSITb ecausetheorderofwords intheexpandedqueryis notrelevant.
WesubmittedoneruntoCLEF2002,lab eLLedas\fub02l",whichwaspro ducedusingPROSIT
CLEF2001 0.5116 0.5208 0.5127 0.5223
CLEF2002 0.5019 0.5088 0.4872 0.4947
Table1: Retrievalp erformanceofPROSITanditsvariants.
It should alsob e noted thatwe exp erimentedwith othertyp es ofmulti-wordindex units,by
usingwindowsofdierentsizeandbyselectingalargernumb erofwords. However,wefoundthat
usingjust twowords withawindowof size5wastheoptimalchoice.
5 Reranking PROSITresults usingcoordination level match-
ing
Consistentwithearliereectivenessresults,mostinformationretrievalsystemsarebasedonb est-
matchingalgorithmsb etween queryanddo cuments.
However, the use of very short queries on the part of most of the users and the prevailing
interest in precision rather than recall have fostered new research on exact matching retrieval,
seenas analternativeor asacomplementarytechniquetotraditionalb est-matching retrieval. In
particular,ithasb eenshownthattakingintoaccountthenumb erofquerywordsmatchedbythe
do cumentstorerankretrievalresultsmayimprovep erformanceincertainsituations(e.g.,[7],[3]).
Forthe CLEFexp eriments,we fo cusedon thequerytitle. Thegoalwasto prefer do cuments
thatmatchedallofthequerykeywordsab ovedo cumentsthatmatchedallbutoneofthekeywords,
andso on.
Toimplementthis strategy,wemo diedthestandardb est-matchingsimilarityscore b etween
query and do cuments,computed as explained inSection 2, byadding amuch larger addendum
toitwhichwasprop ortionaltothenumb eroftermssharedbythedo cumentandthequery title.
Inthisway,thedo cumentswerepartiallyordered accordingtotheirco ordinationlevelmatching
withthequerytitle,withtiesb eingbrokenusingtheirb est-matchingsimilarityscoretothequery.
However,theresultsweresomewhatdisapp ointing. Weobtainedamuchb etterretrievaleec-
tivenessbysimplypreferringthedo cumentsthatcontainedallthewordsofthequerytitle,without
payingattentionto lowerlevelsofco ordinationmatching.This wasourchoice(run fub02b).
Finally,wesubmittedafourthrunbyusingthefullyenhancedversionofPROSIT,i.e.,bigrams
+co ordinationlevelmatching(runfub02lb)
6 Results
WetestedPROSITanditsthreevariants(i.e.,PROSITwithbigrams,PROSITwithco ordination
level matching,and PROSIT withb oth bigramsand co ordinationlevel matching)ontheCLEF
2001 andCLEF 2002 Italianmonolingualtasks. Table1showsthe retrieval p erformanceof the
foursystems onthetwotestcollections usingtheaverageprecision asevaluationmeasure.
Theresults ofTable1showthat thep erformanceofstandardPROSITwasexcellent onb oth
testcollections,withthevalueobtainedforCLEF2001(0.5116)b eingmuchhigherthantheresult
oftheb estsystematCLEF2001 (0.4865). Thisresult isaconrmationofthehigheectiveness
oftheprobabilisticrankingmo delimplementedinPROSIT,whichisexclusivelybasedonsimple
do cumentandquery statistics.
Table 1also shows that, in general, the variationsin p erformance when passingfrom basic
PROSIT to enhanced PROSIT were small. More in particular, the use of bigrams improved
p erformance across b oth test collections, whereas the use of co ordination level matching was
slightlyb enecialforCLEF2001anddetrimentalforCLEF2002.
Combiningb oth enhancementsimprovedtheretrievalp erformanceoverusing CLMalone on
b othtestcollectionsbutitwasstillworsethanbaselinep erformanceontheCLEF2002collection
Overall, the results ab out theenhanced versionsof PROSIT are inconclusive. More work is
needed to collect further evidence ab outtheir eectiveness, e.g., byusing amore representative
sampleofp erformancemeasuresorbyconsideringotherqueryscenarios.
Besides more robust evaluation of retrieval p erformance, it would b e useful ab etter under-
standingofwhytheuseofbigramsintoPROSIT'smainalgorithmyielded p ositive resultsinthe
exp erimentsrep ortedinthispap er. Thismightb edone,forinstance,byanalysingthevariations
on qualityof the top ranked do cumentsused for query expansion or by p erforming aquery by
queryanalysis ofconcept driftinthenal retrieveddo cuments.
7 Conclusions
Wehaveexp erimentedwiththePROSITsystemontheItalianmonolingualtaskandhaveexplored
theuseofbigramsand co ordinationlevelmatchingwithinPROSIT'smainalgorithm. Fromour
exp erimentalevaluation,thefollowingmainconclusionscanb edrawn.
Thenovelprobabilisticmo delimplementedinPROSITachievedhighretrievaleectiveness
onb oth theCLEF 2001andCLEF2002 Italianmonolingualtasks. Theseresults are even
moreremarkableconsideringthat thesystememploysvery simpleindexingtechniques and
do es notrelyonanysp ecialised oradho cnaturallanguagepro cessingtechniques.
Usingbigramsintheplaceofunigramshurtp erformance;thecombinationofbigramscores
andunigramscoresp erformedb etterbutitwasstillinferiortotheresultsobtainedbyusing
unigramsalone. However,usingthebigramscoresintherst-passranking,just torankthe
do cumentsusedforquery expansion,resultedinap erformanceimprovement.Theseresults
held acrossb othtestcollections.
Usingco ordinationlevelmatchingtoreranktheretrievalresultsdidnot,ingeneral,improve
p erformance Favouringthe do cuments that contained all the keywords in the query title
workedb etterononetestcollectionandworseontheothercollection,whereasorderingthe
do cumentsaccordingto theirlevelofco ordinationmatchinghurtp erformanceonb othtest
collections.
We regret that due to tight schedule we were not able to test PROSIT on the other CLEF
monolingualtasks. However,astheapplicationofPROSITtotheItaliantaskdidnotrequireany
sp ecialwork,wearecondentthatwithasmalleortwecouldobtainsimilarresultsfortheother
languages. Thisisleftforfuture work.
References
[1] Gianni Amati, Claudio Carpineto, and Giovanni Romano. FUB at TREC 10 web track:
a probabilistic framework for topic relevance term weighting. In E.M. Vo orhees and D.K.
Harman, editors, In Proceedings of the 10th Text Retrieval Conference TREC 2001, pages
182{191,Gaithersburg,MD,2002.NISTSp ecial Pubblication500-250.
[2] GianniAmatiandCornelisJo ostvanRijsb ergen. Probabilisticmo delsofinformationretrieval
basedonmeasuringdivergencefromrandomness.ACMTransactionsonInformationSystems,
(toapp ear), 2002.
[3] E.Berenci,C.Carpineto,V.Giannini,S.Mizzaro. Eectivenessofkeyword-baseddisplayand
selectionofretrievalresultsforinteractivesearches.InternationalJournalOnDigitalLibraries,
3(3):249-260,2000.
[4] D. Bo do, A. Kambil. Partial co ordination. I. The b est of pre-co ordination and p ost-
Exp erimentswithindexpruning. InE.M.Vo orheesandD.K.Harman,editors,InProceedings
of the 10th Text Retrieval Conference TREC 2001, pages228-236,Gaithersburg, MD,2002.
NISTSp ecialPubblication500-250.
[6] C. Carpineto, R. De Mori, G. Romano,and B. Bigi. An informationtheoretic approach to
automaticqueryexpansion. ACM Transactionson InformationSystems,19(1):1{27,2001.
[7] C.Clarke,G.Cormack,E.Tudhop e,(1997). Relevancerankingforonetothreetermqueries.
Proceedings of RIAO'97,388-400,1997.
[8] JSavoy. Rep orts onclef-2001exp eriments. InWorking notes ofCLEF,Darmstadt,2001.
[9] C.M Tan, Y.F.Wang,C.DLee. The useof bigramsto enhance text categorization. IP&M,
38(4):529-546,2002.