information retrieval
GianniAmati 1
,ClaudioCarpineto 1
,andGiovanniRomano 1
Fondazione UgoBordoni, viaB.Castiglione 59,
00142Rome,Italy
fgba,carpinet,romanog@fub.it
1 Introduction
Althoughthechoiceoftheweightingmo delmaycruciallyaectthep erformance
of any information retrieval system, there has b een little work on evaluating
the relative merits and drawbacks of dierent weighting mo dels in the CLEF
environment.ThemaingoalofourparticipationinCLEF03wastohelpllthis
gap.
We considerthreeweightingmo delswithadierenttheoreticalbackground
that have proved theireectiveness onanumb er oftasks and collections.The
three mo dels are Okapi [8], statistical language mo deling [10], and deviation
fromrandomness[3].
Westudytheretrievalp erformanceoftherankingspro ducedbyeachweight-
ingmo del,withoutandwithretrievalfeedback,onthreemonolingualtestcollec-
tions,i.e.,French,ItalianandSpanish.Thecollectionsareindexedwithstandard
techniques and theretrievalfeedback stage is p erformedusing themetho dde-
scrib edin[4].
In the following we rst describ e the three weighting mo dels, the metho d
used for retrieval feedback, and the exp erimental setting. Then we rep ort the
results andthemainconclusionsofourexp eriments.
2 The three weighting models
Fortheeaseofclarityandcomparison,thedo cumentrankingpro ducedbyeach
weightingmo delisrepresentedusingthesamegeneralexpression,namelyasthe
pro ductofado cument-basedtermweightbyaquery-based termweight:
sim(q ;d)= P
t2q ^d w
t;d w
t;q
Thisformalismallowsalsoauniformapplicationofthesubsequentretrieval
feedback stage to the rst-pass rankingpro duced by each weightingmo del,as
t;d t;q
rep ortthecompletelistofvariablesthat willb eused:
f
t
thenumb erofo ccurrences oftermt inthecollection
f
t;d
thenumb erofo ccurrences oftermt indo cumentd
f
t;q
thenumb erofo ccurrences oftermt inquery q
n
t
thenumb erofdo cumentsinwhichtermto ccurs
D thenumb erofdo cumentsinthecollection
T thenumb eroftermsinthecollection
t
theratiob etweenf
t andT
W
d
thelengthof do cumentd
W
q
thelengthof queryq
av r W
d
theaveragelengthofdo cumentsinthecollection
2.1 Okapi
To describ e Okapi, we use the expression given in[8]. This formulahas b een
usedbymostparticipantsinTRECandCLEFover thelast years.
w
t;d
=
(k
1
+1) f
t;d
k
1
(1 b)+b W
d
av gW
d
+f
t;d
w
t;q
= (k
3
+1) f
t;q
k
3 + f
t;q
l og
2
D n
t + 0:5
n
t + 0:5
2.2 Statisticallanguagemo deling (SLM)
Thestatisticallanguagemo delingapproachhasb eenpresentedinseveralpap ers
(e.g.,[5],[6]).Hereweusetheexpressiongivenin[10],withDirichletsmo othing.
w
t;d
= P
t2q ^d (l og
2 f
t;d
+
t
W
d
+
l og
2
W
d
+
l og
2
t )+W
q l og
2
W
d
+
w
t;q
= f
t;q .
2.3 Deviationfromrandomness(DFR)
Deviation from randomness has b een successfully used at the Web Track of
TREC10[1] andinCLEF 2002,fortheItalian monolingualtask [2].It is b est
describ ed in[3]
w
t;d
= (l og
2 (1 +
t ) + f
t;d l og
2 1 +
t
t )
f
t + 1
n
t (f
t;d + 1)
w
t;q
= f
t;q ,
f
t;d
= f
t l og
2 (1 +
c av r W
d
Wd )
3 Retrieval feedback
As retrieval feedbackhasb een incorp orated inmostrecent systems participat-
ing inCLEF, itis interesting to evaluate alsothep erformance ofthe dierent
weightingmo delswhentheyareenriched withretrievalfeedback.
Top erformtheexp eriments,weusedatechniquecalledinformation-theoretic
query expansion [4].Attheend of therst-pass ranking,each term inthetop
retrieved do cuments was assigned a score using the Kullback-Leibler distance
b etween thedistributionoftheterminsuch do cumentsand thedistributionof
thesameterminthewhole collection:
KLD
t;d
= f
t;d l og
2 f
t;d
f
t
andthetermswiththehighestscores wereselected forexpansion.
At this p oint,the KLD scores were used also to reweight the terms inthe
expandedquery.Astheweightsfortheunexpandedquery(i.e.,SLM,Okapi,and
DFR)andtheKLDscoreshaddierentscales,we normalizedb oththeweights
of the originalquery and the scores of the expansion terms by the maximum
corresp ondingvalue;thenthenormalizedvalues werelinearlycombined.
Thenewexpressionforcomputingthesimilarityb etweenanexpandedquery
q
exp
andado cumentdb ecomes:
sim(q
exp
;d) = P
t2q
exp
^d w
t;d (
w
t;q
Max
q w
t;q +
KLD
t;d
Max
d KLD
t;d )
4 Experimental setting
4.1 Testcollections
Theexp erimentswerep erformedusing threeCLEF 2003monolingualtest col-
lections,namelytheFrench,Spanish,andItaliancollections.Forallcollections,
thetitle+description topicstatementwasconsidered.
4.2 Do cumentandquery indexing
Weidentiedtheindividualwordso ccurringinthedo cuments,consideringonly
the admissible sections and ignoring punctuation and case. The system then
p erformedwordstemmingandwordstopping.
For word stemming, we used the French, Italian, and Spanish versions of
stop listsprovidedbySavoy[9].
Thus, we p erformeda strict single-wordindexing; furthermore,we did not
use any adho c linguistic manipulationsuch as expandingor removingcertain
wordsfromthequery textorusinglistsofprop er nouns.
4.3 Choice of exp erimentalparameters
Thenaldo cumentrankingisaectedbyanumb erof parameters.Top erform
theexp eriments,wesettheparametersusingvaluesthat haveb eenrep ortedin
theliterature.Hereisthecompletelistofparametervalues:
O k api k
1
=1.2,k
3
=1000,b=0.75
SLM =1000
D FR c=2
Retr iev al feedback 10do cs,40terms,=1, =0.5
5 Results
Foreachcollectionandforeach query,we computedsixruns:tworuns foreach
ofthethreeweightingmo desl,onewithoutandonewithretrievalfeedback(RF).
Table1, Table2, and Table 3 show theretrieval p erformance of each metho d
on the French, Italian, and Spanish collection, resp ectively. Performance was
measured using average precision (AV-PREC), precision at 5 retrieved do cu-
ments(PREC-AT-5), andprecision at 10retrieved do cuments(PREC-AT-10).
For each collectionwe show inb old the b est result withoutretrieval feedback
andtheb estresultwith retrievalfeedback.
Note that for the French and Italian collections the average precision was
greaterthantheearlyprecisions;thisisduetothefactthatforthesecollections
the mean numb er of relevant do cumentsp er query is, on average, small, and
thatthere aremanyqueries withvery fewrelevantdo cuments.
Therstmainndingofourexp erimentsisthattheb estabsoluteresultfor
each collection andfor each evaluationmeasure was alwaysobtained by DFR
withretrievalfeedback,withnotableimprovementsonseveraldatap oints.The
excellentp erformanceoftheDFRmo delisconrmedalsowhencomparingthe
weightingmo delswithoutqueryexpansion,althoughinthelattercaseDFRdid
notalwaysachieve the b est results (i.e.,for PREC-AT-5 and PREC-AT-10on
Italian,andforPREC-AT-5onSpanish).
Okapi 0.5030 0.4385 0.3654
Okapi+RF 0.5054 0.4769 0.3942
SLM 0.4753 0.4538 0.3635
SLM+RF 0.4372 0.4192 0.3462
DFR 0.5116 0.4577 0.3654
DFR+RF 0.5238 0.4885 0.3981
Table1.Retrieval p erformanceon theFrenchcollection
AV-PREC PREC-AT-5 PREC-AT-10
Okapi 0.4762 0.4588 0.3510
Okapi+RF 0.5238 0.4824 0.3902
SLM 0.5027 0.4941 0.3824
SLM+RF 0.5095 0.4824 0.3863
DFR 0.5046 0.4824 0.3725
DFR+RF 0.5364 0.5255 0.4137
Table 2.Retrieval p erformanceontheItaliancollection
AV-PREC PREC-AT-5 PREC-AT-10
Okapi 0.4606 0.5684 0.5175
Okapi+RF 0.5093 0.6105 0.5491
SLM 0.4720 0.6140 0.5157
SLM+RF 0.5112 0.5825 0.5316
DFR 0.4907 0.6035 0.5386
DFR+RF 0.5510 0.6140 0.5825
Table 3.Retrieval p erformanceontheSpanish collection
to the other. They achieved comparable results on Spanish, while Okapi was
slightly b etter than DFR on French and slightly worse on Italian. However,
when considering the rst retrieved do cuments, the p erformance of SLM was
usuallyverygo o dandsometimesevenb etter thanDFR.
TheresultsinTable1,Table2,andTable3showalsothatretrievalfeedback
improvedOkapiandDFRrunsandmostlyhurtSLMruns.Inparticular,theuse
ofretrievalfeedbackimprovedtheretrievalp erformanceof Okapiand DFRfor
all evaluationmeasures and acrossall collections,whereas it usuallydecreased
the early precision of SLM and on one o ccasion (i.e., on French) it hurt even
theaverageprecisionofSLM.Theunsatisfyingp erformanceofSLM+RFmay
b e explained by considering that the exp eriments were p erformed using long
queries.
WealsowouldliketoemphasizethattheDFRrunsshownherecorresp ondto
actuallysubmittedruns,althoughtheywerenottheb estruns.Infact,ourb est
submitted runs had language-sp ecic optimal parameters; then we submitted
for each language a run with the same exp erimentalparameters, obtained by
averagingtheb estparameters.
We alsop erformedaquery-by-query analysis.For eachquery, we computed
thedierenceb etweentheb estandtheworstretrievalresult,consideringaverage
precisionasthep erformancemeasure.Figure1,Figure2,andFigure3showthe
results forFrench,italian,andSpanish,resp ectively.
Fig.1.Performance variation onindividual queriesforFrench
Thus, the length of each bar depicts the range of p erformance variations
Fig.3.Performancevariationon individual queriesforSpanish
do es nottelluswhichmetho dp erformedb est.
Togetamorecompletepicture,wecounted,foreachcollection,thenumb erof
queriesforwhicheachmetho dachievedtheb est,median,orworstp erformance.
Theresults,showninTable4,conrmtheb etter retrievaleectiveness ofDFR
over the other two mo dels.The sup eriority of DFR over Okapi and SLM was
clear forSpanish, while DFR and Okapi obtainedmorecomparable results on
theothertwotestcollections.ForFrenchandItalian,thenumb erofb estresults
obtainedby DFR and Okapiwas similar,but,on thewhole,DFR was ranked
aheadofOkapiforamuchlargernumb erofqueries.
French Italian Spanish
SLMOkapiDFRSLMOkapiDFRSLMOkapiDFR
1st 11 20 21 10 21 20 16 16 25
2nd 11 17 24 9 16 26 10 22 25
3rd 30 15 7 32 14 5 31 19 7
Table4.Rankedp erformance
6 Conclusions
The main conclusion of our exp erimentsis that the DFR mo delwas more ef-
fective than b oth Okapi and SLM, which achieved comparable retrieval p er-
formance.In particular, DFR with query expansion obtained theb est average
absoluteresults foranyevaluationmeasureandacrossalltestcollections.
The second conclusion is that retrieval feedback always improved the p er-
formanceof Okapi and DFR,whereas itwas oftendetrimentalto theretrieval
eectiveness ofSLM, althoughthe latternding mayhave b een inuenced by
thelengthofthequeriesused intheexp eriments.
Theseresults seemto suggestthat theretrieval p erformanceofaweighting
mo del is only mo derately aected by the choice of the language, but this hy-
p othesisshouldb etakenwithcaution,b ecauseourresults wereobtainedunder
sp ecic exp erimentalconditions.
Although there are reasons to b elieve that similar results might hold also
across dierent exp erimental situations,in that we chose simple and untuned
parametervaluesandmadetypicalindexing assumptions,theissueneedsmore
investigation.Thenextstepofthisresearchistoexp erimentwithawiderrange
of factors, such as the length of queries, the values of each weightingmo del's
parameters,and the combinationof parametervalues for retrieval feedback. It
wouldalso b e usefulto exp erimentwith other languages,tosee ifthe hyp oth-
esis that the retrieval p erformanceof a weighting mo delis indep endent of the
1. G.Amati,C.Carpineto,G.Romano. FUBatTREC10webtrack:aprobabili stic
framework for topic relevance term weighting. Proceedingsof TREC-7, 182-191,
2001.
2. G.Amati,C.Carpineto,G.Romano.Italianmonolingual informationretrievalwith
PROSIT. WorkingnotesofCLEF2002,145-152,2001.
3. Gianni Amati and C.J. van Rijsb ergen. Probabilisti c mo dels of information re-
trieval based on measuring divergence from randomness. ACM Transactions on
InformationSystems,20(4):357-389,2002.
4. C.Carpineto, R.DeMori, G.Romano,and B.Bigi. Aninformation theoreticap-
proachtoautomaticqueryexpansion. ACMTransactionsonInformationSystems,
19(1):1-27,2001.
5. D.Hiemstra,W.Kraaij.Twenty-oneatTREC-7:Ad-ho candcross-languagetrack.
ProceedingsofTREC-7,227-238,1998.
6. J. Ponte, W.B. Croft. A language mo deling approach to information retrieval.
ProceedingsofSIGIR-98,275-281,1998.
7. M.F.Porter. Implementing aprobabili stic information retrieval system. Inf.Tech.
Res.Dev.,1(2):131-156,1982.
8. S.E. Rob ertson, S.Walker, M. Beaulieu. Okapi at TREC-7: Automatic ad ho c,
ltering, VLC,andinteractive track. ProceedingsofTREC-7,253-264,1998.
9. J Savoy. Rep orts on CLEF-2001 exp eriments. InWorkingnotes ofCLEF 2001,
Darmstadt,2001.
10. C.Zhai, J.Laerty. A studyofsmo othing metho dsforlanguage mo dels applied
toadho cinformation retrieval. ProceedingsofSIGIR-01,334-342,2001.