applying Google searh engine for robust
ross-lingual retrieval
FernandoMartínez-Santiago,ArturoMontejo-Ráez,Miguel A.Garía-Cumbreras
DepartmentofComputerSiene. UniversityofJaén,Jaén,Spain
{dofer,amontejo,mag}ujaen.es
Abstrat
We have reported on our experimentation for the Ad-Ho Robust trak CLEF task
onerning web-based querygeneration for English andFrenh olletions. We have
ontinued theapproahof thelastyear,although themodelhasbeenmodied. Last
yearweusedGoogleinordertoexpandtheoriginalquery. Thisyearwedon'texpand
thequerybutwerathermakeanewquerytobeexeuted. Thus,wehavetodealwith
twolistsofrelevantdouments, onefromeahquery. In ordertointegratebothlists
of douments we have applied logistiregressionmerging solution. Obtained results
aredisouraging.
Categories and Subjet Desriptors
H.3[InformationStorage and Retrieval℄: H.3.1ContentAnalysisand Indexing;H.3.3Infor-
mationSearhandRetrieval;H.3.4SystemsandSoftware;H.3.7DigitalLibraries
General Terms
Algorithms,Languages,Performane,Experimentation
Keywords
InformationRetrieval,MultilingualInformationRetrieval,RobustInformationRetrieval
1 Introdution
ExpandinguserqueriesbyusingwebsearhenginessuhasGooglehasbeensuessfullyusedfor
improvingtherobustnessofretrievalsystemsoverolletionsin English[2℄. Due tothe multilin-
gualityoftheweb,wehaveassumedthatthisouldbeextended toadditionallanguages,though
thesmalleramountofwebnon-Englishpagesouldbeamajorounterpoint. Therefore,wehave
usedGoogleinordertoexpandthequeryinasimilarway[3℄,butinsteadofreplaingtheoriginal
querybythetheexpandedquery,wehaveexeutedbothqueries(theoriginalandexpandedone).
For eah query we have obtained alist of relevant douments. Thus, we need to ombine the
retrievalresultsfromthesetwoindependentlistofdouments. Thisisasimilarproblemtotheso
alledolletionfusionproblem[8℄, butwehavenotseveralolletions: thereis onlyaolletion
buttwolistofrelevantdouments. Thequestionishowshouldwemakethealulusofthesore
ofeahdoument inthe nalresultinglist. Given aquery,in order tointegratethe information
available about the relevane of every retrieved doument, we have applied a model based on
ThissetiondesribestheproessforgeneratinganewqueryusingexpansionbytheGooglesearh
engine. Tothisend,wehaveseletedarandomsampledoument. Thefollowingeldsorrespond
tothedoumentwithidentiationnumber10.2452/252-ahfrom theEnglish olletion.
<title>pensionshemesin europe </title>
<des>finddoumentsthat give informationabout urrentpension
systemsandretirementbenefitsin anyeuropeanountry.</des>
<narr>relevantdouments willontain informationon urrent pension
shemesandbenefitsin single europeanstates.information of
interestinludesminimum andmaximum agesfor retirementandtheway
inwhih theretirementinome isalulated. plansfor future
pensionreform arenotrelevant. </narr>
Theseeldshavebeenonatenatedintoonesingletextandallontainednouns,nounphrases
and prepositional phrases have been extrated by means of TreeTagger. TreeTagger is a tool
forannotatingtextwith part-of-speeh andlemmainformationwhihhasbeendevelopedat the
InstituteforComputationalLinguistisoftheUniversityofStuttgart 1
.
Onenounsandphrasesareidentiedtheyaretakentoomposethequery,preservingphrases
thankstoGoogle'squerysyntax.
douments``pension shemes''benefitsretirementinformation
The former string is passed to Google and the snippets (small fragment of text from the
assoiatedwebpageresult)ofthetop100resultsarejoinedintoonesingletextwherefrom,again,
phrases are extrated with their frequenies to generate a nal expanded query. The 20 most
frequent nouns, noun phrasesand prepositional phrasesfrom this generated text are repliated
aording to their frequenies in the snippets-based text and then normalized to the minimal
frequenyinthose20items(i.e. normalizedaordingtotheleastfrequentphraseamongthetop
ones). Theresultingqueryisshownbelow:
pension pensionpension pensionpensionpension pensionpension
pension pensionpension pensionpensionpension pensionpension
pension pensionpension benefitsbenefitsbenefitsbenefits benefits
benefitsbenefitsbenefitsbenefitsbenefitsretirementretirement
retirementretirementretirementretirement retirementretirement
retirementretirementretirementage agepensionsoupational
oupationaloupational oupationalshemesshemes shemes
shemes shemesshemes shemesshemesshemes shemesshemes
shemes shemesshemes shemesshemesregulations information
information information informationinformation sheme sheme
dislosuredislosurepension shemespension shemespension
shemes pensionshemes pensionshemespension shemespension
shemes pensionshemes pensionshemespension shemespension
shemes pensionshemes retirementbenefitsshemesmembers members
oupationalpension shemesoupationalpension shemes
oupationalpension shemesretirementbenefitsretirement benefits
dislosureof information
Frenh doumentshave been proessed in asimilar way, but using the ORoperator to join
foundphrasesforthegeneratedGooglequery. Thishasbeendonedue tothesmallernumberof
indexed web pages in Frenh language. Sine weexpet to reover100snippets, wehave found
that withthisoperatorthisis possible,despitelowquality textsbeenonsidered to produethe
nalexpandedquery.
Thenext step is to exeute both original and Google queries on the Lemur information re-
trievalsystem. Theolletiondatasethasbeen indexedusing Lemur IRsystem 2
. It isatoolkit
that supports indexingof large-saletext databases,theonstrutionof simplelanguagemodels
for douments, queries,or subolletions, and theimplementation of retrievalsystemsbased on
language models as well asa variety of other retrieval models. The toolkit is being developed
1
Availableathttp://www.ims.uni-stutt gart .de /pro jekt e/ orpl ex/T ree Tagg er/
2
Universityof Massahusettsand theShoolofComputer Sieneat CarnegieMellon University.
IntheseexperimentswehaveusedOkapiasweightingfuntion ([4℄).
Finally,wehavetomergebothlistsofrelevantdouments. [1,6℄proposeamergingapproah
based on logisti regression. Logisti regression is a statistial methodology for prediting the
probabilityofabinaryoutomevariableaordingtoaset ofindependentexplanatoryvariables.
The probability of relevane to the orresponding doument
D i will be estimated aordingto
four parameters: the soreand the rankingobtained by using the original query, and the sore
andtherankingbymeansoftheGoogle-basedquery(seeequation 1). Basedontheseestimated
probabilitiesofrelevane,thelistofdoumentswillbeinterleavedmakingupanuniquenallist.
P rob[D i is rel|rank org i , rsv org i , rank google i , rsv google i ] = e α+β 1 ·ln(rank orgi )+β 2 · rsv orgi +β 3 ·ln(rank googlei )+β 4 · rsv googlei 1 + e α+β 1 ·ln(rank orgi )+β 2 · rsv orgi +β 3 ·ln(rank googlei )+β 4 · rsv googlei
(1)
The oeients
α
,β 1, β 2, β 3 and β 4 are unknown parameters of the model. When tting
β 3 and β 4 are unknown parameters of the model. When tting
the model, usual methods to estime these parametersare maximumlikelihood oriterativelyre-
weightedleastsquaresmethods.
Asitisneededtottheunderlyingmodel,trainingset(topisandtheirrelevaneassessments)
mustbeavailableforeahmonolingualolletion.SinetherearerelevaneassessmentsforEnglish
and Frenh, we have made the experiments for these languages only. For Portuguese we have
reportedonlythebasease(wehavenotusedGooglequeriesforsuhlanguage).
3 Results
Astables 1and2show,theresultsaredisappointing. Fortrainingdata,Googlequeriesimprove
boththem.a.p. andthegeometripreisioninbothlanguages,EnglishandFrenh. Butthisgood
behaviordisappearswhenweapplyourapproahontestdata. Ofourse,wehopethatpreision
for testdata gets worse regardingtrainingdata, but wethink that the dierenein preisionis
exessive. Thisissuedemandsfurtheranalysisbyus.
Approah Colletion map gm-ap
Google training 0.29 0.12
Base training 0.26 0.10
Google test 0.34 0.12
Base test 0.38 0.14
Table1: Results for English data. Google approah is theresult obtained by merging original
queriesandGooglequeries. Baseresultsarethoseobtainedbymeansoforiginalqueriesonly.
Approah Colletion map gm-ap
Google training 0.28 0.10
Base training 0.26 0.12
Google test 0.30 0.11
Base test 0.31 0.13
Table 2: Results for Frenh data. Google approah is the result obtainedby merging original
Wehavereported onourexperimentationfortheAd-HoRobustMultilingualtrakCLEFtask
involving web-based query generation for English and Frenh olletions. The generation of a
nallistofresultsbymergingsearhresultsobtainedfromtwodierentquerieshasbeenstudied.
These twoqueriesaretheoriginal oneand anewonegenerated from Google results. Both lists
arejoinedbymeansoflogistiregression,insteadofusinganexpandedqueryaswedidlastyear.
The results are disappointing. While results for training data are very promising, there is not
improvementfor testdata. Thisquestionmust bendoutand wehopeto understand why the
performaneissopoorfortestdata,analyzing,forinstane,sideeetsoftheregressionapproah.
5 Aknowledgments
ThisworkhasbeenpartiallysupportedbyagrantfromtheSpanishGovernment,projetTIMOM
(TIN2006-15265-C06-03),andtheRFC/PP2006/Id_514grantedbytheUniversityofJaén..
Referenes
[1℄ A. Calvéand J.Savoy. Databasemergingstrategy basedonlogisti regression. Information
Proessing&Management,36:341359,2000.
[2℄ K. L. Kwok, L. Grunfeld, and D. D. Lewis. TREC-3 ad-ho, routing retrieval and thresh-
olding experiments using PIRCS. In Proeedings of TREC'3, volume 500, pages 247255,
Gaithersburg,1995.NIST.
[3℄ FernandoMartínez-Santiago,ArturoMontejo-Ráez,Miguel A. Garía-Cumbreras,andL.Al-
fonso Ureña-López . SINAI at CLEF 2006 Ad Ho RobustMultilingual Trak: QueryEx-
pansion using the Google Searh Engine. Evaluation of Multilingual and Multi-modal Infor-
mationRetrieval7thWorkshop oftheCross-LanguageEvaluation Forum,CLEF2006.LNCS-
Springer.,4730,September2007.
[4℄ S.E. Robertson and S.Walker. Okapi-Keenbow at TREC-8. In Proeedings of the 8th Text
Retrieval ConfereneTREC-8,NIST SpeialPubliation500-246, pages151162,1999.
[5℄ FernandoMartínezSantiago,LuisAlfonsoUreñaLópez,andMaiteTeresaMartín-Valdivia.A
mergingstrategy proposal: The2-step retrievalstatus value method. Inf. Retr.,9(1):7193,
2006.
[6℄ J. Savoy. Cross-Languageinformation retrieval: experimentsbased on CLEF 2000 orpora.
Information Proessing &Management, 39:75115,2003.
[7℄ J. Savoy. Combining multiple strategies for eetive ross-language retrieval. Information
Retrieval,7(1-2):121148,2004.
[8℄ E. Voorhees, N. K. Gupta, and B. Johnson-Laird. The olletionfusion problem. In D. K.
Harman, editor, Proeedings of the 3th Text Retrieval Conferene TREC-3, volume 500-225,
pages 95104, Gaithersburg, 1995. National Institute of Standards and Tehnology, Speial