SINAI at CLEF Ad-Hoc Robust Track 2007: Applying Google Search Engine for Robust Cross-Lingual Retrieval

(1)

applying Google searh engine for robust

ross-lingual retrieval

FernandoMartínez-Santiago,ArturoMontejo-Ráez,Miguel A.Garía-Cumbreras

DepartmentofComputerSiene. UniversityofJaén,Jaén,Spain

{dofer,amontejo,mag}ujaen.es

Abstrat

We have reported on our experimentation for the Ad-Ho Robust trak CLEF task

onerning web-based querygeneration for English andFrenh olletions. We have

ontinued theapproahof thelastyear,although themodelhasbeenmodied. Last

yearweusedGoogleinordertoexpandtheoriginalquery. Thisyearwedon'texpand

thequerybutwerathermakeanewquerytobeexeuted. Thus,wehavetodealwith

twolistsofrelevantdouments, onefromeahquery. In ordertointegratebothlists

of douments we have applied logistiregressionmerging solution. Obtained results

aredisouraging.

Categories and Subjet Desriptors

H.3[InformationStorage and Retrieval℄: H.3.1ContentAnalysisand Indexing;H.3.3Infor-

mationSearhandRetrieval;H.3.4SystemsandSoftware;H.3.7DigitalLibraries

General Terms

Algorithms,Languages,Performane,Experimentation

Keywords

InformationRetrieval,MultilingualInformationRetrieval,RobustInformationRetrieval

1 Introdution

ExpandinguserqueriesbyusingwebsearhenginessuhasGooglehasbeensuessfullyusedfor

improvingtherobustnessofretrievalsystemsoverolletionsin English[2℄. Due tothe multilin-

gualityoftheweb,wehaveassumedthatthisouldbeextended toadditionallanguages,though

thesmalleramountofwebnon-Englishpagesouldbeamajorounterpoint. Therefore,wehave

usedGoogleinordertoexpandthequeryinasimilarway[3℄,butinsteadofreplaingtheoriginal

querybythetheexpandedquery,wehaveexeutedbothqueries(theoriginalandexpandedone).

For eah query we have obtained alist of relevant douments. Thus, we need to ombine the

retrievalresultsfromthesetwoindependentlistofdouments. Thisisasimilarproblemtotheso

alledolletionfusionproblem[8℄, butwehavenotseveralolletions: thereis onlyaolletion

buttwolistofrelevantdouments. Thequestionishowshouldwemakethealulusofthesore

ofeahdoument inthe nalresultinglist. Given aquery,in order tointegratethe information

available about the relevane of every retrieved doument, we have applied a model based on

(2)

ThissetiondesribestheproessforgeneratinganewqueryusingexpansionbytheGooglesearh

engine. Tothisend,wehaveseletedarandomsampledoument. Thefollowingeldsorrespond

tothedoumentwithidentiationnumber10.2452/252-ahfrom theEnglish olletion.

<title>pensionshemesin europe </title>

<des>finddoumentsthat give informationabout urrentpension

systemsandretirementbenefitsin anyeuropeanountry.</des>

<narr>relevantdouments willontain informationon urrent pension

shemesandbenefitsin single europeanstates.information of

interestinludesminimum andmaximum agesfor retirementandtheway

inwhih theretirementinome isalulated. plansfor future

pensionreform arenotrelevant. </narr>

Theseeldshavebeenonatenatedintoonesingletextandallontainednouns,nounphrases

and prepositional phrases have been extrated by means of TreeTagger. TreeTagger is a tool

forannotatingtextwith part-of-speeh andlemmainformationwhihhasbeendevelopedat the

InstituteforComputationalLinguistisoftheUniversityofStuttgart 1

.

Onenounsandphrasesareidentiedtheyaretakentoomposethequery,preservingphrases

thankstoGoogle'squerysyntax.

douments``pension shemes''benefitsretirementinformation

The former string is passed to Google and the snippets (small fragment of text from the

assoiatedwebpageresult)ofthetop100resultsarejoinedintoonesingletextwherefrom,again,

phrases are extrated with their frequenies to generate a nal expanded query. The 20 most

frequent nouns, noun phrasesand prepositional phrasesfrom this generated text are repliated

aording to their frequenies in the snippets-based text and then normalized to the minimal

frequenyinthose20items(i.e. normalizedaordingtotheleastfrequentphraseamongthetop

ones). Theresultingqueryisshownbelow:

pension pensionpension pensionpensionpension pensionpension

pension pensionpension benefitsbenefitsbenefitsbenefits benefits

benefitsbenefitsbenefitsbenefitsbenefitsretirementretirement

retirementretirementretirementretirement retirementretirement

retirementretirementretirementage agepensionsoupational

oupationaloupational oupationalshemesshemes shemes

shemes shemesshemes shemesshemesshemes shemesshemes

shemes shemesshemes shemesshemesregulations information

information information informationinformation sheme sheme

dislosuredislosurepension shemespension shemespension

shemes pensionshemes pensionshemespension shemespension

shemes pensionshemes retirementbenefitsshemesmembers members

oupationalpension shemesoupationalpension shemes

oupationalpension shemesretirementbenefitsretirement benefits

dislosureof information

Frenh doumentshave been proessed in asimilar way, but using the ORoperator to join

foundphrasesforthegeneratedGooglequery. Thishasbeendonedue tothesmallernumberof

indexed web pages in Frenh language. Sine weexpet to reover100snippets, wehave found

that withthisoperatorthisis possible,despitelowquality textsbeenonsidered to produethe

nalexpandedquery.

Thenext step is to exeute both original and Google queries on the Lemur information re-

trievalsystem. Theolletiondatasethasbeen indexedusing Lemur IRsystem 2

. It isatoolkit

that supports indexingof large-saletext databases,theonstrutionof simplelanguagemodels

for douments, queries,or subolletions, and theimplementation of retrievalsystemsbased on

language models as well asa variety of other retrieval models. The toolkit is being developed

1

Availableathttp://www.ims.uni-stutt gart .de /pro jekt e/ orpl ex/T ree Tagg er/

2

(3)

Universityof Massahusettsand theShoolofComputer Sieneat CarnegieMellon University.

IntheseexperimentswehaveusedOkapiasweightingfuntion ([4℄).

Finally,wehavetomergebothlistsofrelevantdouments. [1,6℄proposeamergingapproah

based on logisti regression. Logisti regression is a statistial methodology for prediting the

probabilityofabinaryoutomevariableaordingtoaset ofindependentexplanatoryvariables.

The probability of relevane to the orresponding doument

D _i

^will ^be ^estimated ^aording^to

four parameters: the soreand the rankingobtained by using the original query, and the sore

andtherankingbymeansoftheGoogle-basedquery(seeequation 1). Basedontheseestimated

probabilitiesofrelevane,thelistofdoumentswillbeinterleavedmakingupanuniquenallist.

P rob[D i is rel|rank org _i , rsv org _i , rank google _i , rsv google _i ] = e ^α+β ¹ ^·ln(rank ôrgi ^)+β ² ^· ^rsv ôrgi ^+β ³ ^·ln(rank ^googlei ^)+β ⁴ ^· ^rsv ^googlei 1 + e ^α+β ¹ ^·ln(rank ôrgi ^)+β ² ^· ^rsv ôrgi ^+β ³ ^·ln(rank ^googlei ^)+β ⁴ ^· ^rsv ^googlei

(1)

The oeients

α

^,

β 1

^,

β 2

^,

β 3

^and

β 4

âre ûnknown ^parameters ôf ^the ^model. ^When ^tting

the model, usual methods to estime these parametersare maximumlikelihood oriterativelyre-

weightedleastsquaresmethods.

Asitisneededtottheunderlyingmodel,trainingset(topisandtheirrelevaneassessments)

mustbeavailableforeahmonolingualolletion.SinetherearerelevaneassessmentsforEnglish

and Frenh, we have made the experiments for these languages only. For Portuguese we have

reportedonlythebasease(wehavenotusedGooglequeriesforsuhlanguage).

3 Results

Astables 1and2show,theresultsaredisappointing. Fortrainingdata,Googlequeriesimprove

boththem.a.p. andthegeometripreisioninbothlanguages,EnglishandFrenh. Butthisgood

behaviordisappearswhenweapplyourapproahontestdata. Ofourse,wehopethatpreision

for testdata gets worse regardingtrainingdata, but wethink that the dierenein preisionis

exessive. Thisissuedemandsfurtheranalysisbyus.

Approah Colletion map gm-ap

Google training 0.29 0.12

Base training 0.26 0.10

Google test 0.34 0.12

Base test 0.38 0.14

Table1: Results for English data. Google approah is theresult obtained by merging original

queriesandGooglequeries. Baseresultsarethoseobtainedbymeansoforiginalqueriesonly.

Approah Colletion map gm-ap

Google training 0.28 0.10

Base training 0.26 0.12

Google test 0.30 0.11

Base test 0.31 0.13

Table 2: Results for Frenh data. Google approah is the result obtainedby merging original

(4)

Wehavereported onourexperimentationfortheAd-HoRobustMultilingualtrakCLEFtask

involving web-based query generation for English and Frenh olletions. The generation of a

nallistofresultsbymergingsearhresultsobtainedfromtwodierentquerieshasbeenstudied.

These twoqueriesaretheoriginal oneand anewonegenerated from Google results. Both lists

arejoinedbymeansoflogistiregression,insteadofusinganexpandedqueryaswedidlastyear.

The results are disappointing. While results for training data are very promising, there is not

improvementfor testdata. Thisquestionmust bendoutand wehopeto understand why the

performaneissopoorfortestdata,analyzing,forinstane,sideeetsoftheregressionapproah.

5 Aknowledgments

ThisworkhasbeenpartiallysupportedbyagrantfromtheSpanishGovernment,projetTIMOM

(TIN2006-15265-C06-03),andtheRFC/PP2006/Id_514grantedbytheUniversityofJaén..

Referenes

[1℄ A. Calvéand J.Savoy. Databasemergingstrategy basedonlogisti regression. Information

Proessing&Management,36:341359,2000.

[2℄ K. L. Kwok, L. Grunfeld, and D. D. Lewis. TREC-3 ad-ho, routing retrieval and thresh-

olding experiments using PIRCS. In Proeedings of TREC'3, volume 500, pages 247255,

Gaithersburg,1995.NIST.

[3℄ FernandoMartínez-Santiago,ArturoMontejo-Ráez,Miguel A. Garía-Cumbreras,andL.Al-

fonso Ureña-López . SINAI at CLEF 2006 Ad Ho RobustMultilingual Trak: QueryEx-

pansion using the Google Searh Engine. Evaluation of Multilingual and Multi-modal Infor-

mationRetrieval7thWorkshop oftheCross-LanguageEvaluation Forum,CLEF2006.LNCS-

Springer.,4730,September2007.

[4℄ S.E. Robertson and S.Walker. Okapi-Keenbow at TREC-8. In Proeedings of the 8th Text

Retrieval ConfereneTREC-8,NIST SpeialPubliation500-246, pages151162,1999.

[5℄ FernandoMartínezSantiago,LuisAlfonsoUreñaLópez,andMaiteTeresaMartín-Valdivia.A

mergingstrategy proposal: The2-step retrievalstatus value method. Inf. Retr.,9(1):7193,

2006.

[6℄ J. Savoy. Cross-Languageinformation retrieval: experimentsbased on CLEF 2000 orpora.

Information Proessing &Management, 39:75115,2003.

[7℄ J. Savoy. Combining multiple strategies for eetive ross-language retrieval. Information

Retrieval,7(1-2):121148,2004.

[8℄ E. Voorhees, N. K. Gupta, and B. Johnson-Laird. The olletionfusion problem. In D. K.

Harman, editor, Proeedings of the 3th Text Retrieval Conferene TREC-3, volume 500-225,

pages 95104, Gaithersburg, 1995. National Institute of Standards and Tehnology, Speial

SINAI at CLEF Ad-Hoc Robust Track 2007: Applying Google Search Engine for Robust Cross-Lingual Retrieval

D i

P rob[D i is rel|rank org i , rsv org i , rank google i , rsv google i ] = e α+β 1 ·ln(rank orgi )+β 2 · rsv orgi +β 3 ·ln(rank googlei )+β 4 · rsv googlei 1 + e α+β 1 ·ln(rank orgi )+β 2 · rsv orgi +β 3 ·ln(rank googlei )+β 4 · rsv googlei

α

β 1

β 2

β 3

β 4

D _i