Working with Russian Queries for the GIRT, Bilingual and Multilingual CLEF Tasks

(1)

Bilingual and Multilingual CLEF Tasks

FredricC.Gey 1

,HailingJiang 2

andNataliaPerelman 2

1

UCDataArchive&TechnicalAssistance,

2

SchoolofInformationManagementandSystems

UniversityofCalifornia

Berkeley,CA94720USA

Abstract. For ouractivitieswithin theCLEF 2001evaluation,

Berkeley group one participated in the bilingual,multilingual

and GIRT tasks focussingon the use ofRussian queries. Per-

formance on the Russian queries !English documents bilin-

gualtaskwasexcellent,comparabletoperformanceusingGer-

man queries. For the multilingual task we utilized English as

a pivot language between Russian and German and the En-

glish/French/German/Italian/Spanishdocumentcollections.Per-

formanceherewas merelyaverage. The GIRTtaskperformed

Russian !GermanCross-LanguageIRbycomparingweb-available

machine translation withlookup techniquesonthe GIRT the-

saurus.

1 Introduction

Successfulcross-languageinformationretrieval(CLIR)combineslinguistictech-

niques(phrasediscovery,machinetranslation,bilingualdictionarylookup)with

robustmonolingual information retrieval.For monolingualretrievalthe Berke-

ley group has used the technique of logistic regression from the beginning of

theTRECseriesofconferences.InTREC-2[1]wederiveda statisticalformula

for predicting probability of relevance based upon statistical clues contained

within documents, queries and collections as a whole. This formula was used

fordocumentretrievalinChinese[3]andSpanishinTREC-4throughTREC-6.

We utilized theidentical formula for English and German queries against the

English/French/German/Italiandocument collectionsin the CLEF 2000 eval-

uation[10]. During the past two years, the formula has proven well-suited for

Japanese and Japanese-English cross-language information retrieval as well as

English-Chinese CLIR[6], even when onlytrained onEnglish documentcollec-

tions.ParticipationintheNTCIR WorkshopsinTokoyo

(http://research.nii.ac.jp/~ntcadm/workshop/work-en.html)

led to dierent techniques for cross-language retrieval, ones which utilisedthe

powerofhumanindexingofdocumentstoimproveretrieval.Alignmentsofpar-

alleltextswereusedtocreatelarge-scalebilinguallexiconsbetweenEnglishand

JapaneseandbetweenEnglishandChinese.Suchlexiconswerewell-suitedtothe

(2)

The document ranking formula used by Berkeley inall of our CLEF retrieval

runswastheTREC-2formula[1].TheadhocretrievalresultsontheTRECtest

collectionshaveshownthattheformulaisrobustforlongqueriesandmanually

reformulatedqueries.Applyingthesameformula(trainedonEnglishTRECcol-

lections)tootherlanguageshasperformedwell,asontheTREC-4Spanishcol-

lections,theTREC-5ChinesecollectionandtheTREC-6andTREC-7European

languages(French,German,Italian)[4,5].Thusthealgorithmhasdemonstrated

its robustness independentof language as long as appropriateword boundary

detection(segmentation)canbeachieved.Thelogoddsofrelevanceofdocument

D toqueryQisgivenby

logO(R jD;Q)=log

P(R jD;Q)

P(RjD;Q)

(1)

= 3:51+ 1

p

N+1

+:0929N (2)

=37:4 N

X

i=1 qtf

i

ql+35

+0:330 N

X

i=1 log

dtf

i

dl+80

0:1937 N

X

i=1 log

ctf

i

cl

(3)

whereP(R jD;Q)istheprobabilityof relevanceofdocumentDwithrespectto

queryQ,P(R jD;Q)istheprobabilityofirrelevanceofdocumentDwithrespect

toqueryQ.Detailsaboutthederivationoftheseformulaemaybefoundinour

TRECpaper[1].Itistobeemphasizedthattraininghastakenplaceexclusively

on English documents but the matching has proven robust over seven other

languagesinmonolingualretrieval,includingJapaneseandChinesewhereword

boundariesform anadditionalstepinthediscovery process.

3 Submissions for the CLEF main tasks

CLEFhasthreemaintasks:monolingual(non-English)retrieval,bilingual(where

non-EnglishqueriesarerunagainsttheCLEFsub-collectionofEnglishlanguage

documents),and multilingual,wherequeries inany languageare runagainsta

multilingualcollectionofdocumentscomprisedoftheunionofsubcollectionsin

English,French,German,ItalianandSpanish.Wechosethisyeartoparticipate

inthe bilingualand multilingualmain tasks. Inaddition, where our focus last

yearwasonEnglishandGermansourcequeries,thisyearwewishedtoexplore

theinterestingquestionofwhetheraless-usedquerylanguage(inthiscaseRus-

sian) could achieveperformance comparable to the more mainstream Western

Europeanlanguages.

ForCLEFmaintaskswesubmitted7runs,3forthebilingual(German/Russian-

English) task and 4 for the Multilingual task. Table 1 summarizes these runs

(3)

RunName LanguageRuntype Priority

BKBIGEM1 German Manual 1

BKBIREM1 Russian Manual 2

BKBIREA1 Russian Automatic 3

FortheMultilingualtaskwesubmitted:

BKMUGAM1 German Manual 1

BKMUEAA1 English Automatic 2

BKMUEAA2 English Automatic 3

BKMUREA1 Russian Automatic 4

Table1.SummaryofsevenoÆcialCLEFruns.

3.1 Bilingual Retrieval ofthe CLEF collections

Bilingualretrievalisperformedbyrunningqueriesinanotherlanguageagainst

theEnglishcollectionofCLEF.WechosetofocusonRussianbuttodoGerman

forabaselinecomparison.TherunBKBIGEM1wasobtainedbytranslatingthe

GermanqueriestoEnglishusingtheL&HPowerTranslatorandthenmanually

adjusting the resultingqueries by searching for the untranslated terms inour

ownspecialassociationdictionarycreatedfromalibrarycatalog.Anexampleof

words notfoundcomesfrom Topic88about'mad cowdisease'. IntheGerman

version,thewordsSpongiformerandEnzephalopathiewerenottranslatedbythe

commercialsystem, butourassociation dictionaryobtainedthewords 'hepatic

encephalopathy' associated with the inquery Enzephalopathie. Further details

aboutourmethodologycanbefoundin[8].TherunBKBIREA1wasobtainedby

using thePROMPT web-based translator (http://www.translate.ru/).As with

theGermantranslation,certainwordswerenottranslated.Ourmethodologyto

dealwiththis was twofold{ rstwetransliterated theRussianqueries totheir

romanized alphabeticequivalent,andthenweadded untranslatedtermstothe

English query intheir transliterated form. For example Topic 50 on 'uprising

of Indians in Chiapas', the Russian word

qiapas

^was ^not translated, It can, however,betransliteratedas'chiapas'.

3.2 Bilingual Performance

Ourbilingualperformancecanbe foundinTable 2.Thenalline ofthetable,

labeled"CLEFPrec"iscomputedasanaverageofeachCLEFmedianprecision

amongallsubmittedruns.Theaverageisperformedoverthe47queriesforwhich

the English collection had relevant documents. While an average of medians

cannot be considered a statistic from which rigorous inference can be made,

we have foundit usefulto average themedians of allqueries as sent by CLEF

organizers. Comparing our overall precision to this average of medians yields

some fuzzy gauge of whether our performance is better, poorer, or about the

same asthemedianperformance.

Usingthismeasure wecan ndthat allourbilingualrunsperformedsigni-

(4)

Retrieved 47000 47000 47000

Relevant 856 856 856

Rel.Ret 812 737 733

Precision

at0.00 0.7797 0.6545 0.6420

at0.10 0.7390 0.6451 0.6303

at0.20 0.6912 0.5877 0.5691

at0.30 0.6306 0.5354 0.5187

at0.40 0.5944 0.4987 0.4806

at0.50 0.5529 0.4397 0.4167

at0.60 0.4693 0.3695 0.3605

at0.70 0.3952 0.3331 0.3218

at0.80 0.3494 0.2881 0.2747

at0.90 0.2869 0.2398 0.2339

at1.00 0.2375 0.1762 0.1743

Brk.Prec. 0.5088 0.4204 0.4077

CLEFPrec. 0.2423 0.2423 0.2423

Table 2.ResultsofthreeoÆcialBerkeleyCLEFbilingualruns.

3.3 Multilingual Retrieval ofthe CLEF collections

Our non-English multilingualretrievalruns were based upon ourbilingualex-

periments, extended to French/Italian/Spanish using English as a pivot lan-

guage and (again) the L&H Power Tranlator as the MT system to tranlate

queries from one language to another. Run BKMURAA1 takes the English

translatedqueriesofthebilingualrunBKBIREA1andagaintranslatesthemto

French/German/Italian/Spanish.RunBKMUGAM1takestheGerman queries

of bilingual run BKBIGEM1 as well as the translation of their Englishequiv-

alents into French/Italian/Spanish. For comparison we did direct translation

fromtheEnglishqueriesinrunsBKMUEAA1andBKMUEAA2.Thedierence

between these two runs is that BKMUEAA1 used Titleand Description elds

only.

The results show that with Russian queries we are about one third lower

in average precision than with either Englishor German queries. We are cur-

rently studying why this is so. Inaddition ouroverall performance seemsonly

slightlyabovetheCLEF-2001median performance.Thisseemedpuzzlingwhen

comparedto ourexcellentbilingualperformanceandouraboveaverageperfor-

mance at CLEF-2000. For comparison wealso inserted the average of median

precisons for last year (Row 2000 Prec. at the bottom of Table 3). As can

be seen,themedianperformanceintermsofqueryprecisionforCLEF-2001of

0.2749 isabout50percentbetterthanthemedianmultilingualperformanceof

0.1843ofCLEF-2000.Thisarguesthatsignicantprogresshasbeenmadebythe

(5)

Retrieved 50000 50000 50000 50000

Relevant 8138 8138 8138 8138

Rel.Ret 5520 5190 5223 4202

Precision

at0.00 0.8890 0.8198 0.8522 0.7698

at0.10 0.6315 0.5708 0.6058 0.4525

at0.20 0.5141 0.4703 0.5143 0.3381

at0.30 0.4441 0.3892 0.4137 0.2643

at0.40 0.3653 0.3061 0.3324 0.1796

at0.50 0.2950 0.2476 0.2697 0.1443

at0.60 0.2244 0.1736 0.2033 0.0933

at0.70 0.1502 0.1110 0.1281 0.0556

at0.80 0.0894 0.0620 0.0806 0.0319

at0.90 0.0457 0.0440 0.0315 0.0058

at1.00 0.0022 0.0029 0.0026 0.0005

Brk.Prec. 0.3101 0.2674 0.2902 0.1838

CLEFPrec. 0.2749 0.2749 0.2749 0.2749

2000Prec. 0.1843 0.1843 0.1843 0.1843

Table 3.ResultsoffouroÆcialCLEF-2001multilingualruns.

4 GIRT retrieval

Thespecial emphasisofourcurrentfunding hasfocusseduponretrievalofspe-

cialized domain documents which have been assigned individual classication

identiers by human indexers. These classication identiers can come from

thesauri.Sincemanymillionsofdollarsareexpendedondevelopingsuchclassi-

cationschemesandusingthemtoindexdocuments,itisnaturaltoattemptto

exploittheresourcesto thefullestextentpossibletoimproveretrieval.Insome

cases such thesauri are developed with identiers translated (or provided) in

multiplelanguages,andcan thusbe usedtotransfer words acrossthelanguage

barrier.

TheGIRTcollection consistsof reports and papers (grey literature)inthe

social science domain. The collection is managed and indexed by the GESIS

organization(http://www.social-science-gesis.de).GIRTisanexcellentexample

of acollectionindexed bya multilingualthesaurus,originally German-English,

recently translated into Russian. The GIRT multilingual thesaurus (German-

English),whichisbasedontheThesaurusfortheSocialSciences[2],providesthe

vocabularysourcefortheindexing termswithintheGIRT collectionof CLEF.

FurtherinformationaboutGIRTcan befoundin[7]There are76,128German

documentsinGIRTsubtaskcollection.Almostallthedocumentscontainman-

uallyassignedthesaurusterms.Onaverage,thereareabout10thesaurusterms

assigned to each document. Figure1 is anexample ofa thesaurus entry. Since

transliterationoftheCyrillicalphabetisakeypartofourretrievalstrategy,we

(6)

Inourexperiments,weindexedtheTITLEandTEXTsectionsineachdoc-

ument(nottheE-TITLEorE-TEXT).TheCLEFrulesspeciedthatindexing

anyothereldwouldneedtobedeclaredamanualrun.ForourCLEFrunsthis

yearweagainusedtheMuscatstemmer,whichissimilartothePorterstemmer

but fortheGerman language.

4.1 Querytranslation from Russian to German

Inordertoprepareforquerytranslation,werstextractedallthesinglewords

and bigrams from the Russian topic elds. Since we do not have a Russian

POS tagger,we took any two adjacent words (overlappingwordbigram)to be

considered a potential phrase. The single words and bigrams in each Russian

querywerethencompared againsttheRussian-Germanthesaurus.Ifa wordor

bigramwasfoundinthethesaurus,itsGermantranslationwasaddedtothenew

Germanquerybeingcreated.TheresultingGermanquerywasthenrunagainst

theGermancollectiontoretrieverelevantdocuments.

Forcomparison, wealsoused anonlineMT system

(Promt-Reverso:http://translation2.paralink.com/)

totranslatetheRussianqueriestoGerman.

Fuzzy Matching for the Thesaurus Therst approachto thesaurus-based

translationwas exactmatchingforthesaurus lookup.Fromthe25GIRTtopics

we obtain about 1300 Russian query terms (words and bigrams). Only 50 of

themweredirectlyfoundinthethesaurus,andthesewereallsinglewords.Two

problemscontributetothelowmatchingrate:

First, a Russian word may have several forms or variations. Usually only

thebaseform orgeneralformappearsinthethesaurus.For example,"evropa"

(EuropeinEnglish)isinthethesaurus,but "evrope"and"evropu"arenot.In

this case, a Russianmorphological analyzerwould be helpful.Since we donot

haveaRussianmorphologicalanalyzer,weusedfuzzymatchingtoaddress this

problem.

There are dierent kinds of algorithms for fuzzy matching, such as Lev-

enshtein distance, commonn-grams, longestcommon subsequence,etc [9]. We

foundthatthesimplecommonbigramalgorithmtobeveryeÆcientandeective

for matching dierent word forms. The two strings are divided into theircon-

stituentbigramsandDice'scoeÆcientisusedtocomputethesimilaritybetween

(7)

migratsiiu migratsiia wanderung

migratsii migratsiia wanderung

bezrabotitsei bezrabotitsa arbeitslosigkeit

televideniia televidenie fernsehen

kul'turu kul'tura kultur

kul'turoi kul'tura kultur

tekhnologiei tekhnologiia technologie

tekhnologii tekhnologiia technologie

AbovearesomeexamplesofRussianwordsthatdonotoccurinthethesaurus

but whose dierent forms were found inthethesaurus by fuzzymatching(the

Russiancharactersintheexamplesaretransliteratedforeasyreading).

Thesecondproblemliesinndingquerybigramswhichdonotmatchexactly

to thesaurusentries.Fuzzymatching was alsousefulforndingdierentforms

ofbigrams,evenincases wherewordorderischanged,examplesare:

OriginalRussianbigram BigramfoundinthethesaurusGermantranslation

tekhnologicheskogorazvitiia tekhnologicheskoerazvitie technologischeentwicklung

razvitieiorganizatsiia organizatsionnoerazvitie organisationsentwicklung

upravlenieorganizatsiia organizatsiiaupravleniia verwaltungsorganisation

rabochemmeste rabocheemesto arbeitsplatz

rukovodiashchikhrabotnikovrukovodiashchierabotniki fuhrungskraft

Thewaybigram'phrases'werecreatedhadtwoproblems:rst,manyofthe

bigramsweresimplynotmeaningful;second, eventhoughmostgenuinephrases

containtwo words(bigrams),approximately25percentofRussiantermsinthe

thesauruscontain3ormorewords.ARussianPOStaggerwouldbeveryhelpful

forndingmeaningfulorlongphrases.

4.2 GIRTresults and analysis

OurGIRTresultsaresummarizedinTable 4.Therunscanbedescribedasfol-

lows:BKGRGGA isa monolingualrun usingtheGerman versionof thetopics

run againsttheGermanGIRTdocumentcollection.BKGRRGA1isa Russian-

German bilingualrun using MT systemfor query translation.BKGRRGA2 is

a Russian-German bilingual run using thesaurus lookup and fuzzy matching.

BKGRRGA3isaRussian-Germanbilingualrun whichisidenticalinmethodol-

ogyto BKGRRGA2exceptthat onlytitleanddescriptionsectionsof thetopic

wereusedformatching.Aswecansee,ourRussianrunsachieveonlyabout1/3

oftheprecisionoftheGerman-Germanmonolingualrun,withasignicantedge

(8)

Retrieved 25000 25000 25000 25000

Relevant 1111 1111 1111 1111

Rel.Ret 1054 774 781 813

Precision

at0.00 0.9390 0.4408 0.4032 0.3793

at0.10 0.8225 0.3461 0.3018 0.2640

at0.20 0.7501 0.2993 0.2441 0.2381

at0.30 0.6282 0.2671 0.2257 0.1984

at0.40 0.5676 0.2381 0.1790 0.1782

at0.50 0.5166 0.1999 0.1341 0.1609

at0.60 0.4604 0.1400 0.0993 0.1323

at0.70 0.4002 0.1045 0.0712 0.1125

at0.80 0.3038 0.0693 0.0502 0.0765

at0.90 0.2097 0.0454 0.0232 0.0273

at1.00 0.0620 0.0051 0.0013 0.0000

Brk.Prec. 0.5002 0.1845 0.1448 0.1461

Table4.ResultsofoÆcialGIRTRussian-Germanruns.

retrieved more relevant documents, this did not translate into higher overall

precision.

5 Summary and Acknowlegments

The participation of Berkeley's Group One in CLEF-2001 has enabled us to

explorethediÆcultiesinextendingcross-languageinformationretrievaltoanon-

Roman alphabet language, Russian, for which limited resources are available.

Specicallywehaveexploredacomparisonofbilingualandmultilingualretrieval

where original queries were in Russian when compared against German as a

query language or English as a query language for multilingual retrieval. For

theGIRTtaskwecomparedvarious formsofRussian !German retrieval.We

havedeterminedthatthereissignicantworktobedonebeforecross-language

informationretrievalfromtheRussianlanguagewillbecomecompetitivetoother

Europeanlanguages.

Thisresearch was supportedbyDARPA(DepartmentofDefenseAdvanced

ResearchProjectsAgency)under contractN66001-97-8541; AO#F477:Search

SupportforUnfamiliarMetadataVocabularieswithintheDARPA Information

TechnologyOÆce.WethankAitaoChenforindexingthemainCLEFcollections.

References

1. WCooperAChenandFGey.Fulltextretrievalbasedonprobabilisticequations

withcoeÆcientsttedbylogisticregression. InD.K.Harman,editor,TheSecond

(9)

English. [Vol. 2:] English-German. [Edition]1999. InformationsZentrum Sozial-

wissenschaftenBonn,2000.

3. A Chen J He LXuF Gey andJ Meggs. Chinese text retrieval withoutusinga

dictionary. InA.DesaiNarasimhaluNicholasJ.BelkinandPeterWillett,editors,

Proceedingsofthe20thAnnualInternationalACMSIGIRConferenceonResearch

andDevelopmentinInformationRetrieval,Philadelphia,pages42{49,1997.

4. F.C.GeyandA.Chen. Phrasediscoveryforenglishandcross-languageretrieval

attrec-6. InD.K.HarmanandEllenVoorhees,editors,TheSixthTextREtrieval

Conference (TREC-6),NISTSpecial Publication500-240,pages637{647,August

1998.

5. F. C. Gey and H. Jiang. English-german cross-language retrieval for the girt

collection { exploiting a multilingual thesaurus. In Ellen Voorhees, editor, The

Eighth Text REtrieval Conference (TREC-8), draft notebook proceedings, pages

219{234,November1999.

6. AChen HJiangandFGey. Berkeleyatntcir-2:Chinese,japaneseandenglish ir

experiments.InN.Kando,editor,ProceedingsoftheSecondNTCIRWorkshopon

Evaluation of Chinese andJapanese Text Retrieval and Summarization, Tokoyo

Japan,pages32{39,March2001.

7. Michael KluckandFredricGey.Thedomain-specictaskofclef-specicevalua-

tionstrategiesincross-languageinformationretrieval.InCrossLanguageRetrieval

Evaluation, ProceedingsoftheCLEF2000Workshop.Springer,2001.

8. F Gey MBuckland R Larsonand A Chen. Entryvocabulary {a technology to

enhance digital search. In Proceedings of the First International Conference on

HumanLanguageTechnology,March2001.

9. Michael P.Oakes. Statisticsfor CorpusLinguistics. EdinburghUniversityPress,

1998.

10. F Gey H Jiang V Petras and A Chen. Cross-languageretrievalfor theclef col-

lections{comparingmultiplemethodsofretrieval. InCarolPeters,editor,Cross

LanguageRetrievalEvaluation,ProceedingsoftheCLEF2000Workshop.Springer,

2001.