in Spanish?
JoseL.Vicedo,MaximilianoSaiz,andRubenIzquierdo
DepartamentodeLenguajesySistemasInformaticos
UniversityofAlicante,Spain
fvicedo,max,[email protected]
Abstract. Thispaperdescribesthearchitecture,operationandresults
obtainedwiththeQuestionAnsweringprototypeforSpanishdeveloped
in the Department of Language Processing and Information Systems
at the University of Alicantefor CLEF-2004 Spanish monolingualQA
evaluation task. Our system is based on the prototype developed for
CLEF-2003Spanishmonolingualtask[3].Thissystemhasbeenenhanced
mainlyto usedocumentsindierent languages toobtain evidencesfor
supportingandcomplementingCLEFSpanishcorpora.Particularly,the
experimentsdescribed are intended to study how to use EnglishWeb
documentstosupportmonolingualSpanishQA.
1 Introduction
Cross-Language Evaluation Forum Campaigns 1
(CLEF) are characterized for
fosteringinvestigationin multilingualinformation accesssystemsfrom theper-
spectiveof Europeanlanguagesintegration.Particularlylastyear,CLEF orga-
nized the rst Multiple Language Question Answering task (QA@CLEF-2003)
guided to the evaluation of QA systems in several languages. This evaluation
wasveryimportantsinceitfosteredthedevelopmentofaseriesofresourcesfor
thedevelopmentandevaluationofQAsystemsfromamultilingualperspective.
Thiscampaign,theQA@CLEF-2004proposednewdiÆcultiesandtherefore,
thecharacteristicsofthisevaluationchangedsignicantly.Participantswerepro-
videdwithdocumentcollectionsandquestionsetsinsevenEuropeanlanguages:
Spanish,Portuguese,Italian,Dutch,German,FrenchandEnglish.
Participantshadtochoosealanguageforquestionsandanotherforthetar-
get document collection. This way the evaluation proposed from monolingual
tasks(whenquestionanddocumentlanguageswerethesame)to dierentcom-
binationsofbilingualQA(whenselectedlanguageswere dierent).
Foreachlanguage,theorganisationprovided200questionsrequiringfactual
ordenitionanswerswhoseanswerwasnotguaranteedtooccurinthedocument
collection.Systemsshouldreturnonlyoneresponseperquestion.
Ourparticipation wasrestrictedto theSpanishmonolingualtask. Thenov-
eltyintheexperimentsdevelopedwastheuseofdocumentsinlanguagesdierent
1
CLEF Spanishcorpora.Particularly,weperformed monolingualtask from two
dierentperspectives:(1)using WebSpanishdocumentsand (2)usingEnglish
WebdocumentstosupportmonolingualSpanishQA.Thiswaywecouldbeable
to investigate on using English (or by extent, other languages) documents to
supportmonolingualQA.
Thispaperisorganisedasfollows:Section2describesthemain characteris-
ticsofourQAsystem.Afterwards,wepresentandanalysetheresultsobtained
[email protected],weextractinitialconclu-
sionsanddiscussdirectionsforfuturework.
2 System Description
Our system is based on the QA system described in [3] where two main en-
hancementshavebeenadded: (1)theinclusionofadictionary-basedNEtagger
and (2) the possibility of usingWebdocumentsin other languagesto support
monolingualSpanishQA.
Asthis systemis describedin detailin [3]we onlypresentheretheirmain
characteristicsandenhancethenewmodulesincluded.Oursystemisorganized
intothefollowingmain modules:
1. Questionanalysis.
2. Passageretrieval.
3. Answerextraction.
Question analysis processesquestions formulatedto the systemin order to
detectand extracttheusefulinformation theycontain.Passage retrieval mod-
ule retrievesrelevantpassages from theSpanishEFE document collectionand
also from the Internet in the selected language (Spanish or English). Finally,
theanswer selection moduleprocessesrelevantpassages in orderto locateand
extractthenalanswer.Figure1showssystemarchitecture.
2.1 Question Analysis
Question analysis module carries outtwo processes: answer type classication
andkeywordselection.Theformerdetectsthetypeofinformationthattheques-
tionexpectsasanswer(adate,aquantity,etc)andthelatterselectsthoseques-
tionterms(keywords)that will allowlocatingthedocumentsthat arelikelyto
containthe answer.These processesare performedbyusing amanually devel-
oped set of lexical patterns. Answertypes haveincreased and now thesystem
currently copes with seven possible answer types: NUMBER, DATE, LOCA-
Question
Relevant passages Question Analysis
IR-n Passage Retrieval
Answer Extraction
Answers
Google Passage Retrieval
Relevant passages
Keyword Translation
Fig.1.Systemarchitecture
2.2 PassageRetrieval
Passage retrieval stage is accomplished in parallel using two dierent search
engines:IR-n[2]and Google 2
.IR-nsystemperformspassageretrievaloverthe
entire Spanish EFE document collection. In this case, keywords detected at
questionanalysis stageare processed usingMACOSpanish lemmatiser[1]and
theircorrespondinglemmasareusedforretrievingthe50mostrelevantpassages
fromtheEFEdocumentdatabase.
Inparallel,thesamekeywordlist(withoutbeinglemmatised)istranslatedto
thelanguagethesystemisrequiredtouseforWebsearch(inthiscaseSpanishor
English)andposedtoGoogleInternetsearchengine.Thesystemselectsthe50
best short summariesreturned in Google main retrievalpages. Keywordshave
beentranslatedbyusingSysTran 3
onlinetranslationservices.
2.3 AnswerExtraction
This module processesin parallel bothsets of passages selectedat passagere-
trievalstage(IR-nandGoogle)inordertodetectandextractthemostprobable
answertothequery.Thisprocessinvolves:(1)selectingandevaluatingcandidate
answer from CLEF Spanish document collectionand (2) adding web evidence
2
http://www.google.com/
3
in detailin[3]
3 Results
Wesubmittedtworuns.Firstrun (aliv041eses) wasobtainedapplyingthesys-
temdescribedaboveand usingSpanishWeb retrieveddocumentswhile second
run performed QA process by activating English Web retrieval (aliv042eses).
Table1showstheresultsobtainedforeachrun.
Table1.Spanishmonolingualtaskresults
Accuracy(%)
Run FactoidDenitionOverall
aliv041eses 30.56 40.00 31.50
aliv042eses 31.11 45.00 32.50
Resultanalysis showsthatevidence obtainedthroughEnglishInternetdoc-
uments (aliv042eses) performs better than using Spanish Web documents for
this purpose (aliv041eses). Nevertheless, performance dierences are near in-
signicant (32.5% { 31.5%). These results contradicted our initial hypotheses
since wethought that English web documents would probably help moresig-
nicantlySpanish monolingualQA. After ashallowerroranalysis wedetected
several translationproblemsthat seriouslyaectedtheprocess ofEnglishWeb
documentprocessing:
{ Keywordtranslation.Thelackofcontextwhentranslatingquestionkeywords
producesnon-adequatetranslations.This impliessometimes retrieving En-
glish documents that have no semantic relation with the original Spanish
question.
{ Propernountranslation.Propernountranslationisanunresolvedproblem.
Usually, propernouns referringto people orcompanieshavenotranslation
(eg.BillClinton).Ontheotherhand,namesofcountries(Espa~na vs.Spain)
orcities(Londres vs. London)dierdependingonthelanguage.
{ Abbreviationtranslation.Abbreviationsusuallyrefertolanguage-dependent
expressions.Fromthispointofview,ifwewanttocorrectlytranslateabbre-
viationsandacronymsweneedtoknowthewhole expressionortermsthey
refertoin theoriginallanguage.
{ Titles translation. Titles, such as names of books or lms should not be
translated.Thebasicproblem hereresidesin detectingthese expressionsto
beexcludedfromtranslationprocesses.
Allthesetranslationproblemsaectpassageretrievalandanswerextraction
second,itmakesimpossibletotakeadvantageofevidencesinotherlanguagesfor
supporting candidate answer selectionif propernouns, abbreviations antitles
arenotcorrectlytranslated.
4 Future Work
This work is a rst attempt to perform monolingual QA in Spanish by using
evidencesobtainedformcorporain dierentlanguages,in thiscase,English.
As we have previously seen, using corpora in other languages to support
monolingualQAispossibleandworthwhileifweareableto solvecorrectlythe
translationproblemsdescribedbefore.Consequently,mainlineoffutureworkto
beinvestigated will bedirected to adopt translation techniques that minimize
thecurrentlydetectederrors.
Moreoverwearguethatsurely,thisproblemisthemainbottlenecktowards
themainlong-termobjectiveofdevelopingawholesystemcapableofperforming
multilingual questionanswering.
5 Acknowledgements
This work has been partially supported by theSpanish Government(CICYT)
withgrantTIC2003-07158-C04-01.
References
1. J.Atserias,J.Carmona,I.Castellon,S.Cervell,M.Civit,L.Marquez,M.A.Mart,
L.Padro,R.Placer,H.Rodrguez,M.Taule,andJ.Turmo.MorphosyntacticAnal-
ysisandParsingofUnrestrictedSpanishText.InProceedingsofFirstInternational
Conference on Language Resources and Evaluation. LREC'98, pages 1267{1272,
Granada,Spain,1998.
2. F.Llopis,J.L.Vicedo,andA.Ferrandez. IR-nsystem,apassageretrievalsystema
at CLEF2001. InWorkshopof Cross-Language Evaluation Forum(CLEF 2001),
LecturenotesinComputerScience,Darmstadt,Germany,2001.Springer-Verlag.
3. J.L.Vicedo,R.Izquierdo,F.Llopis,andR.Mu~noz. QuestionAnsweringinSpanish.
InWorkshopofCross-Language EvaluationForum(CLEF2003), Lecturenotesin
ComputerScience,Thondheim,Norway,2003.Springer-Verlag.