• Aucun résultat trouvé

CLEF 2009 Question Answering Experiments at Tokyo Institute of Technology

N/A
N/A
Protected

Academic year: 2022

Partager "CLEF 2009 Question Answering Experiments at Tokyo Institute of Technology"

Copied!
6
0
0

Texte intégral

(1)

Tokyo Institute of Technology

MatthiasH. Heie,JosefR. Novak,EdwardW.D. WhittakerandSadaokiFurui

TokyoInstituteofTechnology

{heie,novakj,edw,furui}@furui.cs.titech.ac.jp

Abstract

Inthis paperwe describethe experimentscarried outat TokyoInstituteof Technology

for the CLEF 2009 Question Answering onSpeechTranscriptions (QAST)task, where we

participatedintheEnglishtrack. Weapplyanon-linguistic,data-drivenapproachtoQuestion

Answering (QA). Relevant sentences are rst retrieved from the supplied corpus, using a

languagemodelbasedsentenceretrievalmodule. Ourprobabilisticanswerextractionmodule

then pinpoints exact answers in these sentences. In this year's QAST task the question

setcontainsbothfactoidandnon-factoidquestions,wherethenon-factoidquestionsaskfor

denitions of given named entities. We do not make any adjustments of our factoid QA

system toaccountfor non-factoidquestions. Moreover,weare presentedwiththechallenge

of searching for the right answerin arelatively smallcorpus. Our system is built to take

advantageofredundantinformationinlargecorpora,however,inthis tasksuchredundancy

isnotavailable. TheresultsshowthatourQAframeworkdoesnotperformwellonthistask:

weendlastoffourparticipatingteamsinsevenoutofeightruns. However,ourperformance

does not regress as automatic transcriptions of speeches or questions are used instead of

manualtranscriptions. Thustheonlyruninwhichwearenotplacedlast,isthemostdicult

task,wherespokenquestionsandASRtranscriptionswithhighWERareused.

Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval; H.3.4

SystemsandSoftware

General Terms

Measurement,Performance,Experimentation

Keywords

Questionanswering,Questionsbeyondfactoids

1 Introduction

In this paper we describe the application of our data-driven and non-linguistic framework for

the CLEF 2009 QAST task. Two sets of questions were given: written questions and manual

transcriptions of the corresponding spoken questions. The corpusconsisted of transcriptionsof

EuropeanParliamentPlenarysessionsinEnglish. ThequestionsetcontainedDenitionquestions

andFactoid questions. 4versions ofthecorpuswereavailable: manualtranscriptionsand3ASR

transcriptions. Wesubmittedoneanswerset for each combinationof questionsetsand corpora,

intotal8submissions.

Our systemis a factoid QA system that wepreviously have participated with in other QA

(2)

tion words, as well as simple rules for pre-processing of the data, makesit possible to rapidly

developnewsystemsforawidevarietyofdierentlanguagesanddomains. Duetoitsdata-driven

nature, ourQA systemperforms best when there is alarge corpusavailable, containingseveral

co-occurrencesofquestionwordsand thecorrect answer. ThustheQASTtask presentedachal-

lengetousduetothesmallsizeofthecorpus. Moreover,wemadenoadjustmentstoourfactoid

QA system to account for Denition questions, i.e. wetreated Denition questions as Factoid

questions.

Thesystem comprisestwomain components, an Information Retrieval (IR)module used to

retrieverelevantsentencesfromacorpus,andanAnswerExtraction(AE)module whichisused

to identify andrankexactanswersin thesentences returnedbytheIR module. ForIR weused

languagemodelbasedsentenceretrieval. Inthisapproach,alanguagemodel(LM)isgeneratedfor

eachsentenceandthesemodelsarecombinedwithdocumentLMstotakeadvantageofcontextual

information. From the retrieved information, we extract rigid answers using our answer lter

model.

2 Sentence retrieval

Languagemodeling forIR has gained in popularityoverthe last decadesincethe approach was

proposed[5]. UnderthisapproachaLMisestimatedforeachdocument. Thedocumentsarethen

rankedaccordingtotheconditionalprobability

P (Q | D)

,theprobabilityofgeneratingthequery

Q

giventhedocument

D

.

Weranksentencescorrespondingly[6]. Due tolackofdatatotrainthesentencespecicLM,

itisassumedthatallwordsareindependent,henceunigramsareused:

P(Q | S ) = Y |Q|

i=1

P (q i | S),

(1)

where

q i

isthe

i

thquerytermin thequery

Q = (q 1 ...q |Q| )

composedof

|Q|

queryterms.

SmoothingmethodsarenormallyemployedwithLMstoavoidtheproblemofzeroprobabilities

when one of the query terms does not occur in the document. This is typically achieved by

redistributingprobabilitymassfromthedocumentmodeltoabackgroundcollectionmodel

P (Q | C)

. WeuseDirichletprior,wheretheprobabilityofaqueryterm

q

givenasentence

S

iscalculated

as:

P 1 (q | S) = c(q; S) + µ · p(q | C) P

w c(w; S) + µ ,

(2)

where

c(q; S)

isthecountofqueryterm

q

in sentence

S

,

µ

isasmoothingparameter,

p(q | C)

is theunigram probability of

q

accordingto thebackgroundcollectionmodeland

P

w c(w; S)

is

thecountofallwordsin

S

.

A problem with this model is that words relevant to the sentence might not occur in the

sentence itself, but in the surrounding text. Forexample, for the question Where was George

Bush born?, the sentence He was born in Connecticut in an article about George Bush should

ideally be assigned ahigh probability, despite the sentence missing important query terms. To

accountforthis,wetraindocumentLMs,

P 1 (q | D)

,inthesamemannerasfor

P 1 (q | S)

inEq.(2),

andperformalinearinterpolationbetween

P 1 (q | S)

and

P 1 (q | D)

:

P 2 (q | S) = (1 − α) · P 1 (q | S) + α · P 1 (q | D),

(3)

where

0 ≤ α ≤ 1

isaninterpolationparameter.

(3)

Foranswerextractionweusetheframeworkdescribedindetailin[7]. Wemodelthemoststraight-

forwardandobviousdependence oftheprobabilityofananswer

A

dependingonaquestion

Q

:

P (A | Q) = P (A | W, X),

(4)

where

A

and

Q

are considered to be strings of

l A

words

A = a 1 , . . . , a l A

and

l Q

words

Q = q 1 , . . . , q l Q

, respectively. Here

W = w 1 , . . . , w l W

represents a set of features describing the

question-type part of

Q

such aswhen, why, how, etc. while

X = x 1 , . . . , x l X

representsa set

of features that describe the information-bearing part of

Q

, i.e. what the questionis actually

aboutandwhatitrefersto. Forexample,inthequestions,WherewasTomCruise married? and

When was Tom Cruise married?, the information-bearing component is identical in both cases

whereasthequestion-typecomponentisdierent.

Findingthebestanswer

A ˆ

involvesasearchoverallavailable

A

for theonewhich maximizes

theprobabilityoftheabovemodel,i.e.,

A ˆ = arg max

A P (A | W, X).

(5)

Given the correct probability distribution, this is guaranteed to giveus the optimal answer

in amaximumlikelihoodsense. Wedon'tknowthisdistribution and itis stilldicultto model

but, using Bayes' rule and making various simplifying, modeling and conditional independence

assumptions(asdescribedin detailin[7])Eq.(5)canberearrangedtogive

arg max

A P (A | X )

| {z }

answer retrieval

model

· P (W | A)

| {z }

answer f ilter model

.

(6)

The

P (A | X)

model wecallthe answer retrieval model. In this year's evaluation wedidn't

usetheanswerretrievalmodel,i.e.

P (A | X )

isuniform.

The

P (W | A)

model matchesapotentialanswer

A

withfeaturesinthequestion-typeset

W

.

Forexample,itrelatesplacenameswithwhere-typequestions. Wecallthiscomponenttheanswer

lter model anditisstructuredasfollows.

The question-type feature set

W = w 1 , . . . , w l W

is constructed by extracting

n

-tuples (

n = 1, 2, . . .

) such as Who, Where and In what from the input question

Q

. A set of single-word featuresisextractedbasedonfrequencyofoccurrenceinourcollectionofexamplequestions.

Modeling the complex relationship between

W

and

A

directly is non-trivial. We therefore introduceanintermediatevariablerepresentingclassesofexamplequestions-and-answers(q-and-

a)

c e

for

e = 1 . . . |C E |

drawnfromtheset

C E

. Inorderto constructtheseclasses,givenaset

E

of exampleq-and-a, we then dene amappingfunction

f : E 7→ C E

which mapseach example

q-and-a

t j

for

j = 1 . . . |E|

intoaparticularclass

f (t j ) = e

. Thuseachclass

c e

maybedenedas

theunionofallcomponentq-and-afeaturesfromeach

t j

satisfying

f (t j ) = e

. Finally,tofacilitate

modelingwesaythat

W

isconditionallyindependentof

c e

given

A

sothat

P (W | A) =

|C E |

X

e =1

P (W | c e W ) · P(c e A | A),

(7)

where

c e W

and

c e A

referrespectivelytothesubsetsofquestion-typefeaturesand exampleanswers fortheclass

c e

.

Assumingconditional independence of theanswerwordsin class

c e

given

A

, and makingthe

modeling assumption that the

j

th answerword

a e j

in theexampleclass

c e

isdependentonly on

the

j

thanswerwordin

A

weobtain:

(4)

Transcriptions Written Spoken

ID Type WER questions questions

m Manual - a_m b_m

a ASR 10.6% a_a b_a

b ASR 14.0% a_b b_b

c ASR 24.1% a_c b_c

Table1: Detailsofthe4transcriptionsand8runs.

Type Subtype #questions

Person 17

Organisation 17

Factoid Location 14

Time 25

Measure 2

Person 12

Denition Organisation 3

Other 10

Table2: Numberofquestionsof eachquestiontype,100intotal. Ofthese,19havenoanswerin

thecorpusandshould beansweredNIL.

P(W | A) =

|C E |

X

e=1

P(W | c e ) ·

l Ae

Y

j =1

P (a e j | a j ).

(8)

Since our set of example q-and-a cannot be expected to cover all the possible answers to

questionsthatmaybeaskedweperformasimilaroperationtothatabovetogiveusthefollowing:

P (W | A) =

|C E |

X

e=1

P(W | c e )

l Ae

Y

j=1

|C A |

X

k =1

P (a e j | c k )P(c k | a j ),

(9)

where

c k

isaconcrete classin theset of

|C A |

answerclasses

C A

. Theindependence assumption leadsto underestimatingtheprobabilitiesof multi-wordanswerssowetakethegeometric mean

ofthelengthoftheanswer(notshownin Eq.(9))andnormalize

P (W | A)

accordingly.

4 Experimental work

ForQAST2009,twosetsofquestionsweregiven: 100writtenquestionsandmanualtranscriptions

ofthe correspondingspokenquestions. Theanswerswereto beextractedfrom transcriptionsof

European Parliament Plenary sessions in English (TC-STAR05 EPPS English corpus), which

consistsof 6spokendocuments, transcribedfrom3hoursofrecordings. 4versions ofthecorpus

were available: manualtranscriptionsand 3ASR transcriptions. There were onerunforeachof

thepossiblecombinationsofquestionsetsandtranscriptions,thustherewere2

×

4=8runs,as

shownin inTable1.

Twomaintypesofquestionswereconsidered:Factoid questionsandDenitionquestions. The

Factoid questions were further dividedinto thefollowingtypes: Person, Organisation,Location,

Time and Measure. TheDenition questions wereof thefollowingtypes: Person, Organisation

and Other. Questions where an answercannot befound in thecorpus, were to beansweredby

NIL.DetailsaregiveninTable2.

Wecleanedthedatabyautomaticallyremovingllersandpauses,andperformedsimpletext

(5)

0 0.1 0.2 0.3 0.4 0.5

a_m b_m a_a b_a a_b b_b a_c b_c

MRR

inaoe

0.36 0.34 0.30 0.30 0.22 0.22 0.28 0.28

limsi

0.36 0.33 0.31 0.30 0.25 0.25 0.24 0.24

tok

0.08 0.06 0.07 0.07 0.06 0.06 0.11 0.11

upc

0.31 0.12 0.27 0.08 0.26 0.08 0.24 0.07

Figure1: Resultsofeach runforallteams. Ourteam idistok

0 0.05 0.1 0.15 0.2

NIL DEFINITION TIME PERSON ORGANISATION LOCATION MEASURE

MRR

0.00

0.03

0.15

0.06

0.03

0.00 0.00

Figure2: Resultsofoura_mrun,bytype. InthisgureNILandDenitionarenotfurtherdivided

intosubtypes. TheothertypesaresubtypesofFactoid.

questionsetsand transcriptions. TheASR transcriptionslackedsentence boundaries,unlikethe

manualtranscriptions, where punctuationwasprovided. WesentencesegmentedtheASR tran-

scriptionsbyautomaticallyaligningthetextwiththemanualtranscriptionsusingtheGNU sdiff

tool. A set ofstopwordswasalsoused. Thetop100sentencesand theircontexts(the immedi-

atelyprecedingandsucceedingsentence),werepassedtotheanswerextractionmodule. Thetop

5answercandidates,asrankedbytheAEmodule,weresubmittedforevaluation. Althougheach

teamwasallowedtosubmittwoanswersetsforeachrun,wedecidedtosubmitonlyoneperrun.

5 Results and Discussion

The results of each team's best submission foreach run are plotted in Figure 1. These results

showthatweend uplast ofthefourteamsin allbut onerun. Ourperformancedoesnotregress

as automatictranscriptions of speeches or questions are used instead of manual transcriptions.

Thustheonlyruninwhichwearenotplacedlast,isthemostdiculttask: b_c.

Sinceoursystemdoesnottreatautomatictranscriptionsdierentlyfrommanualtranscriptions

(exceptin thepre-processingstage), werestrictourselvestofurther analyzingrun a_m,in which

written questions and manual transcriptions are used. Figure 2 shows the break-down of the

resultsbyanswertypeforthisrun.

Our system is not able to identify whether the answer to a question can be found in the

corpus, thus wechose neverto return aNIL response. Therefore thescore forNIL questions is

(6)

questions,whichhasanMRR which ismorethandouble that oftheMRRfor thePerson type,

thesecond best question type. This might beexplained bythe fairlyrestricted formatof Time

answersand the eorts wemade on date normalization. Organisation questions are dicultto

answersince there is little restrictionson the formatof organisation names. The zeroscore for

Location scoreis moredisappointing,sincethose answersmostlyconsist ofasinglegeographical

term. The score for Measure questions yields little information, since there were only 2 such

questions.

Normally our QA system utilizes a large corpus, such as the Web, and the more often an

answer candidate occurs in the context of queryterms, the more likely it is to be considered a

correctanswer. However,in thistaskthecorpuswassmall, thus wearenotableto benetfrom

suchredundancy,whichmightbeanexplanatoryfactorforourlowperformance.

6 Conclusion

InthispaperwehavegivenanoverviewofourmethodsandresultsfortheCLEF2009Question

AnsweringonSpeechTranscriptionsevaluation. Theresultsshowthat,usingourQAsystem,we

are not able to achieve good performance on this task. Obvious explanations are the presence

of non-factoid questions, which our system is not built to answer, in addition to our inability

to identify questions which haveno answerin thegivencorpus. Another possiblereason is the

small size of the corpus, which means our system cannot take advantage of redundant answer

information.

References

[1] Whittaker,E., Chatain, P., Furui, S. and Klakow, D., TREC2005Question AnsweringEx-

perimentsat TokyoInstituteofTechnology,Proc. TREC-14, 2005.

[2] Whittaker,E., Novak,J.,Chatain,P.andFurui,S.,TREC2006QuestionAnsweringExperi-

mentsatTokyoInstituteofTechnology,Proc. TREC-15,2006.

[3] Whittaker,Heie, M.,Novak,J.and Furui,S., TREC2007Question AnsweringExperiments

at TokyoInstituteofTechnology,Proc.TREC-16,2007.

[4] Heie,M.,Whittaker,E.,Novak,J.,Mrozinski,J.andFurui,S.,TAC2008QuestionAnswering

Experimentsat TokyoInstitute ofTechnology,Proc.TAC,2008.

[5] Ponte, J. and Croft, W., A Language Modeling Approach to Information Retrieval, Proc.

SIGIR,1998,pp.275-281.

[6] Heie,M.,Whittaker,E.,Novak,J.andFurui,S.,ALanguageModelingApproachtoQuestion

AnsweringonSpeechtranscriptions,Proc. ASRU,2007,pp.219-224.

[7] Whittaker, E., Furui, S. and Klakow, D., A Statistical Pattern Recognition Approach to

QuestionAnsweringusingWebData,Proc. Cyberworlds, 2005,pp.421-428.

Références

Documents relatifs

Keywords: question answering, QA, text preprocessing, tokenization, POS tagging, document indexing, document retrieval, Lucene, answer extraction, query formulation.. 1

In this paper we describe the experiments carried out at Tokyo Institute of Technology for the CLEF 2007 QAst (Question Answering in speech transcripts) pilot task, as well as

Inspired by previous TREC evaluation campaigns, QA tracks have been proposed at CLEF since 2003. During  these  years,  the  effort  of  the  organisers  has 

The basic architecture of our factoid system is standard in nature and comprises query type identification, query analysis and translation, retrieval query formulation, document

In Table 1 a breakdown is given of the results obtained on the two mono-lingual (French and Spanish) tasks for factoid/definition questions with answers in first place and for

When the question asks about a Named Entity such as MEASURE, PERSON, LOCATION, ORGANIZATION, DATE, the UAIC answer extractor looks for all the expressions tagged with the desired

It considers three main modules: passage retrieval, where the passages with the major probability to contain the answer are recovered from the document collection;

The basic architecture of our factoid system is standard in nature and comprises query type identification, query analysis and translation, retrieval query formulation, document