CLEF 2009 Question Answering Experiments at Tokyo Institute of Technology

(1)

Tokyo Institute of Technology

MatthiasH. Heie,JosefR. Novak,EdwardW.D. WhittakerandSadaokiFurui

TokyoInstituteofTechnology

{heie,novakj,edw,furui}@furui.cs.titech.ac.jp

Abstract

Inthis paperwe describethe experimentscarried outat TokyoInstituteof Technology

for the CLEF 2009 Question Answering onSpeechTranscriptions (QAST)task, where we

participatedintheEnglishtrack. Weapplyanon-linguistic,data-drivenapproachtoQuestion

Answering (QA). Relevant sentences are rst retrieved from the supplied corpus, using a

languagemodelbasedsentenceretrievalmodule. Ourprobabilisticanswerextractionmodule

then pinpoints exact answers in these sentences. In this year's QAST task the question

setcontainsbothfactoidandnon-factoidquestions,wherethenon-factoidquestionsaskfor

denitions of given named entities. We do not make any adjustments of our factoid QA

system toaccountfor non-factoidquestions. Moreover,weare presentedwiththechallenge

of searching for the right answerin arelatively smallcorpus. Our system is built to take

advantageofredundantinformationinlargecorpora,however,inthis tasksuchredundancy

isnotavailable. TheresultsshowthatourQAframeworkdoesnotperformwellonthistask:

weendlastoffourparticipatingteamsinsevenoutofeightruns. However,ourperformance

does not regress as automatic transcriptions of speeches or questions are used instead of

manualtranscriptions. Thustheonlyruninwhichwearenotplacedlast,isthemostdicult

task,wherespokenquestionsandASRtranscriptionswithhighWERareused.

Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval; H.3.4

SystemsandSoftware

General Terms

Measurement,Performance,Experimentation

Keywords

Questionanswering,Questionsbeyondfactoids

1 Introduction

In this paper we describe the application of our data-driven and non-linguistic framework for

the CLEF 2009 QAST task. Two sets of questions were given: written questions and manual

transcriptions of the corresponding spoken questions. The corpusconsisted of transcriptionsof

EuropeanParliamentPlenarysessionsinEnglish. ThequestionsetcontainedDenitionquestions

andFactoid questions. 4versions ofthecorpuswereavailable: manualtranscriptionsand3ASR

transcriptions. Wesubmittedoneanswerset for each combinationof questionsetsand corpora,

intotal8submissions.

Our systemis a factoid QA system that wepreviously have participated with in other QA

(2)

tion words, as well as simple rules for pre-processing of the data, makesit possible to rapidly

developnewsystemsforawidevarietyofdierentlanguagesanddomains. Duetoitsdata-driven

nature, ourQA systemperforms best when there is alarge corpusavailable, containingseveral

co-occurrencesofquestionwordsand thecorrect answer. ThustheQASTtask presentedachal-

lengetousduetothesmallsizeofthecorpus. Moreover,wemadenoadjustmentstoourfactoid

QA system to account for Denition questions, i.e. wetreated Denition questions as Factoid

questions.

Thesystem comprisestwomain components, an Information Retrieval (IR)module used to

retrieverelevantsentencesfromacorpus,andanAnswerExtraction(AE)module whichisused

to identify andrankexactanswersin thesentences returnedbytheIR module. ForIR weused

languagemodelbasedsentenceretrieval. Inthisapproach,alanguagemodel(LM)isgeneratedfor

eachsentenceandthesemodelsarecombinedwithdocumentLMstotakeadvantageofcontextual

information. From the retrieved information, we extract rigid answers using our answer lter

model.

2 Sentence retrieval

Languagemodeling forIR has gained in popularityoverthe last decadesincethe approach was

proposed[5]. UnderthisapproachaLMisestimatedforeachdocument. Thedocumentsarethen

rankedaccordingtotheconditionalprobability

P (Q | D)

^,^theprobabilityofgeneratingthequery

Q

^given^the^document

D

^.

Weranksentencescorrespondingly[6]. Due tolackofdatatotrainthesentencespecicLM,

itisassumedthatallwordsareindependent,henceunigramsareused:

P(Q | S ) = Y |Q|

i=1

P (q i | S),

⁽¹⁾

where

q i

^is^the

i

^th^query^termⁱⁿ ^the^query

Q = (q ₁ ...q |Q| )

^composed^of

|Q|

^query^terms.

SmoothingmethodsarenormallyemployedwithLMstoavoidtheproblemofzeroprobabilities

when one of the query terms does not occur in the document. This is typically achieved by

redistributingprobabilitymassfromthedocumentmodeltoabackgroundcollectionmodel

P (Q | C)

^. ^We^use^Dirichlet^prior,^where^theprobabilityofaqueryterm

q

^given^a^sentence

S

^is^calculated

as:

P ₁ (q | S) = c(q; S) + µ · p(q | C) P

w c(w; S) + µ ,

⁽²⁾

where

c(q; S)

^is^the^count^of^query^term

q

ⁱⁿ ^sentence

S

^,

µ

^is^a^smoothing^parameter,

p(q | C)

is theunigram probability of

q

^according^to ^the^background^collection^model^and

P

w c(w; S)

^is

thecountofallwordsin

S

^.

A problem with this model is that words relevant to the sentence might not occur in the

sentence itself, but in the surrounding text. Forexample, for the question Where was George

Bush born?, the sentence He was born in Connecticut in an article about George Bush should

ideally be assigned ahigh probability, despite the sentence missing important query terms. To

accountforthis,wetraindocumentLMs,

P ₁ (q | D)

^,ⁱⁿ^the^same^manner^as^for

P ₁ (q | S)

ⁱⁿ^Eq.^(2),

andperformalinearinterpolationbetween

P ₁ (q | S)

^and

P ₁ (q | D)

^:

P ₂ (q | S) = (1 − α) · P ₁ (q | S) + α · P ₁ (q | D),

⁽³⁾

where

0 ≤ α ≤ 1

^is^aninterpolationparameter.

(3)

Foranswerextractionweusetheframeworkdescribedindetailin[7]. Wemodelthemoststraight-

forwardandobviousdependence oftheprobabilityofananswer

A

^depending^on^a^question

Q

^:

P (A | Q) = P (A | W, X),

⁽⁴⁾

where

A

^and

Q

^are ^considered ^to ^be ^strings ^of

l A

^words

A = a ₁ , . . . , a l A

^and

l Q

^words

Q = q ₁ , . . . , q l Q

^, respectively. Here

W = w ₁ , . . . , w l W

^represents ^a ^set ^of ^features ^describing ^the

question-type part of

Q

^such ^as^when, ^why, ^how, ^etc. ^while

X = x ₁ , . . . , x l X

^represents^a ^set

of features that describe the information-bearing part of

Q

^, î.e. ^what ^the ^questionîs âctually

aboutandwhatitrefersto. Forexample,inthequestions,WherewasTomCruise married? and

When was Tom Cruise married?, the information-bearing component is identical in both cases

whereasthequestion-typecomponentisdierent.

Findingthebestanswer

A ˆ

învolvesâ^searchôverâllâvailable

A

^for ^the^one^which ^maximizes

theprobabilityoftheabovemodel,i.e.,

A ˆ = arg max

A P (A | W, X).

⁽⁵⁾

Given the correct probability distribution, this is guaranteed to giveus the optimal answer

in amaximumlikelihoodsense. Wedon'tknowthisdistribution and itis stilldicultto model

but, using Bayes' rule and making various simplifying, modeling and conditional independence

assumptions(asdescribedin detailin[7])Eq.(5)canberearrangedtogive

arg max

A P (A | X )

| {z }

answer retrieval

model

· P (W | A)

| {z }

answer f ilter model

.

⁽⁶⁾

The

P (A | X)

^model ^we^call^the ânswer ^retrieval ^model. În ^this ^year's êvaluation ^we^didn't

usetheanswerretrievalmodel,i.e.

P (A | X )

^is^uniform.

The

P (W | A)

^model ^matches^a^potential^answer

A

^with^featuresⁱⁿ^thequestion-typeset

W

^.

Forexample,itrelatesplacenameswithwhere-typequestions. Wecallthiscomponenttheanswer

lter model anditisstructuredasfollows.

The question-type feature set

W = w ₁ , . . . , w l W

^is constructed by extracting

n

^-tuples ⁽

n = 1, 2, . . .

⁾ ^such âs ^Who, ^Where ând În ^what ^from ^the înput ^question

Q

^. ^A ^set ^of single-word featuresisextractedbasedonfrequencyofoccurrenceinourcollectionofexamplequestions.

Modeling the complex relationship between

W

^and

A

^directly ^is non-trivial. We therefore introduceanintermediatevariablerepresentingclassesofexamplequestions-and-answers(q-and-

a)

c _e

^for

e = 1 . . . |C E |

^drawn^from^the^set

C _E

^. Înôrder^to ^construct^these^classes,^givenâ^set

E

of exampleq-and-a, we then dene amappingfunction

f : E 7→ C _E

^which ^maps^each ^example

q-and-a

t _j

^for

j = 1 . . . |E|

^into^a^particular^class

f (t j ) = e

^. ^Thus^each^class

c _e

^may^be^dened^as

theunionofallcomponentq-and-afeaturesfromeach

t j

^satisfying

f (t j ) = e

^. ^Finally^,^to^facilitate

modelingwesaythat

W

^isconditionallyindependentof

c e

^given

A

^so^that

P (W | A) =

|C E |

X

e =1

P (W | c ^e _W ) · P(c ^e _A | A),

⁽⁷⁾

where

c ^e _W

^and

c ^e _A

^referrespectivelytothesubsetsofquestion-typefeaturesand exampleanswers fortheclass

c _e

^.

Assumingconditional independence of theanswerwordsin class

c _e

^given

A

^, ^and ^making^the

modeling assumption that the

j

^th ^answer^word

a ^e _j

ⁱⁿ ^the^example^class

c _e

îs^dependentônly ôn

the

j

^th^answer^wordⁱⁿ

A

^we^obtain:

(4)

Transcriptions Written Spoken

ID Type WER questions questions

m Manual - a_m b_m

a ASR 10.6% a_a b_a

b ASR 14.0% a_b b_b

c ASR 24.1% a_c b_c

Table1: Detailsofthe4transcriptionsand8runs.

Type Subtype #questions

Person 17

Organisation 17

Factoid Location 14

Time 25

Measure 2

Person 12

Denition Organisation 3

Other 10

Table2: Numberofquestionsof eachquestiontype,100intotal. Ofthese,19havenoanswerin

thecorpusandshould beansweredNIL.

P(W | A) =

|C E |

X

e=1

P(W | c _e ) ·

l Ae

Y

j =1

P (a ^e _j | a _j ).

⁽⁸⁾

Since our set of example q-and-a cannot be expected to cover all the possible answers to

questionsthatmaybeaskedweperformasimilaroperationtothatabovetogiveusthefollowing:

P (W | A) =

|C E |

X

e=1

P(W | c e )

l Ae

Y

j=1

|C A |

X

k =1

P (a ^e _j | c k )P(c k | a j ),

⁽⁹⁾

where

c _k

îsâ^concrete ^classⁱⁿ ^the^set ôf

|C A |

^answer^classes

C _A

^. ^Theindependence assumption leadsto underestimatingtheprobabilitiesof multi-wordanswerssowetakethegeometric mean

ofthelengthoftheanswer(notshownin Eq.(9))andnormalize

P (W | A)

accordingly.

4 Experimental work

ForQAST2009,twosetsofquestionsweregiven: 100writtenquestionsandmanualtranscriptions

ofthe correspondingspokenquestions. Theanswerswereto beextractedfrom transcriptionsof

European Parliament Plenary sessions in English (TC-STAR05 EPPS English corpus), which

consistsof 6spokendocuments, transcribedfrom3hoursofrecordings. 4versions ofthecorpus

were available: manualtranscriptionsand 3ASR transcriptions. There were onerunforeachof

thepossiblecombinationsofquestionsetsandtranscriptions,thustherewere2

×

⁴⁼⁸^runs,^as

shownin inTable1.

Twomaintypesofquestionswereconsidered:Factoid questionsandDenitionquestions. The

Factoid questions were further dividedinto thefollowingtypes: Person, Organisation,Location,

Time and Measure. TheDenition questions wereof thefollowingtypes: Person, Organisation

and Other. Questions where an answercannot befound in thecorpus, were to beansweredby

NIL.DetailsaregiveninTable2.

Wecleanedthedatabyautomaticallyremovingllersandpauses,andperformedsimpletext

(5)

0 0.1 0.2 0.3 0.4 0.5

a_m b_m a_a b_a a_b b_b a_c b_c

MRR

inaoe

0.36 0.34 0.30 0.30 0.22 0.22 0.28 0.28

limsi

0.36 0.33 0.31 0.30 0.25 0.25 0.24 0.24

tok

0.08 0.06 0.07 0.07 0.06 0.06 0.11 0.11

upc

0.31 0.12 0.27 0.08 0.26 0.08 0.24 0.07

Figure1: Resultsofeach runforallteams. Ourteam idistok

0 0.05 0.1 0.15 0.2

NIL DEFINITION TIME PERSON ORGANISATION LOCATION MEASURE

MRR

0.00

0.03

0.15

0.06

0.03 0.00 0.00

Figure2: Resultsofoura_mrun,bytype. InthisgureNILandDenitionarenotfurtherdivided

intosubtypes. TheothertypesaresubtypesofFactoid.

questionsetsand transcriptions. TheASR transcriptionslackedsentence boundaries,unlikethe

manualtranscriptions, where punctuationwasprovided. WesentencesegmentedtheASR tran-

scriptionsbyautomaticallyaligningthetextwiththemanualtranscriptionsusingtheGNU sdiff

tool. A set ofstopwordswasalsoused. Thetop100sentencesand theircontexts(the immedi-

atelyprecedingandsucceedingsentence),werepassedtotheanswerextractionmodule. Thetop

5answercandidates,asrankedbytheAEmodule,weresubmittedforevaluation. Althougheach

teamwasallowedtosubmittwoanswersetsforeachrun,wedecidedtosubmitonlyoneperrun.

5 Results and Discussion

The results of each team's best submission foreach run are plotted in Figure 1. These results

showthatweend uplast ofthefourteamsin allbut onerun. Ourperformancedoesnotregress

as automatictranscriptions of speeches or questions are used instead of manual transcriptions.

Thustheonlyruninwhichwearenotplacedlast,isthemostdiculttask: b_c.

Sinceoursystemdoesnottreatautomatictranscriptionsdierentlyfrommanualtranscriptions

(exceptin thepre-processingstage), werestrictourselvestofurther analyzingrun a_m,in which

written questions and manual transcriptions are used. Figure 2 shows the break-down of the

resultsbyanswertypeforthisrun.

Our system is not able to identify whether the answer to a question can be found in the

corpus, thus wechose neverto return aNIL response. Therefore thescore forNIL questions is

(6)

questions,whichhasanMRR which ismorethandouble that oftheMRRfor thePerson type,

thesecond best question type. This might beexplained bythe fairlyrestricted formatof Time

answersand the eorts wemade on date normalization. Organisation questions are dicultto

answersince there is little restrictionson the formatof organisation names. The zeroscore for

Location scoreis moredisappointing,sincethose answersmostlyconsist ofasinglegeographical

term. The score for Measure questions yields little information, since there were only 2 such

questions.

Normally our QA system utilizes a large corpus, such as the Web, and the more often an

answer candidate occurs in the context of queryterms, the more likely it is to be considered a

correctanswer. However,in thistaskthecorpuswassmall, thus wearenotableto benetfrom

suchredundancy,whichmightbeanexplanatoryfactorforourlowperformance.

6 Conclusion

InthispaperwehavegivenanoverviewofourmethodsandresultsfortheCLEF2009Question

AnsweringonSpeechTranscriptionsevaluation. Theresultsshowthat,usingourQAsystem,we

are not able to achieve good performance on this task. Obvious explanations are the presence

of non-factoid questions, which our system is not built to answer, in addition to our inability

to identify questions which haveno answerin thegivencorpus. Another possiblereason is the

small size of the corpus, which means our system cannot take advantage of redundant answer

information.

References

[1] Whittaker,E., Chatain, P., Furui, S. and Klakow, D., TREC2005Question AnsweringEx-

perimentsat TokyoInstituteofTechnology,Proc. TREC-14, 2005.

[2] Whittaker,E., Novak,J.,Chatain,P.andFurui,S.,TREC2006QuestionAnsweringExperi-

mentsatTokyoInstituteofTechnology,Proc. TREC-15,2006.

[3] Whittaker,Heie, M.,Novak,J.and Furui,S., TREC2007Question AnsweringExperiments

at TokyoInstituteofTechnology,Proc.TREC-16,2007.

[4] Heie,M.,Whittaker,E.,Novak,J.,Mrozinski,J.andFurui,S.,TAC2008QuestionAnswering

Experimentsat TokyoInstitute ofTechnology,Proc.TAC,2008.

[5] Ponte, J. and Croft, W., A Language Modeling Approach to Information Retrieval, Proc.

SIGIR,1998,pp.275-281.

[6] Heie,M.,Whittaker,E.,Novak,J.andFurui,S.,ALanguageModelingApproachtoQuestion

AnsweringonSpeechtranscriptions,Proc. ASRU,2007,pp.219-224.

[7] Whittaker, E., Furui, S. and Klakow, D., A Statistical Pattern Recognition Approach to

QuestionAnsweringusingWebData,Proc. Cyberworlds, 2005,pp.421-428.

CLEF 2009 Question Answering Experiments at Tokyo Institute of Technology

P (Q | D)

Q

D

P(Q | S ) = Y |Q|

i=1

P (q i | S),

q i

i

Q = (q 1 ...q |Q| )

|Q|

P (Q | C)

q

S

P 1 (q | S) = c(q; S) + µ · p(q | C) P

w c(w; S) + µ ,

c(q; S)

q

S

µ

p(q | C)

q

P

w c(w; S)

S

P 1 (q | D)

P 1 (q | S)

P 1 (q | S)

P 1 (q | D)

P 2 (q | S) = (1 − α) · P 1 (q | S) + α · P 1 (q | D),

0 ≤ α ≤ 1

A

Q

P (A | Q) = P (A | W, X),

A

Q

l A

A = a 1 , . . . , a l A

l Q

Q = q 1 , . . . , q l Q

W = w 1 , . . . , w l W

Q

X = x 1 , . . . , x l X

Q

A ˆ

A

A ˆ = arg max

A P (A | W, X).

arg max

A P (A | X )

| {z }

answer retrieval

model

· P (W | A)

| {z }

answer f ilter model

.

P (A | X)

P (A | X )

P (W | A)

A

W

W = w 1 , . . . , w l W

n

n = 1, 2, . . .

Q

W

A

c e

e = 1 . . . |C E |

C E

E

f : E 7→ C E

t j

j = 1 . . . |E|

f (t j ) = e

c e

t j

f (t j ) = e

W

Q = (q ₁ ...q |Q| )

P ₁ (q | S) = c(q; S) + µ · p(q | C) P

P ₁ (q | D)

P ₁ (q | S)

P ₁ (q | S)

P ₁ (q | D)

P ₂ (q | S) = (1 − α) · P ₁ (q | S) + α · P ₁ (q | D),

A = a ₁ , . . . , a l A

Q = q ₁ , . . . , q l Q

W = w ₁ , . . . , w l W

X = x ₁ , . . . , x l X

W = w ₁ , . . . , w l W

c _e

C _E

f : E 7→ C _E

t _j

c _e

P (W | c ^e _W ) · P(c ^e _A | A),

c ^e _W

c ^e _A

c _e

c _e

a ^e _j

c _e

P(W | c _e ) ·

P (a ^e _j | a _j ).

P (a ^e _j | c k )P(c k | a j ),

c _k

C _A