Tokyo Institute of Technology
MatthiasH. Heie,JosefR. Novak,EdwardW.D. WhittakerandSadaokiFurui
TokyoInstituteofTechnology
{heie,novakj,edw,furui}@furui.cs.titech.ac.jp
Abstract
Inthis paperwe describethe experimentscarried outat TokyoInstituteof Technology
for the CLEF 2009 Question Answering onSpeechTranscriptions (QAST)task, where we
participatedintheEnglishtrack. Weapplyanon-linguistic,data-drivenapproachtoQuestion
Answering (QA). Relevant sentences are rst retrieved from the supplied corpus, using a
languagemodelbasedsentenceretrievalmodule. Ourprobabilisticanswerextractionmodule
then pinpoints exact answers in these sentences. In this year's QAST task the question
setcontainsbothfactoidandnon-factoidquestions,wherethenon-factoidquestionsaskfor
denitions of given named entities. We do not make any adjustments of our factoid QA
system toaccountfor non-factoidquestions. Moreover,weare presentedwiththechallenge
of searching for the right answerin arelatively smallcorpus. Our system is built to take
advantageofredundantinformationinlargecorpora,however,inthis tasksuchredundancy
isnotavailable. TheresultsshowthatourQAframeworkdoesnotperformwellonthistask:
weendlastoffourparticipatingteamsinsevenoutofeightruns. However,ourperformance
does not regress as automatic transcriptions of speeches or questions are used instead of
manualtranscriptions. Thustheonlyruninwhichwearenotplacedlast,isthemostdicult
task,wherespokenquestionsandASRtranscriptionswithhighWERareused.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval; H.3.4
SystemsandSoftware
General Terms
Measurement,Performance,Experimentation
Keywords
Questionanswering,Questionsbeyondfactoids
1 Introduction
In this paper we describe the application of our data-driven and non-linguistic framework for
the CLEF 2009 QAST task. Two sets of questions were given: written questions and manual
transcriptions of the corresponding spoken questions. The corpusconsisted of transcriptionsof
EuropeanParliamentPlenarysessionsinEnglish. ThequestionsetcontainedDenitionquestions
andFactoid questions. 4versions ofthecorpuswereavailable: manualtranscriptionsand3ASR
transcriptions. Wesubmittedoneanswerset for each combinationof questionsetsand corpora,
intotal8submissions.
Our systemis a factoid QA system that wepreviously have participated with in other QA
tion words, as well as simple rules for pre-processing of the data, makesit possible to rapidly
developnewsystemsforawidevarietyofdierentlanguagesanddomains. Duetoitsdata-driven
nature, ourQA systemperforms best when there is alarge corpusavailable, containingseveral
co-occurrencesofquestionwordsand thecorrect answer. ThustheQASTtask presentedachal-
lengetousduetothesmallsizeofthecorpus. Moreover,wemadenoadjustmentstoourfactoid
QA system to account for Denition questions, i.e. wetreated Denition questions as Factoid
questions.
Thesystem comprisestwomain components, an Information Retrieval (IR)module used to
retrieverelevantsentencesfromacorpus,andanAnswerExtraction(AE)module whichisused
to identify andrankexactanswersin thesentences returnedbytheIR module. ForIR weused
languagemodelbasedsentenceretrieval. Inthisapproach,alanguagemodel(LM)isgeneratedfor
eachsentenceandthesemodelsarecombinedwithdocumentLMstotakeadvantageofcontextual
information. From the retrieved information, we extract rigid answers using our answer lter
model.
2 Sentence retrieval
Languagemodeling forIR has gained in popularityoverthe last decadesincethe approach was
proposed[5]. UnderthisapproachaLMisestimatedforeachdocument. Thedocumentsarethen
rankedaccordingtotheconditionalprobability
P (Q | D)
,theprobabilityofgeneratingthequeryQ
giventhedocumentD
.Weranksentencescorrespondingly[6]. Due tolackofdatatotrainthesentencespecicLM,
itisassumedthatallwordsareindependent,henceunigramsareused:
P(Q | S ) = Y |Q|
i=1
P (q i | S),
(1)where
q i isthei
thquerytermin thequeryQ = (q 1 ...q |Q| )
composedof |Q|
queryterms.
SmoothingmethodsarenormallyemployedwithLMstoavoidtheproblemofzeroprobabilities
when one of the query terms does not occur in the document. This is typically achieved by
redistributingprobabilitymassfromthedocumentmodeltoabackgroundcollectionmodel
P (Q | C)
. WeuseDirichletprior,wheretheprobabilityofaquerytermq
givenasentenceS
iscalculatedas:
P 1 (q | S) = c(q; S) + µ · p(q | C) P
w c(w; S) + µ , (2)
where
c(q; S)
isthecountofquerytermq
in sentenceS
,µ
isasmoothingparameter,p(q | C)
is theunigram probability of
q
accordingto thebackgroundcollectionmodelandP
w c(w; S)is
thecountofallwordsin
S
.A problem with this model is that words relevant to the sentence might not occur in the
sentence itself, but in the surrounding text. Forexample, for the question Where was George
Bush born?, the sentence He was born in Connecticut in an article about George Bush should
ideally be assigned ahigh probability, despite the sentence missing important query terms. To
accountforthis,wetraindocumentLMs,
P 1 (q | D)
,inthesamemannerasforP 1 (q | S)
inEq.(2),andperformalinearinterpolationbetween
P 1 (q | S)
andP 1 (q | D)
:P 2 (q | S) = (1 − α) · P 1 (q | S) + α · P 1 (q | D),
(3)where
0 ≤ α ≤ 1
isaninterpolationparameter.Foranswerextractionweusetheframeworkdescribedindetailin[7]. Wemodelthemoststraight-
forwardandobviousdependence oftheprobabilityofananswer
A
dependingonaquestionQ
:P (A | Q) = P (A | W, X),
(4)where
A
andQ
are considered to be strings ofl A words A = a 1 , . . . , a l A and l Q words Q = q 1 , . . . , q l Q, respectively. Here W = w 1 , . . . , w l W represents a set of features describing the
l Q words Q = q 1 , . . . , q l Q, respectively. Here W = w 1 , . . . , w l W represents a set of features describing the
W = w 1 , . . . , w l W represents a set of features describing the
question-type part of
Q
such aswhen, why, how, etc. whileX = x 1 , . . . , x l X representsa set
of features that describe the information-bearing part of
Q
, i.e. what the questionis actuallyaboutandwhatitrefersto. Forexample,inthequestions,WherewasTomCruise married? and
When was Tom Cruise married?, the information-bearing component is identical in both cases
whereasthequestion-typecomponentisdierent.
Findingthebestanswer
A ˆ
involvesasearchoverallavailableA
for theonewhich maximizestheprobabilityoftheabovemodel,i.e.,
A ˆ = arg max
A P (A | W, X). (5)
Given the correct probability distribution, this is guaranteed to giveus the optimal answer
in amaximumlikelihoodsense. Wedon'tknowthisdistribution and itis stilldicultto model
but, using Bayes' rule and making various simplifying, modeling and conditional independence
assumptions(asdescribedin detailin[7])Eq.(5)canberearrangedtogive
arg max
A P (A | X )
| {z }
answer retrieval
model
· P (W | A)
| {z }
answer f ilter model
.
(6)The
P (A | X)
model wecallthe answer retrieval model. In this year's evaluation wedidn'tusetheanswerretrievalmodel,i.e.
P (A | X )
isuniform.The
P (W | A)
model matchesapotentialanswerA
withfeaturesinthequestion-typesetW
.Forexample,itrelatesplacenameswithwhere-typequestions. Wecallthiscomponenttheanswer
lter model anditisstructuredasfollows.
The question-type feature set
W = w 1 , . . . , w l W is constructed by extracting n
-tuples (n = 1, 2, . . .
) such as Who, Where and In what from the input question Q
. A set of single-word
featuresisextractedbasedonfrequencyofoccurrenceinourcollectionofexamplequestions.
Modeling the complex relationship between
W
andA
directly is non-trivial. We therefore introduceanintermediatevariablerepresentingclassesofexamplequestions-and-answers(q-and-a)
c e fore = 1 . . . |C E |
drawnfromtheset C E. Inorderto constructtheseclasses,givenaset E
E
of exampleq-and-a, we then dene amappingfunction
f : E 7→ C E which mapseach example
q-and-a
t j forj = 1 . . . |E|
intoaparticularclassf (t j ) = e
. Thuseachclassc emaybedenedas
theunionofallcomponentq-and-afeaturesfromeach
t jsatisfyingf (t j ) = e
. Finally,tofacilitate
modelingwesaythat
W
isconditionallyindependentofc e givenA
sothat
P (W | A) =
|C E |
X
e =1
P (W | c e W ) · P(c e A | A),
(7)where
c e W andc e Areferrespectivelytothesubsetsofquestion-typefeaturesand exampleanswers
fortheclassc e.
c e.
Assumingconditional independence of theanswerwordsin class
c e givenA
, and makingthe
modeling assumption that the
j
th answerworda e j in theexampleclass c e isdependentonly on
the
j
thanswerwordinA
weobtain:Transcriptions Written Spoken
ID Type WER questions questions
m Manual - a_m b_m
a ASR 10.6% a_a b_a
b ASR 14.0% a_b b_b
c ASR 24.1% a_c b_c
Table1: Detailsofthe4transcriptionsand8runs.
Type Subtype #questions
Person 17
Organisation 17
Factoid Location 14
Time 25
Measure 2
Person 12
Denition Organisation 3
Other 10
Table2: Numberofquestionsof eachquestiontype,100intotal. Ofthese,19havenoanswerin
thecorpusandshould beansweredNIL.
P(W | A) =
|C E |
X
e=1
P(W | c e ) ·
l Ae
Y
j =1
P (a e j | a j ).
(8)Since our set of example q-and-a cannot be expected to cover all the possible answers to
questionsthatmaybeaskedweperformasimilaroperationtothatabovetogiveusthefollowing:
P (W | A) =
|C E |
X
e=1
P(W | c e )
l Ae
Y
j=1
|C A |
X
k =1
P (a e j | c k )P(c k | a j ),
(9)where
c k isaconcrete classin theset of |C A |
answerclassesC A. Theindependence assumption
leadsto underestimatingtheprobabilitiesof multi-wordanswerssowetakethegeometric mean
ofthelengthoftheanswer(notshownin Eq.(9))andnormalize
P (W | A)
accordingly.4 Experimental work
ForQAST2009,twosetsofquestionsweregiven: 100writtenquestionsandmanualtranscriptions
ofthe correspondingspokenquestions. Theanswerswereto beextractedfrom transcriptionsof
European Parliament Plenary sessions in English (TC-STAR05 EPPS English corpus), which
consistsof 6spokendocuments, transcribedfrom3hoursofrecordings. 4versions ofthecorpus
were available: manualtranscriptionsand 3ASR transcriptions. There were onerunforeachof
thepossiblecombinationsofquestionsetsandtranscriptions,thustherewere2
×
4=8runs,asshownin inTable1.
Twomaintypesofquestionswereconsidered:Factoid questionsandDenitionquestions. The
Factoid questions were further dividedinto thefollowingtypes: Person, Organisation,Location,
Time and Measure. TheDenition questions wereof thefollowingtypes: Person, Organisation
and Other. Questions where an answercannot befound in thecorpus, were to beansweredby
NIL.DetailsaregiveninTable2.
Wecleanedthedatabyautomaticallyremovingllersandpauses,andperformedsimpletext
0 0.1 0.2 0.3 0.4 0.5
a_m b_m a_a b_a a_b b_b a_c b_c
MRR
inaoe
0.36 0.34 0.30 0.30 0.22 0.22 0.28 0.28
limsi
0.36 0.33 0.31 0.30 0.25 0.25 0.24 0.24
tok
0.08 0.06 0.07 0.07 0.06 0.06 0.11 0.11
upc
0.31 0.12 0.27 0.08 0.26 0.08 0.24 0.07
Figure1: Resultsofeach runforallteams. Ourteam idistok
0 0.05 0.1 0.15 0.2
NIL DEFINITION TIME PERSON ORGANISATION LOCATION MEASURE
MRR
0.00
0.03
0.15
0.06
0.03
0.00 0.00
Figure2: Resultsofoura_mrun,bytype. InthisgureNILandDenitionarenotfurtherdivided
intosubtypes. TheothertypesaresubtypesofFactoid.
questionsetsand transcriptions. TheASR transcriptionslackedsentence boundaries,unlikethe
manualtranscriptions, where punctuationwasprovided. WesentencesegmentedtheASR tran-
scriptionsbyautomaticallyaligningthetextwiththemanualtranscriptionsusingtheGNU sdiff
tool. A set ofstopwordswasalsoused. Thetop100sentencesand theircontexts(the immedi-
atelyprecedingandsucceedingsentence),werepassedtotheanswerextractionmodule. Thetop
5answercandidates,asrankedbytheAEmodule,weresubmittedforevaluation. Althougheach
teamwasallowedtosubmittwoanswersetsforeachrun,wedecidedtosubmitonlyoneperrun.
5 Results and Discussion
The results of each team's best submission foreach run are plotted in Figure 1. These results
showthatweend uplast ofthefourteamsin allbut onerun. Ourperformancedoesnotregress
as automatictranscriptions of speeches or questions are used instead of manual transcriptions.
Thustheonlyruninwhichwearenotplacedlast,isthemostdiculttask: b_c.
Sinceoursystemdoesnottreatautomatictranscriptionsdierentlyfrommanualtranscriptions
(exceptin thepre-processingstage), werestrictourselvestofurther analyzingrun a_m,in which
written questions and manual transcriptions are used. Figure 2 shows the break-down of the
resultsbyanswertypeforthisrun.
Our system is not able to identify whether the answer to a question can be found in the
corpus, thus wechose neverto return aNIL response. Therefore thescore forNIL questions is
questions,whichhasanMRR which ismorethandouble that oftheMRRfor thePerson type,
thesecond best question type. This might beexplained bythe fairlyrestricted formatof Time
answersand the eorts wemade on date normalization. Organisation questions are dicultto
answersince there is little restrictionson the formatof organisation names. The zeroscore for
Location scoreis moredisappointing,sincethose answersmostlyconsist ofasinglegeographical
term. The score for Measure questions yields little information, since there were only 2 such
questions.
Normally our QA system utilizes a large corpus, such as the Web, and the more often an
answer candidate occurs in the context of queryterms, the more likely it is to be considered a
correctanswer. However,in thistaskthecorpuswassmall, thus wearenotableto benetfrom
suchredundancy,whichmightbeanexplanatoryfactorforourlowperformance.
6 Conclusion
InthispaperwehavegivenanoverviewofourmethodsandresultsfortheCLEF2009Question
AnsweringonSpeechTranscriptionsevaluation. Theresultsshowthat,usingourQAsystem,we
are not able to achieve good performance on this task. Obvious explanations are the presence
of non-factoid questions, which our system is not built to answer, in addition to our inability
to identify questions which haveno answerin thegivencorpus. Another possiblereason is the
small size of the corpus, which means our system cannot take advantage of redundant answer
information.
References
[1] Whittaker,E., Chatain, P., Furui, S. and Klakow, D., TREC2005Question AnsweringEx-
perimentsat TokyoInstituteofTechnology,Proc. TREC-14, 2005.
[2] Whittaker,E., Novak,J.,Chatain,P.andFurui,S.,TREC2006QuestionAnsweringExperi-
mentsatTokyoInstituteofTechnology,Proc. TREC-15,2006.
[3] Whittaker,Heie, M.,Novak,J.and Furui,S., TREC2007Question AnsweringExperiments
at TokyoInstituteofTechnology,Proc.TREC-16,2007.
[4] Heie,M.,Whittaker,E.,Novak,J.,Mrozinski,J.andFurui,S.,TAC2008QuestionAnswering
Experimentsat TokyoInstitute ofTechnology,Proc.TAC,2008.
[5] Ponte, J. and Croft, W., A Language Modeling Approach to Information Retrieval, Proc.
SIGIR,1998,pp.275-281.
[6] Heie,M.,Whittaker,E.,Novak,J.andFurui,S.,ALanguageModelingApproachtoQuestion
AnsweringonSpeechtranscriptions,Proc. ASRU,2007,pp.219-224.
[7] Whittaker, E., Furui, S. and Klakow, D., A Statistical Pattern Recognition Approach to
QuestionAnsweringusingWebData,Proc. Cyberworlds, 2005,pp.421-428.