From Human Gesture to Synthetic Action
Michael Kipp
University of the Saarland (DFKI) Stuhlsatzenhausweg 4
66123 Saarbr ¨ucken
kipp@dfki.de
ABSTRACT
EmbodiedAgentsare still lookingfor their own bodylan-
guage. Oureortsaimatanalyzinghumanspeakerstoex-
tract individualrepertoires of gesture. Weargue that TV
personalitiesoractorsarebettersuitedforthispurposethan
ordinarysubjects. For analysis, wepropose anannotation
schemeand present the tool Anvil for transcodinggesture
andposture. Fortransfertosyntheticagents,wesuggestto
thinkof gestures as categorizable inequivalenceclasses, a
subsetofwhichcanmakeupanagent'snonverbalrepertoire.
Resultsofinvestigating dierent levelsofgesture organisa-
tion and posture shifts are presented. Theconcept of G-
Groups is introduced which is hypothesized to correspond
to rhetorical devices. Also, we give a brief sketch onhow
theseresultsare plannedtobeintegratedinamultimodal
generationsystemforpresentationteams.
Keywords
NonverbalBehavior,EmbodiedAgents,LifelikeCharacters
1. INTRODUCTION
Nonverbalbehaviorlikegestureandpostureplaysacru-
cial role in making communication smoother, facilitating
understanding and interacting in a social way. With the
riseofsyntheticactorsinthehuman-computerinterfacethis
areahasexperiencedasurgeofinterestinaneorttomake
agents more \life-like". At DFKI, we pursue the idea of
interfacing withthe uservia teams ofagentsthat interact
with each other in what can be considered a performance
[2]. Information isconveyedindirectlythroughobservation
ofdialogues. Thisoerspossibilitiesof moreclearly struc-
tured,more socially interesting, moresubtle and moreen-
tertaining presentation [1]. First informal user studies of
thisapproachhavebeenpromisingbutalsorevealedaneed
formoresophisticatedgenerationofnonverbalbehavior. As
opposedtosingleagent scenarios[8, 24]wemustdeal with
the additional challenge of making the agents distinguish-
ableindividuals. Abasicrequirementforthisareindividual
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Copyright 2001 ACM 0-89791-88-6/97/05 ...
$5.00.
repertoires ofnonverbalbehavior.
Ourgeneralresearchagenda hasthreephases. First,by
lookingatempiricaldata,wesearchfor consistentgestures
thatcould betakenasequivalenceclasses intermsofform
and function,and assemblethemtoindividual repertoires.
Second, we aim tointegratethese resultsinarunning ap-
plicationbymodelinginstancesoftheseequivalenceclasses
as animation clips. Third, asystem evaluation will assess
itsnonverbaleectiveness.
Lookingatempiricaldatahasbeendonebeforeforother
systems [8, 24]. Such studies have a long history in psy-
chological,anthropologicalandsemioticresearch[25,15,9].
Themajorityoftheliteraturehasbeenconcernedwithgen-
eralpropertiesof\spontaneousgesture"[22]. Subjectswere
ordinarypeoplechosenwithoutregardtoqualityofgesture.
Also,it hasusuallynotbeenundertakentocollect individ-
ualrepertoires[23]. Foreectiveuseinsyntheticcharacters,
though,thetimehasdenitelycometoanalyzethekindof
speakerwe aim tomodel: a rhetoricallyprocient speaker
who useshis/her nonverbal repertoire to maximum eect,
thus giving eshto the motivationof usingabodyat all.
Forsuchspeakersweneedtoidentifyindividualrepertoires,
patternsof rhetoricallyeectivedeliveryandall thedier-
ent sources that inform the selection of nonverbal signals.
Weenvisionastateoftheartwhereinterchangeablegesture
proles canbeusedandreusedtogiveasyntheticcharacter
itsveryspecialhumantouch.
Butcantherebeany suchthingas arepertoire? Is the
space of personal gesticulation not innite? As has been
arguedbefore[17]thisstronglydependsonthetopic ofthe
talk. Describingspatialrelations(e.g.givingdirectionstoa
foreigner onthe street)or actions (e.g.recountingcartoon
animations) results inamyriadof dierent, often singular
gestures, most of themprobably made up onthe spot for
thespecicpurposeathand. Most suchgestures wouldbe
pointlessto model. Ontheotherhand,\themoreabstract
and metaphorical the content the gesturepertainsto, the
more likely we are to observe consistencies inthe gestural
formsemployed"[17]. Weunderstandthese\consistencies"
asequivalenceclassesofgesturethatcanbefound,modeled
and used for agentsinvolved in\abstract" talk. Abstract
topicscanbeasdiverseassalesdialogues,weatherforecasts,
newsreportsorliterarydiscussions.
Thispaperdescribesworkinprogress,coveringphaseone
(empiricalinvestigation)andaspectsofphasetwo(integra-
tion)inourresearchagenda. We reportonthe annotation
of video material, the coding schemeinvolved and coding
G−Group G−Group Complex
G−Phase
G−Phrase G−Phrase Complex
Figure1: Kinesichierarchy.
reliability. The tool Anvil 1
[18], especially developed for
thisproject,ispresented. Resultsaboutrepertoiresofhigh
frequencygestures,aboutgesturetiming andfunctionsare
laidout. Finally,weoutlineasampleapplicationwherethe
foundresultscanbeintegrated.
2. ANNOTATION
OurdatawastakenfromtheGermanTVshow\Literary
Quartet",wherefourpeoplepresentanddiscussrecentbook
releases. Theshowisidealforourpurposesfortworeasons.
First,thereisagoodmixofabstracttalk and interaction.
Second, the two speakers chosen for this research, called
MRR and HK, have active, procient gesticulation and a
stronglyperceivedpersonality. \Procient" is anintuitive
concepthere,not tobefullydiscussed. Weonlypointout
thatcriteria like clear gesturesegmentation(inanalogy to
cleararticulationinspeech),cleargesturalforms,variation
informandtempocouldallcontributetotheperceptionof
prociency.
For analysis, we transcribed speechusing PRAAT 2
and
encoded text structure according to Rhetorical Structure
Theory (RST)[20]withtheRSTtool 3
. Bothdatawereim-
portedinAnvilwherepostureandgesturewereencodedas
describedbelow.
2.1 Gesture Structure
Gestureisaeetingconcept. Inordertobeabletotran-
scribe gestureand talkabouttimingandfunction,astruc-
turalframeworkisindispensable.
Gesturescanbedissectedintomoreelementarymovement
phases[15]orclusteredtogether,thenconstitutingahigher
unitwithafunction of itsown. This isthe essence ofthe
hierarchical view on gesture structureas proposed by [15,
22]anddepictedinFig.1(wereplacedtheG-Unitlayerby
theG-Grouplayerandaddedcomplexes,seebelow).
The G-Phaselayer comprisesthe movement phases ofa
gesture. Sincethisisnotthefocusofthispaper,allweneed
toknowaboutthislayeristhatthestrokeisthemostener-
geticpartofagesture,usuallythepartthatissynchronized
withspeech (cf.[19, 15, 22] for furtherreading). Gestures
themselvesarelocatedontheG-Phraselayerandconsistof
oneormorephases. Manyclassicationsystemsforgesture
have beenproposed [23, 10, 22, 9]. Wesettled ona com-
promise between semiotic and functional views[16], using
thefollowing categories: emblems,adaptors,deictics, icon-
ics, metaphorics and beats. Our priority was clearly that
thecategoriesare easytoidentify(see Section2.2 for sub-
categorization).
1
Anvil is freely available for research purposes under
http://www.dfki.de/~kipp/anvil
2
byBoersma/Weenik: http://www.fon.hum.uva.nl/praat
3
inward outward round
direct quick slow
vertical
diagonal horizontal
non−circular circular
dynamic
static right hand left hand
both hands
curved straight
tense relaxed narrow
expansive features
G−Group
Figure2: G-Group features withintensity arrows.
Onthenexthigherlayer,wedeneG-Groupsassequences
of contiguous gestures that share the cohesive properties
shownin Fig. 2 (inspiredby [21]). E.g., a stretch of two-
handed gestures forms a G-Group in opposition to a fol-
lowing group of left-handed gestures. A G-Group has no
type, itssole function being the grouping together of ges-
turephrases. Wedivertfrom the literature[15, 22] where
thenexthigherlayerisusuallytheG-Unit denedassubse-
quentgesturesbetweentworestpositionsbecausewedidnot
ndanyobvious correlation tospeech. Kendoncalled ges-
turesequenceswithsharedhandedness\groupingsorparts"
in[15]andhasthusconsideredaspectsofG-Groupswithout
puttingthemintothehierarchy.
Ontopofthethree-layeredhierarchy,wecallspecialpat-
ternsofG-PhrasesandG-Groupscomplexes (see[13]foran
analogouslanguageconcept). Repetitionofgestureissuch
apatternontheG-Phraselayer. Likewise,gesturesthatare
frequentlyusedtogetherbyanspeakerformacomplex.
2.2 Coding Scheme
Gestures were transcoded on three levels: G-Phase, G-
Phrase and G-Group. We focus onG-Phrases whosesub-
categoriesmakeupournotionofgestureequivalenceclasses.
Gesturesarerstclassiedinto thesixmajor categories:
Emblems are gestures withconventionalized meaning that
canusuallybereplacedbyaverbalization[5]. Adaptorsare
self-touches or object-touches and serve to disperse excess
energy (nervosity) and/or to focus on cognitive processes
[12]. Deictics are pointing gestures. Iconics are gestures
thatcorrespondtoaspeechconcept bysomedirectresem-
blance relation (e.g.writing into theair while spelling out
a name), whereas metaphorics relate to speech in a more
abstract way (e.g.holding yourpalm like acupwhenask-
ing aquestion asif readyto \receive"materialanswer, cf.
[22]). Finally,we decided to treat beats as a rest classon
the G-Phrase layer because we seea functionalseparation
betweenG-PhraseandG-Phaselayer:
Rhythmicalbeats,thataccompaniedmostofthegestures
G-Phaselayer. Almosteverygesture canhavesuchrhyth-
micalbeatswhichservecertainfunctions(e.g. highlighting
newinformation [22]). Onthe G-Phrase layer, however, a
dierentlevelofmeaningisaddedthroughthespecicform
ofa gesture. What are thenthe traditionalbeat or baton
gestures found in the literature [11][22]? In our opinion,
theyare gestures whoseformdoesnot carryany meaning,
onlytheir rhythm being important. Therefore, on the G-
Phraselayer, suchgestures can beconsidered a rest class.
Thisisaconvenientwayofdealingwiththealternativeview
thatbeatsaresuperimposedonothergestureswhichwould
complicateannotation.
Fortheemblems,iconicsandmetaphoricsthereisarange
of67subcategories whichwerefoundduringannotationand
documentedinamanual withvideostills. Entriesfor em-
blems specify form, function, concomitant speech, verbal-
ization andsimilar gesturecategories(toavoid confusion).
Entries for iconicsandmetaphorics specifyformand func-
tion. Deictics andadaptors are coded withoneparameter
each: aimofdeixis(self,addressee,space)andtouchregion
(head,shoulder,table)fortheadaptor.
Thesesubcategoriesarewhatwehypothesizetobeequiv-
alenceclasses,i.e.instancesofonesubcategoryaretakento
beexchangeablebyanyotherinstanceinthisgroup.
Notethatqualitativeparameters[4]arenotcodedonthe
G-Phraselayer. Instead,theyarehandledontheG-Group
layer. Parameters like handedness, speed or direction of
movementareseennotasinherentparametersofsingleges-
tures but as cohesive glue grouping sequences of gestures
togetherto G-Groups. Subsequent gestures belong to the
sameG-GroupitheyshareallthesamepropertiesofFig.2.
Forposture, wedistinguishedbetweentwopositions(up-
right, relaxed), and four kinds of motion (legs-cross, legs-
uncross,upperbody,andwholebody).
2.3 Annotation Tool
For transcodinggesture andposturethe genericannota-
tiontoolAnvil [18] (Annotationof Video and Language)
was developed(Fig. 3). It allows annotation of videoles
(e.g.QuicktimeorAVI)onmultiplelayers,so-calledtracks.
Inprimarytrackstheusercanaddelementsbymarkingbe-
gin andendtime. Secondary trackscontain elementsthat
pointto elementsofanother,reference track. Toinsertan
element,theuserspeciesrstandlastelementintherefer-
encetrack. Alltrackelementscontainattribute-valuepairs.
Theattributes'namesandtypesarespeciedbytheuserfor
eachtrack.AttributestypesareValueSet,String,Boolean,
NumberandMultiLink. ForValueSetstheuserdenesaset
ofpossible valuesthat will reappear intheGUI as anop-
tionmenu.Stringtypedattributesoerastringinputeld,
BooleansacheckboxandNumbertypesaslider. MultiLink
typesallowtheselection ofarbitrarytrack elementsinthe
annotation,therebypermitting within-leveland cross-level
linkage.
For the scheme outlined in Section 2.2, G-Phases were
encodedinaprimarytrack,G-Phrasesinasecondarytrack
whose elements point to G-Phase elements. Likewise, G-
Groupswerecodedinasecondarytrack pointingtotheG-
Phrasetrack(seeFig. 3).
Anvilstandsoutincomparisontosimilartoolsinthatit
istheonlynon-commercial,fullyimplemented,XML-based
andplatform-independenttoolspecicallydesigned forthe
0 10 20 30 40 50 60 70
0 2 4 6 8 10 12
amount
clip number all categories
over 1%
0 20 40 60 80 100
0 2 4 6 8 10 12
gesture occurrence coverage in %
clip number
Figure 4: Increase of subcategories with increas-
ing material (left)and percentage ofgesture occur-
rences covered by mostfrequentcategories(right).
work).Especiallycross-levellinkageisafeaturerarelyfound
inothertools.
2.4 Reliability
Inter-coderreliabilitywasanalyzedintwodierentstud-
ies on the G-Phrase level to check consistency of the 67
gesture subcategories. Ineach study, two codersindepen-
dentlyannotated2:54 minutes(study1) and 2:26 minutes
(study2)ofmaterialcontainingpre-annotatedG-Phasesof
speakerMRR.
SegmentationintoG-Phrases(gestures)wasnear-perfect:
segmentation subcat. subcat.kappa
study1 93.0% 64.0% 0.60
study2 100.0% 79.4% 0.78
Insubcategoryagreement,therststudyyieldedcritical
results. Thesecondstudy,building oninsightsofthe rst
studyandthegrownexperienceofthecoders, resultedina
satisfying79%agreementandaKappavalueof 0.78which
isveryclose to\goodagreement"[7].
3. RESULTS
Our results yielded evidence that our equivalence class
approachis feasibleintermsof extraction(seeabove)and
modeling(bytakinghighfrequencygestures,seebelow). We
willalsopresentinsightsontimingandfunction.
In all, we annotated 40:09 minutes of data of the two
speakersMRRandHK.Tab.1showstheamountofphases,
phrasesandgroupsfoundperspeaker. Weidentied67dif-
ferent gesture subcategories for both. Naturally,the more
material we annotated the more subcategories we found.
Fig. 4shows the increaseofgesture subcategories within-
creasingmaterialforthespeakerMRR(leftdiagram,upper
line). Whenweconsideronlythosesubcategorieswhosein-
stancesmakeupmorethen1%ofalloccurrencesweobtain
asurprisinglyconstantamountof21{24 subcategories(left
diagram, lower line). These high frequency subcategories
seem natural candidatesfor being modeledas animations.
But how muchof the original occurrences of gesture does
thissetofsubcategoriescover? Fig.4(right)showsthecov-
erage oftheMRR's21subcategories over1%. Althoughit
is slightly decliningwithgrowing data, inthe nal corpus
of 604occurrencesthe \over1%"subcategories stillcovers
about83%.
For speaker HK we found 19 \over 1%" subcategories.
ComparingthetwosetsofHKandMRR,therewasanover-
lapof12subcategories. Individualityisthusfoundalready
inthedierentrepertoire,furthermoreindierentfrequency
distributionandusageinfunctionandtimingas treatedin
tracks collapsed of other tracks for better visibility
track window displaying info for active track and currently selected element
contents of currently selected track element
controls for inserting new elements (also accessible by context menus) room for comments for this element text field for messages
and trace read−outs
seconds time bar with
elements labelled and color−coded according to user specification zoom panel
bookmark sign
playback speed slider video window with frame−wise control
hierarchically organized tracks active track (highlighted) playback line currently selected element (highlighted)
Figure 3: Anviltoolfor multi-layeredannotation ofaudio-visual data
MRR HK total
G-Phases 1,422 663 2,085
G-Phrases 604 265 869
G-Groups 286 103 389
Table 1: Numberof encodedentities
3.1 Gesture Analysis
Gestures are timedto co-occur withrelated speechcon-
cepts. Kendon introduced the notion of idea units as the
commonoriginofspeechandgestureproduction[15]. These
would be manifest in intonation unit (tone units) on one
hand and G-Phrases on the other. We tried to nd rela-
tionshipstosyntacticentitiesastheywouldbemorereadily
availableinaspeechgenerationarchitecture.Thefollowing
timingpatternsoccurred:
1. Direct:Gesturessynchronizetheirstrokedirectlywith
aword(noun,verb,adverb...) oranoun/prepositional
phrase(NP,PP).
2. Indirect: The stroke does not cover the correlated
noun, verb etc. itself, butclosely linkedinformation,
e.g.thenoun'smodierortheverb'stransitiveobject
ortheverb'snegationparticle.
3. Init:Gesturesthatrefertoconceptsdeeperinaclause
or phrase but only cover the rst 1{2 words; this is
especiallyencountered withgestures that illustratea
process and correlate to a verb. They occur at the
beginningofaverbalphrase(VP)wheretheverbisin
endposition, alleviatingthe factthat inGermanthe
verbisoftenlocatedatthebackofthesentence.
4. Span: Gesture stroke(s) (plushold)cover(s)a whole
What follows are someinsightsonfunctionand timingwe
gatheredforeachgesturetype,forG-Phrasecomplexes,G-
GroupsandG-Groupcomplexes.
Emblems should be most easily identied since their
meaningisconventionalized. Functionsidentiedare: speech
act,displayemotion,displayattitude(certainty)andseman-
tic illustration. Someemblemsalwaysworkwiththe same
timing pattern, e.g., MRR's mostfrequent speech act em-
blem \attention"consistently withspantiming. Emblems
likethisalsoconveyastronghighstatusmessage(according
to[14])thatmaypartlyexplainMRR'sreputationasbeing
highly dominant. Apoint tobefurtherinvestigatedinthe
future.
Adaptorsareconsideredtopossiblyco-occur with\sig-
nicantpointsof thespeechow"[25]. Inourdata,adap-
torsoccurredwhilethespeakersearchingforaword(speech
failure) or withthe beginningof anRST segment, usually
coveringashortpause. AccordingtoJohnstone,alladaptors
conveylowstatus[14].
Deictic gestures directed at the addressee function as
(1)referencetotheaddresseebydirectsyncwithpronouns
(\you", \your"),(2)reference totheaddressee'sutterance,
timed with the span or init pattern and (3) turn-taking
signals [26]: HKexclusively signaled hold-turn,MRRonly
yield-turn(MRRchairsthediscussion). Pointingdirectlyat
somebodywasalsofoundcapableofexpressingaggression.
Iconics illustrate objects, processes and qualities [15].
Objectgesturesareusuallyindirectsyncwithanoun.Pro-
cessgesturessyncwithaverb(direct),theverb'stransitive
object(indirect)orthebeginningoftheVP(init). Quality
gesturesoccurindirectsyncwithanNP.
Metaphorics are similarto iconicsinusage, onlytheir
meaning is more generic. Themost common conduit ges-
ture[22], e.g., can co-occur withalmostanyobjector ab-
stractnotioninspeech.Someprocessgestureswerefoundto
co-occurwithconnectivesorclauseborders,metaphorically
indicatingmotiontothenextpartofthediscussion.
G-Phraseg
1
G-Phraseg
2
~
P(g
2 jg
1 )
emblem.dismiss emblem.wipe 0.21
metaphoric.conduit-ing metaphoric.conduit-ing 0.20
emblem.dismiss emblem.dismiss 0.20
emblem.attention emblem.attention 0.14
emblem.dismiss emblem.dismiss 0.14
metaphoric.conduit iconic.strength 0.13
deictic.space deictic.space 0.13
metaphoric.conduit metaphoric.conduit 0.10
Table2: Most frequent bigramsofMRR's gestures
(G-Phrases)withestimatedconditional probability.
G-Phrase Complexes are formed, e.g., by repetition.
Repetitionaddscohesiontothediscourseandcanbindparts
ofasentencetogetherthatareseparatedbyembeddedstruc-
tures.Butareallgesturessuitableforrepetition? Statistical
analysisresults inthe patternsinTab. 2forspeakerMRR
(onlythemostfrequent,togetherwithestimatedconditional
probability). Moreimportantly,ifthesametwogesturesoc-
curtogetheragainandagain,theycanbeinterpretedasan
idiosyncratic compound gesture that extends the original
repertoire,thusaddingtopersonalstyle. WefoundMRR's
frequentemblemsequence\dismiss{wipe"highlycharacter-
isticforthespeaker.
G-Groups are formed by gestures that sharethe same
properties fromFig. 2. Most G-Groups correspond to one
ofthetworhetoricdevicesstatedby[3]: ListofThreeand
Contrast. These devices are non-verbally marked in pub-
licspeaking tocoordinateaudience reaction (laughter,ap-
plause). Moregenerally,ListofThreeandContrast canbe
takenasRSTrelationsandG-Groupsasmarkersforrhetor-
icalsegments.
G-Group Complexes. A two-handed gesture is more
visible thana one-handed one, an expansive gesturemore
thana narrow one. To capture a grading of intensity, we
introduce an intensity gradation for some dichotomies in
Fig. 2. Intensity rises in direction ofthe arrow. Now, G-
GroupAismoreintensethanG-GroupB ifAhasatleast
onemoreintenseproperty.Iftherearemoreandlessintense
propertiesthereisnorelation. SpeakerMRRorganizeshis
G-Groupssuchthatintensityrises(crescendo)andthensud-
denlydropstogiveweighttothelastpartofspeech. Asim-
pleridentiedpatterniscontrast: twocontrastingsegments
arecoveredby G-Groupsdieringinintensity. A thirdus-
ageofG-Groupsisthemarkingofembeddedclausestomake
thesurroundingutterancemorecohesive.
3.2 Posture Analysis
total segment interaction other
begin end engage disen.
cross 24 15 4 | 2 4
uncross 24 4 14 3 | 3
Table3: Postureshifts ofspeakerMRR
Posturewas hypothesizedbyScheen tocorrespondtoa
S:EvaluatePos
S:ConcludePos S:EnumeratePos
S:InformValue
S:InformValue S:InformValue
S:OtherDim S:DismissDim Dialog
InitDialog
FinishDialog Book2 Book1
Interview . . .
. . .
QuestionAnswer QuestionAnswer S:IntroBook S:DescribePos S:Summary
B:RequestFeature S:InformFeature DiscussFeature
B:NegResponse S:ResponseNegResponse
Figure5: Discourse specication tree
signalmovingfromonetopictoanother. Weinvestigatedfor
speakerMRRwhenandhowhemoved(HKhardlychanged
postureatall). Specically,wefoundthathiscrossingand
uncrossing legs followed apattern shown inTab. 3. Most
posture shifts consideredoccurred at theboundaryoftwo
RSTsegmentsAandB (77%),withaspeechpauseof0.2{
0.5 seconds between A and B. Moreover, depending on
whether the shift occurred at the end of A or at the be-
ginningofB,MRRpreferablyuncrossedorcrossedhislegs.
Asecondcorrelation,postureshiftandinteraction,isonly
slightlyindicated by thedata(there was too littleinterac-
tion): MRRuncrossedhislegsforengagingin(gettingready
for) aninteractionand crossedhis legs afterhavingdisen-
gagedfromaninteraction.
4. APPLICATION SKETCH
Asthis is work inprogress, we only brieyoutline how
our results will be integrated in an applicationof presen-
tationteams bypointingatcritical decisionpointsfor ges-
ture/postureselection:
Gesturerepertoires(ofsubcategories)aremodeledasani-
mationclipsusingthe3DStudioMAX animationsoftware.
Eachgesturemustbeprovidedwith(1)variousin/outposi-
tions,(2)dierentposturesand(3)dierenthandednessto
allowvariabilityandpostureshifts. Postureshiftsmustbe
modeledseparately. Gestures arethenclassiedandstored
accordingtogesturecategory,functionandtimingpattern.
Asasampleapplicationanexisting presentationgenera-
tionsystemfor sellingcars[2]wasadapted toabooksales
scenariocalled\BookJam". Thepresentationteamconsists
ofabookselling agent(S)whopresentsanumberofbooks
totwobuyingagents(B).Theensuingdialogue,containing
monologic passagesofthesellerandcriticalquestionsfrom
thebuyers,indirectlypresentsdierentfactsandaspectsof
thebooktoanobservinguserinanentertainingfashion.
Togenerate such a dialogue, a plan-basedsystem com-
putes a presentation plan at run-time(a sampleis shown
in Fig. 5) making use of parametersthat model personal-
ity andstatus ofthe agents. Thisplan species thewhole
agent-agentinteraction,eachnoderepresentingaplanoper-
ator that redduringprocessing. Wenowdistinguish four
node types: Surface nodes are the leaves where text and
gestureactuallyappearintheirnalform.Rhetoricalnodes
(underlined)are thosenodes thatonlycontainchild nodes
of a singleagent. Theminimal nodes that comprise more
than one agent are interactive nodes (boxed). All others
are called topical (boxed,round corners) astheystructure
largerportionsofthedialogue.
Tointegrategesture/posturegenerationintotheplanop-
adaptors. Interactivenodesonadaptorsandturn-takingsig-
nals(e.g.,deictics),speechactemblemsandpostureshifts.
Inrhetoricalnodes,G-GroupsandG-Phrasecomplexesare
planned(determininghandednessandgesturequalities)and
alsodiscoursestructuringgestures (metaphorics,emblems).
Finally,onthe leaves, gestures ofall typescanbeselected
accordingtothesemanticsofconcomitantspeech.Thechal-
lengeatsurfacenodesistheintegrationofalldecisionsmade
here and inhighernodes and the coordination of gestures
andpostureshiftsinaccordancewithgiventimingpatterns.
At the time of writing, all this belongs to futurework.
Furthermore, we will think about ways of evaluating the
system with regard to the impact of nonverbal behavior,
thustyingtogetherempiricalresultsandapplication.
5. DISCUSSION
Wehaveidentiedanddocumentedtwoindividualreper-
toiresofgesturethatcanbeusedinagenerationsystemfor
presentationteams. Asopposedtorelatedresearch,we(1)
haveproposedtheanalysisofrhetoricallyprocientspeakers
and(2)haveembarkedontheeorttocollectandanalyze
individualrepertoires insteadof generalizingoversubjects.
The main research tool, Anvil, was presented, as well as
anannotationschemeandsomeresultsontiming,function
andfrequencies. Anewstructurallayer,the G-Group,was
introducedandhypothesizedtocoincidewithrhetoricalde-
vices,thoughmoreworkneedsto bedoneonaconceptual
level. Posture shiftswere found at borders of larger RST
segments. Theirtimingandform(crossing/uncrossinglegs)
were identiedfor one speaker. A sampleapplicationwas
outlinedwithrstideashowtointegratetheresults.
For thefuture, more workonfast, easyand reliablean-
notationwillbedone. Acodingmanualusingdecisiontrees
for formand functionisbeing considered. Emblem deni-
tionswillbemademoreformalusing,e.g.,BittiandPoggi's
scheme for classifying holophrastic emblems by specifying
(1)arguments,(2)predicate,(3)tense,and(4)attitude[5].
Onthegenerationsideweare workingonarchitecture,3D
gesturemodeling and multimodalfusion onsurfacenodes.
Attributionexperimentsareplannedtoinvestigatewhether
someemblemshave strongconnotationsintermsofstatus
(high,low),attitude(positive,negative)andemotion.
6. ACKNOWLEDGMENTS
Thisresearchissupportedbyadoctoralscholarshipfrom
the German Science Foundation (DFG). Many thanks to
RalfEngel and MartinKlesen for gestureannotation, and
toElisabethAndrefordiscussionandadvice.
7. REFERENCES
[1] E.AndreandT.Rist.PresentingThroughPerforming:On
theUseofMultipleLifelikeCharactersin
Knowledge-BasedPresentationSystems.InProceedingsof
theSecondInternationalConferenceonIntelligentUser
Interfaces(IUI2000),pages1{8,2000.
[2] E.Andre,T.Rist,S.vanMulken,M.Klesen,and
S.Baldes.TheAutomatedDesignofBelievableDialogues
forAnimatedPresentationTeams.InJ.Cassell,
J.Sullivan,S.Prevost,andE.Churchill,editors,Embodied
ConversationalAgents,pages220{255.MITPress,
Cambridge,MA,2000.
[3] M.Atkinson.OurMasters'Voices.Thelanguageandbody
animationbasedonmovementobservationandcognitive
behaviormodels.ComputerAnimationConference,1999.
[5] P.E.R.BittiandI.Poggi.Symbolicnonverbalbehavior:
Talkingthroughgestures.InR.S.FeldmanandB.Rime,
editors,FundamentalsofNonverbalBehavior,pages
433{457.CambridgeUniversityPress,NewYork,1991.
[6] P.E.Bull.PostureandGesture.PergamonPress,Oxford,
1987.
[7] J.Carletta.AssessingAgreementonClassicationTask:
TheKappaStatistics.ComputationalLinguistics,
22(2):249{254,1996.
[8] J.Cassell,T.Bickmore,L.Campbell,andH.Vilhjalmsson.
HumanConversationasaSystemFramework:Designing
EmbodiedConversationalAgents.InJ.Cassell,J.Sullivan,
S.Prevost,andE.Churchill,editors,Embodied
ConversationalAgents,pages29{63.MITPress,
Cambridge,MA,2000.
[9] D.Efron.GestureandEnvironment.King'sCrownPress,
NewYork,1941.
[10] P.EkmanandW.V.Friesen.TheRepertoireofNonverbal
Behavior: Categories,Origins,andCoding.Semiotica,
1:49{98,1969.
[11] P.EkmanandW.V.Friesen.Handmovements.Journal of
Communication,22:353{374,1972.
[12] N.FreedmanandS.P.Homan.Kineticbehaviorin
alteredclinicalstates: Approachtoobjectiveanalysisof
motorbehaviorduringclinicalinterviews.Perceptualand
MotorSkills,24:527{539,1967.
[13] M.A.K.Halliday.Languageassocialsemiotic.Edward
Arnold,London,1978.
[14] K.Johnstone.Impro forStorytellers.Routledge/Theatre
ArtsBooks,NewYork,1999.
[15] A.Kendon.Gesticulationandspeech:Twoaspectsofthe
processofutterance.InM.R.Key,editor,Nonverbal
Communication andLanguage,pages207{227.Mouton,
TheHague,1980.
[16] A.Kendon.GestureandSpeech.howTheyInteract.In
J.M.WiemannandR.P.Harrison,editors,Nonverbal
Interaction,pages13{45.Sage,London,1983.
[17] A.Kendon.AnAgendaforGestureStudies.TheSemiotic
ReviewofBooks,7(3):8{12,1996.
[18] M.Kipp.Anvil{aGenericAnnotationToolfor
MultimodalDialogue.InProceedingsofEurospeech 2001,
2001.(submitted).
[19] S.Kita,I.vanGijn,andH.vanderHulst.Movement
phasesinsignsandco-speechgestures,andtheir
transcriptionbyhumancoders.InI.Wachsmuthand
M.Frohlich,editors,GestureandSignLanguagein
Human-ComputerInteraction,pages23{35.Springer,1998.
[20] W.C.MannandS.A.Thompson.RhetoricalStructure
Theory: Towardafunctionaltheoryoftextorganization.
Text,8(3):243{281,1988.
[21] R.Martinec.Typesofprocessinaction.Semiotica,
130(3/4):243{268,2000.
[22] D.McNeill.HandandMind: WhatGesturesRevealabout
Thought.UniversityofChicagoPress,Chicago,1992.
[23] C.Muller.RedebegleitendeGesten: Kulturgeschichte,
Theorie,Sprachvergleich.BerlinVerlag,Berlin,1998.
[24] I.Poggi,C.Pelachaud,andF.deRosis.Eye
CommunicationinaConversational3dSyntheticAgent.
AI Communications,2000.
[25] B.RimeandL.Schiaratura.Gestureandspeech.InR.S.
FeldmanandB.Rime,editors,FundamentalsofNonverbal
Behavior,pp.239{281.CambridgeUniversityPress,N.Y.,
1991.
[26] H.Sacks,E.A.Schleglo,andG.Jeerson.Asimplest
systematicsfortheorganizationofturntakingfor
conversation.Language,50:696{735,1974.
[27] A.E.Scheen.TheSignicanceofPosturein
CommunicationSystems.Psychiatry,26:316{331,1964.