Action scheduling in humanoid conversational agents

(1)

Action Scheduling in Humanoid Conversational Agents by

Joey Chang

Submitted to the Department of Electrical Engineering and Computer Science In Partial Fulfillment of the Requirements for the Degrees of

Bachelor of Science in Computer Science and Engineering

and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology

May 22, 1998

Author

D,

'

Department of Electrical Engineering and Computer Science

May 22, 1998

Certified by I I. N

, N-1

-

I

Tyent

Justine

Cassell

T&T CareeDeVement Profess fIedia Arts and Sciences

'- / / '/ -'f' 4,'w uprior

Accepted by

(

. -£--J - _ -- _.--

Arthur C. Smith Chairman, Department Committee on Graduate Theses

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

(2)

(3)

Action Scheduling in Humanoid Agents

by

Joey Chang

Submitted to the Department of Electrical Engineering and Computer Science on May 22, 1998 in Partial Fulfillment of the

Requirements for the Degrees of

Bachelor of Science in Computer Science and Engineering

and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology

Abstract

This paper presents an approach to action scheduling in lifelike humanoid agents. The action scheduler constitutes a module in the agent which links the mental processes to the physical actions of the agent. It receives requests from the agent's behavior generating modules and schedules them to be executed at the appropriate time, handling such tasks as speech-to-action synchronization, concurrent and overlapping behavior blending, and behavior conflict resolution. The optimal approach to action scheduling will vary from system to system, depending upon the characteristics of the system, since the ultimate action scheduler would accommodate a fully functional human representation-a goal which is out of the scope of research today. This paper presents an action scheduler for a real-time three-dimensional multi-modal humanoid agent.

Thesis Supervisor: Justine Cassell

(4)

Acknowledgements

Thank you to Justine Cassell for her guidance and insight.

Thank you to Hannes Vilhjdlmsson for his invaluable technical expertise and inspiration. Thank you to the Rea Team for their hard work and devotion.

Thank you to my family for always being ready to help, whichever path I choose. Thank you to Kae-Chy for being there for me.

And while the acknowledgements are short in description, each of you know the great lengths to which they extend in my heart.

(5)

1.4 Humanoid

Agents

_ __…

-_

-

___

8

2. Background 10

2.1 The Role of the Action Scheduler 10

2.2 Past work __ 13

2.2.1 Ymir - - - 13

2.2.1.1 The Processing Layers ……________ _ 13

2.2.1.2 Behavior Selection ___ 14

2.2.1.3 Proper Abstraction__-- ---- --- 15

2.2.2 Animated Conversations 16

2.2.3 Multi-level Direction of Creatures 17

2.2.3.1 A Three-Tier Architecture 17

2.2.3.2 Flexibility through Modularity - __ 18

2.2.3.3 Using DOFs -… -_ -18

2.2.3.4 Similarities

and

Shortcomings

- - - - -

_

19

2.2.4 Improv…-- - - - -_ - 20

2.2.4.1 Handling Competing Actions --- __ -__ 20

2.2.4.2 Action Buffering - -_ -21…

2.2.4.3 Shortcomings……---

---

___

21

3. Statement of Problem 22

4. Architecture 24

4.1 Architectural Overview 25

4.2 Enabling Speech Planning ……__________ _ 25

4.3 Enabling Coarticulation- - - - 26

(6)

4.5 Similarities and Differences to Ymir _ __________ 27

5. Implementation 29

5.1 REA: a Multi-Modal Humanoid Conversational Agent -… 29

5.2 A Flexible

Animation

Engine

- - -

…

_

30

5.3 The Action Scheduler ___ ... _______ 31

5.4 Speech Generation ____ 32

6. Evaluation 34

6.1 What Worked 34

6.2 What Didn't Work--- 35

6.3 A

Comparison

-- ___-

---

36

7. Conclusions and Further Research 38

7.1 The Validity of the Action Scheduler _ 38

(7)

Chapter 1 Introduction

1.1 Goals of This Work

The primary goal of this work is to examine the role of the Action Scheduler in the development of a multi-modal humanoid agent. While there is no standard design to the development of a multi-modal humanoid agent, most attempts at developing some "smart" creature or agent have similarities in their decision-making process for evoking behaviors. This work will take one such architecture for a multi-modal humanoid agent

and examine the issues that effect the Action Scheduler. It will also examine the

action-evoking designs of other applications and attempt to determine the optimal method of implementing a general multi-modal humanoid agent.

1.2 Thesis Outline

This thesis will begin with a review of the role of the Action Scheduler in a multi-modal humanoid agent accompanied by a review of past work relevant to the Action Scheduler. Following this will be a description of the architecture which will use this particular Action Scheduler. Most importantly, this will include the modules which affect and are directly affected by the Action Scheduler. The thesis will conclude with a description of

(8)

the implementation of the humanoid agent and a discussion of the observations made during and after its development

1.3 Multi-modal Communication

Humans typically engage in face-to-face communication through the use of multiple input modalities. On the lowest level-the perceptual level-these include sight and hearing. Beyond the raw perceptual layer, however, humans use social customs and behaviors evolved through generations to regulate the collaborative effort of the face-to-face exchange of information-the conversation. Some of these behaviors-such as pointing (deictics) to a requested object-are consciously performed and produce an easily recognizable purpose. However, behaviors such as backchannel listener feedback, where a listener provides feedback in the form of head movements and short utterances as a way of regulating the conversation, are not so consciously noticed as elements of the conversation. Other communicative behaviors involve the use of facial expression, gaze direction, hand gesture, head movement, and posture to maintain a smooth conversational

flow.

1.4 Humanoid Agents

The continuing fast-paced advance of technological capabilities has driven a small number of businesses and research facilities to pursue the futuristic dream of a virtual servant. These computer-driven creatures are produced with the aim that they may fill the duties currently filled by humans, but enjoy the advantage of sparing human resources. Such entities, termed agents, have encountered a rocky reception at best due to the unfamiliarity of dealing with a software-driven agent that attempts to mimic a

(9)

real-life being. Help agents such as Microsoft Word's dictating paperclip have met with wide skepticism regarding their usefulness compared to a conventional means of help lookup.

While such applications as the Star Trek voice-interface computer are testimony to the wide potential of agents for the facilitation of information retrieval and task completion, the current state of research in the development of such agents is still far too primitive to produce highly useful results. One approach to developing an effective agent involves the development of a humanoid agent that embodies the multiple modalities of communication inherent in human face-to-face communication. This approach strives to create a virtual humanoid that demonstrates both the high level conscious and lower level subconscious forms of face-to-face communicative behavior in an attempt to ease the task of information exchange-the conversation-between the user and the agent. Such an endeavor demands an in-depth examination of discourse research to properly design and embed proper behaviors within a humanoid agent.

(10)

Chapter 2 Background

2.1 The Role of the Action Scheduler

The Action Scheduler's primary role involves planning what the agent will do on the highest level. [Th6risson 96] describes the Action Scheduler as a cerebellum that receives behavior requests from various drives within the mind of the agent. These drives represent different levels of behavior. The Action Scheduler examines the behavior requests submitted by these other modules and performs a resource look-up to verify the agent's ability to perform the particular action.

Such a role carries with it the responsibility of selecting the proper method to reflect a certain behavior in light of the possibility of conflicts. If the agent needs to scratch its head and desires to acknowledge the user by waving with the same hand, such

a conflict could be resolved by sufficing the acknowledgement with a nod of the head

instead.

The graceful execution of actions also falls under the responsibility of the Action Scheduler. Since the visual output of actions that are merely scheduled as is would appear awkward in several cases, the Action Scheduler must catch and edit these cases so

(11)

they present a smooth output. Take the case where the agent points to the user and then scratches its head thoughtfully. If these two actions occur reasonably close, it would seem awkward for the agent to execute them as completely separate actions, complete with preparation and retraction phases-where the user prepares for the main part of the gesture and retracts to a rest position. More graceful would be the omission of the retraction phase of the former gesture and the preparation phase of the latter gesture

resulting in a direct transition from the outstretched arm (pointing at the user) to a scratch

of the head (in thoughtfulness). Such a blending of closely occurring gestures is known

as coarticulation.

Similarly, the reverse case, where an intermediary position is desired, also occurs and is not immediately obvious to the execution phase unless the action scheduler can catch it. Since three-dimensional animation engines operate by the linear translation from one key frame to the next, a method of creating an intermediary position to alleviate physical impossibilities-known as action buffering-becomes a necessity. For

example, if an agent needs to transition from a state where its hands are in its pockets to a

folded arm position, a direct linear translation rendering of the animation would result in the hands seeming to pass right through the pockets and into the folded arm position.

Much more realistic would be the intermediary phase where the agent pulls its hands out

of its pockets and then moves to the folded arm position. Needless to say, such action buffering would be crucial for most cases transitioning from a hand-in-pockets state.

The responsibility of action buffering can lie in either the Action Scheduler, or the Animation Engine, depending upon how much freedom the Animation Engine is given to interpret action directives from the Action Scheduler. If the Animation Engine follows

(12)

strict methods of executing actions, then the Action Scheduler should resolve the proper set of actions and instruct the Animation Engine to execute them. If, however, the developer chooses to allow some interpretive freedom in the Animation Engine, action buffering could occur within the Animation Engine, with it simply reporting to the Action

Scheduler the information that the Scheduler needs, such as what joints are used and

when those joints are busy.

Another less noticeable, but drastically crucial task, is the synchronization of speech and action. Given that the humanoid agent has the ability to speak, the Action

Scheduler must be able to negotiate the speech system and coordinate actions so that they

fall at reasonably precise moments for communicative effectiveness. This functionality will depend largely upon the speech generation tools available. While some tools provide specific timing information, others do not, and would demand other methods of timing synchronization such as estimation. While the speech tools can be run by the Animation

Engine since it is a rote action, the various timing and synchronization tasks should be

performed by the Action Scheduler so that it can manage the integration of gesture and

facial expression with the speech.

Finally, one other basic, and probably the least explored endeavor, is the resolution of behavior conflicts. How does the conversational agent decide what to do when the driving forces within it want more than one behavior to occur? An agent may

typically greet a user with a smile and a spoken greeting, but how does the agent handle

the acknowledgement of a second user when it is in the process of speaking with the first user? Such resolutions require the knowledge of which modalities are and are not available as well as the alternatives available to the agent when a particular action request

(13)

is not executable. Sometimes, the agent won't be able to find acceptable alternatives to express particular conflicting behaviors. In these cases, the Action Scheduler must have to ability to decide which behaviors are of greater priority. It must then gracefully interrupt or carry out the appropriate behaviors. The Action Scheduler represents the final stage at which the agent can refine its desired behaviors, so it becomes crucial that the Action Scheduler contain the intelligence to do so properly and effectively.

2.2 Past Work

A number of past works provide insight regarding the approach of an effective Action Scheduler for a three-dimensional humanoid agent. While some works do not specifically target a humanoid conversational agent, all have contributions to the

development of a capable and interactive agent.

2.2.1 Ymir

The predominant humanoid conversational agent in research is Th6rrison's Ymir

architecture, a design for a multi-modal humanoid agent. In it, the Action Scheduler

mediates between three layers-the reactive layer, the process control layer, and the content layer-which each submit behavior requests.

2.2.1.1 The Processing Layers

The reactive layer consists of those behaviors that occur with the utmost immediacy, such as turning the agent toward the user when the agent is addressed. This layer serves to produce the behaviors that require little or no state knowledge and therefore the least amount of computation. The quick turnover of a response enables the system to retain a level of reactivity that contributes to the believability of the agent's intelligence. While

(14)

the actual decisions issued by the Reactive Layer could contain large amounts of error, the fact that a response occurs takes precedence over whether the response is a correct one.

The Process Control Layer manages global control of the interaction. This includes elements of face-to-face interaction that are common through most conversations, such as knowing when to take the turn, when to listen because the user is speaking, what to do when the agent does not understand the user, and how to act-depending on the agent's understanding of the user-when the user is speaking to the agent. The Process Control Layer has a handle into the activity status, inactivity status, and thresholds of state variables that affect the Reactive Layer-an important functionality since the Reactive Layer does not have access to enough global information to operate appropriately on an independent basis.

The third and highest level layer-the Content Layer-handles the processes that understand the content of the user's input and generate appropriate responses accordingly. If the user requested to look at an object, the Content Layer would process the content of the request and possibly spawn a behavior such as "show object."

2.2.1.2 Behavior Selection

Behavior requests are serviced in an order based upon the layer from which the request originates. Received behavior requests are placed in a buffer that is periodically checked for contents. If the buffer contains requests, the Action Scheduler services all Reactive Layer requests first. Once there are no more Reactive Layer requests, the Action Scheduler services one Process Control Layer request. If there are no Reactive or Process Control Layer requests, it services a Content Layer request.

(15)

Once a request is fulfilled, it is ballistic in that behavior conflicts always result in the latter-the as of yet unexecuted behavior-requiring an alternative or becoming rejected. Speech execution is performed in ballistic units roughly the size of noun or verb phrases to accommodate a workable means of interrupting the agent mid-statement.

2.2.1.3 Proper Abstraction

The Action Scheduler in this design receives the request at the conversational phenomenon level and translates that phenomenn into body movements-or actions. This means that the Action Scheduler decides how to execute the particular behavior. A behavior such as PROVIDE_FEEDBACK, which would involve the agent giving backchannel feedback to the speaking user, would be resolved by the Action Scheduler into several possibilities such as SAY_OK or NOD_HEAD. The executable action would then be chosen from the generated possibilities.

Giving such duties to the Action Scheduler make it necessary to make it discourse-smart, a feature that should be abstracted away from that particular module. The complexities of humanoid conversational agent design involve a large enough set of tasks such that the Action Scheduler should be concerned solely with receiving explicit action requests and scheduling them, not resolving what a discourse-related behavior should entail. Furthermore, separating the conversational phenomena-deciding module from the action scheduling module does not incur any scope difficulties because the two tasks-behavior generation and action scheduling-are independently executable.

Ymir's Action Scheduler does not provide feedback to the rest of the architecture, following the belief that any necessary feedback needed by the other modules is embodied in the user's response. This approach seems problematic when considering the

(16)

value of the agent knowing whether or not a behavior, especially one containing information from the Content Layer, is successfully executed. If the agent plans to share information regarding a specific object, or a goal of the user, and such an attempt fails to execute, the knowledge of whether such a failure is due to internal-essentially invisible-or interactional shortcomings should inform the agent of whether the attempt should simple be made again or whether an alternate approach should be generated.

2.2.2 Animated Conversations

[Cassell et. al. 1994] introduces a system that automatically generates and animates conversations between multiple humanoid agents with synchronized speech, intonation, facial expressions, and hand gestures. The system generates the phenomenon of gesture coarticulation, in which two gestures within an utterance occur closely in time without intermediary relaxation of the hands. Namely, the retraction phase of the former gesture and the preparation phase of the latter gesture are replaced with a direct transfer from the former to the latter gesture.

For instance, if the agent plans to POINT_AT(X) and utter the phrase "I would buy this item" where X is the item it plans to point at precisely when it utters the would 'this', then the agent would execute a preparation phase during some time before the word 'this' is uttered and a retraction phase for some time after the word is uttered. If, however, the agent wishes to say "I would buy this item and sell that item," calling POINTAT(X) and POINTAT(Y) at the words 'this' and 'that', respectively, then it would appear awkward for the agent to execute a preparation phase and a retraction phase for both POINTAT() calls. The execution normally performed by humans involves discarding the retraction phase of POINTAT(X) and the preparation phase of

(17)

POINTAT(Y), creating the effect of point to 'that' directly from having pointed at 'this'. Effectively this means that instead of the agent pointing to one object, withdrawing its hand, and pointing to the other object, the agent points to the first object, then shift over to pointing to the other object directly. This becomes critical when implementing closely occurring gestures.

Animated Conversations accomplishes the effect of coarticulation by keeping a record of all pending gestures and tagging only the final gesture with a retraction phase. This can become problematic when applying coarticulation to a humanoid conversational agent, however, because it demands real-time generation of behaviors. Such an agent would require a way to quickly generate or alter gesture preparation and retraction phases

to say the least.

2.2.3 Multi-level Direction of Creatures

[Blumberg, Galyean 95] present a layered architecture approach to multi-level direction of autonomous creatures for real-time virtual environments.

2.2.3.1 A Three-Tier Architecture

Their architecture consists of a behavior (Behavior System), action (Motor Skill), and animation (Geometry) level, separated in between by a Controller (between Behavior and Motor) and a Degrees of Freedom (between Motor and Geometry) abstraction barrier. The Behavior System produces general behavior directives, while the Motor Skill details the actual actions possible and the Geometry specifies the lowest level of animation. The Controller maps the behaviors onto appropriate actions in the Motor Skill module, and the Degrees of Freedom constitute a resource manager, keeping track of what parts of the Geometry are busy animating.

(18)

For example, if the Behavior System produced a canine agent behavior of GREET, the Controller would resolve this behavior to tailor to the agent. In this case, it might resolve the behavior into actions such as WAG-TAIL and PANT, to be sent to the Motor Skill module. The Motor Skill module then sends the WAG-TAIL and PANT requests through the Degrees of Freedom module, which once again resolves the action request to tailor to the particular geometry. Such animation level directives as SWING-TAIL-CONE or OSCILLATE-TONGUE-OBJECT might be present in the Geometry to run the animation.

2.2.3.2 Flexibility through Modularity

This architecture provides a good level of modularity in that the Controller and Motor Skill modules can be replaced to accommodate different sets of creatures. While a behavior request of MOVE-TO might evoke a WALK-TO from a dog, MOVE-TO might evoke a DRIVE-TO from a car or a TELEPORT-TO from a wizard creature. Similarly, the Degrees of Freedom and Geometry modules dictate the animation method, and conceivably could be replaced by any sort of animation method, given the degrees of freedom were mapped correctly.

2.2.3.3 Using DOFs

Motor Skills utilize DOFs to produce coordinated movement by declaring which DOFs are needed to perform the particular task. If the resources are available, it locks the needed DOFs and performs the particular task. Most Motor Skills are "spring-loaded" in

that they are given an "on" directive to activate and, once the "on" directives are no

longer requested, will begin to move back to some neutral position and turn off within a few time steps. This supports a sort of memory-less scheme in which only the length of

(19)

time to retain the "on" requests need be stored, as opposed to details of an "off' command. For example, if a canine agent were to initiate a WAG-TAIL action, the DOF module would seize the proper resources to wag the tail (given they are available), and initiate the wagging of the tail. Once the agent no longer wishes to wagif a canine agent were to initiate a WAG-TAIL action, the DOF module would seize the proper resources to wag the tail (given they are available), and initiate the wagging of the tail. Once the

agent no longer wishes to wag its tail, the tail will gradually decrease its amplitude of

movement, eventually turning off completely within a few time steps.

2.2.3.4 Similarities and Shortcomings

The DOF usage is similar to Th6risson's Ymir architecture in that resources are checked and locked to initiate actions. Furthermore, actions that conflict simply cannot interrupt ongoing actions in both systems, making them first come first serve types of systems. Both seem successful at least at the primitive level of conflict resolution. Unfortunately, a humanoid conversational agent requires a more complex scheme to accommodate the insertion of proper accompanying actions to the speech stream, such as time-critical gestures and facial expressions. The system described by [Blumberg, Galyean, 95] is not designed to undertake lengthy speech and gesture combinations which would involve the transmission of key content. The majority of the conversational agents actions will not be actions that decay and eventually turn off. They will require strict timing specifications, such as the synchronization of a beat gesture with an emphasized word.

The agent's behaviors and actions are not driven directly by basic competing needs so

they cannot be directly linked as such. While a canine agent in the [Blumberg, Galyean, 95] system may be able to perform decision-making through primitive desires such as

(20)

HUNGER, CURIOSITY, FATIGUE, or FEAR, the conversational agent must react to more complex driving factors which will depend upon the actions of each other. If the conversational agent desires to sell a house and proceeds to relate information about the house to the user, competing behaviors that would cause the agent to switch to another task would make the behavior of the agent appear flighty or unrealistic, unless a more complex behavior resolution and execution architecture is implemented.

2.2.4 Improv

[Perlin, Goldberg 96] present a system called Improv which enables a developer to create real-time behavior-based animated actors. Improv utilizes an architecture based upon the Blumberg/Gaylean architecture.

2.2.4.1 Handling Competing Actions

With Improv, authors create lists of groups of designated competing actions. These groups contain sets of actions that are physically impossible to perform simultaneously,

such as waving one's arm to greet a friend while scratching one's head with that same

arm. When two action requests fall within the same group, conflict resolution occurs by way of a smoothly transitioning weight system. For example, if an agent is scratching its

head and then wishes to wave to friend, it might stop scratching, wave for a bit, and then

resume scratching, all due to the smoothly transitioning weight system. At the time that

the agent waves to the friend, its WAVE weight exceeds its SCRATCH-HEAD weight.

However, as the waving continues, the value of the weight drops below the value of the suspended SCRATCH-HEAD action. This enables actions considered more "important"

by the author to take over conflicting ones, while allowing the conflicting ones to resume

(21)

2.2.4.2 Action Buffering

Improv also performs action buffering, which handles unrealistic transitions between actions. Since graphics typically employ a linear approach to rendering movement, some translations of position will be physically impossible for real humans due to obstructions. For example, moving from a "hands behind back" position to a "hands folded" position would entail an intermediary position of "hands at side relaxed" in a linear movement system. Improv allows authors to declare certain actions in a group to be a buffering action for another.

2.2.4.3 Shortcomings

While Improv handles the visual display of character actions well, it falls short of serving a humanoid conversational agent because of the simplicity of the behavior engine. The conversational agent requires a high level of depth to encompass the understanding of the conversation flow. Elements such as transmitted content and utterance planning cannot be handled by systems such as Improv, a scripting tool which doesn't have the processing power to handle heavy content-based decision-making.

(22)

Chapter 3 Statement of Problem

The task at hand involves scheduling actions that enable a three dimensional multi-modal humanoid conversational agent to converse with a user. The specific application involves the agent adopting the role of a real estate agent and showing the user through a virtual

house in an attempt to sell the house. However, the main focus of the agent's

functionality will be its ability to hold, as realistically as possible, a conversation with the user. Therefore, conversational behaviors and the actions that follow from them will be the primary focus. Such actions include movements of the head, eyes, eyebrows, and mouth for generating facial expressions and gaze behaviors; and movements of the arms

and hands for generating gesture.

The aforementioned and examined previous work contributes nicely to aspects that would be useful in the Action Scheduler for a humanoid conversational agent.

However, individually they each lack a portion of what needs to be integrated as a whole

for the agent, obviously because the studied works were ihot attempting the development of a humanoid conversational agent. The closest comparison, Ymir, focused upon multi-modality, a critical feature in generating a conversational agent because of the necessity

(23)

of a rich level of input resources. However, Vmir's Action Scheduler contains a number of elements that could be improved upon, such as the division of its tasks into two clearer modules, separating the generation of behaviors and the execution method of those behaviors. Another suspect assumption that Ymir makes is that the feedback loop need only lie in the data retrieved from sensing the user. While humans do indeed receive much feedback from others in the conversation, there are a number of realities in the implementation of a software agent that call for the need for sender notification when an action completes or executes. We are at least subconsciously aware of signals being sent to observers, such as signaling the desire to speak, and their meaning and significance or we would not engage in them. For this reason, the humanoid conversational agent should have the capability to be aware of the execution of actions it performs itself.

My goal is to design an architecture for the Action Scheduler of a humanoid conversational agent which will encompass all or as many of those features that are relevant in the reviewed works. It will improve upon the Ymir architecture's design of the Action Scheduler and attempt to incorporate features such as gesture coarticulation, action buffering, and interruptions of actions and/or speech.

(24)

Chapter 4 Architecture

The Action Scheduler must be considered as a closely-knit module to the Animation Engine-the module lying at its output--of the humanoid conversational agent. While the two modules should be abstracted away from one another, it is important for the Action Scheduler to understand the limitations of the Animation Engine and the functionality provided by it. For this reason, the architecture of the Action Scheduler includes both the Action Scheduler and the Animation Engine.

On the input end of the Action Scheduler lies a Generation Module which will resolve appropriate discourse phenomena into explicit actions and send the action requests to the Action Scheduler, thus relieving the Action Scheduler of the task of determining the execution of various discourse phenomena.

Dialogue will be planned as a behavior by the Generation Module, scheduled and time-synchronized by the Action Scheduler, and initiated by the Animation Engine.

(25)

4.1 Architectural Overview

The model for the architecture of the Action Scheduler comprises a slightly redefined and extended version of [Blumberg 95]. An abstraction module, Degrees-of-Freedom (DOF), separates the Action Scheduler from the Animation Engine, enabling notifications to the Action Scheduler regarding the status of the geometry, controlled explicitly by the Animation Engine. As requests enter the Action Scheduler, it attempts to schedule them on an immediate basis, consulting the DOF module for current availability of motors. Each request may carry with it a set of independent behavior requests that have been broken down, by the generation module, into single or concurrent sets of actions. Each particular broken-down behavior request can have a number of alternative actions or action sets that give the Action Scheduler some level of flexibility in the event that conflicts occur. Should no alternatives pass the check without a conflict, the request for that particular behavior will be discarded or queued until resources are freed.

4.2 Enabling Speech Planning

To this point, the architecture appears to be very similar to that of [Blumberg 95]. However, the agent must also have the ability to plan long sets of utterances to converse with the user. To this end, requests contain a QUEUE status bit that determines whether the request is queued upon conflict, or whether it is discarded upon the depletion of alternatives. This enables sets of utterarces to be sent one after the other and executed in line regardless of motor conflicts. Supplementing this feature is the ability of the request sender to ask for an acknowledgement of execution or of completion as a part of the action request. This would enable other modules to receive a notification of executing or completed behaviors, useful in the event that a message is lost or never completes. Such

(26)

information would be useful on a content level where the agent should have knowledge of whether or not it shared specific bits of data with the user.

4.3 Enabling Coarticulation

The Animation Engine produces the graphics by moving motors to a target position regardless of the current position of the motors. This enables performed actions to be flexible to a degree that they can start where they left off and not suffer from the restriction of always having to be a particular strict start-to-end specification. In the execution of utterances embedded with gestures, retraction phases can be tagged as such, and discarded to make way for the preparation phase of a following gesture if the two occur closely.

4.4 Input Overview

The input to the Action Scheduler takes the form of a frame that is used, as well, in all message passing for the other modules. A message frame contains several details such as time sent, sender, receiver, and input received from agent sensors. As modules receive incoming frames, they parse the relevant information and generate a frame to pass on to the next module, placing within it information pertinent to the receiving module, whether the information is passed from the first frame or generated within the sending module. For instance, the Generation Module, which sends the Action Scheduler's input, might receive information regarding the agent's surroundings. The module would then generate a feasible set of reactions for the agent, package that data with the relevant information it received in the first frame, and send it off to the Action Scheduler. The frame format would consist of sets of types, keys, and values in an embedded format such as the following:

(27)

(type :keyl value 1 :key2 value... )

where value could hold another mini-frame of the same format. This holds the possibility for an endless set of embedded frames. Once a module receives a frame, it can parse the frame and query the names of the types as well as check the values associated with a particular key. Examples of frames sent to the Action Scheduler are the following: (behavior :queued false :content (choice :options [ [ (eyes :state looktowards)

(head :trans nod)]

[(face :state smile)] ]))

This frame would be enclosed within a frame that would contain other information such as the time the frame is sent, the sender, the intended recipient, and the duration for which the frame is valid.

4.5 Similarities and Differences to Ymir

The proposed architecture for an Action Scheduler/Animation Engine is similar to Ymir in a number of ways. Both utilize the upkeep of joint status to notify of conflicts. Both also serve requests on a similar basis, using a first-come-first-serve policy. Ymir does not, however, use any conflict resolution policies other than denying the latter request. The proposed architecture enables actions to be queued for later execution, should the modality currently be busy.

Ymir places a great deal of responsibility upon the Action Scheduler in terms of resolving the behavior request from a level of intended effect to that of physical action. While Ymir gives the Action Scheduler the responsibility of deciding the proper actions to elicit a behavior such as ACKNOWLEDGEMENT, the proposed architecture places that discourse-related material outside of the Action Scheduler, reasoning that action scheduling should be the main focus, especially when the complexity is expected to grow

(28)

to large proportions. Another feature present in the proposed architecture and absent in 'imir is the functionality of combining the speech stream and the action requests into one, effectively enabling the timing of actions to be paired with the timing of words within the utterance. Overall, the proposed architecture attempts to follow the model that was created by Ymir, but improve it in ways that can accommodate higher demands from humanoid conversational agents in the future.

(29)

Chapter 5 Implementation

The architecture for the Action Scheduler is being implemented in a project on multi-modal conversation in a humanoid agent at the MIT Media Lab. The Action Scheduler and Animation Engine represent the end modules of the agent design, receiving the action requests and carrying the responsibility of formatting them into a realistically behaving

agent.

5.1 REA: a Multi-Modal Humanoid Conversational Agent

REA the Real Estate Agent has the goal of selling the user a house. To this end, she

performs, with the user, a virtual walkthrough of a 3D representation of the targeted house. Her interaction with the user consists of a mixed-initiative conversation where Rea will service the user's requests, answer questions, make suggestions, and engage in small talk. Rea's design involves five main modules apart from the Action Scheduler and Animation Engine. The Input Manager lies at the back end of the agent design. It collects data from various sources of input (vision cameras, speech input), and formats it so that it can be understood at the next step. The Understanding Module examines the

(30)

data collected by the Input Manager and attempts to glean a higher level understanding of the implications of the perceived data and the intentions of the user. The Reactive Module generates appropriate reactive actions for the humanoid agent and passes the request along to the Generation Module, which generates appropriate propositional behavior for Rea, such as utterance production. The Generation Module then performs all action requests to the Action Scheduler, which then interfaces with the Animation Engine to produce realistic behavior in the 3D humanoid agent.

Rea occupies four computers. Two SGI 02s are utilized wholly to run the STIVE vision system. A Hewlett-Packard is used to run speech recognition software. One SGI Octane runs the TrueTalk speech generation software and the Action Scheduler and Animation Engine which performs the graphical rendering.

5.2 A Flexible Animation Engine

The Animation Engine is developed with the TGS Open Inventor 3D Toolkit. Inventor's scenegraph representation of a graphical scene enables the Animation Engine to specify a rough skeleton of nodes which, if followed by a character developer using a higher level 3D package such as 3D Studio Max, allows VRML characters to be read in and assembled by the Animation Engine. This allows artistic specialists to avoid the deep coding complexities when they want to vary the physical characteristics of the characters.

Each node within the specified skeleton is then attached to up to a number of Inventor engines which drive the rotation and translation along appropriate independent axes by updating appropriate values in the graphical fields. A Ramp object is then attached to each engine and utilized to provide the change of values over time that will

(31)

initiate the update of the fields. This variance of output values over time creates the movement in the humanoid agent.

The Ramp object also performs a mapping of a range of input values from 0 to 1 to the appropriate possible range of movement of the agent's joints. This means that a developer wishing to modify the physical flexibility of the character's joints need only alter the mapping within the ramp class. Once this is done, functions to drive the ramp through the appropriate angles or distances can be inputted with a range from 0 to 1. This

method allows the complexities of angle joints in particular to be abstracted away from

an action function designer. Instead of concerning him or herself with remembering that the shoulder joint ranges from -3.14/2 to 3.14/4 and the elbow from 0 to 3.14/2, he or she need only note the 0 and point of the angle extremities and interpolate the desired

values.

Currently, the Animation Engine is capable of executing actions which assist in floor management between the agent and the user. These actions include eyebrow movement (raising and lowering), head movement (turning with six degrees of freedom to achieve such actions as looking toward, looking away, nodding and shaking), eye movement (looking up, down, left, and right), eyelid movement (blinkirig), and mouth movement (speaking and not speaking).

5.3 The Action Scheduler

The Action Scheduler for Rea receives action requests by way of message frames sent over a Local Area Network and placed inside a buffer. Each message frame contains a behavior request or set of behavior requests, each of which can be served independently but are packaged together for simplicity. Each behavior request consists of a behavior or

(32)

a set of possible behaviors (a behavior and its alternatives), each of which has been decoded from the conversational phenomenon level to the action level bfor the Action Scheduler. Each behavior request can decode from one behavior to the execution of many actions. For instance, the Generation Module may wish to execute an GREET behavior. It would decode this to a few possibilities, each of which might consist of more than one action simultaneously. One example of a decoding might be DO(SAY("hi"), NOD, WAVE) If that conflicts, an alternative might be DO(SMILE, NOD). The Action Scheduler examines each behavior request and checks for joint conflicts (conflict within the geometry of the agent), checking for alternative actions upon the instance of a conflict. If the Action Scheduler concludes that there are no non-conflicting action sets then it discards the request. If the request's QUEUE flag has been set, then it saves the

request for later execution once the joint conflicts are freed.

5.4 Speech Generation

The Generation Module, which sends all of the Action Scheduler's input, generates the

speech for the agent. This includes pitch specifications and embedded gestures as well as

the words in the speech. The Action Scheduler properly formats the speech request and passes it to the Animation Engine, which performs the final call to the speech tool, TrueTalk. If the Generation Module then requests an interruption to the speech, the

Action Scheduler can flag the Animation Engine in much the same way as it sends a

speech directive, causing the agent to end the utterance abruptly, but gracefully. TrueTalk handles the graceful interruption of the conversational agent.

TrueTalk allows the insertion of callbacks within the speech instruction. The agent utilized these callbacks to notify other modules of the completion or initiation of

(33)

certain actions, for example propositional actions which concern what knowledge the agent has of what she has said or not said. This notification becomes useful to modules concerned with the content level of the agent's output. If the agent has failed to execute a content related action, such as mentioning the cost of the house, the content generating module should be aware of the failure to pass that information to the user and avoid such fallacies as asking whether the cost is too high, for instance. The callbacks can also be used to initiate gesture executions at key point within the speech, such as when the agent wishes to point at an object precisely upon uttering a particular word.

(34)

Chapter 6 Evaluation

6.1 What Worked

Rea is work in progress, so her design has not had the opportunity to be extensively tested. However, much of the back end development of the Action Scheduler and the Animation Engine an, a modicum of testing on the integration of all of her modules has been completed to a sufficient degree to make a number of observations.

The development of the Inventor engines to manipulate the joint values along with the creation of Ramp classes to move the joint values through a given range over a given time greatly facilitated the generation of action procedures. Using a convention of ranging the possible joint values from 0 to 1 allowed quick intuitive estimations of the proper angles needed to produce the desired effects. This allowed a careful developer to measure the proper conversions necessary to embed in each particular engine to create a physically realistic agent in terms of flexibility. The generation of action procedures to add to the knowledge base of the character quickly followed with ease.

The ramp/engine combination also enables the system to constantly have an awareness of the values of Rea's joint fields. This produces two benefits. First, the

(35)

Action Scheduler can query the joints and know to a reasonable degree where they are. Second, movements are performed from current positions as opposed to pre-calculated start positions. This scheme of designating only final positions instead of both final and initial positions allows a much greater level of flexibility in the movement specifications.

Rea responds with a great level of immediacy. When there is no user present in

the user space, the agent will look to the side or glance away, appearing to be in a

non-attentive state. Once a user enters the user space and faces Rea, the agent receives a

signal carried from the input modules through to the Action Scheduler and immediately gazes upon the user expectantly. While Rea's development has not reached a state where she can understand language, she is able to understand when the user is or is not speaking. This ability combined with the ability to know whether the user is attending to her or not enables Rea to engage in turn-taking behavior. When the user completes an utterance (which involves the stopping of gesture movement as well), Rea initiates an utterance, only continuing as long as the user appears to be attending. If the user indicates that he or she wishes to speak, Rea will gracefully cease speaking and politely wait for the user to complete the turn. Rea's ability to exhibit a sensitivity to the flow of the conversation and react with reasonable immediacy is due largely to the Action Scheduler's ability to interrupt executing actions at will.

6.2 What Didn't Work

We found that high-end 3D packages such as 3D Studio Max and Softimage do not have VRML porting tools that make a fully complete conversion from a native scene to VRML. While our intention was that developers could create highly artistic characters with these high-end packages and then convert them to VRML characters which would

(36)

be used by the Animation Engine, the conversion process was fraught with errors and annoyances. Problems such as mysteriously inverted normals, displaced objects, and untranslated groupings in the hierarchy made the process extremely cumbersome. A complete conversion tool would greatly facilitate the combination of high-level graphics

and real-time graphical animation in the future.

Initial attempts to generate callback functionality with the speech tool, TrueTalk, failed. Given that speech generation technology still doesn't provide much in the way of precise timing and fast generation, further problems are expected as the development of Rea continues. Speech recognition, as well, remains an area that takes a noticeable amount of processing, so the Action Scheduler's ability to serve the proper reactive behaviors given the lag between the user's action and understanding of that action is expected to suffer.

6.3 A Comparison

The Action Scheduler/Animation Engine draws from the contributions of past works to various elements in conversation. The gesture coarticulation of Animated Conversations, action buffering of Improv, and the three-tier architecture of the Blumberg system are useful features to make the agent more realistic. Th6risson's Ymir architecture is the closest predecessor to the proposed design for a humanoid conversational agent. The proposed system has some structural similarities to Ymir such as the use of joints to notify the existence of conflicts. However, several differences exist as well. The manner in which the two Action Schedulers choose incoming requests differs. While Ymir selects particular types of behaviors to service first, and then resolves what those behaviors mean on the action level, Rea's scheduler is not designed to need to understand

(37)

the distinctions between those types of behaviors, nor does it need to understand the mapping from behavior to action. This makes the Action Scheduler cleaner because it does not need any discourse knowledge. Another distinction is the handling of speech/action integration. Rea's Action Scheduler can combine the speech annotation with word oriented callbacks to fire actions at the occurrence of specific words. YImir,

however, does not connect the two calls to speech and action. While Rea's and Y•mir's

action scheduler architectures have similarities between them, Rea enjoys the ability to move to the level of a more complex agent, since the multitude of discourse behavior resolution tasks are abstracted away to a generation module.

(38)

Chapter 7 Conclusions and Future Research

7.1 The Validity of the Action Scheduler

While the use of an Action Scheduler module seems to be successful in progressing the performance of humanoid agents, it still remains in question whether the current paradigm is a model that will fit further development in the future. It must be noted that while the Action Scheduler does behave as a sort of cerebellum as observed by [Th6risson 96], it should not necessarily be selecting and excluding actions simply on an availability basis. Humans do not discard actions simply and solely because the modality

happens to be busy at the moment. Currently it seems sufficient to use a scheme in which

the Action Scheduler discards requests that exhaust all possible alternatives, while allowing the possibility to perform an absolute interruption. However, a more accurate model of human decision making would require more subtle distinctions between actions. Perhaps the use of prioritization levels would be more appropriate, but then the task would require a global method of prioritizing behaviors.

(39)

7.2 Expansions

The fact that our priorities shift depending upon several factors implies that a more complex scheme for behavior selection needs to be implemented in the Action Scheduler or that the selection process needs to be abstracted completely away from that module and placed in its input module. While the Action Scheduler may hold promise for developing more effective humanoid agents, key tasks such as conflict resolution need to be handled more cleverly. The current scheme of first-come-first-serve cannot accommodate extensively growing demands on conversational agents. Such a scheme greatly limits the potential intelligence that the agent can exhibit if development undertakes an effort to correctly represent human cognition. The alternative, the creation of the perception of intelligence, achieved in systems that aim to create actors (such as Improv), undoubtedly will experience a ceiling effect on the potential abilities of the agent. Conflict resolution should drive from the intentions and needs of the agent.

(40)

Bibliography

[Blumberg, Galyean, 1995] B. Blumberg, T. Galyean. "Multi-level Direction of Autonomous Creatures for Real-Time Virtual Environments, Computer Graphics (SIGGRAPH '95 Proceedings), 30(3):47-54, 1995.

[Cassell et. al. 1994] J. Cassell, C. Pelachaud, N.I. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, M. Stone. "ANIMATED CONVERSATION: Rule-based Generation of Facial Expression, Gesture and Spoken Intonation for Multiple

Conversational Agents". Siggraph'94, Orlando, USA, 1994.

[Perlin, Goldberg, 1996] Perlin, K. And Goldberg. A. "Improv: A System for Scripting Interactive Actors in Virtual Worlds". Media Research Laboratory, New York

University.

[Th6risson 1996] Th6risson, K.R. Communicative Humanoids: A Computational Model of Psychosocial Dialogue Skills. Ph.D. Thesis, Media Arts and Sciences, MIT Media Laboratory.

(41)

THESIS PROCESSINGSLIP

FIXED FIELD: ill. name

index biblio * COPIES: h Aero Dewey ( Hum

Lindgren Music Rotch Science

lTLE VARIES: isTurg

NAME VARIES: P[_

IMPRINT: (COPYRIGHT)

P COLLATION:

4o

&ADD: DEGREE: .S, P- DEPT.:

E

E"

SUPERVISORS:

NOTES:

catr: date:

page:

o DEPT .

EB-

y

3iq

6113S

G

.YEAR:

Ing

P9 DEGREE: -

M.

EnE-, NAME:

1ACHr%

TWe

--_ . j& i , , . , . , . , ,,~ -. ,

Action scheduling in humanoid conversational agents

D,

'

, N-1

-

I

Tyent

Justine

Cassell

(

Action Scheduling in Humanoid Agents

by

Abstract

Acknowledgements

Contents

1.4

Humanoid

Agents

_ __…

-_

-_

-

___

8

2.2.3.4

Similarities

and

Shortcomings

- - - - -

_

19

2.2.4.3

Shortcomings……---

---

___

21

5.2 A Flexible

Animation

Engine

- - -

- - -

…

_

30

6.3

A

Comparison

-- ___-

---

36

Chapter 1

Introduction

1.1 Goals of This Work

1.2 Thesis Outline

1.3 Multi-modal Communication

1.4 Humanoid Agents

Chapter 2

Background

2.1 The Role of the Action Scheduler

2.2 Past Work

2.2.1 Ymir

2.2.1.1 The Processing Layers

2.2.1.2 Behavior Selection

2.2.1.3 Proper Abstraction

2.2.2 Animated Conversations

2.2.3 Multi-level Direction of Creatures

2.2.3.1 A Three-Tier Architecture

2.2.3.2 Flexibility through Modularity

2.2.3.3 Using DOFs

2.2.3.4 Similarities and Shortcomings

2.2.4 Improv

2.2.4.1 Handling Competing Actions

2.2.4.2 Action Buffering

2.2.4.3 Shortcomings

Chapter 3

Statement of Problem

Chapter 4

Architecture

4.1 Architectural Overview

4.2 Enabling Speech Planning