• Aucun résultat trouvé

CHAPTER 2: HUMAN - MACHINE SPOKEN DIALOGUE: STATE OF THE

2.2. G ENERALITIES ABOUT H UMAN -M ACHINE D IALOGUE

2.2.4. General Architecture of a Dialogue System

communication is to ensure that the sentence have sense and to be able to extract that sense from utterances. At the semantic level starts the study of context-independent meaning, analysing what words mean but also how those meanings combine in sentences.

2.2.3.7. The Pragmatic Level

Pragmatics groups all the context-dependent information in communication.

Indeed, sometimes utterances implicitly refer to underlying information supposed to be known by other participants of a conversation. The underlying information can be the context but also something more general called

‘ground knowledge’ that includes all what people from a same culture are supposed to know, their background and their common knowledge.

Sometimes, the pragmatic level is divided into three sub-levels [Allen, 1987]:

• Pure pragmatic level: the study of the different meanings that can convey a single sentence uttered in different contexts.

• Discourse Level: concerns how the directly preceding sentence affects the interpretation of the next sentence. It can be helpful for resolving anaphora and other language ambiguities.

• World knowledge level: it is sometimes referred to as ‘ground knowledge’ as said before and includes all what people know about the world (milk is white for example) but also what a conversation participant knows about other participants beliefs and goals.

2.2.4. General Architecture of a Dialogue System

Human-human dialogues are generally multimodal in that sense that when a person engage a conversation with another, he/she integrates information coming from all his/her senses combined to his/her background knowledge to understand his/her interlocutor. Multiple means are also used to communicate thoughts like gesture, drawings, etc. Sometimes pieces of information coming from different sources are complementary, sometimes they are redundant.

Naturally, researches in the field of multimodal systems are currently conducted in the purpose of building systems interacting with human users through several means of communication such as gestures [Quek et al, 2002], lip reading [Dupont & Luettin, 2000], humming [Ghias et al, 1995], etc.

Multimodality is used either for ergonomics allowing users to introduce complementary information by different means (gesture, humming) or for robustness, using redundant information to improve performances (lip reading).

The general architecture of a multimodal dialogue system is depicted at Fig. 2.

It shows the typical flow of information going from the user to the system, using any kind of modalities such as gesture, speech, keyboard entry, mouse movements etc. The information is processed and then passed to a subsystem that aims at managing the interaction and has to take decisions about what to do next. According to this decision, new information is transmitted back to the user. Inputs and outputs are processed by a set of subsystems that will be described more precisely in the following.

Output Generation

Fusion &

User’s Goal Inferring

Input Acquisition &

Processing

d

£ ‘

User

Interaction Management

Conceptual Feedback Generation

Environ-mental noise

World Knowledge Low LevelHigh Level

Semantic

Pragmatic Signal

Fig. 2: Typical architecture of a multimodal dialogue system

Inputs and outputs of such a system are generally perturbed by environmental noise. Noise can be of several natures in function of the type of information it affects and the way this information is conveyed from the user to the acquisition subsystem or from the system back to the user. Of course it can be additive background sounds in the case of audio inputs or convolutive noise in the case of signal transmission over a poor quality channel. Other types of noise can be imagined. For example, in the case of gesture inputs via a video stream captured by a distant camera, any event

happening around the user can be mixed up with useful information and is also considered as noise.

2.2.4.1. Input Acquisition & Processing

As human beings collect information from the outside world using their senses, input signals have to be first captured and transformed by sensors in order to be processed by artificial systems. Sensors can be of different kind such as sets of cameras, microphone arrays, touch pads, joysticks, keyboards, haptic data gloves, etc. These devices usually first convert analogue into digital signals that can be processed by computers or any DSP

unit. The subsystem in charge of the acquisition of input signals is generally responsible for their pre-processing as well. Denoising and feature extraction are some steps of the signal pre-processing.

The feature extraction operation aims to reduce the amount of data to be processed by the subsequent subsystems. Indeed, a full video stream, for example, is not only a too large amount of data to be processed for a dialogue system but in general totally inadequate by itself if it must be used for gesture recognition. Actually, only some characteristic points and their moves are often useful. The relevant points and information about their moves are in most of the cases extracted from the video stream and constitute the features that are passed to the rest of the system.

The denoising process occurs before or after the feature extraction process or even both. In the case of acoustic signals, it is sometimes interesting to apply denoising techniques to the original signal in order to improve the accuracy of extracted features. In the case of gesture recognition, noise introduced by movements around the user can be suppressed by foreground-background segmentation, which is a process applied to both the original signal and to features.

2.2.4.2. Fusion and User’s Goal Inferring

Once proper features are extracted from the original input signal, they are passed to a second subsystem that aims at obtaining meanings (semantic level) and inferring the user’s actions and intentions.

As said before, the multi-sensory input data can be redundant or complementary. This means that there might exist dependencies among sets of data and those dependencies may be time-varying. Although some multimodal systems use very primitive data-fusion techniques, the most efficient systems use complex statistical models to infer the user’s goal.

Bayesian Networks (BN) (also called Bayesian Belief Networks), Dynamic

Bayesian Networks (DBN) [Pearl 1988] and Influence Diagrams [Howard &

Matheson, 1984] are general statistical frameworks often used in multimodal systems [Wachsmuth & Sagerer, 2002], [Garg et al, 2000] because of their ability to learn dependencies among sets of input data. Kalman filters and

HMMs are particular examples of DBN and recent researches introduced the concept of Asynchronous HMMs [Bengio, 2003] for audio-visual signal processing.

2.2.4.3. Interaction Manager

The Interaction Manager controls the entire behaviour of the system. After having understood the user’s intention, the system has to take a decision about what to do next. This module coordinates the interaction with the user but also with the rest of its environment. Indeed, the system can interact with databases, for example, to retrieve information needed either by the Interaction Manager itself in order to continue its exchange with the user, or by the user. One can imagine that the Interaction Manager could interact with other devices, in order to collect information concerning everything but the user (like external temperature, time of the day, etc.) and that can be helpful to take a decision about the next action to perform. All those possible external devices providing extra information (pragmatic level) are grouped in what was called the ‘World Knowledge’ on Fig. 2.

The reaction of the system to the user’s inputs depends not only on the inferred user’s intention but, of course, on the actual task. The Interaction Manager is then the more task-dependent part of the system and is rarely reusable. Challenges of designing an Interaction Manager and particularly in the case of a SDS will be discussed later.

2.2.4.4. Conceptual Feedback Generation

When the Interaction Manager has taken the decision about the next action, it has to communicate some pieces of information to the user according to this decision. This process starts with the production of a set of concepts supposed to represent this information as well as possible. Indeed, when one is solicited by another human during a dialogue, one first thinks about what has been uttered and then produces a series of ideas or concepts expressing the results of one’s thoughts. No words or gestures are yet used but a structure of what one thinks to summarise one’s thoughts is built. In the same way, the ‘Conceptual Feedback Generation’ subsystem builds a set of concepts to be conveyed by any means to the user. This process is often referred to as planning.

2.2.4.5. Output Generation

This last subsystem is in charge of using any means at its disposal (speech, screens, sounds, etc.) to express the previously generated concepts in a way that the user can understand them. In the literature, this process is called surface realisation.

2.3. General Architecture of Spoken Dialogue