• Aucun résultat trouvé

CHAPTER 7: DIALOGUE IN THE MDP FRAMEWORK

7.1. G ENERALITIES

7.1.1. Choice of the Learning Method

Chapter 7: Dialogue in the MDP Framework

7.1. Generalities

From the beginning of this text, a spoken dialogue has been considered as a turn taking process between two agents. It is therefore a sequential process during which two agents try to achieve a goal (generally accomplishing a task:

task-oriented dialogue). In order to achieve the agents’ mutual goal, they ted or one optimality

d of exchange utterances embedding intentions until the task is comple

of both agents gives up. Although it is not realistic to think about an

criterion for generic human-human dialogue, in the special case of task-oriented and goal-directed dialogues, there should be possible to rank interactions and the underlying strategy followed by each agent. It should therefore be possible to learn an optimal strategy according to this optimality criterion. Yet, it is not always easy to identify important indices of dialogue goodness and to derive such a criterion. Moreover, obtaining training data for the special purpose of task-oriented dialogue strategy learning is a very complex problem that should also be optimised.

7.1.1. Choice of the Learning Method

Nowadays, thanks to more than half a century of researches in the fiel Artificial Intelligence, lots of learning techniques have been developed.

Several problems are addressed by the paradigm of learning algorithms such as pattern matching, decision-making and others, which have as common points that no analytic solution can easily be found and human beings perform better than computers.

7.1.1.1. Supervised vs. Unsupervised Learning

Two main classes of learning methods can be distinguished: supervised and unsupervised learning. Supervised learning is mainly related to ‘learning-by-example’ techniques, which means that, during the training phase, examples of a problem are presented to a learning agent as well as the solution. The learning agent then learns to provide good solutions to unseen similar

problems. Neural networks are such learning agents widely used for pattern matching (speech or handwriting recognition) [Bishop, 1995]. Unsupervised learning can be seen as ‘trial-and-error’ learning, which means that the learner is never told what is the optimal behaviour to adopt but is rewarded or punished for its actions in given situations. This last class of learning methods seems to be more natural since it is the way animals and young children learn. Reinforcement Learning (RL) agents are such learners that make use of unsupervised techniques to solve particular decision-making problems [Sutton & Barto, 1998].

In the case of dialogue strategy learning, there is usually no example of

tty good strategy is already used.

ge number of interactions (growing with the to converge and online learning

results obtained by learning the state-transition optimal strategies available, basically because it is generally unknown for all cases but the simplest ones. Unsupervised learning is then the more suitable technique for this purpose. As a matter of fact, RL has already been proposed to solve the problem of finding optimal dialogue strategies in [Levin et al, 1997] and [Singh et al, 1999] and seemed to be appropriate.

7.1.1.2. Online, Model-Based, Model-Free Learning

RL has actually been developed in the framework of online learning. As exposed in section 3.2 the RL agent interacts with a real environment and learns to behave optimally thanks to rewards obtained while interacting.

Thanks to Q-learning algorithms for instance, the RL agent can learn an optimal policy while following another and thus have a coherent behaviour (even if sub-optimal) during learning. Yet, online learning is not really adapted to dialogue strategy design unless a pre

Indeed, RL algorithms need a lar

size of the state space and the action set)

would involve the system to interact with real human users. This is not thinkable because of the time it would take and the money it would cost.

On another hand, some RL algorithms use a model of the Markov Decision Process (MDP) underlying the studied environment to speed up the convergence. Learning relies on a probabilistic model including the state-transition probabilities T and the reward distribution R of the observed environment (see section 3.2 for notational information). While interacting, the agent learns from real rewards and updates its model to learn from artificially generated situations between two real interactions. This algorithm family is called model-based learning. The most popular model-based RL method is Sutton’s Dina-Q algorithm [Sutton & Barto, 1998]. Starting from this point, some researches reported

probabilities and reward distribution from a corpus of annotated dialogues and applying a Q-learning RL algorithm on the model [Singh et al, 1999]

[Walker, 2000]. There are several drawbacks to this technique since it necessitates the collection of a large amount of data but also to predefine all the possible states before collecting the data. Moreover, the data collection only provides information about what happened when using a particular

lot of the

l-free technique is proposed. Actually, It is

onment described before.

s. The dialogue simulation environment described in the preceding chapters allows the production of data also

can produce error-prone observed ally emitted intentions. An interesting

Yet, in the following, only the MDP approximation will be described since the rise of performance doesn’t justify the extra amount of computation induced strategy (the one used during the data collection). There might be a states that are not sufficiently visited and above all, this makes assumption that the strategy does not affect the user’s behaviour and that he/she will always act in the same way in a given state whatever the path followed before. This assumption is never met in practice. Finally, it cannot predict what will happen if actions are added to the action set.

For those reasons, the mode

commonly called model-free technique because no prior knowledge about the distribution of states and rewards are needed and above all because the learner doesn’t updates its internal model like in the Dyna-Q algorithm. But it actually relies on a model of the environment like described in the preceding three chapters. The thing is that the learner is not ‘aware’ that it is interacting with a model and not with a real environment; both can be exchanged without any modification in the learner’s behaviour. In order to keep things clear, the model-free technique discussed in this text will be referred as a simulation-based technique. A first simulation-simulation-based RL experiment has been described in [Levin et al, 1997] followed by [Scheffler & Young, 2002] with the limitations described in section 4.2.1. The subject of this chapter is the application of the

RL paradigm to the problem of dialogue strategy learning using the dialogue simulation envir

7.1.1.3. MDP vs POMDP

As said in section 3.2.1, a dialogue can be described either within the MDP

framework or within the POMDP framework. Exact solution to POMDP is intractable when the state space is larger than 15 states but approximate solutions that outperform the general MDP solution might be derived more easily. Anyway, the state space must stay very small what is a major drawback for more complex dialogue

suitable for POMDP training since it intentions and the corresponding actu

experiment might be done using some recently developed techniques for solving POMDP my means of the combination of RL and DBN [Sallans, 2002]

using dialogue simulation.

by POMDP solving. Moreover, it is definitely not suitable for the purpose of easy design of SDS prototypes. Some tricks will be used to overcome the limitation brought by the MDP approximation and particularly introducing indices of the dialogue system’s lacks in observations.