Toward A More Realistic Simulation Environment

CHAPTER 4: GENERALITIES ABOUT DIALOGUE SIMULATION

4.2. P ROBABILISTIC S IMULATION E NVIRONMENT

4.2.1. Several Probabilistic Proposals for Simulating the Environment

4.2.1.3. Toward A More Realistic Simulation Environment

In order to build a more realistic simulation environment for SDS evaluation, several facts should be taken into account. Here are some reflections that can be done about the problem:

1. The simulation environment should be minimally task-dependent.

2. The simulated user should be goal-directed and should behave consistently according to its goal.

3. A human user would have a sufficient knowledge about the task to have a consistent goal according to the task.

4. The DM’s action set includes actions that are meant to become synthesized speech utterances spoken to the user but also actions that are meant to query or modify the WK.

5. If there can be errors introduced by the input processing systems of the SDS, output generation can also be error-prone.

6. If the simulation environment is meant to allow SDS evaluation, it should provide information about metrics that affect user’s satisfaction.

7. Noise influences the input processing results but also the way synthesized speech is perceived.

8. Synthesized speech quality and the length of system’s utterances influences user’s satisfaction [Walker et al, 2001].

According to this list the simulation environment on Fig. 25 is proposed as an extension of the environment described in [Pietquin & Renals, 2002] and [Pietquin & Dutoit, 2002]. The inclusion of the WK in the environment meets the requirements of statements 1, 2, 3 and 4 of the above list. Indeed, if the goal is to obtain a user model as task-independent as possible but to make it able to build a task-dependent goal at the beginning of each simulated dialogue session, it should access to the task model included in the WK. Moreover, if DM’s actions can affect the WK, it should definitely be considered as a part of the environment. In agreement with statement 5, a model of the output generation blocks has been included to the simulation environment.

This model also addresses problems raised by statements 6 and 8. A noise source has been added and generated noise is mixed with SDS generated output before it reaches the user model and with user model’s outputs in conformity with statement 7.

Fig. 25: A more realistic simulation environment

Finally, the DM’s action set A has been split into two parts (A_u and A_WK) in order to address problems expressed by statement 4:

{ } { } { }

In Fig. 25, variables are:

• st is the internal state at time t,

• at is the action performed by the DM and transmitted to the environment at time t,

• a_u,t is the action meant to become synthesized speech utterances spoken to the user (null if a_t ∈ A_WK) at time t,

• aWK,t is the action meant to access or modify the WK (null if a_t ∈ Au) at time t,

• sys_t is the system’s utterance, derived from action a_u,t, by the output generation block at time t,

• ut is the concept sequence produced by the user model at time t in response to the concepts he could retrieve from syst (working at the word level here would be intractable and useless to model all the manners a user can express a set of concepts),

• g is the user’s goal,

• nt is the contribution of noise at time t,

• o_t is the observation the DM can perceive at time t after its action a_t has been processed by the environment,

• st+1 is the internal state at time t+1.

According to these definitions, the joint probability is given by:

(4.13)

With the same assumptions than before and keeping in mind that actions on the WK give deterministic results, the second term of this expression can be expressed as:

Thus, as long that the WK is part of the simulation environment, nothing has to be modelled in this term. The first term of the same expression can be decomposed as:

In addition to previous assumptions, the following can also been made:

• The output generation is independent from the noise and the current state. Indeed, the NLG subsystem generates a text according to concepts embedded in the action only. This text is then synthesized without taking noise into account although one could think about output signal level adaptation. This doesn’t mean that the noise doesn’t affect the perception of the system’s utterance by the user but the production process is not affected by the noise.

• The user model doesn’t change its goal during a dialogue session and all goals have the same probability to be chosen. Noise will be kept as a conditioning variable as it can affect the understanding of the system’s utterance.

• User’s utterance is independent from the actual DM’s action. The action is transmitted to the user by the system’s utterance syst.

• The inputs processing block results are independent from the user’s goal and the system’s utterance. Indeed, the system cannot be aware of the user’s goal to tune its subsystems. Moreover, if the actual DM action can be responsible for some tuning, the spoken realisation of the concepts embedded in the action is not responsible for any tuning.

• The task model is independent from the user’s utterance and goal and from the system’s utterance and noise.

The previous expression can therefore be rewritten:

(4.16)

The simulation problem can be summarised as the problem of evaluating the four first terms of this expression:

( ) ( ) ( ) ( )

As equal probabilities for all goals are assumed, evaluating P(g) is quite trivial but the evaluation of other terms will be the subject of the following chapters.

On another hand, the simulation environment should provide evaluation indices for each dialogue session. All the components of the environment should then provide metrics indicating how well they performed their job. An additional block dedicated to the computation of an instantaneous evaluation c_t+1 of the turn is then included in the simulation environment (Fig. 26). Inputs of this block are metrics supplied at time t by each component and are referred to as follow on the figure:

•

{ }

c^outt are the metrics provided by the output generation blocks

•

{ }

cⁱⁿt are the metrics provided by the input processing blocks

•

{ }

c^wkt are metrics provided when accessing the WK

• U^S is the overall user’s satisfaction which is typically provided once, at the end of each dialogue session.

a_t

c_t+1 User

Model

InputProcessing

u_t

OutputGeneration

sys_t

n_t

| a_u,t

a_WK,t

DM ^s^t+1

s_t

Cost Function {c^out}

{c^wk}

{cⁱⁿ} U^s

t t

Fig. 26: Simulation environment with cost function block

Dans le document 1 A Framework for Unsupervised Learning of Dialogue Strategies A Framework for Unsupervised Learning of Dialogue Strategies (Page 119-123)