Evaluation - D IALOGUE M ANAGER - HUMAN - MACHINE SPOKEN DIALOGUE: STATE OF THE

CHAPTER 2: HUMAN - MACHINE SPOKEN DIALOGUE: STATE OF THE

2.4. D IALOGUE M ANAGER

2.4.5. Evaluation

confirmed should be placed near the end of the confirmation message: they should not be followed by a question such as ‘Is that correct?’ Since users will often speak before or during such a question [McInnes et al, 1999].

2.4.5. Evaluation

Previous sections underlined the main problems of DM design like the choices degree of initiative the system should leave to the n strategy to deploy. It has also been shown that

rmance evaluation of such high-level communicative systems

ne of the first

SDS evaluation is the PARADISE

ystems Evaluation) paradigm [Walker et al, 1997a].

mpt to explain users’ satisfaction as a linear of a dialogue model, the

users or the confirmatio

several studies lead to controversial conclusions for each of those problems.

Consequently, it would be of great interest to have a framework for SDS

performance evaluation and it is the object of a large field of current researches.

Despite the amount of researches dedicated to the problem of SDS

performance evaluation, there is no clear and objective method to solve it.

Indeed, perfo

highly relies on the opinion of end-users about their interaction and is therefore strongly subjective. Thus, studies on subjective appreciation of

SDSs through user satisfaction surveys have often (and early) been conducted [Polifroni et al, 1992]. But even those surveys proved to have non-trivial interpretation. For example, experiments reported in [Devillers &

Bonneau-Maynard, 1998] demonstrate as a side-conclusion that the users’

appreciation of different strategies depends on the order in which SDSs implementing those strategies were presented for evaluation.

Nevertheless, there have been several attempts to determine the overall system’s performance thanks to objective measures made on the SDS

components, such as speech recogniser performance with o

tries in [Hirschman et al, 1990]. Other objective measures taking into account the whole system’s behaviour (like the average number of turns per transaction, the task success rate etc.) have been exercised in the aim of evaluating different versions (strategies) of the same SDS [Danieli & Gerbino, 1995] with one of the first tries applied in the SUNDIAL project (and after within the EAGLES framework) [Simpson & Fraser, 1993]. More complex paradigms have been developed afterwards.

2.4.5.1. PARADISE

One of the most popular frameworks for (PARAdigm for DIalogue S

PARADISE is the first atte

combination of objective measures. For the purpose of evaluation, the task is described as an AVM and the user’s satisfaction as the combination of a task

completion measure (κ) and a dialogue cost expressed as a weighted sum of objective measures (ci). The overall system’s performance is then approximated by:

(2.11)

( )

^U ⁼^α^⋅

( )

^κ ⁻

∑

^wⁱ^⋅

( )

^cⁱ

P N N

where N is a Z-score normalisation function that normalises the results to

have mean 0 and stand α and w

express the relative importance of each term of the sum in the performance

ormed between the user and the system when using the SDS

where P(A) is the propo ect interpretations (sum of the diagonal

elements of M: mii ct interpret

occurring by chance. One can see that κ = 1 when the system performs

comprises around 9

s never been

i) will ard deviation 1. This way, each weight (

of the system.

The task completion measure κ is the Kappa coefficient [Carletta, 1996] that is computed from a confusion matrix M summarising how well the transfer of information perf

to be evaluated. M is a square matrix of dimension n (number of values in the

AVM) where each mij is the number of dialogues in which the value i was interpreted while value j was meant. The kappa coefficient is then computed by:

(2.12)

( ) ( ) ( ) E 1− P

E P A P −

= κ

rtion of corr

) and P(E) is the proportion of corre ations perfect interpretation (P(A) = 1) and κ = 0 when the only correct interpretations were obtained by chance (P(A) = P(E)).

In order to compute weights α and wi, a large number of users are asked to answer a satisfaction survey after having used the system while costs c are measured during the interaction. The questionnaire

statements on a five-point Likert scale and the overall satisfaction is computed as the mean value of collected ratings. A Multivariate Linear Regression (MLR) is then applied with the result of the survey as the dependent variable and the weights as independent variables.

Several criticisms can be made about assumptions and methods used in the

PARADISE framework. First, the assumption of independency of the different costs c made when building an additive evaluation function ha_i

proved to be true (it is actually false as the number of turns and the time duration of a dialogue session are heavily correlated for example [Larsen, 2003]). The Kappa coefficient as a measure of task success can also be discussed, as it is often very difficult to compute when too many values are possible for a given attribute. An example is given when trying to apply

PARADISE to the PADIS system (Philips Automatic Directory Information System) [Bouwman & Hulstijn, 1998]. Recent studies have also criticised the satisfaction questionnaire. While [Sneele & Waals, 2003] proposes to add a single statement rating the overall performance of the system on a 10-point scale, [Larsen, 2003] recommends to rebuild the whole questionnaire, taking psychometric factors into account (which seems to be a good idea). Finally, the AVM representation of the task has proved to be very difficult to extend to multimodal systems and thus, seems not to be optimal for system comparisons. Some attempts to modify PARADISE have been proposed [Beringer et al, 2002].

Besides the abovementioned critiques, PARADISE has been applied on a wide range of systems. It was adopted as the evaluation framework for the DARPA

Communicator project and applied to the official 2000 and 2001 evaluation

tive evaluation means that the system . Although the usual process of SDS

experiments [Walker et al, 2000]. Nevertheless, experiments on different

SDSs reached different conclusions. PARADISE’s developers themselves found contradicting results and reported that time duration was weakly correlated with user’s satisfaction in [Walker et al, 1999] while [Walker et al, 2001]

reports that dialogue duration, task success and ASR performance were good predictors of user’s satisfaction. On another hand, [Rahim et al, 2001] reports a negative correlation between user’s satisfaction and dialogue duration because users hung up when unsatisfied. Finally [Larsen, 1999] surprisingly reports that ASR performance is not a so good predictor of user’s satisfaction.

2.4.5.2. Analytical Evaluation

The principal drawback of the method described above is, of course, the need of data collection. Indeed, subjec

should be released to be evaluated

design obeys to the classical prototyping cycle composed of successive pure design and user evaluation cycles, there should be as little user evaluations as possible because it is time consuming and it is often very expensive. This is why some attempts to analyse strategies by mathematical means have been developed. In [Louloudis et al, 2001], the authors propose some way to diagnose the future performance of the system during the design process.

Other mathematical models of dialogue have been proposed [Niimi &

Nishimoto, 1999] and closed forms of dialogue metrics (like the number of dialogue turns) have been proposed. Nevertheless, too few implementations were made to prove their reliability. Moreover, lots of simplifying assumptions have to be made for analytical evaluation and it is thus difficult to extend those methods to complex dialogue configurations.

2.4.5.3. Computer-Based Simulation

Because of the data collection’s inherent difficulties, some efforts have been done in the field of dialogue simulation. The

simulation for SDS evaluation is mainly to enlarge

purpose of using dialogue the set of available data and

A complete SDS relies on lot of different techniques manipulating high-level 2.3.5 and 2.4) and low-level data (see sections m acoustic signal to understood concepts. The

reusability of previous work or at least to define generic to predict the behaviour of the SDS in unseen situations. Simulation techniques will be more extensively discussed in the second part of this text.

Dans le document 1 A Framework for Unsupervised Learning of Dialogue Strategies A Framework for Unsupervised Learning of Dialogue Strategies (Page 80-83)