CHAPTER 2: HUMAN - MACHINE SPOKEN DIALOGUE: STATE OF THE
2.4. D IALOGUE M ANAGER
2.4.5. Evaluation
confirmed should be placed near the end of the confirmation message: they should not be followed by a question such as ‘Is that correct?’ Since users will often speak before or during such a question [McInnes et al, 1999].
2.4.5. Evaluation
Previous sections underlined the main problems of DM design like the choices degree of initiative the system should leave to the n strategy to deploy. It has also been shown that
rmance evaluation of such high-level communicative systems
ne of the first
SDS evaluation is the PARADISE
ystems Evaluation) paradigm [Walker et al, 1997a].
mpt to explain users’ satisfaction as a linear of a dialogue model, the
users or the confirmatio
several studies lead to controversial conclusions for each of those problems.
Consequently, it would be of great interest to have a framework for SDS
performance evaluation and it is the object of a large field of current researches.
Despite the amount of researches dedicated to the problem of SDS
performance evaluation, there is no clear and objective method to solve it.
Indeed, perfo
highly relies on the opinion of end-users about their interaction and is therefore strongly subjective. Thus, studies on subjective appreciation of
SDSs through user satisfaction surveys have often (and early) been conducted [Polifroni et al, 1992]. But even those surveys proved to have non-trivial interpretation. For example, experiments reported in [Devillers &
Bonneau-Maynard, 1998] demonstrate as a side-conclusion that the users’
appreciation of different strategies depends on the order in which SDSs implementing those strategies were presented for evaluation.
Nevertheless, there have been several attempts to determine the overall system’s performance thanks to objective measures made on the SDS
components, such as speech recogniser performance with o
tries in [Hirschman et al, 1990]. Other objective measures taking into account the whole system’s behaviour (like the average number of turns per transaction, the task success rate etc.) have been exercised in the aim of evaluating different versions (strategies) of the same SDS [Danieli & Gerbino, 1995] with one of the first tries applied in the SUNDIAL project (and after within the EAGLES framework) [Simpson & Fraser, 1993]. More complex paradigms have been developed afterwards.
2.4.5.1. PARADISE
One of the most popular frameworks for (PARAdigm for DIalogue S
PARADISE is the first atte
combination of objective measures. For the purpose of evaluation, the task is described as an AVM and the user’s satisfaction as the combination of a task
completion measure (κ) and a dialogue cost expressed as a weighted sum of objective measures (ci). The overall system’s performance is then approximated by:
(2.11)
( )
U =α⋅( )
κ −∑
wi⋅( )
ciP N N
i
where N is a Z-score normalisation function that normalises the results to
have mean 0 and stand α and w
express the relative importance of each term of the sum in the performance
ormed between the user and the system when using the SDS
where P(A) is the propo ect interpretations (sum of the diagonal
elements of M: mii ct interpret
occurring by chance. One can see that κ = 1 when the system performs
i
comprises around 9
s never been
i) will ard deviation 1. This way, each weight (
of the system.
The task completion measure κ is the Kappa coefficient [Carletta, 1996] that is computed from a confusion matrix M summarising how well the transfer of information perf
to be evaluated. M is a square matrix of dimension n (number of values in the
AVM) where each mij is the number of dialogues in which the value i was interpreted while value j was meant. The kappa coefficient is then computed by:
(2.12)
( ) ( ) ( ) E 1− P
E P A P −
= κ
rtion of corr
) and P(E) is the proportion of corre ations perfect interpretation (P(A) = 1) and κ = 0 when the only correct interpretations were obtained by chance (P(A) = P(E)).
In order to compute weights α and wi, a large number of users are asked to answer a satisfaction survey after having used the system while costs c are measured during the interaction. The questionnaire
statements on a five-point Likert scale and the overall satisfaction is computed as the mean value of collected ratings. A Multivariate Linear Regression (MLR) is then applied with the result of the survey as the dependent variable and the weights as independent variables.
Several criticisms can be made about assumptions and methods used in the
PARADISE framework. First, the assumption of independency of the different costs c made when building an additive evaluation function hai
proved to be true (it is actually false as the number of turns and the time duration of a dialogue session are heavily correlated for example [Larsen, 2003]). The Kappa coefficient as a measure of task success can also be discussed, as it is often very difficult to compute when too many values are possible for a given attribute. An example is given when trying to apply
PARADISE to the PADIS system (Philips Automatic Directory Information System) [Bouwman & Hulstijn, 1998]. Recent studies have also criticised the satisfaction questionnaire. While [Sneele & Waals, 2003] proposes to add a single statement rating the overall performance of the system on a 10-point scale, [Larsen, 2003] recommends to rebuild the whole questionnaire, taking psychometric factors into account (which seems to be a good idea). Finally, the AVM representation of the task has proved to be very difficult to extend to multimodal systems and thus, seems not to be optimal for system comparisons. Some attempts to modify PARADISE have been proposed [Beringer et al, 2002].
Besides the abovementioned critiques, PARADISE has been applied on a wide range of systems. It was adopted as the evaluation framework for the DARPA
Communicator project and applied to the official 2000 and 2001 evaluation
tive evaluation means that the system . Although the usual process of SDS
experiments [Walker et al, 2000]. Nevertheless, experiments on different
SDSs reached different conclusions. PARADISE’s developers themselves found contradicting results and reported that time duration was weakly correlated with user’s satisfaction in [Walker et al, 1999] while [Walker et al, 2001]
reports that dialogue duration, task success and ASR performance were good predictors of user’s satisfaction. On another hand, [Rahim et al, 2001] reports a negative correlation between user’s satisfaction and dialogue duration because users hung up when unsatisfied. Finally [Larsen, 1999] surprisingly reports that ASR performance is not a so good predictor of user’s satisfaction.
2.4.5.2. Analytical Evaluation
The principal drawback of the method described above is, of course, the need of data collection. Indeed, subjec
should be released to be evaluated
design obeys to the classical prototyping cycle composed of successive pure design and user evaluation cycles, there should be as little user evaluations as possible because it is time consuming and it is often very expensive. This is why some attempts to analyse strategies by mathematical means have been developed. In [Louloudis et al, 2001], the authors propose some way to diagnose the future performance of the system during the design process.
Other mathematical models of dialogue have been proposed [Niimi &
Nishimoto, 1999] and closed forms of dialogue metrics (like the number of dialogue turns) have been proposed. Nevertheless, too few implementations were made to prove their reliability. Moreover, lots of simplifying assumptions have to be made for analytical evaluation and it is thus difficult to extend those methods to complex dialogue configurations.
2.4.5.3. Computer-Based Simulation
Because of the data collection’s inherent difficulties, some efforts have been done in the field of dialogue simulation. The
simulation for SDS evaluation is mainly to enlarge
purpose of using dialogue the set of available data and
A complete SDS relies on lot of different techniques manipulating high-level 2.3.5 and 2.4) and low-level data (see sections m acoustic signal to understood concepts. The
reusability of previous work or at least to define generic to predict the behaviour of the SDS in unseen situations. Simulation techniques will be more extensively discussed in the second part of this text.