TTS System Model - O UTPUT P ROCESSING S IMULATION

CHAPTER 6: INPUT AND OUTPUT PROCESSING SIMULATION

6.2. O UTPUT P ROCESSING S IMULATION

6.2.2. TTS System Model

Of course, this mismatch is generally the result of a user misunderstanding but it is because of an ambiguity in the generated sentence, this is why this parameter is included in a NLG model and not in the user model. The more ξt

is close to 0, the more the NLG system is accurate and ξt will be considered as null for recorded and human authored prompts.

6.2.2. TTS System Model

In general, the TTS synthesis process in the framework of a SDS is deterministic in that sense that a given sequence of words w_t will always result in the same spoken utterance sys_t that will always be interpreted in the same way by a given user as a given set of AV pairs. Thus, the only things that can be modelled in the case of a TTS system are objective metrics.

Objective metrics that can be provided by a TTS system are not very numerous in the purpose of a SDS design since it only affects the subjective satisfaction of the user [Walker et al, 2001] (hopefully not his/her understanding of the sentence or it should be avoided to use the TTS system).

Indeed, satisfaction surveys showed that the TTS performance was an important factor. Therefore it is proposed to use information about the general perceived quality of the TTS in the computation to generate a metric provided by the TTS model. This can result in decreasing the number of prompts when learning a strategy for example.

On another hand, it has also been shown that the time duration of prompts influences the user’s satisfaction. Thus, this duration should also be part of the metrics provided by the TTS model.

6.3. Conclusion

In this chapter, a simulation model of the whole speech-processing channel has been proposed. It is as least parametric as possible and particularly relies on the specific task.

Indeed, the ASR model takes as inputs the possible values of each attribute in the AVM representation of the task in order to estimate the possible confusions between those values when they are spoken and to provide metrics about the confidence an ASR system can have in the recognition result provided. The inter-word distance computed in the framework of the

ASR modelling section of this text can also be used as source of assistance during the design of speech grammars and vocabularies. Indeed it can point out very close words in a given vocabulary or provide information such as the mean inter-word distance or the mean size of a cluster of words in this vocabulary. The parameters of the ASR model are mainly included in the edit

cost matrix used for computing inter-word distances. These parameters can be either deduced from articulatory features but also replaced by a measured confusion matrix if a sufficiently large corpus of acoustic and annotated data is available, which is rarely the case for non-expert users.

The NLU model relies on the task structure and on the Bayesian UM already described before and therefore does not require new pa

a way to simulate attribute classification errors taking the and provides an estimate of the seman

The output processing subsystems (n

have also been the subject of a concise modelling study. Possible ambiguities in the generated sentence has been introduced and some metrics about the performance of the TTS systems (specially the time duration of the generated spoken utterances).

In the beginning of this chapter, the noise was also introduced as a variable influencing the process. Several experiments have been realised in order to evaluate the effect of artificially added noise in recorded speech utterances on the confusability between phonemes. Yet, as indicated before, the results obtained on the BDSons database were dramatically bad and nothing could be done to improve them in a reasonable amount of time (since the subject of this work was not pure ASR researches). Adding noise to the spoken utterances resulted in a WER close to 100% and no conclusion could reasonably be drawn from these.

rameters. It supplies context into account tic and contextual confidence level.

amely the NLG and the TTS systems)

Th T h i i rd r d P Pa ar rt t: :

Le L ea ar rn ni i ng n g S S t t ra r at te eg g ie i es s

C C

hildren learn to speak very early. Yet smartly interacting with

matter of technical erform automatic man authoring in others is a complex and almost lifelong quest mainly based on a trial-and-error process. The definition of a smart interaction is not clear anyway and should be revised for each particular case and each particular participant. No one can actually provide an example of what would have objectively been the perfect sequencing of exchanges after having participate to a dialogue.

Human being has a greater propensity to criticise what is wrong than to provide positive proposals.

It then came like obviousness that machines should mimic the natural human behaviour and use unsupervised learning techniques to become able to interact with human beings in a useful and satisfying manner. Moreover, unlike for other artificial intelligence problems where humans perform better than computers, human-authored dialogue strategies are generally sub-optimal because optimality is also a

performance. The problem is then not only to p design of strategies but also to outperform hu this domain.

Dans le document 1 A Framework for Unsupervised Learning of Dialogue Strategies A Framework for Unsupervised Learning of Dialogue Strategies (Page 175-179)