Articulatory synthesis - HMM-based method

Chapter 4. Speech inversion evaluation

4.3. Evaluation results

4.3.1. HMM-based method

4.3.1.2 Articulatory synthesis

As described in Chapter 2, the synthesis is performed as follows: a linear sequence of HMM states is built by concatenating the corresponding segmental phoneme HMMs, and a sequence of observation parameters is generated using a specific ML-based parameter generation algorithm (Zen et al., 2004).

Via this HMM-based approach, the articulatory trajectories can be inversed from speech alone or from speech and labels. For these cases, the state sequence can be generated using the trained acoustic HMMs by decoding the unseen speech directly or by forced alignment of the original phone labels. Note that the forced alignment method is equivalent to perfect recognition. In order to assess the contribution of the trajectory formation to errors of the complete inversion procedure, we synthesized the articulatory trajectories using a forced alignment of the states based on the original labels, emulating a perfect acoustic recognition stage.

The inversion configuration is 8 Gaussian mixtures per shared state for the acoustic HMMs; the bigram allophone model trained on “Le Monde” is used in recognition stage; single Gaussians are used for the articulatory part of HMMs; multistream HMMs are trained by MLE.

Table 4.3-6. Inversion performances using the MLE training method on the EMA-PB-2007 corpus

Input Criteria no-ctx L-ctx ctx-R L-ctx-R Audio and labels µRMSE 1.79 1.49 1.50 1.41

RMSE 1.86 1.54 1.55 1.45

PMCC 0.91 0.93 0.93 0.94

AccArt 71.13 81.01 88.29 89.30

Audio alone µRMSE 1.85 1.63 1.60 1.61

RMSE 1.92 1.69 1.66 1.66

PMCC 0.89 0.91 0.92 0.92

AccArt 69.39 76.94 82.97 82.19

From Table 4.3-6, we can estimate that the contribution of the trajectory formation stage to the overall RMSE amounts to nearly 90 %. This relatively high level of errors can

likely be explained by the fact that the trajectory formation model tends to over smooth the predicted movements and does not capture properly coarticulation patterns. Note that we have found that the missing HMMs inheritance mechanism decreases the RMSE by about 0.1 mm.

Impact of state duration on articulatory speech synthesis

In the inversion stage, the quality of HMM state duration can have an impact on the articulatory inversion performance. The state duration of the HMM is needed by the parameter generation algorithm to generate trajectories of the articulatory movement. In the proposed inversion system, the HMM state duration can be derived directly from the recognition stage. An alternative way is to decode only HMM duration at the recognition stage and to estimate the state durations at the synthesis stage using a z-scoring model.

The effect of using different state duration predictions can be analysed by comparing the global errors, which are reported in Table 4.3-7. Table 4.3-7 shows the errors of generated trajectories from decoded state from unseen speech compared to the estimated duration by z-scoring.We see that the use of state durations produced by the recognition stage results in an improvement of about 10 % for RMSE and about 4% for PMCC, compared to the z-scoring method.

Table 4.3-7. Influence of state duration prediction on the performance of the articulatory synthesis stage using MLE trained models on the EMA-PB-2007 corpus

State duration Criteria no-ctx L-ctx ctx-R L-ctx-R

determined Impact of MGE for the articulatory speech synthesis

To compare the MLE and MGE training criteria, it is essential to look at the difference between the two types of training method at the synthesis stage, thus using a perfect recognition stage. The RMSE corresponding to the MLE and MGE trained models are displayed in Table 4.3-8. Note that the trajectories were estimated using speech signal and labels as input. We observe that all MGE trained articulatory HMMs consistently

provide an RMSE greater than the MLE trained ones. The differences are all significant at p < 0.005 level using t-test. This confirms the hypothesis that training will be most effective when the training objective and the error measurement of the task match. For the inversion task, the MGE training criterion is better suited than the MLE one.

Table 4.3-8. Performances of MLE and MGE trained model on the EMA-PB-2007 corpus using a perfect recognition stage.

Training method Criteria no-ctx L-ctx ctx-R L-ctx-R

MLE rates of the sound signal were found using 8 Gaussian components from different tied-states. The right context (ctx-R) again gives the best result for both recognition and synthesis.

Table 4.3-9. Performances of the full inversion using MGE trained models on the EMA-PB-2007 corpus right context (R-ctx) condition. We see that MGE leads to less centralisation than MLE, very likely in relation with less smoothing and better attainment of the vowel and consonant targets.

Figure 4.3-1. Articulatory spaces synthesised using MLE versus MGE trained models superposed on the measured articulatory space (grey contours, pertaining to midsagittal articulators contours for a consonant produced by the same speaker, are

plotted here to serve as a reference frame)

Figure 4.3-2, that displays an example of trajectories synthesised using MLE and MGE, confirms that trajectories generated by MGE are closer to the measured ones than trajectories generated by MLE.

Tongue tip y

Tongue middle y

Tongue back y

Original MGE

/a/ /k/ /a/ MLE

Jaw y

Upper lip y Lower lip y

Figure 4.3-2. Sample of synthesised Y-coordinates trajectories of an /aka/ sequence using MGE and MLE trained models compared to the measured trajectories.

Dans le document Contrôle de têtes parlantes par inversion acoustico-articulatoire pour l’apprentissage et la réhabilitation du langage ~ Association Francophone de la Communication Parlée (Page 99-103)