• Aucun résultat trouvé

Chapter 4. Speech inversion evaluation

4.2. Evaluation criteria

4.2.5. Articulatory recognition

= × (4.2-5)

where N, S, D and I are the total number of phones, the number of substitution errors, the number of deletion errors, and the number of insertion errors respectively. The acoustic recognition correct (Correct) that ignores insertion errors was defined as

Correct N D S 100%

N

= − − × (4.2-6)

4.2.4. Articulatory spaces

Another interesting way to analyse the performance of an inversion method is to compare visually, for the measured and reconstructed data, the articulatory spaces of the EMA coils, i.e. the spaces in the midsagittal plane covered by the six coils for the whole corpuses (cf. further Figure 4.3-1). This could be complemented by a measure of the degree of overlap between the areas of measured and estimated articulatory spaces.

4.2.5. Articulatory recognition

4.2.5.1 Method

When original articulatory data are not available, in particular in the case of inversion of a new speaker, using an acoustic adaptation stage (see Chapter 5), the RMSE criterium cannot be used. In such a case, an interesting alternative way to evaluate estimated articulatory trajectories is to determine how well they can be recognised by an automatic “articulatory recognition” system trained on the original data. Engwall (2006) proposes an articulatory classifier to evaluate the results of speech inversion. In addition to the correlation coefficients and the RMS error, he presents classification scores summarized as the percentage of correctly classified phonemes and places of articulation, as the performance for different phoneme groups and in confusion matrices. Tepperman et al. (2008) presents hidden articulator Markov models, which were trained on articulatory representation of phone-level transcription, to generate articulatory confidence measures and recognition-based feature. With this purpose, we

have trained an HMM-based phonetic decoder on the articulatory data of the reference speaker PB.

It is expected that phonemes differing only by voicing or velum position – characteristics not explicitly measured by our EMA setup (no velum coil was available in our recording setup) – cannot be well recognised. Therefore, contrarily to the acoustic recognition stage which determines phonemes, this articulatory recognition procedure was designed to recognise articulatory phoneme classes, such as /p b m/, /k g/, etc. for whom main articulatory characteristics cannot be distinguished. Accordingly, we defined 16 clusters of French phonemes (cf. Table 4.2-1), and used them as articulatory phoneme classes for the articulatory recognition. In addition, two extra phoneme classes were used: one for the schwa and the short pause, and the other for the long pause at the boundaries of sentences. Finally, these 18 articulatory phoneme classes were used to train and to recognize the articulatory trajectories for both PB-2007 and EMA-PB-2009 corpuses.

Table 4.2-1. Articulatory phoneme classes used to train the articulatory models and to recognise the articulatory trajectories

Phoneme class name phonemes

Vowels

Open a ɛ ɛ̃

Mid-front ø œ œ̃

Front y

Close e i

Mid-back o ɔ ɑ̃ɔ̃

Back u

Consonants

Labial p b m

Alveolar t d s z n

Fricative f v

Post-alveolar ʃ ʒ

Velar k g

Uvular-fricative ʁ

Alveolar-lateral l

Semi-vowels

Palatal j

Labiopalatal ɥ

Labiovelar w

The HMM-based articulatory recognition system was built using a procedure similar to the one described in Chapter 2. The same articulatory feature vectors used for inversion are used here (the x and y coordinates of the six active coils with their first time derivatives). Various contextual schemes were tested: articulatory phoneme classes without context (no-ctx), with left (L-ctx) or right context (ctx-R), and with both left and right contexts (L-ctx-R). Left-to-right, 3-state phoneme class HMMs with a mixture of 8 Gaussians per state and a diagonal covariance matrix were used. The training was performed using the Expectation Maximization (EM) algorithm based on the Maximum Likelihood (ML) criterion.

The performance of this system was evaluated on the articulatory data of the reference speaker PB, using the 5-fold cross-validation procedure described previously. The articulatory recognition accuracy (AccArt) was defined as

AccArt N D S I 100%

N

− − −

= × (4.2-7)

where N, S, D and I are the total number of phones, the number of substitution errors, the number of deletion errors, and the number of insertion errors respectively.

The percentage correct (CorrectArt) was defined as CorrectArt N D S 100%

N

= − − × (4.2-8)

Notice that this measure ignores insertion errors.

4.2.5.2 Baseline

To evaluate all articulatory trajectories generated from the acoustic signal of the reference speaker “PB” or from any other speaker, independently of the inversion mapping approach used, we need to establish baseline results to serve as reference. We have used the articulatory recognition results for the (ctx-R) context for the different corpuses.

EMA-PB-2007 corpus

Table 4.2-2 shows the articulatory recognition rates (percent correct and accuracy) of the measured articulatory trajectories of EMA-PB-2007 corpus. The best performance

was obtained using context dependent models (with right context) and a bigram language model of phoneme’s classes trained on “Le Monde” corpus. In this case, the recognition accuracy (AccArt) was 84.84 %.

Table 4.2-2. Articulatory recognition rates of the measured trajectories of EMA-PB-2007 corpus (percent correct and accuracy)

no-ctx L-ctx ctx-R L-ctx-R

CorrectArt 80.39 90.53 89.27 89.96

AccArt 79.29 84.15 84.84 80.08

EMA-PB-2009 corpus

Table 4.2-3 displays the articulatory recognition rates of the measured articulatory trajectories of EMA-PB-2009 corpus, using HMMs trained on the same corpus. HMMs with the same structure as above were used. Best performance was obtained using context dependent model (with right context) and a bigram language model of phoneme’s classes trained on “Le Monde” corpus. In this case, the recognition accuracy (AccArt) was 82.47 %. These articulatory HMMs are used to evaluate all articulatory trajectories generated from models trained on EMA-PB-2009 corpus, independently of the used inversion mapping approach.

Table 4.2-3. Articulatory recognition rates of the measured trajectories of EMA-PB-2009 corpus

no-ctx L-ctx ctx-R L-ctx-R

CorrectArt 68.22 87.42 87.47 89.54

AccArt 66.91 82.22 82.47 77.36

MOCHA-TIMIT corpus

Table 4.2-4 displays the articulatory recognition rates of the measured articulatory trajectories of fsew0 speaker of MOCHA-TIMIT corpus. Figure 4.2-3 and Figure 4.2-4 show the hierarchical clustering based on articulatory Mahalanobis distances of the vowels and consonants, respectively. Based on these dendograms, we defined 8 vowels clusters and 12 consonants clusters.

i: ə eɪ ɪə u j ə ɪ ʊ oʊ ɚ ɑ ʌ ɒ æ aɪ aʊ e ɛɘ ɔɪ ɔ

Figure 4.2-3. Articulatory vowels clusters for speaker fsew0

d t n s z ð θ f v h l r g k ŋ w b p m ʤ ʧ ʃ ʒ

Figure 4.2-4. Articulatory consonant clusters for speaker fsew0

The best performance was obtained using context dependent model (with right context) and a bigram language model of phoneme’s classes trained on training corpus. In this case, the recognition accuracy (AccArt) was 65.02 %. Compared to the result found on the French corpuses, the recognition accuracy of MOCHA is lower by more than 15%.

This difference may be due to the re-attachment of the velum and the tongue middle coils (see (Richmond, 2009)) during the recording.

Table 4.2-4. Articulatory recognition rates of the measured trajectories of MOCHA-TIMIT corpus

no-ctx L-ctx ctx-R L-ctx-R

CorrectArt 44.61 66.61 70.51 75.08

AccArt 43.57 61.97 65.02 61.85

The models in right context trained on the measured trajectories that gives the best results is used as the baseline to evaluate the recognition of the reconstructed trajectories for that speaker.