Results - Variabilité et confusabilité phonémique pour les modèles de prononciations au sein d’

The performance metric adopted in this work is the FOM (see Section 6.3.1), which is the one we try to optimize. The results will be presented in terms of the ROC curves showing the system performance for different operation points.

The baseline system is the one, where the KWS search is realized on the index before applying the confusion model and expanding the search space. The initialized confusion model is then applied and discriminatively trained to maximize the FOM on the training data. Figure 6.2 presents the ROC curves on the training data before training, after 10 iterations and after 50 iterations of the training algorithm. It can be seen that the area under the ROC curve is indeed increased.

We have found that it is beneficial to smooth the trained confusion model before apply-ing it to the evaluation data. The smoothapply-ing chosen here is the linear interpolation of the

tel-00843589, version 1 - 11 Jul 2013

Figure 6.2:ROC curves on training data

Figure 6.3:ROC curves on evaluation data

trained confusion model weights with a null model, which considers zero probabilities for substitutions, deletions and insertions and probabilities one for correct phoneme recogni-tions. The interpolation factor was heuristically set to 0.5. The curve for the smoothed trained model in Figure 6.3 corresponds to this interpolated model. Some degradation is seen for the low false alarms region (less than 1 false alarm per hour per query term), but then some improvement is observed, which increases for high false alarms numbers.

For this area, if a random horizontal line is drawn in Figure 6.3, it can be actually seen that there is achieved an important decrease in the false alarms number for the same hit rate when the smoothed trained model is applied. For example, for a hit rate of 0.45, the false alarms number decreases from 5 (baseline curve) to 2.5 (interpolated trained curve) false alarms per hour and per query term. It should be noted that the value of the baseline performance is fairly acceptable for a phonemic KWS system. A last comment can be made on a slight deterioration of the KWS performance observed on the evaluation set in comparison to the training set. The KWS task seems to be more difficult on the evaluation data, which can be due to the different query term list used.

tel-00843589, version 1 - 11 Jul 2013

6.6 Conclusion

We have presented a phoneme confusion model for the KWS that enables recovery from recognition errors, and detection of OOVs. A discriminative approach for training its weights was applied based on the direct optimization of the FOM. The approach was tested for English. However, it is language-independent and could be applied to other lan-guages, potentially including languages with limited resources where the OOV problem is more extensive. In terms of FOM performance a promising improvement was observed on the evaluation set.

The confusion model is applied on the index constructed using the output lattices of a phone-loop recognizer. In the future we plan to apply it also to hybrid systems that use both word and phoneme recognition. The confusion model used in this work does not take into account any phoneme context. It is our aim to try to use at least bigram phoneme confusion models and expect to achieve better KWS results. Our aim is also a better initialization of the confusion model and we have already started working in this direction. In addition, other more complex methods to train the parameters of our model could be investigated during the FOM optimization. Last but not least, currently the confusion model just add bias to the posterior scores. Instead, more complicated confusion models could be developed that operate directly on the acoustic scores from which the posteriors are computed. In this case, the confusion model could represent a multiplicative or additive correction to the acoustic scores.

tel-00843589, version 1 - 11 Jul 2013

tel-00843589, version 1 - 1

Conclusion and Perspectives

We close this thesis with a summary of the main findings and contributions of this work. After this, some perspectives for continuation of the current work are also dis-cussed.

7.1 Thesis summary

The first part of this thesis was devoted to the automatic generation of pronunciations for OOVs and of pronunciation variants for the baseforms of a recognition dictionary.

Some innovating SMT-inspired approaches were proposed and state-of-the-art g2p re-sults were achieved over a difficult baseline. Then, the expanded lexicon was tested in speech recognition experiments and some improvements were noticed over a single pro-nunciation dictionary baseline. However, adding a lot of variants resulted in a degradation of the ASR performance. This highlighted the well-known problem of phonemic confus-ability when phonemic variation is added to an ASR system without any constraints. Our interest then turned in the direction of having a better understanding of these confusability phenomena and finding a way to measure and counterbalance them.

Next, pronunciation entropy was defined, a measure of the confusability introduced by the recognition lexicon in the decoding process. Experiments were conducted in order to observe how this measure is influenced when automatically generated variants are added to the lexicon. We also measured the influence of using frequency counts as weights to the pronunciations of the lexicon in contrast to no weights at all. We did not manage to find a clear correlation of this measure with the error rate of the system though.

The use of frequency counts is a very simplistic way to assign weights to pronunci-ations and it is restricted to words that occur in the training set. A more suitable way of choosing pronunciations and training their weights might improve the ASR perfor-mance. In this thesis, discriminative training was proposed to train a phoneme confusion model that expands the search space of pronunciations during ASR decoding. The pro-posed methods offer phonemic variation while keeping the confusability of the system low. Moreover, the additional variation is adapted to a particular data set and not static as in the g2p conversion task. An FST-based training and decoding was implemented and an improvement over the FST-based decoder was observed. It is not straight-forward

tel-00843589, version 1 - 11 Jul 2013

however to integrate our confusion model in a non-FST based decoder.

Last but not least, we expanded the discriminative training to the KWS task adopting a new objective function directly related to the KWS performance. There has been growing interest in the KWS task as the amount of available data exponentially augments and an efficient way of searching them becomes indispensable in order to be able to make the best (or any) use of them. In this work, gains were observed over the baseline when using a discriminatively trained phoneme confusion model to expand the index of a phoneme-based system.

Dans le document Variabilité et confusabilité phonémique pour les modèles de prononciations au sein d’un système de reconnaissance automatique de la parole ~ Association Francophone de la Communication Parlée (Page 105-110)