Experimental results with SAT and unsupervised MAP and fMLLR

6.5 Conclusions

7.1.2 Experimental results with SAT and unsupervised MAP and fMLLR

7.1.2.1 Baseline system

The experiments were conducted on the WSJ0 corpus (Section5.1).

We used conventional 11×39MFCC features, composed of 39-dimensional MFCC (with CMN) spliced across 11 frames [−5..5], as baseline features and compared them to the proposed GMMD features. The two SI-DNN models corresponding to these two types of features, 11×39MFCC and 11×118GMMD, had identical topology (except for the dimension of the input layer) and were trained on the same training dataset. An auxiliary monophone GMM was also trained on the same data. For training speaker independent (SI) DNN on GMMD features, we applied the scheme shown in Figure6.1.

The SI CD-DNN-HMM systems used four 1000-neuron hidden layers and an output layer with approximately 2500 neurons. The neurons in the output layer corresponded to the context-dependent states determined by tree-based clustering in CD-GMM-HMM. Rectified linear units (ReLUs) were used in the hidden layers. The DNN system was trained using the frame-level CE criterion and the senone alignment generated from the GMM system. The output layer was a softmax layer. DNNs were trained without layer-by-layer pre-training, and hidden dropout factor (HDF =0.2) was applied during the training as a regularization.

Evaluation was carried out on the two standard WSJ0 evaluation tests: (1) si_et_05and (2)si_et_20.

7.1.2.2 SAT-DNN AMs

We trained three SAT DNNs on GMMD features, as shown in Figure7.1. For adapting an auxiliary GMM model we used two different adaptation algorithms: MAP and fMLLR. In addition, we trained a SAT DNN on GMMD features with "fMLLR+MAP" configuration,

where MAP adaptation of an auxiliary GMM model was performed after fMLLR adaptation.

For comparison purposes we also trained a DNN on conventional 11×39MFCC features with fMLLR. All SAT DNNs had similar topology and were trained as described in Section7.1.2.1 for SI models. Training GMM-HMMs and fMLLR adaptation was carried out using the Kaldi speech recognition toolkit [Povey et al.,2011b].

7.1.2.3 Adaptation performance for different DNN-HMMs

Unless explicitly stated otherwise, the adaptation experiments were conducted in an un-supervised mode on the test data using transcripts obtained from the first decoding pass.

The performance results in terms of word error rate (WER) for SI and SAT DNN-HMM models are presented in Table 7.1. We can see that SI DNN-HMM trained on GMMD features performs slightly better than DNN-HMM trained on 11×39MFCC features. In all experiments we consider SI DNN trained on 11×39MFCC features as the baseline model and compare the performance results of the other models with it.

Table 7.1 WER results for unsupervised adaptation on WSJ0 evaluation sets: si_et_20and si_et_05. ∆_relWER is a relative WER reduction calculated with respect to the baseline SI model. WER for the baseline system is given with confidence intervals, corresponding to 5%

level of significance.

Features Adaptation (SAT)

si_et_20 si_et_05

WER, % ∆_relWER, % WER, % ∆_relWER, % 11×39MFCC SI 9.55±0.77 baseline 3.23±0.47 baseline

fMLLR 8.03 16.0 2.76 14.5

GMMD

SI 9.28 2.8 3.19 1.2

MAP 7.60 20.4 2.69 16.8

fMLLR 7.85 17.8 2.52 22.0

fMLLR+MAP 7.83 18.0 2.78 13.9

The SAT DNN trained on GMMD features with MAP adaptation demonstrates the best result over all cases in the dataset si_et_20. It outperforms the baseline SI DNN model trained on 11×39MFCC features and results in 20.4% relative WER reduction (∆_relWER, see Formula 3.34). In the dataset si_et_05 the SAT DNN trained on GMMD features with fMLLR adaptation performs better than other models and gives 22.0% of∆_relWER.

Moreover, we can see that in all experiments SAT models trained on GMMD features, with fMLLR as well as with MAP adaptation, perform better in terms of WER than the SAT model trained on 11×39MFCC features with fMLLR adaptation.

7.1 SAT DNN AM training using GMMD features 103 From the last row of Table7.1, which shows the results of combining MAP and fMLLR algorithms, we can conclude that this combination does not lead to additional improvement in performance, and for the testsi_et_05, it even degrades the result.

The results, presented in Tables7.2and 7.3, demonstrate adaptation performance for different speakers from the si_et_20 and si_et_05 datasets, correspondingly. The bold figures in the tables indicate the best performance over the three acoustic models. Relative WER reduction (∆_relWER) is given in comparison to the baseline model, SI DNN built on 11×39MFCC. We can observe that all the three models behave differently depending on the speaker.

Table 7.2 Performance for unsupervised speaker adaptation on the WSJ0 corpus for speakers fromsi_et_20evaluation set

7.1.2.4 Adaptation performance depending on the size of the adaptation dataset In this experiment we measured the performance of the proposed adaptation method using different amounts of adaptation data. Adaptation sets of different sizes, from 15 to 210 seconds of speech data (per speaker), were used to adapt SAT DNN-HMM models trained on 11×39MFCC and GMMD features.

The results are shown in Figure7.2. Relative WER reduction is given with respect to the baseline SI DNN trained on 11×39MFCC. We can see that, for adaptation sets of different size, the SAT DNN trained on GMMD features with fMLLR perform better than the SAT DNN trained on 11×39MFCC features with fMLLR.

Table 7.3 Performance for unsupervised speaker adaptation on the WSJ0 corpus for speakers

In contrast, MAP adaptation reaches the performance of fMLLR adaptation (for MFCC) only when using all the adaptation data. However, the gain from MAP adaptation grows monotonically as the sample size increases, while fMLLR adaptation reaches saturation on a small adaptation dataset. The same behavior of MAP and fMLLR adaptation algorithms is known for GMM-HMM AMs.

We conducted an additional experiment with MAP adaptation, marked in Figure7.2with the text"using extra data", in which we added the data from the WSJ0si_et_addataset to the adaptation data. Hence, in this case we performed the adaptation in a semi-supervised mode:

the transcriptions for thesi_et_addataset were supposed to be known and the transcriptions for the si_et_05 were generated from the first decoding pass. The total duration of the adaptation data was approximately 380 seconds for each speaker. This result (21.4% of relative WER reduction) confirms the suggestion that the performance of MAP adaptation did not reach its maximum in the previous experiments.

Dans le document Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems ~ Association Francophone de la Communication Parlée (Page 122-125)