Comparison of adaptation behavior for BLSTM-CTC and TDNN AMs142

9.3 Experiments

9.3.4 Comparison of adaptation behavior for BLSTM-CTC and TDNN AMs142

In this series of experiments we aim to compare the adaptation behavior of SAT CTC models with the different type of neural network AMs. For this purpose we chose a TDNN model topology, because such models are shown to achieve the best performance result in many state-of-the ASR systems [Peddinti et al.,2015]. These AMs were trained with the CE criterion.

For comparison we used the same set of SI and SAT AMs, as before for CTC-AMs (see Sections9.3.1and 9.3.2), except for #1. All SI ans SAT TDNN models were trained in a similar way and have the same model topology. They differ only in the type of the input features. These are the same AMs, as were described in Sections8.4.3and8.5.2. Note that not all the TDNN AMs explored in the mentioned sections are used in the current chapter.

9.4 Conclusions 143 Table 9.2 Summary of unsupervised adaptation results for TDNN AMs on TED-LIUM corpus

# Model Features WER,%

Dev. Test₁ Test₂

2 SI MFCC 13.69 11.34 14.38

3 BN 12.32 10.48 14.00

SAT

MFCC⊕i-vectors 11.63 9.62 13.28

5 BN⊕i-vectors 11.62 9.75 13.30

6 BN-fMLLR 10.70 9.28 12.84

7 MFCC⊕GMMD#2 11.30 9.75 13.74

8 BN⊕GMMD#2 11.07 9.75 13.55

9 BN-fMLLR⊕GMMD#2 10.92 9.54 13.27

10 SAT BN⊕GMMD_#6 10.29 9.20 13.04

11 BN-fMLLR⊕GMMD_#6 10.15 9.06 12.84

The results for TDNN AMs are reported in Table9.2. They include some results from Table8.4, but we show them here again for clarity and ease of comparison with CTC AMs.

Also Figure 9.2 presents the comparison of different adaptation algorithms in terms of relative WER reduction for the speakers from test and development datasets for BLSTM-CTC (Figure9.2a) and TDNN (Figure9.2b) AMs. The relative WER reduction is calculated with respect to the SI AMs trained on BN features.

For TDNN AMs we also added in Figure9.2bthe results obtained with the use of SAT AMs for the first decoding pass, because they provide a consistent additional improvement in performance in comparison with the use of SI AMs.

Table 9.3 shows the relative WER reduction for BLSTM-CTC and TDNN AMs in comparison with the best corresponding SI AMs (#2 for CTC and #3 for TDNN). We can see that the optimal choice of features depends on the AM architecture. For SI AMs, BNs have appeared to perform better than MFCCs for TDNN AMs, but for CTC AMs the situation is reversed. Also for SAT CTC and SAT TDNN AMs, the ranking of the systems by the WER is different.

9.4 Conclusions

This chapter explored how the end-to-end AMs can benefit from speaker adaptation and demonstrated that speaker adaptation has remained an essential mechanism for improving

Table 9.3 Relative WER reduction (∆_relWER) for adapted BLSTM-CTC and TDNN AMs in comparison with the best SI AMs for each AM type (#2 for CTC and #3 for TDNN).

∆_relWER values are calculated based on the results from Tables9.1and9.2.

# Features CTC:∆_relWER,% TDNN:∆_relWER,%

Dev. Test₁ Test₂ Dev. Test₁ Test₂

4 MFCC⊕i-vectors 2.20 6.36 0.42 5.60 8.21 5.14

5 BN⊕i-vectors -1.97 -1.88 -1.13 5.68 6.97 5.00

6 BN-fMLLR 5.75 1.79 2.54 13.15 11.45 8.29

7 MFCC⊕GMMD_#2 9.54 8.60 0.78 8.28 6.97 1.86

8 BN⊕GMMD_#2 11.73 9.14 1.91 10.15 6.97 3.21

9 BN-fMLLR⊕GMMD_#2 11.96 11.20 4.03 11.36 8.97 5.21

10 BN⊕GMMD_#6 11.66 9.41 3.18 16.48 12.21 6.86

11 BN-fMLLR⊕GMMD_#6 13.63 11.02 4.81 17.61 13.55 8.29 the performance of an ASR system in the new end-to-end speech recognition paradigm.

Experimental results on the TED-LIUM corpus showed that in an unsupervised adaptation mode, the adaptation and data augmentation techniques can provide approximately a 10–20%

relative WER reduction on different adaptation sets, compared to the SI BLSTM-CTC system built on filter-bank features. In average, the best results for BLSTM-CTC AMs were obtained by using GMMD features and MAP adaptation, which can be further slightly improved by combination with fMLLR adaptation technique.

We found out, that the type of the neural network AM architecture can differently influence the adaptation performance. The comparison with the TDNN-CE AMs showed that for these models, in contradiction to BLSTM-CTC AMs, MAP adaptation using GMMD features outperforms fMLLR only when it uses SAT model in the first decoding pass to obtain transcriptions for adaptation.

Also the obtained results allow us to compare TDNN-CE and BLSTM-CTC AMs in the realistic conditions, when the speaker adaptation is applied, which is important because usually end-to-end and hybrid AMs are compared on incomplete unadapted systems. The best SI TDNN-CE AM outperforms the best SI BLSTM-CTC AM by 1–7% of relative WER reduction for different test sets. For the best SAT AMs this gap in WER for TDNN-CE and BLSTM-CTC AMs increases and reaches 5–13% of relative WER reduction.

9.4 Conclusions 145

Figure 9.2 Per-speaker relative WER reduction (∆_relWER) for speakers from test and devel-opment datasets of the TED-LIUM corpus for different adaptation algorithms with respect to the SI AMs, trained on BN features (#3). Adaptation is performed in an unsupervised mode.

Results are ordered in ascending WER reduction order for each AM.

Chapter 10 GMMD feature analysis

The objective of this chapter is to analyze the proposed GMMD features and the adaptation algorithm for better understanding their nature and properties at different levels.

10.1 Phoneme posterior based features

Phoneme posterior based (PPB) features are obtained from time dependent phoneme posterior scores [Gollan and Bacchiani,2008;Uebel and Woodland,2001] by computing arc posteriors from the output lattices of the decoder. These features contain more information about the decoding process, than the posterior probabilities from neural networks. We use this type of features to analyze the adaptation performance for TDNN acoustic models.

Let{ph₁, . . . ,ph_N}be a set of phonemes and the silence model. For each time frametwe

calculate pⁿ_t — the confidence score of phoneme ph_n(1≤n≤N) at timet in the decoding lattice by calculating arc posterior probabilities. The forward-backward algorithm is used to calculate these arc posterior probabilities from the lattice as follows:¹

P(l|O) =∑q∈Q_lP_ac(O|q)^λ¹P_lm(w)

P(O) , (10.1)

whereλ is the scale factor (the optimal value of λ is found empirically by minimizing WER of the consensus hypothesis [Mangu et al., 2000]); q is a path through the lattice corresponding to the word sequencew;Q_lis the set of paths passing through arcl;P_ac(O|q)

1We already used the lattice scores obtained from arc posterior probabilities to improve MAP adaptation in Section7(Formula (7.2)), but here we again provide the description as a reminder.

is the acoustic likelihood;P_lm(w)is the language model probability; andP(O)is the overall likelihood of all paths through the lattice.

For the given frameo_t at timet we calculate its probabilityP(o_t ∈ph_n)of belonging to phoneme ph_n, using lattices obtained from the first decoding pass:

pⁿ_t =P(o_t∈ ph_n) =

∑

l∈Sn(ot)

P(l|O), (10.2)

whereS_n(o_t)is the set of all arcs corresponding to the phonemeph_nin the lattice at timet;

P(l|O)is the posterior probability of arcl in the lattice.

ASR

Figure 10.1 Scheme of phoneme posterior based (PPB) feature extraction

The obtained probability P(o_t ∈ ph_n) of frame o_t belonging to phoneme ph_n is the coordinate valuepⁿ_t on the new feature vectorp_t. Thus for a given acoustic feature vectoro_t at timet we obtain a new feature vectorp_t:

p_t= p¹_t, . . . ,p^N_t

, (10.3)

whereNis the number of phones in the phoneme set used in the ASR system.

10.2 Visual analysis using t-SNE 149

Dans le document Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems ~ Association Francophone de la Communication Parlée (Page 163-170)