Auxiliary features - Adaptation for DNN models

4.3 Adaptation for DNN models

4.3.6 Auxiliary features

Usingauxiliary features, such asi-vectors [Gupta et al.,2014;Karanasou et al.,2014;Saon et al.,2013;Senior and Lopez-Moreno,2014], is another widely used approach in which the acoustic feature vectors are augmented with additional speaker-specific or channel-specific features computed for each speaker or utterance at both training and test stages. Another example of auxiliary features is the use of speaker-dependent BN features obtained from a speaker aware DNN used in a far field speech recognition task [Liu et al.,2014]. Alternative method is adaptation withspeaker codes[Abdel-Hamid and Jiang,2013;Huang et al.,2016;

4.3 Adaptation for DNN models 73

Xue et al., 2014c]. In [Murali Karthick et al., 2015] speaker-specific subspace vectors, obtained by the SGMM model, are used as auxiliary features to adapt CNN AMs.

4.3.6.1 I-vectors

The use ofspeaker identity vectors(ori-vectors) has become a popular approach for speaker adaptation of DNN AMs [Gupta et al.,2014; Karanasou et al.,2014; Miao et al., 2014;

Saon et al.,2013;Senior and Lopez-Moreno,2014]. Originally i-vectors were developed for speaker verification and speaker recognition tasks [Dehak et al.,2011], and nowadays they have become a very common technique in these domains. I-vectors can capture the relevant information about the speaker’s identity in a low-dimensional fixed-length represen-tation [Dehak et al.,2011;Saon et al.,2013].

I-vector extraction

The acoustic feature vectoro_t∈R^Dcan be considered as a sample, generated with auniversal background model(UBM), represented as a GMM with K diagonal covariance Gaussians [De-hak et al.,2011;Saon et al.,2013]:

o_t ∼

∑

k=1

c_kN(·;µµµ_k,ΣΣΣ_k), (4.23) wherec_kare the mixture weights,µµµ_kare means andΣΣΣ_kare diagonal covariances. The acoustic feature vectoro_t(s), belonging to a given speakersis described with the distribution:

o_t(s)∼

∑

k=1

c_kN(·;µµµ_k(s),ΣΣΣ_k), (4.24) whereµµµ_k(s)are the means of the GMM, adapted to the speakers. It is assumed that there is a linear dependence between the SD meansµµµ_k(s) and the SI means µµµ_k, which can be expressed in the form:

µµµ_k(s) =µµµ_k+T_kw(s), k=1, . . . ,K, (4.25) whereT_k∈R^D×M is afactor loading matrix, corresponding to componentkandw(s)is the i-vector, corresponding to speakers². EachT_k containsMbases, that span the subspace of the important variability in the component mean vector space, corresponding to componentk.

2More strictly, i-vector is estimated as the mean of the distribution of random variablew(s).

The detailed description of how to estimate the factor loading matrix, given the training data {o_t}, and how to estimate i-vectorsw(s), givenT_k and speaker data {o_t(s)}, can be found, for example, in [Dehak et al.,2011;Saon et al.,2013].

Integration of i-vectors into a DNN model

Various methods of i-vector integration into a DNN AM have been proposed in the literature.

The most common approach [Gupta et al.,2014;Saon et al.,2013;Senior and Lopez-Moreno,2014] is to estimate i-vectors for each speaker (or utterance), and then to concatenate it with acoustic feature vectors, belonging to a corresponding speaker (or utterance). The obtained concatenated vectors are introduced to a DNN for training, as shown in Figure4.9.

In the test stage i-vectors for test speakers also have to be estimated, and input in a DNN in the same manner.

Input vector Hidden layers Output vector

stacked acoustic vectors i-vector

Figure 4.9 Using i-vectors for DNN adaptation: concatenation of i-vectors with input features Unlike acoustic feature vectors, which are specific for each frame, an i-vector is the same for a chosen group of acoustic features, to which it is appended. For example, i-vector can be calculated for each utterance, as in [Senior and Lopez-Moreno,2014], or estimated using all the data of a given speaker, as in [Saon et al.,2013]. I-vectors encode those effects in the acoustic signal, to which an ASR system is desired to be invariant: speaker, channel and background noise. Providing to the input of a DNN the information about these factors makes it possible for a DNN to normalize the acoustic signal with respect to them.

An alternative approach of i-vector integration into the DNN topology is presented in [Miao et al.,2014,2015b], where an input acoustic feature vector is normalized though

4.3 Adaptation for DNN models 75

Input vector

Hidden layers Output vector

i-vector acoustic vectors Adaptation

+

network SI network

𝐢_𝑠 𝐲_𝑠

𝐨_𝑡 𝐨_𝑡

Figure 4.10 Using i-vectors for DNN adaptation: adaptation network

a linear combination of it with a speaker-specific normalization vector obtained from an i-vector, as shown in Figure4.10. Firstly, an original SI DNN is built. Secondly, a small adaptation network is built, keeping a SI DNN fixed. This adaptation network is learnt to take i-vectorsi_sas input and generate SD linear shiftsy_s to the original DNN feature vectors.

After the adaptation network is trained, the SD output vectory_sis added to every original feature vectoro_t from speakers:

oe_t =o_t⊕y_s, (4.26)

where⊕denotes element-wise addition. Finally, the parameters of the original DNN model are updated in the new feature space, while keeping the adaptation network unchanged. This gives a SAT DNN model. Similar approaches have been studied in [Goo et al.,2016;Lee

et al.,2016]. Also i-vector dependent feature space transformations were proposed in [Li and Wu,2015a].

In paper [Dimitriadis et al.,2016] two different i-vector representations are presented:

(1)noise i-vectors, estimated only from the noise component of the noisy speech signal; and (2)noisy i-vectors, estimated from noisy speech signal. Another type of factorized i-vectors is described in [Karanasou et al.,2014], where speaker and environmental i-vectors are used together.

In [Garimella et al.,2015]casual i-vectorsare calculated using all previous utterances of a given speaker for accumulating sufficient statistics. Exponentially decaying weights are applied on previous speech frames to give more importance to the recent speech signal, in order to make i-vectors to be able to capture more recent speaker and channel characteristics.

Attribute-based i-vectors are explored in [Zheng et al.,2016] for accent speech recogni-tion.

4.3.6.2 D-vectors

Deep vectors or d-vectors were originally proposed for text-dependent speaker verifica-tion [Variani et al.,2014]. They also described speaker information and are derived by a DNN which is trained to predict speakers at frame-level. For extracting the d-vector, firstly, a DNN is trained to classify speakers at the frame-level. Then the averaged output activations of last hidden layer are used as the speaker representation. Paper [Li and Wu,2015b] explores the idea of augmenting DNN inputs with d-vectors derived byLSTM projected (LSTMP) RNNs [Sak et al.,2014]. Both the d-vector extraction and senone prediction are LSTMP based, and the whole network is jointly optimized with multi-task learning.

4.3.6.3 Speaker codes

Adaptation with speaker codes, proposed in [Abdel-Hamid and Jiang, 2013; Xue et al., 2014c], relies on a joint training procedure for (1) an adaptation NN from the whole training set and (2) small speaker codes, estimated for each speaker only using data from that speaker, as shown in Figure4.11, The speaker code is fed to the adaptation NN to form a nonlinear transformation in feature space to normalize speaker variations. This transformation is controlled by the speaker code. During adaptation, a new speaker code for a new speaker is learned, optimizing the performance on the adaptation data.

In [Xue et al.,2014b] instead of stacking an adaptation NN below the initial speaker independent NN and normalizing speakers features with speakers codes, speaker codes are

4.3 Adaptation for DNN models 77 fed directly to the hidden layers and the output layer of the initial NN and the output layer of the initial NN. In [Huang et al.,2016] speaker adaptation with speaker codes was applied for RNN-BLSTM AMs.

Input vector

Hidden layers Output vector

stacked acoustic vectors speaker code speaker code speaker code

Figure 4.11 Speaker codes for DNN adaptation

Dans le document Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems ~ Association Francophone de la Communication Parlée (Page 93-98)