State distribution modeling in HMM

3.3 Acoustic modeling

3.3.2 State distribution modeling in HMM

In this section we consider some of the most common approaches for modeling state proba-bility distributions in HMMs.

3.3.2.1 GMM-HMM

TheGaussian mixture models(GMMs) approach has been one of the most common technique to model state probability distributions (see Formula (3.9)) in ASR systems for many decades.

In a GMM-HMM model, each state jis modeled as a weighted sum ofM_j Gaussians:

b_j(o_t) =P(o_t|j) =

m=1

∑

ωjmN o_t;µµµjm,ΣΣΣjm

, (3.13)

whereω_jmis the weight of them’th component in the mixture for state jsuch that:

ω_jm≥0 and

m=1

∑

ω_jm=1. (3.14)

In Formula (3.13),N(.;µµµ,ΣΣΣ)is a multivariative Gaussian distribution with mean vector µ

wheredis the dimensionality of acoustic feature vector: o∈R^d. 3.3.2.2 Subspace Gaussian mixture models

Subspace Gaussian mixture models(SGMMs) have been proposed in [Burget et al.,2010;

Povey et al.,2010,2011a] as an alternative approach to standard GMM-HMMs for acoustic modeling. All context-dependent HMM states in SGMMs share a common representation, based on the universal background model (UBM)¹. The covariance matrices are shared between states. The mean vectors and mixture weights for each state are specified by a corresponding vectorv_j∈R^S, whereSis a subspace dimension (typicallyS=50). In the simplest form SGMM can be expressed as follows [Povey et al.,2010]:

 Gaus-sians (usuallyI∈[200,2000]) and covariance matricesΣΣΣ_iwhich are shared between states, mixture weights ω_ji and meansµµµ_ji. The description of model extensions and developed training procedures can be found in [Povey et al.,2011a].

The strengths of SGMMs lie in their compactness. These models require less data for training in comparison with GMM models to achieve a comparable accuracy and have been widely used for low-resource and multi-lingual ASR systems [Burget et al.,2010;Lu et al.,2011]. The weakness of SGMMs consists in their complexity of implementation in comparison with conventional GMMs; we will not consider these models further in the thesis.

3.3.2.3 DNN-HMM

The use of neural networks is an alternative approach for AM training, proposed in [Bourlard and Wellekens,1990; Morgan and Bourlard,1990], where multilayer perceptions (MLP) (feed-forwardartificial neural networks(ANN)) were used to estimate emission probabilities

1Universal background model(UBM) is a GMM model trained on all speech classes pooled together [Povey et al.,2010]. The UBM is a popular approach to alternative hypothesis modeling in speaker recognition domain, where a GMM-UBM represents a large mixture of Gaussians that cover all speech, and in the context of speaker recognition, it is trained using the EM algorithm on data from multiple speakers, and then adapted to each speaker using maximum a posteriori (MAP) estimation [Bimbot et al.,2004;Reynolds et al.,2000].

3.3 Acoustic modeling 37 for HMMs. In the DNN-HMM approach state probabilities in HMMs are estimated with a neural network model.

One of the commonly used approaches in state-of-the-art ASR systems is ahybrid DNN-HMM approach, where the dynamics of speech signal is modeled with DNN-HMMs, and the state observation probabilities are estimated with DNN models. In the original approach [Bourlard and Wellekens,1990] the MLP outputs were used directly for HMMs. But later the need for scaling the obtained probabilities with state priors was shown. The output posterior probabilities from a neural network are scaled using prior class probabilities estimated from the training data, to obtain scaled likelihoods [Bourlard et al.,1992].

In a DNN-HMM system, outputs of a DNN are the state posteriorsP(s_i|o_t), which are transformed for decoding into pseudo (or scaled) log-likelihoods as follows:

logP(o_t|s_i) =logP(s_i|o_t)P(o_t)

P(s_i) ∝logP(s_i|o_t)−logP(s_i), (3.17) where the state priorP(s_i)can be estimated from the state-level forced alignment on the training speech data, and probabilityP(o_t)of the acoustic observation vectoro_tis independent on the HMM state and can be omitted during the decoding process.

Typically states of context-dependent (CD) phonemes (triphones) are used as basic classes for DNN outputs. However the earlier hybrid ASR systems use context-independent (CI) phones (monophones) states. The first attempts to incorporate CD triphones into hybrid architectures [Bourlard et al.,1992] were based on modelling the context-dependency using factorization:

P(s_i,c_j|o_t) =P(s_i|o_t)P(c_i|s_j,o_t) =P(c_i|o_t)P(s_i|c_j,o_t), (3.18) wherec_j∈ {c₁, . . . ,c_J}is one of the clustered context classes, ands_jis either a CD phone or a state in a CD phone. ANNs were used to estimate bothP(s_i|o_t)(orP(c_i|o_t)) andP(c_i|s_j,o_t) (orP(s_i|c_j,o_t)) probabilities. These types of CD DNN-HMM models provided comparatively small improvement over the GMM-HMM models.

The more efficient technique for CD state modeling [Dahl et al.,2012] uses a DNN model to directly estimate the posterior distribution of tied triphone HMM states. These states are usually obtained from a conventional GMM-HMM system through a state clustering proce-dure, however there are some works that propose a GMM-free training procedure [Bacchiani et al.,2014; Senior et al.,2014;Zhu et al.,2015]. A DNN-HMM model is illustrated in Figure3.5.

The recent technical progress in computational hardware allowed to overcome some limitations of the first hybrid ASR systems. The shallow neural networks have been replaced with the deep neural architectures, with many hidden layers. This type of acoustic models is referred to as CD-DNN-HMM models in the literature. Modern DNN-HMM AMs have been demonstrated to significantly outperform GMM-HMM systems on various ASR tasks [Dahl et al.,2011,2012;Hinton et al.,2012a;Kingsbury et al.,2012;Seide et al.,2011b;Yu et al., 2010]. A review of recent developments in DNN-HMM framework and in deep learning approach to ASR is given in [Li et al.,2015a;Trentin and Gori,2001;Yu and Deng,2014].

States

DNN HMM

Observation probabilities

Input layer

Feature vectors

Output layer

Hidden layers Transition probabilities

Window of acoustic vectors

Figure 3.5 Example of the hybrid acoustic model using a DNN to estimate posterior proba-bilities for HMM states

3.3 Acoustic modeling 39

Dans le document Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems ~ Association Francophone de la Communication Parlée (Page 56-60)