Statistical modeling - Organization of the thesis

1.5 Organization of the thesis

2.1.2 Statistical modeling

IfX={x⁰,x¹, . . . ,x^T−1}represents the sequence of the feature vectors extracted from the speech signal , statistical modeling technique formulates the speech recognition problem as a maximum a posteriori (MAP) problem as follows.

W^∗= arg max_W_∈LP(W|X) (2.4)

The most likely word sequenceW^∗ from the set of all possible word sequencesL, givenX, is chosen as the recognized string of the words. The MAP formulation of the speech recognition is hard

to deal with directly. It is usually reformulated into a problem based on the likelihood estimation using the Bayes rule¹.

W^∗= arg max_W_∈L P(X, W)

P(X) ∼arg max_W_∈LP(X|W)P(W) (2.5)

P(X)has been dropped from the above equation as it serves just as a scaling factor. In the above equation,P(X|W), the conditional probability ofXgivenW, is usually referred to as theacoustic model, and P(W), the prior probability of the word sequence W, is referred to as the language model. In practice, both the acoustic and language models are assumed to fit in some parametric form, sayPΘ(.)andPΓ(.), with parameter setsΘandΓrespectively. ThenPΘ(X|W,Θ)andPΓ(W|Γ) are used in (2.5) as estimates for P(X|W) and P(W). The values of model parameters Θand Γ are estimated from the training database containing a large collection of utterances with known transcriptions. If ξ represents the set of all the training utterances along with the corresponding transcriptions, ideally the parameters can be estimated according to:

Θ^∗,Γ^∗= arg maxhQ

W∈ξPΘ(X|W,Θ)PΘ(W|Γ)i

(2.6) But practical constraints do not allow the joint estimation ofΘand Γ. They are usually esti-mated independently of each other from different training sets, sayξaandξl, respectively, yielding:

Θ^∗= arg maxΘhQ

X∈ξaPΘ(X|W,Θ)i

(2.7)

Γ^∗= arg maxΓhQ

W∈ξlPΓ(W|Γ)i

(2.8) Equation (2.7) is referred to as maximum likelihood (ML) training. A popular ML training algorithm is expectation-maximization (EM) algorithm (Dempster et al., 1977) where a few hidden variables are postulated in addition to the existing parameter set in order to make the otherwise intractable training problem to be tractable. EM is an iterative procedure where, in each iteration the new values for the parameter set,Θ^new, are found from the old values,Θ^old, so that the overall likelihood of the training data is increased:

QX∈ξaPΘ(X|W,Θ^new)≥Q

X∈ξaPΘ(X|W,Θ^old) (2.9)

Every iteration of EM has two steps: E and M steps. In E step, we find the expected value of the complete data log likelihood with respect to the probability distribution of the hidden variables given the observed variables X and the current estimates of the parameters. In the M step, we maximize this expected complete data log-likelihood. These two steps are repeated as necessary.

Each iteration is guaranteed to increase the likelihood of the observed variablesX.

State-of-the-art ASR system use hidden Markov models (HMMs) (Rabiner and Juang, 1993;

Bourlard and Morgan, 1994) for acoustic modeling and bigram/trigram probabilities for the lan-guage modeling. As lanlan-guage modeling does not fall within the scope of this thesis, it will not be discussed further in this thesis. HMM used for acoustic modeling is explained in more detail below.

1P(A|B) = ^P(A,B)_P_(B)

2.1. STATE-OF-THE-ART ASR SYSTEMS 13 Hidden Markov Models (HMM):The most successful approach developed so far for the acous-tic modeling task of the ASR is the hidden Markov model (HMM). HMM is basically a stochas-tic finite state automaton, i.e, a finite state automaton with stochasstochas-tic output process associated with each state. HMM models the speech by assuming that the feature vector sequence X = {x⁰,x¹, . . . ,x^T−1} to be a piece-wise stationary stochastic process that has been generated by a sequence of HMM states, denoted by Q={q0, q1, . . . , qT−1}, that transit from one to another over time. The stochastic output process associated with each state is assumed to govern the generation of feature vectors by the states. IfCrepresents the set of all possible state sequences, the acoustic model in (2.5) can be rewritten as:

P(X|W) =P

Q∈CP(X, Q|W) (2.10)

In the above equation, Θ as in (2.7) is dropped for the reasons of simplicity. To make the model simple and computationally tractable a few simplifying assumptions are made while ap-plying HMMs to the acoustic model problem. They are:

1. First order hidden Markov model assumption, i.e.,

P(qt|q0, q1, . . . , qt−1) =P(qt|qt−1) (2.11) whereP(qt|qt−1)is referred to as the state-transition probability.

2. Feature independence (i.i.d) assumption, i.e.,

P(xt|x0, x1, . . . , xt−1, Q) =p(xt|qt) (2.12) wherep(xt|qt) is referred to as the emission probability, i.e., the probability of the state qt

emitting the feature vectorxt. With these assumptions 2.10 becomes,

P(X|W) =P

Q∈CP(q0)p(x0|q0)QT−1

t=1 P(qt|qt−1)p(xt|qt) (2.13) The above equation gives an exact formula for the computation of the likelihood. However, some times it is also approximated as the likelihood of the best state sequence as follows:

P(X|W) = maxQ∈Cp(q0)p(x0|q0)QT−1

t=1 P(qt|qt−1)p(xt|qt) (2.14) This is called as Viterbi approximation.

An illustration of the use of the HMM as an acoustic model with the above assumption is given in Fig.2.1. If the number of the states in the HMM is M, the complete set of parameters that describe the complete HMM are the following:

1. Transition probabilities,P(qt = statej|qt−1 = statei),denoted by aij, 0 ≤ i, j,≤M −1and satisfying the constraintsPM−1

j=0 aij = 1, and

2. State emission density functions,p(xt|qt=state i), denoted bypi(.),0≤i≤M−1.

A detailed tutorial on EM algorithm and its use to estimate the HMM parameters for ASR ap-plication is provided in (Bilmes, 1998). Most commonly used techniques for modeling the emission density are Gaussian Mixture models and multi layer perceptrons (MLP) and they are briefly ex-plained below:

Emission density model Feature Vectors

HMM states

Figure 2.1. Illustration of Hidden Markov Models (HMM).

1. Gaussian Mixture Model (GMM) is a weighted mixture of several Gaussian densities. It is fully characterized by weighting factors, mean vectors and covariance matrices of all the con-stituent Gaussians. The expression for density functionp(x)for GMM is given by,

p(x) =PK−1

k=0 ckGk(x) (2.15)

whereK denotes the number of Gaussians in the GMM, andck denoted the weighting factor for thek^thGaussian,Gk(.). Ifµk andΣk denote respectively the mean vector and covariance matrix of thek^thGaussian, and ifD denote the feature vector dimension, the expression for Gk(x)is given by,

Gk(x) = 1

2π^D/2|Σk|^D/2exp(−¹₂(x−µk)^TΣ⁻¹_k (x−µk)) (2.16) 2. Multi layer perceptrons (MLP) (Bourlard and Wellekens, 1990; Bourlard and Morgan, 1994), a special case of artificial neural networks (ANNs) has a series of layers of artificial neurons.

Neurons in each layer are fully connected with the neurons of the following layer. The first layer is called the input layer, the last layer is the output layer, and all the in between layers are called hidden layers. Every neuron of all the layers, except the input layer, perform non-linear operation; a weighted sum of the outputs of all the neurons from which it receives the input, followed by a sigmoid or a softmax operation. The connection weighting factors between the neurons are called the weights. The vector applied at the input layer propagates through the hidden layers one by one until it finally reaches the output layer. MLPs can be used for classification or function mapping. Arbitrary complex decision hyper-surfaces can be formed by the MLPs while using them in the classification mode. Any continuous mapping between input and output space can be represented by the MLPs while using them in the function map-ping mode. The connection weights of the MLP can be trained using error back propagation

2.2. NOISE ROBUST SPEECH RECOGNITION 15

Dans le document Novel speech processing techniques for robust automatic speech recognition (Page 27-31)