HMM Training

Chapter 2. Statistical mapping techniques for inversion

2.3. HMM-based speech recognition and synthesis

2.3.2. HMM Training

Generally, the training problem consists in adjusting the HMM parameters. The parameters of an HMM can be estimated from training data via Maximum Likelihood Estimation (MLE)

^λ

( )

^λ

λ pO

max

ˆ=arg (2.3-9)

However there is no known way to analytically solve the model λ^ˆ which maximizes the quantity^p

( )

^O^λ . But we can choose model parameters such that it is locally maximized, using an iterative procedure, like the Baum-Welch algorithm (Baum and Petrie, 1966) which is one version of the expectation maximisation (EM) algorithm.

Given an observation sequence O=

{

o₁,o₂, K, oT

}

, and a HMM model λ, we can compute the probability of the observed sequence ^p

( )

^O^λ thanks to the forward-backward procedure. In brief, the forward variable is taken as the probability α_t(i) of the partial observation sequence o₁,o₂, K, o_t (until time t) and being in state q_i at time t, given the model λ. The backward variable is defined as the probability of the partial observation sequence o_t ₁,o_t ₂, K, o_T

+ (from t+1 to the end T) given being in state qi

at time t and the model λ, as βt(i). Both αt(i) and βt(i) are worked out with the forward-backward procedure. Once the α and β variables have been collected, a set of new parameters λ^ˆ can be re-estimated from λ; the process is then iterated until there is no improvement. At each iteration, the probability ^p

( )

^O^λ of O being observed from the model is updated until maximum expectation is reached. This iterative procedure is guaranteed to converge on a local maximum.

Our HMM-based speech inversion system needs acoustic and articulatory HMMs for acoustic recognition and articulatory synthesis, respectively. As stated in (Zhang, 2009), two training frameworks could be used to estimate these models.

The first uses separate training: the acoustic speech HMMs are trained on the acoustic data only and the articulatory HMMs are built from the articulatory data alone using the MLE training procedure. The idea behind the separate training is clearly that training the two types of HMMs individually is likely to bring out the best performance from each channel. This framework works even for acoustic and articulatory data acquired separately.

The second scheme, on the other hand, aims to jointly optimise a single model for both acoustic and articulatory information. The model therefore has acoustic and articulatory components, both modelled as multi-state phone-level HMMs: (1) acoustic HMMs that perform an acoustic recognition stage that produces a string of phones and the associated state durations, and (2) articulatory HMMs which generate articulatory trajectories from this string of phones with their durations. Both the acoustic and articulatory models have the same topology, i.e. they have exactly the same set of HMM states and allophonic variations. This structure enables to establishe a stronger bridge between the acoustic and articulatory speech domains and the same phone boundaries for both acoustic and articulatory streams.

Note that the first training framework has to explicitly cope with AV asynchrony if any:

there is no guaranty that the best alignment of phone-sized acoustic models directly corresponds to the optimal chaining of phone-sized articulatory models. One solution consists in learning a phasing model such as proposed by Govokhina et al. (2007) for audiovisual speech synthesis or by Saino et al (2006) for computing time-lags between notes of the musical score and sung phones for an HMM-based singing voice synthesis system. The second scheme on the contrary preserves inter-stream asynchrony because internal states of each stream learn static and dynamic characteristics of each corresponding parameters. Transient states are therefore not forced to be captured by the same states: asynchronies are here just of consequence of statistical learning in a way similar to the triphone model for tongue kinematics early proposed by Okadome et al.

(1999).

As expected, Zhang (2009) found that training jointly the acoustic and articulatory features in a multi-stream HMM leads to more accurate inversion results than training them separately.

The phone-sized HMM are modelled by joint probability densities of acoustic and articulatory parameters. These models can be enriched in many ways:

• Use of dynamic (delta) features. (Furui, 1986; Masuko et al., 1996). Dynamic features, i.e. first time derivative of the features, can be exploited by trajectories HMMs to smooth trajectories.

• Context-dependent HMMs. Due to coarticulatory effects, it is unlikely that a single context-independent HMM could optimally represent a given allophone.

Therefore context-dependent HMMs are used as another way to enrich the model. The idea of context-dependent modelling is that, instead of defining phones, we define phones in their contexts. We define a left context with a minus "-" sign, and a right context with a plus "+" sign. For example, the phone

"i" bounded by a "b" and a "p" is now modelled by: “b-i+p”. Because of the limited training data available for our system, we use context classes of phonemes as contexts, in order to have more occurrences for each class and to

ensure a better statistics (cf. Section 3.3.1.2). In this case, the “b-i+p” become

“Cbpm-i+Cbpm” where Cbpm clusters bilabial phonemes /b/, /p/ and /m/.

• Inheritance mechanism. For the missing phone’s context, we used an inheritance mechanism that replaces the missing allophone by the closest allophone with less context information. For example, if “Cbpm-i+Cbpm” does not exist, we use the “i+Cbpm” model trained using phones in their right context, and if the “i+Cbpm” model does not exist either, we use the “i” model trained using the phone without context.

• Tied states. A drawback of building context-dependent models is that the number of HMM states related to all phone contexts becomes huge and there may be a lack of training data. This number of states can be reduced by sharing some states between several models. For each stream, the choice of model configuration (number of components, full or diagonal covariance matrices, parameter tying and number of Gaussian mixture) is often determined by the amount of data available for estimating the Gaussian mixtures parameters and how the Gaussian model is used in a particular stream. To improve the robustness and accuracy of the acoustic models, we have used a decision tree–

based state tying mechanism (Young et al., 2009) which allows similar acoustic states of different context-dependent HMMs to be tied together. This should ensure that all state distributions can be robustly estimated. The state tying decision tree in the acoustic domain is elaborated based on the single Gaussian models. Then, multiple mixture component Gaussian distributions are iteratively trained. Note that the number of Gaussian mixtures in the articulatory stream remains unchanged.

2.3.2.1 Multi-stream HMMs

To build the multi-stream HMM-based system, a simple fusion approach is to concatenate acoustic and articulatory features. This is a way to tie the two HMMs at state level during training (i.e. synchronous state). The multiple data streams functionality provided by the HMM ToolKit² (HTK) (Young et al., 2009) makes this type of training possible, by combining the two sets of HMMs into a single, two-stream HMM model. They have the same HMM topology, same phone boundaries. In our system, we have used a two-stream model: one for the acoustic information and another one for the articulatory information.

Given the multi-stream observation vector O, i.e. acoustic and articulatory modalities, the emission probability of multi-streams HMM is given by

2 http://htk.eng.cam.ac.uk/

( ) _∏ ∑ ( )

This equation differs from equation (2.3-4) by the use of S streams. For each stream s, Ms is the number of mixture components; cjsm is the weight of the m^th component and

(

^O^st ^jsm ^jsm

)

N ;µ ,∑ denotes the multivariate Gaussian distribution with mean vector µjsmand diagonal covariance matrix∑_jsm. We choose diagonal covariance matrices in order to decrease the system complexity and thus the number of parameters to estimate, as their reliability is related to the size of the training corpus. The contribution of each stream is controlled by the weightγ _jst. In our system, the stream weight default is set to 1.0 for all streams, but could be optimised.

Dans le document Contrôle de têtes parlantes par inversion acoustico-articulatoire pour l’apprentissage et la réhabilitation du langage ~ Association Francophone de la Communication Parlée (Page 49-52)

Chapter 2. Statistical mapping techniques for inversion

2.3. HMM-based speech recognition and synthesis

2.3.2. HMM Training

( )

( )

{

}

( )

( )

( ) ∏ ∑ ( )

(

)

( ) _∏ ∑ ( )