Discussion 1 Convexity - MAXIMUM LIKELIHOOD MINIMUM ENTROPY HMM

MAXIMUM LIKELIHOOD MINIMUM ENTROPY HMM

4. Discussion 1 Convexity

It is interesting to note that the update equation forΣ_i in Eqn 5.14 is very similar to the one obtained when using ML estimation, except for the term in the denominator ⁽¹⁻_α^α⁾_T

t=1P(q_t = i), which can be thought of as a reg-ularization term. Because of this positive term, the covarianceΣ_i is smaller than what it would have been otherwise. This corresponds to lower conditional entropy, as desired.

3.3 Unsupervised Case

The above analysis can easily be extended to the unsupervised case, i.e.

when onlyXobsis given andQobsis not available. In this case, we use the cost function given in Eqn 5.3. The update equations for the parameters are very similar to the ones obtained in the supervised case. The only difference is that now we replaceNNNij in Eqn 5.6 byT Baum-Welch algorithm by means of the forward and backward variables.

4. Discussion 4.1 Convexity

From the law of large numbers, it is known that, in the limit (i.e. as the number of samples approaches inﬁnity), the likelihood of the data tends to the negative of the entropy,P(X)≈ −H(X).

Therefore, in the limit, the negative of our cost function for the supervised case can be expressed as

−F = (1−α)H(X|Q) +αH(X, Q)

= H(X|Q) +αH(Q). (5.16)

Note thatH(X|Q)is a strictly concave function ofP(X|Q), andH(X|Q) is a linear function of P(Q). Consequently, in the limit, the cost function from Eqn 5.15 is strictly convex (its negative is concave) with respect to the distributions of interest.

In the unsupervised case and in the limit again, our cost function can be expressed as

F = −(1−α)H(X|Q)−αH(X)

= −H(X) + (1−α)(H(X)−H(X|Q))

= −H(X) + (1−α)I(X, Q)≈P(X) + (1−α)I(X, Q).

The unsupervised case thus reduces to the original case withα replaced by 1−α. MaximizingF is, in the limit, the same as maximizing the likelihood of the data and the mutual information between the hidden and the observed states, as expected.

4.2 Convergence

We analyze next the convergence of the MMIHMM learning algorithm in the supervised and unsupervised cases. In the supervised case, HMMs are di-rectly learned without any iteration. However, in the case of MMIHMM we do not have a closed form solution for the parameters b_ij anda_ij. Moreover these parameters are inter-dependent (i.e. in order to computebij, we need to computeP(q_t =i)which requires the knowledge ofaij). Therefore an itera-tive solution is needed. Fortunately, the convergence of the iteraitera-tive algorithm is extremely fast, as it is illustrated in Figure 5.2. This ﬁgure shows the cost function with respect to the iterations for a particular case of the speaker detec-tion problem (a) (see secdetec-tion 5.5.2), and for synthetically generated data in an unsupervised situation (b). From Figure 5.2 it can be seen that the algorithm typically converges after only 2-3 iterations.

4.3 Maximum A-posteriori View of Maximum Mutual Information HMMs

The MMIHMM algorithm presented can also be viewed as a maximum a-posteriori (MAP) algorithm with the conditional entropy acting as a prior on the space of models.

An HMM is a probability distribution over a set of RV’s,(X, Q). Tradition-ally, the parameters of HMMs are estimated by maximizing the joint likelihood

Discussion 113

1 2 3 4 5 6 7 8 9 10

−900

−850

−800

−750

−700

−650

−600

−550

−500

Iteration Number

Cost Function

(a)

0 2 4 6 8 10 12 14 16 18 20

−550

−500

−450

−400

−350

−300

−250

−200

−150

−100

−50

Iterations

Cost Function

(b)

Figure 5.2. Value of the cost function with respect to the iteration number in (a) the speaker detection experiment; (b) a continuous unsupervised case with synthetic data.

of the hidden states and the observations,P(X, Q). The work in this chapter can be thought of as an entropic estimation (similar to [Brand, 1998]) frame-work in the space of possible distributions modeled by an HMM. In contrast to Brand et al. [Brand, 1998], where the prior is imposed over the parameters of the model, we impose the priors directly on the model, preferring models with low conditional entropy. In particular, given a Hidden Markov Modelχ, char-acterized by its parameters {π, A, B}, whereπ are the initial state probabili-ties,Ais the transition probability matrix, andBis the observation probability matrix (in the discrete case), the prior probability of the model is assumed to beP(χ)∝e^λI⁽^X^;^Q⁾. Under this prior, the posterior probability can be written

as:

P_post P

P ≡P(X, Q|χ)P(χ) ∝ P(X, Q|χ)e^λI⁽^X,Q⁾

= P(X, Q|χ)e^λ⁽^H⁽^X⁾⁻^H⁽^X^|^Q⁾⁾. (5.17) The priorP(χ)∝e^λI⁽^X,Q⁾is referred to as the entropic prior (modulo a nor-malization constant) over the space of distributions, preferring distributions with high mutual information over distributions with low mutual information.

The parameterλcontrols the weight of the prior and acts as a smoothing fac-tor: ifλis very small, all the models are almost equally likely, whereas ifλis large, models with high mutual information are favored over others. Our goal with the HMMs is to predict the hidden states based on the observations. Thus, the mutual information is used to model the dependence between the hidden and the observed states. This concept of entropic priors could be extended to other graphical models, by computing the mutual information between the observed and the query variables. Note how the prior is over the possible dis-tributions and not over the parameters, as proposed in the past [Brand, 1998].

Given that the dependence of H(X) on the model parameters is weak, we will approximate (to keep the problem tractable) the objective function with, P_post

P (χ) ∝ P(X, Q|χ)e⁻^λH⁽^X^|^Q⁾. The prior distribution (e⁻^λH⁽^X^|^Q⁾) can now be seen as favoring the distributions with low conditional entropy.

This prior has two properties derived from the deﬁnition of entropy:

1 It is a bias for compact distributions having less ambiguity,i.e. lower con-ditional entropy;

2 It is invariant to re-parameterization of the model because the entropy is deﬁned in terms of the model’s joint and/or factored distributions.

Taking the logarithm of the posterior probability in Eqn 5.16 and dropping from now on the explicit dependence onχ, we obtain:

F = log(PPP_post) ≡ logP(X, Q) +λI(X, Q)

= logP(X, Q) +λ(H(X)−H(X|Q)). (5.18) This leads to the following function to maximize:

F = λI(Q, X) + logP(X_obs, Q_obs)

= (1−α)I(Q, X) +αlogP(X_obs, Q_obs) (5.19) whereα, provides a way of trading off between the ML (α= 1) and Maximum Mutual Information (MMI) (α= 0) criteria, andλ= ⁽¹⁻_α^α⁾. Note that Eqn 5.3 is exactly the log-posterior probability expressed in Eqn 5.18.

Experimental Results 115

Dans le document Machine Learning in Computer Vision (Page 122-126)