The Dirichlet-multinomial model - 3 Generative models for discrete data

3 Generative models for discrete data

3.4 The Dirichlet-multinomial model

x 1

B(a, b) ₁

0 θ^x(1−θ)^M−xθ^a−1(1−θ)^b−1dθ (3.32) We recognize the integral as the normalization constant for aBeta(a+x, M−x+b)distribution.

Hence ₁

0 θ^x(1−θ)^M^−xθ^a−1(1−θ)^b−1dθ=B(x+a, M−x+b) (3.33) Thus we ﬁnd that the posterior predictive is given by the following, known as the (compound) beta-binomialdistribution:

Bb(x|a, b, M) M

B(x+a, M−x+b)

B(a, b) (3.34)

This distribution has the following mean and variance E[x] =M a

a+b, var [x] = M ab (a+b)²

(a+b+M)

a+b+ 1 (3.35)

IfM = 1, and hencex∈ {0,1}, we see that the mean becomesE[x|D] =p(x= 1|D) =_a+b^a , which is consistent with Equation 3.29.

This process is illustrated in Figure 3.7(a). We start with a Beta(2,2) prior, and plot the posterior predictive density after seeing N1 = 3 heads and N0 = 17 tails. Figure 3.7(b) plots a plug-in approximation using a MAP estimate. We see that the Bayesian prediction has longer tails, spreading its probablity mass more widely, and is therefore less prone to overﬁtting and blackswan type paradoxes.

3.4 The Dirichlet-multinomial model

In the previous section, we discussed how to infer the probability that a coin comes up heads.

In this section, we generalize these results to infer the probability that a dice with K sides comes up as face k. This might seem like another toy exercise, but the methods we will study are widely used to analyse text data, biosequence data, etc., as we will see later.

0 1 2 3 4 5 6 7 8 9 10

Figure 3.7 (a) Posterior predictive distributions after seeingN1= 3, N0= 17. (b) Plugin approximation.

Figure generated bybetaBinomPostPredDemo.

3.4.1 Likelihood

Suppose we observeN dice rolls,D ={x₁, . . . , xN}, where xi ∈ {1, . . . , K}. If we assume the data is iid, the likelihood has the form

p(D|θ) = statistics for this model). The likelihood for the multinomial model has the same form, up to an irrelevant constant factor.

3.4.2 Prior

Since the parameter vector lives in theK-dimensional probability simplex, we need a prior that has support over this simplex. Ideally it would also be conjugate. Fortunately, the Dirichlet distribution (Section 2.5.4) satisﬁes both criteria. So we will use the following prior:

Dir(θ|α) = 1

Multiplying the likelihood by the prior, we ﬁnd that the posterior is also Dirichlet:

p(θ|D) ∝ p(D|θ)p(θ) (3.38)

We see that the posterior is obtained by adding the prior hyper-parameters (pseudo-counts)αk

to the empirical countsNk.

We can derive the mode of this posterior (i.e., the MAP estimate) by using calculus. However, we must enforce the constraint that

kθk = 1.². We can do this by using a Lagrange multiplier. The constrained objective function, or Lagrangian, is given by the log likelihood plus log prior plus the constraint:

(θ, λ) =

Nklogθk+

(αk−1) logθk+λ

1−

θk

(3.41) To simplify notation, we deﬁneN_k Nk+αk−1. Taking derivatives with respect toλyields the original constraint:

∂

∂λ =

1−

θk

= 0 (3.42)

Taking derivatives with respect toθk yields

∂

∂θk

= N_k

θk −λ= 0 (3.43)

N_k = λθk (3.44)

We can solve forλusing the sum-to-one constraint:

N_k = λ

θk (3.45)

N+α0−K = λ (3.46)

where α0 _K

k=1αk is the equivalent sample size of the prior. Thus the MAP estimate is given by

θˆk= Nk+αk−1

N+α0−K (3.47)

which is consistent with Equation 2.77. If we use a uniform prior,αk= 1, we recover the MLE:

θˆk=Nk/N (3.48)

This is just the empirical fraction of times facekshows up.

2. We do not need to explicitly enforce the constraint thatθk ≥0since the gradient of the objective has the form Nk/θk−λ; so negative values would reduce the objective, rather than maximize it. (Of course, this does not preclude settingθk= 0, and indeed this is the optimal solution ifNk= 0andαk= 1.)

3.4.4 Posterior predictive

The posterior predictive distribution for a single multinoulli trial is given by the following expression:

p(X =j|D) =

p(X=j|θ)p(θ|D)dθ (3.49)

p(X=j|θj)

p(θ−j, θj|D)dθ−j

dθj (3.50)

θjp(θj|D)dθj=E[θj|D] = αj+Nj

k(αk+Nk) = αj+Nj

α0+N (3.51) whereθ−j are all the components ofθ exceptθj. See also Exercise 3.13.

The above expression avoids the zero-count problem, just as we saw in Section 3.3.4.1. In fact, this form of Bayesian smoothing is even more important in the multinomial case than the binary case, since the likelihood of data sparsity increases once we start partitioning the data into many categories.

3.4.4.1 Worked example: language models using bag of words

One application of Bayesian smoothing using the Dirichlet-multinomial model is to language modeling, which means predicting which words might occur next in a sequence. Here we will take a very simple-minded approach, and assume that thei’th word,Xi ∈ {1, . . . , K}, is sampled independently from all the other words using aCat(θ)distribution. This is called the bag of wordsmodel. Given a past sequence of words, how can we predict which one is likely to come next?

For example, suppose we observe the following sequence (part of a children’s nursery rhyme):

Mary had a little lamb, little lamb, little lamb, Mary had a little lamb, its fleece as white as snow Furthermore, suppose our vocabulary consists of the following words:

mary lamb little big fleece white black snow rain unk

1 2 3 4 5 6 7 8 9 10

Hereunkstands for unknown, and represents all other words that do not appear elsewhere on the list. To encode each line of the nursery rhyme, we ﬁrst strip off punctuation, and remove any stop words such as “a”, “as”, “the”, etc. We can also perform stemming, which means reducing words to their base form, such as stripping off the ﬁnal sin plural words, or the ing from verbs (e.g., runningbecomesrun). In this example, no words need stemming. Finally, we replace each word by its index into the vocabulary to get:

1 10 3 2 3 2 3 2 1 10 3 2 10 5 10 6 8

We now ignore the word order, and count how often each word occurred, resulting in a histogram of word counts:

Token 1 2 3 4 5 6 7 8 9 10 Word mary lamb little big ﬂeece white black snow rain unk

Count 2 4 4 0 1 1 0 1 0 4

Denote the above counts byNj. If we use aDir(α) prior forθ, the posterior predictive is just

p( ˜X =j|D) =E[θj|D] = αj+Nj jαj+Nj

= 1 +Nj

10 + 17 (3.52)

If we setαj= 1, we get

p( ˜X =j|D) = (3/27,5/27,5/27,1/27,2/27,2/27,1/27,2/27,1/27,5/27) (3.53) The modes of the predictive distribution areX = 2(“lamb”) andX = 10(“unk”). Note that the words “big”, “black” and “rain” are predicted to occur with non-zero probability in the future, even though they have never been seen before. Later on we will see more sophisticated language models.

Dans le document machine learning (Page 109-113)