Multinomial random variables

(1)

Statistics review : Solutions

Semaine de pré-rentrée du master MVA

Multinomial random variables

1.

IfZ = (Z1, ..., ZK)∼ M(π1, ..., πK; 1)we have

P(Zk= 1) =P(Z=e^(k)), withe^(k)= (0, ...,0, 1

|{z}

k

,0, ...,0)

= 1

e^(k) ^K

Y

j=1

π^e

(k) j

j 1ⁿPK

j=1e^(k)_j =1o

=πk

2.

For(n₁, ..., n_K)∈ N^K such thatP

kn_k=n, let Zn1,..,nK:=

(

(z⁽¹⁾, ..., z⁽ⁿ⁾)∈ {0; 1}^Kn

| ∀i∈ {1, . . . , n},

K

X

k=1

z⁽ⁱ⁾_k = 1and∀k∈ {1, . . . , K},

n

X

i=1

z⁽ⁱ⁾_k =n_k )

P(N₁=n₁, ..., N_K =n_K) =P((Z⁽¹⁾, ..., Z⁽ⁿ⁾)∈ Zn1,..,nK)

= X

(z⁽¹⁾,...,z⁽ⁿ⁾)∈Zn1,..,nK

P((Z⁽¹⁾, ..., Z⁽ⁿ⁾) = (z⁽¹⁾, ..., z⁽ⁿ⁾))

= X

(z⁽¹⁾,...,z⁽ⁿ⁾)∈Zn1,..,nK

P(Z⁽¹⁾=z⁽¹⁾)...P(Z⁽ⁿ⁾=z⁽ⁿ⁾)

= X

(z⁽¹⁾,...,z⁽ⁿ⁾)∈Zn1,..,nK

π₁ⁿ¹π₂ⁿ²... πⁿ_K^K

= n

n1, ..., nK

^K Y

k=1

πⁿ_k^k1{^Pini=n}

This shows thatN := (N1, ..., NK)follows the distributionM(π1, ..., πK;n)since the multinomial coefficient n

n1, ..., nK

:= n!

n1!. . . nK!

is exactly equal to the number of ordered partitions of{1, . . . , n}into sets of cardinalitiesn₁, . . . , n_K.

(2)

Sufficient Statistic

We first show that the conditional independence statement implies the proposed factorization. Indeed, we have

p(x, t, θ) =p(θ|t)p(t|x)p(x) but sincet=T(x)is assumed to be a function ofx,p(t|x) =δ t−T(x)

and p(x, t, θ) =p θ|T(x)

δ t−T(x) p(x),

where we have introduced the Dirac function (more precisely the Dirac in 0), so that after marginalizing t out we obtain :

p(x, θ) =p θ|T(x) p(x), and so

p(x|θ) = p θ|T(x) p(θ) p(x), which is of the desired form.

We now show conversely that the factorization ofp(x|θ)implies the conditional independence statement.

If

p(x|θ) =f x, T(x)

g T(x), θ then

p(t, x, θ) =δ T(x)−t

f x, T(x)

g T(x), θ

p(θ) =δ T(x)−t

f(x, t)g(t, θ)p(θ), wherep(θ)is the density of the prior distribution overθwith respect to a reference measure onθ.

(To be rigorous, we should not write that this is a joint density for (t, x, θ)but that it is a derivative in the sense of generalized functions of a joint probability measure over the triple(t, x, θ); that is, we should call for exampleµ(t, x, θ)the joint measure and instead of writingp(x, t, θ)we should writedµ(x, t, θ). However, to avoid to write things that are unnecessarily abstract we will stick to these non-rigorous notations. The reasoning is however itself rigorous.)

As a consequence we have p(t, θ) =

Z

x

p(t, x, θ) = Z

x

δ T(x)−t

f(x, t)g(t, θ)dx=h(t)g(t, θ)p(θ).

(Note that here p(t, θ) is again very rigorously a density with respect to a reference measure inR²). For t such thatp(t, θ)6= 0, we have

p(x|t, θ) =p(x, t, θ)

p(t, θ) =δ T(x)−tf(x, t)g(t, θ)p(θ)

h(t)g(t, θ)p(θ) =δ T(x)−tf(x, t) h(t) ,

Method of moments vs maximum likelihood estimation

1.

a)

p(x1, ..., xn|θ) = 1 θⁿ

n

Y

i=1

1x_i∈[0;θ]

(3)

θˆM LE = argmax

θ

p(x1, ..., xn|θ)

= argmax

θ

1

θⁿ1_{(max_i_x_i_)∈[0;θ]}

= argmin

θ

θ s.t. θ≥ max

i∈{1,...,n}xi

= max

i∈{1,...,n}xi

b)

P(ˆθM LE ≤x) =P(∀i∈ {1, . . . , n}, xi≤x)

=x θ

n

pθˆM LE(z) =nzⁿ⁻¹ θⁿ

Thus, ^θ^ˆ^{M LE}_θ follows a Beta distribution whose parameters areα=nandβ= 1.

c)

We can use the given formulas :

E_θ[ˆθ_{M LE}] =θ n n+ 1 Varθ(ˆθM LE) =θ² n

(n+ 1)²(n+ 2) d)

θˆM O= 2 n

n

X

i=1

xi

Eθ[ˆθM O] =θ Var_θ(ˆθ_{M O}) = 4

n² ·n· θ² 12 = θ²

3n e)

M SE=E[(θ−θ)ˆ²]

=E[(θ−E[ˆθ])²] +E[(ˆθ−E[ˆθ])²]

= (_θ2

3n for MO

θ²

(n+1)² +_(n+1)^θ²2ⁿ(n+2) = _(n+1)(n+2)^2θ² for MLE

(4)

Computation of maximum likelihood estimators

1

a)

argmax

θ

P(x₁, ..., x_n|θ) = argmax

θ

θ^Pⁿⁱ⁼¹^xⁱ·(1−θ)ⁿ⁻^Pⁿⁱ⁼¹^xⁱ

= argmax

θ

expXⁿ

i=1

xilog(θ) + (n−

n

X

i=1

xi) log(1−θ)

= argmax

θ

Nlog(θ) + (n−N) log(1−θ), withN :=Pn

i=1. Each term is continuous, strictly concave and their sum goes to−∞towards 0 and 1 so the MLE is unique.

b)

Letl(x1, ..., xn|θ) =Nlog(θ) + (n−N) log(1−θ),

∂l(x1, ..., xn|θ)

∂θ = N

θ −n−N 1−θ

= N−nθ θ(1−θ).

Thus,

θˆM LE = N n.

2.

a)

P(Z₁, ..., Z_n| {πk}_k) =π

Pn i=1Z_i,1

1 ...π

Pn i=1Z_i,K K

=π^N₁¹...π_K^N^K

So(N₁, ..., N_K)is asufficient statisticfor the sample because the likelihood depends on the data only through these quantities (see the exercise calledSufficient statistic for definition).

b)

argmax

π

π^N₁¹...π_K^N^K = argmax

π

N1log(π1) +...NKlog(πK) The MLE is solution of theconstrained convex optimization problem

argmax

π≥0,PK k=1π_k=1

K

X

k=1

N_klog(π_k)

(5)

c)

LetL(π, λ) =P

k, Nk>0N_klog(π_k)−λ(P

kπ_k−1)be the associated Lagrangian.

∂L

∂πk

= N_k ˆ πk

−λ

So,

ˆ

πk=αNk, withα= 1 λ

= N_k n since

K

X

k=1

ˆ π_k = 1.

Note that we only introduced Lagrange multipliers for the equality constraint and not for the positivity constraintsπk ≥0because, the log-likelihood diverges to−∞on the edge of the domain which ensures that the constraints will be satisfied. We can indeed check that the estimatorsπˆk are all non-negative.

3.

f(µ+h) =u^>µ+u^>h dfµ(h) =u^>h

∇f(µ) =u

g(µ+h) =µ^>Aµ+µ^>Ah+µ^>A^>h+h^>Ah dg_µ(h) =µ^>(A+A^>)h

∇g(µ) = (A+A^>)µ a)

IfΣis fixed and positive definite, argmax

µ

p(x1, ..., xn|µ) = argmax

µ

−n

2 log((2π)^d|Σ|)−1 2

X

i

(µ^>Σ⁻¹µ−µ^>Σ⁻¹xi−x^>_i Σ⁻¹µ+x^>_i Σ⁻¹xi)

= argmin

µ

X

i

(µ^>Σ⁻¹µ−µ^>Σ⁻¹xi−x^>_i Σ⁻¹µ+x^>_i Σ⁻¹xi) Let’s compute the gradient of the log-likelihood,

∇l(µ) =X

i

2Σ⁻¹µ−2Σ⁻¹xi

This gives us,

ˆ

µM LE = 1 n

X

i

xi

(6)

b)

Ifµis fixed and Λ = Σ⁻¹, since X

i

(xi−µ)^>Σ⁻¹(xi−µ) = tr X

i

(xi−µ)^>Σ⁻¹(xi−µ)

!

= tr X

i

Σ⁻¹(x_i−µ)(x_i−µ)^>

!

We have :

p(x1, ..., xn|Σ) = 1

((2π)^d|Σ|)¹² exp(−1

2tr( ˆΣΛ)) d)

hA, B+HiF − hA, BiF =hA, HiF

e)

f :A→log(|A|) LetH = (hi,j)i,j∈Rⁿ²

|I+H|= X

σ∈Sn

sgn(σ)

n

Y

i=1

(1_σ(i)=i+h_i,σ(i))

=

n

Y

i=1

(1 +hi,i) + X

σ∈Sn\In

sgn(σ)

n

Y

i=1

(1_σ(i)=i+h_i,σ(i))

= 1 +

n

X

i=1

h_i,i+O(|H|²) d|.|_I(H) =tr(H)

A symmetric positive definite,H symmetric such thatA+H positive definite : log det(A+H) = log(detA.det(I+A⁻¹H))

= log detA+ log det(I+A⁻¹H)

= log detA+ tr(A⁻¹H) d|.|A(H) = tr(A⁻¹H)

∇f(A) = (A⁻¹)^>

f )

logp(x1, ..., xn|Λ) =−d

2log((2π) +1

2log|Λ| −1 2tr( ˆΣΛ)

∇f(Λ) = 1

2Λ⁻¹−1 2

Σˆ ΛM LE = ˆΣ⁻¹

ΣM LE = ˆΣ

Indeed, ifθˆ∈argmax_θf(θ)andθˆ=φ( ˆα)thenαˆ∈argmax_αf(φ(α)).

(7)

g)

p(x₁, ..., x_n|µ,Σ) = 1

((2π)^d|Σ|)¹² exp

−1 2

X

i

(x_i−µ)^>Σ⁻¹(x_i−µ)

If Σˆ is not invertible, let’s write Σ =ˆ Udiag(λ₁, ..., λ_K,0, ...,0)U^> with U an orthogonal matrix and set Λ_α=Udiag(λ₁, ..., λ_K, α, ..., α)⁻¹U^>.

logp(x1, ..., xn|Λα) =−d

2log((2π) +1

2log|Λα| −1

2tr( ˆΣΛα)

The second term goes to∞withαwhile the two others are constant, so the log-likelihood is unbounded. In practice, the maximum likelihood estimator is extended by continuity to these case ; the obtained estimator can also be though of as the maximum likelihood estimators for Gaussian densities on the subspace spanned by{x1, . . . , x_n}.

Bayesian estimation

1.

a)

p(π|α, n) =p(n|α, π)· p(π|α) p(n|α)

∝

K

Y

k=1

π_kⁿ^kΓ(α₁+...+α_K) QK

k=1Γ(α_k)

Yπ_k^α^k⁻¹

∝

K

Y

k=1

π_kⁿ^k^+α^k⁻¹

b)

We denote by4the canonical simplex4:=

u∈R^K+ |PK

k=1uk = 1 . we then have E[π_j|Z] =

Z

4

π_jp(π|Z)dπ

Let us consider a fixed value forj∈ {1, . . . , K} and defineβk =αk+nk for allk.

E[πj|Z] = Γ(β1+. . .+βK) QK

k=1Γ(βk) Z

4

πj K

Y

k=1

π^β_k^k⁻¹dπ

=Γ(β1+. . .+βK) QK

k=1Γ(βk) · Γ(βj+ 1)Q

k6=jΓ(βk) Γ(β₁+. . .+β_K+ 1)

=Γ(βj+ 1)

Γ(βj) · Γ(β1+. . .+βK) Γ(β1+. . .+βK+ 1)

= βj

β1+. . .+βK

= αj+nj

αtot+n, withαtot=α1+. . .+αK.

(8)

c)

E[π₁|n1, n₂] = n₁+ 1 n1+n2+ 2

Without smoothing, ifx1= 0orx2= 0, which is common ifπ1is close to0or1, the maximum likelihood estimator estimates that pi = 0 even thoughpi>0. This is a major problem because then the probability of some non-zero event is assessed to be equal to0, which makes all probabilistic reasonings fail.

2.

P ={p(x|θ), θ∈Θ}

Π ={pα(θ), α∈ A}

Πis a conjugate family of distributions forP if for allpα∈Π, there exists pα⁰ ∈Πsuch that we can write pα(θ|x) =pα⁰(θ).

pBernoulli(x|θ) =θ^x(1−θ)^1−x

The family of beta distributionsp_α,β(θ)∝θ^α−1(1−θ)^β−1is a conjugate family of distributions for the family of Bernoulli distributions.

pP oisson(x|λ) =λ^Pⁿⁱ⁼¹^xⁱe^−nλ Qn

i=1(xi!)

The family of gamma distributionspα,β(λ)∝λ^α−1e^−βλ is a conjugate family of distributions for the family of Poisson distributions.

pexp(x|µ)∝exp

−X

i

(µ−xi)^tΣ⁻¹(µ−xi)−(µ−µ0)^tΣ⁻¹₀ (µ−µ0)

The family of gaussian distributions with the given covariance and unknown mean is a conjugate family of distributions for the family of gaussian random variables with fixed known covariance and unknown mean (cf. ex. 3).

3.

a)

As the product of two gaussian distributions, the a posteriori distribution is still a gaussian N(ˆµP M, ω).

On the one hand,

exp

−(µ−µˆ_{P M})² 2ω²

= exp

− µ²

2ω²+2µˆµ_{P M}

2ω² −µˆ²_{P M} 2ω²

On the other hand, exp X

−(xi−µ)²

2σ² −(µ0−µ)² 2τ²

= exp

−µ².( n 2σ² + 1

2τ²) + 2µ(X x_i 2σ² + µ₀

2τ²)−(X x²_i 2σ² + µ²₀

2τ²) By identification

1 ω² = n

σ² + 1 τ², and we get the posterior mean :

ˆ µP M =

Px_i σ² +^µ_τ⁰₂

n σ² +_τ¹2

.

(9)

b)

ˆ µP M =

Pxi

n ·λn+µ0·(1−λn) , with λn=

n σ² n σ² +_τ¹2

d)

In this case and if µ₀= 0 : argmin

µ

− logp(x₁, ..., x_n|µ) +λ

2µ²= argmin

µ

− logp(x₁, ..., x_n|µ)·p(µ)forλ= 1 τ²

Thus the MAP estimator can be viewed as a minimizer of the log-likelihood with some ridge regularization.

e)

µM AP =µP M

This property doesn’t hold for a Bernoulli distribution with a Beta prior.

f )

E[log p(X⁰|ν, σ)] =E[−(X⁰−ν)² 2σ² ] g)

R(ν) =E[(ν−X⁰)²]

=E[(ν−E[X⁰] +E[X⁰]−X⁰)²]

=E[(ν−E[X⁰])²] +E[(E[X⁰]−X⁰)²] + 2E[(ν−E[X⁰])(E[X⁰]−X⁰)]

= (µ−ν²) + Var(X⁰) h)

E_D_n[E(ˆµ)] =E_D_n[R(ˆµ)−R(µ)]

=E_D_n[(ˆµ−µ)²]

=ED_n[(ˆµ−ED_n[ˆµ])²] + (ED_n[ˆµ]−µ)² i)

ED_n[E(µM LE)] =ED_n[(µM LE−ED_n[µM LE])²] + 0

= σ² n

(10)

E_D_n[E(µM AP)] =E_D_n[(µ_{M AP} −E_D_n[µ_{M AP}])²] + (E_D_n[µ_{M AP}]−µ)²

=

n σ²

(_σⁿ₂ +_τ¹₂)² +

(µ−µ0)² τ⁴

(_σⁿ₂+_τ¹₂)² j)

Rπ(M LE) =σ² n

Rπ(P M) =

σ² n

(1 +_nτ^σ²2) l)

E_µ∼π,D_n[(ˆµ−µ)²] =E_D_n[E_µ∼π[(ˆµ−µ)²|Dn]]

The inner quantity is minimized for every possibleD_n by using the posterior mean.

Bregman divergence

1.

DF(p, q) =hp, pi − hq, qi −2hq, pi+ 2hq, qi=hp−q, p−qi

2.

(∇H(q))_i=−logq_i−1

DH(p, q) =X

pilogpi−X

qilogqi−X

(logqi+ 1)(pi−qi)

=X

pi(logpi−logqi)

=KL(p, q)

3.

We assume the loss is differentiable.

F(µ) =EX[l(µ, X)]−EX[l(µ^∗, X)]

(11)

1 Ridge regression and PCA

1.

a)

X =U SV^>, X⁻=V S⁻U^>, S= diag(s_i) XX⁻X =U SV^>V S⁻U^>U SV

=Udiag(1s_i6=0)SV^>)V^>

=U SV^>=X b)

(X^>X)⁻= (V S²V^>)⁻

=V(S⁻)²V^>

X⁻(X⁻)^>=V(S⁻)²V^>

(X^>X)⁻X^>=X⁻(X⁻)^>X^>

=V S⁻S⁻SU^>

=V S⁻U^>

c)

X^>XX⁻ =V SU^>U SV^>V S⁻U^> =V SU^>

d)

First supposeX =S. Letw be a solution to the normal equation.

s²_i,iwi=si⇒wi=s⁻_i y ifsi6= 0 (S^>y)i= 0ifsi= 0 So it is true ifX =S.

For anyX, letw˜ =V^>w,y˜=U^>y.

wis a solution to the normal equation iffw˜ is a solution ofS²w˜ =Sy˜

2.

X = U SV^>, U and V are square matrices. The columns of U and the columns of V are called the left-singular vectors and right-singular vectors of X.

(12)

a)

SinceU andV are orthogonal matrices, we have for allw∈Rⁿ :kXwk=kU SV^>wk=kSV^>wk argmax

w∈Rⁿ,kwk=1

kXwk=V^> argmax

w∈Rⁿ,kwk=1

kSwk

b)

Σ =ˆ 1

nV S²V^>

c)

X

i

(cj,i−c¯j)²=X

i

((Xvj)i−X

k

(Xvj)k)²

=X

(Xv_j)²

=s²_j·1 d)

XX^>Xvj=U SV^>V^>SU U SV^>vj =U S³V vj =s²_j,jXvj

3.

a)

X

i

ky⁽ⁱ⁾−(c⁽ⁱ⁾_1:k)^>wk²=X

i

ky⁽ⁱ⁾−(x^>_i vj)j=1:kwk²

=ky−(XV1:k)wk²

=ky−UkSkwk²

withU_k = (u₁, ..., u_k),S_k = diag(s₁, ..., s_k). So,

˜

w= ((UkSk)^>UkSk)⁻¹(UkSk)^>y=S_k⁻¹U_k^>y.

b)

˜

w^>(hx−x¯0, v1i, ...,hx−x¯0, vki) =X

j

ˆ

wjhx−x¯0, vji=hx−x¯0,X

j

1 sj

huj, yivji

(13)

c)

wR= (X^>X+λIp)⁻¹X^>y

= (V(S²+λIp)V^>)⁻¹V SU^>y

=Vdiag( si

s²_i +λ)U^>y

=

m

X

j=1

sj

s²_j+λhuj, yivj

d)

The coefficients for j > kwill vanish.

e)

hX^>y, v_ji=hV SU^>y, v_ji

= (SU^>Y)_J

=sjhuj, yi f )

Andrei Tikhonov and Karl Pearson

Area under the curve and Mann-Whitney U statistic

a)

LetC0 be the set of elements that belong to class0,C1be the set of elements that belong to class 1.

rT P(b) = P(s(x)> b, x∈C₁) P(x∈C1)

= P(s(x)> b).P(x∈C1)

P(x∈C1) , sinces(x)doesn’t depend onx.

= 1−F(b)

rF P(b) = P(s(x)> b, x∈C₀) P(x∈C0)

= P(s(x)> b).P(x∈C0) P(x∈C0)

= 1−F(b) Hence,

AU C= 1 2

(14)

b)

WLOG, suppose s(x₁)< ... < s(x_n),s(y₁)< ... < s(y_m). On the one hand, U =

n

X

i=1

| {s(z)|z∈DN ∪DP ands(z)≤s(xi)} | −n(n+ 1) 2

=

n

X

i=1

| {s(z)|z∈DN and s(z)≤s(xi)} |+

n

X

i=1

| {s(z)|z∈DP ands(z)≤s(xi)} |

| {z }

=i

−n(n+ 1) 2

=

n

X

i=1

| {s(z)|z∈DN and s(z)≤s(xi)} | On the other hand,

rT P(s(xi)) = 1− i n and

rF P(s(xi)) = | {s(z)|z∈∈DN ands(z)> s(xi)} | m

= 1−| {s(z)|z∈D_N ands(z)≤s(x_i)} | m

so

U m.n = 1

n

X

i=1

1−rF P(s(x_i))

=

n

X

i=1

1−rF P(s(x_i))

.(rT P(s(x_i))−rT P(s(x_i−1)))withx₀=−∞

=AU C