• Aucun résultat trouvé

Multinomial random variables

N/A
N/A
Protected

Academic year: 2022

Partager "Multinomial random variables"

Copied!
14
0
0

Texte intégral

(1)

Statistics review : Solutions

Semaine de pré-rentrée du master MVA

Multinomial random variables

1.

IfZ = (Z1, ..., ZK)∼ M(π1, ..., πK; 1)we have

P(Zk= 1) =P(Z=e(k)), withe(k)= (0, ...,0, 1

|{z}

k

,0, ...,0)

= 1

e(k) K

Y

j=1

πe

(k) j

j 1nPK

j=1e(k)j =1o

k

2.

For(n1, ..., nK)∈ NK such thatP

knk=n, let Zn1,..,nK:=

(

(z(1), ..., z(n))∈ {0; 1}Kn

| ∀i∈ {1, . . . , n},

K

X

k=1

z(i)k = 1and∀k∈ {1, . . . , K},

n

X

i=1

z(i)k =nk )

P(N1=n1, ..., NK =nK) =P((Z(1), ..., Z(n))∈ Zn1,..,nK)

= X

(z(1),...,z(n))∈Zn1,..,nK

P((Z(1), ..., Z(n)) = (z(1), ..., z(n)))

= X

(z(1),...,z(n))∈Zn1,..,nK

P(Z(1)=z(1))...P(Z(n)=z(n))

= X

(z(1),...,z(n))∈Zn1,..,nK

π1n1π2n2... πnKK

= n

n1, ..., nK

K Y

k=1

πnkk1{Pini=n}

This shows thatN := (N1, ..., NK)follows the distributionM(π1, ..., πK;n)since the multinomial coefficient n

n1, ..., nK

:= n!

n1!. . . nK!

is exactly equal to the number of ordered partitions of{1, . . . , n}into sets of cardinalitiesn1, . . . , nK.

(2)

Sufficient Statistic

We first show that the conditional independence statement implies the proposed factorization. Indeed, we have

p(x, t, θ) =p(θ|t)p(t|x)p(x) but sincet=T(x)is assumed to be a function ofx,p(t|x) =δ t−T(x)

and p(x, t, θ) =p θ|T(x)

δ t−T(x) p(x),

where we have introduced the Dirac function (more precisely the Dirac in 0), so that after marginalizing t out we obtain :

p(x, θ) =p θ|T(x) p(x), and so

p(x|θ) = p θ|T(x) p(θ) p(x), which is of the desired form.

We now show conversely that the factorization ofp(x|θ)implies the conditional independence statement.

If

p(x|θ) =f x, T(x)

g T(x), θ then

p(t, x, θ) =δ T(x)−t

f x, T(x)

g T(x), θ

p(θ) =δ T(x)−t

f(x, t)g(t, θ)p(θ), wherep(θ)is the density of the prior distribution overθwith respect to a reference measure onθ.

(To be rigorous, we should not write that this is a joint density for (t, x, θ)but that it is a derivative in the sense of generalized functions of a joint probability measure over the triple(t, x, θ); that is, we should call for exampleµ(t, x, θ)the joint measure and instead of writingp(x, t, θ)we should writedµ(x, t, θ). However, to avoid to write things that are unnecessarily abstract we will stick to these non-rigorous notations. The reasoning is however itself rigorous.)

As a consequence we have p(t, θ) =

Z

x

p(t, x, θ) = Z

x

δ T(x)−t

f(x, t)g(t, θ)dx=h(t)g(t, θ)p(θ).

(Note that here p(t, θ) is again very rigorously a density with respect to a reference measure inR2). For t such thatp(t, θ)6= 0, we have

p(x|t, θ) =p(x, t, θ)

p(t, θ) =δ T(x)−tf(x, t)g(t, θ)p(θ)

h(t)g(t, θ)p(θ) =δ T(x)−tf(x, t) h(t) ,

which shows thatp(x|t, θ) =p(x|t). Ifp(t, θ) = 0, we can definep(x|t, θ)the way we want (because on a set of probability zero, its value does not matter) and in particular we may setp(x|t, θ) =p(x|t).

Method of moments vs maximum likelihood estimation

1.

a)

p(x1, ..., xn|θ) = 1 θn

n

Y

i=1

1xi∈[0;θ]

(3)

θˆM LE = argmax

θ

p(x1, ..., xn|θ)

= argmax

θ

1

θn1{(maxixi)∈[0;θ]}

= argmin

θ

θ s.t. θ≥ max

i∈{1,...,n}xi

= max

i∈{1,...,n}xi

b)

P(ˆθM LE ≤x) =P(∀i∈ {1, . . . , n}, xi≤x)

=x θ

n

pθˆM LE(z) =nzn−1 θn

Thus, θˆM LEθ follows a Beta distribution whose parameters areα=nandβ= 1.

c)

We can use the given formulas :

Eθ[ˆθM LE] =θ n n+ 1 Varθ(ˆθM LE) =θ2 n

(n+ 1)2(n+ 2) d)

θˆM O= 2 n

n

X

i=1

xi

Eθ[ˆθM O] =θ Varθ(ˆθM O) = 4

n2 ·n· θ2 12 = θ2

3n e)

M SE=E[(θ−θ)ˆ2]

=E[(θ−E[ˆθ])2] +E[(ˆθ−E[ˆθ])2]

= (θ2

3n for MO

θ2

(n+1)2 +(n+1)θ22n(n+2) = (n+1)(n+2)2 for MLE

(4)

Computation of maximum likelihood estimators

1

a)

argmax

θ

P(x1, ..., xn|θ) = argmax

θ

θPni=1xi·(1−θ)n−Pni=1xi

= argmax

θ

expXn

i=1

xilog(θ) + (n−

n

X

i=1

xi) log(1−θ)

= argmax

θ

Nlog(θ) + (n−N) log(1−θ), withN :=Pn

i=1. Each term is continuous, strictly concave and their sum goes to−∞towards 0 and 1 so the MLE is unique.

b)

Letl(x1, ..., xn|θ) =Nlog(θ) + (n−N) log(1−θ),

∂l(x1, ..., xn|θ)

∂θ = N

θ −n−N 1−θ

= N−nθ θ(1−θ).

Thus,

θˆM LE = N n.

2.

a)

P(Z1, ..., Zn| {πk}k) =π

Pn i=1Zi,1

1 ...π

Pn i=1Zi,K K

N11...πKNK

So(N1, ..., NK)is asufficient statisticfor the sample because the likelihood depends on the data only through these quantities (see the exercise calledSufficient statistic for definition).

b)

argmax

π

πN11...πKNK = argmax

π

N1log(π1) +...NKlog(πK) The MLE is solution of theconstrained convex optimization problem

argmax

π≥0,PK k=1πk=1

K

X

k=1

Nklog(πk)

(5)

c)

LetL(π, λ) =P

k, Nk>0Nklog(πk)−λ(P

kπk−1)be the associated Lagrangian.

∂L

∂πk

= Nk ˆ πk

−λ

So,

ˆ

πk=αNk, withα= 1 λ

= Nk n since

K

X

k=1

ˆ πk = 1.

Note that we only introduced Lagrange multipliers for the equality constraint and not for the positivity constraintsπk ≥0because, the log-likelihood diverges to−∞on the edge of the domain which ensures that the constraints will be satisfied. We can indeed check that the estimatorsπˆk are all non-negative.

3.

f(µ+h) =u>µ+u>h dfµ(h) =u>h

∇f(µ) =u

g(µ+h) =µ>Aµ+µ>Ah+µ>A>h+h>Ah dgµ(h) =µ>(A+A>)h

∇g(µ) = (A+A>)µ a)

IfΣis fixed and positive definite, argmax

µ

p(x1, ..., xn|µ) = argmax

µ

−n

2 log((2π)d|Σ|)−1 2

X

i

>Σ−1µ−µ>Σ−1xi−x>i Σ−1µ+x>i Σ−1xi)

= argmin

µ

X

i

>Σ−1µ−µ>Σ−1xi−x>i Σ−1µ+x>i Σ−1xi) Let’s compute the gradient of the log-likelihood,

∇l(µ) =X

i

−1µ−2Σ−1xi

This gives us,

ˆ

µM LE = 1 n

X

i

xi

(6)

b)

Ifµis fixed and Λ = Σ−1, since X

i

(xi−µ)>Σ−1(xi−µ) = tr X

i

(xi−µ)>Σ−1(xi−µ)

!

= tr X

i

Σ−1(xi−µ)(xi−µ)>

!

We have :

p(x1, ..., xn|Σ) = 1

((2π)d|Σ|)12 exp(−1

2tr( ˆΣΛ)) d)

hA, B+HiF − hA, BiF =hA, HiF

e)

f :A→log(|A|) LetH = (hi,j)i,j∈Rn2

|I+H|= X

σ∈Sn

sgn(σ)

n

Y

i=1

(1σ(i)=i+hi,σ(i))

=

n

Y

i=1

(1 +hi,i) + X

σ∈Sn\In

sgn(σ)

n

Y

i=1

(1σ(i)=i+hi,σ(i))

= 1 +

n

X

i=1

hi,i+O(|H|2) d|.|I(H) =tr(H)

A symmetric positive definite,H symmetric such thatA+H positive definite : log det(A+H) = log(detA.det(I+A−1H))

= log detA+ log det(I+A−1H)

= log detA+ tr(A−1H) d|.|A(H) = tr(A−1H)

∇f(A) = (A−1)>

f )

logp(x1, ..., xn|Λ) =−d

2log((2π) +1

2log|Λ| −1 2tr( ˆΣΛ)

∇f(Λ) = 1

−1−1 2

Σˆ ΛM LE = ˆΣ−1

ΣM LE = ˆΣ

Indeed, ifθˆ∈argmaxθf(θ)andθˆ=φ( ˆα)thenαˆ∈argmaxαf(φ(α)).

(7)

g)

p(x1, ..., xn|µ,Σ) = 1

((2π)d|Σ|)12 exp

−1 2

X

i

(xi−µ)>Σ−1(xi−µ)

If Σˆ is not invertible, let’s write Σ =ˆ Udiag(λ1, ..., λK,0, ...,0)U> with U an orthogonal matrix and set Λα=Udiag(λ1, ..., λK, α, ..., α)−1U>.

logp(x1, ..., xnα) =−d

2log((2π) +1

2log|Λα| −1

2tr( ˆΣΛα)

The second term goes to∞withαwhile the two others are constant, so the log-likelihood is unbounded. In practice, the maximum likelihood estimator is extended by continuity to these case ; the obtained estimator can also be though of as the maximum likelihood estimators for Gaussian densities on the subspace spanned by{x1, . . . , xn}.

Bayesian estimation

1.

a)

p(π|α, n) =p(n|α, π)· p(π|α) p(n|α)

K

Y

k=1

πknkΓ(α1+...+αK) QK

k=1Γ(αk)

kαk−1

K

Y

k=1

πknkk−1

b)

We denote by4the canonical simplex4:=

u∈RK+ |PK

k=1uk = 1 . we then have E[πj|Z] =

Z

4

πjp(π|Z)dπ

Let us consider a fixed value forj∈ {1, . . . , K} and defineβkk+nk for allk.

E[πj|Z] = Γ(β1+. . .+βK) QK

k=1Γ(βk) Z

4

πj K

Y

k=1

πβkk−1

=Γ(β1+. . .+βK) QK

k=1Γ(βk) · Γ(βj+ 1)Q

k6=jΓ(βk) Γ(β1+. . .+βK+ 1)

=Γ(βj+ 1)

Γ(βj) · Γ(β1+. . .+βK) Γ(β1+. . .+βK+ 1)

= βj

β1+. . .+βK

= αj+nj

αtot+n, withαtot1+. . .+αK.

(8)

c)

E[π1|n1, n2] = n1+ 1 n1+n2+ 2

Without smoothing, ifx1= 0orx2= 0, which is common ifπ1is close to0or1, the maximum likelihood estimator estimates that pi = 0 even thoughpi>0. This is a major problem because then the probability of some non-zero event is assessed to be equal to0, which makes all probabilistic reasonings fail.

2.

P ={p(x|θ), θ∈Θ}

Π ={pα(θ), α∈ A}

Πis a conjugate family of distributions forP if for allpα∈Π, there exists pα0 ∈Πsuch that we can write pα(θ|x) =pα0(θ).

pBernoulli(x|θ) =θx(1−θ)1−x

The family of beta distributionspα,β(θ)∝θα−1(1−θ)β−1is a conjugate family of distributions for the family of Bernoulli distributions.

pP oisson(x|λ) =λPni=1xie−nλ Qn

i=1(xi!)

The family of gamma distributionspα,β(λ)∝λα−1e−βλ is a conjugate family of distributions for the family of Poisson distributions.

pexp(x|µ)∝exp

−X

i

(µ−xi)tΣ−1(µ−xi)−(µ−µ0)tΣ−10 (µ−µ0)

The family of gaussian distributions with the given covariance and unknown mean is a conjugate family of distributions for the family of gaussian random variables with fixed known covariance and unknown mean (cf. ex. 3).

3.

a)

As the product of two gaussian distributions, the a posteriori distribution is still a gaussian N(ˆµP M, ω).

On the one hand,

exp

−(µ−µˆP M)22

= exp

− µ2

2+2µˆµP M

2 −µˆ2P M2

On the other hand, exp X

−(xi−µ)2

2 −(µ0−µ)22

= exp

−µ2.( n 2σ2 + 1

2) + 2µ(X xi2 + µ0

2)−(X x2i2 + µ20

2) By identification

1 ω2 = n

σ2 + 1 τ2, and we get the posterior mean :

ˆ µP M =

Pxi σ2 +µτ02

n σ2 +τ12

.

(9)

b)

ˆ µP M =

Pxi

n ·λn0·(1−λn) , with λn=

n σ2 n σ2 +τ12

d)

In this case and if µ0= 0 : argmin

µ

− logp(x1, ..., xn|µ) +λ

2= argmin

µ

− logp(x1, ..., xn|µ)·p(µ)forλ= 1 τ2

Thus the MAP estimator can be viewed as a minimizer of the log-likelihood with some ridge regularization.

e)

µM APP M

This property doesn’t hold for a Bernoulli distribution with a Beta prior.

f )

E[log p(X0|ν, σ)] =E[−(X0−ν)22 ] g)

R(ν) =E[(ν−X0)2]

=E[(ν−E[X0] +E[X0]−X0)2]

=E[(ν−E[X0])2] +E[(E[X0]−X0)2] + 2E[(ν−E[X0])(E[X0]−X0)]

= (µ−ν2) + Var(X0) h)

EDn[E(ˆµ)] =EDn[R(ˆµ)−R(µ)]

=EDn[(ˆµ−µ)2]

=EDn[(ˆµ−EDn[ˆµ])2] + (EDn[ˆµ]−µ)2 i)

EDn[E(µM LE)] =EDn[(µM LE−EDnM LE])2] + 0

= σ2 n

(10)

EDn[E(µM AP)] =EDn[(µM AP −EDnM AP])2] + (EDnM AP]−µ)2

=

n σ2

(σn2 +τ12)2 +

(µ−µ0)2 τ4

(σn2+τ12)2 j)

Rπ(M LE) =σ2 n

Rπ(P M) =

σ2 n

(1 +σ22) l)

Eµ∼π,Dn[(ˆµ−µ)2] =EDn[Eµ∼π[(ˆµ−µ)2|Dn]]

The inner quantity is minimized for every possibleDn by using the posterior mean.

Bregman divergence

1.

DF(p, q) =hp, pi − hq, qi −2hq, pi+ 2hq, qi=hp−q, p−qi

2.

(∇H(q))i=−logqi−1

DH(p, q) =X

pilogpi−X

qilogqi−X

(logqi+ 1)(pi−qi)

=X

pi(logpi−logqi)

=KL(p, q)

3.

We assume the loss is differentiable.

F(µ) =EX[l(µ, X)]−EX[l(µ, X)]

(11)

1 Ridge regression and PCA

1.

a)

X =U SV>, X=V SU>, S= diag(si) XXX =U SV>V SU>U SV

=Udiag(1si6=0)SV>)V>

=U SV>=X b)

(X>X)= (V S2V>)

=V(S)2V>

X(X)>=V(S)2V>

(X>X)X>=X(X)>X>

=V SSSU>

=V SU>

c)

X>XX =V SU>U SV>V SU> =V SU>

d)

First supposeX =S. Letw be a solution to the normal equation.

s2i,iwi=si⇒wi=si y ifsi6= 0 (S>y)i= 0ifsi= 0 So it is true ifX =S.

For anyX, letw˜ =V>w,y˜=U>y.

wis a solution to the normal equation iffw˜ is a solution ofS2w˜ =Sy˜

2.

X = U SV>, U and V are square matrices. The columns of U and the columns of V are called the left-singular vectors and right-singular vectors of X.

(12)

a)

SinceU andV are orthogonal matrices, we have for allw∈Rn :kXwk=kU SV>wk=kSV>wk argmax

w∈Rn,kwk=1

kXwk=V> argmax

w∈Rn,kwk=1

kSwk

b)

Σ =ˆ 1

nV S2V>

c)

X

i

(cj,i−c¯j)2=X

i

((Xvj)i−X

k

(Xvj)k)2

=X

(Xvj)2

=s2j·1 d)

XX>Xvj=U SV>V>SU U SV>vj =U S3V vj =s2j,jXvj

3.

a)

X

i

ky(i)−(c(i)1:k)>wk2=X

i

ky(i)−(x>i vj)j=1:kwk2

=ky−(XV1:k)wk2

=ky−UkSkwk2

withUk = (u1, ..., uk),Sk = diag(s1, ..., sk). So,

˜

w= ((UkSk)>UkSk)−1(UkSk)>y=Sk−1Uk>y.

b)

˜

w>(hx−x¯0, v1i, ...,hx−x¯0, vki) =X

j

ˆ

wjhx−x¯0, vji=hx−x¯0,X

j

1 sj

huj, yivji

(13)

c)

wR= (X>X+λIp)−1X>y

= (V(S2+λIp)V>)−1V SU>y

=Vdiag( si

s2i +λ)U>y

=

m

X

j=1

sj

s2j+λhuj, yivj

d)

The coefficients for j > kwill vanish.

e)

hX>y, vji=hV SU>y, vji

= (SU>Y)J

=sjhuj, yi f )

Andrei Tikhonov and Karl Pearson

Area under the curve and Mann-Whitney U statistic

a)

LetC0 be the set of elements that belong to class0,C1be the set of elements that belong to class 1.

rT P(b) = P(s(x)> b, x∈C1) P(x∈C1)

= P(s(x)> b).P(x∈C1)

P(x∈C1) , sinces(x)doesn’t depend onx.

= 1−F(b)

rF P(b) = P(s(x)> b, x∈C0) P(x∈C0)

= P(s(x)> b).P(x∈C0) P(x∈C0)

= 1−F(b) Hence,

AU C= 1 2

(14)

b)

WLOG, suppose s(x1)< ... < s(xn),s(y1)< ... < s(ym). On the one hand, U =

n

X

i=1

| {s(z)|z∈DN ∪DP ands(z)≤s(xi)} | −n(n+ 1) 2

=

n

X

i=1

| {s(z)|z∈DN and s(z)≤s(xi)} |+

n

X

i=1

| {s(z)|z∈DP ands(z)≤s(xi)} |

| {z }

=i

−n(n+ 1) 2

=

n

X

i=1

| {s(z)|z∈DN and s(z)≤s(xi)} | On the other hand,

rT P(s(xi)) = 1− i n and

rF P(s(xi)) = | {s(z)|z∈∈DN ands(z)> s(xi)} | m

= 1−| {s(z)|z∈DN ands(z)≤s(xi)} | m

so

U m.n = 1

n

n

X

i=1

1−rF P(s(xi))

=

n

X

i=1

1−rF P(s(xi))

.(rT P(s(xi))−rT P(s(xi−1)))withx0=−∞

=AU C

Références

Documents relatifs

In the field of asymptotic performance characterization of Conditional Maximum Likelihood (CML) estimator, asymptotic generally refers to either the number of samples or the Signal

For any shape parameter β ∈]0, 1[, we have proved that the maximum likelihood estimator of the scatter matrix exists and is unique up to a scalar factor.. Simulation results have

• If the measurement error is assumed to be distributed quasi-uniformly, OMNE / GOMNE is a Maximum Likelihood Estimator. • Maximum Likelihood Estimation in an interval

Keywords: Data assimilation, local scale simulation, boundary conditions, shal- low water model, 3D-Var, back and forth nudging algorithm, iterative ensemble Kalman smoother..

S YSTEMIC OPTIMIZATION PROBLEM PROCEDURE In order to handle the optimization problem of the passive WT submitted to a consumption profile and a given wind speed, we have

The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est

Le présent travail a été réalisé sur des champs cultivés tra- ditionnellement par les agriculteurs en respectant les quantités et la répartition de résidus qui existent

In a recent work Ben Alaya and Kebaier [1] use a new approach, based on Laplace transform technics, to study the asymptotic behavior of the MLE associated to one of the drift