Review of Statistics

(1)

Review of Statistics

Guillaume Obozinski

Ecole des Ponts - ParisTech

Master MVA

Review of Statistics 1/52

(2)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

(3)

Outline

3 Method of moments

4 Linear regression

(4)

Statistical concepts

(5)

Expectation and Variances

Expectation ofX : E[X] =P

xx·p(x)

Expectation off (X), pourf mesurable : E[f (X)] =X

x

f (x)·p(x) Variance :

Var (X) = E h

(X −E[X])² i

= E X²

−E[X]² Covariance :

Cov(X,Y) =E

(X−E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

x·p(x|y) Variance decomposition

(6)

Expectation and Variances

xx·p(x) Expectation off (X), pourf mesurable :

E[f (X)] =X

x

f (x)·p(x)

Variance :

Var (X) = E h

(X −E[X])² i

= E X²

Cov(X,Y) =E

(X−E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

(7)

Expectation and Variances

E[f (X)] =X

x

Var (X) = E h

(X −E[X])²i

= E X²

Cov(X,Y) =E

(X −E[X]) (Y −E[Y])

Conditional expectation of X given Y : E[X|Y] =X

x

(8)

Expectation and Variances

E[f (X)] =X

x

Var (X) = E h

(X −E[X])²i

= E X²

Cov(X,Y) =E

(X −E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

x·p(x|y)

Variance decomposition

(9)

Expectation and Variances

E[f (X)] =X

x

Var (X) = E h

(X −E[X])²i

= E X²

Cov(X,Y) =E

(X −E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

(10)

Independence concepts

Independence: X⊥⊥Y

We will say that X andY are independent and writeX⊥⊥Y iff:

∀x,y, P(X =x,Y =y) =P(X =x)P(Y =y)

Conditional Independence: X⊥⊥Y |Z

We will say thatX andY are independent conditionally on Z and writeX⊥⊥Y |Z ssi:

∀x,y,z,

P(X =x,Y =y |Z =z) =P(X =x|Z =z)P(Y =y|Z =z)

(11)

Independence concepts

Independence: X⊥⊥Y

We will say that X andY are independent and writeX⊥⊥Y iff:

∀x,y, P(X =x,Y =y) =P(X =x)P(Y =y) Conditional Independence: X⊥⊥Y |Z

We will say thatX andY are independent conditionally on Z and writeX⊥⊥Y |Z ssi:

∀x,y,z,

P(X =x,Y =y |Z =z) =P(X =x|Z =z)P(Y =y|Z =z)

(12)

Conditional Independence exemple

Example of

“X-linked recessive inheritance”:

Transmission of the gene responsible for hemophilia

Risk for sons from an unaffected father:

dependance between the situation of the two brothers.

conditionally independent given that the mother is a carrier of the gene or not.

(13)

Conditional Independence exemple

Example of

“X-linked recessive inheritance”:

Transmission of the gene responsible for hemophilia

Risk for sons from an unaffected father:

dependance between the situation of the two brothers.

conditionally independent given that the mother is a carrier of the gene or not.

(14)

Statistical Model

Parametric model – Definition:

Set of distributions parametrized by a vector θ∈Θ⊂R^p P_Θ=

pθ(x)|θ∈Θ

Bernoulli model: X ∼Ber(θ) Θ = [0,1] p_θ(x) =θ^x(1−θ)^(1−x) Binomial model: X ∼Bin(n, θ) Θ = [0,1]

pθ(x) = n

x

θ^x(1−θ)^(1−x)

Multinomial model: X ∼ M(n, π₁, π₂, . . . , π_K) Θ = [0,1]^K pθ(x) =

n x1, . . . ,x_k

π1x1 . . . πkxk

(15)

Statistical Model

pθ(x)|θ∈Θ Bernoulli model: X ∼Ber(θ) Θ = [0,1]

pθ(x) =θ^x(1−θ)^(1−x)

Binomial model: X ∼Bin(n, θ) Θ = [0,1] pθ(x) =

n x

θ^x(1−θ)^(1−x)

n x1, . . . ,x_k

π1x1 . . . πkxk

(16)

Statistical Model

pθ(x) =θ^x(1−θ)^(1−x) Binomial model: X ∼Bin(n, θ) Θ = [0,1]

pθ(x) = n

x

θ^x(1−θ)^(1−x)

n x1, . . . ,x_k

π1x1 . . . πkxk

(17)

Statistical Model

pθ(x) =θ^x(1−θ)^(1−x) Binomial model: X ∼Bin(n, θ) Θ = [0,1]

pθ(x) = n

x

θ^x(1−θ)^(1−x)

Multinomial model: X ∼ M(n, π₁, π₂, . . . , π_K) Θ = [0,1]^K p_θ(x) =

n x₁, . . . ,x_k

π1x1 . . . π_k^x^k

(18)

Indicator variable coding for multinomial variables

Let C a r.v. taking values in{1, . . . ,K}, with P(C =k) =π_k.

We will code C with a r.v. Y = (Y₁, . . . ,Y_K)^> with Yk = 1{C=k}

For example if K = 5 andc = 4 then y = (0,0,0,1,0)^>. So y ∈ {0,1}^K withPK

k=1y_k = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

π_k^y^k.

(19)

Indicator variable coding for multinomial variables

k=1y_k = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

π_k^y^k.

(20)

Indicator variable coding for multinomial variables

For example if K = 5 andc = 4 then y = (0,0,0,1,0)^>.

So y ∈ {0,1}^K withPK

k=1y_k = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

π_k^y^k.

(21)

Indicator variable coding for multinomial variables

k=1y_k = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

π_k^y^k.

(22)

Indicator variable coding for multinomial variables

k=1y_k = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

π_k^y^k.

(23)

Bernoulli, Binomial, Multinomial

Y ∼Ber(π) (Y₁, . . . ,Y_K)∼ M(1, π₁, . . . , π_K) p(y) =π^y(1−π)^1−y p(y) =π^y₁¹. . . π_K^y^K

N1 ∼Bin(n, π) (N1, . . . ,N_K)∼ M(n, π₁, . . . , π_K) p(n1) = _nⁿ

1

πⁿ¹(1−π)ⁿ⁻ⁿ¹ p(n) =

n n1 . . . n_K

πⁿ₁¹. . . π_Kⁿ^K

with n

i

= n!

(n−i)!i! and

n n1 . . . nK

= n!

n1!. . .n_K!

(24)

Gaussian model

Scalar Gaussian model : X ∼ N(µ, σ²) X real valued r.v., andθ= µ, σ²

∈Θ =R×R^∗+. p_µ,σ2(x) = 1

√

2πσ²exp −1 2

(x−µ)² σ²

!

Multivariate Gaussian model: X ∼ N (µ,Σ)

X r.v. taking values in R^d. If K_d is the set of positive definite matrices of size d ×d , and θ= (µ,Σ)∈Θ =R^d× K_d.

p_µ,Σ(x) = 1 q

(2π)^ddetΣ exp

−1

2(x−µ)^TΣ⁻¹(x−µ)

(25)

Gaussian model

Scalar Gaussian model : X ∼ N(µ, σ²) X real valued r.v., andθ= µ, σ²

∈Θ =R×R^∗+. p_µ,σ2(x) = 1

√

2πσ²exp −1 2

(x−µ)² σ²

!

Multivariate Gaussian model: X ∼ N (µ,Σ)

X r.v. taking values in R^d. If K_d is the set of positive definite matrices of size d ×d , and θ= (µ,Σ)∈Θ =R^d× K_d.

p_µ,Σ(x) = 1 q

(2π)^ddetΣ exp

−1

2(x−µ)^TΣ⁻¹(x−µ)

(26)

Gaussian densities

(27)

Gaussian densities

(28)

Sample/Training set

The data used to learn or estimate a model typically consists of a collection of observation which can be thought of as instantiations of random variables.

X⁽¹⁾, . . . ,X⁽ⁿ⁾

A common assumption is that the variables are i.i.d. independent

identically distributed, i.e. have the same distributionP.

This collection of observations is called

thesampleor the observations in statistics thesamplesin engineering

thetraining set in machine learning

(29)

Sample/Training set

X⁽¹⁾, . . . ,X⁽ⁿ⁾

A common assumption is that the variables are i.i.d.

independent

(30)

Sample/Training set

X⁽¹⁾, . . . ,X⁽ⁿ⁾

A common assumption is that the variables are i.i.d.

independent

(31)

Outline

3 Method of moments

4 Linear regression

(32)

The maximum likelihood principle

(33)

Maximum likelihood principle

LetP_Θ=

p(x;θ)|θ∈Θ be a model Letx be an observation

Likelihood:

L: Θ → R+

θ 7→ p(x;θ) Maximum likelihood estimator:

θˆ_ML= argmax

θ∈Θ

p(x;θ) Sir Ronald Fisher

(1890-1962)

Case of i.i.d data

If (xi)1≤i≤n is an i.i.d. sample of sizen: θˆML= argmax

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

(34)

Maximum likelihood principle

LetP_Θ=

Likelihood:

L: Θ → R+

θ 7→ p(x;θ)

Maximum likelihood estimator: θˆ_ML= argmax

θ∈Θ

(1890-1962)

Case of i.i.d data

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

(35)

Maximum likelihood principle

LetP_Θ=

Likelihood:

L: Θ → R+

θˆ_ML= argmax

θ∈Θ

(1890-1962)

Case of i.i.d data

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

(36)

Maximum likelihood principle

LetP_Θ=

Likelihood:

L: Θ → R+

θˆ_ML= argmax

θ∈Θ

(1890-1962)

Case of i.i.d data

If (xi)1≤i≤n is an i.i.d. sample of sizen:

θˆML= argmax

θ∈Θ n

Y

i=1

p_θ(xi) = argmax

θ∈Θ n

X

i=1

log p_θ(xi)

(37)

The maximum likelihood estimator

The MLE

does not always exists

is not necessarily unique

(38)

The maximum likelihood estimator

The MLE

does not always exists is not necessarily unique

(39)

The maximum likelihood estimator

The MLE

does not always exists is not necessarily unique

(40)

MLE for the Bernoulli model

Let X₁,X₂, . . . ,X_n an i.i.d. sample ∼Ber(θ).

The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ)

=

n

X

i=1

log

θ^xⁱ(1−θ)^1−xⁱ

=

n

X

i=1

x_ilogθ+ (1−x_i) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1x_i.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(41)

MLE for the Bernoulli model

Let X₁,X₂, . . . ,X_n an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ)

=

n

X

i=1

log

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(42)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(43)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(44)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(45)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(46)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique.

since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(47)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(48)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ.

Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(49)

MLE for the Bernoulli model

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

=

n

X

i=1

i=1x_i.

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆ_ML= N

n = x₁+x₂+· · ·+x_n

n .

(50)

MLE for the multinomial

Done on the board.

(51)

Outline

3 Method of moments

4 Linear regression

(52)

Method of moments (Karl Pearson, 1894)

Consider a statistical model for a univariate r.v. parameterized by θ= (θ₁, . . . , θ_K)∈R^k.

Denote by µ^k thekth moment of a random variable:

µ1(θ) =Eθ[X], µ2(θ) =Eθ[X²], . . . , µ_K(θ) =Eθ[X^K].

We have

(µ₁, . . . , µ_K) =f(θ) =f(θ₁, . . . , θ_K). Principle of the method of moments

Given a sampleX₁, . . . ,X_n

Estimate the µks with the empirical moments: ˆµk = 1 n

n

X

i=1

X_i^k.

The moment estimator is ˆθ defined as the solution to the equation (ˆµ₁, . . . ,µˆ_K) =f(ˆθ₁, . . . ,θˆ_K).

(53)

Method of moments (Karl Pearson, 1894)

We have

(µ₁, . . . , µ_K) =f(θ) =f(θ₁, . . . , θ_K).

Principle of the method of moments Given a sampleX₁, . . . ,X_n

n

X

i=1

X_i^k.

(54)

Method of moments (Karl Pearson, 1894)

We have

(µ₁, . . . , µ_K) =f(θ) =f(θ₁, . . . , θ_K).

n

X

i=1

X_i^k.

(55)

Method of moments (Karl Pearson, 1894)

We have

(µ₁, . . . , µ_K) =f(θ) =f(θ₁, . . . , θ_K).

n

X

i=1

X_i^k.

(56)

Method of moments: illustration

In many usual cases the moment estimator and theMLE are equal.

Example where MME 6= MLE For the family of gamma distribution

p(x;λ,p) = x^p−1e^−λx λ^pΓ(p) 1_{x>0} the MLE is not closed-form (exercise). However

µ1 =E[X] =λp, µ2 =E[X²] =p(p+ 1)λ²,So that λ= µ²₁

µ2−µ²₁, p = µ2−µ²₁

µ₁ ,

which yields the moment estimators λˆ= µˆ²₁

ˆ

µ2−µˆ²₁, p = µˆ₂−µˆ²₁ ˆ

µ₁ .

(57)

Method of moments: illustration

p(x;λ,p) = x^p−1e^−λx λ^pΓ(p) 1_{x>0}

the MLE is not closed-form (exercise).

However

µ2−µ²₁, p = µ2−µ²₁

µ₁ ,

ˆ

µ2−µˆ²₁, p = µˆ₂−µˆ²₁ ˆ

µ₁ .

(58)

Method of moments: illustration

p(x;λ,p) = x^p−1e^−λx λ^pΓ(p) 1_{x>0}

the MLE is not closed-form (exercise). However µ1 =E[X] =λp, µ2 =E[X²] =p(p+ 1)λ²,

So that λ= µ²₁

µ2−µ²₁, p = µ2−µ²₁

µ₁ ,

ˆ

µ2−µˆ²₁, p = µˆ₂−µˆ²₁ ˆ

µ₁ .

(59)

Method of moments: illustration

p(x;λ,p) = x^p−1e^−λx λ^pΓ(p) 1_{x>0}

the MLE is not closed-form (exercise). However

µ2−µ²₁, p = µ₂−µ²₁

µ₁ ,

ˆ

µ2−µˆ²₁, p = µˆ₂−µˆ²₁ ˆ

µ₁ .

(60)

Outline

3 Method of moments

4 Linear regression

(61)

Linear regression

(62)

Design matrix

Consider a finite collection of vectors xi ∈R^d pour i = 1. . .n.

Design Matrix

X =







—– x₁^> —–

... ... ...

—– x_n^> —–





.

We assume that the vectors are centered, i.e. that Pn

i=1xi = 0. Ifxi are not centered the design matrix of centered data can be constructed with the rows xi−x¯^> with ¯x= ¹_nPn

i=1xi.

(63)

Design matrix

Design Matrix

X =







—– x₁^> —–

... ... ...

—– x_n^> —–





.

i=1x_i = 0.

Ifxi are not centered the design matrix of centered data can be constructed with the rows xi−x¯^> with ¯x= ¹_nPn

i=1xi.

(64)

Design matrix

Design Matrix

X =







—– x₁^> —–

... ... ...

—– x_n^> —–





.

i=1x_i = 0.

Ifxi are not centered the design matrix of centered data can be constructed with the rows xi−x¯^> with ¯x= ¹_nPn

i=1xi.

(65)

Linear regression

We consider the OLS regression for the linear hypothesis space.

We haveX =R^p,Y =Rand` the square loss.

Consider the hypothesis space:

S ={f_w |w ∈R^p} with fw :x7→w^>x. Given a training set {(x₁,y1), . . . ,(xn,yn)} we have

Rb_n(f_w) = 1 2n

n

X

i=1

(y_i −w^>x_i)² = 1

2nky −X wk²₂ with

the vector of outputs y^>= (y1, . . . ,yn)∈Rⁿ

the design matrixX ∈R^n×p whose ith row is equal tox^>_i .

(66)

Linear regression

S ={f_w |w ∈R^p} with fw :x7→w^>x.

Given a training set {(x₁,y1), . . . ,(xn,yn)} we have Rb_n(f_w) = 1

2n

n

X

i=1

(y_i −w^>x_i)² = 1

(67)

Linear regression

2n

n

X

i=1

(y_i −w^>x_i)²

= 1

(68)

Linear regression

2n

n

X

i=1

(y_i −w^>x_i)² = 1

(69)

Solving linear regression

To solve min

w∈R^p

Rb_n(fw), we consider that Rb_n(f_w) = 1

2n w^>X^>X w −2w^>X^>y +kyk²

is a differentiable convexfunction whose minima are thus characterized by the

Normal equations

X^>X w −X^>y =0

IfX^>X is invertible, then fbis given by:

fb:x⁰ 7→x^0>(X^>X)⁻¹X^>y.

Problem: X^>X is never invertible forp >n and thus the solution is not unique.

(70)

Solving linear regression

To solve min

w∈R^p

2n w^>X^>X w −2w^>X^>y +kyk²

Normal equations

X^>X w −X^>y =0

fb:x⁰ 7→x^0>(X^>X)⁻¹X^>y.

(71)

Solving linear regression

To solve min

w∈R^p

2n w^>X^>X w −2w^>X^>y +kyk²

Normal equations

X^>X w −X^>y =0

fb:x⁰ 7→x^0>(X^>X)⁻¹X^>y.

(72)

Solving linear regression

To solve min

w∈R^p

2n w^>X^>X w −2w^>X^>y +kyk²

Normal equations

X^>X w −X^>y =0

fb:x⁰ 7→x^0>(X^>X)⁻¹X^>y.

Problem: X^>X is never invertible for p >n and thus the solution is not unique.

(73)

Ridge regression

Is obtained by applying Tikhonov regularization to OLS regression.

wmin∈R^p

1

2nky −X wk²₂+λkwk²₂ Problem now strongly convex thus well-posed

Thus with unique solution: ˆ

w^(ridge)= (X^>X+λI)⁻¹X^>y

Shrinkage effect

Regularization improves the conditioning number of the Hessian

⇒ Problem now easier to solve computationally

(74)

Ridge regression

wmin∈R^p

1

2nky −X wk²₂+λkwk²₂ Problem now strongly convex thus well-posed Thus with unique solution:

ˆ

Shrinkage effect

(75)

Ridge regression

wmin∈R^p

1

ˆ

Shrinkage effect

(76)

Ridge regression

wmin∈R^p

1

ˆ

Shrinkage effect

(77)

Ridge regression

wmin∈R^p

1

ˆ

Shrinkage effect

(78)

Outline

3 Method of moments

4 Linear regression

(79)

Bayesian estimation

Bayesians treat the parameter θ as a random variable.

A priori

The Bayesian has to specify ana priori distribution p(θ) for the model parametersθ, which models his prior belief of the relative plausibility of different values of the parameter.

A posteriori

The observation contribute through the likelihood: p(x|θ). The a posterioridistribution on the parameters is then

p(θ|x) = p(x|θ)p(θ)

p(x) ∝p(x|θ)p(θ).

→ The Bayesian estimator is therefore a probability distribution on the parameters.

This estimation procedure is called Bayesian inference.

(80)

Bayesian estimation

Bayesians treat the parameter θ as a random variable.

A priori

The Bayesian has to specify ana priori distribution p(θ) for the model parametersθ, which models his prior belief of the relative plausibility of different values of the parameter.

A posteriori

The observation contribute through the likelihood: p(x|θ).

The a posterioridistribution on the parameters is then p(θ|x) = p(x|θ)p(θ)

p(x) ∝p(x|θ)p(θ).

→ The Bayesian estimator is therefore a probability distribution on the parameters.

This estimation procedure is calledBayesian inference.

(81)

Dirichlet distribution

We say thatθ = (θ1, . . . , θ_K) follows the Dirichlet distribution and note θ∼Dir(α)

for

θ in the simplex 4_K ={u∈R^K+|PK

k=1u_k = 1}and admitting the density

p(θ;α) = Γ(α0) Q

kΓ(α_k) θ^α₁¹⁻¹. . . θ_K^α^K⁻¹ with respect to the uniform measure on the simplex, where

α₀=X

k

α_k and Γ(x) := Z ∞

0

t^x−1e^−tdt

(82)

Dirichlet distribution

for θ in the simplex 4_K ={u∈R^K₊|PK

k=1u_k = 1}and

admitting the density

p(θ;α) = Γ(α0) Q

α₀=X

k

α_k and Γ(x) := Z ∞

0

t^x−1e^−tdt

(83)

Dirichlet distribution

for θ in the simplex 4_K ={u∈R^K₊|PK

k=1u_k = 1}and admitting the density

p(θ;α) = Γ(α0) Q

α₀=X

k

α_k and Γ(x) :=

Z ∞ 0

t^x−1e^−tdt

(84)

Dirichlet distribution II

E[θk] = αk

α0

, Var(θk) = αk(α0−αk)

α²₀(α₀+ 1) and Cov(θj, θk) = −α_jα_k α²₀(α₀+ 1) with α0=P

kαk.

(85)

Dirichlet distribution II

E[θk] = αk

α0

, Var(θk) = αk(α0−αk)

α²₀(α₀+ 1) and Cov(θj, θk) = −α_jα_k α²₀(α₀+ 1) with α0=P

kαk.

(86)

Bayesian estimation of a multinomial random variable

Consider the simple Bayesian Dirichlet-Multinomial model with

A Dirichlet prior on the parameter of the multinomial: θ ∼Dir(α) A multinomial random variable z ∼ M(1,θ)

p(θ)∝

K

Y

k=1

θ_k^α^k⁻¹ and p(z|θ) =

K

Y

k=1

θ_k^z^k

Let z⁽¹⁾, . . . ,z^(N) be an i.i.d. sample distributed likez. We have

p(θ|z⁽¹⁾, . . . ,z^(N)) = p(θ) Q

np(z⁽ⁿ⁾|θ)

p(z⁽¹⁾, . . . ,z^(N)) ∝ Y

k

θ^α^k⁺

P

nznk−1 k

So that (θ|(Z))∼Dir((α₁+N₁, . . . , α_K+N_K)) withN_k =P

nz_nk