• Aucun résultat trouvé

Review of Statistics

N/A
N/A
Protected

Academic year: 2022

Partager "Review of Statistics"

Copied!
147
0
0

Texte intégral

(1)

Review of Statistics

Guillaume Obozinski

Ecole des Ponts - ParisTech

Master MVA

Review of Statistics 1/52

(2)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

(3)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

Review of Statistics 3/52

(4)

Statistical concepts

(5)

Expectation and Variances

Expectation ofX : E[X] =P

xx·p(x)

Expectation off (X), pourf mesurable : E[f (X)] =X

x

f (x)·p(x) Variance :

Var (X) = E h

(X −E[X])2 i

= E X2

−E[X]2 Covariance :

Cov(X,Y) =E

(X−E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

x·p(x|y) Variance decomposition

Review of Statistics 5/52

(6)

Expectation and Variances

Expectation ofX : E[X] =P

xx·p(x) Expectation off (X), pourf mesurable :

E[f (X)] =X

x

f (x)·p(x)

Variance :

Var (X) = E h

(X −E[X])2 i

= E X2

−E[X]2 Covariance :

Cov(X,Y) =E

(X−E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

x·p(x|y) Variance decomposition

(7)

Expectation and Variances

Expectation ofX : E[X] =P

xx·p(x) Expectation off (X), pourf mesurable :

E[f (X)] =X

x

f (x)·p(x) Variance :

Var (X) = E h

(X −E[X])2i

= E X2

−E[X]2 Covariance :

Cov(X,Y) =E

(X −E[X]) (Y −E[Y])

Conditional expectation of X given Y : E[X|Y] =X

x

x·p(x|y) Variance decomposition

Review of Statistics 5/52

(8)

Expectation and Variances

Expectation ofX : E[X] =P

xx·p(x) Expectation off (X), pourf mesurable :

E[f (X)] =X

x

f (x)·p(x) Variance :

Var (X) = E h

(X −E[X])2i

= E X2

−E[X]2 Covariance :

Cov(X,Y) =E

(X −E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

x·p(x|y)

Variance decomposition

(9)

Expectation and Variances

Expectation ofX : E[X] =P

xx·p(x) Expectation off (X), pourf mesurable :

E[f (X)] =X

x

f (x)·p(x) Variance :

Var (X) = E h

(X −E[X])2i

= E X2

−E[X]2 Covariance :

Cov(X,Y) =E

(X −E[X]) (Y −E[Y]) Conditional expectation of X given Y :

E[X|Y] =X

x

x·p(x|y) Variance decomposition

Review of Statistics 5/52

(10)

Independence concepts

Independence: X⊥⊥Y

We will say that X andY are independent and writeX⊥⊥Y iff:

∀x,y, P(X =x,Y =y) =P(X =x)P(Y =y)

Conditional Independence: X⊥⊥Y |Z

We will say thatX andY are independent conditionally on Z and writeX⊥⊥Y |Z ssi:

∀x,y,z,

P(X =x,Y =y |Z =z) =P(X =x|Z =z)P(Y =y|Z =z)

(11)

Independence concepts

Independence: X⊥⊥Y

We will say that X andY are independent and writeX⊥⊥Y iff:

∀x,y, P(X =x,Y =y) =P(X =x)P(Y =y) Conditional Independence: X⊥⊥Y |Z

We will say thatX andY are independent conditionally on Z and writeX⊥⊥Y |Z ssi:

∀x,y,z,

P(X =x,Y =y |Z =z) =P(X =x|Z =z)P(Y =y|Z =z)

Review of Statistics 6/52

(12)

Conditional Independence exemple

Example of

“X-linked recessive inheritance”:

Transmission of the gene responsible for hemophilia

Risk for sons from an unaffected father:

dependance between the situation of the two brothers.

conditionally independent given that the mother is a carrier of the gene or not.

(13)

Conditional Independence exemple

Example of

“X-linked recessive inheritance”:

Transmission of the gene responsible for hemophilia

Risk for sons from an unaffected father:

dependance between the situation of the two brothers.

conditionally independent given that the mother is a carrier of the gene or not.

Review of Statistics 7/52

(14)

Statistical Model

Parametric model – Definition:

Set of distributions parametrized by a vector θ∈Θ⊂Rp PΘ=

pθ(x)|θ∈Θ

Bernoulli model: X ∼Ber(θ) Θ = [0,1] pθ(x) =θx(1−θ)(1−x) Binomial model: X ∼Bin(n, θ) Θ = [0,1]

pθ(x) = n

x

θx(1−θ)(1−x)

Multinomial model: X ∼ M(n, π1, π2, . . . , πK) Θ = [0,1]K pθ(x) =

n x1, . . . ,xk

π1x1 . . . πkxk

(15)

Statistical Model

Parametric model – Definition:

Set of distributions parametrized by a vector θ∈Θ⊂Rp PΘ=

pθ(x)|θ∈Θ Bernoulli model: X ∼Ber(θ) Θ = [0,1]

pθ(x) =θx(1−θ)(1−x)

Binomial model: X ∼Bin(n, θ) Θ = [0,1] pθ(x) =

n x

θx(1−θ)(1−x)

Multinomial model: X ∼ M(n, π1, π2, . . . , πK) Θ = [0,1]K pθ(x) =

n x1, . . . ,xk

π1x1 . . . πkxk

Review of Statistics 8/52

(16)

Statistical Model

Parametric model – Definition:

Set of distributions parametrized by a vector θ∈Θ⊂Rp PΘ=

pθ(x)|θ∈Θ Bernoulli model: X ∼Ber(θ) Θ = [0,1]

pθ(x) =θx(1−θ)(1−x) Binomial model: X ∼Bin(n, θ) Θ = [0,1]

pθ(x) = n

x

θx(1−θ)(1−x)

Multinomial model: X ∼ M(n, π1, π2, . . . , πK) Θ = [0,1]K pθ(x) =

n x1, . . . ,xk

π1x1 . . . πkxk

(17)

Statistical Model

Parametric model – Definition:

Set of distributions parametrized by a vector θ∈Θ⊂Rp PΘ=

pθ(x)|θ∈Θ Bernoulli model: X ∼Ber(θ) Θ = [0,1]

pθ(x) =θx(1−θ)(1−x) Binomial model: X ∼Bin(n, θ) Θ = [0,1]

pθ(x) = n

x

θx(1−θ)(1−x)

Multinomial model: X ∼ M(n, π1, π2, . . . , πK) Θ = [0,1]K pθ(x) =

n x1, . . . ,xk

π1x1 . . . πkxk

Review of Statistics 8/52

(18)

Indicator variable coding for multinomial variables

Let C a r.v. taking values in{1, . . . ,K}, with P(C =k) =πk.

We will code C with a r.v. Y = (Y1, . . . ,YK)> with Yk = 1{C=k}

For example if K = 5 andc = 4 then y = (0,0,0,1,0)>. So y ∈ {0,1}K withPK

k=1yk = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

πkyk.

(19)

Indicator variable coding for multinomial variables

Let C a r.v. taking values in{1, . . . ,K}, with P(C =k) =πk.

We will code C with a r.v. Y = (Y1, . . . ,YK)> with Yk = 1{C=k}

For example if K = 5 andc = 4 then y = (0,0,0,1,0)>. So y ∈ {0,1}K withPK

k=1yk = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

πkyk.

Review of Statistics 9/52

(20)

Indicator variable coding for multinomial variables

Let C a r.v. taking values in{1, . . . ,K}, with P(C =k) =πk.

We will code C with a r.v. Y = (Y1, . . . ,YK)> with Yk = 1{C=k}

For example if K = 5 andc = 4 then y = (0,0,0,1,0)>.

So y ∈ {0,1}K withPK

k=1yk = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

πkyk.

(21)

Indicator variable coding for multinomial variables

Let C a r.v. taking values in{1, . . . ,K}, with P(C =k) =πk.

We will code C with a r.v. Y = (Y1, . . . ,YK)> with Yk = 1{C=k}

For example if K = 5 andc = 4 then y = (0,0,0,1,0)>. So y ∈ {0,1}K withPK

k=1yk = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

πkyk.

Review of Statistics 9/52

(22)

Indicator variable coding for multinomial variables

Let C a r.v. taking values in{1, . . . ,K}, with P(C =k) =πk.

We will code C with a r.v. Y = (Y1, . . . ,YK)> with Yk = 1{C=k}

For example if K = 5 andc = 4 then y = (0,0,0,1,0)>. So y ∈ {0,1}K withPK

k=1yk = 1.

P(C =k) =P(Yk = 1) and P(Y =y) =

K

Y

k=1

πkyk.

(23)

Bernoulli, Binomial, Multinomial

Y ∼Ber(π) (Y1, . . . ,YK)∼ M(1, π1, . . . , πK) p(y) =πy(1−π)1−y p(y) =πy11. . . πKyK

N1 ∼Bin(n, π) (N1, . . . ,NK)∼ M(n, π1, . . . , πK) p(n1) = nn

1

πn1(1−π)n−n1 p(n) =

n n1 . . . nK

πn11. . . πKnK

with n

i

= n!

(n−i)!i! and

n n1 . . . nK

= n!

n1!. . .nK!

Review of Statistics 10/52

(24)

Gaussian model

Scalar Gaussian model : X ∼ N(µ, σ2) X real valued r.v., andθ= µ, σ2

∈Θ =R×R+. pµ,σ2(x) = 1

2πσ2exp −1 2

(x−µ)2 σ2

!

Multivariate Gaussian model: X ∼ N (µ,Σ)

X r.v. taking values in Rd. If Kd is the set of positive definite matrices of size d ×d , and θ= (µ,Σ)∈Θ =Rd× Kd.

pµ,Σ(x) = 1 q

(2π)ddetΣ exp

−1

2(x−µ)TΣ−1(x−µ)

(25)

Gaussian model

Scalar Gaussian model : X ∼ N(µ, σ2) X real valued r.v., andθ= µ, σ2

∈Θ =R×R+. pµ,σ2(x) = 1

2πσ2exp −1 2

(x−µ)2 σ2

!

Multivariate Gaussian model: X ∼ N (µ,Σ)

X r.v. taking values in Rd. If Kd is the set of positive definite matrices of size d ×d , and θ= (µ,Σ)∈Θ =Rd× Kd.

pµ,Σ(x) = 1 q

(2π)ddetΣ exp

−1

2(x−µ)TΣ−1(x−µ)

Review of Statistics 11/52

(26)

Gaussian densities

(27)

Gaussian densities

Review of Statistics 12/52

(28)

Sample/Training set

The data used to learn or estimate a model typically consists of a collection of observation which can be thought of as instantiations of random variables.

X(1), . . . ,X(n)

A common assumption is that the variables are i.i.d. independent

identically distributed, i.e. have the same distributionP.

This collection of observations is called

thesampleor the observations in statistics thesamplesin engineering

thetraining set in machine learning

(29)

Sample/Training set

The data used to learn or estimate a model typically consists of a collection of observation which can be thought of as instantiations of random variables.

X(1), . . . ,X(n)

A common assumption is that the variables are i.i.d.

independent

identically distributed, i.e. have the same distributionP.

This collection of observations is called

thesampleor the observations in statistics thesamplesin engineering

thetraining set in machine learning

Review of Statistics 13/52

(30)

Sample/Training set

The data used to learn or estimate a model typically consists of a collection of observation which can be thought of as instantiations of random variables.

X(1), . . . ,X(n)

A common assumption is that the variables are i.i.d.

independent

identically distributed, i.e. have the same distributionP.

This collection of observations is called

thesampleor the observations in statistics thesamplesin engineering

thetraining set in machine learning

(31)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

Review of Statistics 14/52

(32)

The maximum likelihood principle

(33)

Maximum likelihood principle

LetPΘ=

p(x;θ)|θ∈Θ be a model Letx be an observation

Likelihood:

L: Θ → R+

θ 7→ p(x;θ) Maximum likelihood estimator:

θˆML= argmax

θ∈Θ

p(x;θ) Sir Ronald Fisher

(1890-1962)

Case of i.i.d data

If (xi)1≤i≤n is an i.i.d. sample of sizen: θˆML= argmax

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

Review of Statistics 16/52

(34)

Maximum likelihood principle

LetPΘ=

p(x;θ)|θ∈Θ be a model Letx be an observation

Likelihood:

L: Θ → R+

θ 7→ p(x;θ)

Maximum likelihood estimator: θˆML= argmax

θ∈Θ

p(x;θ) Sir Ronald Fisher

(1890-1962)

Case of i.i.d data

If (xi)1≤i≤n is an i.i.d. sample of sizen: θˆML= argmax

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

(35)

Maximum likelihood principle

LetPΘ=

p(x;θ)|θ∈Θ be a model Letx be an observation

Likelihood:

L: Θ → R+

θ 7→ p(x;θ) Maximum likelihood estimator:

θˆML= argmax

θ∈Θ

p(x;θ) Sir Ronald Fisher

(1890-1962)

Case of i.i.d data

If (xi)1≤i≤n is an i.i.d. sample of sizen: θˆML= argmax

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

Review of Statistics 16/52

(36)

Maximum likelihood principle

LetPΘ=

p(x;θ)|θ∈Θ be a model Letx be an observation

Likelihood:

L: Θ → R+

θ 7→ p(x;θ) Maximum likelihood estimator:

θˆML= argmax

θ∈Θ

p(x;θ) Sir Ronald Fisher

(1890-1962)

Case of i.i.d data

If (xi)1≤i≤n is an i.i.d. sample of sizen:

θˆML= argmax

θ∈Θ n

Y

i=1

pθ(xi) = argmax

θ∈Θ n

X

i=1

log pθ(xi)

(37)

The maximum likelihood estimator

The MLE

does not always exists

is not necessarily unique

Review of Statistics 17/52

(38)

The maximum likelihood estimator

The MLE

does not always exists is not necessarily unique

(39)

The maximum likelihood estimator

The MLE

does not always exists is not necessarily unique

Review of Statistics 17/52

(40)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ).

The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ)

=

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

(41)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ)

=

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

Review of Statistics 18/52

(42)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

(43)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

Review of Statistics 18/52

(44)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

(45)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique. since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

Review of Statistics 18/52

(46)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique.

since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

(47)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique.

since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

Review of Statistics 18/52

(48)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique.

since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ.

Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

(49)

MLE for the Bernoulli model

Let X1,X2, . . . ,Xn an i.i.d. sample ∼Ber(θ). The log-likelihood is

`(θ) =

n

X

i=1

logp(xi;θ) =

n

X

i=1

log

θxi(1−θ)1−xi

=

n

X

i=1

xilogθ+ (1−xi) log(1−θ)

=Nlog(θ) + (n−N) log(1−θ) with N :=Pn

i=1xi.

MLE: Existence? Uniqueness?

θ7→`(θ) is strongly concave⇒ the MLE exists and is unique.

since `differentiable + strongly concave its maximizer is the unique stationary point

∇`(θ) = ∂

∂θ`(θ) = N

θ −n−N

1−θ. Thus

θˆML= N

n = x1+x2+· · ·+xn

n .

Review of Statistics 18/52

(50)

MLE for the multinomial

Done on the board.

(51)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

Review of Statistics 20/52

(52)

Method of moments (Karl Pearson, 1894)

Consider a statistical model for a univariate r.v. parameterized by θ= (θ1, . . . , θK)∈Rk.

Denote by µk thekth moment of a random variable:

µ1(θ) =Eθ[X], µ2(θ) =Eθ[X2], . . . , µK(θ) =Eθ[XK].

We have

1, . . . , µK) =f(θ) =f(θ1, . . . , θK). Principle of the method of moments

Given a sampleX1, . . . ,Xn

Estimate the µks with the empirical moments: ˆµk = 1 n

n

X

i=1

Xik.

The moment estimator is ˆθ defined as the solution to the equation (ˆµ1, . . . ,µˆK) =f(ˆθ1, . . . ,θˆK).

(53)

Method of moments (Karl Pearson, 1894)

Consider a statistical model for a univariate r.v. parameterized by θ= (θ1, . . . , θK)∈Rk.

Denote by µk thekth moment of a random variable:

µ1(θ) =Eθ[X], µ2(θ) =Eθ[X2], . . . , µK(θ) =Eθ[XK].

We have

1, . . . , µK) =f(θ) =f(θ1, . . . , θK).

Principle of the method of moments Given a sampleX1, . . . ,Xn

Estimate the µks with the empirical moments: ˆµk = 1 n

n

X

i=1

Xik.

The moment estimator is ˆθ defined as the solution to the equation (ˆµ1, . . . ,µˆK) =f(ˆθ1, . . . ,θˆK).

Review of Statistics 21/52

(54)

Method of moments (Karl Pearson, 1894)

Consider a statistical model for a univariate r.v. parameterized by θ= (θ1, . . . , θK)∈Rk.

Denote by µk thekth moment of a random variable:

µ1(θ) =Eθ[X], µ2(θ) =Eθ[X2], . . . , µK(θ) =Eθ[XK].

We have

1, . . . , µK) =f(θ) =f(θ1, . . . , θK).

Principle of the method of moments Given a sampleX1, . . . ,Xn

Estimate the µks with the empirical moments: ˆµk = 1 n

n

X

i=1

Xik.

The moment estimator is ˆθ defined as the solution to the equation (ˆµ1, . . . ,µˆK) =f(ˆθ1, . . . ,θˆK).

(55)

Method of moments (Karl Pearson, 1894)

Consider a statistical model for a univariate r.v. parameterized by θ= (θ1, . . . , θK)∈Rk.

Denote by µk thekth moment of a random variable:

µ1(θ) =Eθ[X], µ2(θ) =Eθ[X2], . . . , µK(θ) =Eθ[XK].

We have

1, . . . , µK) =f(θ) =f(θ1, . . . , θK).

Principle of the method of moments Given a sampleX1, . . . ,Xn

Estimate the µks with the empirical moments: ˆµk = 1 n

n

X

i=1

Xik.

The moment estimator is ˆθ defined as the solution to the equation (ˆµ1, . . . ,µˆK) =f(ˆθ1, . . . ,θˆK).

Review of Statistics 21/52

(56)

Method of moments: illustration

In many usual cases the moment estimator and theMLE are equal.

Example where MME 6= MLE For the family of gamma distribution

p(x;λ,p) = xp−1e−λx λpΓ(p) 1{x>0} the MLE is not closed-form (exercise). However

µ1 =E[X] =λp, µ2 =E[X2] =p(p+ 1)λ2,So that λ= µ21

µ2−µ21, p = µ2−µ21

µ1 ,

which yields the moment estimators λˆ= µˆ21

ˆ

µ2−µˆ21, p = µˆ2−µˆ21 ˆ

µ1 .

(57)

Method of moments: illustration

In many usual cases the moment estimator and theMLE are equal.

Example where MME 6= MLE For the family of gamma distribution

p(x;λ,p) = xp−1e−λx λpΓ(p) 1{x>0}

the MLE is not closed-form (exercise).

However

µ1 =E[X] =λp, µ2 =E[X2] =p(p+ 1)λ2,So that λ= µ21

µ2−µ21, p = µ2−µ21

µ1 ,

which yields the moment estimators λˆ= µˆ21

ˆ

µ2−µˆ21, p = µˆ2−µˆ21 ˆ

µ1 .

Review of Statistics 22/52

(58)

Method of moments: illustration

In many usual cases the moment estimator and theMLE are equal.

Example where MME 6= MLE For the family of gamma distribution

p(x;λ,p) = xp−1e−λx λpΓ(p) 1{x>0}

the MLE is not closed-form (exercise). However µ1 =E[X] =λp, µ2 =E[X2] =p(p+ 1)λ2,

So that λ= µ21

µ2−µ21, p = µ2−µ21

µ1 ,

which yields the moment estimators λˆ= µˆ21

ˆ

µ2−µˆ21, p = µˆ2−µˆ21 ˆ

µ1 .

(59)

Method of moments: illustration

In many usual cases the moment estimator and theMLE are equal.

Example where MME 6= MLE For the family of gamma distribution

p(x;λ,p) = xp−1e−λx λpΓ(p) 1{x>0}

the MLE is not closed-form (exercise). However

µ1 =E[X] =λp, µ2 =E[X2] =p(p+ 1)λ2,So that λ= µ21

µ2−µ21, p = µ2−µ21

µ1 ,

which yields the moment estimators λˆ= µˆ21

ˆ

µ2−µˆ21, p = µˆ2−µˆ21 ˆ

µ1 .

Review of Statistics 22/52

(60)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

(61)

Linear regression

Review of Statistics 24/52

(62)

Design matrix

Consider a finite collection of vectors xi ∈Rd pour i = 1. . .n.

Design Matrix

X =

—– x1> —–

... ... ...

—– xn> —–

.

We assume that the vectors are centered, i.e. that Pn

i=1xi = 0. Ifxi are not centered the design matrix of centered data can be constructed with the rows xi−x¯> with ¯x= 1nPn

i=1xi.

(63)

Design matrix

Consider a finite collection of vectors xi ∈Rd pour i = 1. . .n.

Design Matrix

X =

—– x1> —–

... ... ...

—– xn> —–

.

We assume that the vectors are centered, i.e. that Pn

i=1xi = 0.

Ifxi are not centered the design matrix of centered data can be constructed with the rows xi−x¯> with ¯x= 1nPn

i=1xi.

Review of Statistics 25/52

(64)

Design matrix

Consider a finite collection of vectors xi ∈Rd pour i = 1. . .n.

Design Matrix

X =

—– x1> —–

... ... ...

—– xn> —–

.

We assume that the vectors are centered, i.e. that Pn

i=1xi = 0.

Ifxi are not centered the design matrix of centered data can be constructed with the rows xi−x¯> with ¯x= 1nPn

i=1xi.

(65)

Linear regression

We consider the OLS regression for the linear hypothesis space.

We haveX =Rp,Y =Rand` the square loss.

Consider the hypothesis space:

S ={fw |w ∈Rp} with fw :x7→w>x. Given a training set {(x1,y1), . . . ,(xn,yn)} we have

Rbn(fw) = 1 2n

n

X

i=1

(yi −w>xi)2 = 1

2nky −X wk22 with

the vector of outputs y>= (y1, . . . ,yn)∈Rn

the design matrixX ∈Rn×p whose ith row is equal tox>i .

Review of Statistics 26/52

(66)

Linear regression

We consider the OLS regression for the linear hypothesis space.

We haveX =Rp,Y =Rand` the square loss.

Consider the hypothesis space:

S ={fw |w ∈Rp} with fw :x7→w>x.

Given a training set {(x1,y1), . . . ,(xn,yn)} we have Rbn(fw) = 1

2n

n

X

i=1

(yi −w>xi)2 = 1

2nky −X wk22 with

the vector of outputs y>= (y1, . . . ,yn)∈Rn

the design matrixX ∈Rn×p whose ith row is equal tox>i .

(67)

Linear regression

We consider the OLS regression for the linear hypothesis space.

We haveX =Rp,Y =Rand` the square loss.

Consider the hypothesis space:

S ={fw |w ∈Rp} with fw :x7→w>x.

Given a training set {(x1,y1), . . . ,(xn,yn)} we have Rbn(fw) = 1

2n

n

X

i=1

(yi −w>xi)2

= 1

2nky −X wk22 with

the vector of outputs y>= (y1, . . . ,yn)∈Rn

the design matrixX ∈Rn×p whose ith row is equal tox>i .

Review of Statistics 26/52

(68)

Linear regression

We consider the OLS regression for the linear hypothesis space.

We haveX =Rp,Y =Rand` the square loss.

Consider the hypothesis space:

S ={fw |w ∈Rp} with fw :x7→w>x.

Given a training set {(x1,y1), . . . ,(xn,yn)} we have Rbn(fw) = 1

2n

n

X

i=1

(yi −w>xi)2 = 1

2nky −X wk22 with

the vector of outputs y>= (y1, . . . ,yn)∈Rn

the design matrixX ∈Rn×p whose ith row is equal tox>i .

(69)

Solving linear regression

To solve min

wRp

Rbn(fw), we consider that Rbn(fw) = 1

2n w>X>X w −2w>X>y +kyk2

is a differentiable convexfunction whose minima are thus characterized by the

Normal equations

X>X w −X>y =0

IfX>X is invertible, then fbis given by:

fb:x0 7→x0>(X>X)−1X>y.

Problem: X>X is never invertible forp >n and thus the solution is not unique.

Review of Statistics 27/52

(70)

Solving linear regression

To solve min

wRp

Rbn(fw), we consider that Rbn(fw) = 1

2n w>X>X w −2w>X>y +kyk2

is a differentiable convexfunction whose minima are thus characterized by the

Normal equations

X>X w −X>y =0

IfX>X is invertible, then fbis given by:

fb:x0 7→x0>(X>X)−1X>y.

Problem: X>X is never invertible forp >n and thus the solution is not unique.

(71)

Solving linear regression

To solve min

wRp

Rbn(fw), we consider that Rbn(fw) = 1

2n w>X>X w −2w>X>y +kyk2

is a differentiable convexfunction whose minima are thus characterized by the

Normal equations

X>X w −X>y =0

IfX>X is invertible, then fbis given by:

fb:x0 7→x0>(X>X)−1X>y.

Problem: X>X is never invertible forp >n and thus the solution is not unique.

Review of Statistics 27/52

(72)

Solving linear regression

To solve min

wRp

Rbn(fw), we consider that Rbn(fw) = 1

2n w>X>X w −2w>X>y +kyk2

is a differentiable convexfunction whose minima are thus characterized by the

Normal equations

X>X w −X>y =0

IfX>X is invertible, then fbis given by:

fb:x0 7→x0>(X>X)−1X>y.

Problem: X>X is never invertible for p >n and thus the solution is not unique.

(73)

Ridge regression

Is obtained by applying Tikhonov regularization to OLS regression.

wminRp

1

2nky −X wk22+λkwk22 Problem now strongly convex thus well-posed

Thus with unique solution: ˆ

w(ridge)= (X>X+λI)−1X>y

Shrinkage effect

Regularization improves the conditioning number of the Hessian

⇒ Problem now easier to solve computationally

Review of Statistics 28/52

(74)

Ridge regression

Is obtained by applying Tikhonov regularization to OLS regression.

wminRp

1

2nky −X wk22+λkwk22 Problem now strongly convex thus well-posed Thus with unique solution:

ˆ

w(ridge)= (X>X+λI)−1X>y

Shrinkage effect

Regularization improves the conditioning number of the Hessian

⇒ Problem now easier to solve computationally

(75)

Ridge regression

Is obtained by applying Tikhonov regularization to OLS regression.

wminRp

1

2nky −X wk22+λkwk22 Problem now strongly convex thus well-posed Thus with unique solution:

ˆ

w(ridge)= (X>X+λI)−1X>y

Shrinkage effect

Regularization improves the conditioning number of the Hessian

⇒ Problem now easier to solve computationally

Review of Statistics 28/52

(76)

Ridge regression

Is obtained by applying Tikhonov regularization to OLS regression.

wminRp

1

2nky −X wk22+λkwk22 Problem now strongly convex thus well-posed Thus with unique solution:

ˆ

w(ridge)= (X>X+λI)−1X>y

Shrinkage effect

Regularization improves the conditioning number of the Hessian

⇒ Problem now easier to solve computationally

(77)

Ridge regression

Is obtained by applying Tikhonov regularization to OLS regression.

wminRp

1

2nky −X wk22+λkwk22 Problem now strongly convex thus well-posed Thus with unique solution:

ˆ

w(ridge)= (X>X+λI)−1X>y

Shrinkage effect

Regularization improves the conditioning number of the Hessian

⇒ Problem now easier to solve computationally

Review of Statistics 28/52

(78)

Outline

1 Statistical concepts

2 The maximum likelihood principle

3 Method of moments

4 Linear regression

5 Bayesian Inference

6 Principal Component Analysis

7 Mesures of performance for binary classifiers

(79)

Bayesian estimation

Bayesians treat the parameter θ as a random variable.

A priori

The Bayesian has to specify ana priori distribution p(θ) for the model parametersθ, which models his prior belief of the relative plausibility of different values of the parameter.

A posteriori

The observation contribute through the likelihood: p(x|θ). The a posterioridistribution on the parameters is then

p(θ|x) = p(x|θ)p(θ)

p(x) ∝p(x|θ)p(θ).

→ The Bayesian estimator is therefore a probability distribution on the parameters.

This estimation procedure is called Bayesian inference.

Review of Statistics 30/52

(80)

Bayesian estimation

Bayesians treat the parameter θ as a random variable.

A priori

The Bayesian has to specify ana priori distribution p(θ) for the model parametersθ, which models his prior belief of the relative plausibility of different values of the parameter.

A posteriori

The observation contribute through the likelihood: p(x|θ).

The a posterioridistribution on the parameters is then p(θ|x) = p(x|θ)p(θ)

p(x) ∝p(x|θ)p(θ).

→ The Bayesian estimator is therefore a probability distribution on the parameters.

This estimation procedure is calledBayesian inference.

(81)

Dirichlet distribution

We say thatθ = (θ1, . . . , θK) follows the Dirichlet distribution and note θ∼Dir(α)

for

θ in the simplex 4K ={u∈RK+|PK

k=1uk = 1}and admitting the density

p(θ;α) = Γ(α0) Q

kΓ(αk) θα11−1. . . θKαK−1 with respect to the uniform measure on the simplex, where

α0=X

k

αk and Γ(x) := Z

0

tx−1e−tdt

Review of Statistics 31/52

(82)

Dirichlet distribution

We say thatθ = (θ1, . . . , θK) follows the Dirichlet distribution and note θ∼Dir(α)

for θ in the simplex 4K ={u∈RK+|PK

k=1uk = 1}and

admitting the density

p(θ;α) = Γ(α0) Q

kΓ(αk) θα11−1. . . θKαK−1 with respect to the uniform measure on the simplex, where

α0=X

k

αk and Γ(x) := Z

0

tx−1e−tdt

(83)

Dirichlet distribution

We say thatθ = (θ1, . . . , θK) follows the Dirichlet distribution and note θ∼Dir(α)

for θ in the simplex 4K ={u∈RK+|PK

k=1uk = 1}and admitting the density

p(θ;α) = Γ(α0) Q

kΓ(αk) θα11−1. . . θKαK−1 with respect to the uniform measure on the simplex, where

α0=X

k

αk and Γ(x) :=

Z 0

tx−1e−tdt

Review of Statistics 31/52

(84)

Dirichlet distribution II

E[θk] = αk

α0

, Var(θk) = αk0−αk)

α200+ 1) and Cov(θj, θk) = −αjαk α200+ 1) with α0=P

kαk.

(85)

Dirichlet distribution II

E[θk] = αk

α0

, Var(θk) = αk0−αk)

α200+ 1) and Cov(θj, θk) = −αjαk α200+ 1) with α0=P

kαk.

Review of Statistics 32/52

(86)

Bayesian estimation of a multinomial random variable

Consider the simple Bayesian Dirichlet-Multinomial model with

A Dirichlet prior on the parameter of the multinomial: θ ∼Dir(α) A multinomial random variable z ∼ M(1,θ)

p(θ)∝

K

Y

k=1

θkαk−1 and p(z|θ) =

K

Y

k=1

θkzk

Let z(1), . . . ,z(N) be an i.i.d. sample distributed likez. We have

p(θ|z(1), . . . ,z(N)) = p(θ) Q

np(z(n)|θ)

p(z(1), . . . ,z(N)) ∝ Y

k

θαk+

P

nznk−1 k

So that (θ|(Z))∼Dir((α1+N1, . . . , αK+NK)) withNk =P

nznk

Références

Documents relatifs

Index Terms— Consistency in uniform metric, classification, functional data analysis, pattern recognition, uniform consistency, universal consistency.. The quality of a classifier

Similarly, if P3 is an exogenous process (it occurred previously because the rare stimulus was more intense), then it should now also be larger to the more intense

The next theorem gives a necessary and sufficient condition for the stochastic comparison of the lifetimes of two n+1-k-out-of-n Markov systems in kth order statistics language4.

But if statement Sh fails to make sense, it seems that statement Sf does not make sense either, according to Condition1: for the terms “brain” and “left arm” occupy the

Does it make a difference whether the parcel fronts upon the body of water or if the body of water is contained within or passes through the

o  We can estimate the number of theoretical plates in our column based on peak retention times and widths.. o  Both factors are important in determining if a separation

Phase 4 - Analysis and dissemina- tion: In this phase of the Translational Informatics Cycle, the integrated/aggre- gated data and knowledge created in the preceding phases is

- In case of very different sclare measures between the variables, it is better to use the normalized Euclidean distance to give all the variables the same weight.. Note that this