Miss probability (in %)

(1)

Data analysis and stochastic modeling

Lecture 8 – Hypothesis testing

Guillaume Gravier

[email protected]

Data analysis and stochastic modeling – Hypothesis testing – p.

Where are we?

1. A gentle introduction to probability 2. Data analysis

3. Cluster analysis

4. Estimation theory and practice 5. Mixture models

6. Random processes 7. Queing systems 8. Hypothesis testing

◦ what for?

◦ the mechanics of statistical tests

◦ some common tests

What for?

Hypothesis testing = make some decision on whether something is true or not based on experimental evidences, yet knowing the risk we are taking with that decision.

Typical hypotheses we want to test are:

◦ the mean time to failure of a system exceeds a threshold

θ

0(or not)

◦ the job arrival rate in a queing system is equal to

λ

0(or not)

◦ the distribution of some samples is Gaussian (or not)

◦ two estimated mean values correspond to the same mean (or not)

◦ an observed arrival process is Poisonian (or not)

◦ etc.

The rainmakers example

◦ rain levels (in mm) are assumed to be Gaussian

N (600, 100)

◦ people claim they can increase rain levels by 50 mm and, using their method over 9 years, we observed the following rain levels

year 1 2 3 4 5 6 7 8 9

mm 510 614 780 512 501 534 603 788 650

◦ Is this a scam or not?

Two hypotheses are confronted

H

0

m = 600

^mm

H

1

m = 650

^mm

Since the method to increase rain levels is expensive, we want to use it only if we are pretty sure that it works, i.e. if we have only a 5 % chance of wrongly accepting

H

1from the evidences (risk

α = 0.05

^).

(2)

The rainmakers example

^(cont’d)

We study the empirical mean

X

over the nine year period

◦ if

H

0is true,

X N (600, 100/ √

9)

[see lecture 4]

◦ decision rule

⊲ if

X > k

^{, accept}

H

1(or reject

H

0)

⊲ if

X < k

^{, reject}

H

1(or accept

H

0)

⊲

k

is determined so that the probability of wrongly accepting

H

1is 0.05

k = 600 + 100

√ 9 1.64 = 655

Conclusion: those rainmakers are ripping you off!

The rainmakers example

^(cont’d)

What if rainmakers were very unlucky over those nine years?

◦ if they were right,

X N (650, 100/ √ 9)

◦ an error is made each time

X < 655

◦ the probability of wrongly accepting

H

0(or wrongly rejecting

H

1) is given by the risk

β

β = P

"

655 − 650 100/ √

9 #

= 0.56

◦

H

1defines the shape of the critical decision region (

650 > 600

⁾

but the threshold

k

only depends on

H

0and

α

◦ additional knowledge on

H

1is used to compute the risk

β

Concepts and vocabulary

◦ a statistical hypothesis is an assertion which can be valid or not

◦ a statistical test is a procedure to make a decision

◦ the null hypothesis

H

0is a claim that we are interested in accepting or rejecting

◦ the alternative hypothesis

H

1is the contradiction of

H

0

◦ the critical or rejection region is the region where we reject

H

0(above

k

in the example) as opposed to the acceptance region

◦ wrongly rejecting

H

0is known as the type 1 error and the probability

α

that this happens is the level of significance of the test

◦ wrongly rejecting

H

1is a type 2 error and the quantity

1 − β

^{is the}

power of the test

Hypotheses and types of errors

◦

α

= probability of choosing

H

1when

H

0is true

◦

β

= probability of choosing

H

0when

H

1is true

truth\^decision H0 H1

H0 1−α α H1 β 1−β

In practice,

α

is given by the decision maker and

H

0corresponds to the following

◦ a well established hypothesis, not contradicted so far

◦ a safe decision

e.g. when testing a vaccine,H0is the less favorable hypothesis

◦ the only hypothesis that is easy to formulate

e.g.m=m0is easir to formulate thanm6=m0

(3)

Methodology

1. define H

0

and H

1

2. determine the variables on which to make the decision 3. determine the shape of the critical region based on H

1

4. compute the critical region given α

5. eventually compute the power of the test

6. compute the experimental value of the decision variable 7. accept or reject H

0

Various tests to compare two mean values

H

0

H

1 Reject if Test statistic

µ = µ

0

µ 6 = µ

0

| z

0

| > z

α/2

(

σ

^known)

µ < µ

0

z

0

< − z

α

z

0

=

^X_σ/^¯⁻^√^µ_n⁰

µ > µ

0

z

0

> z

α

µ = µ

0

µ 6 = µ

0

| t

0

| > t

α/2,n−1

(

σ

^unknown)

µ < µ

0

t

0

< − t

α,n−1

t

0

=

^X_s/^¯⁻^√^µ_n⁰

µ > µ

0

t

0

> t

α,n−1

Data analysis and stochastic modeling – Hypothesis testing – p. 10

Optimal decision for simple tests

Consider

X

with density

f(x; θ)

, and denote

L( x , θ)

the density of the sample

x

^.

H₀ θ =θ₀ H1 θ =θ1

Maximize the power of the test!

⇓

P[W|H₁] = 1−β=

Z

W

L(x;θ₁)dx

Jerzy Neyman, 1894 – 1981

Karl Pearson, 1857 – 1936

The optimal critical region is defined by the points such that

L( x ; θ

1

)

L( x ; θ

0

) > k

α

.

Testing composite hypotheses

H

0

θ = θ

0

H

1

θ = θ

1

vs

H

0

θ = θ

0

H

1

θ 6 = θ

1

⇒

^{the risk}

β

(and hence the power) depends on

θ

H

0

m = 600 H

1

m > 600

(4)

The likelihood ratio test

H

0

θ = θ

0

H

1

θ 6 = θ

0

with

θ ∈ R

^p

◦ Test statistics

λ = L( x ; θ

0

) sup

θ

L(x; θ)

⊲ Note that replacingL(x;θ0)bysup

θ

L(x;θ)is like using the ML estimate ofθ

◦ Critical region =

{ x | λ < K

α

}

◦ Asymptotic distribution of the test statistics under

H

0:

λ χ

²_p

Classical test for mean values of N (m, σ)

◦ standard deviation is known

⊲ test statistics =

X N (m,

^√^σ_n

)

⊲ critical region =

X > k

α

⊲ cf. introductory example

◦ standard deviation is known

⊲ test statistics = Student

T

n−1

= √

n − 1 X − m S

⊲ critical region =

| T

n−1

| > k

α

⊲ Example:

⋄

H

0

m = 30

^vs.

H

1

m 6 = 30

⋄ 15 samples with

x = 37.2

^and

s = 6.2

⋄ under

H

0,

t = √

14 37.2 − 30

6.2 = 4.35

⋄ critical value at

α = 0.05

^for

T

14= 1.761

⇒

^REJECT

H

0!

The χ ² fitting tests

X

is a random variable divided into

k

classes of resp. probabilities

p

1

, p

2

, . . . , p

kand we observe a sample of the variable with population size in each class

N

1

= n

1

, N

2

= n

2

, . . . N

k

= n

k

Note thatE[Ni] =np_i

We want to test if the underlying law of the observed process fits the theoretical distribution defined by the

p

i’s (

H

0) or not.

◦ test statistics is

D

²

=

k

X

i=1

(N

i

− np

i

)

²

np

i

◦ asymptotic distribution of the test statistics =

χ

²_k

−1

Notes:

◦ ifmparameters are estimated from the sample (e.g.λin Poisson)D² χ²k−1−m

◦ need for at least 5 (or 3) elements per class

◦ equivalent to the likelihood ratio test for complex hypotheses

Sample comparison & statistical significance

Are two independent samples of resp. sizes

n

1and

n

2coming from the same population/distribution?

Gaussian case:

X

1

N (m

1

, σ

1

)

^and

X

1

N (m

2

, σ

2

)

◦ test variance equality

⇒

Fisher-Snedecor

◦ test mean equality

⇒

^Student

T

n1+n2−2

= √

n

1

+ n

2

− 2 (X

1

− m

1

) − (X

2

− m

2

) s

(n

1

S

₁²

+ n

2

S

₂²

) 1

n

1

+ 1 n

2

Note. The test is robust to distributions others than the Gaussian one.

(5)

Sample comparison & statistical significance

^(cont’

Are two independent samples of resp. sizes

n

1and

n

2coming from the same population/distribution?

General case: compare

{ x

1

, . . . , x

n

}

^and

{ y

1

, . . . , y

m

}

◦ Smirnov to compare empirical distributions

◦ Wilcoxon

⊲ mix all values

x

i and

y

i and sort them (ascending order)

⊲ test statistics

U

= number of pairs

(x

i

, y

j

)

^{such that}

x

i

> y

j

⋄ U = 0⇒x1, . . . , x_n, y1, . . . y_m

⋄ U =nm⇒y1, . . . y_m, x1, . . . , x_n

⋄ underH₀the ranking should be “homogeneous”

⊲ asymptotic distribution is

N nm 2 ,

r nm(n + m + 1) 12

!

⊲ critical region

U −

^nm2

> k

α

Speaker or face identity verification

◦

H

0: the person is who he/she says he/she is

◦

H

1: the person is an impostor

p(y

1^T

; H

0

) p(y

1^T

; H

1

)

H

0

>

H <

1

β

In speaker verification

◦

p(y

₁^T

; H

0

)

= Gaussian mixture model trained with the speaker’s speech

◦

p(y

₁^T

; H

1

)

= GMM of speech in general

Speaker or face identity verification

^(cont’d)

Two errors : false acceptance(type 1) etfalse rejection(type 2).

0.1 0.5 2 5 10 20 30 40 50 60

0.005 0.025 0.1 0.5 2 5 10 20 30 40 50

Miss probability (in %)

False Alarms probability (in %) IRISA max. F-measure min. HTER LIA ENST/TSI

◦ text known or not

◦ size of the data set

◦ signal duration

◦ signal quality

◦ speaker

◦ ...

site ENST IRISA LIA

%fa 9.8 0.3 2.8

%fr 25.3 23.6 30.6

F-measure 46.9 84.3 66.0

Thanks for attending until the end!