• Aucun résultat trouvé

DISCRETE SCAN STATISTICS AND THE GENERALIZED LIKELIHOOD RATIO TEST

N/A
N/A
Protected

Academic year: 2022

Partager "DISCRETE SCAN STATISTICS AND THE GENERALIZED LIKELIHOOD RATIO TEST"

Copied!
10
0
0

Texte intégral

(1)

AND THE GENERALIZED LIKELIHOOD RATIO TEST

MICHAEL GENIN

Communicated by Marius Iosifescu

The scan statistic is used to detect statistically signicant clusters of events. In the continuous one-dimensional case, the test based on scan statistic is equivalent to the generalized likelihood ratio test. We show that this result remains true for one and two dimensional discrete scan statistics. The binomial and Poisson models are considered.

AMS 2010 Subject Classication: 62F03.

Key words: scan statistics, generalized likelihood ratio test.

1. INTRODUCTION

In many areas it is necessary to decide whether a certain accumulation of events is normal or not. In public health, epidemiology services seek the fac- tors that explain the clusters of cancer or birth defects. Biologists seek clusters palindromes in DNA sequences to nd clues to the origin of replication of some viruses. In quality control, one wonders about the clusters of defective prod- ucts. The decision is then made according to the probability of observing such a cluster under the null hypothesis of a normal situation. If this probability is small, it is reasonable to assume the presence of a deviation from the normal situation and then decisions must be taken.

Scan statistics are used to determine how signicant a cluster of events is.

More specically, scan statistics are random variables used as test statistics to check the null hypothesis that the observations are independent and identically distributed (i.i.d.) from a specied distribution, against an alternative hypoth- esis which supports the existence of some cluster of events. Many works are devoted to scan statistics. For an overview, include the monographs of Glaz and Balakrishnan [5], Balakrishnan and Koutras [1], Fu and Lou [4] and more recently the monograph of Glaz, Pozdnyakov and Wallenstein [6]. Scan statis- tics have been widely used in several elds of application such as cosmology [3], reliability theory [2], epidemiology and public health [8].

REV. ROUMAINE MATH. PURES APPL. 60 (2015), 1, 8392

(2)

In Naus 1966 [7], the authors show that, for continuous scan statistics and a scanning window of xed size, the generalized likelihood ratio test (GLRT) is used to reject H0 whenever the scan statistic is greater than its quantile of order 1−α where α is the type 1 error. This result is taken for granted in the case of discrete scan statistics but, as far as we know, there is no proof in the literature. In this article, we formalize this result in case of the one- dimensional and two-dimensional discrete scan statistics for both binomial and Poisson models. Section 2 is devoted to the one-dimensional case and Section 3 to the two-dimensional case. Section 4 provides a conclusion about the obtained results.

2. ONE-DIMENSIONAL DISCRETE SCAN STATISTIC AND THE GLRT

Let N be a positive integer and {Xi}, 1 ≤i≤ N be a sequence of i.i.d.

nonnegative integer random variables from a specied distribution (Bernoulli, binomial, Poisson, etc). Letmbe a positive integer such that 1≤m≤N. For 1≤t≤N−m+ 1, let

νtt(m) =

t+m−1

X

i=t

Xi

(1)

be the number of observed events in the scanning window of size m.

The one-dimensional discrete scan statistic is dened as the maximum number of events occurring in any window of sizem within {1, . . . , N},

Sm=S(m, N) = max

1≤t≤N−m+1νt. (2)

The statistic Sm is used to testH0 assuming that the Xi's are i.i.d. from a specied distribution against an alternative hypothesis H1 which supports the existence of some cluster of events.

For the binomial model, the null hypothesis assumes that the Xi's are i.i.d., Xi ∼ B(n, p0),n >0, withp0 the probability of success. The alternative hypothesis presumes the existence of a window of size m, [a, a+m−1], a∈ [0, N −m+ 1], where the Xi's are i.i.d. as binomial B(n, p1) with p1 > p0 if i∈[a, a+m−1]and p0 =p1 otherwise.

For the Poisson model, the null hypothesis assumes that theXi's are i.i.d., Xi∼ P(λ0). The alternative hypothesis presumes the existence of a window of size m,[a, a+m−1],a∈[0, N −m+ 1], where the Xi's are i.i.d. as Poisson P(λ1) withλ1 > λ0 if i∈[a, a+m−1]and λ01 otherwise.

(3)

In what follows, we assume that the value of mis known. We show that, for both binomial and Poisson models, the GLRT rejects H0 in favor of H1 whenSm is greater than its quantile of order 1−αwhereα is the type 1 error.

The binomial model. LetX1, X2, . . . , XN be a sequence of independent random variables binomially distributed asB(n, pk) wherekis such that

k=

(0 if 1≤i < t or t+m≤i≤N 1 if t≤i < t+m.

It is assumed that the parameters pk are known. One wants to check the null hypothesis H0 that the Xi's are independent and identically distributed (i.i.d.) as B(n, p0):

H0:p0=p1, (3)

against the alternative hypothesis H1 which supports a cluster of events of length m where theXi's are i.i.d. as B(n, p1):

H1 :p1 > p0

(4)

Proposition 1. The generalized likelihood ratio test rejects H0 in favor of the alternative hypothesisH1 when the unidimensional discrete scan statistic with scanning window of xed length m exceeds a threshold τ determined from P(S(m, N)> τ|H0) =α, whereα corresponds to the type 1 error.

Proof. Let LH0 be the likelihood function under H0. It is expressed as LH0(x1, . . . , xN) =LH0 =

N

Y

i=1

n xi

px0i(1−p0)n−xi,

where xi ∈ {0, . . . , n}and ∀i= 1, . . . , N.

One can remark thatH1 can be expressed as a function of t: H1 =

N−m+1

[

t=1

H1(t),

whereH1(t) corresponds to an alternative of a cluster of length m starting at the tth position, for allt∈ {1, . . . , N−m+ 1}.

Thus, the likelihood function under H1,LH1(t), can be expressed as LH1(t) =

t−1

Y

i=1

n xi

px0i(1−p0)n−xi

! t+m−1 Y

i=t

n xi

px1i(1−p1)n−xi

!

×

N

Y

i=t+m

n xi

px0i(1−p0)n−xi

! .

(4)

Hence, the likelihood ratio LR(t, m) is dened as LR(t, m) =

Qt−1

i=1(xin)pxi0 (1−p0)n−xi

Qt+m−1

i=t (xin)pxi1 (1−p1)n−xi QN

i=1(xin)pxi0 (1−p0)n−xi

×

QN

i=t+m(xin)pxi0 (1−p0)n−xi QN

i=1(xin)pxi0 (1−p0)n−xi , which can be simplied to

LR(t, m) =

Qt+m−1 i=t

n xi

px1i(1−p1)n−xi Qt+m−1

i=t n xi

px0i(1−p0)n−xi.

Hence, the logarithm of the likelihood ratio LLR(t, m) can be expressed as

LLR(t, m) = logQt+m−1 i=t

n xi

px1i(1−p1)n−xi−logQt+m−1 i=t

n xi

px0i(1−p0)n−xi,

=Pt+m−1 i=t xilog

p1

p0

+ (n−xi) log 1−p1

1−p0

. Denoting C1 = log

p1

p0

and C2 = log 1−p1

1−p0

. The LLR(t, m) can be expressed as

LLR(t, m) =C1 t+m−1

X

i=t

xi+C2 t+m−1

X

i=t

(n−xi).

For 1≤t≤N −m+ 1, let νt=

t+m−1

X

i=t

xi,

be the number of observed events in the window [t, t+m−1]. Hence, the LLR(t, m) can be written as follows

LLR(t, m) =C1νt+C2(mn−νt).

For xed m and since C1 > 0 and C2 < 0, LLR(t, m) is a monotically increasing function of νt. Consequently, the GLRT rejectsH0 for a value ofνt

as large as possible, i.e. the unidimensional discrete scan statistic with scanning window of xed length m,Sm dened in Eq.(2).

The Poisson model. LetX1, X2, . . . , XN be a sequence of independent random variables Poisson distributed asP(λk)wherek is such that

k=

(0 if 1≤i < t ou t+m≤i≤N 1 if t≤i < t+m.

It is assumed that the parameters λk are known. One wants to check the null hypothesisH0 that theXi's are i.i.d. asP(λ0) :

H001 (5)

(5)

against the alternative hypothesis H1 which supports a cluster of events of length m where theXi's are i.i.d. as P(λ1)

H11 > λ0

(6)

Proposition 2. The generalized likelihood ratio test rejects H0 in favor of the alternative hypothesisH1 when the unidimensional discrete scan statistic with scanning window of xed length m exceeds a threshold τ determined from P(S(m, N)> τ|H0) =α, whereα corresponds to the type 1 error.

Proof. Let LH0 be the likelihood function under H0. It is expressed as follows

LH0(x1, . . . , xN) =LH0 =

N

Y

i=1

e−λ0λx0i xi! , where xi ∈ {0, . . . , n}and ∀i= 1, . . . , N.

The expression of the likelihood function under H1,LH1(t), is the follow- ing

LH1(t) =

t−1

Y

i=1

e−λ0λx0i xi!

! t+m−1 Y

i=t

e−λ1λx1i xi!

! N Y

i=t+m

e−λ0λx0i xi!

! .

Hence, the likelihood ratio LR(t, m) is dened as LR(t, m) =

Qt+m−1 i=t

e−λ1λxi1 xi!

Qt+m−1 i=t

e−λ0λxi0 xi!

,

and its logarithm,LLR(t, m), LLR(t, m) =Pt+m−1

i=t [(−λ1+xilog(λ1)−log(xi!))−(λ0+xilog(λ0)log(xi!))],

=m(λ0−λ1)+Pt+m−1 i=t xilog

λ1

λ0

. Denoting C = log

λ1

λ0

. The expression of the LLR(t, m) is now the following

LLR(t, m) =m(λ0−λ1) +C

t+m−1

X

i=t

xi.

For 1≤t≤N −m+ 1, let νt=

t+m−1

X

i=t

xi,

be the number of observed events in the window [t, t+m−1]. Hence, the LLR(t, m) can be written as follows

LLR(t, m) =m(λ0−λ1) +Cνt.

(6)

For xed mand sinceC >0 andm(λ0−λ1)<0,LLR(t, m) is a monot- ically increasing function of νt. Accordingly, the GLRT rejectsH0 for a value of νt as large as possible, i.e. the unidimensional discrete scan statistic with scanning window of xed length m,Sm dened in Eq.(2).

3. TWO-DIMENSIONAL DISCRETE SCAN STATISTIC AND THE GLRT

Let N1, N2 be positive integers, R = [0, N1]×[0, N2] be a rectangular region and {Xij}, 1 6 i 6 N1, 1 6 j 6 N2, be a family of i.i.d. nonnega- tive integer random variables from a specied distribution (Bernoulli, binomial, Poisson, etc). In practice, the Xij's represent the number of events occurring in the elementary square sub-region[i−1, i]×[j−1, j].

Let m1,m2 be positive integers such that 1 6m1 6 N1, 16 m2 6N2. For16t6N1−m1+ 1and 16s6N2−m2+ 1, let

νtsts(m1, m2) =

t+m1−1

X

i=t

s+m2−1

X

j=s

Xij (7)

be the number of events observed in the scanning window located on the rect- angular sub-region [t, t+m1−1]×[s, s+m2−1]within R(see Fig. 1).

Fig. 1. The scanning process with m1×m2 scanning window inR.

The two-dimensional discrete scan statistic is dened as the maximum number of events that occured in anym1×m2 rectangular window within the rectangular region [0, N1]×[0, N2],

Sm1,m2 =S(m1, m2, N1, N2) = max

16t6N1m1+ 1 16s6N2m2+ 1

νts . (8)

(7)

The statistic Sm1,m2 is used to test the null hypothesis (H0) assuming that the Xij's are i.i.d. from a specied distribution, against an alternative hypothesis (H1) which supports the existence of some cluster of events.

For the binomial model, the null hypothesis assumes that the Xij's are i.i.d.,Xij ∼ B(n, p0),n >0, withp0 the probability of success. The alternative hypothesis presumes the existence of a rectangular subregion of size m1×m2, [a, a+m1−1]×[b, b+m2−1],a∈[0, N1−m1+ 1],b∈[0, N2−m2+ 1], such that the random variables Xij's are independent and identically distributed as a binomialB(n, p1) withp1 > p0 if (i, j)∈[a;a+m1−1]×[b, b+m2−1]and p0 =p1 otherwise.

For the Poisson model, assumes that the Xij's are i.i.d., Xij ∼ P(λ0). The alternative hypothesis presumes the existence of a rectangular subregion of size m1 ×m2, [a, a+m1 −1]×[b, b+m2−1], a ∈ [0, N1−m1+ 1], b ∈ [0, N2 −m2 + 1], such that the random variables Xij's are independent and identically distributed as a Poisson P(λ1) withλ1 > λ0 if (i, j)∈[a;a+m1− 1]×[b, b+m2−1]and λ10 otherwise.

In what follows, we assume that the values ofm1 andm2 are known. We show that, for both binomial and Poisson models, the GLRT rejectsH0in favor of H1 when Sm1,m2 is greater than its quantile of order 1−α when α is the type 1 error.

The Binomial model. Let [0, N1]×[0, N2] be a rectangular region, N1, N2∈N. Let{Xij}be a family of independent random variables binomially distributed asB(n, pk) wherekis such that

k=

(0 if {i, j} ∈[0, N1]×[0, N2]\[t, t+m1−1]×[s, s+m2−1]

1 if {i, j} ∈[t, t+m1−1]×[s, s+m2−1].

It is assumed that the parameters pk are known. One wants to verify the null hypothesisH0 that theXij's are i.i.d. as B(n, p0):

H0 :p0 =p1

(9)

against the alternative hypothesisH1 which supports the existence of a cluster of events of sizem1×m2 where theXij's are i.i.d. as B(n, p1):

H1 :p1 > p0 (10)

Proposition 3. The GLRT rejects H0 in favor of H1 when the two- dimensional discrete scan statistic with scanning window of xed size m1×m2

exceeds a threshold determined from P(S(m1, m2, N1, N2) > τ|H0) =α, where α is the type 1 error.

Proof. The following proof has a similar reasoning to the one-dimensional

(8)

case. LetLH0 be the likelihood function under H0. It is expressed as follows:

LH0(x11, . . . , xN1N2) =LH0 =

N1

Y

i=1 N2

Y

j=1

n xij

px0ij(1−p0)n−xij,

wherexij ∈ {0, . . . , n} and∀i∈ {1, . . . , N1} and∀j∈ {1, . . . , N2}. Under H1, the likelihood function is given by

LH1(t, s) =

t−1

Y

i=1 s−1

Y

j=1

n xij

px0ij(1−p0)n−xij

t+m1−1

Y

i=t

s+m2−1

Y

j=s

n xij

px1ij(1−p1)n−xij

×

N1

Y

i=t+m1

N2

Y

j=s+m2

n xij

px0ij(1−p0)n−xij.

Hence, the likelihood ratio LR(t, m) is dened as LR(t, s, m1, m2) =

Qt+m1−1 i=t

Qs+m2−1 j=s

n xij

px1ij(1−p1)n−xij Qt+m1−1

i=t

Qs+m2−1 j=s

n xij

px0ij(1−p0)n−xij, and its logarithm,LLR(t, s, m1, m2),

LLR(t, s, m1, m2) = logQt+m1−1 i=t

Qs+m2−1 j=s

n xij

px1ij(1−p1)n−xij

−logQt+m1−1 i=t

Qs+m2−1 j=s

n xij

px0ij(1−p0)n−xij,

= Pt+m1−1 i=t

Ps+m2−1 j=s

h xijlog

p1

p0

+ (n−xij) log 1−p1

1−p0

i . Denoting C1 = log

p1

p0

and C2= log1−p

1

1−p0

. Then

LLR(t, s, m1, m2) =C1

t+m1−1

X

i=t

s+m2−1

X

j=s

xij +C2

t+m1−1

X

i=t

s+m2−1

X

j=s

(n−xij).

For 1≤t≤N1−m1+ 1and1≤s≤N2−m2+ 1, let νts =

t+m1−1

X

i=t

s+m2−1

X

j=s

xij,

be the number of observed events in the window of size[t, t+m1−1]×[s, s+ m2−1]. Thus, the LLR(t, s, m1, m2) can be written as

LLR(t, s, m1, m2) =C1νts+C2(m1m2n−νts).

For xed m1 and m2 and since C1 > 0 and C2 < 0, LLR(t, s, m1, m2) is a monodically increasing function of νts. Consequently, the GLRT rejects H0 for a value of νts as large as possible, i.e. the two-dimensional discrete scan statistic with scanning window of xed size m1×m2, Sm1,m2 dened in Eq. (8).

(9)

The Poisson model. Let [0, N1]× [0, N2] be a rectangular region, N1, N2 ∈ N. Let {Xij} be a family of independent random variables Poisson distributed asP(λk)wherek is such that

k=

(0 if {i, j} ∈[0, N1]×[0, N2]\[t, t+m1−1]×[s, s+m2−1]

1 if {i, j} ∈[t, t+m1−1]×[s, s+m2−1].

It is assumed that the parameters pk are known. One wants to verify the null hypothesisH0 that theXij's are i.i.d. asP(λ0):

H001 (11)

against the alternative hypothesisH1 which supports the existence of a cluster of events of sizem1×m2 where theXij's are i.i.d. as P(λ1):

H11 > λ0

(12)

Proposition 4. The GLRT rejects H0 in favor of H1 when the two- dimensional discrete scan statistic with scanning window of xed size m1×m2

exceeds a threshold determined from P(S(m1, m2, N1, N2) > τ|H0) =α, where α is the type 1 error.

Proof. Let LH0 be the likelihood function under H0: LH0(x11, . . . , xN1N2) =LH0 =

N1

Y

i=1 N2

Y

j=1

e−λ0λx0ij xij! ,

wherexij ∈ {0, . . . , n} and∀i∈ {1, . . . , N1} and∀j∈ {1, . . . , N2}. The likelihood function under H1 is given by

LH1(t, s) =

t−1

Y

i=1 s−1

Y

j=1

e−λ0λx0ij xij!

t+m1−1

Y

i=t

s+m2−1

Y

j=s

e−λ1λx1ij xij!

N1

Y

i=t+m1

N2

Y

j=s+m2

e−λ0λx0ij xij! .

Hence, the likelihood ratio LR(t, s, m1, m2)is dened as

LR(t, s, m1, m2) =

Qt+m1−1 i=t

Qs+m2−1 j=s

e−λ1λxij1 xij!

Qt+m1−1 i=t

Qs+m2−1 j=s

e−λ0λxij0 xij!

,

and its logarithm,LLR(t, s, m1, m2), LLR(t, s, m1, m2) =Pt+m1−1

i=t

Ps+m2−1

j=s [(−λ1+xijlog(λ1))−(λ0+xijlog(λ0))],

=m1m20−λ1) +CPt+m1−1 i=t

Ps+m2−1 j=s xij, whereC = log

λ1

λ0

.

(10)

For 1≤t≤N1−m1+ 1and1≤s≤N2−m2+ 1, let νts =

t+m1−1

X

i=t

s+m2−1

X

j=s

xij,

be the number of observed events in the window of size[t, t+m1−1]×[s, s+ m2−1]. Hence, the LLR(t, s, m1, m2)can be written as

LLR(t, s, m1, m2) =m1m20−λ1) +Cνts.

For xedm1 andm2,LLR(t, s, m1, m2) is a monodically increasing func- tion of νts. Consequently, the GLRT rejects H0 for a value of νts as large as possible, ie the two-dimensional discrete scan statistic with scanning window of xed sizem1×m2,Sm1,m2 dened in Eq. (8).

4. CONCLUSION

In this paper, we showed that the generalized likelihood ratio test rejects the null hypothesis of randomness in favor of an alternative hypothesis which supports the existence of a cluster when the discrete scan statistic exceeds its quantile of order 1−α. We proved this result for one-dimensional and two- dimensional discrete scan statistics for both binomial and Poisson models.

REFERENCES

[1] N. Balakrishnan and M.V. Koutras, Runs and scans with applications. Wiley, 2001.

[2] A. Barbour, O. Chryssaphinou and M. Roos, Compound poisson approximation in sys- tems reliability. Naval Research Logistics 43 (1996), 2, 251264.

[3] R. Darling and M.S. Waterman, Extreme value distribution for the largest cube in a random lattice. SIAM Journal on Applied Mathematics 46 (1996), 1, 118132.

[4] J. Fu and W. Lou, Distribution theory of runs and patterns and its applications. World Scientic Publishing Co., 2003.

[5] J. Glaz and N. Balakrishnan, Scan Statistics and Applications. Statistics for Industry and Technology. Birkhauser Boston, 1999.

[6] J. Glaz, V. Pozdnyakov and S. Wallenstein, Scan Statistics : Methods and Applications.

Statistics for Industry and Technology. Birkhauser Boston, 2009.

[7] J. Naus. Power comparison of two tests of non-random clustering . Technometrics 8 (1966), 3, 493517.

[8] J.F. Viel, P. Arveux, J. Baverel and J.Y. Cahn, Soft-tissue sarcoma and non-hodgkinÕs lymphoma clusters around a municipal solid waste incinerator with high dioxin emission levels. American Journal of Epidemiology 152 (2000), 1, 13-9.

Received 9 December 2013 Universite Lille 2

EA2694, CERIM, Faculte de Medecine, 1, Place de Verdun, F-59045 Lille

Cedex, France michael.genin@univ-lille2.fr

Références

Documents relatifs

2014 The Dirichlet series of the multiplicative theory of numbers can be reinterpreted as partition functions of solvable quantum mechanical systems.. We

The main purpose of this paper is to provide an efficient test of this statistical model for every item level data set, to show that this statistical model actually applies

As both the null and alternative hypotheses are composite hypotheses (each comprising a different set of distributions), computing the value of the generalized likelihood ratio

approximate equidistribution is made precise by introducing a notion of asymptotic translation invariance. The distribution of Sn is shown to be asymptotically

The next theorem gives a necessary and sufficient condition for the stochastic comparison of the lifetimes of two n+1-k-out-of-n Markov systems in kth order statistics language4.

Motivated by the aforementioned biomédical and genomic applications and the limitations of existing multiple testing methods, we hâve developed and implemented (in R and

Cloud responses to climate change are highlighted by the Global Energy and Water cycle Exchanges (GEWEX) as one of the grand challenges and a major uncertainty for

First, the noise density was systematically assumed known and different estimators have been proposed: kernel estimators (e.g. Ste- fanski and Carroll [1990], Fan [1991]),