AND THE GENERALIZED LIKELIHOOD RATIO TEST

MICHAEL GENIN

Communicated by Marius Iosifescu

The scan statistic is used to detect statistically signicant clusters of events. In the continuous one-dimensional case, the test based on scan statistic is equivalent to the generalized likelihood ratio test. We show that this result remains true for one and two dimensional discrete scan statistics. The binomial and Poisson models are considered.

AMS 2010 Subject Classication: 62F03.

Key words: scan statistics, generalized likelihood ratio test.

1. INTRODUCTION

In many areas it is necessary to decide whether a certain accumulation of events is normal or not. In public health, epidemiology services seek the fac- tors that explain the clusters of cancer or birth defects. Biologists seek clusters palindromes in DNA sequences to nd clues to the origin of replication of some viruses. In quality control, one wonders about the clusters of defective prod- ucts. The decision is then made according to the probability of observing such a cluster under the null hypothesis of a normal situation. If this probability is small, it is reasonable to assume the presence of a deviation from the normal situation and then decisions must be taken.

Scan statistics are used to determine how signicant a cluster of events is.

More specically, scan statistics are random variables used as test statistics to check the null hypothesis that the observations are independent and identically distributed (i.i.d.) from a specied distribution, against an alternative hypoth- esis which supports the existence of some cluster of events. Many works are devoted to scan statistics. For an overview, include the monographs of Glaz and Balakrishnan [5], Balakrishnan and Koutras [1], Fu and Lou [4] and more recently the monograph of Glaz, Pozdnyakov and Wallenstein [6]. Scan statis- tics have been widely used in several elds of application such as cosmology [3], reliability theory [2], epidemiology and public health [8].

REV. ROUMAINE MATH. PURES APPL. 60 (2015), 1, 8392

In Naus 1966 [7], the authors show that, for continuous scan statistics and
a scanning window of xed size, the generalized likelihood ratio test (GLRT)
is used to reject H_{0} whenever the scan statistic is greater than its quantile of
order 1−α where α is the type 1 error. This result is taken for granted in
the case of discrete scan statistics but, as far as we know, there is no proof
in the literature. In this article, we formalize this result in case of the one-
dimensional and two-dimensional discrete scan statistics for both binomial and
Poisson models. Section 2 is devoted to the one-dimensional case and Section 3
to the two-dimensional case. Section 4 provides a conclusion about the obtained
results.

2. ONE-DIMENSIONAL DISCRETE SCAN STATISTIC AND THE GLRT

Let N be a positive integer and {X_{i}}, 1 ≤i≤ N be a sequence of i.i.d.

nonnegative integer random variables from a specied distribution (Bernoulli, binomial, Poisson, etc). Letmbe a positive integer such that 1≤m≤N. For 1≤t≤N−m+ 1, let

νt=νt(m) =

t+m−1

X

i=t

Xi

(1)

be the number of observed events in the scanning window of size m.

The one-dimensional discrete scan statistic is dened as the maximum number of events occurring in any window of sizem within {1, . . . , N},

Sm=S(m, N) = max

1≤t≤N−m+1νt. (2)

The statistic S_{m} is used to testH_{0} assuming that the X_{i}'s are i.i.d. from
a specied distribution against an alternative hypothesis H_{1} which supports
the existence of some cluster of events.

For the binomial model, the null hypothesis assumes that the X_{i}'s are
i.i.d., Xi ∼ B(n, p_{0}),n >0, withp0 the probability of success. The alternative
hypothesis presumes the existence of a window of size m, [a, a+m−1], a∈
[0, N −m+ 1], where the X_{i}'s are i.i.d. as binomial B(n, p_{1}) with p_{1} > p_{0} if
i∈[a, a+m−1]and p0 =p1 otherwise.

For the Poisson model, the null hypothesis assumes that theX_{i}'s are i.i.d.,
Xi∼ P(λ_{0}). The alternative hypothesis presumes the existence of a window of
size m,[a, a+m−1],a∈[0, N −m+ 1], where the X_{i}'s are i.i.d. as Poisson
P(λ_{1}) withλ_{1} > λ_{0} if i∈[a, a+m−1]and λ_{0}=λ_{1} otherwise.

In what follows, we assume that the value of mis known. We show that,
for both binomial and Poisson models, the GLRT rejects H_{0} in favor of H_{1}
whenSm is greater than its quantile of order 1−αwhereα is the type 1 error.

The binomial model. LetX_{1}, X_{2}, . . . , X_{N} be a sequence of independent
random variables binomially distributed asB(n, p_{k}) wherekis such that

k=

(0 if 1≤i < t or t+m≤i≤N 1 if t≤i < t+m.

It is assumed that the parameters p_{k} are known. One wants to check the
null hypothesis H_{0} that the X_{i}'s are independent and identically distributed
(i.i.d.) as B(n, p_{0}):

H_{0}:p_{0}=p_{1},
(3)

against the alternative hypothesis H_{1} which supports a cluster of events of
length m where theXi's are i.i.d. as B(n, p_{1}):

H_{1} :p1 > p0

(4)

Proposition 1. The generalized likelihood ratio test rejects H_{0} in favor
of the alternative hypothesisH_{1} when the unidimensional discrete scan statistic
with scanning window of xed length m exceeds a threshold τ determined from
P(S(m, N)> τ|H_{0}) =α, whereα corresponds to the type 1 error.

Proof. Let LH_{0} be the likelihood function under H_{0}. It is expressed as
LH0(x_{1}, . . . , x_{N}) =LH0 =

N

Y

i=1

n
x_{i}

p^{x}_{0}^{i}(1−p_{0})^{n−x}^{i},

where xi ∈ {0, . . . , n}and ∀i= 1, . . . , N.

One can remark thatH_{1} can be expressed as a function of t:
H_{1} =

N−m+1

[

t=1

H_{1}(t),

whereH_{1}(t) corresponds to an alternative of a cluster of length m starting at
the tth position, for allt∈ {1, . . . , N−m+ 1}.

Thus, the likelihood function under H_{1},LH_{1}(t), can be expressed as
LH1(t) =

t−1

Y

i=1

n xi

p^{x}_{0}^{i}(1−p0)^{n−x}^{i}

! _{t+m−1}
Y

i=t

n xi

p^{x}_{1}^{i}(1−p1)^{n−x}^{i}

!

×

N

Y

i=t+m

n
x_{i}

p^{x}_{0}^{i}(1−p_{0})^{n−x}^{i}

! .

Hence, the likelihood ratio LR(t, m) is dened as LR(t, m) =

Qt−1

i=1(_{xi}^{n})^{p}^{xi}0 (1−p0)^{n−}^{xi}

Qt+m−1

i=t (_{xi}^{n})^{p}^{xi}1 (1−p1)^{n−}^{xi}
QN

i=1(_{xi}^{n})^{p}^{xi}0 (1−p0)^{n−}^{xi}

×

QN

i=t+m(_{xi}^{n})^{p}^{xi}0 (1−p_{0})^{n−}^{xi}
QN

i=1(_{xi}^{n})^{p}^{xi}0 (1−p0)^{n−}^{xi} ,
which can be simplied to

LR(t, m) =

Qt+m−1 i=t

n xi

p^{x}_{1}^{i}(1−p1)^{n−x}^{i}
Qt+m−1

i=t n xi

p^{x}_{0}^{i}(1−p0)^{n−x}^{i}.

Hence, the logarithm of the likelihood ratio LLR(t, m) can be expressed as

LLR(t, m) = logQt+m−1 i=t

n xi

p^{x}_{1}^{i}(1−p1)^{n−x}^{i}−logQt+m−1
i=t

n xi

p^{x}_{0}^{i}(1−p0)^{n−x}^{i},

=Pt+m−1 i=t xilog

p1

p0

+ (n−xi) log 1−p1

1−p0

. Denoting C1 = log

p1

p0

and C2 = log 1−p1

1−p0

. The LLR(t, m) can be expressed as

LLR(t, m) =C1 t+m−1

X

i=t

xi+C2 t+m−1

X

i=t

(n−xi).

For 1≤t≤N −m+ 1, let νt=

t+m−1

X

i=t

xi,

be the number of observed events in the window [t, t+m−1]. Hence, the LLR(t, m) can be written as follows

LLR(t, m) =C1νt+C2(mn−νt).

For xed m and since C_{1} > 0 and C_{2} < 0, LLR(t, m) is a monotically
increasing function of νt. Consequently, the GLRT rejectsH_{0} for a value ofνt

as large as possible, i.e. the unidimensional discrete scan statistic with scanning
window of xed length m,S_{m} dened in Eq.(2).

The Poisson model. LetX_{1}, X_{2}, . . . , X_{N} be a sequence of independent
random variables Poisson distributed asP(λ_{k})wherek is such that

k=

(0 if 1≤i < t ou t+m≤i≤N 1 if t≤i < t+m.

It is assumed that the parameters λk are known. One wants to check the
null hypothesisH_{0} that theXi's are i.i.d. asP(λ0) :

H_{0}:λ_{0} =λ_{1}
(5)

against the alternative hypothesis H_{1} which supports a cluster of events of
length m where theX_{i}'s are i.i.d. as P(λ_{1})

H_{1}:λ1 > λ0

(6)

Proposition 2. The generalized likelihood ratio test rejects H_{0} in favor
of the alternative hypothesisH_{1} when the unidimensional discrete scan statistic
with scanning window of xed length m exceeds a threshold τ determined from
P(S(m, N)> τ|H_{0}) =α, whereα corresponds to the type 1 error.

Proof. Let LH0 be the likelihood function under H_{0}. It is expressed as
follows

LH_{0}(x_{1}, . . . , x_{N}) =LH_{0} =

N

Y

i=1

e^{−λ}^{0}λ^{x}_{0}^{i}
xi! ,
where xi ∈ {0, . . . , n}and ∀i= 1, . . . , N.

The expression of the likelihood function under H_{1},LH1(t), is the follow-
ing

LH_{1}(t) =

t−1

Y

i=1

e^{−λ}^{0}λ^{x}_{0}^{i}
xi!

! _{t+m−1}
Y

i=t

e^{−λ}^{1}λ^{x}_{1}^{i}
xi!

! _{N}
Y

i=t+m

e^{−λ}^{0}λ^{x}_{0}^{i}
xi!

! .

Hence, the likelihood ratio LR(t, m) is dened as LR(t, m) =

Qt+m−1 i=t

e^{−λ}^{1}λ^{xi}_{1}
xi!

Qt+m−1 i=t

e^{−λ}^{0}λ^{xi}_{0}
xi!

,

and its logarithm,LLR(t, m), LLR(t, m) =Pt+m−1

i=t [(−λ_{1}+xilog(λ1)−log(x_{i}!))−(λ_{0}+xilog(λ0)log(xi!))],

=m(λ0−λ1)+Pt+m−1 i=t xilog

λ1

λ0

. Denoting C = log

λ1

λ0

. The expression of the LLR(t, m) is now the following

LLR(t, m) =m(λ_{0}−λ_{1}) +C

t+m−1

X

i=t

x_{i}.

For 1≤t≤N −m+ 1, let
ν_{t}=

t+m−1

X

i=t

x_{i},

be the number of observed events in the window [t, t+m−1]. Hence, the LLR(t, m) can be written as follows

LLR(t, m) =m(λ_{0}−λ_{1}) +Cν_{t}.

For xed mand sinceC >0 andm(λ_{0}−λ_{1})<0,LLR(t, m) is a monot-
ically increasing function of ν_{t}. Accordingly, the GLRT rejectsH_{0} for a value
of νt as large as possible, i.e. the unidimensional discrete scan statistic with
scanning window of xed length m,S_{m} dened in Eq.(2).

3. TWO-DIMENSIONAL DISCRETE SCAN STATISTIC AND THE GLRT

Let N1, N2 be positive integers, R = [0, N1]×[0, N2] be a rectangular
region and {X_{ij}}, 1 6 i 6 N_{1}, 1 6 j 6 N_{2}, be a family of i.i.d. nonnega-
tive integer random variables from a specied distribution (Bernoulli, binomial,
Poisson, etc). In practice, the Xij's represent the number of events occurring
in the elementary square sub-region[i−1, i]×[j−1, j].

Let m1,m2 be positive integers such that 1 6m1 6 N1, 16 m2 6N2. For16t6N1−m1+ 1and 16s6N2−m2+ 1, let

ν_{ts} =ν_{ts}(m_{1}, m_{2}) =

t+m1−1

X

i=t

s+m2−1

X

j=s

X_{ij}
(7)

be the number of events observed in the scanning window located on the rect-
angular sub-region [t, t+m_{1}−1]×[s, s+m_{2}−1]within R(see Fig. 1).

Fig. 1. The scanning process with m1×m2 scanning window inR.

The two-dimensional discrete scan statistic is dened as the maximum
number of events that occured in anym_{1}×m_{2} rectangular window within the
rectangular region [0, N1]×[0, N2],

S_{m}_{1}_{,m}_{2} =S(m_{1}, m_{2}, N_{1}, N_{2}) = max

16t6N1−m1+ 1 16s6N2−m2+ 1

ν_{ts} .
(8)

The statistic S_{m}_{1}_{,m}_{2} is used to test the null hypothesis (H_{0}) assuming
that the X_{ij}'s are i.i.d. from a specied distribution, against an alternative
hypothesis (H_{1}) which supports the existence of some cluster of events.

For the binomial model, the null hypothesis assumes that the Xij's are
i.i.d.,X_{ij} ∼ B(n, p_{0}),n >0, withp_{0} the probability of success. The alternative
hypothesis presumes the existence of a rectangular subregion of size m1×m2,
[a, a+m1−1]×[b, b+m2−1],a∈[0, N1−m1+ 1],b∈[0, N2−m2+ 1], such
that the random variables X_{ij}'s are independent and identically distributed as
a binomialB(n, p_{1}) withp1 > p0 if (i, j)∈[a;a+m1−1]×[b, b+m2−1]and
p0 =p1 otherwise.

For the Poisson model, assumes that the X_{ij}'s are i.i.d., X_{ij} ∼ P(λ_{0}).
The alternative hypothesis presumes the existence of a rectangular subregion
of size m1 ×m2, [a, a+m1 −1]×[b, b+m2−1], a ∈ [0, N1−m1+ 1], b ∈
[0, N_{2} −m_{2} + 1], such that the random variables X_{ij}'s are independent and
identically distributed as a Poisson P(λ_{1}) withλ_{1} > λ_{0} if (i, j)∈[a;a+m_{1}−
1]×[b, b+m2−1]and λ1=λ0 otherwise.

In what follows, we assume that the values ofm_{1} andm_{2} are known. We
show that, for both binomial and Poisson models, the GLRT rejectsH_{0}in favor
of H_{1} when Sm1,m2 is greater than its quantile of order 1−α when α is the
type 1 error.

The Binomial model. Let [0, N_{1}]×[0, N_{2}] be a rectangular region,
N1, N2∈N. Let{X_{ij}}be a family of independent random variables binomially
distributed asB(n, p_{k}) wherekis such that

k=

(0 if {i, j} ∈[0, N1]×[0, N2]\[t, t+m1−1]×[s, s+m2−1]

1 if {i, j} ∈[t, t+m_{1}−1]×[s, s+m_{2}−1].

It is assumed that the parameters p_{k} are known. One wants to verify the null
hypothesisH_{0} that theX_{ij}'s are i.i.d. as B(n, p_{0}):

H_{0} :p0 =p1

(9)

against the alternative hypothesisH_{1} which supports the existence of a cluster
of events of sizem_{1}×m_{2} where theX_{ij}'s are i.i.d. as B(n, p_{1}):

H_{1} :p_{1} > p_{0}
(10)

Proposition 3. The GLRT rejects H_{0} in favor of H_{1} when the two-
dimensional discrete scan statistic with scanning window of xed size m1×m2

exceeds a threshold determined from P(S(m1, m_{2}, N_{1}, N_{2}) > τ|H_{0}) =α, where
α is the type 1 error.

Proof. The following proof has a similar reasoning to the one-dimensional

case. LetLH0 be the likelihood function under H_{0}. It is expressed as follows:

LH0(x11, . . . , xN1N2) =LH0 =

N1

Y

i=1 N2

Y

j=1

n xij

p^{x}_{0}^{ij}(1−p0)^{n−x}^{ij},

wherex_{ij} ∈ {0, . . . , n} and∀i∈ {1, . . . , N_{1}} and∀j∈ {1, . . . , N_{2}}.
Under H_{1}, the likelihood function is given by

LH1(t, s) =

t−1

Y

i=1 s−1

Y

j=1

n
x_{ij}

p^{x}_{0}^{ij}(1−p0)^{n−x}^{ij}

t+m1−1

Y

i=t

s+m2−1

Y

j=s

n
x_{ij}

p^{x}_{1}^{ij}(1−p1)^{n−x}^{ij}

×

N1

Y

i=t+m1

N2

Y

j=s+m2

n
x_{ij}

p^{x}_{0}^{ij}(1−p0)^{n−x}^{ij}.

Hence, the likelihood ratio LR(t, m) is dened as
LR(t, s, m_{1}, m_{2}) =

Qt+m1−1 i=t

Qs+m2−1 j=s

n xij

p^{x}_{1}^{ij}(1−p1)^{n−x}^{ij}
Qt+m1−1

i=t

Qs+m2−1 j=s

n xij

p^{x}_{0}^{ij}(1−p0)^{n−x}^{ij},
and its logarithm,LLR(t, s, m_{1}, m_{2}),

LLR(t, s, m_{1}, m_{2}) = logQt+m1−1
i=t

Qs+m2−1 j=s

n xij

p^{x}_{1}^{ij}(1−p_{1})^{n−x}^{ij}

−logQt+m1−1 i=t

Qs+m2−1 j=s

n xij

p^{x}_{0}^{ij}(1−p0)^{n−x}^{ij},

= Pt+m1−1 i=t

Ps+m2−1 j=s

h xijlog

p1

p0

+ (n−xij) log 1−p1

1−p0

i
.
Denoting C_{1} = log

p1

p0

and C_{2}= log_{1−p}

1

1−p_{0}

. Then

LLR(t, s, m1, m2) =C1

t+m1−1

X

i=t

s+m2−1

X

j=s

xij +C2

t+m1−1

X

i=t

s+m2−1

X

j=s

(n−xij).

For 1≤t≤N1−m1+ 1and1≤s≤N2−m2+ 1, let νts =

t+m1−1

X

i=t

s+m2−1

X

j=s

xij,

be the number of observed events in the window of size[t, t+m_{1}−1]×[s, s+
m2−1]. Thus, the LLR(t, s, m1, m2) can be written as

LLR(t, s, m_{1}, m_{2}) =C_{1}ν_{ts}+C_{2}(m_{1}m_{2}n−ν_{ts}).

For xed m_{1} and m_{2} and since C_{1} > 0 and C_{2} < 0, LLR(t, s, m_{1}, m_{2})
is a monodically increasing function of ν_{ts}. Consequently, the GLRT rejects
H_{0} for a value of νts as large as possible, i.e. the two-dimensional discrete
scan statistic with scanning window of xed size m_{1}×m_{2}, S_{m}_{1}_{,m}_{2} dened in
Eq. (8).

The Poisson model. Let [0, N_{1}]× [0, N_{2}] be a rectangular region,
N_{1}, N_{2} ∈ N. Let {X_{ij}} be a family of independent random variables Poisson
distributed asP(λk)wherek is such that

k=

(0 if {i, j} ∈[0, N_{1}]×[0, N_{2}]\[t, t+m_{1}−1]×[s, s+m_{2}−1]

1 if {i, j} ∈[t, t+m1−1]×[s, s+m2−1].

It is assumed that the parameters p_{k} are known. One wants to verify the
null hypothesisH_{0} that theXij's are i.i.d. asP(λ0):

H_{0}:λ_{0} =λ_{1}
(11)

against the alternative hypothesisH_{1} which supports the existence of a cluster
of events of sizem1×m2 where theXij's are i.i.d. as P(λ_{1}):

H_{1}:λ1 > λ0

(12)

Proposition 4. The GLRT rejects H_{0} in favor of H_{1} when the two-
dimensional discrete scan statistic with scanning window of xed size m1×m2

exceeds a threshold determined from P(S(m1, m_{2}, N_{1}, N_{2}) > τ|H_{0}) =α, where
α is the type 1 error.

Proof. Let LH0 be the likelihood function under H_{0}:
LH0(x11, . . . , x_{N}_{1}_{N}_{2}) =LH0 =

N1

Y

i=1 N2

Y

j=1

e^{−λ}^{0}λ^{x}_{0}^{ij}
x_{ij}! ,

wherexij ∈ {0, . . . , n} and∀i∈ {1, . . . , N_{1}} and∀j∈ {1, . . . , N_{2}}.
The likelihood function under H_{1} is given by

LH1(t, s) =

t−1

Y

i=1 s−1

Y

j=1

e^{−λ}^{0}λ^{x}_{0}^{ij}
x_{ij}!

t+m1−1

Y

i=t

s+m2−1

Y

j=s

e^{−λ}^{1}λ^{x}_{1}^{ij}
x_{ij}!

N1

Y

i=t+m1

N2

Y

j=s+m2

e^{−λ}^{0}λ^{x}_{0}^{ij}
x_{ij}! .

Hence, the likelihood ratio LR(t, s, m1, m2)is dened as

LR(t, s, m1, m2) =

Qt+m1−1 i=t

Qs+m2−1 j=s

e^{−λ}^{1}λ^{xij}_{1}
xij!

Qt+m1−1 i=t

Qs+m2−1 j=s

e^{−λ}^{0}λ^{xij}_{0}
xij!

,

and its logarithm,LLR(t, s, m_{1}, m_{2}),
LLR(t, s, m1, m2) =Pt+m1−1

i=t

Ps+m2−1

j=s [(−λ1+xijlog(λ1))−(λ_{0}+xijlog(λ0))],

=m1m2(λ0−λ1) +CPt+m1−1 i=t

Ps+m2−1 j=s xij, whereC = log

λ1

λ0

.

For 1≤t≤N_{1}−m_{1}+ 1and1≤s≤N_{2}−m_{2}+ 1, let
νts =

t+m1−1

X

i=t

s+m2−1

X

j=s

xij,

be the number of observed events in the window of size[t, t+m1−1]×[s, s+ m2−1]. Hence, the LLR(t, s, m1, m2)can be written as

LLR(t, s, m_{1}, m_{2}) =m_{1}m_{2}(λ_{0}−λ_{1}) +Cν_{ts}.

For xedm1 andm2,LLR(t, s, m1, m2) is a monodically increasing func-
tion of ν_{ts}. Consequently, the GLRT rejects H_{0} for a value of ν_{ts} as large as
possible, ie the two-dimensional discrete scan statistic with scanning window
of xed sizem1×m2,Sm1,m2 dened in Eq. (8).

4. CONCLUSION

In this paper, we showed that the generalized likelihood ratio test rejects the null hypothesis of randomness in favor of an alternative hypothesis which supports the existence of a cluster when the discrete scan statistic exceeds its quantile of order 1−α. We proved this result for one-dimensional and two- dimensional discrete scan statistics for both binomial and Poisson models.

REFERENCES

[1] N. Balakrishnan and M.V. Koutras, Runs and scans with applications. Wiley, 2001.

[2] A. Barbour, O. Chryssaphinou and M. Roos, Compound poisson approximation in sys- tems reliability. Naval Research Logistics 43 (1996), 2, 251264.

[3] R. Darling and M.S. Waterman, Extreme value distribution for the largest cube in a random lattice. SIAM Journal on Applied Mathematics 46 (1996), 1, 118132.

[4] J. Fu and W. Lou, Distribution theory of runs and patterns and its applications. World Scientic Publishing Co., 2003.

[5] J. Glaz and N. Balakrishnan, Scan Statistics and Applications. Statistics for Industry and Technology. Birkhauser Boston, 1999.

[6] J. Glaz, V. Pozdnyakov and S. Wallenstein, Scan Statistics : Methods and Applications.

Statistics for Industry and Technology. Birkhauser Boston, 2009.

[7] J. Naus. Power comparison of two tests of non-random clustering . Technometrics 8 (1966), 3, 493517.

[8] J.F. Viel, P. Arveux, J. Baverel and J.Y. Cahn, Soft-tissue sarcoma and non-hodgkinÕs lymphoma clusters around a municipal solid waste incinerator with high dioxin emission levels. American Journal of Epidemiology 152 (2000), 1, 13-9.

Received 9 December 2013 Universite Lille 2

EA2694, CERIM, Faculte de Medecine, 1, Place de Verdun, F-59045 Lille

Cedex, France michael.genin@univ-lille2.fr