• Aucun résultat trouvé

March 2011, Volume 39, Issue 11. http://www.jstatsoft.org/

N/A
N/A
Protected

Academic year: 2022

Partager "March 2011, Volume 39, Issue 11. http://www.jstatsoft.org/"

Copied!
18
0
0

Texte intégral

(1)

March 2011, Volume 39, Issue 11. http://www.jstatsoft.org/

Computing the Two-Sided Kolmogorov-Smirnov Distribution

Richard Simard Universit´e de Montr´eal

Pierre L’Ecuyer Universit´e de Montr´eal

Abstract

We propose an algorithm to compute the cumulative distribution function of the two- sided Kolmogorov-Smirnov test statistic D

n

and its complementary distribution in a fast and reliable way. Different approximations are used in different regions of (n, x). Java and C programs are available.

Keywords: KS test, Kolmogorov-Smirnov distribution, goodness of fit.

1. Introduction

The Kolmogorov-Smirnov (KS) two-sided test statistic D

n

is widely used to measure the goodness-of-fit between the empirical distribution of a set of n observations and a given continuous probability distribution. It is defined by

D

n

= sup

x

G(x) − G ˆ

n

(x) ,

where n is the number of (independent) observations, ˆ G

n

is their empirical cumulative dis- tribution function (cdf), and G is a completely specified continuous theoretical cdf. Let F

n

denote the cdf of D

n

under the null hypothesis H

0

that the n observations are independent and have cdf G, that is,

F

n

(x) = P[D

n

6 x | H

0

] for x ∈ [0, 1].

This F

n

is loosely called the KS distribution.

Computing F

n

(x) in a fast and reliable way, for arbitrary values of n and x, is not easy.

With the help of a symbolic algebra package, Drew, Glen, and Leemis (2000) computed the

exact KS distribution for 1 6 n 6 30 as a set of piecewise polynomials of degree n resulting

(2)

from a large number of integrals. They also gave a short literature review. Brown and Harvey (2007) consider seven different methods proposed in the past to compute the exact KS distribution. They programmed these seven methods in Mathematica, using only rational numbers to obtain exact results. Although their Mathematica programs are very useful for verification purpose, the calculations are extremely slow for large n. For example, computing F

n

(x) for n = 10000 and x = 2/ √

n, using the Durbin (1968) recursion formula, takes three hours on our (standard) desktop computer. The Durbin recursion formula gives the fastest exact method among all those programmed by Brown and Harvey (2007). Unfortunately, it contains expressions that suffer from subtractive cancellation between very large terms, and is thus unsuitable for floating-point computations with limited precision. Among the formulæ considered by Brown and Harvey (2007), only those of No´e, Pomeranz, and the Durbin matrix formula are not subject to catastrophic cancellation because they contain only non-negative terms.

The method of Pomeranz (1974) and the Durbin (1968) recursion method were programmed in single precision Fortran 77 by Pomeranz (1974). The Durbin recursion method is numerically unstable and gives sensible results only for very small n. The Pomeranz method makes use of floors and ceilings which must be computed with great care; as an illustration, the KS distribution computed by the program of Pomeranz (1974) sometimes gives a non-smooth distribution with very large errors, as we will see in Section 2.2. Kallman (1977) programmed a generalized version of Pomeranz method in Fortran 77; this program works well in single precision for small n but starts to break down for moderate n and when F

n

(x) is close to 1.

Miller (1999) programmed the complementary KS distribution ¯ F

n

(x) = 1 − F

n

(x) in Fortran 90, using different approximations; his program is very fast, but gives only 0 to 3 decimal digits of precision for n > 120 and sometimes returns NaN (not a number), for example for n = 100 and x = 0.54, 0.66, 0.77, or negative values, for example for n = 50 and 0.35 < x < 0.50.

Recently, Marsaglia, Tsang, and Wang (2003) wrote a C procedure implementing the matrix formula of Durbin (1973). Although it works very well for small n and gives 13 decimal digits of precision according to the authors, it is very slow for large n and sometimes returns NaN or infinite values; for example for n = 11000 and 0.000413 ≤ x ≤ 0.000414; for n = 21000 or 21001 and 0.000434 6 x 6 0.000526; for n = 42001 and 0.000206 ≤ x ≤ 0.000263; and for n = 62000 and 0.001 6 x 6 0.007. Furthermore, computing a p-value as p = P[D

n

>

x] = 1 − F

n

(x) gives rise to loss of precision when F

n

(x) is close to 1, due to subtractive cancellation.

A close look at popular statistical software reveals that even some of the best products use poor approximations for the KS distribution (or the complementary distribution) in some regions (values of (n, x)). For example, in the popular statistical software environment R (2010), the method of Marsaglia et al. (2003) was recently implemented to compute the p-value p = 1 − F

n

(x) in KS tests (with the option “exact”). However, because this method is very slow for large n, the default algorithm in R for n > 100 is an asymptotic approximation that gives only 0 to 2 decimal digits of precision. For example, for n = 120 and x = 0.0874483967333, R default method returns p = 0.3178 while the correct value is p = 0.30012. For n = 500 and x = 0.037527424, R default method returns p = 0.482 while the correct value is p = 0.47067.

R “exact” (but slow) method gives the correct values in both cases. However for p-values smaller than 10

15

, neither method gives any correct decimal.

MATLAB (2009) uses different methods to approximate the KS distribution depending on n

and x. For example, for n = 20 and x = 0.8008915818, MATLAB returns p = 2.2544 × 10

−12

(3)

while the correct value is p = 2.5754 × 10

−14

, and for n = 20 and x = 0.9004583223, it returns p = 1.5763 × 10

15

while the correct value is p = 1.8250 × 10

20

. These returned values have zero decimal digits of precision in both cases.

Available programs for the KS distribution in standard programming languages such as C, Fortran, and Java are either fast but unreliable, or precise but slow. Moreover, they either compute only the KS distribution or only its complementary. So there is a need for a fast and reliable program that computes both the KS distribution and its complementary distribution for arbitrary n and x.

In this paper, we describe such an algorithm and its implementation for the KS distribution with floating-point numbers of 53 bits of precision. In Section 2, we briefly describe the exact methods that we use to compute the KS distribution. We implemented the Pomeranz recursion formula and we compare its speed with the Durbin matrix formula. In Section 3, we introduce faster but less precise approximations that we use for large n. In Section 4, we partition the space (n, x) in different regions and we explain why each method is more appropriate in a given region to compute the KS distribution. In Section 5, we do the same for the complementary KS distribution. Finally, we measure the speed of our C program at several values of n and x.

2. Exact methods

2.1. Some values in the tails

There are a few special cases for which the exact KS distribution is known and has a very simple form. In the tails (for x close to 0 or 1), the exact values for arbitrary n are (Ruben and Gambino 1982):

F

n

(x) =

 

 

 

 

0 for x 6

2n1

n! 2x −

1n

n

for

2n1

< x 6

n1

1 − 2(1 − x)

n

for 1 −

1n

6 x < 1

1 for 1 6 x.

(1)

This also provides an exact formula for the complementary cdf ¯ F

n

(x)

def

= 1 − F

n

(x) over the same range of values of (n, x). We will use these simple formulas whenever they apply.

2.2. The Pomeranz recursion formula

The Pomeranz formula permits one to compute a (2n + 2) × (n + 1) matrix whose entries are

defined by a recurrence, and the last of those entries is F

n

(x). The method is described in

Pomeranz (1974) and Brown and Harvey (2007), and explained below. It makes use of floors

and ceilings which must be computed with care, because it is very sensitive to numerical

imprecision, as we will see in a moment. Let t = nx and define the A

1

, . . . , A

2n+2

(which

(4)

depend on t) as follows:

A

1

= 0,

A

2

= min { t − b t c , d t e − t } , A

3

= 1 − A

2

,

A

i

= A

i−2

+ 1, for i = 4, 5, . . . , 2n + 1, and A

2n+2

= n

We then define the entries V

i,j

of a (2n + 2) × (n + 1) matrix V (which depend on t) by V

1,1

= 1, V

1,j

= 0 for j = 2, . . . , n + 1, and

V

i,j

=

k2(i,j)

X

k=k1(i,j)

((A

i

− A

i−1

)/n)

jk

(j − k)! V

i−1,k

(2)

for i = 2, . . . , 2n + 2 and j = b A

i

− t c + 2, . . . , d A

i

+ t e , where

k

1

(i, j) = max (1, b A

i−1

− t c + 2) and k

2

(i, j) = min (j, d A

i−1

+ t e ) . (3) Then we have

F

n

(x) = n! V

2n+2,n+1

. (4)

As observed by Pomeranz, the differences A

i

− A

i−1

can take at most four different values. This can be exploited by precomputing the factors ((A

i

− A

i−1

)/n)

j

/j! in (2), and this increases the speed of computation significantly. It is also clear from (2) that we never need to store more than two rows of matrix V at a time, since row i can be computed from row i − 1 only.

Pomeranz’s algorithm uses the floor and ceiling of A

i

± t to define the lower and upper bounds of summation in (2), which are defined in (3). For some values of t (and x), unavoidable round- off errors in floating-point calculations will give the wrong value of k

1

(i, j) or k

2

(i, j), and this may give rise to large errors in the cdf. As an illustration, Figure 1 shows (in blue) the KS distribution F

n

(x) as computed by the program 487.f of Pomeranz (1974) for n = 20 and 0.17 6 x 6 0.19, compared with the correct distribution (in red). The zigzags in the blue line are due to floating-point errors in computing k

1

(i, j) and k

2

(i, j).

To avoid this type of problem, we precompute exactly all floors and ceilings of A

i

± t in (3) for the given value of t = nx, although nx itself is not computed exactly due to floating-point error in the multiplication. We decompose t = ` + f , where ` is an non-negative integer and 0 6 f < 1. There are three cases to consider depending on where the fractional part f lies.

We must make sure to identify correctly in which case we are, and use the corresponding formulas for the precomputations.

Case (i): f = 0. In this case, we have

A

2i

= i − 1 A

2i+1

= i, for i = 1, 2, . . .

b A

2i

− t c = i − 1 − ` d A

2i

+ t e = i − 1 + `, for i = 1, 2, . . .

b A

2i+1

− t c = i − ` d A

2i+1

+ t e = i + `, for i = 0, 1, 2, . . .

(5)

●●

●●

●●●●

●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●

0.170 0.175 0.180 0.185 0.190

0.450.500.550.60

x Fn

(

x

)

487.f exact

Figure 1: The KS distribution for n = 20 computed by program 487.f (in blue) compared with the correct cdf (in red).

Case (ii): 0 < f 6 1/2. Here we have

A

2i

= i − 1 + f A

2i+1

= i − f, for i = 1, 2, . . . b A

1

− t c = − ` − 1 d A

1

+ t e = ` + 1

b A

2i

− t c = i − 1 − ` d A

2i

+ t e = i + `, for i = 1, 2, . . . b A

2i+1

− t c = i − 1 − ` d A

2i+1

+ t e = i + `, for i = 1, 2, . . .

Case (iii): 1/2 < f < 1. Here we have

A

2i

= i − f A

2i+1

= i − 1 + f, for i = 1, 2, . . .

b A

2i

− t c = i − 2 − ` d A

2i

+ t e = i + `, for i = 1, 2, . . .

b A

2i+1

− t c = i − 1 − ` d A

2i+1

+ t e = i + 1 + `, for i = 0, 1, 2, . . .

With floating-point numbers that follow the IEEE-754 64-bit standard for double precision,

the algorithm cannot be used in this naive form as soon as n > 140, because for some

regions of x, the terms V

i,j

will underflow to 0 while the n! factor will overflow, even though

the corresponding probability is finite. For this reason, we must periodically renormalize

all elements of row i of V (in our implementation, we multiply all elements of the row by

2

350

) when the smallest element of the row becomes too small (smaller than 10

280

in our

implementation) and we keep track of the logarithm of the renormalization factors to recover

the correct numbers.

(6)

2.3. The Durbin matrix formula

Durbin (1973) proposed a formula for F

n

(x) based on the calculation of a k × k matrix H raised to the power n, where k = d t e = d nx e ; see his equations (2.4.3) and (2.4.4); see also Brown and Harvey (2007). Then F

n

(x) is given by n! T /n

n

, where T is the element k × k of the matrix H

n

. Marsaglia et al. (2003) have recently implemented this algorithm in a C program which we shall call MTW (Marsaglia-Tsang-Wang). We reuse their program for this method, with small corrections.

2.4. Comparison between Pomeranz and Durbin algorithms for small n One of our objectives in this work was to provide a reliable implementation of Pomeranz algorithm and compare its speed and precision with the Durbin matrix algorithm implemented in MTW (its main competitor), for small n. We implemented the Pomeranz algorithm in C and Java, while MTW is implemented in C. For n not too large, these two implementations both return 13 to 15 decimal digits of precision (as measured by comparing with the exact values provided by the Mathematica programs of Brown and Harvey 2007). We also compared their speed for computing F

n

(x) at values of (n, x) selected as follows. The mean of D

n

(under H

0

) is close to

µ

0 def

= ln(2) p

π/(2n) ≈ 0.868731160636/ √ n

and we selected the pairs (n, x) around this mean. More specifically, we took x = aµ

0

for a = 1/4, 1/3, 1/2, 1, 2, and 3, for n ranging from 10 to 1000. The values of F

n

(x) for these pairs (n, x) are given in Table 1. For each pair (n, x) in the table, we measured the CPU time (in seconds) needed to compute F

n

(x) 10

6

times with the C programs. All the timings reported in this paper were made on a computer with an Advanced Micro Devices (AMD)

n \ x µ

0

/4 µ

0

/3 µ

0

/2 µ

0

0

0

10 1.9216 × 10

8

5.7293 × 10

5

0.021523 0.63157 0.99769 0.9999999 50 2.2810 × 10

9

1.9914 × 10

5

0.014262 0.59535 0.99618 0.9999987 100 1.0020 × 10

−9

1.3267 × 10

−5

0.012461 0.58616 0.99587 0.9999982 200 4.9331 × 10

10

9.5266 × 10

6

0.011212 0.57949 0.99566 0.9999980 500 2.3705 × 10

10

6.8500 × 10

6

0.010131 0.57343 0.99549 0.9999978 1000 1.5699 × 10

−10

5.7174 × 10

−6

0.009597 0.57032 0.99541 0.9999977

Table 1: Values of F

n

(x) for selected pairs (n, x).

n \ x µ

0

/4 µ

0

/3 µ

0

/2 µ

0

0

0

10 0.17 0.17 2.5 3.5 5.2 6.5

50 9.6 11 15 32 74 124

100 22 26 42 100 266 498

200 57 79 128 325 1257 3266

500 221 324 592 2314 1.9 × 10

4

3.1 × 10

4

1000 708 1112 2070 1.8 × 10

4

5.6 × 10

4

7.9 × 10

4

Table 2: CPU time (seconds) to compute F

n

(x) 10

6

times with the Pomeranz algorithm.

(7)

n \ x µ

0

/4 µ

0

/3 µ

0

/2 µ

0

0

0

10 0.57 0.57 1.6 3.4 21 60

50 2.5 5.3 9.6 48 264 782

100 5.9 5.7 21 106 754 2468

200 14 27 62 331 2200 7012

500 37 85 221 1601 1.2 × 10

4

7.4 × 10

4

1000 96 242 615 4608 3.8 × 10

4

1.3 × 10

5

Table 3: CPU time (seconds) to compute F

n

(x) 10

6

times with the Durbin matrix algorithm.

Athlon processor 4000+ at clock speed of 2400 MHz, running Red Hat Linux. The results are shown in Table 2 for the Pomeranz algorithm and in Table 3 for the Durbin matrix algorithm.

We see that the CPU times increase with n and with x (for fixed n). For small n, the Durbin matrix program is faster for x < µ

0

(roughly), while the Pomeranz program is faster for x > µ

0

. When n is large, especially for x > µ

0

, these exact algorithms are too slow and have to be replaced by approximations.

3. Asymptotic approximations

Kolmogorov (1933) has proved the following expansion which provides an approximation of F

n

(z/ √

n) for large n:

n

lim

→∞

F

n

(z/ √

n) = lim

n→∞

P[ √

nD

n

6 z | H

0

] = K

0

(z) = 1 + 2 X

k=1

( − 1)

k

e

2k2z2

. (5)

Pelz and Good (1976) generalized Kolmogorov’s approximation (5) to an asymptotic series in 1/ √

n of the form

n→∞

lim P[ √

nD

n

6 z | H

0

] = K

0

(z) + K

1

(z)

n

1/2

+ K

2

(z)

n + K

3

(z) n

3/2

+ O

1 n

2

, (6)

where

K

0

(z) =

√ 2π z

X

k=1

e

π2(2k1)2/(8z2)

,

K

1

(z) = 1 6z

4

r π 2

X

k=−∞

π

2

(k +

12

)

2

− z

2

e

π2(k+1/2)2/(2z2)

,

K

2

(z) = 1 72z

7

r π 2

X

k=−∞

(6z

6

+ 2z

4

) + π

2

(2z

4

− 5z

2

)(k +

12

)

2

+ π

4

(1 − 2z

2

)(k +

12

)

4

e

−π2(k+1/2)2/(2z2)

− 1 36z

3

r π 2

X

k=−∞

π

2

k

2

e

π2k2/(2z2)

, and

(8)

K

3

(z) = 1 6480z

10

r π 2

X

k=−∞

π

6

(k +

12

)

6

(5 − 30z

2

) + π

4

(k +

12

)

4

( − 60z

2

+ 212z

4

) + π

2

(k +

12

)

2

(135z

4

− 96z

6

) − (30z

6

+ 90z

8

) e

π2(k+1/2)2/(2z2)

+ 1

216z

6

r π

2 X

k=−∞

( − π

4

k

4

+ 3π

2

k

2

z

2

)e

−π2k2/(2z2)

.

These authors claim that their approximation is accurate to 5 decimal digits for n > 100 (the absolute error of their approximation is in fact smaller than 10

5

for n > 100, but in our work we focus on the relative error). Notice that K

0

(z) in the above equation is equivalent to (5) through some of Jacobi’s identities on theta functions.

Another approximation useful for x close to 1 is based on the complementary cdf of the one-sided KS statistic, defined by

D

+n

= sup

x

{ G ˆ

n

(x) − G(x) } ,

where n, ˆ G

n

, and G are defined as in Section 1. The upper tail probability of D

n+

, defined as p

+n

(x) = P[D

n+

> x | H

0

], is much faster and easier to compute than F

n

(x). Smirnov’s stable formula gives

p

+n

(x) = x

bn(1−x)c

X

j=0

n j

j n + x

j−1

1 − x − j n

n−j

(7) Miller (1956) has shown that for p

+n

(x) not too large, a very good approximation for the distribution of D

n

in the upper tail is

F ¯

n

(x) = P[D

n

> x | H

0

] ≈ 2p

+n

(x). (8) But as p

+n

(x) increases (and x decreases), the reliability of this approximation deteriorates quickly as we shall see in the next section.

4. The selected method as a function of (n, x)

In our implementation, we have selected the method that seems to make the best compromise

between speed and precision as a function of (n, x), in each area of the N × [0, 1] space. We

partitioned this space as illustrated in Figure 2, where the minimal number of decimal digits

of precision of our algorithm is given in parenthesis in each area of the partition (we use the

relative error everywhere). For practical reasons, we use approximations when n is large or

when x is too close to 1, because the exact methods are then far too slow. The precision of

these approximations is very poor for small n or x, and improves as n or x increases. We start

by partitioning in three regions according to the value of n: (i) n 6 140, (ii) 140 < n 6 10

5

,

and (iii) n > 10

5

. Note that the horizontal line below the Pelz-Good region in Figure 2 is at

n = 140. Next, we explain what we do in each of those three regions; more details can be

found in the appendix at the end of the paper.

(9)

0 0.2 0.4 0.6 0.8 1 1

10 140 10

3

10

4

10

5

Pel z-G oo d

(5)

Du rbin (13)

Ruben-Gambino (13)

F

n

(x ) = 1 (15) 1 − F ¯

n

( x ) (14) Pomer

anz (13) x n

Figure 2: Choice of method to compute (or approximate) F

n

(x) and minimal number of decimal digits of precision for F

n

(x), as a function of (n, x), in each region of N × [0, 1].

constant, are curves of almost constant probability (see Table 1 for some typical values of x).

Thus regions of constant nx

2

seem relevant for the different approximations of F

n

(x).

If ¯ F

n

(x) < 2.2 × 10

16

, the double precision representation of F

n

(x) cannot be distinguished from 1, so we can just return F

n

(x) = 1. The second term in Kolmogorov’s formula (5) is 2e

2z2

and is less than 2.2 × 10

16

when z

2

> 18.37. More precise calculations show that in fact ¯ F

n

(x) < 5 × 10

16

whenever nx

2

> 18. Thus we shall just return F

n

(x) = 1 whenever nx

2

> 18, and this gives 15 digits of precision in this region, labeled “F

n

(x) = 1” in Figure 2.

Comparing the results of the approximation of ¯ F

n

(x) by (8) with the exact values from the Brown and Harvey (2008) Mathematica programs, we find that it returns at least 10 decimal digits of precision as long as ¯ F

n

(x) < 10

3

when n 6 140. To determine the region where F ¯

n

(x) < 10

3

(approximately) in terms of n and x, we first use Kolmogorov’s formula (5) which gives K

0

(2) = 0.99933 for nx

2

= 4. More precise calculations with exact methods show that for n 6 140, we have ¯ F

n

(x) < 0.0006 (and thus F

n

(x) > 0.9994) whenever nx

2

> 4.

When we use the approximation (8) in the region nx

2

> 4, the absolute error on ¯ F

n

(x) is always smaller than 10

14

and thus computing F

n

(x) = 1 − F ¯

n

(x) gives 14 decimal digits of precision for F

n

(x).

In the region that remains, we use one of the exact methods of Durbin and Pomeranz. Based on the speed comparison in § 2.4, we choose Durbin when nx

2

< 0.754693 and Pomeranz otherwise. We avoid the Pelz-Good approximation because it returns less than 5 correct digits for n ≤ 140.

(ii) 140 < n 6 10

5

. Whenever nx

2

> 18, we simply return F

n

(x) = 1 and this gives 15 digits of precision as explained above. Otherwise, we use the Pelz-Good asymptotic series (6) Figure 2: Choice of method to compute (or approximate) F

n

(x) and minimal number of decimal digits of precision for F

n

(x), as a function of (n, x), in each region of N × [0, 1].

(i) n 6 140. For n 6 140, we use the Ruben-Gambino formula (1) whenever it applies, which is for x 6 1/n and for x > 1 − 1/n, the Durbin matrix algorithm for 1/n < nx

2

< 0.754693, and the Pomeranz algorithm for 0.754693 6 nx

2

< 4. When 4 6 nx

2

< 18, we compute the approximation of the complementary cdf ¯ F

n

(x) given by (8) and we return F

n

(x) = 1 − F ¯

n

(x).

This is in the region labeled “1 − F ¯

n

(x)” in the figure. When nx

2

> 18, we just return F

n

(x) = 1. The program returns at least 13 to 15 decimal digits of precision everywhere for n 6 140. The reasons for this partition are as follows.

A look at Kolmogorov’s original approximation (5) or the Pelz-Good formula (6) shows that the tail probability of F

n

(x) is approximately a function of z = √

nx only. A closer exami- nation reveals that in general, except when n is very small, the curves nx

2

= c, where c is a constant, are curves of almost constant probability (see Table 1 for some typical values of x).

Thus regions of constant nx

2

seem relevant for the different approximations of F

n

(x).

If ¯ F

n

(x) < 2.2 × 10

16

, the double precision representation of F

n

(x) cannot be distinguished from 1, so we can just return F

n

(x) = 1. The second term in Kolmogorov’s formula (5) is 2e

−2z2

and is less than 2.2 × 10

−16

when z

2

> 18.37. More precise calculations show that in fact ¯ F

n

(x) < 5 × 10

16

whenever nx

2

> 18. Thus we shall just return F

n

(x) = 1 whenever nx

2

> 18, and this gives 15 digits of precision in this region, labeled “F

n

(x) = 1” in Figure 2.

Comparing the results of the approximation of ¯ F

n

(x) by (8) with the exact values from the Brown and Harvey (2007) Mathematica programs, we find that it returns at least 10 decimal digits of precision as long as ¯ F

n

(x) < 10

3

when n 6 140. To determine the region where F ¯

n

(x) < 10

−3

(approximately) in terms of n and x, we first use Kolmogorov’s formula (5) which gives K

0

(2) = 0.99933 for nx

2

= 4. More precise calculations with exact methods show that for n 6 140, we have ¯ F

n

(x) < 0.0006 (and thus F

n

(x) > 0.9994) whenever nx

2

> 4.

When we use the approximation (8) in the region nx

2

> 4, the absolute error on ¯ F

n

(x) is

(10)

always smaller than 10

−14

and thus computing F

n

(x) = 1 − F ¯

n

(x) gives 14 decimal digits of precision for F

n

(x).

In the region that remains, we use one of the exact methods of Durbin and Pomeranz. Based on the speed comparison in Section 2.4, we choose Durbin when nx

2

< 0.754693 and Pomeranz otherwise. We avoid the Pelz-Good approximation because it returns less than 5 correct digits for n ≤ 140.

(ii) 140 < n 6 10

5

. Whenever nx

2

> 18, we simply return F

n

(x) = 1 and this gives 15 digits of precision as explained above. Otherwise, we use the Pelz-Good asymptotic series (6) everywhere in the remaining region, except for x very close to 0 where it is not very good and where we use the Durbin matrix algorithm. As n increases, the region where the Pelz-Good approximation is good gets larger and closer to x = 0. We could use again a curve nx

2

= constant as a separation between these two regions, but we use instead the empirical curve nx

3/2

= 1.4 as separation, which permits the use of the Pelz-Good approximation in a slightly larger region near x = 0; this is more efficient since the Durbin matrix algorithm is very slow for larger n. Thus we use the Durbin matrix algorithm for nx

3/2

< 1.4 and the Pelz-Good asymptotic series for nx

3/2

> 1.4. Our program returns at least 5 decimal digits of precision everywhere.

(iii) n > 10

5

. We use F

n

(x) = 1 for nx

2

> 18 and the Pelz-Good asymptotic series ev- erywhere for nx

2

< 18. Let u = F

n

(x). Then for n = 100001, our program returns at least 5 decimal digits of precision for all u > 10

−16

, at least 2 decimal digits of precision for u > 10

56

, and at least 1 decimal digit of precision for all u > 10

108

. The Pelz-Good approximation becomes better as n increases and, for a given n, it also becomes better as x increases. For example, for n = 10

6

, our program returns at least 5 decimal digits of precision for all u > 10

33

, at least 2 decimal digits of precision for u > 10

120

, and at least 1 decimal digit of precision for all u > 10

230

. For such small values of u, a high precision for F

n

(x) is less important since 1 or 2 decimal digits of precision suffices to reject a statistical hypothesis with confidence.

5. The complementary cdf

The complementary cdf is defined by ¯ F

n

(x) = 1 − F

n

(x). Direct use of this formula in the upper tail, where F

n

(x) ≈ 1, would cause loss of precision. For this reason, we use specialized methods for x close to 1.

When x is close enough to 0 or 1, we use the complementary of the Ruben-Gambino exact formula (1). We split the remaining region in two: n 6 140 and n > 140.

(i) n 6 140. For x close to 1, we base our computation of ¯ F

n

(x) on the distribution of the one-sided Kolmogorov-Smirnov statistic, as given by the Miller approximation (8). In the region where nx

2

> 4, we always have ¯ F

n

(x) < 0.0006. Comparing the approximation (8) with the exact values obtained from Brown and Harvey’s Mathematica program, we see that the approximation returns at least 11 decimal digits of precision for ¯ F

n

(x) in this region.

When nx

2

< 4, we simply take ¯ F

n

(x) = 1 − F

n

(x) with F

n

(x) computed by the method

specified in Section 4. Since F

n

(x) < 0.99993 for all x in this region as long as n > 6,

(11)

0 0.2 0.4 0.6 0.8 1 1

10 140 10

3

10

4

10

5

1 − F

n

( x ) (10)

Ruben-Gambino (13) F ¯

n

(x ) = 0

Miller

(5) (6)

(10)

x n

Figure 3: Choice of method to compute (or approximate) ¯ F

n

(x) and minimal number of decimal digits of precision for ¯ F

n

(x), as a function of (n, x), in each region of N × [0, 1].

to the dominant term in (5), we obtain the condition z

2

> 355. More precise calculations for different values of n with the Miller approximation (8) show that we can safely set ¯ F

n

(x) = 0 whenever nx

2

> 370, because ¯ F

n

(x) is then always smaller than 10

308

; this corresponds to region “ ¯ F

n

(x) = 0” in Figure 3.

Since for n > 140, the Pelz-Good approximation (6) returns 5 decimal digits of precision, we shall use the Miller approximation (8) with the Smirnov formula (7) in the larger region nx

2

> 2.2. In this region, we have ¯ F

n

(x) < 0.025 and the Miller approximation returns at least 6 decimal digits of precision.

For nx

2

< 2.2, we simply use ¯ F

n

(x) = 1 − F

n

(x). Since for n > 140, we have at least 5 decimal digits of precision on F

n

(x) in the region where F

n

(x) > 10

12

, the program returns at least 5 decimal digits of precision everywhere for ¯ F

n

(x) in the region nx

2

< 2.2.

Overall, for 140 < n ≤ 200000, our program for ¯ F

n

(x) returns at least 5 decimal digits of precision everywhere. And for n > 200000, it returns a few correct decimal digits.

6. The C and Java programs

The C program in file KolmogorovSmirnovDist.c has two public functions:

ˆ KScdf(int n, double x), which computes the cdf F

n

(x) for given n and x, and

ˆ KSfbar(int n, double x), which computes the complementary cdf ¯ F

n

(x) for given n and x.

The Java program in file KolmogorovSmirnovDist.java has two public static methods:

Figure 3: Choice of method to compute (or approximate) ¯ F

n

(x) and minimal number of decimal digits of precision for ¯ F

n

(x), as a function of (n, x), in each region of N × [0, 1].

subtractive cancellation may cause the loss of at most 4 decimal digits of precision. For n < 6, x is either larger than 1 or in the region where we use the exact formula in (1), so there is no loss of precision there.

Overall, n 6 140, our program returns at least 10 decimal digits of precision everywhere for F ¯

n

(x) (see Figure 3).

(ii) n > 140. The minimal value of a floating-point number represented in the IEEE-754 64-bit standard for double precision is ≈ 10

−308

. If we impose the constraint 2e

−2z2

< 10

−308

to the dominant term in (5), we obtain the condition z

2

> 355. More precise calculations for different values of n with the Miller approximation (8) show that we can safely set ¯ F

n

(x) = 0 whenever nx

2

> 370, because ¯ F

n

(x) is then always smaller than 10

−308

; this corresponds to region “ ¯ F

n

(x) = 0” in Figure 3.

Since for n > 140, the Pelz-Good approximation (6) returns 5 decimal digits of precision, we shall use the Miller approximation (8) with the Smirnov formula (7) in the larger region nx

2

> 2.2. In this region, we have ¯ F

n

(x) < 0.025 and the Miller approximation returns at least 6 decimal digits of precision.

For nx

2

< 2.2, we simply use ¯ F

n

(x) = 1 − F

n

(x). Since for n > 140, we have at least 5 decimal digits of precision on F

n

(x) in the region where F

n

(x) > 10

12

, the program returns at least 5 decimal digits of precision everywhere for ¯ F

n

(x) in the region nx

2

< 2.2.

Overall, for 140 < n ≤ 200000, our program for ¯ F

n

(x) returns at least 5 decimal digits of

precision everywhere. And for n > 200000, it returns a few correct decimal digits.

(12)

6. The C and Java programs

The C program in file KolmogorovSmirnovDist.c has two public functions:

ˆ KScdf(int n, double x), which computes the cdf F

n

(x) for given n and x, and

ˆ KSfbar(int n, double x), which computes the complementary cdf ¯ F

n

(x) for given n and x.

The Java program in file KolmogorovSmirnovDist.java has two public static methods:

ˆ cdf(int n, double x), which computes the cdf F

n

(x) for given n and x, and

ˆ fbar(int n, double x), which computes the complementary cdf ¯ F

n

(x) for given n and x.

The programs are available along with this manuscript and at http://www.iro.umontreal.

ca/~simardr/ksdir/.

7. Speed comparison

Tables 4 and 5 report the CPU times needed to compute F

n

(x) 10

6

times, for selected values of n and x as in Table 1, with MTW and our C program. Note that for x = 3µ

0

, MTW

n \ x µ

0

/4 µ

0

/3 µ

0

/2 µ

0

0

0

10 0.57 0.57 1.6 3.4 21 60

100 5.9 5.7 21 106 754 0.1

140 6.7 12 41 206 1315 0.1

141 7.0 12.4 44 225 1444 0.1

1000 96 242 615 4.6 × 10

3

3.8 × 10

4

0.1

10000 2.8 × 10

3

6.2 × 10

3

2.1 × 10

4

1.9 × 10

5

3.5 × 10

6

0.1 100000 1.1 × 10

5

2.7 × 10

5

1.1 × 10

6

4.3 × 10

7

4.4 × 10

8

0.1

Table 4: CPU time (seconds) to compute F

n

(x) 10

6

times with the Marsaglia-Tsang-Wang C program.

n \ x µ

0

/4 µ

0

/3 µ

0

/2 µ

0

0

0

10 0.17 0.17 1.6 3.4 5.1 0.66

100 6.5 6.3 23 98 260 22

140 7.3 13 41 174 500 32

141 7.6 13 44 1.4 2.3 3.3

1000 100 247 0.95 1.4 2.3 3.3

10000 2875 0.95 0.95 1.4 2.3 3.3

100000 0.95 0.95 0.95 1.4 2.3 3.3

Table 5: CPU time (seconds) to compute F

n

(x) 10

6

times with the Simard-L’Ecuyer C pro-

gram.

(13)

uses the quick approximation given by the authors, which returns only 7 decimal digits of precision. We do not use their approximation since for n 6 140, the cdf is much less precise than our program, and using this formula to compute ¯ F

n

(x) = 1 − F

n

(x) will give a very bad approximation. For n > 140, our program is considerably faster than MTW. On the other hand, the MTW program often gives more precision than our asymptotic expansions when it returns sensible values for n > 140. However, the MTW program loses all precision when it is used to compute ¯ F

n

(x) and the latter is smaller than 10

−15

, whereas our program returns sensible values for ¯ F

n

(x) even when the latter is as small as 10

307

.

References

Brown JR, Harvey ME (2007). “Rational Arithmetic Mathematica Functions to Evaluate the One-Sided One-Sample K-S Cumulative Sampling Distribution.” Journal of Statistical Software, 19(6), 1–40. URL http://www.jstatsoft.org/v19/i06/.

Drew JH, Glen AG, Leemis LM (2000). “Computing the Cumulative Distribution Function of the Kolmogorov-Smirnov Statistic.” Computational Statistics & Data Analysis, 34, 1–15.

Durbin J (1968). “The Probability that the Sample Distribution Function Lies between Two Parallel Straight Lines.” The Annals of Mathematical Statistics, 39, 398–411.

Durbin J (1973). “Distribution Theory for Tests Based on the Sample Distribution Func- tion.” In SIAM CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, PA.

Kallman R (1977). “Three Algorithms for Computing Kolmogorov-Smirnov Probabilities with Arbitrary Boundaries and a Certification of Algorithm 487.” ACM Transactions on Mathematical Software, 3(3), 285–294. URL http://calgo.acm.org/519.gz.

Kolmogorov A (1933). “Sulla Determinazione Empirica di una Legge di Distribuzione.” Gior- nale dell’ Istituto Italiano degli Attuari, 4, 83–91.

Marsaglia G, Tsang WW, Wang J (2003). “Evaluating Kolmogorov’s Distribution.” Journal of Statistical Software, 8(18), 1–4. URL http://www.jstatsoft.org/v08/i18/.

Miller AJ (1999). “Module Kolmogorov-Smirnov.” URL http://www.cmis.csiro.au/Alan_

Miller/KS2.f90.

Miller LH (1956). “Table of Percentage Points of Kolmogorov Statistics.” Journal of the American Statistical Association, 51, 111–121.

Pelz W, Good IJ (1976). “Approximating the Lower Tail-Areas of the Kolmogorov-Smirnov One-sample Statistic.” Journal of the Royal Statistical Society B, 38(2), 152–156.

Pomeranz J (1974). “Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for

Small Samples (Algorithm 487).” Communications of the ACM, 17(12), 703–704. URL

http://calgo.acm.org/487.gz.

(14)

R Development Core Team (2010). R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:

//www.R-project.org/.

Ruben H, Gambino J (1982). “The Exact Distribution of Kolmogorov’s Statistic D

n

for n ≤ 10.” Annals of the Institute of Statistical Mathematics, 34, 167–173.

The MathWorks, Inc (2009). MATLAB – The Language of Technical Computing, Version 7.9.0 (R2009b). The MathWorks, Inc., Natick, Massachusetts. URL http://www.mathworks.

com/products/matlab/.

(15)

A. Selection of regions

Here we give a few details on the calculations used to select the regions in Section 4 and Section 5. First, we computed the distribution function F

n

(x) at many values of x for n 6 1000 both with the Durbin matrix formula and the Pomeranz formula (in our C program); we also compared these results with the exact Brown-Harvey Mathematica program at a few random values of x and n: all three always agreed to at least 13 decimal digits of precision. For n > 1000, the Mathematica program that computes F

n

(x) becomes very slow, so we used mostly the Durbin matrix program to check how good the other approximations are for large n.

Region nx

2

> 18. Consider the values of x = p

18/n for several values of n. We compute the KS complementary distribution function ¯ F

n

(x) both with the Brown-Harvey Mathematica program and also with the Miller approximation (8). The results are given in Table 6 with the relative error between the two, defined as | (u − v)/u | , where u is the Mathematica result and v the Miller approximation. (For n > 10000, the Mathematica program is so slow that it takes many days to compute ¯ F

n

(x); this explains the blank entries in Table 6.) Since ¯ F

n

(x) is a decreasing function of x, we conclude that ¯ F

n

(x) is smaller than 5 × 10

16

for nx

2

> 18 and n 6 10

9

, and almost certainly for any n (from Table 6). Thus, setting F

n

(x) = 1 gives 15 decimal digits of precision for F

n

(x) in this region.

Region nx

2

> 4 and n 6 140. Consider the values of x = p

4/n for several values of n. We again compare the KS complementary distribution function computed with the Brown-Harvey Mathematica program and also with the Miller approximation (8). The results are given in Table 7 with both the relative error ( ≈ 10

11

) and the absolute error (< 10

14

). Since the Miller approximation becomes better as x increases for a given n, we conclude that it gives at least 11 decimal digits of precision for ¯ F

n

(x), and at least 14 decimal digits of precision for F

n

(x) = 1 − F ¯

n

(x) in this region.

Region nx

2

> 2.2 and 140 < n 6 10

5

. Consider the values of x = p

2.2/n for several values of n. We compare the Miller approximation (8) for the KS complementary distribution function ¯ F

n

(x) both with the Brown-Harvey Mathematica program and with the Durbin ma- trix C program. The results are shown in Tables 8 and 9 with the relative error of the Miller approximation. Since the Miller approximation becomes better as x increases for a given n, we conclude that it gives at least 6 decimal digits of precision for ¯ F

n

(x) in this region.

Region nx

3/2

> 1.4 and 140 < n 6 10

5

for F

n

(x). Here we compare the Pelz-Good

approximation on the separating curve nx

3/2

= 1.4 with the Mathematica program, and also

with the Durbin matrix C program for a few values of (n, x). The results are shown in Tables 10

and 11. After testing many values of F

n

(x) at different n > 140 and x, we concluded that

the Pelz-Good approximation becomes better as x increases for a given n, and that it gives

at least 5 decimal digits of precision for F

n

(x) in the region nx

3/2

> 1.4. The Pelz-Good

approximation becomes worse as x gets close to 0, and so for nx

3/2

< 1.4, we use the Durbin

matrix program.

(16)

n x Miller Mathematica Rel. err.

50 0.6 9.634070456142e-18 9.63407045614234e-18 3.5e-14 100 0.424264068711929 7.606532198486e-17 7.60653219848661e-17 8.0e-14 500 0.189736659610103 3.09340954272e-16 3.09340954272345e-16 2.4e-12 1000 0.134164078649987 3.6959926424e-16 3.69599264245350e-16 5.3e-12 5000 0.06 4.3371233237e-16 4.33712332378453e-16 2.5e-11 10000 0.042426406871193 4.448626200e-16

50000 0.018973665961010 4.56828378e-16 10

5

0.013416407864999 4.59148616e-16 10

6

0.004242640687119 4.6253138e-16 10

7

0.001341640786500 4.634834e-16 10

8

0.000424264068712 4.637718e-16 10

9

0.000134164078650 4.6386e-16

Table 6: Values of ¯ F

n

(x) for x = p 18/n.

n x Miller Mathematica Rel. err. Abs. err.

20 0.44721359549996 0.000362739697817368 0.000362739697817367 3.0e-15 1.1e-18 40 0.31622776601684 0.000469148796139823 0.000469148796139491 7.1e-13 3.3e-16 60 0.25819888974716 0.000513418298233135 0.000513418298231541 3.1e-12 1.6e-15 80 0.22360679774998 0.000538602147624629 0.000538602147621453 5.9e-12 3.2e-15 100 0.2 0.000555192732807431 0.000555192732802810 8.3e-12 4.6e-15 120 0.18257418583506 0.000567103285090544 0.000567103285084519 1.1e-11 6.0e-15 140 0.16903085094570 0.000576152104012519 0.000576152104005186 1.2e-11 7.3e-15

Table 7: Values of ¯ F

n

(x) for x = p 4/n.

n x Miller Mathematica Rel. err.

141 0.124911316058364 0.0223963592 0.0223963330223726 1.2e-6 300 0.0856348838577675 0.0230987058 0.0230986730185827 1.4e-6 500 0.066332495807108 0.0234361007 0.0234360648085745 1.5e-6 1000 0.0469041575982343 0.0237703789 0.0237703399363784 1.6e-6 5000 0.020976176963403 0.0242079719 0.0242079291326927 1.8e-6

Table 8: Values of ¯ F

n

(x) for x = p 2.2/n.

n x Miller Durbin matrix Rel. err.

500 0.066332495807108 0.0234361007 0.0234360648085807 1.5e-6 1000 0.046904157598234 0.0237703789 0.0237703399363874 1.6e-6 5000 0.020976176963403 0.0242079719 0.0242079291327157 1.8e-6 10000 0.0148323969741913 0.0243102062 0.0243101626961063 1.8e-6 50000 0.0066332495807108 0.0244457597 0.0244457151043362 1.8e-6 100000 0.0046904157598234 0.0244777310 0.0244776861027715 1.8e-6

Table 9: Values of ¯ F

n

(x) for x = p

2.2/n.

(17)

Region n > 10

5

. Here we compare the Pelz-Good approximation for n = 100001 with the Durbin matrix C program for a few values of x. The results are shown in Table 12. The Pelz-Good approximation is not so good for small x but becomes better as x increases. It also becomes better as n increases.

n x Pelz-Good Mathematica Rel. err.

140 0.0464158883361278 0.09025921823 0.0902623294750042 3.4e-5 500 0.0198657677675854 0.01302426466 0.0130242540021059 8.2e-7 1000 0.0125146494913519 0.00289496816 0.00289493725169814 1.1e-5 5000 0.0042799499222603 1.42356151e-5 1.42355083146456e-5 7.5e-6

Table 10: Values of F

n

(x) for x = (1.4/n)

2/3

.

n x Pelz-Good Durbin Rel. err.

140 0.0464158883361278 0.09025921823 0.0902623294750041 3.4e-5 500 0.0198657677675854 0.01302426466 0.0130242540021058 8.2e-7 1000 0.0125146494913519 0.00289496816 0.00289493725169812 1.1e-5 5000 0.00427994992226032 1.42356151e-5 1.42355083146454e-5 7.5e-6 10000 0.00269619949977585 4.83345438e-7 4.83345410767114e-7 5.6e-8 50000 0.00092208725841169 3.71471703e-12 3.71479094405454e-12 2.0e-5 100000 0.00058087857335637 2.21229903e-15 2.21236052547566e-15 2.8e-5

Table 11: Values of F

n

(x) for x = (1.4/n)

2/3

.

x x Pelz-Good Durbin matrix Rel. err.

1/14 √

n 0.000225875846349904 6.09086278e-103 1.07874093328718e-102 0.44 1/12 √

n 0.000263521820741555 1.572209174e-75 1.87885894249649e-75 0.16 1/10 √

n 0.000316226184889866 2.269812367e-52 2.35008915128113e-52 0.034 1/8 √

n 0.000395282731112333 1.962478061e-33 1.96902657319316e-33 0.0033 1/6 √

n 0.00052704364148311 1.018350563e-18 1.01845452774208e-18 0.0001 1/4 √

n 0.000790565462224666 2.9070737934e-8 2.90707424915525e-8 1.6e-7 1/2 √

n 0.00158113092444933 0.0363919975970 0.0363919976016742 1.3e-10 1/ √

n 0.00316226184889866 0.7305646847185 0.730564684714965 4.8e-12 2/ √

n 0.00632452369779733 0.9993319333086 0.999331933307205 1.4e-12

Table 12: Values of F

n

(x) for n = 100001.

(18)

Affiliation:

Richard Simard, Pierre L’Ecuyer

D´epartement d’Informatique et de Recherche Op´erationnelle Universit´e de Montr´eal

C.P. 6128, Succ. Centre-Ville, Montr´eal, H3C 3J7, Canada

E-mail: [email protected], [email protected] URL: http://www.iro.umontreal.ca/~lecuyer/

Journal of Statistical Software http://www.jstatsoft.org/

published by the American Statistical Association http://www.amstat.org/

Volume 39, Issue 11 Submitted: 2010-03-17

March 2011 Accepted: 2010-12-13

Références

Documents relatifs

No calculators are to be used on this exam. Note: This examination consists of ten questions on one page. Give the first three terms only. a) What is an orthogonal matrix?

[r]

[r]

The present paper communicates a number of new properties of arbitrary real functions, of which the most important is the theorem on the measurable boundaries

The problem of inversion of the attenuated X-ray transformation is solved by an explicit formula.. Several subsequent results are

We will need the following local law, eigenvalues rigidity, and edge universality results for covariance matrices with large support and satisfying condition (3.25).... Theorem

This paper is organized as follows: Section 2 describes time and space discretiza- tions using gauge formulation, Section 3 provides error analysis and estimates for the

However, the proof of Theorem 2.7 we present here has the virtue that it applies to other situations where no CAT(0) geometry is available: for instance, the same argument applies,