March 2011, Volume 39, Issue 11. http://www.jstatsoft.org/
Computing the Two-Sided Kolmogorov-Smirnov Distribution
Richard Simard Universit´e de Montr´eal
Pierre L’Ecuyer Universit´e de Montr´eal
Abstract
We propose an algorithm to compute the cumulative distribution function of the two- sided Kolmogorov-Smirnov test statistic D
nand its complementary distribution in a fast and reliable way. Different approximations are used in different regions of (n, x). Java and C programs are available.
Keywords: KS test, Kolmogorov-Smirnov distribution, goodness of fit.
1. Introduction
The Kolmogorov-Smirnov (KS) two-sided test statistic D
nis widely used to measure the goodness-of-fit between the empirical distribution of a set of n observations and a given continuous probability distribution. It is defined by
D
n= sup
x
G(x) − G ˆ
n(x) ,
where n is the number of (independent) observations, ˆ G
nis their empirical cumulative dis- tribution function (cdf), and G is a completely specified continuous theoretical cdf. Let F
ndenote the cdf of D
nunder the null hypothesis H
0that the n observations are independent and have cdf G, that is,
F
n(x) = P[D
n6 x | H
0] for x ∈ [0, 1].
This F
nis loosely called the KS distribution.
Computing F
n(x) in a fast and reliable way, for arbitrary values of n and x, is not easy.
With the help of a symbolic algebra package, Drew, Glen, and Leemis (2000) computed the
exact KS distribution for 1 6 n 6 30 as a set of piecewise polynomials of degree n resulting
from a large number of integrals. They also gave a short literature review. Brown and Harvey (2007) consider seven different methods proposed in the past to compute the exact KS distribution. They programmed these seven methods in Mathematica, using only rational numbers to obtain exact results. Although their Mathematica programs are very useful for verification purpose, the calculations are extremely slow for large n. For example, computing F
n(x) for n = 10000 and x = 2/ √
n, using the Durbin (1968) recursion formula, takes three hours on our (standard) desktop computer. The Durbin recursion formula gives the fastest exact method among all those programmed by Brown and Harvey (2007). Unfortunately, it contains expressions that suffer from subtractive cancellation between very large terms, and is thus unsuitable for floating-point computations with limited precision. Among the formulæ considered by Brown and Harvey (2007), only those of No´e, Pomeranz, and the Durbin matrix formula are not subject to catastrophic cancellation because they contain only non-negative terms.
The method of Pomeranz (1974) and the Durbin (1968) recursion method were programmed in single precision Fortran 77 by Pomeranz (1974). The Durbin recursion method is numerically unstable and gives sensible results only for very small n. The Pomeranz method makes use of floors and ceilings which must be computed with great care; as an illustration, the KS distribution computed by the program of Pomeranz (1974) sometimes gives a non-smooth distribution with very large errors, as we will see in Section 2.2. Kallman (1977) programmed a generalized version of Pomeranz method in Fortran 77; this program works well in single precision for small n but starts to break down for moderate n and when F
n(x) is close to 1.
Miller (1999) programmed the complementary KS distribution ¯ F
n(x) = 1 − F
n(x) in Fortran 90, using different approximations; his program is very fast, but gives only 0 to 3 decimal digits of precision for n > 120 and sometimes returns NaN (not a number), for example for n = 100 and x = 0.54, 0.66, 0.77, or negative values, for example for n = 50 and 0.35 < x < 0.50.
Recently, Marsaglia, Tsang, and Wang (2003) wrote a C procedure implementing the matrix formula of Durbin (1973). Although it works very well for small n and gives 13 decimal digits of precision according to the authors, it is very slow for large n and sometimes returns NaN or infinite values; for example for n = 11000 and 0.000413 ≤ x ≤ 0.000414; for n = 21000 or 21001 and 0.000434 6 x 6 0.000526; for n = 42001 and 0.000206 ≤ x ≤ 0.000263; and for n = 62000 and 0.001 6 x 6 0.007. Furthermore, computing a p-value as p = P[D
n>
x] = 1 − F
n(x) gives rise to loss of precision when F
n(x) is close to 1, due to subtractive cancellation.
A close look at popular statistical software reveals that even some of the best products use poor approximations for the KS distribution (or the complementary distribution) in some regions (values of (n, x)). For example, in the popular statistical software environment R (2010), the method of Marsaglia et al. (2003) was recently implemented to compute the p-value p = 1 − F
n(x) in KS tests (with the option “exact”). However, because this method is very slow for large n, the default algorithm in R for n > 100 is an asymptotic approximation that gives only 0 to 2 decimal digits of precision. For example, for n = 120 and x = 0.0874483967333, R default method returns p = 0.3178 while the correct value is p = 0.30012. For n = 500 and x = 0.037527424, R default method returns p = 0.482 while the correct value is p = 0.47067.
R “exact” (but slow) method gives the correct values in both cases. However for p-values smaller than 10
−15, neither method gives any correct decimal.
MATLAB (2009) uses different methods to approximate the KS distribution depending on n
and x. For example, for n = 20 and x = 0.8008915818, MATLAB returns p = 2.2544 × 10
−12while the correct value is p = 2.5754 × 10
−14, and for n = 20 and x = 0.9004583223, it returns p = 1.5763 × 10
−15while the correct value is p = 1.8250 × 10
−20. These returned values have zero decimal digits of precision in both cases.
Available programs for the KS distribution in standard programming languages such as C, Fortran, and Java are either fast but unreliable, or precise but slow. Moreover, they either compute only the KS distribution or only its complementary. So there is a need for a fast and reliable program that computes both the KS distribution and its complementary distribution for arbitrary n and x.
In this paper, we describe such an algorithm and its implementation for the KS distribution with floating-point numbers of 53 bits of precision. In Section 2, we briefly describe the exact methods that we use to compute the KS distribution. We implemented the Pomeranz recursion formula and we compare its speed with the Durbin matrix formula. In Section 3, we introduce faster but less precise approximations that we use for large n. In Section 4, we partition the space (n, x) in different regions and we explain why each method is more appropriate in a given region to compute the KS distribution. In Section 5, we do the same for the complementary KS distribution. Finally, we measure the speed of our C program at several values of n and x.
2. Exact methods
2.1. Some values in the tails
There are a few special cases for which the exact KS distribution is known and has a very simple form. In the tails (for x close to 0 or 1), the exact values for arbitrary n are (Ruben and Gambino 1982):
F
n(x) =
0 for x 6
2n1n! 2x −
1nnfor
2n1< x 6
n11 − 2(1 − x)
nfor 1 −
1n6 x < 1
1 for 1 6 x.
(1)
This also provides an exact formula for the complementary cdf ¯ F
n(x)
def= 1 − F
n(x) over the same range of values of (n, x). We will use these simple formulas whenever they apply.
2.2. The Pomeranz recursion formula
The Pomeranz formula permits one to compute a (2n + 2) × (n + 1) matrix whose entries are
defined by a recurrence, and the last of those entries is F
n(x). The method is described in
Pomeranz (1974) and Brown and Harvey (2007), and explained below. It makes use of floors
and ceilings which must be computed with care, because it is very sensitive to numerical
imprecision, as we will see in a moment. Let t = nx and define the A
1, . . . , A
2n+2(which
depend on t) as follows:
A
1= 0,
A
2= min { t − b t c , d t e − t } , A
3= 1 − A
2,
A
i= A
i−2+ 1, for i = 4, 5, . . . , 2n + 1, and A
2n+2= n
We then define the entries V
i,jof a (2n + 2) × (n + 1) matrix V (which depend on t) by V
1,1= 1, V
1,j= 0 for j = 2, . . . , n + 1, and
V
i,j=
k2(i,j)
X
k=k1(i,j)
((A
i− A
i−1)/n)
j−k(j − k)! V
i−1,k(2)
for i = 2, . . . , 2n + 2 and j = b A
i− t c + 2, . . . , d A
i+ t e , where
k
1(i, j) = max (1, b A
i−1− t c + 2) and k
2(i, j) = min (j, d A
i−1+ t e ) . (3) Then we have
F
n(x) = n! V
2n+2,n+1. (4)
As observed by Pomeranz, the differences A
i− A
i−1can take at most four different values. This can be exploited by precomputing the factors ((A
i− A
i−1)/n)
j/j! in (2), and this increases the speed of computation significantly. It is also clear from (2) that we never need to store more than two rows of matrix V at a time, since row i can be computed from row i − 1 only.
Pomeranz’s algorithm uses the floor and ceiling of A
i± t to define the lower and upper bounds of summation in (2), which are defined in (3). For some values of t (and x), unavoidable round- off errors in floating-point calculations will give the wrong value of k
1(i, j) or k
2(i, j), and this may give rise to large errors in the cdf. As an illustration, Figure 1 shows (in blue) the KS distribution F
n(x) as computed by the program 487.f of Pomeranz (1974) for n = 20 and 0.17 6 x 6 0.19, compared with the correct distribution (in red). The zigzags in the blue line are due to floating-point errors in computing k
1(i, j) and k
2(i, j).
To avoid this type of problem, we precompute exactly all floors and ceilings of A
i± t in (3) for the given value of t = nx, although nx itself is not computed exactly due to floating-point error in the multiplication. We decompose t = ` + f , where ` is an non-negative integer and 0 6 f < 1. There are three cases to consider depending on where the fractional part f lies.
We must make sure to identify correctly in which case we are, and use the corresponding formulas for the precomputations.
Case (i): f = 0. In this case, we have
A
2i= i − 1 A
2i+1= i, for i = 1, 2, . . .
b A
2i− t c = i − 1 − ` d A
2i+ t e = i − 1 + `, for i = 1, 2, . . .
b A
2i+1− t c = i − ` d A
2i+1+ t e = i + `, for i = 0, 1, 2, . . .
●●
●●●●
●●●●
●●●●●●●●●●●●●●
●●●●
●
●
●
●
●●●●●
●
●
●
●
●●●●●
●
●
●
●
●●●●●
●
●
●
●
●●●●
●
●
●
●
●
●●●●
●
●
●
●
●●●●●
●
●
●
●
●●●●●
●
●
●
●
●●●●
●
●
0.170 0.175 0.180 0.185 0.190
0.450.500.550.60
x Fn
(
x)
● 487.f exact
Figure 1: The KS distribution for n = 20 computed by program 487.f (in blue) compared with the correct cdf (in red).
Case (ii): 0 < f 6 1/2. Here we have
A
2i= i − 1 + f A
2i+1= i − f, for i = 1, 2, . . . b A
1− t c = − ` − 1 d A
1+ t e = ` + 1
b A
2i− t c = i − 1 − ` d A
2i+ t e = i + `, for i = 1, 2, . . . b A
2i+1− t c = i − 1 − ` d A
2i+1+ t e = i + `, for i = 1, 2, . . .
Case (iii): 1/2 < f < 1. Here we have
A
2i= i − f A
2i+1= i − 1 + f, for i = 1, 2, . . .
b A
2i− t c = i − 2 − ` d A
2i+ t e = i + `, for i = 1, 2, . . .
b A
2i+1− t c = i − 1 − ` d A
2i+1+ t e = i + 1 + `, for i = 0, 1, 2, . . .
With floating-point numbers that follow the IEEE-754 64-bit standard for double precision,
the algorithm cannot be used in this naive form as soon as n > 140, because for some
regions of x, the terms V
i,jwill underflow to 0 while the n! factor will overflow, even though
the corresponding probability is finite. For this reason, we must periodically renormalize
all elements of row i of V (in our implementation, we multiply all elements of the row by
2
350) when the smallest element of the row becomes too small (smaller than 10
−280in our
implementation) and we keep track of the logarithm of the renormalization factors to recover
the correct numbers.
2.3. The Durbin matrix formula
Durbin (1973) proposed a formula for F
n(x) based on the calculation of a k × k matrix H raised to the power n, where k = d t e = d nx e ; see his equations (2.4.3) and (2.4.4); see also Brown and Harvey (2007). Then F
n(x) is given by n! T /n
n, where T is the element k × k of the matrix H
n. Marsaglia et al. (2003) have recently implemented this algorithm in a C program which we shall call MTW (Marsaglia-Tsang-Wang). We reuse their program for this method, with small corrections.
2.4. Comparison between Pomeranz and Durbin algorithms for small n One of our objectives in this work was to provide a reliable implementation of Pomeranz algorithm and compare its speed and precision with the Durbin matrix algorithm implemented in MTW (its main competitor), for small n. We implemented the Pomeranz algorithm in C and Java, while MTW is implemented in C. For n not too large, these two implementations both return 13 to 15 decimal digits of precision (as measured by comparing with the exact values provided by the Mathematica programs of Brown and Harvey 2007). We also compared their speed for computing F
n(x) at values of (n, x) selected as follows. The mean of D
n(under H
0) is close to
µ
0 def= ln(2) p
π/(2n) ≈ 0.868731160636/ √ n
and we selected the pairs (n, x) around this mean. More specifically, we took x = aµ
0for a = 1/4, 1/3, 1/2, 1, 2, and 3, for n ranging from 10 to 1000. The values of F
n(x) for these pairs (n, x) are given in Table 1. For each pair (n, x) in the table, we measured the CPU time (in seconds) needed to compute F
n(x) 10
6times with the C programs. All the timings reported in this paper were made on a computer with an Advanced Micro Devices (AMD)
n \ x µ
0/4 µ
0/3 µ
0/2 µ
02µ
03µ
010 1.9216 × 10
−85.7293 × 10
−50.021523 0.63157 0.99769 0.9999999 50 2.2810 × 10
−91.9914 × 10
−50.014262 0.59535 0.99618 0.9999987 100 1.0020 × 10
−91.3267 × 10
−50.012461 0.58616 0.99587 0.9999982 200 4.9331 × 10
−109.5266 × 10
−60.011212 0.57949 0.99566 0.9999980 500 2.3705 × 10
−106.8500 × 10
−60.010131 0.57343 0.99549 0.9999978 1000 1.5699 × 10
−105.7174 × 10
−60.009597 0.57032 0.99541 0.9999977
Table 1: Values of F
n(x) for selected pairs (n, x).
n \ x µ
0/4 µ
0/3 µ
0/2 µ
02µ
03µ
010 0.17 0.17 2.5 3.5 5.2 6.5
50 9.6 11 15 32 74 124
100 22 26 42 100 266 498
200 57 79 128 325 1257 3266
500 221 324 592 2314 1.9 × 10
43.1 × 10
41000 708 1112 2070 1.8 × 10
45.6 × 10
47.9 × 10
4Table 2: CPU time (seconds) to compute F
n(x) 10
6times with the Pomeranz algorithm.
n \ x µ
0/4 µ
0/3 µ
0/2 µ
02µ
03µ
010 0.57 0.57 1.6 3.4 21 60
50 2.5 5.3 9.6 48 264 782
100 5.9 5.7 21 106 754 2468
200 14 27 62 331 2200 7012
500 37 85 221 1601 1.2 × 10
47.4 × 10
41000 96 242 615 4608 3.8 × 10
41.3 × 10
5Table 3: CPU time (seconds) to compute F
n(x) 10
6times with the Durbin matrix algorithm.
Athlon processor 4000+ at clock speed of 2400 MHz, running Red Hat Linux. The results are shown in Table 2 for the Pomeranz algorithm and in Table 3 for the Durbin matrix algorithm.
We see that the CPU times increase with n and with x (for fixed n). For small n, the Durbin matrix program is faster for x < µ
0(roughly), while the Pomeranz program is faster for x > µ
0. When n is large, especially for x > µ
0, these exact algorithms are too slow and have to be replaced by approximations.
3. Asymptotic approximations
Kolmogorov (1933) has proved the following expansion which provides an approximation of F
n(z/ √
n) for large n:
n
lim
→∞F
n(z/ √
n) = lim
n→∞
P[ √
nD
n6 z | H
0] = K
0(z) = 1 + 2 X
∞k=1
( − 1)
ke
−2k2z2. (5)
Pelz and Good (1976) generalized Kolmogorov’s approximation (5) to an asymptotic series in 1/ √
n of the form
n→∞
lim P[ √
nD
n6 z | H
0] = K
0(z) + K
1(z)
n
1/2+ K
2(z)
n + K
3(z) n
3/2+ O
1 n
2, (6)
where
K
0(z) =
√ 2π z
∞
X
k=1
e
−π2(2k−1)2/(8z2),
K
1(z) = 1 6z
4r π 2
X
∞k=−∞
π
2(k +
12)
2− z
2e
−π2(k+1/2)2/(2z2),
K
2(z) = 1 72z
7r π 2
∞
X
k=−∞
(6z
6+ 2z
4) + π
2(2z
4− 5z
2)(k +
12)
2+ π
4(1 − 2z
2)(k +
12)
4e
−π2(k+1/2)2/(2z2)− 1 36z
3r π 2
∞
X
k=−∞
π
2k
2e
−π2k2/(2z2), and
K
3(z) = 1 6480z
10r π 2
X
∞k=−∞
π
6(k +
12)
6(5 − 30z
2) + π
4(k +
12)
4( − 60z
2+ 212z
4) + π
2(k +
12)
2(135z
4− 96z
6) − (30z
6+ 90z
8) e
−π2(k+1/2)2/(2z2)+ 1
216z
6r π
2 X
∞k=−∞
( − π
4k
4+ 3π
2k
2z
2)e
−π2k2/(2z2).
These authors claim that their approximation is accurate to 5 decimal digits for n > 100 (the absolute error of their approximation is in fact smaller than 10
−5for n > 100, but in our work we focus on the relative error). Notice that K
0(z) in the above equation is equivalent to (5) through some of Jacobi’s identities on theta functions.
Another approximation useful for x close to 1 is based on the complementary cdf of the one-sided KS statistic, defined by
D
+n= sup
x
{ G ˆ
n(x) − G(x) } ,
where n, ˆ G
n, and G are defined as in Section 1. The upper tail probability of D
n+, defined as p
+n(x) = P[D
n+> x | H
0], is much faster and easier to compute than F
n(x). Smirnov’s stable formula gives
p
+n(x) = x
bn(1−x)c
X
j=0