The significance probability of the Smirnov two-sample test By J. L. HOBOES, Jr.~

(1)

A R K I V F 0 R M A T E M A T I K B a n d 3 n r 43

C o m m u n i c a t e d 5 J u n e 1957 b y HARALD CR/~.~I]~R and OTTo FROSTMAN

The significance probability of the Smirnov two-sample test

B y J . L . HOBOES, Jr.~

W i t h 3 figures i n t h e t e x t

1. Introduction

I n 1939 N. V. Smirnov proposed the following rank-order test for the two-sample problem. L e t x 1 .. . . , xm a n d Yl . . . Yn be samples of independent observations f r o m populations with continuous distribution functions F and G, respectively. F o r m f r o m the samples the empirical distribution functions Fm and Gn; t h a t is,

mFm(u)

is t h e n u m b e r of the observations xl, ..., x m which do not exceed u, with

nGn(u)

defined analogously. To test t h e hypothesis _P = G we use the statistic D = sup lF~(u) -

U

G~(u)[, large values of which are significant. We m a y without loss of generality assume m ~ n.

I t is clear t h a t the signficance probability

P r {D >= d [ F =

G}, which we shall denote t h r o u g h o u t b y P , , is independent of the common value of F = G; t h a t is, the t e s t (like all rank-order tests for the two-sample problem) is similar over the class of all continuous distributions. Further, the fact t h a t sup [F~(u) - F ( u ) [ tends to 0 in probability as m - ~ c ~ implies t h a t the test is consistent against all alternatives F ~= G. These properties of similarity and consistency, together with a certain mathematical elegance, give the test wide appeal to m a t h e m a t i c a l statisticians. A considerable literature has developed, the proposer of the test has been awarded a Stalin prize (Kolmogorov a n d HinOin 1951), and the t e s t has begun to a p p e a r in applied handbooks. The test is not very powerful against specific alternatives such as shift ( v a n der W a e r d e n 1953), b u t this could hardly be expected in view of its consistency.

Smirnov's test was suggested b y analogy with the earlier test of K o l m o g o r o v (1933) for t h e one-sample problem. I n fact, Smirnov's test generalizes Kolmogorov's, for when n--> c~ we m a y replace G~ b y G, and D becomes K o l m o g o r o v ' s statistic for t h e hypothesis t h a t F equals a completely specified G. Thus general results on the Smirnov test usually give (by the limit passage m - + o o ) results on the Kolmogorov test. We shall not however a t t e m p t to discuss the significance p r o b l e m for K o l m o - gorov's test, nor shall we t a k e up the m a n y variants of Smirnov's test which h a v e been suggested.

Smirnov's test also appears in a one-sided version. We m a y use D e = sup [F~ (u) -

U

Gn(u)] to t e s t the hypothesis t h a t

F(u) ~_ G(u)

for all u. This f o r m of the test is in

'1 T h i s p a p e r w a s w r i t t e n w h i l e t h e a u t h o r w a s a f e l l o w of t h e J o h n S i m o n G u g g e n h e i m m e m o r i a l f o u n d a t i o n , a n d a g u e s t of S t e c k h o l m s hOgskola. I t is a p l e a s u r e t o r e c o r d a p p r e c i a t i o n for t h e c o u r t e s i e s e x t e n d e d t o m e b y t h e h O g s k o l a a n d i t s r e c t o r , P r o f e s s o r H a r a l d CramOr.

(2)

J. L. HOD¢~.S, JR., The Smirnov two-sample test

fact often appropriate when we wish to know whether a new m e t h o d of t r e a t m e n t (producing the population G) gives larger values t h a n the standard t r e a t m e n t (population F) for some quantile. We shall denote b y P1 the q u a n t i t y P r {D + > d [ F = G}, which is the size of the one-sided test. One can also define a non-symmetrical two- sided t e s t statistic, which includes D a n d D + as special eases. While no essential difficulties a p p e a r in doing this, we shall not discuss the general statistic as its usefulness does n o t appear to justify the considerable notational complication required.

I n spite of the fact t h a t the Smirnov test has been in use for nearly t w e n t y years, the c o m p u t a t i o n of P1 and P2 has not been satisfactorily dealt with, nor does it even seem to be widely realized t h a t a computational problem exists. We review in Section 2 the methods now available, and give in Section 3 the results of a brief numerical investigation showing the inadequacy of those methods. I n Section 4 we present a technique which is useful when m - n is small, a n d which serves to illuminate the complexity of the problem. Among the r e m a r k s in the concluding Section 5 is an interpolation formula which appears to give considerably b e t t e r values t h a n the currently s t a n d a r d technique.

2. Methods f o r obtaining values o f P1 a n d P2

The user of Smirnov's test, faced with the problem of obtaining a value of P1 or Pg. corresponding to the observed value of D + or D, has available a v a r i e t y of methods, which we shall now review.

(a)

Direct computation

As the distribution F = G is continuous, we m a y assume t h a t the m + n observations are distinct. Since the samples are drawn from the same population, we m a y imagine t h a t t h e y were obtained b y first drawing ra + n observations, and t h e n select- ing a t r a n d o m m of these to form the first sample. Thus, under the null hypothesis

/ - - \

F = G , each o f t h e I r a : n ) possible orderings of the samples with respect to each

otherhasthcsameprobabilityl/(m:n).Aswithallrank-ordertests, theproblem

of computing P1 a n d P2 reduces to a purely combinatorial one.

T h e solution of the combinatorial problem, and incidentally the calculation of the values of D+ and D, is aided b y a graphical device. We arrange the m q- n observations in increasing order on a common sequence, and associate with this a r r a n g e m e n t a p a t h in the x , y plane. We begin at the origin, and (reading the observations from smallest to largest) t a k e a unit step to the right for each x-observation, a n d a unit step up for each y-observation. The p a t h terminates a t the point (m, n), a n d from it we can reproduce the ranks of the two samples in their c o m m o n array. Figure 1 illustrates the process for samples in the order indicated b y

xyxyxxyyxx.

Such a graphical representation is of course a standard m e t h o d in classical probability, for example in the problem of gambler's ruin. I t is usually presented with steps to the right and left instead of to the right and up (Korolyuk a n d ¥ a r o § e v s ' k i i 1951, Gnedenko and K o r o l y u k 1951) b u t the present version, which is used b y Drion (1952), is more convenient for use with ordinary graph paper.

(3)

ARKIV lvOR MATEMATIK. Bd 3 nr 43

J

/ J

J

/

J

f

Fig. 1.

Q

J f

J J

J

Suppose t h a t of the first x + y observations in the common array, just x come from the first sample. Then between the (x + y)th and (x + y + 1)st observations, we shall h a v e F,~(x) - G~(x) = x / m - y / n = (nx - mY)Iron, which is proportional to t h e distance of (x, y) from the diagonal of the rectangle with corners (0,0) and (m, n).

Therefore to determine D + (D) we need only locate those points Q+ (Q) on the p a t h which are farthest below (farthest from) the diagonal. I n Figure 1, D ÷ = 61, corresponding to either of the points labelled Q÷, while D -~,- 1 corresponding to the point labelled Q.

To compute P~ for an observed path, we m a y count those paths which never get as far from the diagonal as the farthest point on the observed path. Construct boundaries (Figure 2), parallel to the diagonal and at the same distance from it as Q.

The number A (x,y) of ways to go from (0, 0) to (x, y) while staying strictly inside the boundaries satisfies the recursion formula

A (x, y) = A (x - 1, y) + A (x, y - 1)

(2.1)

with starting values A (0, y) = A (x, 0) = 1. The desired value A (m, n) can be computed b y simple additions, asillustrated on Figure2, w h e r e P ~ = l - A ( m , n ) / ( m : n ) = 97/105. This technique, which will be referred to as the inside method, is effective for small sample sizes but becomes rapidly less so as m and n increase. For fixed values of P2 the number of additions increases as n '/' and the size of the numbers increases exponentially.

The work can be substantially reduced by two devices: (i) Until the b o u n d a r y is reached, we are simply generating combinatorials by Pascal's triangle, so that an initial part of the work is unneeessay. (ii) B y symmetry, we need carry on only until x + y => ½(m + n), and use the fact t h a t the number of paths through (x,y) is the prod- uct of A (x, y) and A (m - x , n - y ) . Finally, when m and n have a large common factor, and P~ is large, P o l y a ' s (1948) method for exact sequential analysis m a y be.

useful.

(4)

J. L. HODGES, J R . , The Smirnov two-sample t e s t

This inside m e t h o d can also be used for P1, b u t since the n u m b e r of additions now increases as n ~, the alternative

outside method

is usually preferable. W e count t h e p a t h s which reach the (lower) boundary, classifying t h e m according to the point (x,y) a t which t h e y first reach (or pass) it. I f

B(x,y)

is the n u m b e r of ways to go from (0, 0) to

x,y)

without previously reaching the boundary, then the t o t a l n u m b e r of p a t h s reaching the b o u n d a r y is

~ B ( x ' Y ) ( m + n - x - y ) y n - y

(2.2)

W e m a y

eomputeBsuccessivelyfory=O, 1,...,byobservingthatB(x,y)is(X+y y)

diminished b y t h e n u m b e r of those ways of going from (0,0) to

x,y)which

previously reach the boundary. The latter n u m b e r can easily be found with the aid of earlier B values. This process, proposed b y K o r o l y u k (1955a), is illustrated below for t h e problem of Figure 2.

f

J

I

J l

²

^J

I

2

!Jl

^f

f .... s

L I

4

j J J

2 f

J Fig. 2.

8 16

J

j *

J f

(x+y) B(x,y) (re+n-x-y)

y x ~ y l \ n-y

0 2 1 3 10 1 70

1 4 5 2 2 l 0

2 5 21 7 3

Thus, B ( 4 , 1 ) = 5 - 3 , where the 3 represents the n u m b e r ( ~ ) of ways to go from (2,0) to (4,1). T h e sum of products of the final column is 111, the n u m b e r of p a t h s reaching the b o u n d a r y , so t h a t P1 = 111/210.

We note t h a t t h e outside m e t h o d can also be used for P~ when the boundaries are so far a p a r t t h a t a p a t h cannot reach both, in which case P2 = 2P1- E v e n if a few p a t h s can reach b o t h boundaries, it m a y be best to allow for these separately a n d use the outside method.

(5)

ARKIV FOR MATEMATIK. B d 3 n r 4 3

(b) T a b l e s

Massey has p u b l i s h e d t w o tables of P~, b o t h c o m p u t e d b y t h e inside m e t h o d . (i) I n Massey (1951) we are given 1 - P 2 , t o f r o m 2 t o 6 significant figures, for m = n = 1 (1)40 a n d for a = 2 (1)13 where d = a / n . A t m = n = 40 t h e last decimal is n o t reliable. This table is a useful c o m p l e m e n t to f o r m u l a (2.4) below.

(ii) Massey (1952) gives 1 - P2 to 5 decimals, for all n < m < 10 a n d all d, a n d for selected v a l u e s of n < m a n d d for m > 10. U n f o r t u n a t e l y this table does n o t seem t o be reliable. I n checking a b o u t 125 values, using t h e m e t h o d of Section 4, I f o u n d t h e following 16 instances in which I could n o t v e r i f y Massey's value of P2- T h e first of t h e given values is t a k e n f r o m Massey (1952), while t h e second is mine.

T a b l e 1.

, n m d P r ( D ~ _ d ) n m d P r ( D ~ _ d )

10 36 21 26 27 28 35 50 53

00874 00816

00533 00466

34297 31313

28205 25221

22238 19254

06371 05594

01584 01399

01100 00946

7 8 9 12 15

10 10 10 16 20

56 22 23 54 63 30 31 31

00607 00452

09877 09511

07683 07043

03436 03027

00719 00704

00577 00525

00376 00331

01414 01363

I t seems unlikely t h a t Massey's second table will be e x t e n d e d in its present f o r m t o cover v e r y large sample sizes, because t h e n u m b e r of possible values of D increases so rapidly. I t can be s h o w n t h a t D has 1 + [½ r s] + (t - 1) r s possible values, where m = r t, n = s t , a n d r a n d s are relatively prime. While 568 entires suffice for all n < m _< 10, 8707 would be required for n < m < 20, a n d one can show (using k n o w n facts on t h e d e n s i t y of relative primes) t h a t t h e n u m b e r of entries required to cover n < m < M is a s y m p t o t i c a l l y M412~(3) - $ ( 4 ) ] / 1 6 ~ ( 2 ) = 0.0502 M ~. While P1 a n d Pz could be p r o g r a m m e d for efficient electronic c o m p u t a t i o n , t h e cost of publishing a n a d e q u a t e table would be excessive.

(c) C l o s e d e x p r e s s i o n s

I n a few special cases one can o b t a i n expressions for P1 a n d Pz which are relatively easy to compute.

(i) W h e n m = n our r e c t a n g l e b e c o m e s a square, t h e boundaries h a v e slope 1, a n d we can use a reflectional m e t h o d . L e t t h e b o u n d a r y be y = x - a , where a = 1, 2, ..., m. A p a t h reaching t h e b o u n d a r y will first do so a t some p o i n t Q. W e reflect t h a t p a r t of t h e p a t h f r o m (0,0) t o @ a b o u t t h e b o u n d a r y , generating a p a t h f r o m ( a , - a) to (n, n). As t h e n u m b e r of these is ( 2 n t , a n d as t h e correspondence is one-

\ ~ b - - a / to-one, we h a v e

(6)

J. L. HODGES, JR.,

The Smirnov two-sample

^{t e s t}

To compute P2, we m u s t allow for paths which reach both boundaries, which can be done b y repeated reflections. We find

(2.4) The a r g u m e n t is classical, and is given for example b y Gnedenko and K o r o l y u k (1951) a n d Drion (1952).

(if) I t has recently been pointed out t h a t similar results hold when m = n p, where p is a positive integer. Recall the q u a n t i t y B entering into the inside method. K o r o l y u k (1955a) noted t h a t of those paths to the point

(x,y)

on the b o u n d a r y p y = x - a which h a v e previously reached it, exactly

1/(p +

1) approach

(x,y)

t h r o u g h

(x,y -

1).

F r o m this it follows t h a t

B ( x , y ) = ( x ; Y ) - ( p + l ) ( x ; Y - 1 1 )"

When this is substituted into (2.2) we see t h a t the n u m b e r of paths reaching the b o u n d a r y is

u=0 ( P + I ) y + a \

y n - y

We then of eourse have Pl=N (a) / ((P+nl )n ).

Two a t t e m p t s h a v e been made to o b t a i n a corresponding result for P2. K o r o l y u k (1955a, p. 86) produced a formula, b u t it contains a partitional sum t h a t would be difficult to compute. The formula given in B l a c k m a n (1956) is u n f o r t u n a t e l y in- correct, a n d the revised formula (Blackman 1957) is n o t suited for easy computation.

(iii) A closed formula (4.4) for P1 when m = n + 1 is developed in Section 4 below.

(d)

Large-sample approximations

(i) I n his original papers (1939a, b) Smirnov proved that, as m and n - + o o so t h a t

m/n--~q

we have, for fixed z > 0,

Pr{Vmen+n D+>=z} -+e-2z''

(2.5)

P r { /m~ n+ nD > Z} ---> l - K (z) -- 2 [e-2~' -e-2(2~)~ + e-2¢3")' - ... ].

^(2.6)

The function K, which also appears in the limit t h e o r y of the K o l m o g o r o v test, has been tabled b y Smirnov (1939 b, 1948) to 6 decimals for its a r g u m e n t at intervals of 0.01 over the entire range. Alternative proofs or heuristic proofs of (2.5) and 2.6) h a v e been given b y Feller (1948), and Doob (1949). (See also Donsker 1952.)

Smirnov's theorems suggest the introduction of new r a n d o m variables Z =

Umn/(m + n)D

a n d z + =

Vmn/(m + n)D +.

We shall hereafter always restrict z

(7)

ARKIV F S R MATEMATIK. B d 3 n r 4 3

to possible vMues of Z and Z+, and shM1 refer to z as the

distance variable

of the b o u n d a r y . I t is clear that, if z' and

z"

are consecutive possible values of Z or Z + and if z' > z => z", then P r {z (+) => Z} = P r {Z (+) _-> z"), so we lose no generality b y re- stricting z. To p e r m i t z to assume a r b i t r a r y values, as is c u s t o m a r y in Che literature, leads to considerable and needless notational complication. Similarly, d will always denote a possible value of D or D +, with

z = Vmn/(m + n) d,

etc.

I t is a truism t h a t a limit theorem can be used to justify m a n y different large-sample approximations. For example, we might on the basis of (2.5) a p p r o x i m a t e

Pr (D+ >= d)

b y exp { -

2mnd2/(m + n)},

b u t this q u a n t i t y could with equal reason be used to a p p r o x i m a t e

Pr(D+>

d), which in some cases is substantially different.

I t has been noted b y Drion (1952) t h a t good results are obtained, when m = n, if we a p p r o x i m a t e

P2 = Pr (D >= d)

b y the q u a n t i t y 1 - K ( } / ~ d). The corresponding observation for P1 has been m a d e b y H. E. Daniels

(J. Roy. Stat. Soc.

B (1956), V. 18, p. 22). These observations suggest the use of the large-sample approximations

t5 l = e -e~' and t 3 2 = l - K ( z ) ,

for P1 and P2 respectively. This numerical finding can be reinforced b y an a s y m p t o t i c expansion of (2.3) a n d (2.4). Using Stirling's formula a n d expanding the logarithms involved, it is easy to show t h a t when n-->c~ a n d a = 0 (}/n),

{a2 2 o4 (1))

- - - ~ ~ - 0 •

P l = e x p n 2 n 2 6 n 3 The substitution a = z 2}/~ gives

which shows t h a t the use of/31 as an approximation for P1 will lead to a relative error of order

I/n,

when m = n. Note t h a t in this case, the customary continuity correction would introduce an error of order 1/Vnn; b u t see Section 5(a). A similar analysis shows t h a t

P2=e-2~'-e-~(2z)~+e-~(3~)~-"'+O = P 1 - P ~ + P ~

. . . . + 0 n " (2.8) Essentially these expansions are to be found in Gnedenko (1952). (The version in Gnedenko (1954) appears to be in error.)

(ii) K o r o l y u k (1954, 1955b) has recently developed a s y m p t o t i c expansions for P1 and P2 giving the terms of order 1/~/n a n d

1/n

for arbirary m, n. His formulae would a p p e a r to provide means for dealing with the practical problem of determining P1 and P2. However, their correctness has been challenged b y Blackman, in his review of K o r o l y u k ' s paper

[Mathematical Rewiews

16 (1955) 839], a n d again in B l a c k m a n (1956). H e finds t h a t K o r o l y u k ' s results are not consistent with earlier results of Smirnov (1944) on the one-sample problem, with some of Gnedenko's results for m = n, nor with B l a c k m a n ' s own results for m = n p .

(8)

J . L. HODGES, J R . , The Smirnov two-sample t e s t

As a consequence of the theory developed in Section 4, it follows t h a t I/?z [e~ z'm.- P r { z + > ~ . n } - 1]

does not tend to a limit as m, n-->oo, Zm.n--->Z, m/n---->l. Thus, not only is the particular expression for P I given by Korolyuk wrong, but no general expression of this simple character can be right. Korolyuk uses analytic machinery on a function of two discrete variables in a formal way, without verifying its applicability. I n particular he ignores the lattice structure of the boundary of his region, and it follows from the development of Section 4 t h a t the structure of the b o u n d a r y in some cases influences the term of order 1//l/nn.

3. Numerical examination of the Smirnov approximations

The preceding section leads to this main conclusion: except for quite small m, we shall usually either have to carry out a rather heavy computation to obtain the significance probablility, or else rely on the approximations ~51 and/52 based on Smirnov's principal term hmit theorem. Although these have been in use for nearly t w e n t y years, I have not been able to find a n y numerical examination of their accuracy, except when m = n. The present section presents the results of a brief numerical study, for n = 12, m = 13(1)18, and 0.05> P ~ > 0.002.

We shall report our results in terms of the relative, not the absolute, accuracy of the approximations. This m a y be motivated from the point of view of the Neyman- Pearson theory, b y observing that an error of given percentage in determining the significance level of a test will usually result in a percentage error of the same order of magnitude in the power fucntion generally. Again, from the Bayesian viewpoint, a percentage error in small P results in a posteriori odds in error b y a b o u t the same percentage, without regard to the absolute error in P.

I n the range studied,/~2//P2 and/~1//PI will be nearly identical. F r o m (2.6) we see t h a t / 5 = 2~51[1 _/5~ + _ ~ . . . . ] so t h a t the two-tailed approximation is almost exactly double the one-tailed approximation. Correspondingly, the actual value P2 is either exactly 2 P1 or else very nearly so throughout our range. The required values of P1 were computed by the method of Section 4 and b y the outside method.

The results are shown as Figure 3, which for each m gives loge(/~2/P2) against z.

An examination of this figure reveals several interesting features.

(i) The relative errors are perhaps surprizingly large. F o r comparison, for m = n = 12, the relative errors of the two values of P2, in the same range of P2, are 9 % and 23 %.

When we add a single observation to one sample, the relative errors are increased to range from 38 % to 180 %. This example m a y serve as a useful warning against the common belief t h a t increased sample sizes always favor an asymptotic approximation.

(if) The relative error is by no means monotone in z; instead there are wild oscillations at m = 13, which gradually subside as m is increased to 16, b u t which seem to reappear at m = 17 only to vanish at m = 18.

(iii) I f we t r y to average out the oscillations by some sort of trend line, we see t h a t the " a v e r a g e " relative error increases--perhaps linearly--with z.

(9)

ARKIV FOR MATEMATIK. Bd 3 nr 43

160

12C

12(

80

40

"120

80

13

1.4

15

12C

4~

12C

16

17

8C

4C

,20 18

80

40

I I

• L3 m'.s b

;3

^1.5 ^.

Fig. 3.

(iv) B u t , for f i x e d z, t h e a v e r a g e is n o t m o n o t o n e i n m. I n g e n e r a l i t falls as m is i n c r e a s e d , b u t a t m = 17 i t is h i g h e r t h a n a t m ,-- 16.

T h e s e n u m e r i c a l r e s u l t s raise b o t h p r a c t i c a l a n d m a t h e m a t i c a l p r o b l e m s . T h e S m i r n o v a p p r o x i m a t i o n is seen t o b e h i g h l y i n a c c u r a t e for v a l u e s of m a n d n w h i c h a r e a l r e a d y l a r g e e n o u g h for d i r e c t c o m p u t a t i o n s t o b e a r d u o u s . F u r t h e r , s o m e m a t h e m a t i c a l e x p l a n a t i o n is d e s i r a b l e for t h e p h e n o m e n a j u s t o b s e r v e d . W e s h a l l i n t h e t w o f o l l o w i n g s e c t i o n s g i v e p a r t i a l s o l u t i o n s for t h e s e p r o b l e m s .

4. Nearly equal sample sizes

C o n s i d e r a t i o n s of efficiency or s y m m e t r y u s u a l l y l e a d t o s p e c i f y i n g m = n in t h e d e s i g n of c o m p a r a t i v e e x p e r i m e n t s , b u t one or m o r e o b s e r v a t i o n s is o f t e n lost. A s a r e s u l t , t h e p r o b l e m of c a l c u l a t i n g / ) 1 a n d P2 w h e n m a n d n a r e n e a r l y b u t n o t e x a c t l y e q u a l a c q u i r e s p r a c t i c a l i m p o r t a n c e , e s p e c i a l l y since t h e S m i r n o v a p p r o x i m a t i o n is p a r t i c u l a r l y b a d i n t h i s case. W e shall n o w d e v e l o p a t h e o r y a p p r o p r i a t e w h e n m - n = c is small. W e b e g i n b y e s t a b l i s h i n g a g e n e r a l l o w e r b o u n d for P1.

F o r b r e v i t y we s h a l l r e f e r t o t h e line s e g m e n t s y = x - a, for a = c, c + 1 . . . m a n d y => 0, x _< m, a s mirrors. E a c h p a t h r e a c h e s a t l e a s t one m i r r o r ; of these, t h e

(10)

z. L. HODGES, Jm,

The Smirnov two-sample test

one w i t h l a r g e s t a will be called t h e

end mirror.

A p a t h t o u c h e s i t s e n d m i r r o r in a t l e a s t one p o i n t ; t h e l o w e s t of t h e s e is t h e

end point.

T h e r a n d o m c o o r d i n a t e s of t h e e n d p o i n t will b e d e n o t e d b y (X, Y) a n d we w r i t e X - Y = A . W e shall use t h e r a n d o m v a r i a b l e A t o i n d e x t h e m i r r o r s .

T h e r e f l e c t i o n m e t h o d (Section 2c) gives a t once

P r ( A ~ a ) ~ ( m + n ] / ( m : n ) •

- \ n + a ] / ^(4.])

F u r t h e r m o r e , i t e n a b l e s us t o c a l c u l a t e t h e n u m b e r of w a y s t o go f r o m (0,0) t o (a + y, y) w i t h o u t p r e v i o u s l y t o u c h i n g t h e m i r r o r A = a, w h i c h will b e d e n o t e d b y

I(a,y).

T h i s n u m b e r is t h e s a m e as t h e n u m b e r of w a y s t o go f r o m (0,0) t o (a + y - 1,

( a + 2 y - 1 )

y) w i t h o u t t o u c h i n g A = a. T h e r e a r e w a y s t o r e a c h (a + y - 1, y), b u t Y

r e f l e c t i o n we see t h a t ( a ÷ 2 y - 1~ of t h e s e h a v e t o u c h e d A = a. H e n c e b y

\

y - 1

/

I(a,y)=(a+2y-l] ( a + 2 y - 1 ) .

y ] - \ y - 1 (4.2)

S i m i l a r l y , t h e n u m b e r of w a y s t o go f r o m (a - y , y) t o (m,n) w i t h o u t crossing ( b u t p o s s i b l y t o u c h i n g ) t h e m i r r o r A = a is I (n + a + 1 - m, m - a - y). T h u s

( m : n ) p r ( A = a a n d Y = < u ) = ~ = o

~

I ( a , y ) I ( n + a + l - m , m - a - y ) .

(4.3)

C o n s i d e r n o w a l o w e r b o u n d a r y line L, c o r r e s p o n d i n g to a m o s t d i s t a n t p o i n t (x +, y+).

T h e l i n e L : m (y - y+) = n (x - x +) will i n t e r s e c t a t l e a s t one m i r r o r ; l e t t h e h i g h e s t m i r r o r w h i c h L i n t e r s e c t s b e A = ~. T h e n ~ is t h e s m a l l e s t i n t e g e r g r e a t e r t h a n o r e q u a l t o

x+ - m y+/n.

I t is e a s y t o s h o w t h a t t h e o r d i n a t e of t h e p o i n t of i n t e r s e c t i o n of L a n d A = ~ , w h i c h we shall d e n o t e b y

vn,

is e q u a l t o n[1

- g ( x + - m y + / n ) ] / c ,

w h e r e g (u) d e n o t e s t h e f r a c t i o n a l p a r t of u. T h e v a r i a b l e v t h u s d e f i n e d will b e called t h e

phase variable

of L . T h e o r d i n a t e of t h e p o i n t of i n t e r s e c t i o n of L w i t h A = ~ + i is t h e n

n (v + i/c).

F i n a l l y , for t h e d i s t a n c e v a r i a b l e z w e h a v e t h e e x p r e s s i o n

z= r e + n \ m

so t h a t

~=vc+zV (n÷c)(2n+c)

C

A n y p a t h w h o s e e n d p o i n t lies on or b e l o w L m u s t of course r e a c h or p a s s L. T h e r e - f o r e

m n ply_ m + + ~ ~. I ( a , y ) I ( n + a + l - m , m - a - y ) .

(4.4)

m ÷ a = ~ u=o

(11)

ARKIV FOR MATEMATIK. B d 3 n r 4 3

W e shall denote the right side of ( 4 . 4 ) b y ( m : n ) P ~ . P ~ i s t h u s a l o w e r b o u n d f o r P1, and equals, P1 if and only if L cuts a single mirror, which will always be the case

when c = 1. Thus (4.4) provides an exact closed expression for P1 when m = n + 1.

I t is intuitively plausible t h a t P~ should be close to P1 when c is small, a n d it is shown below t h a t P 1 - P ~ = 0 ( l / n ) when n - + c ~ and c is bounded.

The m o s t notable feature of P~ is t h a t it depends in a simple w a y on a function, I , of only two variables, while P~ itself depends on the four variables m, n, x +, a n d y+.

We give a short table of I , suitable for computing Pa when m_-_ 30. F r o m the 323 values of Table 2, and a table of combinationals, one can easily obtain a n y of thou- sands of accurate estimates for P1. (As explained in Section 3, 2P* m a y then be used to estimate P~ if the latter is not too large.)

As an example, we t a k e m =

20, n =

16, and seek P1 corresponding to a p a t h w i t h farthest point (17,6). The b o u n d a r y L is 5 ( y - 6) = 4 ( x - 17), which cuts the mirrors A = 10, 11, so t h a t a = 10. The ordinates of the points of intersection of L with A = 10 and A =11 are respectively 2 and 6. As m o s t of the mirror A = 11

lies below L, it is quicker to compute the complement of (4.3) with respect to (m : n) Pr (A = a).

We have

(36) p~ = (3:) -[I(11'7) I (8'2)+I (11'8)I (8'1)+I (11'9)I

+ [I (10, 0) I (7, 8) + I (10, 1) I (7,9) + I (10, 2) I (7,10)]

= 9.1407 × 107

This gives P~ = 0.01251. Massey (1952) gives P2 = 0.0251I, and since the probability of reaching b o t h boundaries is negligible, this implies P1 = 0.01256. I n the range P 2 < 0.1, 2 P~ never differs from P2 as given b y Massey b y as m u c h as 1% of P2.

The accuracy of P~ will of course decrease as c increases, b u t even at n = 12, m = 20

~P* is considerably s u p e r i o r t o / 5 i.

We shall now develop an asymptotic expression for P1, correct to t e r m s of order 1 / ~ n , for what m a y be called " n e a r l y equal" sample sizes. Consider a sequence

(mk,n~)

of sample sizes, such t h a t n k - + c ~ and m k - n k = % is bounded. With each

(mk, n~)

is associated a boundary L~ with distance variable z~, phase variable vk, and significance probability Plk. We assume t h a t zk is bounded and bounded a w a y f r o m 0, so t h a t ~k will be of exact order ~/~. To simplify t y p o g r a p h y we shall o m i t the subscript k hereafter.

A straightforward application of Stirling's formula to (4.1) shows t h a t

/

P r ( A > _ - a ) = e x p - - - + - - + 0 (4.5)

n ~b

when a = 0 (~n). F r o m this it follows t h a t

(12)

Table 2. Table of I (a, y). The superscript is the exponent of that power of 10 by which the entry is to be multiplied. I(a,O) = l; I(a, 1) =a; I(l,y + l) =I(2,y). t~ y a = 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

5 14 42 132 429 1430 4862 16801 58791 2080" 7429 ~ 2674 a 9695 a 35364 12966 47766 1767" 6564 ~ 2447' 9148' 3431" 1290 ~ 4862 ~ 1837 lo 6953 lo 2637 I~ 1002 x~

9 14 28 48 90 165 297 572 1O01 2002 3432 7072 1193 ~ 25191 4199 x 90441 1492 ~ 3269" 5349 ~ 1189" 1932 ~ 4346 ~ 7020 ~ 15974 25664 58934 94294 2183 ~ 34806 81206 1290 ~ 3030 e 4797 a 11347 1790' 4255' 6702' 1601 ~ 25168 60388 9468 s 2282" 3572 ~ 8643 ~ 13511o 32801o 51171o 1247 ~ 194211 4747 ~ 7385 ix 20 27 75 110 275 429 1001 1638 3640 6188 13261 23261 48451 87211 1776" 32692 6538~ 12263 2414" 4602 a 89488 17304 33274 6513 ~ 1241 ~ 24566 46406 92805 1740e 3512 e 65416 1331' 2465' 5053' 9308 ~ 19218 3522 s 7315" 1335 ~ 27899 5071" 1065 l° 192910 4072 lo 7351 lo 155911 280511

35 154 637 2548 9996 38761 1492" 5720" 2187 a 8351 a 31874 12166 46406 1772" 6769 e 2588' 9904' 3793" 1454" 5579 g 21421° 8234 lo

44 208 910 3808 15501 62021 2452 z 9614 z 3749" 14574 56454 21836 84366 32576 1257' 48517 1872" 72258 27899 1077 lo 4162101

54 65 273 350 1260 1700 5508 7752 23261 33921 95931 14422 3894" 6009 ~ 15623 2467" 6216 e 10024 24584 40324 96774 16135 3796 a 6419 s 1486" 2545 e 5802 s 1006' 2263' 3965' 8815' 15608 3432" 61288 1335" 2405" 51949 94269 2020 lo

77 440 2244 10661 48281 2115" 9045" 37998 15744 64514 26236 10596 4255" 1702' 6784' 26968 1069' 42329

90 544 2907 14361 67301 3036" 1332" 5723" 24194 10106 4172 s 1710 ~ 6962" 2819' 1136" 4564" 1827 ~

104 663 3705 19021 92091 4276 I 19248 8454 s 3646' 15505 65095 27076 1117' 4581' 1868" 7582"

119 798 4655 24791 1240 ~ 5920" 2731" 12274 53994 23365 99756 4212" 1762' 7315' 3018"

135 950 5775 31881 16442 8073 m 3817 a 17534 78684 34666 15046 64466 2734' 1150 s

152 1120 7084 40481 2153 l 10863 52598 24684 11306 5067 s 2235 e 97216 41807

170 1309 8602 50831 27852 14428 71528 34304 1600 b 73065 32756 1446'

18 189 1518 10351 63181 ~" 35632 18933 9612 a 47074 2239 s 1040 e 4739 e

(13)

ARKIV FOR MATEMATIK. B d 3 nr 4 3

On substituting a = V~2--n z + vc + 0 ( 1 / ~ n ) i n t o (4.5), we find

P r ( A > ~ z + c ) = e x p - 2 z ~ - ( 2 v c + c ) ~ n + 0 •

Similarly

P r ( A = ~ + i ) = 2 V 2 Z e - ~ ' + O ( 1 ) f o r i = O , 1 , - - . . . , c .

Vn

(4.6)

(4.7) Since t h e mirrors cut b y L h a v e probability of order 1/Vn, it is enough to determine pr(y<__ u n l A =o~+i) to constant order. F r o m (4.2), it can be shown t h a t

I (a, y) = 2 ~ y+a-1 a exp { - - a2 (4.8)

when y = 0 (n). When we substitute this into (4.3) and m a k e the usual integral ap- p r o x i m a t i o n of a sum, we find

(2 u - 1)]

P r ( r < u n [ + 0(1),

(4.9)

where (I) is the normal distribution function. On substituting (4.6), (4.7), a n d (4.9) into (4.4), we obtain

P ~ = e x p - 2 z z - Vn ~ , f f i 0 ~ tFz v + (4.10)

[z(2u- where

I t remains to show t h a t

P1 -P~

= 0 ( l / n ) in order to establish (4.10) as an expression for P1 with error of order 1/n. The a r g u m e n t will be given for simplicity only in the case m = n + 2. A p a t h counted in P1 b u t not in P~ m u s t (i) reach b u t not pass the mirror A = a f o r x + y < n, and t h e n (if) reach b u t not pass the mirror A = a + 1 for x + y > n. L e t S be the point in which such a p a t h crosses x + y = n.

F o r a given value of S, the behavior of the p a t h after S is conditionally independent of its behavior before S. As a consequence of (4.6), each of the events (i) and (if) has probability of order 1/Vnn, uniformly in S.

T h e expression (4.10) has a number of interesting features. (i) E a c h of the three terms in the exponent has an interpretation. Thus - 2 z 2 corresponds to the Smirnov a p p r o x i m a t i o n P1. The second t e r m - V2 z/Vnn represents the " a v e r a g e " error of the Smirnov approximation; we note t h a t it is of order 1/Vn, a n d t h a t it is p r o p o r t i o n a l to z, as was suggested empirically in section 3. Further, it does not depend on c. The t h i r d t e r m represents the oscillatory component. This is also of order 1/~/n, which shows t h a t it is not possible to express P1 to accuracy 1 / ~ n in a formula of the t y p e proposed b y Korolyuk.

(14)

J . L. H O D G E S , J R . , T h e S m i r n o v t w o - s a m p l e t e s t

(ii) The function ~F was derived b y Malmquist (1954), using the heuristic a r g u m e n t of Doob (1949), as the limiting conditional distribution of the ordinate of the furthest point, say Y+, given the value of Z +. Our argument, which is restricted to the case when c is bounded, permits this limiting distribution to be derived rigorously when Z+ is restricted to an appropriate interval, and also shows t h a t the result is n o t correct for arbitrary small intervals for Z +. We thus have an example of a problem in which the heuristic argument fails unless properly restricted.

C--1

(iii) From s y m m e t r y we see t h a t ~d~ (u) + ~F~ (1 - u) = 0. I t follows t h a t ~ ~F~

i = 0 l/c

v + c = 0 for 2 cv aninteger, a n d t h a t Y. | v + - | d v = O , so t h a t the oscillatory

~=o \ c /

o

part averages out to zero over a cycle of the phase variable v. Further, it can be shown from the Euler-Melaurin f o r m n l a t h a t m a x ~ ~I~ v + - - ~ 0 a s c - ~ o o , whieh

"explains" the phenomenon, observed in Section 3, of damped oscillations as m is increased. A few values of the maximum oscillation are given in Table 3.

T a b l e 3. m a x t ~ 0 ~ F z ( x + ~ ) .

z c = l 2 3

1 0.13 0.04 0.006

1.5 0.21 0.03 0.013

2 0.26 0.10 0.011

(iv) The formula shows t h a t the error of the Smirnov approximation m a y be quite large even for substantial values of n, particularly when c = 1. For example, when m = n + 1 and z = 1.5 corresponding to P~ = 0.022, the relative error of P~ at the least favorable phasing is about 3/Vnn, so t h a t the sample sizes must be about 900 to assure a 10 % relative error. The same goal is achieved a t n = 20 when the sample sizes

are equal.

We conclude with an extension of these results to the two-sided test. As remarked above, in practice the approximation P2 = 2 P1 will usually serve; and since the results for P~ are considerably more complicated than those for P1, we shall give only an outline of methods. Corresponding to the end mirror A, we now have upper and lower end mirrors, say A 1 and Au. Instead of (4.1) we now obtain, analogously to (2.4),

n ) r ( A ~ > = a o r A l ~ a - c ) - - .

= \ n ÷ a ] \ n + 2 a ] \ n ÷ 2 a - c ] ÷ ' ' "

(4.11)

The first term on the right side of (4.4) is thus replaced b y (writing~z + c for a in (4.11), m + n

\ n + 2 ¢ ¢ + c ] "

(15)

ABKIV FhR MATEMATIK. Bd 3 nr 43 I f we c a r r y o u t w o r k analogous t o t h a t which gave (4.6), we find

P r ( A l <=oc or A ~ > o ~ + c ) = 2 P r ( A > = o ~ + c ) { 1 - [ P r ( A > = o ~ + c ) ] a + .-'} (4.12) a relation analogous to t h e f o r m u l a / 5 = 2/51(1 - ~ha + ...) of Section 2. I f n or P2 is v e r y large, t h e second t e r m in (4.12) m a y be w o r t h using.

A similar extension is possible for t h e second t e r m on t h e r i g h t side of (4.4). I n s t e a d of t h e function I , we h a v e J defined b y J ( a , y ) = t h e n u m b e r of w a y s to go f r o m (0,0) to (a + y , y ) w i t h o u t previously t o u c h i n g either A = a or A = - a. A n application of t h e reflection m e t h o d yields

J (a,y) = I (a,y) - l ( 3 a , y - a ) + I (ha, y - 2 a ) - I (T a, y - 3 a ) + ...

which enables one to c o m p u t e an analog to P~. Corresponding to (4.8), J (a, y) has an e x p a n s i o n consisting of t h e right side of (4.8) multiplied b y 1 - 3 e -2 a,/~+ ....

I f we ignore p a t h s influencing P~ only to order I / n , we can classify t h e p a t h s reaching one or b o t h boundaries into t h e following sets. (a) P a t h s reaching A = ~ + c or A = - ~ or b o t h ; (b~) p a t h s reaching A = ~ + i below its intersection with t h e lower b o u n d a r y , b u t n o t reaching A = ~ + i + 1 n o r A = - ~; (c~) a set of p a t h s analogous t o b~ b u t w i t h t h e boundaries interchanged. T h e set (b~) will c o n t r i b u t e to P2 a q u a n t i t y analogous to (4.9), b u t with (4.9) replaced b y a series whose first t e r m is (4.9). T h e remaining terms will be of order 1/Vnn, b u t in practice t h e y are usually negligibly small.

5. Condu~on

W e conclude w i t h three r e m a r k s or conjectures on t h e n a t u r e of t h e general p r o b l e m of c o m p u t i n g P1 t o order 1/~nn.

(a) I t seems likely t h a t t h e oscillatory b e h a v i o u r d e m o n s t r a t e d a b o v e for m / n n e a r 1, will also o b t a i n for m / n near a n y rational n u m b e r r / s , where r > s a n d r a n d s are relatively prime. W e shall give an heuristic a r g u m e n t which also yields a q u a n t i t a - tive f o r m u l a for t h e oscillation. I f t h e latter is correct, t h e oscillation will be of practical interest in o n l y a few cases.

Suppose m is " n e a r " ( r / s ) n. Analogously to t h e end mirrors of section 4, we define t h e end line of a p a t h as t h e line r (y - y+) = s (x - x+), a n d again index these lines b y their x-intercept, a. N o w s a takes on consecutive integral values, a n d consecutive z values differ b y 1 / V n ~ + s ) . This suggests t h a t t h e limiting p r o b a b i l i t y of a n end line is 4 z e - e Z ~ / V ~ r ( r + s). This is Y 2 / r ( r + s) times its value w h e n m = n.

W e define t h e end p o i n t of a p a t h as t h e lowest (highest) p o i n t which the p a t h reaches on its end line, when m > ( < ) r n / s . M a l m q u i s t ' s heuristic a r g u m e n t suggests t h a t t h e limiting conditional distribution of t h e ordinate of t h e end point, given t h e end line, is again given b y (4.9); in fact, a n y other form for this d i s t r i b u t i o n w o u l d n o t be reconcilable with t h e result of Malmquist cited above.

Finally, we notice t h a t t h e n u m b e r of end lines cut b y t h e b o u n d a r y is, in t h e limit, (i) c s w h e n m = r n / s ++_ c, or (ii) c r when n = s m / r + c. C o m b i n i n g , we should find t h e a s y m p t o t i c m a g n i t u d e of t h e oscillation to be ~ 2 / r (r + s) times as ~oTeat as it is w h e n m = n + c s in case (i), or m = n + c r in case (ii). I n particular, t h e oscilla-

(16)

J . L. H O D G E S , J R . , T h e

Smirnov two-sample

test

t i o n at (n = 12, m = 17) should be a b o u t V2/15 = 0.37 times as g r e a t as t h a t at (n = 12, m = 14). T h e empirical results of Section 3 are in reasonable a g r e e m e n t w i t h this prediction.

A c o m p a r i s o n of t h e conclusion of this heuristic a r g u m e n t with Table 3 suggests t h a t t h e oscillatory aspect of P1 is likely t o be of practical interest only w h e n m a n d n are n e a r l y equal, a n d t o a lesser e x t e n t w h e n m / n is near ~. Thus, for m / n n e a r ~, t h e oscillations should be a t worst only ~/~ (4 + 3) = 0.27 as gTeat as at m = n + 3; or for z near 1.5, t h e y should contribute a t worst 0.015/Vnn to t h e relative error. E v e n near ~, t h e m a x i m u m oscillation is a t worst a b o u t 0 . 0 5 / V n .

_ (b) W e n o w ignore t h e oscillations, a n d examine t h e " a v e r a g e " v a l u e of P1, s a y P1, for m a n d n n e a r l y equal. F r o m (4.10) we h a v e

for a n y c > 0. On t h e other hand, for c = 0, we h a v e from (2.7) P l = ' l e x p { - 2 z2 + 0 ( 1 ) } •

W e are c o n f r o n t e d with a disconcerting discontinuity in P l as a f u n c t i o n of c. I t is of interest to n o t e t h a t t h e t w o formulae are reconciled if a c o n t i n u i t y correction is used.

I n fact, for c > 0, t h e probabilities of individual z-values are of order 1In '1', so t h a t (5.1) continues to hold w h e n a c o n t i n u i t y correction is employed. A t c = 0, on t h e o t h e r hand, Pr(Z+ = z) = (2V2z/l/-nn) e -2z'. T h e use of a c o n t i n u i t y correction implies increasing t h e estimate b y half of this a m o u n t , so t h a t (5.1) used with t h e correction will yield, to order l / n , t h e v a l u e / 5 1 . (Incidentally, t h e oscillations for m = n + 1 are m a d e considerably m o r e regular when t h e c o n t i n u i t y correction is used.)

(c) Since (5.1) holds for a n y fixed c, it is t e m p t i n g to conjecture t h a t this f o r m u l a m a y be used for a r b i t r a r y values of m and n. This conjecture, however, is easily disproved b y considering t h e case m = n p, where p is an integer. B y use of Stirling's f o r m u l a it can be shown t h a t t h e y t h t e r m of t h e s u m given in Section 2 for N (a) is

Tb~2 7~ba

a l/n exp (n - y) 2 (p + I) y (n - y) +

~/2 z t p ( p + i) ya ( n _ y) 2 p ( p + l ) y

( 2 y - n ) a . ( 2 p + l ) n ( n - 2 y ) a 3 ( 1 ) } + 2 p y ( n - y ) ~- 6 [ p ( p + 1 ) y ( n - y ) ] 2 + 0 w h e n y a n d n - y are b o t h of order n. If we m a k e t h e substitutions y = v n a n d a =

] / p - ~ + 1) z, we find

1

z exp - 1 - v (1 - v)

P1 - 1/2,~,,3 (1_ v) 2v (i-,,) 2 Vv(~-+l)~

0

(p+ 1 ) ( 2 v - 1 ) ( 2 p + 1) (1-2v)z2]}

v ( l - v ) ~ v~-(~-- ~ 2 ] d v .

(17)

ARKIV FOR MATI~MATIK. B d 3 n r 4 3 T h e v a r i o u s i n t e g r a l s m a y b e r e d u c e d b y s u b s t i t u t i n g 1 - 2 v = u w h e n 0 < v < ½ a n d 2 v - 1 = u w h e n ½ < v < 1, a d d i n g t h e i n t e g r a n d s , a n d t r a n s f o r m i n g t o t h e g a m m a i n t e g r a l . W e f i n d

3 Y p ( p + 1) V~n ~ 0 ( 5 . 2 )

w h i c h a g r e e s w i t h t h e a p p r o p r i a t e s p e c i a l c a s e of C o r a l l a r y 2 i n B l a c k m a n ( 1 9 5 6 ) a f t e r m a k i n g t h r e e c h a n g e s , a p p a r e n t l y m i s p r i n t s , i n t h e l a t e r .

W h e n a c o n t i n u i t y c o r r e c t i o n is e m p l o y e d , f o r m u l a (5.2) b e c o m e s

T h i s s i m p l e f o r m u l a is t h u s s h o w n t o b e c o r r e c t f o r m = n , f o r m a n i n t e g r a l m u l t i p l e o f n , a n d " o n t h e a v e r a g e " f o r m n e a r l y e q u a l t o n . O n t h e b a s i s of a l i m i t e d n u m e r i e £ 1 i n v e s t i g a t i o n , i t a p p e a r s t o w o r k r e a s o n a b l y w e l l i n o t h e r e a s e s , a n d c a n p e r h a p s b e r e c o m m e n d e d a s a g e n e r a l i n t e r p o l a t i o n f o r m u l a .

University o] Cali/ornia, Berkeley, Cal., U.S.A.

R E F E R E N C E S

BLACK~-N, JEROME (1956), " A n extension of the Kolmogorov d i s t r i b u t i o n " . Arm. lVlath. Statistics 27, 513-520.

- - - - (1957), "Correction to ' A n extension of t h e Kolmogorov d i s t r i b u t i o n ' " , unpublished.

DO~SKER, MONROE D. (1952), " J u s t i f i c a t i o n a n d extension of Doob's heuristic approach to the K o l m o g o r o v - S m i m o v t h e o r e m s " . Ann. Math. Statistics 23, 277-281.

]DOOR, J. L . (1949), "Heuristic approach to the K o l m o g o r o v - S m i r n o v t h e o r e m s " . Ann. Math.

Statistics 20, 393-403.

DRIO~, E. F. (1952), " S o m e distribution-free tests for t h e difference between two empirical c u m u l a t i v e d i s t r i b u t i o n functions". Ann. Math. Statistics 23, 563-574.

FELLER, W. (1948), " O n the K o l m o g o r o v - S m i r n o v limit theorems for empirical distributions".

Arm. Math. Statistics 19, 177-189.

GNEDE~KO, B. V. (1952), "Some results on the m a x i m u m discrepancy between two empirical distributions". Doklady Akad. Nauk. SSSR 82, 661-663.

G~EDENKO, B. (1954), " K r i t e r i e n fiir die Unver~nderlichkeit der Wahrschcinlichkeitsverteilung von zwei u n a b h ~ n g i g e n Stichprobenreihen". Math. Nachr. 12, 29-66.

GNEDENKO, ]B. V., a n d Kom)LYU~, V. S. (1951), " O n the m a x i m u m discrepancy between two empirical distributions". D o k l a d y Akad. N a u k SSSR 80, 525-528.

GNEDENKO, B. V., a n d RVA6EVA, E. L. (1952), " O n a problem of comparison of two empirical distributions". Doklady Akad. N a u k SSSR 82, 513-516.

GNEDENKO, B. V., a n d STun~Ev, Y v . P. (1952), "Comparison of t h e effectiveness of several m e t h o d s of testing homogeneity of statistical m a t e r i a l " . Dopovidi Akad. N a u k U k r a i n . RSR, pp. 359-363.

KOLMOOO~OFF, A. (1933), " S u l l a determinazione cmpirica di une leggi di distribuzione". Giorn.

Ist. Ital. A t t u a r i 4, 83-91.

KOLMOOOROV, A. N., a n d Hr~6IN, A. YA (1951), " T h e work of N. V. Smirnov on t h e i n v e s t i g a t i o n of properties of variational series a n d on n o n p a r a m e t r i e problems of m a t h e m a t i c a l statistics".

Uspehi Matem. N a u k 6, no. 4 (44), p. 190-192.

Ko~oLYuK, V. S. (1954), " A s y m p t o t i c expansions for A. N. Kolmogorov's and ~q. V. S m i r n o v ' s criteria of fit." ])oklady Akad. N a u k SSSR 95, 443-446.

- - - - (1955 a), " O n t h e discrepancy of empiric distributions for t h e case of two independen£

samples". Izvestiya Akad. N a u k SSSR. Ser. Mat. 19, 81-96.

- - (1955 b), " A s y m p t o t i c expansions for the criteria of fit of A. N. Kolmogorov a n d N. V.

S m i r n o v " . Izvestiya Akad. N a u k SSSR. Ser. Mat. 19, ]03-124.

(18)

J . L. HODGES, JR., The Smirnov two-sample test

K O ~ O L Y ~ , V. S., a n d Y.~mo~.vs~-~, B. I. (1951), " S t u d y of t h e m a x i m u m discrepancy of two empirical d i s t r i b u t i o n s " . Dopovidi Akad. N a u k U k r a i n R S R , pp. 243-247.

MALMQUIST, STEN (1954), " O n certain confidence contours for distribution f u n c t i o n s " , A n n . Math. S t a t i s t i c s 25, 523-533.

MASSEY, F. J. JR. (1950), " A note on t h e power of a n o n - p a r a m e t r i c t e s t " . A n n . M a t h . Statistics 21, 440-443.

MASSEY, Fma_,~x J., J ~ . (1951), " T h e distribution of t h e m a x i m u m d e v i a t i o n b e t w e e n two sample c u m u l a t i v e step f u n c t i o n s " . Ann. Math. Statistics 22, 125-128.

- - (1952), " D i s t r i b u t i o n table for t h e d e v i a t i o n b e t w e e n two s a m p l e e u m u l a t i v e s " . A n n . Math.

S t a t i s t i c s 23, 435-441.

POLY.~, GEORGE (1948), " E x a c t f o r m u l a s in t h e s e q u e n t i a l analysis of a t t r i b u t e s " . U n i v . California Publ. Math. 1, 229-239.

RVA6EVA, E. L. (1952), " O n t h e m a x i m u m d i s c r e p a n c y between t w o emprirical d i s t r i b u t i o n s " . U k r a i n . Mat. ~ u r n a l 4, 373-392.

SMraNO~F, N. (1939), " S u r les ~carts de la courbe de d i s t r i b u t i o n e m p i r i q u e " . Mat. Sbornik 6 (48), p. 3-26.

S ~ N O V , N. (1939), " O n t h e e s t i m a t i o n of t h e discrepancy between empirical c u r v e s of distribu- t i o n for two i n d e p e n d e n t s a m p l e s " . Bull. Math. Univ. Moseou 2:2.

- - - - (1948), " T a b l e for e s t i m a t i n g t h e goodness of fit of empirical d i s t r i b u t i o n s " . A n n . Math.

Statistics 19, 279-281.

VAN DER WAE~D~.N, B. L. (1953), "Order t e s t s for t h e t w o - s a m p l e problem. I I " . I n d a g a t i o n e s Math. 15, 303-310.

Tryckt den 17 december 1957 Uppsala 1957. AImqvist & "Wiksel|s Boktryckeri AB

The significance probability of the Smirnov two-sample test By J. L. HOBOES, Jr.~