Asymptotic normality in mixture models

(1)

ASYMPTOTIC NORMALITY IN MIXTURE MODELS

SARA VAN DE GEER

.

Abstract. We study the estimation of a linear function ⁰=^R ^adF⁰ of a distribution ^F⁰, using i.i.d. observations of the mixture ^p^F⁰ =

R

k(^y)^dF⁰(^y). Let ^^Fⁿ be the maximum likelihood estimator of ^F⁰ and ^ⁿ =^R ^ad^F^ⁿ. We examine the asymptotic distribution of ^ⁿ. A problem here is that usually, ^^Fⁿ does not dominate ^F⁰. Our main aim is to show that this can be overcome by considering the convex combination ^Fn^ + (1^;)^F0, with ^<1.

1. Introduction

Let

X

¹

:::X

n be independent identically distributed random variables on (^X

^A), with distribution

P

. Suppose that for some

-nite measure

,

p

=

dP

d

²^f

p

F =

Z

k

(

y

)

dF

(

y

) :

F

² ^g

where is the class of all probability measures on a measurable space (^Y

^B), and where

k

:^X^Y ^!0

¹) is a given kernel with^R

k

(

xy

)

d

(

x

) = 1 for all

y

²^Y. So

p

=

p

F⁰ =

Z

k

(

y

)

dF

⁰(

y

)

for some

F

⁰ ² .

In this paper, we assume that

F

⁰ is unknown, and that the maximum likelihood estimator ^

F

n of

F

⁰ exists. The latter is (not necessarily uniquely)

dened by _n

X

i⁼¹log

p

_F^{^}ⁿ(

X

i) = max_F

2

n

X

i⁼¹log

p

F(

X

i)

:

Consider now functionals of the form

(

F

) =

Z

adF F

²

with

a

a given function on (^Y

^B). Write

⁰ = ^R

adF

⁰ and ^

n = ^R

ad F

^{^}n. We shall investigate the asymptotic behaviour (and eciency) of the estimator of ^

n of

⁰. An important issue in this context is the appropriate dierentiability of

(

F

). Let us briey sketch the main idea.

ESAIM: Probability and Statistics is an electronic journal with URL address

http://www.emath.fr/ps.

Received by the journal 8 March 1995. Revised 2 June 1995. Accepted for publica- tion 28 September 1995

(2)

Consider a pair of random variables (

XY

) with values in ^X ^Y, let

k

(

y

) be the density of

X

given

Y

=

y

, and

F

⁰ the distribution of

Y

. For a function

b

:^X ^!^R, write

E

(

b

(

X

)^j

Y

=

y

) =

A

b

(

y

)

:

Thus

A

b

(

y

) =

Z

k

(

xy

)

b

(

x

)

d

(

x

)

:

Write for

h

:^Y ^!^R,

E

(

h

(

Y

)^j

X

=

x

) =

A

F⁰

h

(

x

)

where

A

F

h

(

x

) =

R

k

(

xy

)

h

(

y

)

dF

(

y

)

p

F(

x

)

:

If for some

b

,

E

(

b

(

X

)^j

Y

=

y

) =

a

(

y

)

for

F

⁰^;almost all

y

(1

:

1) then clearly

E

(

b

(

X

)) =

⁰

:

So if (1.1)holds for some

b

²

L

²(

P

) not depending on

F

⁰, then (1

=n

)^Pⁿ_i⁼¹

b

(

X

i) is a^p

n

-consistent and asymptotically normal estimator of

⁰. We say that

(

F

⁰) is dierentiable at

F

⁰ if a solution of (1.1) exists. Note that if

k

(

y

) is a complete family for

y

in the support of

F

⁰, there is at most one solution of (1.1). In general, there may also be several solutions, in which case we would like to take the one with the smallest variance. But such a solution possibly depends on

F

⁰. The arguments below indicate that perhaps the maximum likelihood procedure automatically picks the best solution, with the estimator ^

F

n plugged in for

F

⁰.

We shall now discuss the solution with the smallest variance. First, we center the functions. Instead of

a

in (1.1), we consider the gradient of

(

F

), which is dened as

_F =

a

^;^Z

adF:

If for some

h

_F ²

L

¹(

F

) with ^R

h

_F

dF

= 0,

A

F

h

F =

F

^;a

:

s

:

(1

:

2) we call

b

F =

A

F

h

F (1

:

3) the ecient inuence curve at

(

F

). Note that

b

F is also centered now:

R

b

F

dP

F = 0. It follows from Van der Vaart (1991) that (1

=n

)^R

b

²_F⁰

dP

is a lower bound for the asymptotic variance of an estimator of

⁰. He considers parametric submodels with Hilbert space structure to arrive at results of this type in a very general context. In our case, the parametric submodel would

(3)

be indexed by F =^f

F

t: (

dF

t)¹⁼²= (1+¹²

th

F)(

dF

)¹⁼²

^j

t

^jsmall^g. For this reason, we call

h

F a direction (along which one can consider a submodel).

The assumptions

h

F ²

L

¹(

F

) and ^R

h

F

dF

= 0 ensure that indeed F . Suppose now that the ecient inuence curves

b

_F^{^}n and

b

F⁰ exist. So

b

_F^{^}n =

A

_F^{^}n

h

_F^{^}n for some

h

_F^{^}n

2

L

¹( ^

F

n), and

A

b

_F^{^}ⁿ =

_F^{^}ⁿ

F

^{^}n^;a

:

s

::

(1

:

4) Observe that ^

F

nis an interior point of^f

F

^nt :

d F

^{^}nt = (1+

th

_F^{^}ⁿ)

d F

^{^}n

^j

t

^jsmall^g. So we have

dt d

n

X

i⁼¹log

p

_F^{^}nt(

X

i)^jt⁼⁰= 0

or _n

X

i⁼¹

b

_F^{^}ⁿ(

X

i) = 0

:

Write this as ^Z

b

_F^{^}ⁿ

dP

n = 0

with

P

n the empirical distribution based on

X

¹

:::X

n (see (2.1)). Now, let us compare this with^R

b

_F^{^}n

dP

. Changing the order of integration gives

Z

b

_F^{^}n

dP

=

Z

A

b

_F^{^}n

dF

⁰

:

Moreover, if ^

F

n dominates

F

⁰, then (1.4) yields

Z

A

b

_F^{^}ⁿ

dF

⁰ =

Z

_F^{^}ⁿ

dF

⁰=

⁰^;

^{^}n

:

So then we have the identity of van der Laan (1993, 1994):

^n^;

⁰ =

Z

b

_F^{^}ⁿ

d

(

P

n^;

P

)

:

(1

:

5) Finally, if

b

_F^{^}ⁿ converges to

b

F⁰ in an appropriate sense, one obtains asymptotic normality (and eciency) of ^

n.

The problem is now that the assumption that ^

F

n dominates

F

⁰ is often not valid. Nevertheless, van der Laan (1993) presents some examples that show that the identity (1.5) can hold even when ^

F

n does not dominate

F

⁰. We shall however be concerned with the situation where the identity (1.5) is not necessarily true.

Our approach is to consider for each 0

<

1 and

F

² , the convex combination

F

=

F

+ (1^; )

F

⁰

:

Then indeed, ^

F

n dominates

F

⁰. We shall obtain asymptotic normality of

^n by choosing = ^n in such a way that it tends to one at an appropriate speed.

(4)

The paper is organized as follows. In Section 2, we derive a linear ap- proximation for ^

n. Asymptotic normality and eciency follow from this.

The conditions we need to arrive at the result are consistency of the maximum likelihood estimator (obtained by separate means), dierentiability of

(

F

) at

F

=

F

, 0

<

1, bounded directions and suitable continuity conditions on the inuence curves. A discussion of these conditions can be found in Section 3. Section 4 presents some examples.

2. Asymptotic normality

As pseudo-metric on , we take the Hellinger distance between the mixtures:

d

(

F F

^~) =

h

(

p

_F

p

_F^~_{) = (12}^Z (^p

p

_F ^;^p

p

_F^~)²

d

)¹⁼²

F F

^~ ²

:

Consistency of ^

F

nin this metric holds in fairly general situations. It is closely related to the further assumptions as stated in Condition 1 (see Section 3.1 for more details).

We use the notation

P

n= 1

n

X

i⁼¹

Xⁱ

(2

:

1)

i.e.

P

n is the empirical distribution that puts mass (1

=n

) at each of the observations

X

¹

:::X

n. Moreover, we shall make frequent use of stochastic order symbols. If ^f

Z

n^g is a sequence of real-valued random variables, and

f

k

n^g a sequence of positive numbers, then we say that

Z

n =

O

^P(

k

n) if

Mlim^!1lim sup_n^!1^P(^j

Z

n^j

> k

n

M

) = 0

:

Similarily,

Z

n=

o

^P(

k

n) means that for all

>

0,

nlim^!1^P(^j

Z

n^j

> k

n

) = 0

:

Condition 1. (consistency and rates). The estimators ^

F

n and ^

n are consistent, i.e.

d

( ^

F

_n

F

⁰) =

o

^P(1)

(2

:

2) and

j

^n^;

⁰^j=

o

^P(1)

:

(2

:

3) Moreover, for some

_n² =

o

(

n

^;1⁼²),

Z

log( 2

p

_F^{^}n

p

_F^{^}n +

p

F⁰)

dP

n =

O

^P(

_n²)

:

(2

:

4) Condition 2 (dierentiability in a neighbourhood of

F

⁰ and existence of ecient inuence curves). For some

>

0, and for all 0

<

1 and

F

² with

d

(

FF

⁰)

, we have

A

b

F =

F

^;a

:

s

:

(2

:

5)

(5)

for

b

F satisfying

b

F(

x

) =

A

F

h

F(

x

)

p

F(

x

)

>

0

(2

:

6) for some

h

F ²

L

¹(

F

), with^R

h

F

dF

= 0.

Condition 3. (control on the direction

h

F). For some

>

0 and

M <

¹,

d⁽FFsup⁰⁾ sup

0<¹_y sup

2supp ort(F⁾^j(1^; )

h

F(

y

)^j

M:

(2

:

7) Condition 4. (control on the ecient inuence curves

b

F). The informa- tion for estimating

⁰ is positive:

Z

b

²_F⁰

dP >

0

:

(2

:

8) Moreover, the inuence curves are uniformly bounded: for some

>

0,

d⁽FFsup⁰⁾ sup

0<¹_p sup

F

(x⁾>⁰^j

b

F(

x

)^j

<

¹

:

(2

:

9) Finally, for 0

<

n^!0,

d⁽FFsup⁰⁾ⁿ sup

0<¹

Z

b

²_F

dP

n=

Z

b

²_F⁰

dP

+

o

^P(1)

(2

:

10) and

0<¹

Z

(

b

F^;

b

F⁰)

d

(

P

n^;

P

) =

o

^P(

n

^;1⁼²)

:

(2

:

11) Theorem 2.1. Assume that conditions 1-4 are met. Then

^n^;

⁰ =

Z

b

F⁰

d

(

P

n^;

P

) +

o

^P(

n

^;1⁼²)

:

(2

:

12) Proof. By (2.4), we can nd a non-random increasing function

: (0

¹)^! (0

¹), satisfying

(

x

)^!0 for

x

^!0

(2

:

13)

(

x x

) 1

2 for all

x >

0

(2

:

14)

and

x

(

x

) ^!0 for

x

^!0

(2

:

15) such that

_n² =

o

(

n

^;1⁼²)

(

n

^;1⁼²). This gives

Z

log( 2

p

_F^{^}ⁿ

p

_F^{^}ⁿ +

p

F⁰)

dP

n =

o

^P(

n

^;1⁼²)

(

n

^;1⁼²)

:

(2

:

16)

(6)

Choose

1^;^n = ^j

^_n^;

⁰^j+

n

^;1⁼²

(^j

^n^;

⁰^j+

n

^;1⁼²)

:

(2

:

17) Then ^n

<

1 and by (2.14), ^n 1

=

2. Since ^j

^n ^;

⁰^j =

o

^P(1), (2.15) implies that (1^;^n) =

o

^P(1). Furthermore, (2.13) yields

j

^n^;

⁰^j

1^;^n =

o

^P(1)

(2

:

18)

as well as

n

^;1⁼²

1^;^n =

o

^P(1)

:

(2

:

19)

Dene

F

~n = ^n

F

^n+ (1^;^n)

F

⁰

:

(2

:

20) Let

>

0 be small enough, so that (2.5), (2.6), (2.7) and (2.9) in conditions 2-4 are fullled for this value of

, and let

D

n be the set

D

n=^f

d

( ^

F

n

F

⁰)

^g

:

Because ^n

<

1, we can nd on

D

n an inuence curve

b

_F^~n and a direction

h

_F^~ⁿ, such that

A

b

_F^~ⁿ =

_F^~ⁿ

and

b

_F^~n(

x

) =

A

_F^~n

h

_F^~n(

x

)

x

²^f

p

_F^~n

>

0^g

:

We can write

Z

b

_F^~ⁿ

dP

nl^f

D

n^g=

Z

b

_F^~ⁿ

d

(

P

n^;

P

)l^f

D

n^g+

Z

b

_F^~ⁿ

dP

l^f

D

n^g

:

(2

:

21) Because

d

( ^

F

n

F

⁰) =

o

^P(1), we have that l^f

D

n^g = 1 +

o

^P(1). So, using (2.11),

Z

b

_F^~ⁿ

d

(

P

n^;

P

)l^f

D

n^g=

Z

b

F⁰

d

(

P

n^;

P

)+

o

^P(

n

^;1⁼²) =

O

^P(

n

^;1⁼²)

:

(2

:

22) Since ~

F

_n dominates

F

⁰, and ^n= 1 +

o

^P(1),

Z

b

_F^~ⁿ

dP

l^f

D

n^g=^;^n(^

n^;

⁰)l^f

D

n^g=^;(1 +

o

^P(1))(^

n^;

⁰)

:

(2

:

23) Insert (2.22) and (2.23) into (2.21) to get that

Z

b

_F^~n

dP

nl^f

D

n^g=

Z

b

F⁰

d

(

P

n^;

P

)+

o

^P(

n

^;1⁼²)^;(1+

o

^P(1))(^

n^;

⁰)

:

(2

:

24)

Let ^

t

n=

R

b

_F^~ⁿ

dP

n

(1^;^n)^R

b

²_F⁰

dP

^l^f

D

n^g

:

(2

:

25)

(7)

Equality (2.24), together with (2.8), (2.18) and (2.19) imply that

^

t

n =

o

^P(1)

:

(2

:

26)

Take

M

as in (2.7) and

E

n=

D

n^\^fj^

t

n^j

<

1

=M

^g. Dene on^f

E

n^g,

d F

^~n(^

t

n) = (1 + ^

t

n(1^;^n)

h

_F^~ⁿ)

d F

^~n

:

Then on^f

E

n^g,

F

~n(^

t

n)²

:

(2

:

27) Moreover, from (2.26) we know that l^f

E

n^g= 1+

o

^P(1), so that by (2.9) and (2.10),

Z

log(

p

_F^~ⁿ⁽_t^{^}ⁿ⁾

p

_F^~ⁿ ⁾

dP

nl^f

E

n^g=

Z

log(1 + ^

t

n(1^;^n)

b

_F^~n)

dP

nl^f

E

n^g (2

:

28)

= ^

t

n(1^;^n)

Z

b

_F^~ⁿ

dP

nl^f

E

n^g^;1

2^

t

²_n(1^;^n)²

Z

b

²_F^~n

dP

nl^f

E

n^g(1 +

o

^P(1))

= 12(^R

b

_F^~ⁿ

dP

n)²

R

b

²_F⁰

dP

^l^f

E

n^g(1 +

o

^P(1))

:

Since ^

F

n maximizes the likelihood, and (2.27) holds on

E

n, we have

Z

log

p

_F^{^}n

dP

_nl^f

E

_n^g^Z log

p

_F^~n (

^tⁿ⁾

dP

_nl^f

E

_n^g

:

(2

:

29) Furthermore, the concavity of the log-function and the fact that ^n 1

=

2 yield

Z

log

p

_F^~ⁿ

dP

n (2^n^;1)

Z

log

p

_F^{^}ⁿ

dP

n+ 2(1^;^n)

Z

log(

p

_F^{^}ⁿ+

p

F⁰

2 )

dP

n

:

(2

:

30) Combine (2.28), (2.29) and (2.30) to nd

2(1^;^n)

Z

log( 2^

p

_F^{^}n

p

_F^{^}n+

p

F⁰)

dP

n 1

2(^R

b

_F^~n

dP

n)²

R

b

²_F0

dP

^l^f

E

n^g(1 +

o

^P(1))

:

(2

:

31) From (2.16) and (2.17), we know that the left-hand side of this equality is (1^;^n)

o

^P(

n

^;1⁼²)

(

n

^;1⁼²) = (^j

^^;

⁰^j+

n

^;1⁼²)

(

n

^;1⁼²)

(^j

^_n^;

⁰^j+

n

^;1⁼²)

o

^P(

n

^;1⁼²)

= (^j

^n^;

⁰^j+

n

^;1⁼²)

o

^P(

n

^;1⁼²)

where in the last step, we used that

is increasing. In view of (2.24), the right-hand side of (2.31) is of the form

(

O

^P(

n

^;1⁼²)^;(1 +

o

^P(1))(^

n^;

⁰))²(1 +

o

^P(1)

R

b

²_F⁰

dP

⁾

:

(8)

So we nd from (2.31) that

j

^n^;

⁰^j² max^f

O

^P(

n

^;1)

(^j

^n^;

⁰^j+

n

^;1⁼²)

o

^P(

n

^;1⁼²)^g

which implies ^j

^n^;

⁰^j=

O

^P(

n

^;1⁼²). But then, the left-hand side of (2.31) is

o

^P(

n

^;1), so that it reads

(

Z

b

F⁰

d

(

P

n^;

P

)+

o

^P(

n

^;1⁼²)^;(1+

o

^P(1))(^

n^;

⁰))²(1+

o

^P(1)) =

o

^P(

n

^;1)

:

In other words,

^n^;

⁰ =

Z

b

F⁰

d

(

P

n^;

P

) +

o

^P(

n

^;1⁼²)

:

t u

3. Comments on conditions 1-4

Conditions 1 and 4 can be veried using the concept of entropy. Therefore, we introduce the following denitions. Let

Q

be a probability measure on (^X

^A) and^G^Lq(

Q

),

q

1.

Definition 1 The

-covering number

N

q(

^G

Q

) of ^G is dened as the number of balls with radius

necessary to cover ^G. More precisely, let

f

g

j^gmj⁼¹ be such that for each

g

²^G there is a

j

²^f1

:::m

^gsuch that

Z

(

g

^;

g

j)^q

dQ

^q

:

(3

:

1) Then

N

q(

^G

Q

) is the smallest

m

for which such a collection^f

g

j^gmj⁼¹exists.

The

-entropy of ^G is

H

_q(

^G

Q

) = log

N

_q(

^G

Q

)^_1.

Definition 2. Let ^f

g

_Lj

g

_Uj]^g_mj⁼¹ be such that for all

g

² ^G there is a

j

²^f1

:::m

^gsuch that

g

_Lj

g g

_Uj

(3

:

2)

and ^Z

(

g

_Uj^;

g

_Lj)^q

dQ

^q

:

(3

:

3) Write

N

_Bq(

^G

Q

)for the smallest

m

for which such a collection^f

g

_Lj

g

_Uj]^g_mj⁼¹ exists. Then

H

_Bq(

^G

Q

) = log

N

_Bq(

^G

Q

)^_1 is called the

-entropy with bracketing.

3.1. On Condition 1

Suppose that^Y is a locally compact Hausdor space with countable base, and that ^B is the Borel

-algebra. Let ^C⁰ be the class of all functions

c

: ^Y ^! ^R that vanish at innity. If

k

(

x

) ² ^C⁰ for

-almost all

x

² ^X, then

d

( ^

F

n

F

⁰)^!0 almost surely (see Pfanzagl (1988)).

(9)

Denote the class of all measures

F

on (^Y

^B) with

F

(^Y) 1 by . The vague topology on is the smallest topology such that

F

^7!^R

cdF

is continuous for every

c

²^C⁰. Let

be the metric corresponding to the vague topology. We say that

F

⁰ is identiable (for the metric

) if for all

F

² ,

d

(

FF

⁰) = 0 implies

(

FF

⁰) = 0. If

F

⁰ is identiable, then

d

( ^

F

n

F

⁰) ^!0 almost surely implies

( ^

F

_n

F

⁰) ^! 0 almost surely. So in particular, then

j

^n^;

⁰^j^!0 almost surely, whenever

a

²^C⁰. More details can be found in e.g. Pfanzagl (1988) or Van de Geer (1993a).

Let us now investigate the rate of convergence for the log-likelihood ratio. Note rst of all that

0

Z

log( 2

p

_F^{^}n

p

_F^{^}n+

p

F⁰)

dP

n 1 2

Z

log(

p

_F^{^}n

p

F⁰)

dP

n

so that nothing is lost (and indeed something could be gained) by comparing

p

_F^{^}n with the convex combination (

p

_F^{^}n+

p

F⁰)

=

2 instead of with

p

F⁰.

Lemma 3.1. Suppose that

Z

1

0

q

H

²^B(

^G

P

)

d <

¹

(3

:

4) where

G=^f

r

p

F +

p

F⁰

p

F⁰ :

F

² ^g

:

Then ^Z

log(

p

_F^{^}ⁿ

p

F⁰)

dP

n=

O

^P(

_n²)

with

_n² =

o

(

n

^;1⁼²)

:

Proof. See Van de Geer (1995). A slight modication can be found in Wong

and Shen (1992). ^t^u

We call the left-hand side of (3.4) the entropy integral. If the entropy integral diverges, suboptimal rates can emerge (see Birge and Massart (1993)).

Lemma 3.2 below makes use of the special structure of the mixing model. We need the following notation: for

0,

¹²(

) =

Z

p^F⁰

p

F⁰

d

and

²²(

) =

Z

p^F⁰> 1

p

F⁰

d:

Lemma 3.2. Let ^K= ^f

k

(

y

) :

y

² ^Yg. Suppose that the functions in ^K are uniformly bounded, and that for all probability measures

Q

with nite support,

N

²(

^K

Q

)

A

^;^w

>

0

(3

:

5)

(10)

where the constants

A

and

w

do not depend on

Q

. Let 0

n ^! 0,

n

¹(

n)^_

n

^;(2+^w⁾⁼⁽⁴⁺⁴^w⁾(

²(

n))^w=⁽²⁺²^w⁾. Then

Z

log( 2

p

_F^{^}n

p

_F^{^}n +

p

F⁰)

dP

n =

O

^P(

_n²)

:

(3

:

6)

Proof See Van de Geer (1993b). ^t^u

Lemma 3.2 may not yield the optimal rate, but all we need here is a rate faster than

n

^;1⁼². This is the case if

²(

n) =

o

(

n

^2w¹ ) . To put it dierently, if we dene

¹^;1(

) = sup^f

:

¹(

)

²^g

then

_n² =

o

(

n

^;1⁼²) if

²(

¹^;1(

)) =

o

(

^;^w²) for

^#0

:

3.2. On Condition 2

It is natural to require (2.5) and (2.6) for = 0. The fact that we need these equalities for all 0

<

1 is closely related to being able to estimate the ecient inuence curve. However, it should be noted that we only assume

b

F to exist for certain

F

that dominate

F

⁰.

In some applications,

b

_F^{^}ⁿ does exist, but this usually will not help to simplify the proofs (see Section 4 for an example).

3.3. On Condition 3

Here, we assume that

h

F behaves like 1

=

(1^; ). Clearly, this allows misbehaviour for ^! 1, but it also requires

h

_F to be bounded. In some applications, this reduces to assuming that

h

F⁰ is bounded. The proof of Theorem 2.1 reveals that we need that for all

t

suciently small,

dF

(

t

)

=dF

= 1+

t

(1^; )

h

F exists and is non-negative, i.e., that

F

(

t

)² . If for some function

g

,

dg=dF

⁰ exists and is bounded, say by

C

, then also

dg=dF

exists and (1^; )

dg=dF

is bounded by

C

. This is the reason why in Example 4.1 and 4.3 our approach works. But it fails in Example 4.2!

3.4. On Condition 4

Let us present a brief overview of some results from empirical process theory, that can be applied in this context. We cite them from Pollard (1984) and Ossiander (1987). Throughout, we assume that the necessary measurability conditions are satised.

Consider a class ^G of functions on (^X

^A), with envelope

G

= sup_g

2G j

g

^j

:

The class ^G is called a Glivenko-Cantelli class if

sup_g

2G j

Z

gd

(

P

n^;

P

)^j^!0

almost surely

:

(11)

It is called a Donsker class if

n gd

(

P

n^;

P

) converges in distribution to a mean zero Gaussian process on^G. This limiting process is assumed to have continuous sample paths with respect to

(

), with

(

g

¹

g

²)²=

Z

(

g

¹^;

g

²)²

dP

^;(

Z

(

g

¹^;

g

²)

dP

)²

:

Necessary and sucient conditions for^Gto be a Glivenko-Cantelli class

are:

G

²

L

¹(

P

)

and

H

¹(

^G

P

n) =

o

^P(

n

)

>

0

:

If^G is a Donsker class, then the asymptotic equicontinuity condition holds, i.e. for all

>

0 there exists a

>

0 such that

limsup_n

!1

P(sup_g

2G

j

p

n

^Z

g d

(

P

n^;

P

)^j

>

)

<

(3

:

7) with^G =^f(

g

¹^;

g

²) :

g

¹

g

²²^G (

g

¹

g

²)

^g. Sucient conditions for^G to be a Donsker class are:

G

²

L

²(

P

)

and ^Z 1

0

q

H

²^B(

^G

P

)

d <

¹

:

(3

:

8) Condition (3.8) may be replaced by

Z

1

0

p

(

^G)

d <

¹

(3

:

9) where

(

^G)sup_Q

H

²(

(

Z

G

²

dQ

)¹⁼²

^G

Q

)

(3

:

10) and where the supremum is taken over all probability measures

Q

with nite support.

Suppose now that (2.9) is met, and that for 0

<

_n^!0,

0<¹

Z

(

b

F^;

b

F⁰)²

dP

^!0

:

(3

:

11) Then it is clear from the above that (2.10) and (2.11) are fullled if one of the entropy conditions (3.8) or (3.9) holds, with

G=^f

b

F:

d

(

FF

⁰)

0

<

1^g

:

In many applications however, no explicit expressions are available for the ecient inuence curves, so that it may still be dicult to check the entropy conditions.

(12)

4. Examples

Example 4.1: interval censored obervations, case I

Suppose one observes

X

_i = (

T

_i

_i), where

_i = l^f

Y

_i

T

_i^g,

Y

_i and

T

_i are independent random variables, both with values in a bounded interval, say (0

1], and where

T

i has (unknown) distribution

G

,

i

= 1

:::n

. This is one of the models studied in Groeneboom and Wellner (1992). The density of

X

i with respect to

G

,

being the counting measure on^f0

1^g, is

p

F⁰(

t

) =

F

⁰(

t

) + (1^;

)(1^;

F

⁰(

t

)) =

Z

k

(

ty

)

dF

⁰(

y

)

with

k

(

ty

) =

l^f

y t

^g+ (1^;

)l^f

y > t

^g.

Lemma 4.1. Suppose

G

has density

g

with respect to Lebesgue measure.

Assume that _

a

(

y

) =

da

(

y

)

=dy

exists and

j

a

_(

t

)

g

(

t

)^j

C

¹

<

¹

t

²(0

1]

(4

:

1) and

j

d

(_

a

(

t

)

=g

(

t

))

dF

⁰(

t

) ^j

C

²

<

¹

t

²(0

1]

:

(4

:

2)

Then

^n^;

⁰ =

Z

b

F⁰

d

(

P

n^;

P

) +

o

^P(

n

^;1⁼²)

where

b

_F⁰(

t

) =^;

⁽¹^;

F

⁰(

t

))_

a

(

t

)

g

(

t

) + (1^;

)

F

⁰(

t

)_

a

(

t

)

g

(

t

)

t

²(0

1] ²^f0

1^g

:

Proof Condition 1 is met:

d

( ^

F

n

F

⁰)^!0 almost surely,^j

^n^;

⁰^j^!0 almost

surely and ^Z

log(

p

_F^{^}n

p

F⁰)

dP

n =

O

^P(

n

^;2⁼³)

:

This follows from the theory in Subsection 3.1 (see also Groeneboom and Wellner (1992) and Van de Geer (1993a)). Dene

b

F(

t

) =^;

⁽¹^;

F

(

t

))_

a

(

t

)

g

(

t

) + (1^;

)

F

(

t

)_

a

(

t

)

g

(

t

)

t

²(0

1] ²^f0

1^g

:

Then

A

b

F(

y

) = ^;

Z

1

y^;(1^;

F

(

t

))_

a

(

t

)

dt

+

Z y^;

0

F

(

t

)_

a

(

t

)

dt

=

a

(

y

)^;

a

(0)^;

Z

1

0

(1^;

F

(

t

^;))_

a

(

t

)

dt

=

F(

y

)

y

²(0

1]

:

(13)

Because

F

dominates

F

⁰,

dF

⁰

=dF

exists. So

h

F(

y

) =^;

d

(

F

(

y

)(1^;

F

(

y

))_

a

(

y

)

=g

(

y

))

dF

(

y

)

= (2

F

(

y

)^;1)_

a

(

y

)

g

(

y

) +

F

(

y

^;)(1^;

F

(

y

^;))

g

²(

y

)

d

(_

a

(

y

)

=g

(

y

))

dF

⁰(

y

)

dF

⁰(

y

)

dF

(

y

) exists too. Moreover, for

p

F(

t

)

>

0,

A

F

h

F(

t

) =

Rt

0

h

F

dF

F

(

t

) + (1^;

)

R

t1

h

F

dF

1^;

F

(

t

) =

b

F(

t

)

:

Thus, Condition 2 is fullled.

Clearly,

j

h

F(

y

)^j

C

¹+

C

² (1^; )

:

Therefore, Condition 3 holds too.

Finally, we shall verify Condition 4. Note rst that

Z

b

²_F⁰

dP

=

Z

F

⁰(

t

)(1^;

F

⁰(

t

))(_

a

(

t

))²

g

(

t

)

dt >

0

and

j

b

F(

t

)^j

C

¹

:

To check (2.10) and (2.11), we use the theory of Subsection 3.4. We have

d

²(

FF

⁰_{) = 12(}^Z (^p

F

^;^p

F

⁰)²

dG

+

Z

(^p1^;

F

^;^p1^;

F

⁰)²

dG

)

and

Z

(

b

_F^;

b

_F⁰)²

dP

=

Z (

F

(

t

)^;

F

⁰(

t

))²(_

a

(

t

))²

g

(

t

)

dt C

¹²

d

²(

F

⁰)

:

So indeed, (3.11) is satised: for 0

<

n ^!0,

0<¹

Z

(

b

F^;

b

F⁰)²

dP

^!0

:

Also (3.8) holds, namely for^G=^f

b

F:

d

(

FF

⁰)

0

<

1^g, we have for some constant

A

,

H

²^B(

^G

P

)

A

¹

>

⁰

:

This follows from entropy calculations for monotone functions (see Birman and Solomjak (1967) and, for the extension to entropy with bracketing, Van de Geer (1991)). Therefore, (2.10) and (2.11) are fullled. In view of

Theorem 2.1, this completes the proof. ^t^u

(14)

The result of Lemma 4.1 was established earlier in Groeneboom and Wellner (1992). They use specic properties of the maximum likelihood estimator ^

F

n, such as the distance between successive jumps of ^

F

n being smaller than

n

^;1⁼³log

n

with large probability. In this sense, local properties of ^

F

n had to be obtained rst, before one could arrive at the asymptotic behaviour of such global quantities as the mean of the maximum likelihood estimator. It inspired us to develop an alternative proof, which is hopefully applicable in more general situations.

The model for interval censored observations gives a good insight into the diculties that arise due to the fact that ^

F

n does not dominate

F

⁰. Let us have a closer look for the case

a

(

y

) =

y

. It is known that ^

F

n has nite support, say ^

z

¹

< ::: < z

^m. Dene

g

^(

t

) =

G

(^

z

j)^;

G

(^

z

j^;1)

z

^j ^;

z

^j^;1

t

²(^

z

j^;1

^

z

j]

and

b

_F^{^}n(

t

) =^;

¹^;

F

^{^}n(

t

)

g

^(

t

) +

F

^{^}n(

t

)

g

^(

t

)

:

Then for

y

²^f

z

^¹

::: z

^m^g,

A

b

_F^{^}ⁿ(

y

) =

y

^;

^{^}n =

_F^{^}ⁿ(

y

)

:

However, ^Z

_F^{^}n

dF

⁰ =^;(^

n^;

~n)

with

~n =

Z Z y 1

g

^(

t

)

dG

(

t

)

dF

⁰(

y

)

:

In general, ~

n ⁶=

⁰, unless

G

happens to be the uniform distribution on (0

1].

Write

h

_F^{^}ⁿ(

y

) =^;

d

( ^

F

n(

y

)(1^;

F

^n(

y

))

= g

^(

y

))

d F

^{^}n(

y

)

= 2

F

^n(

y

)^;1

^

g

(

y

) +

F

^n(

y

^;)(1^;

F

^n(

y

^;))

^

g

(

y

)^

g

(

y

^;)

d g

^(

y

)

d F

^{^}n(

y

)

:

In order to have that at ^

F

n, the derivative of the likelihood equals zero in the direction

h

_F^{^}ⁿ, one must have that this direction is bounded. This leads to showing that the jumps of ^

F

n are large enough.

Example 4.2: interval censored observations: case II Let

X

i= (

T

i

U

i

i), with

i= l^f

Y

i

T

i^g,

i= l^f

T

i

< Y

i

U

i^g and

Y

i

independent of (

T

i

U

i),

i

= 1

:::n

. We assume bounded support, say

T

i

U

i

and

Y

i²(0

2], and that

T

i

< U

i. This model is also studied in Groeneboom and Wellner (1992). The rate of convergence of the log-likelihood ratio can