HAL Id: hal-00451831
https://hal.archives-ouvertes.fr/hal-00451831v2
Submitted on 9 Apr 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Divergences and duality for estimation and test under moment condition model
Michel Broniatowski, Amor Keziou
To cite this version:
Michel Broniatowski, Amor Keziou. Divergences and duality for estimation and test under moment condition model. Journal of Statistical Planning and Inference, Elsevier, 2012, 142 ( 9), pp. 2554-2573.
�hal-00451831v2�
ON GENERALIZED EMPIRICAL LIKELIHOOD METHODS
MICHEL BRONIATOWSKI∗AND AMOR KEZIOU∗∗
Abstract. We introduce estimation and test procedures through divergence minimization for models satisfying linear constraints with unknown parameter. These procedures extend the em- pirical likelihood (EL) method and share common features with generalized empirical likelihood (GEL) approach. We treat the problems of existence and characterization of the divergence projections of probability measures on sets of signed finite measures. Our approach allows to obtain the limit distributions of the estimates and test statistics (including the EL ones) under alternatives and misspecification. The asymptotic behavior of the estimates and test statistics are studied both under the model and under alternatives including misspecification, using the dual representation of the divergences and the explicit forms of the divergence projections. An approximation to the power function is deduced as well as the sample size which ensures a desired power for a given alternative.
Keywords:Empirical likelihood; Generalized Empirical likelihood; Minimum divergence; Effi- ciency; Power function; Duality; Divergence projection.
MSC (2000) Classification: 62G05; 62G10; 62G15; 62G20; 62G35.
Contents
1. Introduction and notation 1
2. Statistical divergences 3
3. Minimum divergence estimates 4
4. Dual representation of φ − divergences under constraints 6 5. Asymptotic properties of the estimates of the parameter and the estimates of the
divergences 10
5.1. Under the model 10
5.2. Asymptotic properties of the estimates of the divergences for a given value of the
parameter 11
5.3. Under misspecification 13
6. Simulation results: Approximation of the power function of the empirical likelihood
ratio test 14
7. Concluding remarks and possible developments 15
8. Appendix 16
References 20
1. Introduction and notation Statistical models are often defined through estimating equations
E [g(X, θ)] = 0
Date: April 2010.
1
where g(X, θ) is some vector valued function of a random vector X ∈ R
mand a parameter vector θ ∈ Θ ⊂ R
d. The function g has l real valued functions g
jas its components. Examples of such models are numerous, see e.g. Qin and Lawless (1994), Haberman (1984), Sheehy (1987), McCullagh and Nelder (1983), Owen (2001) and the references therein. Denoting M
1the collection of all probability measures (p.m.) on R
m, the submodel M
1θ, associated to a given value θ of the parameter, consists of all distributions Q satisfying the linear constraints induced by g(., θ), namely
M
1θ:=
Q ∈ M
1such that Z
g(x, θ) dQ(x) = 0
. The statistical model which we consider can be written as
(1.1) M
1:= [
θ∈Θ
M
1θ.
Let X
1, ..., X
ndenote an i.i.d sample of X with unknown distribution P
0. We denote θ
0, if it exists, the value of the parameter such that P
0belongs to M
1θ0, namely the value satisfying E [g(X, θ
0)] = 0, and we assume obviously that θ
0is unique. This paper addresses the two following natural questions:
Problem 1 : Does P
0belong to the model M
1?
Problem 2 : When P
0is in the model, which is the value θ
0of the parameter for which E [g(X, θ
0)] = 0? Also can we perform tests about θ
0? Can we construct confidence areas for θ
0?
We note that these problems have been investigated by many authors. Hansen (1982) considered generalized method of moments (GMM). Hansen et al. (1996) introduced the continuous updating (CU) estimate. The empirical likelihood (EL) approach, developed by Owen (1988) and Owen (1990), has been investigated in the context of model (1.1) by Qin and Lawless (1994) and Imbens (1997) introducing the EL estimator. The recent literature in econometrics focusses on such models;
Newey and Smith (2004) provided a class of estimates called generalized empirical likelihood (GEL) estimates which contains the EL and CU estimates. Schennach (2007) discussed the asymptotic properties of the empirical likelihood estimate under misspecification; She showed the important fact that the EL estimate may cease to be root n consistent when the functions defining the moments conditions are unbounded. Among other results pertaining to EL, Newey and Smith (2004) stated that EL estimate enjoys optimality properties in term of efficiency when bias corrected among all GEL estimates including the GMM one. Also Corcoran (1998) and Baggerly (1998) proved that in a class of minimum discrepancy statistics (called power divergence statistics), EL ratio is the only one that is Bartlett correctable. Confidence areas for the parameter θ
0have been considered in the seminal paper by Owen (1990). Problem 1 and 2 have been handled via EL approach in Qin and Lawless (1994) and in Newey and Smith (2004) under the null hypothesis H
0: P
0∈ M
1; however the limit distributions of the EL estimate and the EL test statistic under misspecification have not been obtained so far. Our contribution is as follows:
(1) The approach which we develop is based on minimum discrepancy estimates, which extends the EL method and has common features with minimum distance and GEL techniques, using merely divergences. We present a wide class of estimates, test statistics and confi- dence regions for the parameter θ
0as well as various test statistics for Problems 1 and 2, all depending on the choice of the divergence.
(2) The limit distribution of the EL test statistic under the alternative and under misspecifi-
cation remains up to date an open problem. The present paper fills this gap; indeed, we
give the limit distributions of the proposed estimates and test statistics (including the EL
ones) for Problems 1 and 2 both under the null hypotheses, under alternatives and under
misspecification.
(3) The limit distributions of the test statistics under the alternatives and misspecification are used to give an approximation to the power function and the sample size which ensures a desired power for a given alternative.
(4) We extend confidence region (C.R.) estimation techniques based on EL (see Owen (1990)), providing a wide range of such C.R.’s, each one depending upon a specific criterion.
From the point of view of the statistical criterion under consideration, the main advantage of us- ing a divergence based approach lays in the fact that it leads to all statistical properties of the estimates and test statistics under the alternative, including misspecification, which cannot be achieved through the classical EL context. In the case of parametric models of densities, White (1982) studied the asymptotic properties of the parametric maximum likelihood estimate and the parametric likelihood ratio statistic under misspecification. Broniatowski and Keziou (2009) stated the consistency and obtained the limit distributions of the minimum divergence estimates and the corresponding test statistics (including the parametric likelihood ones) both under the null hy- potheses and the alternatives, from which they deduced an approximation to the power function.
In this paper, we extend the above results to the case of the semi-parametric models (1.1) in the global context of empirical divergences; including the EL method.
The paper is organized as follows. Section 2 describes the statistical divergences used in the sequel.
Section 3 is devoted to the description of estimation and test procedures. In Section 3 we adapt the formalism of Lagrangian duality to the context of statistical divergence, and we use it to give practical formulas (for the study and the numerical computation) of the proposed estimates and test statistics. Section 5 deals with the asymptotic properties of the estimates and test statistics.
Simulations results are given in Section 6. All proofs are postponed to the Appendix.
2. Statistical divergences
We first set some general definitions and notations. Let P be some p.m. Denote by M the space of all signed finite measures (s.f.m.) on R
m. Let φ be a convex function from R onto [0, + ∞ ] with φ(1) = 0, and such that its domain domφ := { x ∈ R such that φ(x) < ∞} is an interval with endpoints a < 1 < b (which may be finite or infinite). We assume that φ is closed
1. For any s.f.m.
Q, the φ-divergence between Q and the p.m. P, when Q is absolutely continuous with respect to (a.c.w.r.t) P , is defined through
(2.1) D
φ(Q, P ) :=
Z
Rm
φ dQ
dP (x)
dP (x).
in which
dQdP( · ) denotes the Radon-Nikodym derivative. When Q is not a.c.w.r.t. P , we set D
φ(Q, P ) = + ∞ . For any p.m. P , the mapping Q ∈ M 7→ D
φ(Q, P ) is convex and takes nonnegative values. When Q = P then D
φ(Q, P ) = 0. Furthermore, if the function x 7→ φ(x) is strictly convex on a neighborhood of x = 1, then
(2.2) D
φ(Q, P ) = 0 if and only if Q = P.
All the above properties are presented in Csisz´ ar (1963), Csisz´ar (1967) and Liese and Vajda (1987) chapter 1, for φ − divergences defined on the set of all p.m.’s M
1. When the φ-divergences are defined on M , then the same arguments as developed on M
1hold. When defined on M
1, the Kullback-Leibler (KL), modified Kullback-Leibler (KL
m), χ
2, modified χ
2(χ
2m), Hellinger (H ), and L
1divergences are respectively associated to the convex functions φ(x) = x log x − x+1, φ(x) =
− log x + x − 1, φ(x) =
12(x − 1)
2, φ(x) =
12(x − 1)
2/x, φ(x) = 2( √ x − 1)
2and φ(x) = | x − 1 | . All these divergences except the L
1one, belong to the class of power divergences introduced in
1The closedness ofφmeans that ifaorbare finite thenϕ(x)→ϕ(a) whenx↓a, andϕ(x)→ϕ(b) whenx↑b.
Cressie and Read (1984) (see also Liese and Vajda (1987) and Pardo (2006)). They are defined through the class of convex functions
(2.3) x ∈ R
∗+7→ φ
γ(x) := x
γ− γx + γ − 1 γ(γ − 1)
if γ ∈ R \{ 0, 1 } and by φ
0(x) := − log x+x − 1 and φ
1(x) := x log x − x+1. So, the KL − divergence is associated to φ
1, the KL
mto φ
0, the χ
2to φ
2, the χ
2mto φ
−1and the Hellinger distance to φ
1/2. We extend the definition of the power divergences functions Q ∈ M
17→ D
φγ(Q, P ) onto the whole set of signed finite measures M as follows. When the function x 7→ φ
γ(x) is not defined on ( −∞ , 0[ or when φ
γis defined on R but is not a convex function we extend the definition of φ
γthrough
(2.4) x ∈ R 7→ φ
γ(x) 1
[0,+∞](x) + (+ ∞ ) 1
[−∞,0[(x).
Note for instance that for χ
2-divergence, the corresponding φ function φ(x) =
12(x − 1)
2is convex and defined on whole R. In this paper, for technical considerations, we assume that the φ functions are strictly convex on their domain (a, b), twice continuously differentiable on the interior of their domain and satisfy φ(1) = 0, φ
′(1) = 0 and φ
′′(1) = 1. We assume also that φ is “essentially smooth” in the sense that lim
x↓aφ
′(x) = −∞ if a > −∞ and lim
x↑bφ
′(x) = + ∞ if b < + ∞ . Note that all the power functions φ
γ, see (2.4), satisfy the above conditions, including all standard divergences.
Definition 2.1. Let Ω be some subset in M . The φ − divergence between the set Ω and a p.m. P is defined by
D
φ(Ω, P ) := inf
Q∈Ω
D
φ(Q, P ).
A finite measure Q
∗∈ Ω, such that D
φ(Q
∗, P ) < ∞ and
D
φ(Q
∗, P ) ≤ D
φ(Q, P ) for all Q ∈ Ω,
is called a projection of P on Ω. This projection may not exist, or may be not defined uniquely.
3. Minimum divergence estimates
Let X
1, ..., X
ndenote an i.i.d. sample of a random vector X ∈ R
mwith distribution P
0. Let P
nbe the empirical measure pertaining to this sample, namely P
n( · ) := 1
n X
n i=1δ
Xi( · )
in which δ
xdenotes the Dirac measure at point x. We will endow our statistical approach in the global context of s.f.m’s with total mass 1 satisfying linear constraints:
(3.1) M
θ:=
Q ∈ M such that Z
Rm
dQ(x) = 1 and Z
Rm
g(x, θ) dQ(x) = 0
and
(3.2) M := [
θ∈Θ
M
θ,
sets of signed finite measures that replace M
1θand M
1. Enhancing the model (1.1) to the above one (3.2) bears a number of improvements upon existing results; this is argued at the end of the present Section. The “plug-in” estimate of D
φ( M
θ, P
0) is
(3.3) D b
φ( M
θ, P
0) := inf
Q∈Mθ
D
φ(Q, P
n) = inf
Q∈Mθ
Z
Rm
φ dQ
dP
n(x)
dP
n(x).
If the projection Q
nof P
non M
θexists, then it is clear that Q
nis a s.f.m. (or possibly a p.m.) a.c.w.r.t. P
n; this means that the support of Q
nmust be included in the set { X
1, . . . , X
n} . So, define the sets
(3.4) M
(n)θ:=
(
Q ∈ M | Q a.c.w.r.t. P
n, X
n i=1Q(X
i) = 1 and X
n i=1Q(X
i)g(X
i, θ) = 0 )
,
which may be seen as subsets of R
n. Then, the plug-in estimate (3.3) can be written as (3.5) D b
φ( M
θ, P
0) = inf
Q∈M(n)θ
1 n
X
n i=1φ (nQ(X
i)) .
In the same way, D
φ( M , P
0) := inf
θ∈Θinf
Q∈MθD
φ(Q, P
0) can be estimated by (3.6) D b
φ( M , P
0) = inf
θ∈Θ
inf
Q∈M(n)θ
1 n
X
n i=1φ (nQ(X
i)) .
By uniqueness of arg inf
θ∈ΘD
φ( M
θ, P
0) and since the infimum is reached at θ = θ
0under the model, we estimate θ
0through
(3.7) θ b
φ= arg inf
θ∈Θ
inf
Q∈M(n)θ
1 n
X
n i=1φ (nQ(X
i)) .
Enhancing M
1to M and accordingly extensions in the definitions of the φ functions on ] −∞ , + ∞ [ and of the φ-divergences on the whole space of s.f.m’s M , is motivated by the following arguments:
- If the domain (a, b) of the function φ is included in [0, + ∞ [ then minimizing over M
1or over M leads to the same estimates and test statistics. Hence, both approaches coincide for instance in the case of the divergences KL
m, KL, modified χ
2and Hellinger.
- Let θ be a given value in Θ. Denote Q
1nand Q
nrespectively the projection of P
non M
1θand on M
θ. If Q
1nsatisfies 0 < Q
n(X
i) < 1 for all i = 1, . . . , n then it coincides with Q
n, i.e., Q
1n= Q
n. Therefore, in this case, both approaches leads also to the same estimates and test statistics.
- It may occur that for some θ in Θ and some i = 1, . . . , n, Q
1n(X
i) is a boundary value of [0, 1], hence the first order conditions are not met which makes a real difficulty for the calculation of the estimates over the sets of p.m. M
1θand M
1. However, when M
1is replaced by M , then this problem does not hold any longer in particular when domφ = R, which is the case for instance of the χ
2-divergence. Other arguments are given in remark 4.5 below.
The empirical likelihood paradigm (see Owen (1988), Owen (1990), Qin and Lawless (1994) and Owen (2001)), enters as a special case of the statistical issues related to estimation and tests based on φ − divergences with φ(x) = φ
0(x) = − log x + x − 1, namely on KL
m− divergence. Indeed, it is straightforward to see that the empirical log-likelihood ratio statistic for testing P
0∈ M against P
0∈ M / , in the context of φ-divergences, can be written as 2n D b
KLm( M , P
0); and that the EL estimate of θ
0can be written as θ b
KLm= arg inf
θ∈ΘD b
KLm( M
θ, P
0); see Remark 4.3 below. In the case of the power functions φ = φ
γ, the corresponding estimates (3.7) belong to the class of GEL estimates introduced by Newey and Smith (2004), and (3.5) are the empirical Cressie-Read statistics introduced by Baggerly (1998) and Corcoran (1998).
The constrained optimization problems (3.5), (3.6) and (3.7) can be transformed into unconstrained
ones making use of some arguments of “duality” which we briefly state hereunder from Rockafellar
(1970). On the other hand, the obtention of asymptotic statistical results of the estimates and the
test statistics, under misspecification or under alternative hypotheses, requires to handle existence
conditions and characterization of the projection of P
0on the submodel M
θor on the entire model M . This also will be considered through duality, along the following Section.
4. Dual representation of φ − divergences under constraints
This Section is central for our purposes. Indeed, it provides the explicit form of the proposed estimates by transforming the constrained problems (3.5) to unconstrained ones, using Lagrangian duality which is a classical tool in optimization theory. This Section adapts this formalism to the context of divergences and the present statistical setting. The Lagrangian “dual” problems, corresponding to the “primal” ones
(4.1) inf
Q∈Mθ
D
φ(Q, P
0)
and its empirical counterpart (3.5), make use of the Fenchel-Legendre transform of φ, defined through
(4.2) ψ : t ∈ R 7→ ψ(t) := sup
x∈R
{ tx − φ(x) } . The “dual” problems associated to (4.1) and (3.5) are respectively
(4.3) sup
t∈R1+l
t
0− Z
Rm
ψ(t
0+ X
l j=1t
jg
j(x, θ)) dP
0(x)
,
and
(4.4) sup
t∈R1+l
t
0− 1 n
X
n i=1ψ(t
0+ X
l j=1t
jg
j(X
i, θ))
.
In the following propositions, we state sufficient conditions under which the primal problems (4.1) and (3.5) coincide respectively with the dual ones (4.3) and (4.4). First, recall some properties of the convex conjugate ψ of φ. For the proofs we can refer to Rockafellar (1970) Section 26. The function ψ is convex and closed, its domain is an interval with endpoints
(4.5) a
∗= lim
x→−∞
φ(x)
x , b
∗= lim
x→+∞
φ(x) x
satisfying a
∗< 0 < b
∗and ψ(0) = 0. The strict convexity of φ on its domain (a, b) is equivalent to the condition that its conjugate ψ is essentially smooth, i.e., differentiable with
(4.6) lim
t↓a∗ψ
′(t) = −∞ if a
∗> −∞ , lim
t↑b∗ψ
′(t) = + ∞ if b
∗< + ∞ .
Conversely, φ is essentially smooth on its domain (a, b) if and only if ψ is strictly convex on its domain (a
∗, b
∗). In all the sequel, we assume additionally that φ is essentially smooth. Hence, ψ is strictly convex on its domain (a
∗, b
∗), and it holds that
a
∗= lim
x↓a
φ
′(x), b
∗= lim
x↑b
φ
′(x), and
(4.7) ψ(t) = tφ
′−1(t) − φ
φ
′−1(t)
, for all t ∈ ]a
∗, b
∗[.
It holds also that ψ is twice continuously differentiable on ]a
∗, b
∗[, (4.8) ψ
′(t) = φ
′−1(t) and ψ
′′(t) = 1
φ
′′φ
′−1(t) .
In particular, ψ
′(0) = 1 and ψ
′′(0) = 1. Obviously, since φ is assumed to be closed, we have φ(a) = lim
x↓a
φ(x) and φ(b) = lim
x↑b
φ(x), which may be finite or infinite. Hence, by closedness of ψ, we have
ψ(a
∗) = lim
t↓a∗
ψ(x) and ψ(b
∗) = lim
t↑b∗
ψ(t).
Finally, the first and second derivatives of φ in a and b are defined to be the limits of φ
′(x) and φ
′′(x) when x ↓ a and when x ↑ b. The first and second derivatives of ψ in a
∗and b
∗are defined in a similar way. In Table 1, we give the convex conjugates ψ of some functions φ associated to standard divergences. We determine also their domains, (a, b) and (a
∗, b
∗).
Table 1. Convex conjugates for some standard divergences.
D
φφ domφ domψ ψ
D
KLmφ(x) := − log x + x − 1 ]0, + ∞ [ ] − ∞ , 1[ ψ(t) = − log(1 − t) D
KLφ(x) := x log x − x + 1 [0, + ∞ [ R ψ(t) = e
t− 1 D
χ2mφ(x) :=
12(x−1)x 2]0, + ∞ [
−∞ ,
12ψ(t) = 1 − √ 1 − 2t D
χ2φ(x) :=
12(x − 1)
2R R ψ(t) =
12t
2+ t D
Hφ(x) := 2( √ x − 1)
2[0, + ∞ [ ] − ∞ , 2[ ψ(t) =
2−t2tD
φγφ(x) :=
xγ−γx+γ−1γ(γ−1)−− −− ψ(t) =
1γ(γt − t + 1)
γ−1γ−
1γProposition 4.1. Let θ be a given value in Θ. If there exists Q
0in M
(n)θsuch that (4.9) a < Q
0(X
i) < b, for all i = 1, . . . , n.
Then
(4.10) inf
Q∈M(n)θ
D
φ(Q, P
n) = sup
t∈R1+l
t
0− 1 n
X
n i=1ψ(t
0+ X
l j=1t
jg
j(X
i, θ))
with dual attainment. Conversely, if there exists a dual optimal solution b t such that (4.11) a
∗< t b
0+
X
l j=1b
t
jg
j(X
i, θ) < b
∗, for all i = 1, . . . , n,
then the equality (4.10) holds, and the unique optimal solution of the primal problem inf
Q∈M(n) θD
φ(Q, P
n), namely the projection of P
non M
(n)θ, is given by
Q
n(X
i) = 1
n φ
′−1(b t
0+ X
l j=1t b
jg
j(X
i, θ)), i = 1, ..., n, where b t is solution of the equations
( 1 −
n1P
ni=1
φ
′−1( t b
0+ P
lj=1
t b
jg
j(X
i, θ)) = 0
−
n1P
ni=1
g
j(X
i, θ)φ
′−1(b t
0+ P
lj=1
t b
jg
j(X
i, θ)) = 0, j = 1, ..., l.
Remark 4.1. For the χ
2− divergence, we have a = −∞ and b = + ∞ . Hence, condition (4.9) holds
whenever M
(n)θis not void. More generally, the above Proposition holds for any φ-divergence with
φ function satisfying domφ = R.
Remark 4.2. Assume that g(x, θ) := x − θ. So, for any divergence D
φwith domφ =]0, + ∞ [, which is the case of the modified χ
2divergence and the modified Kullback-Leibler divergence (or equivalently EL method), condition (4.9) means that θ is an interior point of the convex hull of the data (X
1, ..., X
n). This is precisely what is checked in Owen (1990), p. 100, for the EL method;
see also Owen (2001).
For the asymptotic counterpart of the above results we have; see Theorem 1 in Broniatowski and Keziou (2006):
Proposition 4.2. Let θ be a given value in Θ. Assume that R
| g
j(x, θ) | dP
0(x) < ∞ for all j = 1, . . . , l. If there exists Q
0in M
θwith D
φ(Q
0, P
0) < ∞ and
2(4.12) a < inf
x
dQ
0dP
0(x) ≤ sup
x
dQ
0dP
0(x) < b, P
0− a.s.
Then
(4.13) inf
Q∈Mθ
D
φ(Q, P
0) = sup
t∈R1+l
t
0− Z
Rm
ψ(t
0+ X
l j=1t
jg
j(x, θ)) dP
0(x)
with dual attainment. Conversely, if there exists a dual optimal solution t
∗which is an interior point of the set
(4.14)
t ∈ R
1+lsuch that Z
Rm
| ψ(t
0+ X
l j=1t
jg
j(x, θ)) | dP
0(x) < ∞
,
then the dual equality (4.13) holds, and the unique optimal solution Q
∗θof the primal problem inf
Q∈MθD
φ(Q, P
0), namely the projection of P
0on M
θ, is given by
dQ
∗θdP
0(x) = φ
′−1(t
∗0+ X
l j=1t
∗jg
j(x, θ)), where t
∗is solution of
(4.15)
( 1 − R
φ
′−1(t
∗0+ P
lj=1
t
∗jg
j(x, θ)) dP
0(x) = 0
− R
g
j(x, θ)φ
′−1(t
∗0+ P
lj=1
t
∗jg
j(x, θ)) dP
0(x) = 0, j = 1, . . . , l.
Furthermore, t
∗is unique if the functions 1
Rm, g
1(., θ), . . . , g
l(., θ) are linearly independent in the sense that P
0n x | t
0+ P
lj=1
t
jg
j(x, θ) 6 = 0 o
> 0 for all t ∈ R
mwith t 6 = 0.
For sake of brevity and clearness, we must introduce some additional notations. Denote by g the vector valued function ( 1
Rm, g
1, . . . , g
l)
T. For any p.m. P and any measurable function f on R
m, P f denotes the integral R
Rm
f (x) dP (x). Let
(4.16) m(x, θ, t) := t
0− ψ(t
Tg(x, θ)), for all x ∈ R
m, θ ∈ Θ ⊂ R
d, t ∈ R
1+l. Note that the sup in (4.10) and (4.13) can be restricted respectively to the sets (4.17) Λ
n(θ) :=
t ∈ R
1+l| a
∗< t
Tg(X
i, θ) < b
∗, for all i = 1, . . . , n
2The strict inequalities in (4.12) mean thatP0
n
x∈Rm| dQdP0
0(x)≤a o
=P0
n x|dQdP0
0(x)≥b o
= 0.
and
(4.18) Λ(θ) :=
t ∈ R
1+l| Z
Rm
| ψ(t
0+ X
l j=1t
jg
j(x, θ)) | dP
0(x) < ∞
.
In view of the above propositions, we redefine the estimates (3.5), (3.6) and (3.7) as follows (4.19) D b
φ( M
θ, P
0) := sup
t∈Λn(θ)
1 n
X
n i=1m(X
i, θ, t) := sup
t∈Λn(θ)
P
nm(θ, t),
(4.20) D b
φ( M , P
0) := inf
θ∈Θ
sup
t∈Λn(θ)
1 n
X
n i=1m(X
i, θ, t) := inf
θ∈Θ
sup
t∈Λn(θ)
P
nm(θ, t)
and
(4.21) θ b
φ:= arg inf
θ∈Θ
sup
t∈Λn(θ)
1 n
X
n i=1m(X
i, θ, t) := arg inf
θ∈Θ
sup
t∈Λn(θ)
P
nm(θ, t).
Remark 4.3. When φ(x) = − log x + x − 1, then the estimate (3.7) clearly coincides with the EL one, so it can be seen as the value of the parameter which minimizes the KL
m-divergence between the model M and the empirical measure P
nof the data. The statistics 2n D b
KLm( M , P
0), see (3.6), coincides with the empirical likelihood ratio associated to the null hypothesis H
0: P
0∈ M against the alternative H
1: P
06∈ M . The dual representation of D b
KLm( M , P
0), see (4.20), is
D b
KLm( M , P
0) = inf
θ∈Θ
sup
t∈Λn(θ)
t
0+ 1 n
X
n i=1log(1 − t
0− X
l j=1t
jg
j(X
i, θ))
. For a given θ ∈ Θ, the KL
m-projection Q
n, of P
non M
θ, is given by (see proposition 4.1)
1
Q
n(X
i) = n
1 − t
∗0− X
l j=1t
∗jg(X
i, θ)
, i = 1, . . . , n,
which, multiplying by Q
n(X
i) and summing upon i yields t
∗0= 0. Therefore, t
0can be omitted, and the above representation can be rewritten as follows
D b
KLm( M , P
0) = inf
θ∈Θ
sup
t1,...,tl
1 n
X
n i=1log(1 + X
l j=1t
jg
j(X
i, θ))
and then
b θ
KLm= θ b
EL= arg inf
θ∈Θ
sup
t1,...,tl
1 n
X
n i=1log(1 + X
l j=1t
jg
j(X
i, θ))
in which the sup is taken over the set
(t
1, . . . , t
l) ∈ R
m| − 1 <
X
l j=1t
jg
j(X
i, θ) < + ∞ , for all i = 1, . . . , n
.
This is the ordinary dual representation of the EL estimate; see Qin and Lawless (1994) and Owen
(2001).
Remark 4.4. Consider the power divergences, associated to the power functions φ
γ; see (2.3) and (2.4). We will show that the estimates θ b
φγbelong to the class of GEL estimators introduced by Newey and Smith (2004). The projection Q
nof P
non M
θis given by
Q
n(X
i) =
(γ − 1)(t
∗0+ X
l j=1t
∗jg(X
i, θ)) + 1
1/(γ−1)
, i = 1, . . . , n.
Using the constraint P
ni=1
Q
n(X
i) = 1, we can explicit t
∗0in terms of t
∗1, . . . , t
∗l, and hence the sup in the dual representation (4.21) can be reduced to a subset of R
l, as in Newey and Smith (2004).
When φ(x) =
12(x − 1)
2, then θ b
φcoincides with the continuous updating estimator of Hansen et al.
(1996).
Remark 4.5. ( Numerical calculation of the estimates and the specific role of the χ
2- divergence). The computation of b t(θ) for fixed θ ∈ Θ as defined in (4.15) is difficult when handling a generic divergence. In the case of χ
2-divergence, i.e., when φ(x) =
12(x − 1)
2, optimizing on all s.f.m’s, the system (4.15) is linear; we thus easily obtain an explicit form for b t(θ), which in turn allows for a single gradient descent when optimizing upon Θ. This procedure is useful in order to calculate the estimates for all other divergences (for which the corresponding system is non linear) including EL, since it provides an easy starting point for the resulting double gradient descent.
5. Asymptotic properties of the estimates of the parameter and the estimates of the divergences
5.1. Under the model. This Section addresses Problems 1 and 2, aiming at testing the null hypothesis H
0: P
0∈ M against the alternative H
1: P
06∈ M . We expose the limit distributions of the proposed test statistics which are the estimated divergences between the model M and P
0. We also derive the limit distributions of the estimates of θ
0. The following two Theorems extend Theorem 3.1 and 3.2 in Newey and Smith (2004) to the context of divergence based approach. The assumptions which we consider match those of Theorems 3.1 and 3.2 in Newey and Smith (2004).
Assumption 1. (a) P
0∈ M and θ
0∈ Θ is the unique solution to E [g(X, θ)] = 0; (b) Θ ⊂ R
dis compact; (c) g(X, θ) is continuous at each θ ∈ Θ with probability one; (d) E [sup
θ∈Θk g(X, θ) k
α] <
∞ for some α > 2; (e) the matrix Ω := E
g(X, θ
0)g(X, θ
0)
Tis nonsingular.
Theorem 5.1. Under assumption 1, the estimate θ b
φexists and converges to θ
0in probabil- ity,
n1P
ni=1
g(X
i, θ b
φ) = O
P(1/ √ n), b t( θ b
φ) := arg sup
t∈Λn(bθφ)
P
nm( θ b
φ, t) exists and belongs to int(Λ
n( θ b
φ)) with probability approaching one as n → ∞ , and b t( θ b
φ) = O
P(1/ √
n).
In order to obtain asymptotic normality, we need some additional assumptions. Denote by G the matrix G := E [∂g(X, θ
0)/∂θ].
Assumption 2. (a) θ
0∈ int(Θ); (b) With probability one g(X, θ) is continuously differentiable in a neighborhood N of θ
0and E [sup
θ∈Nk ∂g(X, θ)/∂θ k ] < ∞ ; (c) rank(G) = d.
Theorem 5.2. Assume that assumptions 1 and 2 hold. Then, (1) √ n
θ b
φ− θ
0converges in distribution to a centered normal vector with covariance matrix V :=
GΩ
−1G
T−1.
(2) If l > d, the statistic 2n D b
φ( M , P
0) converges in distribution to a χ
2random variable with
(l − d) degrees of freedom.
Remark 5.1. The above Theorem allows to perform statistical tests (of the model) with asymp- totic level α. Consider the null hypothesis
(5.1) H
0: P
0∈ M against the alternative H
1: P
06∈ M . The critical region is then
C
φ:= n
2n D b
φ( M , P
0) > q
(1−α)o
where q
(1−α)is the (1 − α)-quantile of the χ
2(l − d) distribution. When φ(x) = − log x + x − 1, the corresponding test is the empirical likelihood ratio one; see Qin and Lawless (1994).
5.2. Asymptotic properties of the estimates of the divergences for a given value of the parameter. For a given θ ∈ Θ, consider the test problems of the null hypothesis H
0: P
0∈ M
θagainst two different families of alternative hypotheses: H
1: P
0∈ M /
θand H
1′: P
0∈ M \ M
θ. Those two tests address different situations since H
1may include misspecification of the model. We present two different test statistics each pertaining to one of the situations and derive their limit distributions both under H
0and under the alternatives. As a by product we also derive confidence areas for the true value θ
0of the parameter. We will state the convergence in probability of D b
φ( M
θ, P
0) to D
φ( M
θ, P
0), and we will obtain the limit law of D b
φ( M
θ, P
0) both when P
0∈ M
θand when P
06∈ M
θ. Obviously, when P
0∈ M
θ, this means that θ = θ
0since the true-value θ
0of the parameter is assumed to be unique.
Assumption 3. (a) P
0∈ M
θand θ is the unique solution to E [g(X, θ)] = 0; (b) E [ k g(X, θ) k
α] < ∞ for some α > 2; (c) the matrix
Ω := E
g(X, θ)g(X, θ)
Tis nonsingular.
Theorem 5.3. Under assumption 3, we have
(1) b t(θ) := arg sup
t∈Λ(θ)P
nm(θ, t) exists and belongs to int(Λ(θ)) with probability approaching one as n → ∞ , and b t(θ) = O
P(1/ √ n).
(2) The statistic 2n D b
φ( M
θ, P
0) converges in distribution to a χ
2(l) random variable.
In order to obtain the limit distribution of the test statistic 2n D b
φ( M
θ, P
0) under the alternative H
1: P
0∈ M /
θ, including misspecification, the following assumption is needed.
Assumption 4. (a) P
06∈ M
θ, and t
∗(θ) := arg sup
t∈Λ(θ)E [m(X, θ, t)] exists and is an interior point of Λ(θ); (b) E [sup
t∈N| m(X, θ, t) | ] < ∞ for some compact set N ⊂ Λ(θ) such that t
∗(θ) ∈ int(N );
(c) the functions 1
Rm, g
1, . . . , g
lare linearly independent in the following sense:
P
0n
x | t
0+ P
lj=1
t
jg
j(x, θ) 6 = 0 o
> 0 for all t ∈ R
1+lwith t 6 = 0.
Assumption (c) hereabove ensures the strict concavity of the function t ∈ Λ(θ) 7→ E [m (X, θ, t)];
otherwise t
∗(θ) may not be defined uniquely implying possible inconsistency of b t(θ).
Theorem 5.4. Under assumption 4, when P
06∈ M
θ, we have (1) b t(θ) converges in probability to t
∗(θ).
(2) D b
φ( M
θ, P
0) converges in probability to D
φ( M
θ, P
0).
We now give the limit distribution of the test statistics under H
1. We need the following additional condition.
Assumption 5. (a) with probability one, the function t 7→ m(X, θ, t) is C
3in a neighborhood N (t
∗(θ)) of t
∗(θ), and all third order partial derivatives (w.r.t. t) of { t 7→ m(X, θ, t); t ∈ N } are dominated by some P
0-integrable function;
(b) E
m(X, θ, t
∗(θ))
2< ∞ , E
k ∂m(X, θ, t
∗(θ))/∂t k
2< ∞ , and the matrix E
∂
2m(X, θ, t
∗(θ))/∂t
2exists and nonsingular.
Theorem 5.5. Under assumptions 4 and 5, we have
(1) √ n(b t(θ) − t
∗(θ)) converges in distribution to a centered normal vector with covariance matrix
[E [m
′′(X, θ, t
∗)]]
−1E
m
′(X, θ, t
∗)m
′(X, θ, t
∗)
T[E [m
′′(X, θ, t
∗)]]
−1. (2) √ n
D b
φ( M
θ, P
0) − D
φ( M
θ, P
0)
converges in distribution to a centered normal random variable with variance
σ
2(θ) = E
m(X, θ, t
∗(θ))
2− [E [m(X, θ, t
∗(θ))]]
2.
Remark 5.2. Let θ be a given value in Θ. Consider the test problem of the null hypothesis (5.2) H
0: P
0∈ M
θagainst P
0∈ M /
θ.
In view of Theorem 5.3 part 2, we reject H
0against H
1at asymptotic level α when 2n D b
φ( M
θ, P
0) exceeds the (1 − α)- quantile of the χ
2(l) distribution. Theorem 5.5 part 2 is useful to give an approximation to the power function
P
0∈ M /
θ7→ β(P
0) := P
0h 2n D b
φ( M
θ, P
0) > q
(1−α)i . We obtain then the following approximation
(5.3) β(P
0) ≈ 1 − F
N√ n σ(θ)
h q
1−α2n − D
φ( M
θ, P
0) i ,
where F
Nis the cumulative distribution of the standard normal distribution. From this approx- imation, we can give the approximate sample size that ensures a desired power β for a given alternative P
0∈ M /
θ. Let n
0be the positive root of the equation
β = 1 − F
N√ n σ (θ)
q
(1−α)2n − D
φ( M
θ, P
0) i.e.,
n
0= (a + b) − p
a (a + 2b) 2D
φ( M
θ, P
0)
2with a := σ(θ
∗)
2F
N−1(1 − β )
2and b := q
(1−α)D
φ( M
θ, P
0) . The required sample size is then
⌊ n
0⌋ + 1 where ⌊ n
0⌋ is the integer part of n
0.
Remark 5.3. (Generalized empirical likelihood ratio test). For testing H
0: P
0∈ M
θagainst the alternative H
′1: M \ M
θ, we propose to use the statistics
(5.4) 2nS
nφ:= 2n
D b
φ( M
θ, P
0) − inf
θ∈Θ
D b
φ( M
θ, P
0)
which converge in distribution to a χ
2(d) random variable under H
0when assumptions 1 and 2 hold. This can be proved using similar arguments as in Theorems 5.2 and 5.3. We then reject H
0at asymptotic level α when 2nS
nφ> q
(1−α), the (1 − α)-quantile of the χ
2(d)-distribution. Under H
′1and when assumptions 1,2,4 and 5 hold, as in Theorem 5.5, it can be proved that
(5.5) √
n S
nφ− D
φ( M
θ, P
0) converges to a centered normal random variable with variance
σ
2(θ) := E m(X, θ, t
∗(θ))
2− (Em(X, θ, t
∗(θ)))
2. So, as in the above remark, we obtain the following approximation
(5.6) β (P
0) ≈ 1 − F
N√ n σ(θ)
h q
1−α2n − D
φ( M
θ, P
0) i
to the power function P
0∈ M / M
θ7→ P
02nS
nφ> q
(1−α). The approximated sample size required to achieve a desired power for a given alternative can be obtained as in the above Remark.
Remark 5.4. (Confidence region for the parameter). For a fixed level α, using convergence (5.4), the set
θ ∈ Θ such that 2nS
nφ≤ q
(1−α)is an asymptotic confidence region for θ
0where q
(1−α)is the (1 − α)-quantile of the χ
2(d)- distribution.
5.3. Under misspecification. We address Problem 1 stating the limit distribution of the pro- posed test statistics under the alternative H
1: P
0∈ M / . This needs the introduction of Q
∗θ∗, the projection of P
0on M . Assumption 6 below ensures the existence of the “pseudo-true” value θ
∗as well as the existence of the projection Q
∗θ∗of P
0on M , and states some necessary other regularity conditions.
Assumption 6. (a) Θ is compact, θ
∗:= arg inf
θ∈Θsup
t∈Λ(θ)E [m(X, θ, t)] exists and is unique; (b) g(X, θ) is continuous at each θ ∈ Θ with probability one; (c) E
h
sup
θ∈Θ,t∈N(θ)| m(X, θ, t) | i
< ∞ where N(θ) ⊂ Λ(θ) is a compact set such that t
∗(θ) ∈ int (N(θ)); (d) the functions 1
Rm, g
1, . . . , g
lare linearly independent in the following sense: P
0n
x | t
0+ P
lj=1
t
jg
j(x, θ) 6 = 0 o
> 0 for all t ∈ R
1+lwith t 6 = 0.
Theorem 5.6. Under assumption 6, we have
(1) k b t(θ) − t
∗(θ) k converges in probability to 0 uniformly in θ ∈ Θ.
(2) θ b
φconverges in probability to θ
∗;
(3) D b
φ( M , P
0) converges in probability to D
φ( M , P
0).
The asymptotic normality of the test statistics under misspecification requires the following addi- tional conditions.
Assumption 7. (a) θ
∗∈ int(Θ); (b) with probability one, the function (θ, t) 7→ m(X, θ, t) is C
3in a neighborhood N ⊂ Θ × Λ(Θ) of (θ
∗, t
∗(θ
∗)), and all the third order partial deriva- tive functions are dominated on N by some P
0-integrable function; (c) E
m(X, θ
∗, t
∗(θ
∗))
2, E
h k ∂m(X, θ
∗, t
∗(θ
∗))/∂t k
2i and E
h k ∂m(X, θ
∗, t
∗(θ
∗)/∂θ k
2i
are finite, and the matrix S :=
S
11S
12S
21S
22, exists and is nonsingular, where S
11:= E
∂
2m(X, θ
∗, t
∗(θ
∗))/∂t
2, S
12= S
21T:= E
∂
2m(X, θ
∗, t
∗(θ
∗))/∂t∂θ and S
22:= E
∂
2m(X, θ
∗, t
∗(θ
∗))/∂θ
2.
Theorem 5.7. Under assumptions 6 and 7, we have (1)
√ n b t( θ b
φ) − t
∗(θ
∗) θ b
φ− θ
∗!
converges in distribution to a centered normal vector with covariance matrix W = S
−1M S
−1where
M := E
"
∂∂t
m (X, θ
∗, t
∗(θ
∗))
∂
∂θ
m (X, θ
∗, t
∗(θ
∗))
∂
∂t
m (X, θ
∗, t
∗(θ
∗))
∂
∂θ
m (X, θ
∗, t
∗(θ
∗))
T#
;
(2) √ n
D b
φ( M , P
0) − D
φ( M , P
0)
converges in distribution to a centered normal variable with variance
σ
2(θ
∗) = E
m(X, θ
∗, t
∗(θ
∗))
2− [E [m(X, θ
∗, t
∗(θ
∗))]]
2.
Remark 5.5. In the case of EL, i.e., when φ(x) = − log x + x − 1, assumption (6-c) implies that (see 4.12)
−∞ < inf
x
t
0+ t
Tg(x, θ) ≤ sup
x
t
0+ t
Tg(x, θ) < 1
P
0-a.s for all θ ∈ N (θ
∗) and t ∈ N (θ). This imposes a restriction on the model when the support of P
0is unbounded. Indeed, when the support of P
0is for example the whole space R
mcondition above does not hold when g is unbounded. At the contrary the same condition may hold for other divergences associated to φ functions with domφ = R.
Remark 5.6. Theorem 5.7 is useful for the computation of the power function. For testing the null hypothesis P
0∈ M against the alternative H
1: P
0∈ M / , the power function is
(5.7) P
0∈ M 7→ / β(P
0) := P
0h 2n D b
φ( M , P
0) > q
(1−α)i .
Using Theorem 5.7 part 2, we obtain the following approximation to the power function (5.7):
(5.8) β(P
0) ≈ 1 − F
N√ n σ (θ
∗)
q
(1−α)2n − D
φ( M , P
0)
where F
Nis the empirical cumulative distribution of the standard normal distribution. From the proxy value of β (P
0) hereabove, the approximate sample size that ensures a given power β for a given alternative P
06∈ M can be obtained as follows. Let n
0be the positive root of the equation
β = 1 − F
N√ n σ(θ
∗)
q
(1−α)2n − D
φ( M , P
0) i.e.
n
0= (a + b) − p
a (a + 2b) 2D
φ( M , P
0)
2with a := σ(θ
∗)
2F
N−1(1 − β )
2and b := q
(1−α)D
φ( M , P
0) . The required sample size is then
⌊ n
0⌋ + 1 where ⌊ n
0⌋ is the integer part of n
0.
6. Simulation results: Approximation of the power function of the empirical likelihood ratio test
We will illustrate by simulation the accuracy of the power approximation (5.8) in the case of EL method, i.e., when φ(x) = − log x + x − 1. Consider the test problem of the composite null hypothesis
H
0: P
0∈ M against the alternative H
1: P
0∈ M / where M = S
θ∈R
M
θand M
θis the set of all s.f.m’s satisfying the constraints R
dQ(x) = 1 and R g(x, θ) dQ(x) = 0 with g(x, θ) := (x, x
2− θ), namely
M
θ:=
Q such that Z
R
dQ(x) = 1 and Z
R
g(x, θ) dQ(x) = 0
,
where θ ∈ R is the parameter of interest. We consider the asymptotic level α = 0.05 and the
alternatives P
0:= U ([ − 1, 1 + ǫ]) 6∈ M for different values of ǫ in the interval ]0, 1]. Note that when
ǫ = 0 then the uniform distribution U ([ − 1, 1]) belongs to the model M . For this model, we can show
also that all assumptions of Theorem 5.2 are satisfied when ǫ = 0, and all assumptions of Theorem
5.7 are met under alternatives. In figure 1, the power function (5.7) is plotted (with a continuous
line), with sample sizes n = 50, n = 100, n = 200 and n = 500, for different values of ǫ. Each
power entry was obtained by Monte-Carlo from 1000 independent runs. The approximation (5.8) is plotted (with a dashed line) as a function of ǫ. The estimates θ b
φand D b
φ( M , P
0) are calculated using the Newton algorithm. We observe from figure 1 that the approximation is accurate even for moderate sample sizes.
Figure 1. Approximation of the power function
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Alternatives
Power function and its approximation
n=50
Power Approxim.
Level
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Alternatives
Power function and its approximation
n=100
Power Approxim.
Level
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Alternatives
Power function and its approximation
n=200
Power Approxim.
Level
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Alternatives
Power function and its approximation
n=500
Power Approxim.
Level
7. Concluding remarks and possible developments
We have proposed new estimates and tests for model satisfying linear constraints with unknown
parameter through divergence based methods which generalize the EL approach. This leads to the
obtention of the limit distributions of the test statistics and the estimates under alternatives and
under misspecification, which can not be obtained through the likelihood point of view. Consistency
of the test statistics under the alternatives is the starting point for the study of the optimality of
the tests through Bahadur approach; also the generalized Neyman-Pearson optimality of EL test
(as developed by Kitamura (2001)) can be adapted for empirical divergence based methods. Many
problems remain to be studied in the future such as the choice of the divergence which leads to
an optimal (in some sense) estimator or test in terms of efficiency and/or robustness. Preliminary
simulation results show that Hellinger divergence enjoys good properties in terms of efficiency- robustness; see Broniatowski and Keziou (2008). Also comparisons under local alternatives should be developed.
8. Appendix
Proof of Theorem 5.1 The same arguments, used for the proof of Theorem 3.1 in Newey and Smith (2004), hold when their criterion function (θ, λ) ∈ Θ × R
l7→
n1P
ni=1
ρ(λ
Tg(X, θ)) is replaced by our function (θ, t) ∈ Θ × R
1+l7→
n1P
ni=1
m(t
Tg(X, θ)). In particular, we have max
i≤nb t( b θ
φ)
Tg(X
i, θ b
φ) tends to 0 in probability, which implies that b t( b θ
φ) ∈ int(Λ
n( θ b
φ)) with probability one as n → ∞ , since a
∗< 0 < b
∗.
Proof of Theorem 5.2. The proof is similar to that of Newey and Smith (2004) Theorem 3.2.
Hence, it is omitted.
Proof of Theorem 5.3. (1) It is a particular case of Theorem 5.1 taking Θ = { θ } . (2) The first order conditions P
n∂m(θ,b t)/∂t = 0 are satisfied with probability one as n → ∞ . Hence by a Taylor expansion we obtain
0 = P
n∂m θ, b t /∂t
= P
n∂m (θ, 0) /∂t + 1 2
P
n∂
2m θ, t /∂t
2Tb t, (8.1)
where t ∈ R
1+lis a vector inside the segment that links 0 and b t. By the uniform weak law of large numbers (UWLLN), and dominated convergence Theorem, we have P
n∂
2m θ, t
/∂t
2tends in probability to
E
∂
2m(X, θ, 0)/∂t
2= −
1 0
T0 Ω
=: − M, which is nonsingular and symmetric. Hence, we can write
(8.2) √ nb t = M
−1√
nP
n∂m(X, θ, 0)/∂t + o
P(1).
Using similar arguments, we get also
D b
φ( M
θ, P
0) = P
nm(θ, b t) = [P
n∂m(θ, 0)/∂t]
Tb t − 1
2 b t
TM b t + o
P(1/n).
From this, using (8.2), we obtain D b
φ( M
θ, P
0) = 1
2 [P
n∂m(θ, 0)/∂t]
TM
−1[P
n∂m(θ, 0)/∂t] + o
P(1).
This yields to
(8.3) 2n D b
φ( M
θ, P
0) = [P
n∂m(θ, 0)/∂t]
TM
−1[P
n∂m(θ, 0)/∂t] + o
P(1).
In the other hand, direct calculation shows that E
∂m(X, θ, 0)∂m(X, θ, 0)
T= M.
Combining this with (8.3), we conclude the proof.
Proof of Theorem 5.4. (1) First, note that condition (b) implies that t
∗(θ) is unique since t ∈ Λ(θ) 7→ E [m(X, θ, t)] is strictly concave by (c) and Λ(θ) is a convex set. By UWLLN, using continuity of m(X, θ, t) in t and condition (b), we obtain
(8.4) | P
nm(θ, t) − E [m(X, θ, t)] | → 0,
in probability uniformly in t over the compact set N . Using this and the fact that t
∗(θ) :=
arg sup
t∈Λ(θ)P
0m(θ, t) is unique and belongs to int(N) and the strict concavity of t 7→ P
0m(θ, t), we conclude that any value
(8.5) t := arg sup
t∈N
P
nm(θ, t)
converges in probability to t
∗(θ); see e.g. Theorem 5.7 in van der Vaart (1998). We end the proof by showing that b t(θ) belongs to int(N ) with probability one as n → ∞ , and therefore it converges to t
∗(θ). In fact, since for n sufficiently large any value t lies in the interior of N , concavity of t 7→ P
nm(θ, t) implies that no other point t in the complement of int(N ) can maximize P
nm(θ, t) over t ∈ R
1+l, hence b t(θ) must be in int(N ).
(2) We have D b
φ( M
θ, P
0) = P
nm(θ, b t) = P
nm(θ, t) where the second equality holds for n sufficiently large. Hence we can write
b D
φ( M
θ, P
0) − D
φ( M
θ, P
0) = P
nm(θ, t) − P
0m(θ, t
∗)
≤ P
nm(θ, t) − P
0m(θ, t) + P
0m(θ, t) − P
0m(θ, t
∗) .
The first term tends to 0 in probability by (8.4), the second term tends to 0 by dominated conver- gence Theorem using assumption (b).
Proof of Theorem 5.5 . (1) By Taylor expansion, there exists t ∈ R
l+1inside the segment that links b t and t
∗with
(8.6)
0 = P
nm
′(θ, b t)
= P
nm
′(θ, t
∗) + (P
nm
′′(θ, t
∗))
Tb t − t
∗+
12b t − t
∗TP
nm
′′′(θ, t) b t − t
∗.
By condition (a) and the Law of Large Numbers (LLN), we get P
nm
′′′(θ, t) = O
P(1). Hence, we can write the last term in the right hand side of (8.6) as o
P(1) b t − t
∗. On the other hand, by the WLLN, P
nm
′′(θ, t
∗) converges in probability to the matrix P
0m
′′(θ, t
∗). Write P
nm
′′(θ, t
∗) as P
0m
′′(θ, t
∗) + o
P(1) to obtain from (8.6)
(8.7) − P
nm
′(θ, t
∗) = (P
0m
′′(θ, t
∗) + o
P(1)) b t − t
∗.
By the Central Limit Theorem (CLT), we have √ nP
nm
′(θ, t
∗) = O
P(1), which by (8.7) implies that √ n b t − t
∗= O
P(1). Hence, from (8.7), we get
(8.8) √
n b t − t
∗= [ − P
0m
′′(θ, t
∗)]
−1√
nP
nm
′(θ, t
∗) + o
P(1).
The CLT concludes the proof of part 1. (2) Using the fact that b t − t
∗= O
P(1/ √ n) and P
nm
′(θ, t
∗) = P
0m
′(θ, t
∗) + o
P(1) = 0 + o
P(1) = o
P(1), we obtain
√ n
D b
φ( M
θ, P
0) − D
φ( M
θ, P
0)
= √
n
D b
φ( M
θ, P
0) − P
0m(θ, t
∗)
= √ n (P
nm(θ, t
∗) − P
0m(θ, t
∗)) + o
P(1), and the CLT yields to the conclusion of the proof.
Proof of Theorem 5.6. (1) First note that condition (d) implies that the function t ∈ Λ(θ) 7→
Em(X, θ, t) is strictly concave for all θ ∈ Θ. Hence, condition (c) implies that t
∗(θ) is unique for all θ ∈ Θ. By UWLLN, using continuity of m(X, θ, t), in θ and t, and condition (c), we obtain the uniform convergence in probability, over the compact set { (θ, t) | θ ∈ Θ, t ∈ N (θ) } ,
(8.9) sup
θ∈Θ,t∈N(θ)