SMOOTH MIN-DIVERGENCE INFERENCE IN SEMI PARAMETRIC MODELS

(1)

HAL Id: hal-02586204

https://hal.archives-ouvertes.fr/hal-02586204

Preprint submitted on 15 May 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

SMOOTH MIN-DIVERGENCE INFERENCE IN SEMI

PARAMETRIC MODELS

Michel Broniatowski, Justin Moutsouka

To cite this version:

Michel Broniatowski, Justin Moutsouka. SMOOTH MIN-DIVERGENCE INFERENCE IN SEMI PARAMETRIC MODELS. 2020. �hal-02586204�

(2)

PARAMETRIC MODELS

MICHEL BRONIATOWSKI(1)_{, JUSTIN STEWARD MOUTSOUKA}(1)

Abstract. This paper considers inference in some semi parametric models through some speci…c class of statistical procedure, which have proved to be of valuable interest in parametric estimation, namely the power divergence family de…ned by Basu Hodjt, Harris and Jones (1998).

At a …rst glance their divergence is not …tted to semiparametric inference. How-ever extending the parametric setting to a smoothed semiparametric one, it is possible to make inference both on T and on the density of PT in

semiparamet-ric models de…ned by moment conditions indexed by some parameter , where the data are generated under some unknown T: This question is of interest; indeed usually the estimation of the density of P T with respect to a dominating measure

(here the Lebesgue measure) is an open chalenge in the realm of semi parametric models. This is the focus of the present paper.

Key words and phrases :Semi parametric models, Inference, Minimum di-vergence inference

1. Introduction

This paper considers inference in some semi parametric models through some speci…c class of statistical procedure, which have proved to be of valuable interest in parametric estimation. The global paradigm which is considered here consists in the minimization of a pseudo distance between the empirical measure de…ned by the data set and a model, de…ned loosely as a collection of probability measures which we consider as candidates for the generic distribution of the data set. This framework is generally referred to as a "divergence based approach"; according to the choice of the divergence (or "pseudo distance"), many classical methods for estimation and testing can be recovered. Before entering into our topics in a more detailed way, let us describe rapidly some of the various divergences which have been discussed in the recent past, and present their speci…cities.

1.1. Divergences. A divergence (or discrepancy) between two probability measures P _{and Q de…ned on the same measurable space X equipped with its Borel …eld B (X )} is a positive mapping

(P; Q)_{! D (Q; P )}

such that D (Q; P ) = 0 if and only if Q = P: No symmetry is assumed, nor any triangular inequality; therefore a divergence need not be a distance. Constructions of such functions D are numerous; we brie‡y sketch two main schemes, each of

(3)

which leading to speci…c …elds of applications in statistics and learning. We refer to

Broniatowski and Stummer (2019) orBroniatowski and Vajda (2012) for description and further references. In this paper the space X is the euclidean space Rd_{, endowed} with its Borel …eld. In the sequel M1_{designates the class of all probability measures} de…ned on (X ; B (X )) :

1.1.1. Ali Silvey and Csiszar divergences. A …rst class of divergences has been intro-duced byAli and Silvey (1966), and byCsiszár (1967); for Q absolutely continuous with respect to P

D(Q; P ) := Z

f dQ

dP dP

where f is a non negative convex function de…ned on R+ _{which satis…es f (1) = 0:} When the support of Q is not included in the support of P then D(Q; P ) := +1: Typical examples of functions f are f0(x) := log x + x 1; f1(x) := x log x x + 1; f1=2(x) := 2 (px 1)

2

; f2(x) := (1=2) (x 1)2; f 1(x) := (1=2)(x 1)

2

x :

In the above list, f0induces the likelihood divergence (modi…ed Kulback-Leibler divergence) , f1 induces the Kulback-Leibler divergences, f1=2 de…nes the Hellinger divergence, while f2 and f 1 respectively de…ne the Pearson (resp. the Neyman) Chi-square divergences. Note that by its very de…nition, given a sample of iid copies X1; ::; Xn of a generic rv X with continuous distribution on Rd and a model M of continuous distributions on Rd

it may hold that the projection of P on M be de…ned, although the natural proxy of P de…ned through the empirical measure

Pn := 1 n n X i=1 Xi

is at in…nite "distance" from M for any n: This drawback can be overpassed and leads to general techniques in parametric inference, encompassing the various classical ones associated to the various names cited hereabove; seeBroniatowski and Keziou (2009). We will turn to semiparametric inference a bit later.

A convenient class of such functions f can be de…ned through the so-called Cressie-Read functions, which are indexed by a real valued parameter

f (x) := x x + 1

( 1) ;

and the examples listed hereabove are indeed indexed along the value of (with limit expansions for the case when = 0; 1).

1.1.2. The BHHJ power divergences. This class of divergences has been introduced by Basu Hodjt, Harris and JonesBasu and al.(1998), referred to as BHHJ divergence here, and is de…ned for distributions which are absolutely continuous with respect to the Lebesgue measure _{on R}d_: _{Given P and Q with respective densities p and q} the power divergence with real power index is de…ned through

(4)

D (Q; P ) = Z

'(q(x); p(x))dx (1.1)

where

'(u; v) = u +1 1 + 1 u v + 1v +1:

We will consider only values of in (0; 1) :

The developed form of D (Q; P ) is therefore D (Q; P ) =

Z

q +1(v) 1 + 1 q (v)p(v) + 1p +1(v) dv (1.2)

Motivation for using the BHHJ divergence in parametric inference is considered in Basu and al. (1998). This class of divergences is well suited for the parametric estimation; indeed consider a parametric model M := fP ; 2 g of absolutely continuous measures where is some parameter space; it then holds

D (P ; P ) = R (P ; P ) + where

:= 1 Z

p +1(v)dv

is independent upon ; therefore minimizing D (P ; P ) on amounts to minimize R (P ; P ); when dealing with estimation, P is supposed to be PT , the distribution of the generic observation X, and substitution of the unknown measure PT by the empirical measure Pn yields the corresponding statistical criterion

R (P ; Pn) := Z p +1(v)dv 1 n 1 + 1 Xn i=1 p (Xi) (1.3)

which can be minimized upon and produce an estimator of T whenever PT = P T:

Whenever the integral in the above display does not depend on the parameter ; as holds for location models, then minimizing upon in R (P ; Pn) amounts to smooth the usual likelihood score by a factor p which damps the role of outliers in the estimating equation.

This procedure has been developed extensively and leads to classical limit re-sults for estimation and testing; see Theorem 2 in Basu and al. (1998). The per-formance of this approach has been compared to similar treatments making use of Csiszar divergences, both under the model and under misspeci…cation; globally speaking, performances of either Csiszar divergence approach or power divergence approach are quite similar (same limit distribution of the estimator and of the test statistics as for the maximum likelihood approach (which falls in the …eld of Csiszar divergences but not in the …eld of power ones for in (0; 1)), nearly similar results in simulation runs on small or medium size samples, Tuning of the parameter or allows to obtain reasonably robust estimators under contamination, as measures through the In‡uence function; see Toma and Broniatowski (2011).

(5)

The main properties of BHHJ divergences are:

Fact 1: D (Q; P ) is a divergence in that it is non-negative for all absolutely continuous probability measures P and Q and equals 0 i¤ P = Q a.e.:

Fact 2: The mapping Q ! D (Q; P ) from P ( ) to R+ _{is convex.}

1.2. Semi parametric models. In this paper we extend the power divergence approach to some speci…c class of semi parametric models. Such models are de…ned through constraints on moments; de…ne l linearly independent functions

(X; )_{3 (x; ) ! g}j(x; ) 1 j l: (1.4)

For any _{let’s denote by M the set of all measures in M}1 _{de…ned by}

M := Q 2 M1 such that

Z

gj(x; )dQ(x) = 0; 1 j l (1.5)

Measures in M therefor satisfy l linear constraints. The model M is de…ned through

M = [ 2 M (1.6)

The inference on in the above model can be performed in a natural way for a number of statistical criterions. Indeed for example for Cressie Read criterions, or more generally for Csiszar type ones, a simple plug in of the empirical measure Pn in place of P in the divergence D(Q; P ) allows to minimize it on M for given , and then to optimize upon : This is due to the fact that the minimizer of D(Q; Pn) on M has support included in the sample points Xi’s. Therefore the seemingly formidable search for this minimization problem boils down to a …nite dimensional one, on the simplex of Rn:This is the core argument for Empirical Likelihood meth-ods and their extensions. All minimum empirical divergence methmeth-ods (therefore including EL) aim at assessing whether the model M is valid and at the estimation of T , the true value of the parameter. so they do not provide any knowledge on the density of P T (whenever P0 = P T belongs to M) nor on the density of the

projection of P0 on M taking into account the very de…nition of the model.

In the present case, due to the very form of D (Q; P ) as in (1.3) no plug in of Pn is feasible. Looking at (1.3) it may seem that the divergence D is not …tted to semiparametric inference. However extending the parametric setting to a smoothed semiparametric one, it is possible to make inference both on T and on the density of P T :This is the focus of the present paper.

The setting when estimating _{in M is clearly quite di¤erent as in the} para-metric case, where M is not de…ned by any such condition as above, but merely consists in a single distribution P : Consider the estimation of _{in M making use} of (1.3). Clearly this yields to a two steps minimization; the …rst one consists in the search for the minimizer Q of R (Q; Pn) for Q in M , and the subsequent min-imization should select the value of which indeed solves min R (Q ; Pn) where Q solves the …rst minimization, whenever possible. Now the …rst minimization is

(6)

indeed di¢ cult, since the class M consists in an in…nite family of distributions on which the minimization of R (Q; Pn)should be performed.

This program can however be experimented, as soon as some appropriate set-ting is de…ned. This setset-ting should contain various ingredients; …rstly the model should be such that all minimization procedures should be well de…ned; our basic setting will imply that the mapping M 3 Q ! D (Q; P ) be sci in a proper topology for which, for all , the convex set M should be closed, for any P ., and the level sets of the mapping Q ! D (Q; P ) should be compact. As such its estimator can also be de…ned. Additional regularity assumptions on the model, with respect to the variation of in , will be necessary in order to perform the second optimization.

The problem at hand writes therefore b := arg min

2 Q2Mmin R (Q; Pn); (1.7)

where for all _{, M consists in a family of distributions with densities wrt the} Lebesgue measure, with some prescribed regularity. We need to introduce some description on the model; this is done in the next Section.

2. Notation and properties of the semi parametric model

2.1. Constraints. All distributions in our model are de…ned on a compact subset K _{of R}m_:

The linearly independent functions (g1; :::; gl)introduced in (1.4) should satisfy some basic requirements. Each of the functions gl is de…ned on K with values in R: hence g := (g1; :::; gl)

T

is de…ned on K _{with values in R}l The parameter space is a compact subset in Rd.

We assume that for all the mapping

x _{! g(x; )} is continuous on int(K): (G1)

All functions gl’s are uniformly bounded sup sup

x2Kkg(x; )k < 1

(G2) where kxk designates the usual norm in Rl:

We also assume uniform continuity of g in the sense

As n!

lim

(7)

2.2. Regularity and smoothness assumptions

. _{The semi parametric model M will be assumed to consist in regular measures, in} the sense that they should have density with respect to the Lebesgue measure on K, and that their densities should be smooth. This is formalized as follows.

Let P be the class of all probability measures with support K, and P( ) the class of all probability measures in P which are a.c. wrt :

We now de…ne a subset E of p.m’s in P( ) which is identi…ed by some smooth-ness properties pertaining to their densities. Any measure Q in P( ) is identi…ed with its density q. An element in E will be indi¤erently identi…ed either by some probability measure Q or by its density q:

The set E is endowed with the metric induced by the sup norm on K; for q and q0 _{in E, denote}

d(q; q0) := sup x2Kjq(x)

q0(x)_{j :} Four conditions will be assumed on E:

1-Let x0 be some point in K:

There exists N > 0 such that for all Q in E,

q(x0) < N: (E1)

2-The class E is equicontinuous: for all " > 0, there exists > 0such that for all Q in E,

sup

jx x0_j< jq(x)

q(x0)_{j < ":} (E2)

3- For all Q 2 E, the map x ! q (x) is a Lipschitz function sup

x;y2K

jq (x) q (y)_j

jx y_j C (E3)

for .some C > 0:

Remark 1. Since for any positive , jq(x) q(y)_{j < implies jq (x)} q (y)_{j < for} some > 0, due to _{2 (0; 1), it follows that when}E2holds, sup_{jx yj<"}_{jq (x)} q (y)_{j <}

, which implies that q is equicontinuous. ThereforeE3enforces E2. For each we consider the parametric submodel

M := Q 2 P( ) such that Z

q(x)d (x) = 1; Z

(8)

and its smooth counterpart

M E :=M \ E;

which we assume to be non void. We de…ne the model M through

M = [ 2 M

and the smooth version of M is de…ned by

ME =[ 2 M \ E = [ M E:

The …rst additional condition is an identi…ability property of the model with respect to :

We assume that for ₆₌ 0,

M \ M 0 =; (M1)

We assume that the collection of submodels M is well separated in the sense that Suppose that (d( ; 0) > )₎ inf fq2M E;q02M 0_Eg d(q; q0) > ! : (M2)

As a consequence of those smoothness assumptions we denote indi¤erently D (Q; P ) by D (q; P ) . The same notation is adopted for R (Q; P ) (to be de…ned further on), etc.

Example 2.1. g(x) = x ;_{M = fQ :}R xdQ(x) = _{g and clearly M \ M}0 =_;

. Whenever R x(q(x) q0_{(x))dx > "}_thenR _jq(x) _q0_(x)_{j dx > "=K , and therefore} d(q; q0_{) >} _{for some} _{> 0:}

2.3. The estimator. Given an i.i.d. sample (X1; X2; :::; Xn) such that (X1) has distribution P 0 2 M 0 for some 0 2 we intend to provide an estimator for 0

minimizing the pseudo-distance between Pn and ME where Pn := 1 n n X i=1 Xi

is the empirical measure pertaining to the sample set (X1; X2; :::; Xn) . Note that the estimation is performed in the smooth model ME and not in M:

We introduce the estimator of 0 in ME by b := arg inf inf

Q2ME

D (Q; Pn): (2.1)

Formula (2.1) provides an natural estimate of 0 if P 0 2 M _0E. Indeed under

the identi…ability condition (H3) we prove that the above estimator converges to 0 = arg inf infQ2M D (Q; P 0).( see Theorem 1 and Theorem 9 ).

In the alternative case that P 0 2 M 0 but P 0 2 E then formula (= 2.1) de…nes an

estimator of some ~ := arg inf inf_Q2M

(9)

of P 0 on ME, and ~ may be di¤erent from 0 but still represents a proxy of 0,in

the smooth model. We will consider a natural condition which entrains that ~ = 0.(Theorem 1).

3. Projection and regularization

We denote P0 the distribution of the variable X1. In this section we consider both cases P0 2 M 0 and P0 2 M _0Efor some 0.

Suppose that the following condition holds inf

Q2M_0E D (Q; P0) < Q2MinfE

D (Q; P0) (3.1)

for all ₆₌ 0 , whenever P0 belongs to M 0 which formalizes the fact that P0 is

approximated smoothly smoothly with a better score in M 0E than in any M E, as

soon as P0 belongs to M 0:

Theorem 1. Under (3.1) it holds, whenever P0 belongs to M or to ME; 0 = arg inf inf

Q2ME

D (Q; P0) = arg inf inf

Q2M D (Q; P0) (3.2)

.

Proof. First case: Suppose that P0 = P 0 2 ME., i.e. PO 2 M _0E: Then

inf Q2M_0ED (Q; P0) = 0: . Since ME M _0E, we have inf Q2ME D (Q; P0) = 0: Furthermore, 0 realizes infQ2M_0E D (Q; P0) = 0. So

0 2 arg inf inf

Q2M_0ED (Q; P0):

It must be shown that 0is the only parameter that satis…es infQ2M_0E D (Q; P0) = 0.

Suppose that 1 6= 0 such that 1 2 arg inf infQ2M_0E D (Q; P0).Then inf Q2M_1ED (Q; P0) =Q2Minf_0E D (Q; P0) = 0: Since M 1E M 1 0 = inf Q2M_1ED (Q; P0) Q2Minf 1 D (Q; P0) 0:

(10)

Hence

inf

Q2M1

D (Q; P0) = 0:

0 is the only who realizes P0 2 M 0 so 1 does not exist, otherwise P0 = P 1 due

to M1.

Second case: Suppose that P0 = P 0 2 M and P0 2 M= E.: Recall that

0 = arg inf inf

Q2M D (Q; P0): We want to show that

0 = arg inf inf

Q2ME

D (Q; P0) .

We project P0 = P 0 on ME and de…ne

1 2 arg inf inf

Q2ME D (Q; P0): Assume that 1 6= 0 . We then have inf Q2M_1ED (Q; P0) Q2Minf_ED (Q; P0) for all by de…nition of 1 .So taking = 0 ,we have

inf

Q2M_1ED (Q; P0) Q2Minf_0ED (Q; P0) (3.3)

Under (3.1_{) it holds . D (M} _0E; P ₀) < D (_M _E; P ₀), for all ₆₌ 0. Then (3.3) is impossible, so 1 = 0 .We have proved (3.2).

Before handling inference we need to explore some properties of minimum pseudo-distance approximations in ME: We will make use of a number of de…n-itions, which we quote now. For …xed P in ME the divergence D (:; P )jE is the restriction of Q ! D (Q; P ) on ME.

For …xed , de…ne therefore the projection of P on M E

Q = arg inf Q2ME D (Q; P )_jE whenever de…ned. Since for Q 2 ME D (Q; P )_jE = D (Q; P ) it holds arg inf Q2M_ED (Q; P ) = argQ2Minf_ED (Q; P )jE = Q : We …rst set some general de…nition.

(11)

De…nition 1. Let _{be some subset of P. The} divergence between the set and a p.m. P is de…ned by

D ( ; P ) := inf

Q2 D (Q; P ):

A probability measure Q 2 , such that D (Q ; P ) < 1 and D (Q ; P ) D (Q; P ) _{for all Q 2 ;}

is called a projection of P on . This projection may not exist, or may be not de…ned uniquely.

De…nition 2. The sequence of functions qn 2 E tends to q strongly if and if sup

x2Kjq

n(x) q(x)j ! 0:

To (Qn)n ME ,we associate (qn).If there exists some q in E such that sup

x2Kjq

n(x) q(x)j ! 0; (3.4)

then Qn converges strongly to Q such that Q(A) = R

1A(x)q(x)dx for all A 2 B(R)

.Denote Qn !

st Q when (3.4) holds ; Q may not be a probability measure . 4. Projection:existence and uniqueness

We need some preliminary result pertaining to the properties of ME:

4.1. Closure of ME. By Arzela-Ascoli Theorem the set E is pre-compact when endowed by the strong topology de…ned in De…nition 1.

Let (Qn) be a family of probability measures on K; by compactness of K; it holds Proposition 1. (Qn)n 0 is a tight family.

As a consequence it holds

Proposition 2. Let (Qn)n 1 be a family of p.m’s with densities in E:Assume that there exists q in E such that limn!1supx2Kjqn(x) q(x)j ! 0 . Then (Qn)n 1 is relatively compact.

Proof. Let fnjg fng and dQ_nj

d (x) = qnj(x),and supx2Kjqnj(x) q(x)j ! 0 then

(Qnj) converges to some p.m Q and Q(A) =

R Aq(x)d (x) for all A in B(K). . Indeed Qnj(A) Z A q(x)d (x) = Z 1A(x)qnj(x)d (x) Z 1A(x)q(x)d (x) sup x2K qnj(x) q(x) (A) ! 0:

So (Qnj)j 1converges to Q, such that q(x) =

dQ

d (x).That Q is a probability measure is a consequence of Prohorov Theorem since (Qn)n 1 is a tight family of p.m’s .

(12)

Theorem 2. Under G1;G2and G3 _{the set M}E is closed for the strong topology of convergence de…ned in De…nition 1.

Proof. Assume that (Qn)n 1 ME and assume that there exists q such that sup

E jq

n(x) q(x)j ! 0;

with qn(x) := (dQn=d ) (x): De…ne Q(A) := R

Aq(x)d (x)for any set A and we have

Qn !

st Q by Proposition 2. We want to prove that Q 2 ME (A) q is a density

(B) R_Kg(x; )q(x)dx = 0 for some . (C) q is equicontinuous.

We prove (A); This follows from Prohorov Theorem. We prove (B) Let n be de…ned by R g(x; n)qn(x)dx = 0;such a n indeed exists since Qn2 M.

Since _{is a compact set in R}d,we select nj n such that the subsequence nj

admits a limit and R g(x; nj)qnj(x)dx = 0.

We prove that R_Kg(x; )q(x)dx = 0 Indeed Z K g(x; )q(x)dx Z K g(x; )qnj(x)dx + Z K g(x; )q(x)dx Z K g(x; )qnj(x))dx B + A A = Z K g(x; ) q(x) qnj(x) dx which tends to 0 by G2. Next B Z K g(x; ) g(x; nj) qnj(x)dx + Z K g(x; nj)qnj(x)dx C + D and D = 0 by de…nition of nj. Hence B C = Z K g(x; ) g(x; nj) qnj(x)dx sup x2K g(x; ) g(x; nj) Z K qnj(x)dx = sup x2K g(x; ) g(x; nj) ! 0 where we used G3

We have proved that any converging sequence nj satis…es

R

(13)

= limnj!1 nj .

Consider two converging subsequences nj and n0j with nj ! and

0 nj ! ,we have Z K g(x; )q(x)dx = Z K g(x; )q(x)dx:

By M 1 it follows that = therefore we have prove that there exists a unique

2 such that _Z

K

g(x; )q(x)dx = 0 which proves (B).

We prove that there exists some N > 0 such that jq(x0)j N . Indeed

jq(x0) qn(x0) + qn(x0)j jqn(x0)j + jqn(x0) q(x0)j N +jqn(x0) q(x0)j N + " for all " > 0 and therefore jq(x0)j N, since

jqn(x0) q(x0)j ! 0: We prove that q is uniformly equicontinuous on K; indeed

jq(x) q(x0)_{j = jq(x)} qn(x) + qn(x) q(x0) + qn(x0) qn(x0)j: Hence sup jx x0_j< jq(x) q(x 0₎_j _sup jx x0_j< jq(x) qn (x)_{j+ sup} jx x0_j< jq(x 0_{) q} n(x0)j+ sup jx x0_j< jqn (x) qn(x0)j 2 sup x2Kjq(x) qn(x)j + sup jx x0_j< jq n(x) qn(x0)j 2" + :

The …rst term in the last display tends to 0 by hypothesis; the second one is smaller than any positive " for adequate > 0_{. Hence q 2 E.}

4.2. Existence and uniqueness of the D -projection of P on ME:. For any P _{in P ( ) let a > 0 and}

AE(a) :=fQ 2 ME : D (Q; P ) ag be some level set of the divergence Q ! D (Q; P ):

Proposition 3. For any _{2 (0; 1) the divergence function Q 7! D (Q; P ) from P( )} to [0; +1] is s.c.i. for the strong topology.

Proof. We prove that AE(a) is a closed subset in ME equipped with the strong topology . Recall that Q ! D (Q; P ) s.c.i is equivalent to AE(a) is closed.

Let Qn2 AE(a)\ ME. Denote dQ_dn(x) = qn(x) with qn2 E,and assume that there exists a function q de…ned on K such that

sup x2Kjq

(14)

De…ne

dQ

d (x) = q(x)

we prove that q 2 E and with Q(A) :=R 1A(x)q(x)d (x), it holds Q 2 AE(a). Since MEis closed, (see Theorem2) the measure Q de…ned by Q(A) =

R

1A(x)q(x)d (x) for all A 2 B(R) is in ME.

It remains to prove that D (Q; P ) a.

Consider the concave mapping t ! t de…ned on R+ which thus satis…es

jt s _j max t 1; s 1 _jt s_j (4.1)

and set t := q_n(x)and s := q (x) with := ( + 1)= ; we then have sup x2K q_n+1(x) q +1(x) sup x2K ( + 1) max q_n+1(x); q +1(x) _jq_n(x) q (x)_j (4.2) It holds similarly sup x2Kjqn (x) q (x)_j sup x2Kf [max (q n(x); q(x))]jqn(x) q(x)jg ! 0: (4.3) Since the function q is bounded on K .

We have sup x2K q_n+1(x) q +1(x) ( + 1) sup x2Kjq (x)jjq n(x) q(x)j ! 0 Since fn is bounded on K, jqn(x) q (x)j jqn(x)j + jq (x)j < 1: So jqn(x) q (x)j is bounded.

Consider now the mapping

x_{! '(q}n(x); p(x)) '(q(x); p(x)): Since '(qn(x); p(x)) '(q(x); p(x)) = qn+1(x) q +1_(x) _{1 +} 1 _{p(x) (q} n(x) q (x)) . using (4.2) and (4.3) sup x2Kj'(q n(x); p(x)) '(q(x); p(x))j ! 0 . Integrating we have Z '(qn(x); p(x))dx Z '(q(x); p(x))dx = D (Q; P ) Z '(qn(x); p(x))dx + : (4.4)

(15)

for any > 0, for n large. Since Qn2 AE(a); R '(qn(x); p(x))dx a; the inequality (4.4) becomes Z '(qn(x); p(x))dx Z '(q(x); p(x))dx Z '(qn(x); p(x))dx + a + So R '(q(x); p(x))dx a_{; hence Q 2 A}E(a) and thus AE(a) is a closed set in ME.

Theorem 3. For all a > 0 the set AE(a) is compact for the strong topology.

Proof. By Arzela-Ascoli Theorem, E has a compact closure . AE(a) is closed in Cl(E).

AE(a) is a closed subset of Cl(E),which is compact Proposition 4. For any in

Q = arg inf

Q2M E

D (Q; P ): exists and is unique.

Proof. Let a := inf_Q2M

E D (Q; P ) and let " > 0: .Then AE(a + ") 6= ; AE(a +

")_{\ M} _E _{6= ;:}

It can be observed that for all _{the set M} _E is a closed set, following the same arguments as in Proposition3_{. Since M} E is closed and AE(a + ")is compact then

AE(a + ")\ M E is compact. Since arg inf q2M E D (Q; P ) = arg inf q2AE(a +")\M _E D (Q; P )

, existence of the projection follows from the lower semi continuity of Q ! D (Q; P ):Since ' _{is strictly convex,then the function Q 2 P( ) ! D (Q; P ) is also strictly convex} ,and the projection of P on any closed convex set _{in M} E is uniquely de…ned

whenever it exists.

Consider now the D projection of P on a convex subset in ME: Similarly as in Proposition 4 it holds

Theorem 4. For any convex set _{in M}E the D projection of P on exists and is unique.

Proof. The proof mimics the one in Proposition 4. Let a := inf

Q2ME

D (Q; P )

and " > 0: Then AE(a + ") \ ME 6= ; .Since ME is closed (see Theorem ??)

and AE(a + ") is compact, existence of the projection follows. Uniqueness is due to convexity. .

(16)

5. Minimum pseudo-distance estimator

Let X1; :::; Xn denote an i.i.d. sample of a random vector X 2 Rm with distri-bution P0. Let Pn(:) be the empirical measure pertaining to this sample, namely

Pn(:) := 1 n n X i=1 Xi(:);

where x(:) denotes the Dirac measure at point x. We de…ne D (_M _E; P0) = inf Q2M_ED (Q; P0) = inf Q2M_E Z q +1(x) 1 + 1 q (x)p0(x) + 1 p₀+1(x) dx Since optimization only pertains to Q de…ne in the following

R (_M _E; P0) = inf Q2ME R (Q; P0) = inf Q2ME Z q +1(x) 1 + 1 q (x)p0(x) dx the “plug-in” estimate of R (M E; P0) through

b R (_M _E; P0) := inf Q2ME R (Q; Pn) = inf Q2ME Z q +1(x)dx 1 + 1 Z q (x)dPn(x) = inf Q2ME (Z q +1(x)dx 1 + 1 1 n n X i=1 q (Xi) )

In the same way,

R (_{M; P}0) := inf 2 Q2Minf_ER (Q; P0) = inf 2 Q2MinfE Z q +1(x)dx 1 + 1 Z q (x)dP0(x) can be estimated by b R (_{M; P}0) := inf 2 Q2MinfE (Z q +1(x)dx 1 + 1 1 n n X i=1 q (Xi) ) Since arg inf q2ME D (_{M ; P}0) = arg inf q2ME R (_{M ; P}0) for any arg inf q2ME R (_M _E; P0)

(17)

exists and is unique( P0 2 [M E or not ).

We will consider estimators of 0 where P0 = P 0 for a unique 0 2 ;this

corre-sponds to the fact that P0 2 M. In this cases by uniqueness of arg inf 2 R (ME; P0)

and since the in…mum is reached at = 0 under the model, we estimate 0 through b := arg inf 2 Q2Minf_E (Z q +1(x)dx 1 + 1 1 n n X i=1 q (Xi) ) 6. Asymptotic properties

The pseudodistances BHHJ will be applied in the standard statistical estima-tion model with i.i.d observaestima-tions X1; :::; Xn governed by P0 from a family P = fP : 2 g of probability measures on (Rk;_{B R}k ) indexed by a set parameters

Rd

. All distributions in P are assumed absolutely continuous, and denotes the Lebesgue measure on Rk_: _{Denote p = dP =d for}

2 .

Remark 2. If P0 2 M there exists an unique P 0 2 M such that P0 = P 0 2 M

then by identi…ability

arg inf D (P ; P ₀) = 0

In other words the unknown parameter 0 is the unique minimizer of the func-tion D (P ; P0)

0 = arg min D (P ; P 0)2 (6.1)

The empirical probability measures Pnare known to converge weakly to P0 as n ! 1 . Therefore by plugging in 6.1the measures Pn for P0 one intuitively expects to obtain the estimator under the form

b (n) = arg min

2 Mn(P ; Pn)

that converges to 0 as n ! 1; where Mn(P ; Pn)is some empirical criterion which estimates the objective function R (P ; P0):

We will repeatedly make use of a basic result which we recall for convenience. Denote Mn( ) a family of random functions of a parameter which belongs to a space T endowed which a metric denoted d .

Assuming that the sequences Mnconverges uniformly to some deterministic function M de…ned on T , then the following result provides a set of su¢ cient conditions which entail the weak convergence of minimizers of Mn to the minimizer of M ,if well de…ned.

Lemma 1. (Van der Vaart (1998), theorem 5.7) Assume that (1)sup _2T _jMn( ) M ( )_j P_{!0; (2)For any} > 0; inf_ft2T;d(t;t0) gM (t) > M (t0), (3) the sequence tn

satis…es

Mn(tn) Mn(t0) + 0p(1) Then the sequence tn satis…es

(18)

d(tn; t0) P !0:

Lemma1 will be used according to the context of minimization at hand.

By (1.7) we consider the inner and the outer minimization problems leading to the estimator. This will be performed in two steps: the inner minimization with respect to Q in M E for …xed , and the outer minimization wrt :

Here we establish the consistency of the minimum pseudodistance estimator on the closed set of measures a.c.w.r.t .

6.1. Inner minimization: convergence of the projection of Pnon M E. Fix

2 : Denote Mn(Q) := R (Q; Pn) where Q 2 M E: Denote qn( ) := arg inf q2ME R (Q; Pn): (6.2)

where,as before,Q is identi…ed with its density q .Existence and uniqueness of a p.m Qn( ) with density qn( ) follows from Proposition 4,following verbatim its proof,substituting P by Pn .

Denote accordingly the unique minimizer of R (Q; P0) , q := arg inf

q2M E

R (Q; P0): (6.3)

We prove that qn( ) converges to q making use of Lemma 1. Setting

Mn( ) := R (Q; Pn);

with = dQ_d ,setting d( ; 0) = sup_x2K_jq(x) q0(x)_{j,it holds.} Lemma 2. Fix : Then Condition (1) in Lemma 1 holds

sup q2ME jR (Q; Pn) R (Q; P0)j ! 0 in probability Proof. It holds sup q2M _EjR (Q; P n) R (Q; P0)j 1 + 1 sup q2M_E 1 n n X i=1 q (Xi) EP0(q (X))

which tends to 0 almost surely as n tends to in…nity, since q is a Lipschitz function for all q 2 M E, and a class of Lipschitz function is a Glivenko-Cantelli class .

(19)

We now prove that the second condition in Lemma1holds Lemma 3. For any " > 0,

inf

fq:kq q ( )k> ;q2M_EgR (Q; P0) > R (Q ; P0): where dQ=dP = q and dQ =dP = q :

Proof. We thus prove condition (2) in Lemma 1. By Proposition4 Q := arg inf

Q2ME

R (Q; P0) exists with uniqueness. Denote q := dQ ( )_d .It holds

inf kq q k>";q2M E

R (Q; P0) > R (Q ( ); P0) .Indeed by de…nition for all Q ,such that dQ_d (x) = q(x)

R (Q ( ); P0) R (Q; P0) and therefore

R (Q ( ); P0) inf kq q ( )k>"

R (Q; P0):

Now let Q ( ) such that dQ ( )=d (x) = q ( )(x) and Q such that dQ( )=d (x) = q( )(x). We prove that the inequality is strict .From the above display we get

R (q ( ); P0) + 1 Z p₀+1(x)dx inf kq q ( )k>" R (q; P0) + 1 Z p₀+1(x)dx i.e. D (_M _E; P0) inf kq q ( )k>";q2M E D (Q; P0) Now if equality holds,there exists q 2 M E , q 6= q ( ) such that

D (_M_E; P0) = D (q ( ); P0) = D (q; P0) (6.4) . It hold Q 6= Q ( ) since q ( ) and q 2 E .But the projection of P0 on M E is

unique, so (6.4) cannot hold.

We also prove that the third condition in Lemma 1holds. Lemma 4. It holds

R (qn( ); Pn) R (q ; P0) + op(1):

Proof. This follows from the very de…nition of qn( ) for which R (qn( ); Pn) R (q; Pn) for all q 2 M E:

(20)

Theorem 5. For any ₂ ,it holds, with qn( ) de…ned in (6.2) and q de…ned in (6.3) sup x2Kjq n( )(x) q (x)j P !0:

6.2. Outer minimization. We now consider the minimization in ,with the fol-lowing notation .Let

^

n := arg inf inf

q2M_ER (Q; Pn) = arg inf R (qn( ); Pn) and

0 := arg inf inf

q2M E

R (Q; P0) = arg inf R (q ; P0)

The parameter 0such that P0 = P 0 is de…ned in a unique way by the above display;

indeed …rstly note that 0 is well de…ned ,either when P0 2 M (i.e. P0 = P 0) (see

Theorem1) or P0 2 M,in which caseP= 0 is the D projection of P0 on ME:

By the Theorem 5,we have proved that sup

x2Kjq

n( )(x) q (x)j P

!0: where q is de…ned in (6.3).We want to show that

arg inf R (qn( ); Pn) P

! arg inf R (q ; P0): where q = arg inf_q2M

E R (Q; P0): By de…nition ^ n:= arg inf R (qn( ); Pn); we prove that arg inf R (q ; P0) = 0 (6.5)

We consider two cases:

(Case 1) If P0 2 M, i.e. if 9! 0 such that P0 = P 0then (6.5) holds.

(Case 2) If P0 2 M,= 0 = arg infQ2MED (Q; P0)and under Condition (3.1) ,

0 = arg inf inf

Q2ME

D (Q; P0): Therefore (6.5) holds.

We make use of Lemma 1with

Mn( ) : = R (qn( ); Pn); (6.6)

M ( ) : = R (Q ; P0): We prove that ^n converges to 0 making use of Lemma 1. Set

(21)

with qn( )(x) = dQ_dn( )(x)setting

d(qn( ); q ) = sup x2Kjq

n( )(x) q (x)j it holds.

Proposition 5. Suppose that the following condition sup

fq2ME;q02M0_E;d( ; 0)< g

d(q; q0) < K (6.7)

holds for some K > 0 independent on and 0; then sup 2 sup x2Kjq n( )(x) q (x)j P !0: Proof. By Theorem 5 for all

d(qn( ); q )! 0 in probability.

We want to prove that uniform convergence upon holds. De…ne n by

sup ₂ d(qn( ); q ) = d(qn( n); q n): (6.8)

Let fnjg fngand suppose such that nj ! :

We show that d(qnj( nj); q _nj) > c > 0 cannot hold.

Now by de…nition (6.8)

sup ₂ d(qnj( ); q ) = d(qnj( nj); q _nj)

d(qnj( nj); qnj( )) + d(qnj( ); q ) + d(q _nj; q )

= : I1+ I2+ I3:

Now I1 = d(qnj( nj); qnj( )) and d( nj; ) ! 0: Hence under (6.7), I1

P

!0: Now I2 = d(qnj( ); q ) ; both qnj( ) and q belong to M E ; By Theorem 5 in M E ,

d(qnj( ); q ) P !0 so I2 P !0; As for I3 = d(q ; q _nj) P

!0 as for I1:We have proved that lim

j!1sup₂ d qnj( ); q = 0 in probability. (6.9)

Assume now that (6.8)

does not hold. In such a case there exists a subsequence fmkg fng and > 0 such that

sup d(qmk( ); q ) > :

Let mk := arg sup d(qmkj( ); q ), whence

(22)

for all k. Extract from fmkg a further subsequence fnjg along which nj converges

to some : Then (6.9) proves our claim, by contradiction.

Under Condition (6.7) in Proposition5 , condition (1) in Lemma1 holds i.e. sup

2 jM

n( ) M ( )j P !0 with Mn( ) and M ( ) de…ned in (6.6)

Proof. .De…ne Mn( ) = R (qn( ); Pn); and M ( ) = R (q ; P0) with R (qn( ); Pn) = Z q_n+1( )(x)dx 1 + 1 Z q_n( )(x)dPn(x) and R (q ; P0) = Z q +1(x)dx 1 + 1 Z (x)q (x)dP0(x) Hence sup 2 jM n( ) M ( )j sup 2 Z q_n+1( )(x) q +1(x) dx+ 1 + 1 sup 2 Z jqn( )(x) q (x)j dPn(x) + 1 + 1 sup 2 Z q (x)d(Pn P0) R1+ R2+ R3: Now R1 = sup 2 Z q_n+1( )(x) q +1(x) dx sup ₂ sup_x2K_jqn( )(x) q (x)j Cste

which tends to 0 in Probability by Proposition5. Also R2 = 1 + 1 sup 2 Z jqn( )(x) q (x)j dPn(x) sup 2 1 + 1 1 n X jqn( )(Xi) q (Xi)j cste sup 2 jq n( )(x) q (x)j

which tends to 0 in Probability, making use of Proposition 5. Turn turn to R3: We prove that the class of functions q indexed by satis…es the three following properties: (i)It is indexed by in _{, a compact subset of R}d_.(ii)

(23)

Secondly it is continuous in for all x in K. (iii) Thirdly the function F de…ned on K by F (x) := sup ₂ _{jq (x)j is such that} Z

F (x)dP0(x) <1:

Whenever these three facts hold, then

R3 = 1 + 1 sup 2 Z q (x)d(Pn P0)

tends to 0 in Probability since fq g is G.C , making use of Lemma 1.1 in Empirical Processes:Glivenko-Cantelli Theorems by Moulinath Banerjee (see also J

Wellner, Chapter 1.6, Notes on Empirical Processes, Torgnon Conference). We now prove the second condition in Lemma1.

Lemma 5. For any " > 0, inf_j 0j> M ( ) > M ( 0). where M ( ) = R (q ; P0) =

R

q +1(x)dx 1 + 1 R q (x)dP0(x)

Proof. Denote q ₀ the projection of P0 on ME, thus 0 := arg inf 2 R (q ; P0).For any _{2 ; let .q be the projection of P}0 on M E; hence

R (q ; P0) R (q ₀; P0)

We prove that equality cannot hold in the above display. Let j 0j > . Assume that there exists some 1 with

d(q ₁; q ₀) > such that

R (q

1; P0) = R (q 0; P0) (6.10)

we can not have equality above because ₀ achieves the minimum of R (q ; P0) on , and q _{! R (q; P}0) is strictly convex. So (6.10) cannot hold.

We also prove the third condition in Lemma 1. Lemma 6. Mn( ) M ( ) + op(1).

Proof. De…ne by Mn( ) = R (qn( ); Pn) and M ( ) = R (q ; P0) .Hence Mn( ) < R (q ; Pn)by de…nition.

Since R (q ; Pn) R (q ; P0) P

!0 by Glivenko Cantelli Theorem, it follows that Mn( ) R (q ; P0) + n

for n large enough,where _n P_!0:

As a consequence of the above arguments, the following convergence result for the minimization of power type divergences on semiparametric models de…ned by moment conditions holds.

(24)

Theorem 6. Under all the above conditions, it holds, whenever P0 belongs to M or P0 belongs to ME; with corresponding 0;

lim n!1D (M; Pn)! 0 and lim n!1bn= 0 Also we get lim n!1d qb; p 0 = 0 and all convergences above hold in probability.

References

Ali, S. M.; Silvey, S. D. A general class of coe¢ cients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B 28 (1966), 131–142.

Basu, Ayanendranath; Harris, Ian R; Hjort, Nils L; Jones, M. C.Robust and e¢ cient estimation by minimising a density power divergence.Biometrika 85 (1998), no. 3, 549–559..

Broniatowski, Michel; Keziou, Amor. Minimization of -divergences on sets of signed measures. Studia Sci. Math. Hungar. 43 (2006), no. 4, 403–442.

Broniatowski, Michel; Keziou, AmorParametric estimation and tests through diver-gences and the duality technique.J. Multivariate Anal. 100 (2009), no. 1, 16–36. Broniatowski, Michel; Stummer, Wolfgang Some universal insights on divergences

for statistics, machine learning and arti…cial intelligence. Geometric structures of information, 149–211, Signals Commun. Technol., Springer, Cham, 2019. Mini-mization of -divergences on sets of signed measures. Studia Sci. Math. Hungar.; arXiv:1003.5457, 43(4), 403–442.

Broniatowski, M. and Keziou, A. (2012). Divergences and duality for estimation and test under moment condition models. J. Statist. Plann. Inference, 142(9), 2554–2573.

Broniatowski, Michel; Vajda, Igor Several applications of divergence criteria in con-tinuous families. Kybernetika (Prague) 48 (2012), no. 4, 600–636.

Chen, J. H. and Qin, J. (1993). Empirical likelihood estimation for …nite populations and the e¤ective usage of auxiliary information. Biometrika, 80(1), 107–116. Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-…t tests. J. Roy.

Statist. Soc. Ser. B, 46(3), 440–464.

Csiszár, I. (1963). Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Marko¤schen Ketten. Magyar Tud. Akad. Mat. Kutató Int. Közl., 8, 85–108.

Csiszár, I. (1967). On topology properties of f -divergences. Studia Sci. Math. Hungar., 2, 329–339.

(25)

Godambe, V. P. and Thompson, M. E. (1989). An extension of quasi-likelihood estimation. J. Statist. Plann. Inference, 22(2), 137–172. With discussion and a reply by the authors.

Haberman, S. J. (1984). Adjustment by minimum discriminant information. Ann. Statist., 12(3), 971–988.

Hansen, L., Heaton, J., and Yaron, A. (1996). Finite-sample properties of some alternative gmm estimators. Journal of Business and Economic Statistics, 14, 462–2800.

Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50(4), 1029–1054.

Ho¤, P. D. (2012). Equivariant estimation. Preprint.

Imbens, G. W. (1997). One-step estimators for over-identi…ed generalized method of moments models. Rev. Econom. Stud., 64(3), 359–383.

Jureµcková, J. and Picek, J. (2009). Minimum risk equivariant estimator in linear regression model. Statist. Decisions, 27(1), 37–54.

Kuk, A. Y. C. and Mak, T. K. (1989). Median estimation in the presence of auxiliary information. J. Roy. Statist. Soc. Ser. B, 51(2), 261–269.

Lehmann, E. L. and Casella, G. (1998). Theory of point estimation. Springer Texts in Statistics. Springer-Verlag, New York, second edition.

Liese, F. and Vajda, I. (1987). Convex statistical distances, volume 95. BSB B. G. Teubner Verlagsgesellschaft, Leipzig.

McCullagh, P. and Nelder, J. A. (1983). Generalized linear models. Monographs on Statistics and Applied Probability. Chapman & Hall, London.

Newey, W. K. and Smith, R. J. (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica, 72(1), 219–255. Owen, A. (1990). Empirical likelihood ratio con…dence regions. Ann. Statist., 18(1),

90–120.

Owen, A. B. (1988). Empirical likelihood ratio con…dence intervals for a single functional. Biometrika, 75(2), 237–249.

Owen, A. B. (2001). Empirical Likelihood. Chapman and Hall, New York.

Pardo, L. (2006). Statistical inference based on divergence measures, volume 185 of Statistics: Textbooks and Monographs. Chapman & Hall/CRC, Boca Raton, FL. Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating

equa-tions. Ann. Statist., 22(1), 300–325.

Rockafellar, R. T. (1970). Convex analysis. Princeton University Press, Princeton, N.J.

Sheehy, A. (1987). Kullback-Leibler constrained estimation of probability measures. Report, Dept. Statistics, Stanford Univ.

Smith, R. J. (1997). Alternative semi-parametric likelihood approches to generalized method of moments estimation. Economic Journal, 107, 503–519.

Toma, Aida; Broniatowski, Michel Dual divergence estimators and tests: robustness results. J. Multivariate Anal. 102 (2011), no. 1, 20–36. 62F03

(26)

van der Vaart, A. W. Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics, 3. Cambridge University Press, Cambridge, 1998. xvi+443 pp.