• Aucun résultat trouvé

Minimum divergence estimators, maximum likelihood and exponential families

N/A
N/A
Protected

Academic year: 2021

Partager "Minimum divergence estimators, maximum likelihood and exponential families"

Copied!
11
0
0

Texte intégral

(1)

HAL Id: hal-00613126

https://hal.sorbonne-universite.fr/hal-00613126v5

Submitted on 19 Aug 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Minimum divergence estimators, maximum likelihood

and exponential families

Michel Broniatowski

To cite this version:

Michel Broniatowski. Minimum divergence estimators, maximum likelihood and exponential families. Statistics and Probability Letters, Elsevier, 2014, 93 (6), pp.27-33. �hal-00613126v5�

(2)

Minimum divergence estimators,

maximum likelihood and

exponential families

Michel Broniatowski

LSTA Universit´e Pierre et Marie Curie

5 Place Jussieu, 75005 Paris, France

e-mail: michel.broniatowski@upmc.fr

Abstract

In this note we prove the dual representation formula of the divergence between two distributions in a parametric model. Re-sulting estimators for the divergence as for the parameter are derived. These estimators do not make use of any grouping nor smoothing. It is proved that all differentiable divergences induce the same estimator of the parameter on any regular exponential family, which is nothing else but the MLE.

Key words: statistical divergence; minimum divergence esti-mator; maximum likelihood; exponential family

1

Introduction

1.1

Context and scope of this note

This note presents a short proof of the duality formula for ϕ− diver-gences defined through differentiable convex functions ϕ in parametric models and discusses some unexpected phenomenon in the context of exponential families. First versions of this formula appear in [8] p 33, in [1] in the context of the Kullback-Leibler divergence and in [7] in a general form. The paper [3] introduces this form in the context of min-imal χ2− estimation; a global approach to this formulation is presented

in Broniatowski and K´eziou (2006)[2]. Independently Liese and Vajda (2006)[9] have obtained a similar expression based on a much simpler argument as presented in all the above mentioned papers (formula (118) in their paper); however the proof of their result is merely sketched and we have found it useful to present a complete treatment of this interest-ing result in the parametric settinterest-ing, in contrast with the aforementioned approaches.

(3)

The main interest of the resulting expression is that it leads to a wide variety of estimators, by a plug in method of the empirical measure eval-uated on the current data set; so, for any type of sampling its estimators and inference procedures, for any ϕ−divergence criterion. In the case of the simple i.i.d. sampling resulting properties of those estimators and subsequent inferential procedures are studied in [4].

A striking fact is that all minimum divergence estimators defined through this dual formula coincide with the MLE in exponential families. They henceforth enjoy strong optimality under the standard exponential models, leading to estimators different from the MLE under different models than the exponential one. Also this result proves that MLE ’s of parameters of exponential families are strongly motivated by being generated by the whole continuum of ϕ−divergences.

This note results from joint cooperation with late Igor Vajda.

1.2

Notation

Let P := {Pθ, θ ∈ Θ} an identifiable parametric model on Rd where

Θ is a subset of Rs. All measures in P will be assumed to be

mea-sure equivalent sharing therefore the same support. The parameter space Θ need not be open in the present setting. It may even hap-pen that the model includes measures which would not be probability distributions; cases of interest cover models including mixtures of prob-ability distributions; see [4]. Let ϕ be a proper closed convex function from ] − ∞, +∞[ to [0, +∞] with ϕ(1) = 0 and such that its domain domϕ := {x ∈ R such that ϕ(x) < ∞} is an interval with endpoints aϕ < 1 < bϕ (which may be finite or infinite). For two measures Pα

and Pθ in P the ϕ-divergence between Q and P is defined by

φ(α, θ) := Z X ϕ  dPα dPθ (x)  dPθ(x).

In a broader context, the ϕdivergences were introduced by [5] as “f -divergences”. The basic property of ϕ− divergences states that when ϕ is strictly convex on a neighborhood of x = 1, then

φ(α, θ) = 0 if and only if α = θ.

We refer to [8] chapter 1 for a complete study of those properties. Let us simply quote that in general φ(α, θ) and φ(, θ, α)are not equal. Hence, ϕ-divergences usually are not distances, but they merely measure some difference between two measures. A main feature of divergences between distributions of random variables X and Y is the invariance property with respect to common smooth change of variables.

(4)

1.3

Examples of

ϕ-divergences

The Kullback-Leibler (KL), modified Kullback-Leibler (KLm), χ2,

mod-ified χ2 2

m), Hellinger (H), and L1 divergences are respectively

associ-ated to the convex functions ϕ(x) = x log x−x+1, ϕ(x) = − log x+x−1, ϕ(x) = 12(x − 1)2, ϕ(x) = 12(x − 1)2/x, ϕ(x) = 2(√x − 1)2 and ϕ(x) = |x − 1|. All these divergences except the L1 one, belong to the class of

the so called “power divergences” introduced in [6] (see also [8] chap-ter 2), a class which takes its origin from R´enyi [10]. They are defined through the class of convex functions

x ∈]0, +∞[7→ ϕγ(x) :=

− γx + γ − 1

γ(γ − 1) (1) if γ ∈ R \ {0, 1}, ϕ0(x) := − log x + x − 1 and ϕ1(x) := x log x − x + 1.

So, the KL-divergence is associated to ϕ1, the KLm to ϕ0, the χ2 to ϕ2,

the χ2

m to ϕ−1 and the Hellinger distance to ϕ1/2.

It may be convenient to extend the definition of the power divergences in such a way that φ(α, θ) may be defined (possibly infinite) even when Pα or Pθ is not a probability measure. This is achieved setting

x ∈] − ∞, +∞[7→ 

ϕγ(x) if x ∈ [0, +∞[,

+∞ if x ∈] − ∞, 0[. (2) when domϕ = R+/ {0} . Note that for the χ2-divergence, the correspond-ing ϕ function φ2(x) := 12(x − 1)2 is defined and convex on whole R.

We will only consider divergences defined through differentiable func-tions ϕ, which we assume to satisfy

(RC)

There exists a positive δ such that for all c in [1 − δ, 1 + δ], we can find numbers c1, c2, c3 such that

ϕ(cx) ≤ c1ϕ(x) + c2|x| + c3, for all real x.

Condition (RC) holds for all power divergences including KL and KLm divergences.

2

Dual form of the divergence and dual estimators

in parametric models

Let θ and θT be any parameters in Θ. We intend to provide a new

expression for φ(θ, θT).

By strict convexity, for all a and b the domain of ϕ it holds ϕ(b) ≥ ϕ(b) + ϕ′

(a)(b − a) (3) with equality if and only if a = b.

(5)

Denote

ϕ#(x) := xϕ′

(x) − ϕ(x). For any α in Θ denote

a := dPθ dPα (x) . Define b := dPθ dPθT (x) .

Inserting these values in (3) and integrating with respect to PθT yields

φ(θ, θT) ≥ Z  ϕ′  dPθ dPα  dPθ− ϕ#  dPθ dPα  dPθT.

Assume at present that this entails

φ(θ, θT) ≥ Z ϕ′  dPθ dPα  dPθ− Z ϕ#  dPθ dPα  dPθT (4)

for suitable α’s in some set Fθ included in Θ.

When α = θT the inequality in (4) turns to equality, which yields

φ(θ, θT) = sup α∈Fθ Z ϕ′  dPθ dPα  dPθ− Z ϕ#  dPθ dPα  dPθT (5) Denote h(θ, α, x) := Z ϕ′  dPθ dPα  dPθ− ϕ#  dPθ dPα  (6) from which φ(θ, θT) = sup α∈Fθ Z h(θ, α, x)dPθT. (7)

Furthermore by (4), for all suitable α

φ(θ, θT) − Z h(θ, α, x)dPθT = Z h(θ, θT, x)dPθT. − Z h(θ, α, x)dPθT ≥ 0

and the function x → h(θ, θT, x) − h(θ, α, x) is non negative, due to

(3). It follows that φ(θ, θT) −R h(θ, α, x)dPθT is zero only if h(θ, α, x) =

h(θ, θT, x)− PθT a.e. Therefore for any x in the support of PθT

Z ϕ′  dPθ dPθT  dPθ− Z ϕ′  dPθ dPα  dPθ  −ϕ#  dPθ dPα (x)  +ϕ#  dPθ dPθT (x)  = 0

(6)

which cannot hold for all x when the functions ϕ#dPθ dPα(x)  , ϕ# dPθ dPθT (x)  and 1 are linearly independent, unless α = θT. We have proved that θT

is the unique optimizer in (5).

We have skipped some sufficient conditions which ensure that (4) holds. Assume that Z ϕ′  dPθ dPα  dPθ < ∞. (8)

Assume further that φ(θ, θT) is finite. Since

− Z ϕ#  dPθ dPα (x)  dPθT ≤ φ(θ, θT) − Z ϕ′  dPθ dPθT  dPθ ≤ φ(θ, θT) + Z ϕ′  dPθ dPθT  dPθ < +∞ we obtain Z ϕ#  dPθ dPα  dPθT > −∞

which entails (4). When R ϕ#dPθ

dPα



dPθT = +∞ then clearly , under

(8) φ(θ, θT) > Z ϕ′  dPθ dPα  dPθ− Z ϕ#  dPθ dPα  dPθT = −∞.

We have proved that (5) holds when α satisfies (8).

Sufficient and simple conditions encompassing (8) can be assessed under standard requirements for nearly all divergences. We state the following Lemma (see Liese and Vajda (1987)[8]) and Broniatowski and K´eziou (2006) [2], Lemma 3.2).

Lemma 1 Assume that RC holds and φ(θ, α) is finite. Then (8) holds.

Summing up, we state

Theorem 2 Let θ belong to Θ and let φ(θ, θT) be finite. Assume that

RC holds.Let Fθ be the subset of all α’s in Θ such that φ(θ, α) is finite . Then φ(θ, θT) = sup α∈Fθ Z ϕ′  dPθ dPα  dPθ− Z ϕ#  dPθ dPα  dPθT.

(7)

For the Cressie-Read family of divergences with γ 6= 0, 1 this repre-sentation writes φγ(θ, θT) = sup α∈Fθ ( 1 γ − 1 Z dP θ dPα γ−1 dPθ− 1 γ Z dP θ dPα γ dPθT − 1 γ(γ − 1) ) .

The set Fθ may depend on the choice of the parameter θ. Such is

the case for the χ2 divergence i.e. ϕ(x) = (x − 1)2/2, when p

θ(x) =

θ exp(−θx)1[0,∞)(x). In most cases the difficulty of dealing with a specific

set Fθ depending on θ can be encompassed when

There exists a neighborhood U of θT for which (A)

φ(θ, θ′

) is finite whatever θ and θ′

in U

which for example holds in the above case for any θT. This simplication

deserves to be stated in the next result

Theorem 3 When φ(θ, θT) is finite and RC holds, then under condition

(A) φ(θ, θT) = sup α∈U Z ϕ′  dPθ dPα  dPθ− Z ϕ#  dPθ dPα  dPθT.

Furthermore the sup is reached at θT and uniqueness holds.

Remark 4 Identifying Fθ might be cumbersome. This difficulty also

appears in the classical MLE case, a special case of the above statement with divergence function ϕ0 ,for which it is assumed that

Z

log pθ(x)pθT(x)dλ(x) is finite

for θ in a neighborhood of θT.

Under the above notation and hypotheses define Tθ(PθT) := arg sup α∈Fθ Z h(θ, α, x)dPθT. (9) It then holds Tθ(PθT) = θT

for all θT in Θ. Also let

S (PθT) := arg inf

θ∈Θα∈Fsupθ

Z

h(θ, α, x)dPθT. (10)

which also satisfies

S (PθT) = θT

for all θT in Θ. We thus state

Theorem 5 When φ(θ, θT) is finite for all θ in Θ and RC holds, both

(8)

3

Plug in estimators

From (7) simple estimators for θT can be defined, plugging any

conver-gent empirical measure in place of PθT and taking the infimum in θ in

the resulting estimator of φ(θ, θT).

In the context of simple i.i.d. sampling, introducing the empirical measure Pn:= 1 n n X i=1 δXi

where the Xi’s are i.i.d. r.v’s with common unknown distribution PθT

in P, the natural estimator of φ(θ, θT) is

φn(θ, θT) := sup α∈Fθ Z h(θ, α, x) dPn(x)  (11) = sup α∈Fθ Z ϕ′  dPθ dPα  dPθ− 1 n n X i=1 ϕ#  dPθ dPα (Xi)  . Since inf θ∈Θφ(θ, θT) = φ(θT, θT) = 0

the resulting estimator of φ(θT, θT) is

φn(θT, θT) := inf θ∈Θφn(θ, θT) = infθ∈Θα∈Fsupθ Z h(θ, α, x) dPn(x)  . (12)

Also the estimator of θT is obtained as

b θ := arg inf θ∈Θα∈Fsupθ Z h(θ, α, x) dPn(x)  . (13)

When A holds then Fθ may be substituted by U in the above definitions.

The resulting minimum dual divergence estimators (12) and (13) do not require any smoothing or grouping, in contrast with the classical approach which involves quantization. The paper [4] provides a complete study of those estimates and subsequent inference tools for the usual i.i.d. sample scheme. For all divergences considered here, these estimators are asymptotically efficient in the sense that they achieve the Cramer-Rao bound asymptotically. The case when ϕ = ϕ0 leads to θM L defined as

the celebrated Maximum Likelihood Estimator (MLE), in the context of the simple sampling.

(9)

4

Minimum divergence estimators in exponential

families

In this section we prove the following result

Theorem 6 For all divergence φ defined through a differentiable func-tion ϕ satisfying Condifunc-tion (RC), the minimum dual divergence esti-mator defined by (13) coincides with the MLE on any full exponential families such that φ (θ, α) is finite for all θ and α in Θ.

Let P be an exponential family on Rs with canonical parameter in

Rd P :=  Pθ such that pθ(x) = dPθ(x) = exp [T (x)′ θ − C(θ)] ; θ ∈ Θ 

where x is in Rs and Θ is an open subset of Rd , and λ is a dominating

measure for P. We assume P to be full, namely that the Hessian matrix (∂2/∂θ2) C(θ) is definite positive for all θ in Θ.

Let X1, ..., Xnbe n i.i.d. random variables with common distribution

PθT with θT in Θ. Introduce Mn(θ, α) := Z ϕ′  dPθ dPα  dPθ− 1 n n X i=1 ϕ#  dPθ dPα (Xi) 

We will prove that inf

θ supα Mn(θ, α) = 0 (14)

whatever the function ϕ satisfying the claim. In (14) θ and α run in Θ.

This result extends the maximum likelihood case for which infθsupαMn(θ, α) =

supθinfα 1 n Pn i=1log pθ(Xi) − 1 n Pn i=1log pα(Xi)  = 0. Direct substitution shows that for any θ,

sup α Mn(θ, α) ≥ Mn(θ, θ) = 0 from which inf θ supα Mn(θ, α) ≥ 0 (15) We prove that

α = θM L is the unique maximizer of Mn(θM L, α) (16)

which yields inf

(10)

which together with (15) completes the proof. Define

Mn,1(θ, α) :=

Z

ϕ′(exp A(θ, α, x)) exp B (θ, x) dλ(x)

Mn,2(θ, α) := 1 n n X i=1 exp (A (θ, α, Xi)) ϕ ′ (exp A(θ, α, Xi)) Mn,3(θ, α) := 1 n n X i=1 ϕ (exp A(θ, α, Xi)) with A(θ, α, x) := T (x)′ (θ − α) + C(α) − C(θ) B(θ, x) := T (x)′ θ − C(θ). It holds Mn(θ, α) = Mn,1(θ, α) − Mn,2(θ, α) + Mn,3(θ, α) with ∂ ∂αMn,1(θ, α)α=θ = −ϕ (2) (1) [∇C (θ) − ∇C (α)α=θ] = 0 for all θ, ∂ ∂αMn,2(θM L, α)α=θM L = ϕ (2)(1) 1 n n X i=1  −T (Xi) + ∇C (α)α=θM L  = 0 and ∂ ∂αMn,3(θM L, α) = 1 n n X i=1  −T (Xi) + ∇C (α)α=θM L  = 0

where the two last displays hold iff α = θM L. Now

∂2 ∂α2Mn,1(θM L, α)α=θM L = ϕ (3)(1) + 2ϕ(2)(1) 2/∂θ2C(θ M L) ∂2 ∂α2Mn,2(θM L, α)α=θM L = ϕ (3)(1) + 4ϕ(2)(1) 2/∂θ2C(θ M L) ∂2 ∂α2Mn,3(θM L, α)α=θM L = ϕ (2) (1) ∂2/∂θ2C(θM L), whence

(11)

∂ ∂αMn(θM L, α)α=θM L = 0 ∂2 ∂α2Mn(θM L, α)α=θM L = −ϕ (2)(1) ∂2/∂θ2C(θ M L)

which proves (16), and closes the proof.

References

[1] Broniatowski, M. Estimation of the Kullback-Leibler divergence. Math. Methods Statist. 12 (2003), no. 4, 391–409 .

[2] Broniatowski, M.; Keziou, A. Minimization of ϕ-divergences on sets of signed measures. Studia Sci. Math. Hungar. 43 (2006), no. 4, 403– 442.

[3] Broniatowski, M.; Leorato, S. An estimation method for the Ney-man chi-square divergence with application to test of hypotheses. J. Multivariate Anal. 97 (2006), no. 6, 1409–1436.

[4] Broniatowski, M. Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivariate Anal. 100 (2009), no. 1, 16–36.

[5] Csisz´ar, I. Eine informationstheoretische Ungleichung und ihre An-wendung auf den Beweis der Ergodizit¨at von Markoffschen Ketten. (German) Magyar Tud. Akad. Mat. Kutat´o Int. K¨ozl. 8 1963 85– 108.

[6] Read, T. R. C., Cressie, N. A. C. Goodness-of-fit statistics for discrete multivariate data. Springer Series in Statistics. Springer-Verlag, New York, 1988. xii+211 pp. ISBN: 0-387-96682-X

[7] Keziou, A. Dual representation of ϕ-divergences and applications. C. R. Math. Acad. Sci. Paris 336 (2003), no. 10, 857–862

[8] Liese, F., Vajda, I. Convex statistical distances. Teubner-Texte zur Mathematik [Teubner Texts in Mathematics], 95. BSB B. G. Teub-ner Verlagsgesellschaft, Leipzig, 1987, ISBN: 3-322-00428-7 . [9] Liese, F., Vajda, I. On divergences and informations in statistics

and information theory. IEEE Trans. Inform. Theory 52 (2006), no. 10, 4394–4412

[10] R´enyi, A. On measures of entropy and information. 1961 Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I pp. 547–561 Univ. California Press, Berkeley, Calif.

Références

Documents relatifs

In particular, the consistency of these estimators is settled for the probability of an edge between two vertices (and for the group pro- portions at the price of an

Drittens wurden in der Analyse nicht alle Risikofaktoren des Sturzes berttcksichtigt (z.B. Diabe- tes mellitus, Parkinson-Krankheit, orthostatische Hypoto- nie oder

The specific objectives of this study were to assess the effects of the high or low daily stress perception on cognitive performance (specifically, attention, working memory

Finally, let us mention that the proof of this main result relies on an upper bound for the expectation of the supremum of an empirical process over a VC-subgraph class.. This

The concept of exponential family is obviously the backbone of Ole’s book (Barn- dorff -Nielsen (1978)) or of Brown (1986), but the notations for the simpler object called

natural exponential families; variance function; Wishart laws; Riesz measure; homogeneous cones; graphical

In Section 5, Wishart natural exponential families on the cones Q A n are defined, and all their classical objects are explicitly determined, beginning with the Riesz

To help in describing page accumulation, a schematic diagram of the expected flow profile over the flat stack is sketched in Fig. 3a, with recirculation zones shown ahead of the