• Aucun résultat trouvé

ROBUST BAYES-LIKE ESTIMATION: RHO-BAYES ESTIMATION

N/A
N/A
Protected

Academic year: 2021

Partager "ROBUST BAYES-LIKE ESTIMATION: RHO-BAYES ESTIMATION"

Copied!
67
0
0

Texte intégral

(1)

arXiv:arXiv:1711.08328

ROBUST BAYES-LIKE ESTIMATION: RHO-BAYES ESTIMATION

By Yannick Baraud∗ and Lucien Birg´e† University of Luxembourg∗ and Sorbonne Universit´e †

We observe n independent random variables with joint distribu-tion P and pretend that they are i.i.d. with some common density s (with respect to a known measure µ) that we wish to estimate. We consider a density model S for s that we endow with a prior distri-bution π (with support in S) and build a robust alternative to the classical Bayes posterior distribution which possesses similar concen-tration properties around s whenever the data are truly i.i.d. and their density s belongs to the model S. Furthermore, in this case, the Hellinger distance between the classical and the robust posterior dis-tributions tends to 0, as the number of observations tends to infinity, under suitable assumptions on the model and the prior. However, un-like what happens with the classical Bayes posterior distribution, we show that the concentration properties of this new posterior distri-bution are still preserved when the model is misspecified or when the data are not i.i.d. but the marginal densities of their joint distribution are close enough in Hellinger distance to the model S.

1. Introduction. The purpose of this paper is to define and study a robust substitute to the classical posterior distribution in the Bayesian framework. It is known that the posterior is not robust with respect to mis-specifications of the model. More precisely, if the true distribution P of an n-sample X = (X1, . . . , Xn) does not belong to the supportP of the prior

and even if it is close to this support in total variation or Hellinger distance, the posterior may concentrate around a point of this support which is quite far from the truth. A simple example is the following one.

Let Pt be the uniform distribution on [0, t] with t ∈ S = (0, +∞) and,

given a > 0 and α > 1, let π be the prior with density Ct−α1l[a,+∞)(t),

C = (α − 1)−1a1−α, with respect to the Lebesgue measure on R+. Given an

n-sample X = (X1, . . . , Xn) with distribution Pt0, the posterior distribution function writes as (1) t 7→ GL(t|X) = " 1 − a ∨ X (n) t n+α−1# 1l[a∨X(n),+∞)(t)

MSC 2010 subject classifications: Primary 62G35, 62F15, 62G05, 62G07, 62C20, 62F99 Keywords and phrases: Bayesian estimation, Rho-Bayes estimation, Robust estimation, Density estimation, Statistical models, Metric dimension, VC-classes

(2)

and, for t0 > a, we see that this posterior is highly concentrated on intervals

of the form a ∨ X(n), 1 + cn−1

a ∨ X(n) with c > 0 large enough. Now assume that the true distribution has been contaminated and is rather

P = 1 − n−1 U([0, t0]) + n−1U [t0+ 100, t0+ 100 + n−1] .

Although it is quite close to the initial distribution Pt0 in variation distance (their distance is 1/n), on an event of probability 1 − (1 − n−1)n > 1/2, t0+ 100 < X(n)< t0+ 100 + n−1 and the posterior distribution is therefore

concentrated around t0 + 100 according to (1). The same problem would

occur if we were using the maximum likelihood estimator (MLE for short) as an estimator of t.

In the literature, most results about the behaviour of the posterior do not say anything about misspecification. Some papers like Kleijn and van der Vaart (2006), (2012) and Panov and Spokoiny (2015) address this problem but their results involve the behaviour of the Kullback-Leibler divergence between P and the distributions inP, as is also often the case when studying the MLE — see for instance Massart (2007) —. However, two distributions may be very close in Hellinger distance and therefore indistinguishable with our sample X, but have a large Kullback-Leibler divergence.

Even when the model is exact, the Kullback-divergence is used to an-alyze the properties of the Bayes posterior. It is known mainly from the work of van der Vaart and co-authors — see in particular Ghosal, Ghosh and van der Vaart (2000) — that the posterior distribution concentrates around P ∈P as n goes to infinity but those general results require that the prior puts enough mass on neighbourhoods of P ∈ P of the form K(P, ε) = {P0 ∈ P, K(P, P0) < ε} where ε is a positive number and K(P, P0) the Kullback-Leibler divergence between P and P0. Unfortunately such neighbourhoods may be empty (and consequently the condition unsat-isfied) when the probabilities inP are not equivalent, which is for example the case for the translation model of the uniform distribution on [0, 1], even though the Bayes method may work well in such cases.

(3)

RHO-BAYES ESTIMATION

suitably strong regularity assumptions.

The aim of this paper is to extend the theory developed in BBS and BB to a Bayesian paradigm in view of designing a robust substitute to the classical Bayes posterior distribution. To be somewhat more precise, let us consider a classical Bayesian framework of density estimation from n i.i.d. observations, although other situations could be considered as well. We ob-serve X = (X1, . . . , Xn) where the Xi belong to some measurable space

(X , A ) with an unknown distribution P on X . We have at disposal a fam-ilyP = Pt, t ∈ S of possible distributions on X , which is dominated by

a σ-finite measure µ with respective densities f (x|t) = (dPt/dµ)(x). We set

f (X|t) =Qn

i=1f (Xi|t) for the likelihood of t. Assuming that S is a

measur-able space endowed with a σ-algebraS , we choose a prior distribution π on S which leads to a posterior πXL that is absolutely continuous with respect to π with density gL(t|X) = (dπLX/dπ)(t). Following these notations, the log-likelihood function and log-log-likelihood ratios write respectively as L(X|t) = log f (X|t) = Pn

i=1log f (Xi|t) and L(X, t, t

0) = L(X|t0) − L(X|t) so

that the density gL(t|X) of the posterior distribution πXL with respect to π is given by exp [L(X|t)] Z S exp [L(X|t)] dπ(t) = exp " L(X|t) − sup t0∈S L X|t0 # Z S exp " L(X|t) − sup t0∈S L X|t0 # dπ(t) and consequently, (2) gL(t|X) = Z f (X|t) S f (X|t) dπ(t) = exp " − sup t0∈S L X, t, t0 # Z S exp " − sup t0∈S L X, t, t0 # dπ(t) .

Note that, if the MLEbt(X) exists, sup

t0∈S

L X, t, t0 = L X

bt(X) − L(X|t)

and that we could as well consider, for all β > 0 the distributions gLβ(t|X) · π with gLβ(t|X) = Z exp [βL(X|t)]

S

(4)

The posterior corresponds to β = 1 and when β goes to infinity the distri-bution gLβ(t|X) · π converges weakly, under mild assumptions, to the Dirac measure located at the MLE. All values of β ∈ (1, +∞) will then lead to interpolations between the posterior and the Dirac at the MLE.

Most problems connected with the maximum likelihood or Bayes estima-tors are due to the fact that the log-likelihood ratios L(X, t, t0) involve the logarithmic function which is unbounded. As a result, we may have

EtL(X, t, t0) = −nEt[log(dPt/dPt0)(X1)] = −∞,

the situation being even more delicate when the true distribution of the Xi

is different (even slightly) from Pt.

In BBS and BB we offered an alternative to the MLE by replacing the logarithmic function in the log-likehood ratios by other ones. One possibility being the function ϕ(x) defined by

ϕ(x) = 4 √ x − 1 √ x + 1 for all x > 0, so that, for x > 0, ϕ0(x) = 4 (1 +√x)2√x > 0 and ϕ 00 (x) = − 2(1 + 3 √ x) (1 +√x)3x3/2 < 0.

Like the log function, ϕ(x) is increasing, concave and satisfies ϕ(1/x) = −ϕ(x). In fact, these two functions coincide at x = 1, their first and second derivatives as well and for all x ∈ [1/2, 2]

(3) 0.99 < ϕ(x)

log x 6 1 and |ϕ(x) − log x| 6 0.055|x − 1|

3.

The main advantage of the function ϕ as compared to the log function lies in its boundedness. It can also be extended to [0, +∞] by continuity by setting ϕ(+∞) = 4. As a consequence, the quantity ϕ t0(X)/t(X) is well-defined (with the convention a/0 = +∞ for a > 0 and 0/0 = 1) and bounded and we can use it as a surrogate for log t0(X)/t(X). This suggests the replacement of L (X, t, t0) by 4Ψ(X, t, t0) where the function Ψ is defined as

(4) Ψ(x, t, t0) = n X i=1 ψ s t0(x i) t(xi) !

for all x ∈Xn and (t, t0) ∈ S2, with the conventions 0/0 = 1, a/0 = +∞ for a > 0 and

(5)

RHO-BAYES ESTIMATION

so that ϕ(x) = 4ψ(√x). Note that ψ is Lipschitz with Lipschitz constant 2. The important point here is that we have already studied in details in BB the behaviour and properties of a process which is closely related to (t, t0) 7→ Ψ(X, t, t0).

We get a pseudo-posterior density with respect to π by replacing in (2) the quantity supt0∈SL(X, t, t0) by 4 supt0∈SΨ (X, t, t0). This pseudo-posterior density can therefore be written

g(t|X) = exp " −4 sup t0∈S Ψ X, t, t0 # Z S exp " −4 sup t0∈S Ψ X, t, t0 # dπ(t) .

More generally we may consider, for β > 0, the random distribution πX

given by (6) dπX dπ (t) = exp " −β sup t0∈S Ψ X, t, t0 # Z S exp " −β sup t0∈S Ψ X, t, t0 # dπ(t) .

This will be the starting point for our study of this Bayes-like framework with a posterior-like distribution πX defined by (6) that will play a similar

role as the posterior distribution in the classical Bayesian paradigm except for the fact that a random variable with distribution πX (conditionally to

our sample X) will possess robustness properties with respect to the hy-pothesis that P belongs toP. We shall call it ρ-posterior by analogy with our construction of ρ-estimators as described in BBS and BB.

To conclude this introduction, let us emphasize the specific properties of our method that distinguish it from classical Bayesian procedures.

— Contrary to the classical Bayesian framework, concentration properties of the ρ-Bayes method do not involve the Kullback-Leibler divergence but only the Hellinger distance.

— Our results are non-asymptotic and given in the form of large deviations of the pseudo-posterior distribution from the true density for a given value n of the number of observations.

(6)

that this distance is small.

— Due to the just mentioned robustness properties, we may work with an approximate model for the true density. In particular, when the density is assumed to belong to a non-parametric set S, it is actually enough to apply our ρ-Bayes procedure on a parametric set S possessing good approximation properties with respect to the elements of S. Besides, starting from a con-tinuous prior on a concon-tinuous model, we can discretize both of them without loosing much provided that our discretization scale is small enough.

— The ρ-posterior also possesses robustness properties with respect to the assumption that the data are i.i.d. provided that the densities of the Xi are

close enough to the model S.

Substituting another function to the log-likelihood in the expression of the posterior distribution, as we do here, is not new in the literature. It has often been motivated by the will of replacing the Kullback-Leibler loss, which is naturally associated to the likelihood-function, by other losses that are more specifically associated to the problem that needs to be solved (estimation of a mean, classification, etc.) or to deal with the problem of misspecification. This approach leads to quasi-posterior distributions which properties have been studied by many authors among which Chernozhukov and Hong (2003) and Bissiri et al. (2016) (see also the references therein). These results do not include robustness but Chernozhukov and Hong (2003) proved some analogues of the Bernstein-von Mises theorem under suitable assumptions on the model and loss function. The use of fractional likelihoods by Jiang and Tanner (2008) was motivated by the problem of misspecification. In a sparse parametric framework (the true parameter θ ∈ Rd has a small num-ber of nonzero components), Atchad´e (2017) replaces the joint density fn,θ

of the observations by a suitable function qn,θ. Together with a prior that

forces sparsity, this results in tractable and consistent procedures for high-dimensional parametric problems. All the cited results are of an asymptotic nature contrary to the next one. Bhattacharya, Pati and Yang (2019) inves-tigate the replacement, in the definition of the posterior, of the likelihood by a fractional one, also considering the case of misspecified models, but use what they call α-divergences instead of the KL one (but which may also be infinite) to evaluate the amount of misspecification.

Closer to our approach is the PAC-Bayesian one that has been developed by Olivier Catoni (2007) and our parameter β in the definition of the ρ-posterior (6) refers to the (inverse) of the so-called temperature parameter in the definition of the Gibbs measure. This parameter essentially plays no role in our results.

(7)

frame-RHO-BAYES ESTIMATION

work and state our main assumption that allows to solve the measurability issues that are inherent to the construction of the posterior. An account of what can be achieved with a ρ-posterior distribution is presented and commented in Section 3 in the density and regression frameworks (with a random design). Our main result can be found in Section4where we present the concentration properties of our ρ-posterior distribution. These proper-ties involve two quantiproper-ties, one which depends on the choice of the prior while the other is independent of it but depends on the model and the true density. We show how one can control these quantities in Sections 6 and 5

respectively, giving there illustrative examples as well as general theorems that can be applied to many parametric models of interest. Our results on the connection between the classical Bayes posterior and the ρ-one are pre-sented in Section7. We show that under suitable assumptions on the density model and the prior, the Hellinger distance between these two distributions tends to 0 at rate n−1/4(log n)3/4 as the sample size n tends to infinity. In

particular, this result shows that under suitable assumptions our ρ-Bayes posterior satisfies a Bernstein-von Mises Theorem. The problem of a hierar-chical prior or, equivalently, that of model selection is handled in Section8. The proofs and discussions about measurability issues are to be found in the supplemental article [Baraud and Birg´e (2020)] while additional results and examples can be found in the original version of this paper, Baraud and Birg´e (2017).

2. Framework, notations and basic assumptions.

2.1. The framework and the basic notations. We actually want to deal with more general situations than the one we presented in the introduction, namely the case of independent but possibly non-i.i.d. observations, even though the Statistician assumes them to be i.i.d. By doing so, our aim is to emphasize the robustness property of our ρ-posterior distribution with respect to the assumption that the data are i.i.d. This generalization leads to the following statistical framework. For n ∈ N? = N \ {0}, we observe

a random variable X = (X1, . . . , Xn) defined on (Ω, Ξ), where the Xi are

independent with values in a measurable space (X , A ) endowed with a σ-finite mesure µ. We denote by L the set of all probability densities u with respect to µ (which means that u is a nonnegative measurable function onX such thatR

X u(x) dµ(x) = 1) and by Pu= u · µ the probability on (X , A )

with density u ∈ L . We assume that for each i ∈ {1, . . . , n}, Xi admits a

density with respect to µ, i.e. has distribution Psi = si· µ with si ∈L . We set s = (s1, . . . , sn) and denote by Ps the probability on (Ω, Ξ) that gives

X the distribution Ps = Nni=1Psi on X

n and by E

(8)

expectation. We shall abusively refer to s as the (true) density of X. We denote by |A| the cardinality of a finite set A and use the word count-able for finite or countcount-able. Parametric models will be indexed by some subset Θ of Rd and | · | will denote the Euclidean norm on Rd. Finally, we shall often use the inequalities

(7) 2ab 6 αa2+ α−1b2; (a + b)2 6 (1 + α)a2+ (1 + α−1)b2 for all α > 0. 2.2. Hellinger type metrics. For all t, t0 ∈L , we shall write h(t, t0) and ρ(t, t0) for the Hellinger distance and affinity between Pt and Pt0. We recall that the Hellinger distance and affinity between two probabilities P, Q on a measurable space (X , A ) are given respectively by

h(P, Q) =   1 2 Z X r dP dν − r dQ dν !2 dν   1/2 ; ρ(P, Q) = Z X r dP dν dQ dν dν, where ν denotes an arbitrary measure which dominates both P and Q, the result being independent of the choice of ν. It is well-known since Le Cam (1973) that 0 6 ρ(P, Q) = 1 − h2(P, Q) and that the Hellinger distance is related to the total variation distance by the following inequalities: (8) h2(P, Q) 6 sup

A∈A|P (A) − Q(A)| 6 h(P, Q)

p

2 − h2(P, Q) 62h(P, Q).

Therefore robustness with respect to the Hellinger distance implies robust-ness with respect to the total variation distance.

The Hellinger closed ball centred at t ∈L with radius r > 0 is denoted B(t, r) and, for s ∈ Ln, we define

B(s, r) = t ∈ L , h2(s, t) 6 r2 with h2(s, t) = 1 n n X i=1 h2(si, t) 6 1.

Then, for S ⊂L , we set BS(t, r) = S ∩B(t, r) and BS(s, r) = S ∩B(s, r).

If the Xi are truly i.i.d. with density s, s = (s, . . . , s) and h2(s, t) = h2(s, t)

for all t ∈L , hence BS(s, r) =BS(s, r).

Note that although h is a genuine distance on the space of all probabilities on X , therefore on {Pt, t ∈ L }, it is only a pseudo-distance on L itself

since h(t, t0) = 0 if t 6= t0but t = t0µ-a.e. For simplicity, we shall nevertheless still call h a distance onL and set h(t, A) = infu∈Ah(t, u) for the distance

of a point t ∈ L to the subset A of L . Similarly h(s, A) = inft∈Ah(s, t).

(9)

RHO-BAYES ESTIMATION

2.3. Models and main assumptions. We consider a density model S, i.e. a subset of L , acting as if the data were i.i.d., and our aim is to estimate the n-uple s = (s1, . . . , sn) from the observation of X on the basis of this

model. Adopting the Bayesian paradigm, we endow S with a σ-algebraS as well as a prior π on (S,S ). There is no reason for t 7→ Ψ(X, t, t0) de-fined by (4) and t 7→ supt0∈SΨ(X, t, t0) to be measurable functions of t on (S,S ) and the function ω 7→ supt0∈SΨ(X(ω), t, t0) to be a random vari-able on (Ω, Ξ). Therefore our ρ-posterior distribution πX, as given by (6),

might not be well-defined. In order to overcome these difficulties, we intro-duce the following assumption and also slightly modify the definition of our ρ-posterior distribution that was originally given by (6) in the density frame-work. The following assumption ensures that the sets and random variables that we shall introduce later are suitably measurable. We refer the reader to the supplemental article [Baraud and Birg´e (2020)] for a discussion about Assumption1 and how it can be checked on examples.

Assumption 1.

(i) The function (x, t) → t(x) on X × S is measurable with respect to the σ-algebraA ⊗ S .

(ii) There exists a countable subset S of S and, given t ∈ S and t0 ∈ S, one can find a sequence (tk)k>0 in S such that, for all x ∈X ,

(9) lim k→+∞tk(x) = t(x) and k→+∞lim ψ s t0(x) tk(x) ! = ψ s t0(x) t(x) ! .

Note that it follows from Proposition15 in the supplemental article [Ba-raud and Birg´e (2020)] that S is dense in S with respect to the distance h. Of course, when S is countable, we shall set S = S without further notice and Assumption1-(ii) will be automatically satisfied with the σ-algebraS gathering all the subsets of S. In the sequel, we shall always assume that the set S associated to the model S has been fixed once and for all.

The following proposition (to be proven in the supplemental article) en-sures that the measurability properties required for a proper definition of the posterior distribution hold.

Proposition 1. Under Assumption 1, given t0 ∈ S and Ψ (x, t, t0) de-fined by (4), the functions

(x, t) 7→ Ψ x, t, t0 and (x, t) 7→ Ψ (x, t) = sup

u∈S

(10)

are measurable with respect to the σ-algebra A ⊗ S . Hence the function x 7→

Z

S

exp [−βΨ(x, t)] dπ(t)

is measurable with respect to A and the function t 7→ h(t, s) is measurable with respect toS whatever s ∈ L .

2.4. The ρ-posterior distribution πX. Let S be the countable subset of S

provided by Assumption1. For ω ∈ Ω and β > 0, we define the distribution πX(ω) on S by its density with respect to the prior π:

(10) dπX(ω) dπ (t) = g(t|X(ω)) = exp [−βΨ (X(ω), t)] Z S exp−βΨ X(ω), t0 dπ(t0) .

Proposition 1 implies that the function (ω, t) 7→ g(t|X(ω)) is measurable with respect to the σ-algebra Ξ ⊗S . We recall that the choice of β = 4 leads to an analogue of the classical Bayes posterior since the function x 7→ 4ψ (√x) is close to log x as soon as x is not far from one. Throughout the paper the parameter β will remain fixed and part of our results will depend on it.

Definition 1. The method that leads from the set S and the prior π on S to the distribution πX (and all related estimators) will be called ρ-Bayes

estimation and πX is the ρ-posterior distribution.

3. A flavour of what a ρ-Bayes procedure can achieve. Through-out this section we take β = 4, the value for which the ρ-posterior distribu-tion is the analogue of the classical Bayes posterior.

3.1. The density framework. Let S be a density model for the supposed common density of our observations X1, . . . , Xn and consider the following

entropy condition.

Assumption 2. There exists a non-increasing function H from (0, 1] to [3, +∞) such that, for any ε ∈ (0, 1], there exists a subset Sε of S with

cardinality not larger than exp [H(ε)] and such that h(t, Sε) 6 ε for all t ∈ S.

Proposition 2. Let S satisfy Assumption 2 and εn be such that

(11) εn> 1/ 2

(11)

RHO-BAYES ESTIMATION

There exists a prior π on S (depending on εn only) such that, whatever the

true density s = (s1, . . . , sn) and ξ > 0, there exists a measurable subset Ωξ

of Ω satisfying Ps(Ωξ) > 1 − e−ξ and for all ω ∈ Ωξ,

πX(ω) t ∈ S, h(s, t) 6 Crn  > 1 − e−ξ

0

for all ξ0> 0, with C a positive universal constant and

rn= h(s, S) + εn+

r ξ + ξ0

n .

In particular, if X1, . . . , Xn are truly i.i.d. with density s ∈L ,

πX(ω)BS(s, Crn)  > 1 − e−ξ0 with rn= h(s, S) + εn+ r ξ + ξ0 n . This result shows that with probability close to 1, the ρ-posterior distri-bution concentrates around points t in the density model S which satisfy

" 1 n n X i=1 h2(si, t) #1/2 6 Crn with rn of order h(s, S) + εn.

The quantity εn corresponds to the concentration rate we get when the

Xi are truly i.i.d. with density in S. For instance, when H(ε) = Aε−V for

all ε > 0 and some constants A, V > 0, εn is of order n−1/(V +2). This

concentration rate remains of the same order as long as h(s, S) is small enough compared to εn, which is actually possible even when none of the

densities si belongs to S. This stability result accounts for the robustness

property of our procedure.

It is well-known — see for instance Birg´e (1983) and (1986) — that, in many cases, the smallest value of εnwhich satisfies (11) corresponds to the

minimax rate of estimation (with respect to n) over S. Here are two typical illustrations for densities with respect to the Lebesgue measure.

i) Assume that S is the set of all non-increasing densities on [0, 1] which are bounded by M < +∞. Of course, if s ∈ S, √s is also non-increasing and is bounded by √M and the Hellinger entropy of S corresponds to the L2-entropy of the set√s, s ∈ S which is known from van de Geer (2000)

to be bounded by Aε−1 leading to an εn of order n−1/3 which is known to

be the minimax rate for this problem.

(12)

in Birg´e (1986) (see in particular his Corollary 3.2) where it is also proved that this rate is minimax (see his Proposition 4.3).

Assumption 2 can actually be replaced by the more general one that S admits a metric dimension D (according to Definition3below) in which case the same conclusion holds with εn> 1/(2

n) satisfying D(εn) 6 10−6nε2n.

3.2. The regression framework. We observe i.i.d. pairs Xi = (Wi, Yi)

with values inW × R drawn from the regression model Yi= f?(Wi) + εi for i = 1, . . . , n.

We assume that the regression function f? is bounded in supnorm (denoted k·k) by some known number B > 0, that the Wi are i.i.d. with unknown

distribution PW on W and the εi are i.i.d. with unknown density p with

respect to the Lebesgue measure λ on R.

We consider a model F for f? which is a set of functions onW satisfying the following property.

Assumption 3. For all f ∈ F , kf k∞ 6 B and there exists a

non-increasing function H on [3, +∞) such that for all ε > 0, one can find a subset Fε ⊂ F with cardinality not larger than exp [H(ε)] which satisfies

infg∈Fεkf − gk∞6 ε for all f ∈ F.

The density p being unknown, we consider a candidate density q for p. Denoting by qδ the translated density qδ(·) = q(· − δ) for δ ∈ R, we assume

that q is of order α ∈ (−1, 1], i.e. satisfies, for some constant a > 1, (12) a−1|δ|1+α∧ a−1 6 h2(qδ, q) 6 a|δ|1+α∧ a−1



for all δ ∈ R. Note that the mapping δ 7→ qδ is one-to-one.

For f ∈F , we denote by qf the density of X1 (with respect to µ = PW⊗λ)

when p = q and f? = f which is given by qf(w, y) = q(y − f (w)). The set

S = {qf, f ∈ F } is a density model for the true density s of X1. A prior π0on

F induces a prior π on S by taking the image of π0 by the mapping f 7→ qf.

In turn, the ρ-posterior πX on (S, π) induces a ρ-posterior distribution πX0

on F which is the image of πX by the reciprocal mapping qf 7→ f . Let us

(13)

RHO-BAYES ESTIMATION

Proposition 3. Let Assumption 3 hold, q satisfy (12) for some a > 1 and α ∈ (−1, 1] and let εn satisfy

(13) εn> 1/ 2

n and Hh ε2n/a1/(1+α)i6 4 · 10−6 nε2n.

There exists a prior π0 on F (which only depends on εn) such that, whatever

the function f? bounded by B, whatever the distribution PW, whatever the

density p and the positive number ξ, there exists a measurable subset Ωξ of

Ω satisfying Ps(Ωξ) > 1 − e−ξ and for all ω ∈ Ωξ,

πX(ω)0 n f ∈F , kf?− f k1+α6 Cr2/(1+α)n o> 1 − e−ξ0 for all ξ0 > 0 where (14) r2n= h2(p, q) + inf f ∈F kf?− f k1+α ∞ + ε2n+ ξ + ξ0 n , for some constant C > 0 depending on a, B and α only.

Let us first emphasize the fact that neither the prior nor the construction of the posterior requires the knowledge of the distribution of the design PW or any assumption about it. The result shows that with probability

close to 1 the ρ-posterior on F concentrates around functions f ∈ F for which kf − f?k1+α is of order hh2(p, q) + inff ∈Fkf?− f k1+α

∞ + ε2n

i1/(1+α) . The quantity ε2/(1+α)n corresponds to the concentration rate we get when

p is equal to q and f? belongs to F while the terms inff ∈Fkf?− f k1+α ∞

and h2(p, q) account for the robustness of the procedure with respect to a misspecification of the class F of the regression functions and the noise distribution respectively. The loss and the quantity εndepend on the specific

features of the chosen density q.

When H(ε) = Aε−V for some constants A, V > 0, then ε2/(1+α)n = C0n−1/(V +1+α)

where C0 > 0 depends on A, a, α and V only. We refer to Ibragimov and Has’minski˘ı (1981) Chapter VI p. 281 for sufficient conditions on the density q to be of order α. For illustration, when q is Gaussian, α = 1, the loss corresponds to the L2(PW)-norm and ε2/(1+α)n = εn is of order n−1/(V +2);

(14)

When F is a subset of the L∞(PW)-ball with radius B and center 0 of a

linear space with dimension d > 1, a classical result on the entropy of balls in a finite dimensional linear space implies that Assumption 3 is satisfied with H(ε) = d log(1 + [2B/ε]) which leads to an upper bound for ε2/(1+α)n

of order [d log(nB/d)/n]1/(1+α). Note that this rate is faster than the usual parametric rate 1/√n when α ∈ (−1, 1).

Choosing a specific density q and a single model F for f? is usually not enough for many applications. It is however possible to mix up several choices of q and F by using a hierarchical prior as we shall show in Section 8 and by arguing as in BBS, Sections 7.2 and 7.3.

4. Our main results. Our main results and definitions involve some numerical constants that we list below for further reference.

(15)              c0 = 103; c1= 15; c2 = 16; c3 = 0.62; c4 = 3.5 max{375; β−1/2}; c5 = 16 × 10−3; c6= 7 × 104; c7 = 4.01; c8 = 0.365; c9 = c−18 (2c6) ∨ β−1 ; cn= 1 + [(log 2)/ log(en)]; γ = β/8.

The properties of πX actually depend on two quantities, namely εSn(s) and

ηS,πn (t) for t ∈ S, that we shall now define. The former only depends on S

via S and also possibly on s while the latter depends on the choice of the prior π but not on s.

4.1. The quantity εSn(s). Given X with distribution Ps and y > 0, we

set Z(X, t, t0) = Ψ(X, t, t0) − EsΨ(X, t, t0) , and wS(s, y) = Es " sup t,t0BS(s,y) Z(X, t, t0) #

with the convention sup

= 0.

Note that wS(s, y) = wS(s, 1) for y > 1. We then define εS n(s) as (16) εSn(s) = supny > 0 w S(s, y) > 6c−1 0 ny 2o _1 n with sup ∅ = 0. Since the function ψ is bounded by 1, wS(s, y) is not larger than 2n hence εSn(s) is not larger than (c0/3)1/2. The quantity εSn(s) measures in some sense

(15)

RHO-BAYES ESTIMATION

4.2. The quantity ηnS,π(t).

Definition 2. Let γ = β/8. Given the prior π on the model S, we define the function ηnS,π on S by

ηnS,π(t) = supnη ∈ (0, 1] πB

S(t, 2η)> expγnη2 πBS(t, η) o,

with the convention sup ∅ = 0.

Note that ηS,πn (t) 6 1 since πBS(t, r)



= π(S) = 1 for r > 1 and that (17) πBS(t, 2r)  6 expγnr2 πBS(t, r)  for all r ∈ h ηnS,π(t), 1 i . This inequality indeed holds by definition for r > ηS,πn (t), which implies by

monotonicity that it also holds for r = ηnS,π(t). Then, if 0 < η 6 1 and

(18) πBS(t, 2r) 

6 expγnr2 πBS(t, r) 

for all r ∈ [η, 1], it follows from (17) that ηS,πn (t) 6 η.

The quantity ηS,πn (t) corresponds to some critical radius over which the

π-probability of balls centred at t does not increase too quickly. In particular, if the prior puts enough mass on a small neighbourhood of t, ηnS,π(t) is small.

Indeed, since πBS(t, 2r)6 1 for all r > 0, the inequality (19) πBS(t, η)> exp−γnη2

for some η ∈ (0, 1] implies that, for 1 > r > η,

πBS(t, r)> exp−γnr2 > πBS(t, 2r)exp−γnr2 ,

hence that ηS,πn (t) 6 η. However, the upper bounds on ηnS,π(t) that are

derived from (19) are usually less accurate than those derived from (18). 4.3. Our main theorem. The concentration properties of the ρ-posterior distribution πX are given by the following theorem.

Theorem 1. Let Assumption 1 be satisfied. Then, whatever the true density s of X and ξ > 0, there exists a measurable subset Ωξ of Ω with

Ps(Ωξ) > 1 − e−ξ such that

(20) πX(ω)BS(s, r)



(16)

with (21) rn= inf t∈S h c1h(s, t) + c2ηS,πn (t) i + c3εSn(s) + c4 r ξ + ξ0+ 2.61 n .

The constants cj, 1 6 j 6 4 are given in (15) and actually universal as soon

as β > 7.2 × 10−6.

In the favorable situation where the observations X1, . . . , Xn are truly

i.i.d. so that s = (s, . . . , s), (20) can be reformulated equivalently as πX(ω)BS(s, r)> 1 − e−ξ0 for all ω ∈ Ωξ, ξ0 > 0 and r > rn

with (22) rn= inf t∈S h c1h(s, t) + c2ηS,πn (t) i + c3εSn(s) + c4 r ξ + ξ0+ 2.61 n ,

which measures the concentration of the ρ-posterior distribution πX around

the true density s of our i.i.d. observations X1, . . . , Xn. It involves three

main terms: h(s, t), ηnS,π(t) and εSn(s). For many models S of interest, as we

shall see in Section5, it is possible to show an upper bound of the form (23) εSn(s) 6 vn(S) for all s ∈Ln,

where vn(S) is of the order of the minimax rate of estimation on S (up

to possible logarithmic factors), i.e. the rate one would expect by using a frequentist or a classical Bayes estimator provided that the true density s does belong to the model S and the prior distribution puts enough mass around s. Under (23), if s does belong to S, we deduce from (22) that (24) rn6 (c2+ c3) max n ηnS,π(s); vn(S) o + c4 r ξ + ξ0+ 2.61 n .

In many cases the quantity ηnS,π(s) turns out to be of the same order or

smaller than vn(S) provided that the prior π puts enough mass around

s. In (22), the term inft∈Shc1h(s, t) + c2ηS,πn (t)

i

(17)

RHO-BAYES ESTIMATION

too large or alternatively to some point t that may be slightly further away from s but for which ηS,πn (t) is smaller than ηS,πn (t) in order to minimize the

function t07→ c1h(s, t0) + c2ηnS,π(t0) over S.

If X1, . . . , Xn are not truly i.i.d. but are independent and close to being

drawn from a common density s0∈ S, i.e. s = (s1, . . . , sn) with h(si, s0) 6 ε

for some small ε > 0 and all i ∈ {1, . . . , n}, then h(s, s0) 6 ε and BS(s, r) ⊂

BS(s

0, ε + r). We therefore deduce from (20) and (21) with t = s0 that, if

(23) holds, the posterior distribution concentrates on Hellinger balls around s0 with radius not larger than

ε + rn6 (1 + c1)ε + (c2+ c3) max n ηS,πn (s0); vn(S) o + c4 r ξ + ξ0+ 2.61 n ,

which is similar to (24) with s = s0 except for the additional term (1 + c1)ε

which expresses the fact that our procedure is robust with respect to a possible departure from the assumption of equidistribution.

5. Upper bounds for εSn(s).

5.1. Case of a finite set S. There are many situations for which it is natural, in view of the robustness properties of the ρ-Bayes posterior, to choose for S a finite set, in which case we take S = S and the quantity εS

n(s)

can then be bounded from above as follows.

Proposition 4. If S is a finite set and S = S, εSn(s) <pc0/3  min q√ 2c0n−1log 2|S|2, 1  .

An important example of such a finite set S is that of an ε-net for a totally bounded set. We recall that, if eS is a subset of some pseudo-metric space M endowed with a pseudo-distance d and ε > 0, a subset Sε of M is an ε-net

for eS if, for all t ∈ eS, one can find t0 ∈ Sε such that d(t, t0) 6 ε. When eS is totally bounded one can find a finite ε-net for eS whatever ε > 0. This applies in particular to totally bounded subsets eS of (L , h). The smallest possible size of such nets depends on the metric properties of ( eS, h) and the following notion of metric dimension, as introduced in Birg´e (2006a) (Definition 6 p. 293) turns out to be a central tool.

Definition 3. Let D be a function from (0, 1] to [3/4, +∞) which is right-continuous. A model eS ⊂L admits a metric dimension bounded by D if, for all ε ∈ (0, 1], there exists an ε-net Sε for eS such that, for any s inL ,

(25) |{t ∈ Sε, h(s, t) 6 r}| 6 expD(ε)(r/ε)2

(18)

Note that this implies that Sε is finite and that one can always take

D(1) = 3/4 since h is bounded by 1. The following result shows how a bound D for the metric dimension can be used to bound εSn(s) for a model S which is an ε-net for eS which satisfies (25).

Proposition 5. Let eS be a totally bounded subset of (L , h) with metric dimension bounded by D and let ε be a positive number satisfying

(26) ε > 1/ 2√n and D(ε) 6 n(ε/c0)2.

If Sε is an ε-net for eS satisfying (25) and S = S = Sε, then εSn(s) 6 2ε

whatever s ∈Ln.

Starting from a classical statistical model eS with metric dimension bounded by D we may therefore replace it by a suitable ε-net S in order to build a ρ-Bayes posterior based on some prior distribution on S. The robustness of the procedure, as shown by Theorem 1, implies that the replacement of eS by S will only entail an additional bias term of order ε.w

5.2. Weak VC-major classes.

Definition 4. A class of real-valued functionsF on a set X is said to be weak VC-major with dimension not larger than d ∈ N if, for all u ∈ R, the class of sets

Cu(F ) = {f > u}, f ∈ F

is VC on X , with VC-dimension not larger than d. The weak VC-major dimension of F is the smallest such integer d.

For details on the definition and properties of VC-classes, we refer to van der Vaart and Wellner (1996) and for weak VC-major classes to Ba-raud (2016). One major point about weak VC-major classes is the fact that if F is weak VC-major with dimension not larger than d ∈ N, the same holds for any subset F0 of F .

Proposition 6. Let F be the class of functions on X given by

(27) F = ( ψ r t0 t ! , (t, t0) ∈ S2 ) .

(19)

RHO-BAYES ESTIMATION

5.3. Examples. We provide below examples of parametric models in-dexed by some subset Θ of a Euclidean space putting on our models the σ-algebra induced by the Borel one on Θ. Since our results are in terms of VC-dimensions, they hold for all submodels of those described below.

Proposition 7. Let (gj)16j6J with J > 1 be real-valued functions on a

setX .

a) If the elements t of the model S are of the form

(29) t(x) = exp  θ0+ J X j=1 θjgj(x)   for all x ∈X

with θ0, . . . , θJ ∈ R, then F defined by (27) is weak VC-major with

dimen-sion not larger than d = J + 2.

b) Let J = (Ii)i=1,...,k (k > 2) be a partition of X . If the elements t of

the model S are of the form

(30) t(x) = k X i=1 exp   J X j=1 θi,jgj(x)  1lIi(x) for all x ∈X

with θi,j ∈ R for i = 1, . . . , k and j = 1, . . . J, then F defined by (27) is

weak VC-major with dimension not larger than d = k(J + 2).

IfX is an interval of R (possibly R itself), the second part of the propo-sition extends to densities based on variable partitions of X .

Proposition 8. Let (gj)16j6J (J > 1) be real-valued functions on an

interval I of R. Let the elements t of the model S be of the form

(31) t(x) = X I∈J (t) exp   J X j=1 θI,jgj(x)  1lI(x) for all x ∈X ,

where J (t) is a partition of I which may depend on t, into at most k intervals (k > 2), and (θI,j)j=1,...,J ∈ RJ for all I ∈J (t). Then F defined

by (27) is weak VC-major with dimension not larger than d = d18.8k(J +2)e, which means the smallest integer j > 18.8k(J + 2).

(20)

Proposition8 implies that F is weak VC-major with dimension not larger than 56.4k.

Note that the densities t given by (30) can be viewed as elements of a piecewise exponential family. Let us indeed consider a classical exponential family on the setX with densities (with respect to µ) of the form

(32) tθ(x) = exp   J X j=1 θjTj(x) − A(θ)   for all x ∈X

with θ = (θ1, · · · , θJ) ∈ Θ ⊂ RJ. It leads to a model S of the form (29) with

gj = Tj for 1 6 j 6 J and θ0 = −A(θ). In particular,F is weak VC-major

with dimension not larger than d = J + 2 and we deduce from Proposition7

that

(33) εSn(s) 6 (11/4)c0

r

cn(J + 2)

n log

3/2(en) for all s ∈Ln.

If all elements of S are piecewise of the form (32) on some partition J =

(Ii)i=1,...,k of X into k subsets, F is then weak VC-major with dimension

not larger than k(J + 3) and for some positive universal constant c0,

(34) εSn(s) 6 c0 r

kJ n log

3/2(en) for all s ∈Ln.

When X = [0, 1], one illustration of case b) is provided by Θi = [−M, M ]J

for i ∈ {1, . . . , k} and Tj(x) = xj−1 for j ∈ {1, . . . , J }. We may then

ap-ply Proposition 7 and the performance of the ρ-posterior distribution will depend on the approximation properties of the family of piecewise polyno-mials on the partitionJ with respect to the logarithm of the true density. Numerous results about such approximations can be found in DeVore and Lorentz (1993).

6. Upper bounds for ηS,πn (t).

6.1. Uniform distribution on an ε-net. We consider here the situation where eS is a totally bounded subset of (L , h) with metric dimension bounded by D, ε ∈ (0, 1], S = Sε is an ε-net for eS which satisfies (25) and we choose

π as the uniform distribution on S.

(21)

RHO-BAYES ESTIMATION

Proof. Let t ∈ S. For all r > 0, πBS(t, r) 

> π({t}) = S

−1

. Us-ing (25) we derive that

πBS(t, 2r)  πBS(t, r) 6 S πBS(t, 2r)  = B S(t, 2r) 6exp  4D(ε) r ε 2

for all r > ε. The conclusion follows from the fact that 4D(ε)/ε2 6 γn. 6.2. Parametric models indexed by a bounded subset of Rd. In this sec-tion we consider the situasec-tion where S is a parametric model {tθ, θ ∈ Θ}

indexed by a measurable (with respect to the Borel σ-algebra) bounded sub-set Θ ⊂ Rd and we assume that the prior π is the image by the mapping θ 7→ tθ of some probability ν on Θ. Besides, we assume that the Hellinger

distance on S is related on Θ to some norm |·| on Rd in the following way: (35) a θ − θ0 α ∗ 6 h(tθ, tθ0) 6 a θ − θ0 α ∗ for all θ, θ 0 ∈ Θ,

where a, a and α are positive numbers. Since h is bounded by 1, (35) implies that Θ is necessarily bounded. Let us denote by B∗(θ, r) the closed ball

(with respect to the norm |·|) of center θ and radius r in Rd.

Proposition 10. Assume that Θ is measurable and bounded in Rd, that (35) holds and that ν satisfies

(36) ν(B∗(θ, 2x)) 6 κθ(x)ν(B∗(θ, x)) for all θ ∈ Θ and x > 0,

where κθ(x) denotes some positive nonincreasing function on R+. Then, for

all θ ∈ Θ, (37) ηS,πn (tθ) 6 inf ( η > 0 η2 > log κθ [η/a] 1/α γn  log(2a/a) α log 2 + 1 ) . If κθ(x) ≡ κ0 for all θ ∈ Θ and x > 0, then

(38) ηnS,π(tθ) 6 s log κ0 γn  log(2a/a) α log 2 + 1  for all θ ∈ Θ.

In particular, if Θ is convex and ν admits a density g with respect to the Lebesgue measure λ on Rd which satisfies

(39) b 6 g(θ) 6 b for λ-almost all θ ∈ Θ with 0 < b 6 b, then (36) holds with κθ(x) ≡ κ0 = 2d(b/b), hence, for all t ∈ S,

(22)

6.3. Example. Let us consider, in the density model with n i.i.d. obser-vations on R, the following translation family tθ(x) = t(x − θ) where t is the

density of the Gamma(2α, 1) distribution, namely

t(x) = c(α)x2α−1e−x1lx>0 with 0 < α < 1

and θ belongs to the interval Θ = [−1, 1]. It is known from Example 1.3 p.287 of Ibragimov and Has’minski˘ı (1981) that, in this situation, (35) holds for |·| the absolute value and a, a depending on α. Let us now derive upper bounds for ηnS,π(tθ) when ν has a density g with respect to the Lebesgue

measure.

— If ν is uniform on Θ, then b = b and (40) is satisfied for some constant c depending on α and γ only.

— If g(z) = (ξ/2)|z|ξ−11l[−1,1](z) with 0 < ξ < 1, in order to compute κ0 one

has to compare the ν-measures of the intervals I1 = [(θ − x) ∨ −1, (θ + x) ∧ 1]

and I2= [(θ − 2x) ∨ −1, (θ + 2x) ∧ 1] for x > 0.

Proposition 11. If in this example g(z) = (ξ/2)|z|ξ−11l[−1,1](z), (38)

holds since

ν(I2) 6 κ0ν(I1) with κ0 = 21+ξ



2ξ− 1−1.

One should therefore note that if (39) is sufficient for κθ(r) to be constant,

it is by no means necessary. — Let us now set g(z) = c−1δ exp

h

− 2|z|δ−1i

1l[−1,1](z) for some δ > 0,

which means that the prior puts very little mass around the point θ = 0. Then

Proposition 12. In this example ηnS,π(t0) 6 Kn−α/[2α+δ], for some K

depending on α, δ, a, a and γ.

(23)

RHO-BAYES ESTIMATION

when the true parameter θ is 0 leads to a bound for (22) of the form

rn6 K  n−α/(2α+δ)+ s log3n n + r ξ + ξ0+ 2.61 n  ,

which is of the order of n−α/(2α+δ)and clearly depends on the relative values of α and δ. In particular, if α = 1/2, which corresponds to the exponential density, we get a bound for rn of order n−1/[2(1+δ)].

7. Connexion with classical Bayes estimators. Throughout this section we assume that the data X1, . . . , Xn are i.i.d. with density s on the

measured space (X , A , µ).

We consider a parametric set of real nonnegative functions {tθ, θ ∈ Θ}

satisfying R

X tθ(x) dµ(x) = 1, indexed by some subset Θ of Rd and such

that the mapping θ 7→ Pθ = tθ · µ is one-to-one so that our statistical

model be identifiable. Our model for s is S = {tθ, θ ∈ Θ}. We set ktk∞ =

supx∈X |t(x)| for any function t on X . Since the mapping θ 7→ tθ is one-to-one, the Hellinger distance can be transfered to Θ and we shall write h(θ, θ0) for h(tθ, tθ0) = h(Pθ, Pθ0).

We consider on (S, h) the Borel σ-algebra S and, given a prior π on (S,S ), we consider both the usual Bayes posterior distribution πL

X and our

ρ-posterior distribution πX given by (6) with β = 4. A natural question is

whether these two distributions are similar or not, at least asymptotically when n tends to +∞. This question is suggested by the fact, proven in Sec-tion 5.1 of BBS, that, under suitable regularity assumpSec-tions, the maximum likelihood estimator is a ρ-estimator, at least asymptotically.

In order to show that the two distributions πL

X and πX are asymptotically

close we shall introduce the following assumptions that are certainly not minimal but at least lead to simpler proofs.

Assumption 4.

(i) The function (x, θ) 7→ tθ(x) is measurable from X × Θ, A ⊗ G (Θ)

to (R+,R) where G (Θ) and R denote respectively the Borel σ-algebras

on Θ ⊂ Rd and R+.

(ii) The parameter set Θ is a compact and convex subset of Θ0 ⊂ Rd

and the true density s = tϑ belongs to S.

(iii) There exists a positive function A2 on Θ such that the following

(24)

(iv) Whatever θ ∈ Θ, the density tθ is positive on X and there exists

a constant A1 such that

s tθ tθ0 − s tθ tθ0 6 A1

θ − θ for all θ, θ and θ0 ∈ Θ.

These assumptions imply that (S, h) is a metric space and that the func-tion t 7→ t(x) from (S, h) to (0, +∞) is continuous whatever x ∈ X . Fur-thermore, Assumption 4-(iv) implies that the Hellinger distance on Θ is controlled by the Euclidean one in the following way:

h2(θ, θ) = 1 2 Z s tθ tθ0 − s tθ tθ0 !2 tθ0dµ 6 A 2 1 2 θ − θ 2 . Since the concavity of the square root implies that

(42) h tθ+ tθ0 2 , tθ+ tθ0 2  6 1 2h θ, θ , we derive from (41) with θ0 = ϑ that h θ, θ > A2(ϑ)

θ − θ . The Hellinger and Euclidean distances are therefore equivalent on Θ:

(43) A2 θ − θ 6 h θ, θ = h(tθ, tθ) 6 A3 θ − θ for all θ, θ ∈ Θ, with A2 = A2(ϑ) < A3= A1/ √ 2.

In particular, the mapping tθ 7→ θ is continuous from (S, h) to (Θ, |·|),

hence measurable from (S,S ) to (Θ, G (Θ)) and so are f : (x, tθ) 7→ (x, θ)

from (X × S, A ⊗ S ) to (X × Θ, A ⊗ G (Θ)) and (x, tθ) 7→ tθ(x) from

(X × S, A ⊗ S ) to (R+,R) as the composition of f with (x, θ) 7→ tθ(x)

which is measurable under Assumption4-(ii). Consequently Assumption1

-(i) is satisfied and so is (9) if we take for S the image by the mapping θ 7→ tθ of a countable and dense subset of (Θ, |·|) and use the fact that for

all x ∈X , the function t 7→ t(x) is continuous and positive on (S, h). We deduce from (41) and (42) that

(44) A2 2 θ − θ 6 h  tθ+ s 2 , tθ+ s 2  6 A3 2 θ − θ for all θ, θ ∈ Θ, and since ψ is a Lipschitz function with Lipschitz constant 2, Assumption4

-(iv)implies that ψ s tθ tθ0 ! − ψ s tθ tθ0 ! 6 2A1

(25)

RHO-BAYES ESTIMATION

If Θ is a compact subset of an open set Θ0 and the the parametric family {tθ, θ ∈ Θ0} is regular with invertible Fisher Information matrix, the same holds for the family {[tθ + tθ0]/2, θ ∈ Θ

0} for each given θ0 in Θ, which

implies that Assumption4-(iii)holds.

Assumption 5. The prior π on (S,S ) is the image via the mapping θ 7→ tθ of a probability ν on (Θ,G (Θ) that satisfies the following

require-ments for suitable constants B > 1 and γ ∈ [1, 4): if B(θ, r) denotes the closed Euclidean ball in Θ with center θ and radius r, whatever θ ∈ Θ and r > 0, (45) ν h Bθ, 2kr i 6 exp h Bγk i νB(θ, r) for all k ∈ N?.

The convexity of Θ and the well-known formulas for the volume of Eu-clidean balls imply that this property holds for all probabilities which are absolutely continuous with respect to the Lebesgue measure with a den-sity which is bounded from above and below but other situations are also possible. One simple example would be Θ = [−1, 1] and ν with density (1/2)(α + 1)|x|α, α > 0 with respect to the Lebesgue measure.

Theorem 2. Under Assumptions 4 and 5, one can find two functions C and n1 on (0, +∞), also depending on s and all the parameters involved

in these assumptions but independent of n, such that, for all n > n1(z),

Ps " h2 πXL, πX 6 C(z) (log n)3/2 √ n # > 1 − e−z for all z > 0.

This means that, under suitably strong assumptions, the usual posterior and our ρ-posterior distributions are asymptotically the same which shows that our construction is a genuine generalization of the classical Bayesian approach. It also implies that the Bernstein-von Mises Theorem also holds for πX as shown by the following result.

Corollary 1. Let Assumptions 4 and 5 hold, (bθn) be an asymptically

efficient sequence of estimators of the true parameter ϑ and assume that the following version of the Bernstein-von Mises Theorem is true:

π L X − N  b θn, [nI(ϑ]−1  T V P −→ n→+∞ 0,

(26)

Bernstein-von Mises Theorem, i.e. πX − N  b θn, [nI(ϑ]−1  T V P −→ n→+∞ 0.

Proof. It follows from the triangular inequality and the classical rela-tionship between Hellinger and total variation distances given by (8).

8. Combining different models.

8.1. Priors and models. In the case of simple parametric problems with parameter set Θ, such as those we considered in Section7, S is the image of a subset of some Euclidean space Rd and one often chooses for π the im-age of a probability on Θ which has a density with respect to the Lebesgue measure. The choice of a convenient prior π becomes more complex when S is a complicated function space which is very inhomogeneous with respect to the Hellinger distance. In such a case it is often useful to introduce “mod-els”, that is to consider S as a countable union of more elementary and homogeneous disjunct subsets Sm, m ∈ M, and to choose a prior πm on

each Sm in such a way that Theorem1applies to each model Sm and leads

to a non-trivial result. It remains to put all models together by choosing some prior ν on M and defining our final prior π on S = S

m∈MSm as

P

m∈Mν({m})πm. This corresponds to a hierarchical prior.

One can as well proceed in the opposite way, starting from a global prior π on S and partitioning S into subsets Sm, m ∈ M, of positive prior

prob-ability, then setting ν({m}) = π(Sm) and defining πm as the conditional

distribution of a random element t ∈ S when it belongs to Sm. The two

points of view are actually clearly equivalent, the important fact for us be-ing that the pairs (Sm, πm) are such that Theorem1 can be applied to each

of them.

Throughout this section, we work within the following framework. Given a countable sequence of disjunct probability spaces (Sm,Sm, πm)m∈M on

(X , A ), we consider S = Sm∈MSm endowed with the σ-algebraS defined

as

S = {A ⊂ S, A ∩ Sm∈Sm for all m ∈ M}.

In order to define our prior, we introduce a mapping pen from M into R+

that will also be involved in the definition of our ρ-posterior distribution. The prior π on S is given by

(46) π(A) = ∆ X

m∈M

Z

A∩Sm

(27)

RHO-BAYES ESTIMATION with ∆ = X m∈M Z Sm exp[−β pen(m)] dπm(t) !−1 ,

so that π is a genuine prior. This amounts to put a prior weight proportional to exp[−β pen(m)] on the model Sm. We shall assume the following.

Assumption 6.

(i) For all m ∈ M the function (x, t) → t(x) on X × Sm is measurable

with respect to the σ-algebra A ⊗ Sm.

(ii) For all m ∈ M there exists a countable subset Sm of Sm with the

following property:: given t ∈ Sm and t0 ∈ S = Sm0∈MSm0, one can find a sequence (tk)k>0 in Sm such that (9) holds for all x ∈X .

(iii) There exists a mapping m 7→ ε2m from M to R+ such that,

what-ever the density s ∈Ln, (47) εSm∪Sm0

n (s) 6

q ε2

m+ ε2m0 for all m, m0 ∈ M. (iv) Given a set {Lm, m ∈ M} of nonnegative numbers satisfying

(48) X

m∈M

exp[−Lm] = 1,

the penalty function pen is lower bounded in the following way: (49) pen(m) > c5nε2m+ c6+ β−1 Lm for all m ∈ M,

with constants c5 and c6 defined in (15).

8.2. The results. We define the ρ-posterior distribution πX on S by its

density with respect to the prior π given by (46) as follows:

(50) dπX dπ (t) = exp−βΨ(X, t) R Sexp−βΨ(X, t0) dπ(t0) for all t ∈ S, with Ψ(X, t) = sup m∈M sup t0∈S m Ψ(X, t, t0) − pen(m) .

Note that if we choose β = 1 and replace Ψ(X, t, t0) by the difference of the log-likelihoods Pn

i=1log t 0(X

i) −Pni=1log t(Xi), πX is the usual posterior

(28)

η on S which associates to an element t ∈ Sm with m ∈ M the quantity η2n(t) given by (51) η2n(t) = inf r∈(0,1] " c7r2+ 1 2nβlog 1 πm(BSm(t, r)) !# for all t ∈ Sm,

which only depends on the choice of the prior πm on Sm. Taking r = 1, we

see that ηn2(t) 6 c7 for all t ∈ S. Moreover, if, for some η ∈ (0, 1] and λ > 0,

πmBSm(t, r)



> exp−λnr2

for all r > η, m ∈ M and t ∈ Sm,

then η2n(t) 6 inf r>η  c7r2+ λr2 2β  =  c7+ λ 2β  η2,

a result which is similar to the one we derived for ηS,πn (t) in Section 4.2

under an analogous assumption.

Theorem 3. Let Assumption 6 hold. For all ξ > 0 and whatever the density s ∈Ln of X, there exists a set Ωξ with Ps(Ωξ) > 1 − e−ξ and such

that

πX(ω)BS(s, r)> 1 − e−ξ0 for all ω ∈ Ωξ, ξ0 > 0 and r > rn

with r2n= inf m∈Ms∈Sinfm  3c7 c8 h2(s, s) − h2(s, S) + 2 c8  2 pen(m) n + η 2 n(s) − Lm βn  + c9 ξ + ξ0+ 2.4 n

and constants cj, 7 6 j 6 9 defined in (15).

This result about the concentration of the ρ-posterior distribution is ana-logue to that one can obtain from a frequentist point of view by using a model selection method. Up to possible extra logarithmic terms, the ρ-posterior concentrates at a rate which achieves the best compromise between the ap-proximation and complexity terms among the family of models.

8.3. Model selection among exponential families. In this section we pre-tend that the observations X1, . . . , Xn are i.i.d. but keep in mind that the

Xi might not be equidistributed so that their true joint density s might not

(29)

RHO-BAYES ESTIMATION

Hereafter, `2(N) denotes the Hilbert space of all square-summable

se-quences θ = (θj)j>0 of real numbers that we endow with the Hilbert norm

| · | and the inner product h·, ·i. Let M = N, M be some positive number and for m ∈ M, let Θ0m be the subset of `2(N) of these sequences θ = (θj)j>0

such that θj ∈ [−M, M ] for 0 6 j 6 m and θj = 0 for all j > m.

For a sequence T = (Tj)j>0 of linearily independent measurable

real-valued functions on X with T0 ≡ 1 and m ∈ M, we define the density

model Sm as the exponential family

Sm=tθ = exp [hθ, Ti − A(θ)] , θ ∈ Θ0m, θm6= 0 ,

where A denotes the mapping from Θ =S

m∈MΘ 0 m to R defined by A(θ) = log Z X exp [hθ, T(x)i] dµ(x),

and µ is a finite measure onX . Note that, whatever θ ∈ Θ, x 7→ hθ, T(x)i is well-defined onX since only a finite number of coefficients of θ are non-zero. For all m ∈ M, we endow Sm with the Borel σ-algebraSm and the prior

πm which is the image of the uniform distribution on Θ0m (identified with

[−M, M ]m+1) by the mapping θ 7→ tθ on Θ0m. Throughout this section,

we consider the family of (disjunct) measured spaces (Sm,Sm, πm) with

m ∈ M together with the choice Lm = (m + 1) log 2 for all m ∈ M, so

thatP

m∈Me−Lm = 1. Then S =

S

m∈MSm, π is given by (46) and for all

m ∈ M, pen(m) = c5nε2m+ (c6+ β−1)Lm with εm = 11c0 4 r cn(m + 3) n log 3/2(en)

and c0, c5, c6, cn defined in (15). In such a situation we derive the following

result.

Proposition 13. Assume that, for all m ∈ M, the restriction Am of

A to Θ0m is convex and twice differentiable on the interior of Θ0m with a Hessian whose eigenvalues lie in (0, σm] for some σm > 0. Whatever the

density s of X, for all ξ > 0, with Ps-probability at least 1 − e−ξ,

πXBS(s, r)



> 1 − e−ξ0 for all ξ0 > 0 and all r ∈ [rn, 1]

with r2n6 C(β) inf m>1  h2(s, Sm) + m + 1 n log 3(en) + log 1 + nσ2 mM2   + c9 ξ + ξ0+ 2.4 n

(30)

Acknowledgements. The first author has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 811017.

The second author was supported by the grant ANR-17-CE40-0001-01 of the French National Research Agency ANR (project BASICS) and by Laboratoire J.A. Dieudonn´e (Nice).

References.

Atchad´e, Y. A. (2017). On the contraction properties of some high-dimensional quasi-posterior distributions. Ann. Statist., 45(5):2248–2273.

Baraud, Y. (2016). Bounding the expectation of the supremum of an empirical process over a (weak) vc-major class. Electron. J. Statist., 10(2):1709–1728.

Baraud, Y. and Birg´e, L. (2016). Rho-estimators for shape restricted density estimation. Stochastic Process. Appl., 126(12):3888–3912.

Baraud, Y. and Birg´e, L. (2017). Robust bayes-like estimation: Rho-bayes estimation. Technical report, arXiv:1711.08328v1.

Baraud, Y. and Birg´e, L. (2018). Rho-estimators revisited: General theory and applica-tions. Ann. Statist., 46(6B):3767–3804.

Baraud, Y. and Birg´e, L. (2020). Supplement to “Robust Bayes-like estimation: Rho-Bayes estimation.” DOI: .

Baraud, Y., Birg´e, L., and Sart, M. (2017). A new method for estimation and model selection: ρ-estimation. Invent. Math., 207(2):425–517.

Bhattacharya, A., Pati, D., and Yang, Y. (2019). Bayesian fractional posteriors. Ann. Statist., 47(1):39–66.

Birg´e, L. (1983). Approximation dans les espaces m´etriques et th´eorie de l’estimation. Z. Wahrsch. Verw. Gebiete, 65(2):181–237.

Birg´e, L. (1984). Stabilit´e et instabilit´e du risque minimax pour des variables ind´ependantes ´equidistribu´ees. Ann. Inst. H. Poincar´e Probab. Statist., 20(3):201–223. Birg´e, L. (1986). On estimating a density using Hellinger distance and some other strange

facts. Probab. Theory Relat. Fields, 71(2):271–291.

Birg´e, L. (2006a). Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincar´e Probab. Statist., 42(3):273–325.

Birg´e, L. (2006b). Statistical estimation with model selection. Indag. Math. (N.S.), 17(4):497–537.

Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016). A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78(5):1103–1130.

Catoni, O. (2007). Pac-Bayesian supervised classification: the thermodynamics of sta-tistical learning, volume 56 of Institute of Mathematical Statistics Lecture Notes— Monograph Series. Institute of Mathematical Statistics, Beachwood, OH.

Chernozhukov, V. and Hong, H. (2003). An MCMC approach to classical estimation. J. Econometrics, 115(2):293–346.

DeVore, R. and Lorentz, G. (1993). Constructive approximation. Springer-Verlag. Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior

distributions. Ann. Statist., 28(2):500–531.

Ibragimov, I. A. and Has’minski˘ı, R. Z. (1981). Statistical Estimation. Asymptotic Theory, volume 16. Springer-Verlag, New York.

(31)

RHO-BAYES ESTIMATION

Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist., 34(2):837–877.

Kleijn, B. J. K. and van der Vaart, A. W. (2012). The Bernstein-Von-Mises theorem under misspecification. Electron. J. Stat., 6:354–381.

Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist., 1:38–53.

Le Cam, L. (1975). On local and global properties in the theory of asymptotic normality of experiments. In Stochastic processes and related topics (Proc. Summer Res. Inst. Statist. Inference for Stochastic Processes, Indiana Univ., Bloomington, Ind., 1974, Vol. 1; dedicated to Jerzy Neyman), pages 13–54. Academic Press, New York.

Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer Series in Statistics. Springer-Verlag, New York.

Massart, P. (2007). Concentration Inequalities and Model Selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003.

Panov, M. and Spokoiny, V. (2015). Finite sample Bernstein–von Mises theorem for semiparametric problems. Bayesian Anal., 10(3):665–710.

van de Geer, S. A. (2000). Applications of empirical process theory, volume 6 of Cam-bridge Series in Statistical and Probabilistic Mathematics. CamCam-bridge University Press, Cambridge.

(32)

Supplement to “Robust Bayes-Like

Estimation: Rho-Bayes estimation”

9. Measurability issues.

9.1. Proof of Proposition 1. Since (x, t) 7→ t(x) is measurable, the same holds for (x, t, t0) 7→ t(x), t0(x) hence for (x, t, t0) 7→ ψpt0(x)/t(x)and

(x, t, t0) 7→ Ψ(x, t, t0) as a sum of measurable functions; the first assertion follows since S is countable. The second assertion immediately follows from the first. As to the last one, the conclusion follows from the fact that the mapping t 7→ ρ(s, t) =R

X ps(x)t(x)dµ(x) is a measurable function of t ∈ S

for all s ∈L .

9.2. Piecewise exponential families. The following proposition provides sufficient conditions for some exponential families of interest to satisfy our Assumption1.

Proposition 14. Let (Tj)16j6Jbe J > 1 measurable functions on (X , A )

and Θ ∈ RJ. If the density model S = {tθ, θ ∈ Θ} on (X , A , µ) has one of

the two following forms a) or b) below, it satisfies Assumption1 withS the Borel σ-algebra on (S, h) and S = {tθ, θ ∈ Θ0} where Θ0 denotes any dense

and countable subset of Θ.

a) Θ is a convex and compact subset of RJ and tθ is of the form (32)

where A is strictly convex and continuous on Θ;

b) J = (Ii)i=1,...,k is a partition of X into k > 2 measurable subsets of

positive measure, Θ =Qk

i=1Θiwhere each Θi is a convex and compact

subset of RJ and tθ has the form

tθ(x) = k X i=1 exp   J X j=1 θi,jTj(x) − A(θ)  1lIi(x) for all x ∈X , where θ = (θi,j)i = 1, . . . , k j = 1, . . . , J and A(θ) = log   k X i=1 Z Ii exp   J X j=1 θi,jTj(x)  dµ(x)  

(33)

RHO-BAYES ESTIMATION

It is well-known that in (32), A is strictly convex and continuous on Θ when Θ is a subset of the interior of the set

   θ = (θ1, . . . , θJ) ∈ RJ, Z X exp   J X j=1 θjTj(x)  dµ(x) < +∞   

and T1, . . . , TJ are almost surely affinely independent, which means that, for

all (λ1, . . . , λJ) ∈ RJ\ {0}, PJj=1λjTj is not constant a.s. If S is not of the

form a) or b) but is a subset of a density model of one of these forms then S also satisfies Assumption1with the Borel σ-algebra S on (S, h).

Proof. In case a), for all θ, θ0 ∈ Θ, the Hellinger affinity writes

ρ(tθ, tθ0) = exp  −A(θ) + A(θ 0) 2  Z X exp  − J X j=1 θ j + θ0j 2  Tj(x)  dµ(x)

and, since Θ is convex, (θ + θ0)/2 ∈ Θ so that (52) ρ(tθ, tθ0) = exp  A θ + θ 0 2  − A(θ) + A(θ 0 ) 2  . In case b), the same argument shows that

ρ(tθ, tθ0) = exp  −A(θ) + A(θ 0 ) 2  × k X i=1 Z Ii exp  − J X j=1 θ i,j+ θi,j0 2  Tj(x)  dµ(x)

and (52) also holds. The function A being strictly convex, h(tθ, tθ0) = 0

implies that θ = θ0, hence tθ = tθ0 and (S, h) is therefore a metric space.

Moreover, the mapping θ 7→ tθ is continuous from (Θ, |·|) to the metric

space (S, h) and since it is also one-to-one from the compact set Θ onto S, it is an homeomorphism from (Θ, |·|) to (S, h). If we denote by g the continuous function from S to Θ defined by g(t) = θ if t = tθ, the function

(x, tθ) 7→ (x, θ) is measurable and since the function (x, tθ) 7→ tθ(x) is the

composition of the maps (x, tθ) 7→ (x, θ) and (x, θ) 7→ tθ(x), it is enough, in

order to check Assumption1-(i), to show that (x, θ) 7→ tθ(x) is measurable.

The functions x 7→ Tj(x) are mesurable and the function θ 7→ A(θ) as well by

continuity. Since products and sums of measurable functions are measurable, (x, θ) 7→PJ

(34)

Since (Θ, |·|) is compact, it is separable and a countable and dense subset Θ0 of Θ does exist. We take S = {tθ, θ ∈ Θ0}. Since A is continuous, the

mapping θ 7→ tθ(x) is continuous on Θ for all x ∈ X . Moreover, since

t0(x) > 0 for all t0 ∈ S and x ∈ X , if tk(x) → t(x), ψ  pt0(x)/t k(x)  → ψ 

pt0(x)/t(x)so that Assumption 1-(ii) is satisfied.

10. Proof of Theorem1. The proof relies on suitable bounds for the numerator and denominator of the density

(53) pX(ω)(t) = dπX(ω) dπ (t) = exp [−βΨ(X(ω), t)] Z S exp−βΨ(X(ω), t0) dπ(t0) ,

which themselves derive from bounds on Ψ(X(ω), t). To get such bounds, we shall use the three following results to be proven in Section12.

Proposition 15. Under Assumption 1, the sequence (tk)k>0 satisfies

(54) lim

k→+∞h(t, tk) = 0, k→+∞lim Ψ(x, tk, t 0

) = Ψ(x, t, t0).

Theorem 4. Under Assumption 1, whatever the density s of X and ξ > 0, there exists a measurable subset Ωξ of Ω the Ps-probability of which is

at least 1−e−ξ and such that for all ω ∈ Ωξ, t 7→ Ψ(X(ω), t, t0) is measurable

on (S,S ) for all t0∈ S and, for all t ∈ S, (55) 1 nΨ(X(ω), t, t 0 ) 6 c7h2(s, t) − c8h2(s, t0) + c5  εSn(s) 2 + c6 ξ + 2.4 n ,

where εSn(s) is given by (16) and the constants cj have been defined in (15).

Proposition 16. Let r, a and b satisfy 0 < ( √

nr)−26 a 6 b. If J ∈ N? is such that 4J −1 > b/a, r0 = 2−Jr and t ∈ S is such that

(56) πBS(t, 2r0)6 exp(3a/8)nr02 πBS(t, r0) for all r0

(35)

RHO-BAYES ESTIMATION

Let us now fix ξ > 0, ω in the set Ωξ provided by Theorem 4 and set x

for X(ω), ε(s) for εSn(s) and η(t) for ηS,πn (t) in order to keep our formulae

as simple as possible. For u ∈ S and u0 ∈ S, (55) can be written

(58) Ψ(x, u, u0) 6 nc7h2(s, u) − c8h2(s, u0) + c5ε2(s) + c6(ξ + 2.4).

To bound the denominator of px(t) from below we apply (58) with u =

t0∈ S and, since h(s, u0) > h(s, S) = h(s, S), derive that, whatever t0∈ S, Ψ(x, t0) n = supu0∈S Ψ(x, t0, u0) n 6 c7h 2(s, t0) − c 8h2(s, S) + c5ε2(s) + c6 ξ + 2.4 n . Therefore, Z S exp−βΨ(x, t0) dπ(t0 ) > expβ c8nh2(s, S) − c5nε2(s) − c6(ξ + 2.4)  × Z S exp−βc7nh2(s, t0) dπ(t0).

Let us now bound the numerator of px(t) from above for t ∈ S.

Assump-tion1implies (54), hence, for u ∈ S there exists a sequence (tk)k>0∈ S such

that Ψ(x, tk, u) −→ k→+∞Ψ(x, t, u) and h 2(t, t k) −→ k→+∞0.

Applying (58) with u0 = tk leads to

−Ψ(x, tk, u) = Ψ(x, u, tk)

6 nc7h2(s, u) − c8h2(s, tk) + c5ε2(s) + c6(ξ + 2.4)

and letting k tend to infinity shows that, for t ∈ S and u ∈ S, −Ψ(x, t, u) 6 nc7h2(s, u) − c8h2(s, t) + c5ε2(s) + c6(ξ + 2.4). Consequently, −Ψ(x, t) = inf u∈S[−Ψ(x, t, u)] 6 c7  inf u∈Snh 2(s, u)  − c8nh2(s, t) + c5nε2(s) + c6(ξ + 2.4)

and, since infu∈Sh2(s, u) = h2(s, S) = h2(s, S), for all t ∈ S,

exp [−βΨ(x, t)]

(36)

Putting the bounds for the numerator and denominator of (53) together, we derive that, for all t ∈ S,

px(t) 6 expβ (c7− c8)nh2(s, S) + 2c5nε2(s) + 2c6(ξ + 2.4)  × exp−βc8nh 2(s, t) Z S exp−βc7nh2(s, t0) dπ(t0) 6 expβ (c7− c8)nh2(s, u) + 2c5nε2(s) + 2c6(ξ + 2.4)  (59) × exp−βc8nh 2(s, t) Z S exp−βc7nh2(s, t0) dπ(t0) for all u ∈ S,

since c7− c8 > 0. It follows from the inequalities h(t, u) 6 h(t, s) + h(s, u),

h(s, t0) 6 h(s, u) + h(u, t0) and (7) that, whatever the positive numbers a, b, for all t, t0, u ∈ S, h2(s, t) > a(1 + a)−1h2(t, u) − ah2(s, u) and h2(s, t0) 6 (1 + b)h2(u, t0) + (1 + b−1)h2(s, u). Together with (59) this leads to

px(t) 6 expβ c7 2 + b−1 + c8(a − 1) nh2(s, u) + 2c5nε2(s)



× exp[2βc6(ξ + 2.4)] exp−β[(ac8)/(a + 1)]nh

2(t, u) Z

S

exp−βc7(1 + b)nh2(u, t0) dπ(t0)

,

for all t, u ∈ S. Choosing a = 11 so that c8(a − 1) < c7 and b = 1, then

setting c07= (1 + b)c7 = 8.02 and c08= 1/3 < a(a + 1)−1c8, we get

px(t) 6 expβ 4c7nh2(s, u) + 2c5nε2(s) + 2c6(ξ + 2.4)  × exp−βc 0 8nh2(t, u)  Z S exp−βc07nh2(u, t0) dπ(t0) for all t, u ∈ S. (60)

Références

Documents relatifs

Confidence distributions have been used to propagate the uncertainty in the prior to empirical Bayes inference about a parameter, but only by combining a Bayesian posterior

In this letter, we propose a prob- abilistic model based on the Bayesian theory, called Local Na¨ıve Bayes (LNB) model, to predict the missing links in complex networks.. Based on

If the kernel a is smoother then the solution x (A4, B holds with γ &gt; β ), then the lower bound of asymptotic of minimax risks is attained only in the cases of simple

We use an item-based collaborative filtering algorithm to generate dynamic Bayesian network mod- els of an advising domain from sparse grade data.. Collaborative filtering

In Sections 3 and 4, we apply Theorem 1 to two types of Dirichlet process mixture models: DPM of Gaussian distributions used to model smooth densities and DPM of uniform

In Theorem 4, we derive rates of convergence for empirical Bayes estimation of monotone non-increasing intensity functions in counting processes satisfying the Aalen

Even when consistency and weak merging hold, simple examples show that the empirical Bayes posterior can have unexpected and counterintuitive behaviors. Frequentist strong merging is

used the same squared-error loss, but relaxed the conditions stated in Tiwari and Zalkikar [16] and proved that the empirical Bayes estimator proposed of the same parameter of