The LAN property for McKean-Vlasov models in a mean-field regime

(1)

arXiv:2205.05932v1 [math.ST] 12 May 2022

LAETITIA DELLA MAESTRA AND MARC HOFFMANN

ABSTRACT. We establish the local asymptotic normality (LAN) property for estimating a multidi- mensional parameter in the drift of a system ofN interacting particles observed over a fixed time horizon in a mean-field regimeN→ ∞. By implementing the classical theory of Ibragimov and Has- minski, we obtain in particular sharp results for the maximum likelihood estimator that go beyond its simple asymptotic normality thanks to H´ajek’s convolution theorem and strong controls of the likelihood process that yield asymptotic minimax optimality (up to constants). Our structural results shed some light to the accompanying nonlinear McKean-Vlasov experiment, and enable us to derive simple and explicit criteria to obtain identifiability and non-degeneracy of the Fisher information matrix. These conditions are also of interest for other recent studies on the topic of parametric inference for interacting diffusions.

Mathematics Subject Classification (2010): 62C20, 62F12, 62F99, 62M99 .

Keywords: Parametric estimation; LAN property; maximum likelihood estimation; statistics and PDE; interacting particle systems; McKean-Vlasov models.

CONTENTS

1. Introduction 2

1.1. Motivation 2

1.2. Setting 2

1.3. Results and organisation of the paper 3

2. Construction and properties of the statistical model 5

2.1. Notation 5

2.2. Model assumptions 5

2.3. The companion McKean-Vlasov product experiment 8

2.4. Identifiability and non-degeneracy of the Fisher information 10

3. Main results 13

3.1. The LAN property 13

3.2. Maximum likelihood estimation and properties 14

4. Examples 14

4.1. McKean-like models 15

4.2. Generalised linear like models 16

4.3. A double layer potential model 17

4.4. A genuinely non-linear example 19

5. Proof of the main results 20

5.1. Preliminaries: couplings 20

5.2. Proof of Theorem 17 21

5.3. Proof of Theorem 19 24

6. Remaining proofs 28

Date: May 13, 2022.

1

(2)

6.1. Proof of Proposition 9 28

7. Appendix 30

7.1. Proof of Lemma 7 30

References 33

1. INTRODUCTION

1.1. Motivation. Collective dynamics models are becoming increasingly popular in modelling complex stochastic systems, with a versatiliy of applications, ranging from mathematical biol- ogy (neurosciences, Baladronet al. [2], structured models in population dynamics, Mogilneret al. [40], Burgeret al. [8]) to social sciences (opinion dynamics, Chazelleet al. [13], cooperative behaviours, Canutoet al. [9]) and finance (systemic risk, Fouque and Sun [17]), or more recently, mean-field games (Cardaliguetet al.[10], Cardaliaguet and Lehalle [11]). Whereas stochastic systems of interacting particles and associated nonlinear Markov processes in the sense of McKean [38] date back to the 1960’s and have been studied extensively over more than half a century, see e.g.[7, 44, 45, 39, 47] among a myriad of references, the development of statistical inference in this setting is only emerging, (with some notable exceptions like L ¨ocherbach [35] in large time ore.g.

Kasonga [27] or Bishwal [4]) in a mean-field limit. Recently, Gieseckeet al.[22] and Sharrock, Kan- tas, Parpas and Grigorios [43] revisit the work of Kasonga and consider a parametric framework where convergent and asymptotically normal contrast estimators are constructed. Several other parametric frameworks (that consider various observation schemes and asymptotic frameworks) have also been recently considered, like [14, 34, 49] or Genon-Catalot and Laredo [20, 21]. There also exist recent results in nonparametric inference: we mention our work [16] and Belometsnyet al.[3], together with studies in identification like [31, 32, 33] or learning [30, 36, 37].

The present paper, close in spirit to [4, 22, 27] and [43] (in their so-called offline case) consid- ers a parametric framework in a mean-field regime over a fixed time horizon. We take a deeper look at the asymptotic structure of the associated statistical experiment, in the sense of local asymptotic normality or LAN, in order to derive strong results for the maximum likelihood, both in asymptotic distribution and in an asymptotic minimax sense (up to constants) for various loss functions. For simplicity, we keep-up with continuous observations, but we briefly explain how to move to a discrete data setting. Also, we look for simple and explicit criteria that enable us to verify identifiability and non-degeneracy of the model. This is a non-trivial issue in the context of nonlinear McKean-Vlasov models that is usually a bit overlooked in the literature.

1.2. Setting. We have a parameter of interestϑlying in a compact setΘ⊂R^p(with non empty interior), for some fixedp ≥1. For some fixed time horizonT > 0, we continuously observe a stochastic system ofN interacting particles

(1) X^(N⁾= (X_t¹, . . . , X_t^N)t∈[0,T],

(3)

evolving in an Euclidean ambient spaceR^d, that solves (2)

(

dX_tⁱ=b(ϑ;t, X_tⁱ, µ^(N_t ⁾)dt+σ(t, X_tⁱ)dB_tⁱ, 1≤i≤N, t∈[0, T], L(X₀¹, . . . , X₀^N) =µ^⊗N₀ ,

whereµ^(N_t ⁾=N⁻¹PN

i=1δ_X_tⁱ is the empirical measure of the system. The(B_tⁱ)t∈[0,T]are indepen- dentR^d-valued Brownian motions. The initial conditionµ0, the driftband the diffusion coefficient σare at least sufficiently regular so that

µ^(N⁾= (µ^(N_t ⁾)t∈[0,T]→µ= (µ_t)t∈[0,T]

weakly asN → ∞, whereµis a family of probability measures that solves (in a weak sense) the parabolic nonlinear equation

(3)

∂tµ+ div b(ϑ;·, µ)µ

=¹₂Pd

k,k^′=1∂_kk² ^′ ckk^′µ

, t∈[0, T], µ_t=0=µ0,

with c = σσ^⊤. We will write µ^ϑ = (µ^ϑ_t)_t∈[0,T_] to emphasise the dependence in ϑ. In this context, we are interested in estimating from data (1) the parameter ϑ ∈ Θ of the function (ϑ;t, x, ν)7→b(ϑ;t, x, ν)∈R^d. Asymptotics are taken asN → ∞.

A particular case of interest that covers many examples is when the dependence in the measure variable forbis linear: we then have

(4) b(ϑ;t, X_tⁱ, µ^(N)_t ) = Z

R^d

eb(ϑ;X_tⁱ, y)µ^(N_t ⁾(dy) =N⁻¹ XN j=1

eb(ϑ;X_tⁱ, X_t^j),

for some functioneb: Θ×R^d×R^d→R^d. A typical form iseb(ϑ;t, x, y) =Gϑ(x) +Fϑ(x−y)where Gϑ, Fϑ:R^d→R^dplay the role of a common external force to the system and an interaction force respectively.

1.3. Results and organisation of the paper. In Section 2, we rigorously construct the (sequence of) statistical experiment(s) generated by the observation (1) under the dynamics (2) that we denote(E^N)N≥1. It is well defined and regular in the classical sense of Ibragimov and Hasminski [24] under strong integrability of the initial conditionµ0and standard smoothness assumptions on the driftϑ 7→ b(ϑ;·)and the diffusion matrixc = σσ^⊤, see Assumptions 1, 2, 3 and 4 and Proposition 6. The deep study of the identifiability ofE^N and the non-degeneracy of its information matrixI_EN(ϑ)is simplified via the accompanying experimentG^⊗N, whereGis generated by the continuous observation of a solution to the McKean-Vlasov equation

dXt=b(ϑ;t, Xt, µ^ϑ_t)dt+σ(t, Xt)dBt, t∈[0, T], L(X0) =µ0,

for a standard Brownian motion(Bt)t∈[0,T]onR^dand whereµ^ϑ_t is the marginal distribution of the solution at timet. In particular, in the case of representation 4 we have thatE^N andG^⊗N do not separate asymptotically by a simple entropy argument, see Proposition 10, and we always have the convergence of the corresponding Fisher information matrices:

N⁻¹I_EN(ϑ)→I_G(ϑ)

in a mean-field limitN→ ∞, as established in Proposition 11. This approximation is the gateway to obtain explicit identifiability and non-degeneracy criteria, as detailed in Section 2.4. In particular, under additional regularity assumptions, we obtain a quite simple criterion forI_G(ϑ)to be

(4)

non-degenerate in Proposition 15, namely the property that one of the functions (5) x7→ ∇^ϑ(c^−1/2b)^j(ϑ; 0, x, µ0)^⊤z, j= 1, . . . , d

is not identically vanishing, for everyz∈R^pwith|z|= 1, withc^−1/2a square root ofc=σσ^⊤. We use the notationf = (f^j)1≤j≤dcomponentwise, thef^jbeing real-valued functions. In particular, (5) has the advantage to only relate to the initial conditionµ0in the measure argument and not the whole (µ^ϑ_t)t∈[0,T] which is (almost) never explicit. Having a simple criterion to achieve the non-degeneracy of the Fisher information seems to have been a bit overlooked in the literature (where it is usually simply assumed to hold true) and our result is thus of interest for other studies.

In Section 3, we state the main results of the paper, Theorem 17, where we establish the LAN property: if we reparametrise the experiments viaϑ=ϑ0+N^−1/2ulocally around a fixed point ϑ0, withu∈R^pbeing now the unknown parameter, then bothE^N andG^⊗N look like a Gaussian shift: we observe

Y^N =u+I_G(ϑ0)^−1/2ξ,

whereξis a standard Gaussian random vector inR^p. This has important consequences in terms of existence and properties of optimal procedures: we have H´ajek’s convolution theorem (Corollary 18), namely for any estimatorϑbN,

(6) lim inf

N→∞ sup

|ϑ^′−ϑ|≤δ

EP^N_ϑ′

w N^1/2I_G(ϑ)^1/2(ϑbN −ϑ^′)

≥E[w(ξ)], for small enoughδ >0, whereP^N

ϑ^′ is the distribution of the data when the parameter isϑ^′ andw is an arbitrary loss function satisfying some regularity properties. The bound (6) is achieved by the maximum likelihood estimatorϑb_N^mleobtained by maximising the contrast

(7) ϑ7→ℓ^N(ϑ;X^(N⁾) = XN i=1

Z T 0

(c⁻¹b)(ϑ;t, X_tⁱ, µ^(N)_t )^⊤dX_tⁱ−¹2|(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N)_t )|²dt .

This implies in particular the convergence

(8) √

N ϑb_N^mle−ϑ

→N 0,I_G(ϑ)⁻¹

in distribution. Moreover, we have in Theorem 19 the minimax asymptotic optimality ofϑb_N^mle, in the sense that

R^N_w(ϑb_N^mle; Θ) = inf

ϑbN

R^N_w(ϑbN; Θ)(1 +o(1)) whereR^N_w(ϑbN; Θ) = sup_ϑ∈ΘE

P^Nϑ[w(N^1/2I_G(ϑ)^1/2(ϑbN −ϑ))]is the classical minimax risk. Thus the LAN property enables us to obtain considerably stronger results than simply (8). In Section 4, we investigate several non-trivial examples that generalise the results of [27], and where our identifiability and non-degeneracy criteria easily apply. We treat in particular the case of a kinetic mean-field double layer potential that may serve as a representative model for swarming models, see in particular [6] and the references therein. The proofs are delayed until Sections 5 and 6, with an appendix (Section 7) that contains useful technical results.

In practice, maximising the function (7) is not feasible, since only discrete data are available. It is then reasonable to replace the ideal observation (1) by the more realistic

X^(N,m)= X_t¹, . . . , X_t^N

t∈{t^m₀,...,t^m_m},

(5)

where(0 =t^m₀ < t^m₁ < . . . < t^m_m=T)is a subdivision of[0, T]with mesh

1≤j≤mmax (t^m_j −t^m_j−1)≤Cm⁻¹.

We thus have(m+ 1)×Ndata with values inR^d. We may then replace (7) by ϑ7→N⁻¹

XN i=1

Xm j=0

(c⁻¹b)(ϑ;t^m_j , X_tⁱ^m

j−1, µ^(N_t^m⁾

j−1)^⊤(X_tⁱ^m

j −X_tⁱ^m

j−1)

−¹2|(c^−1/2b)(ϑ;t^m_j−1, X_tⁱ^m

j−1, µ^(N_t^m⁾

j−1)|²(t^m_j −t^m_j−1) .

Assuming the function(t, x)7→(c^−1/2b)(ϑ;t, x, µ^(N)_t )to be smooth, we may safely expect the discrete approximation to be close to its continuous counterpart up to an additional error of order m^−1/2, by standard high-frequency discretisation techniques, see the textbooks of Jacod and co- authors [1, 25, 26]. In particular, ifm≫ N, the same results as for continuous observations are likely to hold true.

2. CONSTRUCTION AND PROPERTIES OF THE STATISTICAL MODEL

2.1. Notation. The dimensiond≥1of the state spaceR^dand the dimensionp≥1of the parameter spaceΘas well as the time horizonT >0are fixed once for all. We write| · |for the Euclidean distance onR^q (q=p, dor any other integer, depending on the context) or for a matrix norm on R^p⊗R^pfixed throughout.

We consider functions that are mappings defined on products of metric spaces (typicallyΘ× [0, T]×R^d×P₁or subsets of these) with values inRorR^d. Here,P₁denotes the set of probability measures onR^dwith a first moment, endowed with the Wasserstein1-metric

W₁(µ, ν) = inf

m∈Γ(µ,ν)

Z

R^d×R^d

x−ym(dx, dy) = sup

|φ|Lip≤1

Z

R^d

φ d µ−ν ,

whereΓ(µ, ν)denotes the set of probability measures on the product spaceR^d×R^dwith marginals µandν. For a probability measureµonR^d, we also set

m_r(µ) = Z

R^d|y|^rµ(dy)

for its moment of order r ≥ 1 and we say that µ ∈ P_r if m_r(µ)is finite. All the functions in the paper are implicitly measurable with respect to the Borel-sigma field induced by the product topology. AR^d-valued functionf is written componentwise asf = (f^k)1≤k≤d where thef^k are real-valued. We denote by∂ϑk,∇^ϑ,∂_ϑ²_k_ϑ_l respectively the partial derivative of a function with respect to the k-th componentϑk, the gradient of a real-valued function with respect toϑ, the second order partial derivative of a function with respect to thek-th andl-th componentsϑk, ϑl.

Finally, we repeatedly use the notationCfor a positive number that does not depend on N, norϑ, that may vary from line to line and that we call a constant, although it usually depends on some other (fixed) quantities of the model. In most cases, it is explicitly computable.

2.2. Model assumptions.

(6)

Well-posedness of the model and its associated statistical experiment. We work under the following strong integrability property for the initial conditionµ0.

Assumption 1. For everyr≥1, we haveµ0∈P_r.

As for the diffusion matrixσ: [0, T]×R^d→R^d⊗R^d, we make the following strong ellipticity and Lipschitz smoothness assumption.

Assumption 2. The diffusion matrixσis measurable and for someC≥0, we have

|σ(t, x^′)−σ(t, x)| ≤C|x^′−x|.

Moreover,c=σσ^⊤is such thatσ₋²|y|²≤(c(t, x)y)^⊤y≤σ²₊|y|²for someσ±>0.

As for the drift partb: Θ×[0, T]×R^d×P₁→R^d, we work under usual Lipschitz smoothness assumptions.

Assumption 3. The drift b is measurable and for someC≥0, we have sup

t∈[0,T],ϑ∈Θ

b(ϑ;t, x^′, ν^′)−b(ϑ;t, x, ν)≤C |x^′−x|+W₁(ν^′, ν) ,

and there exists someϑ0∈Θsuch that b0= sup

t∈[0,T]|b(ϑ0;t,0, δ0)|<∞.

We let|b|^Lipdenote the smallestC≥0for which Assumption 3 holds.

Assumptions 1, 2, 3 together are sufficient to guarantee the well-posedness of the statistical model: there exists a unique weak solution to (2) for everyϑ∈Θhence the dataX^(N⁾of (1) is well- defined. More precisely, we letC^N =C([0, T],(R^d)^N)denote the space of continuous functions on (R^d)^N, equipped with the natural filtration(F_t)t∈[0,T]induced by the canonical mappings

X_t^(N)(ω) = X_t¹(ω), . . . , X_t^N(ω)

=ωt.

For µ0 ∈ P₁ and ϑ ∈ Θ, the probabilityP^N_ϑ on(C^N,F^N)under which the canonical process X^(N) = (X_t^(N⁾)t∈[0,T] is a solution of (2) for the initial condition µ^⊗N₀ is uniquely defined under Assumptions 1, 2 and 3. Recommended reference (that covers our set of assumptions) is the textbook by Carmona and Delarue [12] or the lectures notes of Lacker [28]. Moreover, for every ϑ∈Θ, the parabolic nonlinear equation (3) has a unique probability solutionµ= (µ^ϑ_t)t∈[0,T]and we have the weak convergenceµ^(N_t ⁾→µ^ϑ_t underP^N_ϑ, for everyϑ∈Θ.

We thus study under Assumptions 1, 2, 3 the (sequence of) statistical experiment(s) generated by the observation (1) under the dynamics (2) and that we realise as

(E^N)N≥1=

C^N,F^N, P^N_ϑ, ϑ∈Θ

N≥1.

Note that at that stage, we do not impose any identifiability assumptioni.e. we do not assume that the mapping ϑ 7→ P^N_ϑ is one-to-one. We will discuss that matter together with the non- degeneracy of the model later in Section 2.4.

(7)

Regularity of the experimentE^N. In order to study the regularity of the model, we need specific smoothness properties for the functionϑ7→b(ϑ,·).

Assumption 4. There existr1, r2 ≥ 1andC > 0such that for every pointϑin the interior ofΘ, the functionϑ7→b(ϑ;t, x, ν)is twice differentiable and for every1≤ℓ, ℓ^′≤p,

sup

t∈[0,T]

(|∂ϑℓb(ϑ;t, x, ν)|+|∂_ϑ²_ℓ_ϑ_ℓ′b(ϑ;t, x, ν)|)≤C(1 +|x|^r¹+m_r₂(ν)), sup

t∈[0,T]|∂ϑℓb(ϑ;t, x^′, ν^′)−∂ϑℓb(ϑ;t, x, ν)| ≤C(|x^′−x|+W₁(ν^′, ν)).

The smoothness properties of the map ϑ 7→ b(ϑ;·)granted by Assumption 4 enables us to explore further the regularity of the experimentE^N. First, note that we have a log-likelihood by setting

(9) ℓ^N(ϑ;X^(N⁾) = XN i=1

Z T 0

(c⁻¹b)(ϑ;t, X_tⁱ, µ^(N_t ⁾)^⊤dX_tⁱ−1 2

XN i=1

Z T

0 |(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N)_t )|²dt, wherec^−1/2 is fixed once for all. Indeed, by Girsanov’s theorem again, the lawsP^N_ϑ are all absolutely continuous w.r.t. W^N, defined as the unique probability on(C^N,F^N)under which the processes

Z ^t

0

c^−1/2(s, X_sⁱ)dX_sⁱ

t∈[0,T], 1≤i≤N

are independent standard Brownian motions on R^d, together with L(X₀¹, . . . , X₀^N) = µ^⊗N₀ . In turn, for everyϑ∈Θ,

dP^N

ϑ

dW^N(X^(N⁾) = exp ℓ^N(ϑ;X^(N⁾)

holdsW^N-almost-surely. We further writeL^N(ϑ;X^(N⁾) = exp ℓ^N(ϑ;X^(N))

for the likelihood process, indexed by the parameterϑ∈Θ. We recall one possible classical definition of a regular statistical experiment, following [24].

Definition 5. The dominated (sequence of) experiment(s)(E^N)N≥1is regular if

(i) ϑ7→L^N(ϑ;X^(N⁾)is differentiable for everyϑin (the interior of)Θ,W^N-almost surely, (ii) ϑ7→ ∇^ϑL^N(ϑ;X^(N))is continuous in quadraticW^N-mean, for everyϑin (the interior of)Θ, (iii) we have finite Fisher information

EP^Nϑ

|∇^ϑℓ^N(ϑ;X^(N))|²

<∞ for everyϑin (the interior of)Θ.

Proposition 6. Under Assumptions 1, 2, 3 and 4 the (sequence of) experiment(s)(E^N)N≥1is regular.

(Sketch of) Proof. By exchanging the order of the differentiation with respect toϑand the stochastic integral we have

∂ϑkℓ^N(ϑ;X^(N⁾) = XN i=1

Z T 0

∂ϑk(c⁻¹b)(ϑ;t, X_tⁱ, µ^(N)_t )^⊤dX_tⁱ

− XN i=1

Z T 0

∂ϑk(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N_t ⁾)^⊤(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N)_t )dt.

(8)

We obtain the representation (10) ∂ϑkℓ^N(ϑ;X^(N⁾) =

XN i=1

Z T 0

∂ϑk(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N_t ⁾)^⊤dB_t^i,N,ϑ, where the

(B^i,N,ϑ_t )t∈[0,T]= Z ^t

0

c^−1/2(s, X_sⁱ)(dX_sⁱ−b(ϑ;s, X_sⁱ, µ^(N_s ⁾)ds)

t∈[0,T], 1≤i≤N are independent Brownian motions onR^d underP^N

ϑ. The properties (i), (ii) and (iii) are then a simple consequence of Assumption 4 together with the following moment bound,

Lemma 7. Under Assumptions 1, 2, 3, for everyr≥1, we have sup

ϑ∈Θ,t∈[0,T],N≥1

EP^Nϑ[|X_tⁱ|^r]<∞. Note thatE

P^Nϑ[|X_tⁱ|^r]does not depend oni. The proof of Lemma 7 is given in Appendix 7.1.

Finally, we have a notion of Fisher information matrix by setting I_EN(ϑ) =E_PN

ϑ

∇^ϑℓ^N(ϑ;X^(N))∇^ϑℓ^N(ϑ;X^(N⁾)^⊤ . Thanks to (10), we also have

(11) I_EN(ϑ) =X^N

i=1

EP^Nϑ

h Z ^T

0

∂ϑℓ(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N_t ⁾)∂ϑ_ℓ′(c^−1/2b)(ϑ;t, X_tⁱ, µ^(N)_t )^⊤dti

1≤ℓ,ℓ^′≤p. 2.3. The companion McKean-Vlasov product experiment. We letC = C([0, T],R^d)denote the space of continuous functions onR^d, equipped with the natural filtration(Ft)0≤t≤T induced by the canonical mapping Xt(ω) = ωt. For every ϑ ∈ Θ, we let P_ϑ denote the unique law under which the process

(B_t^ϑ)t∈[0,T]= Z ^t

0

c^−1/2(s, Xs)(dXs−b(ϑ;s, Xs, µ^ϑ_s)ds)

t∈[0,T]

is a standard Brownian motion on R^d, appended with the condition L(X0) = µ0, andµ^ϑ = (µ^ϑ_t)t∈[0,T]is a probability solution of (3). The family(P_ϑ)ϑ∈Θis well-defined under Assumptions 1, 2, 3. In particular, the canonical process X on (C,F_T) is a solution to the McKean-Vlasov equation

(12)

dXt=b(ϑ;t, Xt, µ^ϑ_t)dt+σ(t, Xt)dB_t^ϑ, t∈[0, T], L(X0) =µ0.

The following result is the counterpart of Lemma 7. Note in particular that the marginals ofP_ϑ coincide with the solutionµ^ϑ= (µ^ϑ_t)t∈[0,T]of the Fokker-Planck equation (3).

Lemma 8. Under Assumptions 1, 2, 3, for everyr≥1, we have sup

ϑ∈Θ,t∈[0,T]

EPϑ[|Xt|^r] = sup

ϑ∈Θ,t∈[0,T]

Z

R^d|x|^rµ^ϑ_t(dx)<∞.

The proof is given in Section 7.2. We also have the following smoothness property in the parameterϑ, proof of which is delayed until Section 6.1.

Proposition 9. Under Assumption 1, 2, 3 and 4, the mappingϑ 7→ µ^ϑ_t is Lipschitz continuous in the Wasserstein-1 metricW₁, uniformly int∈[0, T].

(9)

We next consider the limit experiment

G= C,F_T,(P_ϑ)ϑ∈Θ and itsN-fold counterpart

G^⊗N =

C^N,F^N_T,(P^⊗N_ϑ )ϑ∈Θ

that serves as an approximation for the experimentE^N. Inspired by classical propagation of chaos techniques (see in particular [29]), we can easily show that the measuresP^N

ϑ andP^⊗N_ϑ are indis- tinguishable when the drift is of the form

(13) b(ϑ;t, x, ν) =

Z

R^d

eb(ϑ;t, x, y)ν(dy), for some kerneleb(ϑ;·) : [0, T]×R^d×R^d→R^dsuch that

(14) sup

t∈[0,T],ϑ∈Θ

eb(ϑ;t;x;y)≤C(1 +|x|^r¹+|y|^r²)

for some r1, r2 ≥ 1, a situation that covers most of our examples, see Section 4 below. More precisely, we have the following

Proposition 10. Under Assumptions 1, 2, 3, ifbhas moreover the form(13)-(14), we have

(15) lim sup

N→∞

sup

ϑ∈Θ

EP^⊗Nϑ

hlogdP^⊗N

ϑ

dP^N

ϑ

i<∞.

In particular, if

sup

ϑ∈Θ

Z T 0

Z

R^d×R^d|eb(ϑ;t, x, y)|²(µ^ϑ_t ⊗µ^ϑ_t)(dx, dy)dt <4, then

(16) lim sup

N→∞ sup

ϑ∈ΘkP^N_ϑ −P^⊗N_ϑ k^{T V} <1, wherek · k^{T V} denotes total variation distance.

The proof is given in Section 6.2. Some remarks are in order: 1)The estimate (15) tells us that it is impossible to statistically discriminate betweenP^N

ϑ andP^⊗N_ϑ asymptotically. More precisely, inequality (16) shows in particular that providedeb is not too big orT not too large, then there exists no test of the nullH0 :P^N

ϑ =P^⊗N_ϑ against the alternativeH1 :P^N

ϑ 6=P^⊗N_ϑ with asymptotically arbitrarily small first and second kind error in the limitN → ∞.2)We will actually prove a stronger result in Section 3 below, showing that both(E^N)N≥1and(G^⊗N)N≥1share the LAN property, with same asymptotic Fisher information. 3)Finally, (15) may hold in wider generality when the dependence in the measure variable in the drift is nonlinear, as soon as we have some differentiability in the following sense: there exists∂νb(ϑ;t, x,·) :R^d×P₁→R^dsuch that

b(ϑ;t, x, ν)−b(ϑ;t, x, ν^′) = Z 1

0

∂νb(ϑ;t, x, y, λν+ (1−λ)ν^′)(ν−ν^′)(dy)

for everyν, ν^′∈P₁and∂νb(ϑ;t, x,·)satisfies additional smoothness properties. Iterating the op- erator∂ν, if∂_ν^kb(ϑ;t, x,·) : (R^d)^k×P₁→R^dexists and satisfies some smoothness and integrability properties, we may expect (15) to hold as soon ask≥d/2. We refer to Assumption 4 and Propo- sition 19 of [16] where this approach is developed.

(10)

We also have a log-likelihood in the experimentG^⊗N by setting (17) ℓ^N(ϑ;X^(N⁾) =

XN i=1

Z T 0

(c⁻¹b)(ϑ;t, X_tⁱ, µ^ϑ_t)^⊤dX_tⁱ−1 2

XN i=1

Z T

0 |(c^−1/2b)(ϑ;t, X_tⁱ, µ^ϑ_t)|²dt.

This is the same argument as before: the lawsP^⊗N_ϑ are all absolutely continuous w.r.t. W^N, and for everyϑ∈Θ,

dP^⊗N_ϑ

dW^N (X^(N⁾) = exp ℓ^N(ϑ;X^(N⁾) holdsW^N-almost-surely.

Finally under Assumptions 1, 2, 3 and 4, the (sequence of) experiment(s)G^⊗N is also a regular model and its (normalised) Fisher informationI_G(ϑ) =N⁻¹I_G⊗N(ϑ)is given by

N⁻¹E

Pϑ

∇^ϑℓ^N(ϑ;X^(N⁾)∇^ϑℓ^N(ϑ;X^(N⁾)^⊤

= Xd j=1

Z T 0

Z

R^d∇^ϑ(c^−1/2b)^j(ϑ;t, x, µ^ϑ_t)∇^ϑ(c^−1/2b)^j(ϑ;t, x, µ^ϑ_t)^⊤µ^ϑ_t(dx)dt

= Z ^T

0

Z

R^d

∂ϑℓ(c^−1/2b)(ϑ;t, x, µ^ϑ_t)∂ϑ_ℓ′(c^−1/2b)(ϑ;t, x, µ^ϑ_t)^⊤µ^ϑ_t(dx)dt

1≤ℓ,ℓ^′≤p.

Moreover, the mappingϑ7→I_G(ϑ)is smooth and appears as the (normalised) asymptotic information ofE^N:

Proposition 11. Under Assumptions 1, 2, 3 and 4, the mappingϑ 7→ I_G(ϑ)is Lipschitz continuous.

Moreover, for everyϑin (the interior of)Θ, we have

N⁻¹I_EN(ϑ)→I_G(ϑ)

asN → ∞, whereI_EN(ϑ)is the Fisher information matrix of the experimentE^N defined in(11)above.

The proof is given in Section 6.3.

2.4. Identifiability and non-degeneracy of the Fisher information.

Motivation. In the preceding section, we have builtE^N andG^⊗N (equivalentlyG) as possibly re- dundant, in the sense that the mappingsϑ 7→ P^N

ϑ and ϑ 7→ P_ϑ are not necessarily one-to-one onΘ. Having a well-posed parametrisation is required since we wish to have at least consistent estimators. Arguing asymptotically, we only need to work in the limit modelG.

Also, asymptotic identifiability is somehow linked to the non-degeneracy of the (normalised) Fisher information matrixI_G. Following [42], see also [48], we say that a pointϑin (the interior of)Θisregularifϑ^′ 7→I_G(ϑ^′)has constant rank in a neighbourhood ofϑand the experimentGis calledlocally identifiableatϑif the mappingϑ^′ 7→P_ϑ′is injective in a neighbourhood ofϑ. We have the following classical result (that goes back at least to Cramer [15]):

Proposition 12(Theorem 1 in [42]). Ifϑis regular, thenGis locally identifiable atϑif and only ifI_G(ϑ) has full rank.

Unfortunately, there is no hope to obtain a global result that links the two notions unless in very specific cases, see Proposition 16 below. We next givead-hocassumptions that give sufficient (and independent) condition for both identifiability and non-degeneracy of the Fisher information.

(11)

An identifiability assumption. We first have a relatively weak assumption that guarantees global identifiability inG.

Assumption 13. For allϑ∈Θ, forP_ϑ-almost allω, for allϑ^′ 6=ϑ, the functionst7→b(ϑ;t, Xt(ω), µ^ϑ_t) andt7→b(ϑ^′;t, Xt(ω), µ^ϑ_t^′)are notdt-a.e. equal.

Assumption 13 is relatively standard in the literature of statistics of random processes and minimal (seee.g.[19] in a somewhat analogous context). Indeed, by Girsanov’s theorem, for two different parametersϑ, ϑ^′ ∈Θ, the lawsP_ϑandP_ϑ′are absolutely continuous and

log dP_ϑ dP_ϑ′

(X) = Z T

0

((c⁻¹b)(ϑ;s, Xs, µ^ϑ_s)−(c⁻¹b)(ϑ^′;s, Xs, µ^ϑ_s))^⊤dX_sⁱ

−1 2

Z T 0

(|(c^−1/2b)(ϑ;s, X_sⁱ, µ^ϑ_s)|²− |(c^−1/2b)(ϑ^′;s, X_sⁱ, µ^ϑ_s)|²)ds.

Having Assumption 13 fail for someϑ^′ impliesP_ϑ ^d^P^ϑ

dPϑ′(X) = 1

,i.e. P_ϑ =P_ϑ′. Assumption 13 may be difficult to check in practice. Yet, it is satisfied as soon as the mappingϑ 7→ (t, x) 7→

b(ϑ;t, x, µ^ϑ_t)

is one-to-one. Also, for certain form of the likelihood, we have other criteria, see Proposition 16 below.

Non-degeneracy of the information. We need some notation. For anyϑ, ϑ^′∈Θsuch that the segment [ϑ, ϑ^′] ={ϑ+λ(ϑ^′−ϑ), λ∈[0,1]} ⊂Θand a functionφdefined onΘ, we set

φ([ϑ, ϑ^′]) = Z 1

0

φ(ϑ+λ(ϑ^′−ϑ))dλ.

Definition 14. The statistical experimentGis non-degenerate if

(18) inf

[ϑ,ϑ^′]⊂Θ

detE

Pϑ

∇^ϑℓ¹([ϑ, ϑ^′])∇^ϑℓ¹([ϑ, ϑ^′])^⊤

>0, wheredetdenotes the determinant.

Equivalently, we can rewrite (18) as infdet

X^d

j=1

Z T 0

Z

R^d∇^ϑ(c^−1/2b)^j([ϑ, ϑ^′];t, x, µ^ϑ_t)∇^ϑ(c^−1/2b)^j([ϑ, ϑ^′];t, x, µ^ϑ_t)^⊤µ^ϑ_t(dx)dt

>0, where the infimum is taken over all segments [ϑ, ϑ^′] ⊂ Θ. Obviously, if G is non-degenerate, takingϑ=ϑ^′, Definition 14 boils down to

(19) inf

ϑ∈Θ

detI_G(ϑ)>0

i.e. ϑ 7→ I_G(ϑ)has full rank uniformly inϑand we find back the usual non-degeneracy of the Fisher information. The somewhat stronger non-degeneracy criterion that we pick in Definition 14 enables us to check the assumptions of the theory of Ibragimov and Hasminski for obtaining sharp properties for the maximum likelihood estimator (see in particular Step 2 of the proof of Theorem 19 in Section 5.3 below). In explicit examples, proving (18) is no more difficult than proving (19), see Section 4 below.

(12)

Checking(18)or(19)in practice. A special difficulty for the statistical analysis ofE^N or ratherGlies in the asymptotic form (12) with the presence of(µ^ϑ_t)0≤t≤T in the drift, which is never explicit, except in very special cases with a specific moment structure in the measure dependence, see Sec- tion 4 below.

It is noteworthy that (18) can usually be tested in a simple way given an explicit parametrisation. Indeed, Definition 14 is equivalent to show that for every segment[ϑ, ϑ^′]⊂Θ,

[ϑ,ϑinf^′]⊂Θmin

|z|=1

Xd j=1

Z T 0

Z

R^d

(∇^ϑ c^−1/2b)^j([ϑ, ϑ^′];t, x, µ^ϑ_t)^⊤z2

µ^ϑ_t(dx)dt >0.

Under Assumptions 1, 2, 3, we have thatµ^ϑ_t(dx) =µ^ϑ_t(x)dx is absolutely continuous onR^d for t >0, and we may pick a versionµ^ϑ_t of the density that is continuous and positive onR^d. This follows from classical Gaussian tail estimates for the solution of parabolic equations. We refer for example to Corollary 8.2.2 of [5]. By a simple continuity argument, it is then sufficient to show that there cannot exist a segment[ϑ, ϑ^′]⊂Θand some|z|= 1, such that the function

x7→

Z T 0

Xd j=1

∇^ϑ(c^−1/2b)^j([ϑ, ϑ^′];t, x, µ^ϑ_t)^⊤z2

dt

vanishes asymptotically, or, as soon as we have continuity intast→0, if one of the functions x7→ ∇^ϑ(c^−1/2b)^j([ϑ, ϑ^′]; 0, x, µ0)^⊤z, j= 1, . . . , d

does not identically vanishes. This last criterion has the advantage to avoid the termµ^ϑ_t fort >0.

We gather these observations in the following:

Proposition 15. Work under Assumptions 1, 2, 3 and 4. Assume moreover that the functions t7→ ∇^ϑ(c^−1/2b)^j([ϑ, ϑ^′];t, x, µ^ϑ_t), j= 1, . . . , d

are all continuous att= 0for every[ϑ, ϑ^′]⊂Θand a.e.-almostx∈R^d. If, for every[ϑ, ϑ^′]⊂Θand anyz∈R^pwith|z|= 1, one of the functions (20) x7→ ∇^ϑ(c^−1/2b)^j([ϑ, ϑ^′]; 0, x, µ0)^⊤z, j= 1, . . . , d does not identically vanishes, thenGis non-degenerate in the sense of Definition 14.

We specifically apply this criterion in the examples Section 4 and check that the criterion (20) is particularly simple to establish when the dependence in the measure argument of the function bis of the form (13).

A case of equivalence between global identifiability and non-degeneracy of the information. We revisit Theorem 3 in [42] to obtain the following criterion:

Proposition 16. Work under Assumptions 1, 2, 3 and 4. Assume that the log-likelihoodℓ^N(ϑ;X^(N))in E^N defined by(9)has the form

(21) ℓ^N(ϑ,;X^N) =ϑ^⊤G^N(X^(N⁾) +ϑ^⊤H^N(X^(N⁾)ϑ,

whereG^N andH^N are functions of the trajectoryX^(N⁾with values inR^pandR^p⊗R^prespectively, and (H^N)^⊤=H^N is symmetric. IfΘ0⊂Θis a convex set such thatI_G(ϑ)is non-singular for everyϑ∈Θ0, then, both(E^N)N≥1andGare identifiable onΘ0.

(13)

By identifiability of the sequence of experiment(E^N)N≥1, we mean injectivity of the mapping ϑ 7→ (P^N

ϑ)N≥1 (i.e. simultaneously for everyN ≥ 1). The proof is given in Section 6.4. In the specific case of McKean type models that date back to [38, 44, 46] and widely used in practice (see e.g. [12, 17] or [27] in statistics), we have in some instances a representation like (21) and explicit formulas forI_G(ϑ), which gives global identifiability for free as soon asI_G(ϑ)is non-degenerate.

See the examples in Section 4.

3. MAIN RESULTS

3.1. The LAN property. The local asymptotic normality property of a statistical model charac- terises its regularity: it expresses the fact that the experiment locally resembles a Gaussian shift in an optimal scale driven by the Fisher information. It has powerful consequences in terms of properties of optimal procedures via the celebrated H´ajek convolution theorem [23]. More precisely the sequence of experiments(EN)N≥1satisfies the LAN property atϑ∈Θwith information rate NI_G(ϑ)if

(22) logdP^N

ϑ+(NIG(ϑ))^−1/2u

dP^N

ϑ

=u^⊤ξ_ϑ^N−¹2|u|²+rN(ϑ, u),

whereξ_ϑ^Nconverges in distribution underP^N_ϑ to standard Gaussian variable inR^pandrN(ϑ, u)→ 0inP^N_ϑ-probability. Of course, the convergence (22) is meaningful only ifϑ+ (NI_G(ϑ))^−1/2u∈Θ and is well-defined,i.e. ifdetI_G(ϑ)>0. This is granted for instance forϑin the interior ofΘfor large enoughNand under (19).

Theorem 17. Work under Assumptions 1, 2, 3, 4 and 13. Assume moreover thatGis non-degenerate according to Definition 14. For every ϑin (the interior of)Θ, the sequence of experiments(E^N)N≥1 is locally asymptotically normal atϑwith information rateNI_G(ϑ).

The same result holds for(G^⊗N)N≥1.

Several remarks are in order: 1)Theorem 17 is the most powerful result one can obtain about the structure of(E^N)N≥1and(G^⊗N)N≥1: it tells us that around a given pointϑ0, if we parametrise locally the experiment viaϑ=ϑ0+N^−1/2uwithu∈R^pbeing the unknown parameter, then the experiments look like the simplest possible experiment, namely a Gaussian shift

Y^N =u+I_G(ϑ0)^−1/2ξ+o(1)

whereξis a standard normalN(0,Id_R^p)ando(1)is a small term that vanishes inP^N

ϑ orP^⊗N_ϑ probability, locally uniformly inu.2)The fact that both(E^N)N≥1and(G^⊗N)N≥1share the LAN property with same asymptotic Fisher variance quantifies their asymptotic similarity, see in particular Proposition 10. 3)The LAN property has several consequences in terms of strong properties of the maximum likelihood estimator, see Theorem 19 below. In particular, the first simple consequence is given in terms of exact asymptotic minimax lower bounds: call a centrally symmetric functionw:R^p →[0,∞)such that the sets{w < c}, c >0are all convex apolynomial loss function if it admits a polynomial majorant.

Corollary 18. In the setting of Theorem 17, letwbe a polynomial loss function. Then, for any estimator ϑbN inE^N and any sufficiently smallδ >0, for everyϑin (the interior of)Θfor whichdetI_G(ϑ)>0, we have

lim inf

N→∞ sup

|ϑ^′−ϑ|≤δ

E_PN ϑ′

w N^1/2I_G(ϑ)^1/2(ϑbN −ϑ^′)

≥(2π)^−p/2 Z

R^p

w(x) exp(−¹2|x|²)dx.

The same result holds true forϑbN inG^⊗N replacingP^N_ϑ byP^⊗N_ϑ .

(14)

Corollary 18 is a simple application of H´ajek convolution theorem, given the LAN property of Theorem 17, seee.g. Theorem II.12.1 (an in particular Remark III.12.1) in [24]. It provide with a sharp local asymptotically minimax bound, up to constants. We shall see below that the maximum likelihood estimator achieves this bound.

3.2. Maximum likelihood estimation and properties. We elaborate on the properties of the maximum likelihood estimator by relying on (a uniform version of) the LAN property of Theorem 17.

It implies several fine results that go beyond the usual asymptotic weak expansions given by an ad-hocstudy of the form of the estimator, as is usually the case in the literature.

Theorem 19. Work under Assumptions 1, 2, 3, 4 and 13. Then, for large enoughN, the solutionϑb_N^mleto (23) L^N(bϑ_N^mle;X^(N)) = sup

ϑ∈Θ

L^N(ϑ;X^(N⁾) is well-defined. Moreover, the following asymptotic upper bounds are valid:

(i) ifGis non-degenerate in the sense of Definition 14,

√N ϑb_N^mle−ϑ

→N 0,I_G(ϑ)⁻¹ inP^N

ϑ-distribution asN → ∞.

(ii) For every polynomial loss functionwand anyϑin the interior ofΘ, we have exact local asymptotic minimax optimality:

lim sup

N→∞ sup

|ϑ^′−ϑ|≤δ

E_PN ϑ′

w N^1/2I_G(ϑ)^1/2 ϑb_N^mle−ϑ^′

→(2π)^−p/2 Z

R^p

w(x) exp(−¹2|x|²)dx asδ→0.

(iii) For every polynomial loss function wand any (non empty) open set Θ0 ⊂ Θ, we have global asymptotic minimax optimality:

R^N_w(ϑb_N^mle; Θ0) = inf

b ϑN

R^N_w(ϑbN; Θ0)(1 +o(1)) asN→ ∞, where

R^N_w(ϑbN; Θ0) = sup

ϑ∈Θ0

EP^Nϑ

w N^1/2I_G(ϑ)^1/2(ϑbN −ϑ) .

Some further remarks:1)We find back the classical asymptotic properties (i) of the maximum likelihood estimator that are given in the literature, but the result is appended by a much stronger convergence in (ii), that matches in particular the lower bound of Corollary 18. 2) We finally obtain global asymptotic minimax optimality by (iii), which is the parametric analog (in a much more precise way) of our minimax results of Section 4 in [16] in the nonparametric case.

4. EXAMPLES

In this section, we elaborate on specific examples that appear in the literature and in applications. We first revisit the linear McKean model studied at length in [27]. We slightly extend in Section 4.1 his example (1.3) fromp= 2top= 3. In Section 4.2, we develop an example of a generalised linear form and show in particular how our identifiability and non-degeneracy criteria of Section 2.4 are easily implementable and avoid to use the machinery of [27]. In Section 4.3, we develop a non-trivial example of kinetik mean-field model with a double layer potential that may serve in many applications, like swarming models or more general individual based-models, see [6] and the references therein. We finally develop a genuinely non-linear example,i.e. when the

(15)

measure argument is not linear like in (4), as for instance in the examples of [41]. Assumption 1 is in force throughout.

4.1. McKean-like models. In many applications, (2) takes the explicit form (24) dX_tⁱ= (ϑ1X_tⁱ+ϑ2)dt−ϑ3N⁻¹

XN j=1

(X_tⁱ−X_t^j)dt+dBⁱ_t, i= 1, . . . , N

withX_tⁱ∈R. The parameter isϑ= (ϑ1ϑ2ϑ3)^⊤. In [27] the caseϑ2= 0is studied at length in particular. In our setting, we can encompass a more general situation withX_tⁱ∈R^dfor some arbitrary d≥1and replaceϑ3by a parameter inR^d⊗R^d as well asϑ2by a parameter inR^d. In this case, Assumptions 2, 3 and 4 are readily checked. Likewise, the identifiability and non-degeneracy assumptions can be obtained with some extra care on the initial condition. We elaborate on a specific case below.

Likelihood equations. To keep-up with notational simplicity, we detail the casep = 3with ϑ = (ϑ1 ϑ2 ϑ3)^⊤ ∈Θas a compact subset ofR³for an ambient dimensiond = 1, withϑ1 6= ϑ3 and ϑ16= 0. Introduce

A^N_t (x) =





x² x −h· −x, µ^(N_t ⁾i²

x 1 0

−h· −x, µ^(N_t ⁾i² 0 h· −x, µ^(N_t ⁾i²



, B^N_t (x) =





x 1 h· −x, µ^(N_t ⁾i



,

where we use the bracket notationh·, νito denote integration w.r.t. the measureν. Define

(25) A^N_T =

Z T

0 hA^N_t (x), µ^(N_t ⁾idt and

(26) B^N_T =N⁻¹

XN i=1

Z T 0

B^N_t (X_tⁱ)dX_tⁱ. Thanks to the linearity inϑof the driftb(ϑ;t, x, ν) =ϑ1x+ϑ2−ϑ3R

R(x−y)ν(dy), the likelihood equations are explicit and the maximum likelihood estimatorϑb_N^mlesolves

(27) A^N_Tϑb_N^mle =B^N_T.

Moreover, the Fisher information matrix is given by I_G(ϑ) =

Z T

0 hA_t(ϑ;x), µ^ϑ_tidt, with

A_t(ϑ;x) =



 x² x −h· −x, µ^ϑ_ti²

x 1 0

−h· −x, µ^ϑ_ti² 0 h· −x, µ^ϑ_ti²



.