On the mean speed of convergence of empirical and occupation measures in Wasserstein distance

(1)

www.imstat.org/aihp 2014, Vol. 50, No. 2, 539–563

DOI:10.1214/12-AIHP517

On the mean speed of convergence of empirical and occupation measures in Wasserstein distance

Emmanuel Boissard and Thibaut Le Gouic

Institut de Mathématiques de Toulouse (CNRS UMR 5219). Université Paul Sabatier, 118 route de Narbonne, 31062 Toulouse, France.

E-mail:[email protected];[email protected] Received 22 July 2011; revised 16 February 2012; accepted 15 July 2012

Abstract. In this work, we provide non-asymptotic bounds for the average speed of convergence of the empirical measure in the law of large numbers, in Wasserstein distance. We also consider occupation measures of ergodic Markov chains. One motivation is the approximation of a probability measure by finitely supported measures (the quantization problem). It is found that rates for empirical or occupation measures match or are close to previously known optimal quantization rates in several cases. This is notably highlighted in the example of infinite-dimensional Gaussian measures.

Résumé. Dans ce travail, on exhibe des bornes non asymptotiques pour la vitesse de convergence en moyenne de la mesure empirique dans la loi des grands nombres, en distance de Wasserstein. On considère également la mesure d’occupation d’une chaîne de Markov ergodique. L’une des motivations est l’approximation d’une mesure de probabilité par des mesures à support fini (le problème de la quantification). On détermine que les taux de convergence des mesures empiriques ou des mesures d’occupation correspondent dans plusieurs cas aux taux de quantification optimale déjà établis par ailleurs. Ce fait est notamment établi pour des mesures gaussiennes dans des espaces de dimension infinie.

MSC:60B10; 65C50; 60J05

Keywords:Wasserstein metrics; Optimal transportation; Functional quantization; Transportation inequalities; Markov chains; Measure theory

1. Introduction

This paper is concerned with the rate of convergence in Wasserstein distance for the so-calledempirical law of large numbers: let(E, d, μ)denote a measured Polish space, and let

Ln=1 n

n i=1

δX_i (1)

denote the empirical measure associated with the i.i.d. sample(X_i)₁_≤_i_≤_n of lawμ, then with probability 1,L_n μ asn→ +∞(convergence is understood in the sense of the weak topology of measures). This theorem is also known as Glivenko–Cantelli theorem and is due in this form to Varadarajan [31].

For 1≤p <+∞, thep-Wasserstein distanceis defined on the setPp(E)²of couples of measures with a finitepth moment by

W_p^p(μ, ν)= inf

π∈P(μ,ν)

d^p(x, y)π(dx,dy),

(2)

where the infimum is taken over the setP(μ, ν)of probability measures with first, resp. second, marginalμ, resp.ν.

This defines a metric onPp, and convergence in this metric is equivalent to weak convergence plus convergence of the moment of orderp. These metrics, and more generally the Monge transportation problem from which they originate, have played a prominent role in several areas of probability, statistics and the analysis of P.D.E.s: for a rich account, see C. Villani’s St-Flour course [32].

Our purpose in this paper is to give bounds on the speed of convergence inWpdistance for the Glivenko–Cantelli theorem, i.e. bounds for the convergenceE(Wp(Ln, μ))→0. Such results are desirable notably in view of numerical and statistical applications: indeed, the approximation of a given probability measure by a measure with finite support in Wasserstein distance is a topic that appears in various guises in the literature, see for example [16]. The first motivation for this work was to extend the results obtained by F. Bolley, A. Guillin and C. Villani [5] in the case of variables with support inR^d. As in this paper, we aim to produce bounds that are non-asymptotic and effective (that is with explicit constants), in order to achieve practical relevance.

We also extend the investigation to the convergence of occupation measure for suitably ergodic Markov chains:

again, we have practical applications in mind, as this allows to use Metropolis–Hastings-type algorithms to approxi- mate an unknown measure (see Section1.3for a discussion of this).

There are many works in statistics devoted to convergence rates in some metric associated with the weak convergence of measures, see e.g. the book of A. Van der Vaart and J. Wellner [30]. Of particular interest to us is R. M.

Dudley’s article [12], see Remark1.1.

Other works have been devoted to convergence of empirical measures in Wasserstein distance, we quote some of them. Horowitz and Karandikar [19] gave a bound for the rate of convergence ofE[W₂²(Ln, μ)]to 0 for general measures supported inR^dunder a moment condition. M. Ajtai, J. Komlos and G. Tusnády [1] and M. Talagrand [29]

studied the related problem of the average cost of matching two i.i.d. samples from the uniform law on the unit cube in dimensiond≥2. This line of research was pushed further, among others, by V. Dobri´c and J. E. Yukich [11] or F. Barthe and C. Bordenave [2] (the reader may refer to this last paper for an up-to-date account of the Euclidean matching problem). These papers give a sharp result for measures inR^d, with an improvement both over [19] and [5].

In the caseμ∈P(R), E. del Barrio, E. Giné and C. Matrán [8] obtain a central limit theorem forW₁(L_n, μ)under the condition that_+∞

−∞

√F (t )(1−F (t ))dt <+∞whereF is the cumulative distribution function (c.d.f.) ofμ. In the companion paper [4], we investigate the case of theW1distance by using the dual expression of theW1transportation cost by Kantorovich and Rubinstein, see therein for more references.

We will also discuss problems related to the optimal quantization of probability measures, that is the approximation of probability distributions by distributions with finite support. A general reference on this topic is the book by S. Graf and H. Luschgy [16]. Let us also mention the paper [17] that lays out the connections between optimal quantization and empirical processes.

Before moving on to our results, we make a remark on the scope of this work. Generally speaking, the problem of convergence ofW_p(L_n, μ)to 0 can be divided in two separate questions:

• the first one is to estimate themean rate of convergence, that is the convergence rate ofE[W_p(L_n, μ)],

• while the second one is to study the concentration properties ofW_p(L_n, μ)around its mean, that is to find bounds on the quantities

P

W_p(L_n, μ)−E

W_p(L_n, μ)

≥t .

Our main concern here is the first point. The second one can be dealt with by techniques of measure concentration.

We will elaborate on this in AppendicesAandB.

1.1. Main result and first consequences

Definition 1.1. ForS⊂E,the covering number of orderδforS,denoted byN (S, δ),is defined as the minimaln∈N such that there existx₁, . . . , x_ninSwith

S⊂

n j=1

B(x_i, δ).

(3)

Our main statement is summed up in the following result.

Theorem 1.1. Choose t >0. Let μ∈P(E) with support included in S⊂E with finite diameter ΔS such that N (S, t ) <+∞.We have the bound:

E

W_p(L_n, μ)

≤c

t+n⁻^1/2p _Δ_S/4

t

N (S, δ)^1/2pdδ

withc≤64/3.

Remark. Theorem1.1is related in spirit and proof to the results of R.M.Dudley[12]in the case of the bounded Lipschitz metric

dBL(μ, ν)= inf

f1−Lip,f _∞≤1

fd(μ−ν).

The analogy is not at all fortuitous:indeed,the bounded Lipschitz metric is linked to the1-Wasserstein distance via the well-known Kantorovich–Rubinstein dual definition ofW₁:

W1(μ, ν)= sup

f1−Lip

fd(μ−ν).

The analogy stops at p=1 since there is no representation of Wp as an empirical process for p >1 (there is,however,a general dual expression of the transport cost).In spite of this,the technique of proof in[12]proves useful in our case,and the technique of using a sequence of coarser and coarser partitions is at the heart of many later results,notably in the literature concerned with the problem of matching two independent samples in Euclidean space,see e.g. [29]or the recent paper[2].

We now give a first example of application, under an assumption that the underlying metric space is of finite- dimensional type in some sense. More precisely, we assume that there existkE>0,α >0 such that

N (E, δ)≤k_E(DiamE/δ)^α. (2)

Here, the parameterαplays the role of a dimension.

Corollary 1.2. Assume thatEsatisfies(2),and thatα >2p.With notations as earlier,the following holds:

E

Wp(Ln, μ)

≤c 2p

α−2p 2p/α

DiamEk_E^1/αn⁻^1/α withc≤64/3.

Remark. In the case of measures supported inR^d,this result is neither new nor fully optimal.For a sharp statement in this case,the reader may refer to[2]and references therein.However,we recover at least the exponent ofn⁻^1/d which is sharp ford≥3,see[2]for a discussion.And on the other hand,Corollary1.2extends to more general metric spaces of finite-dimensional type,for example manifolds.

Remark. Corollary1.2and other results throughout this work require a lower bound on the dimension parameterα.

Here for instance we imposeα >2p,which implies thatα >2sincep≥1.This high-dimensional hypothesis is also commonplace in matching problems: for instance in[1] (where matchings over the cube with p=1 are studied), the convergence rates for dimension3and above differ from those in dimension1and2.In low dimensions,further difficulties arise and Theorem1.1will overestimate the convergence rate.For instance on the2-dimensional cube we would get a convergence rate oflogn/√

ninstead of a correct√

logn/nas established in[1].

(4)

It is possible to remove the assumption of boundedness of the metric space and replace it with the following: we assume that there existk_E>0,α >0 such that for all boundedS⊂E,

N (S, δ)≤kE(DiamS/δ)^α. (3)

In this context, the following result holds.

Corollary 1.3. Assume that (E, d) satisfies (3),that thatμ∈Pp(E)has some finite moment of orderq >2p∨ αp/(α−p),meaning thatMq=

d^q(x0, x)dμ <+∞for somex0∈E.Also assume thatα >2p.

Then there existsC >0,depending onE,p,αandM_q,such that E

W_p(L_n, μ)

≤Cn⁻^1/α.

Remark. A more precise statement is the following:with the same notations as above,for allξ >1andζ >1,we have

E

W_p(L_n, μ)

≤C(ξ )n⁻^1/2p+C(ζ )n⁻^1/α,

where there exist constantsc, cdepending onp, α, E,but not onμ(cis universal,cis the constant that appears in Corollary1.2),such that

C(ξ )=c

1+M_p^1/p+M_2ξp^1/2p

M_p^1/p 2⁻^ξ

1−2⁻^ξ + 2¹⁻^ξ 1−2¹⁻^ξ

,

C(ζ )=c

1+M_{ζ αp/(α}^(α⁻^p)/αp₋_p) 4¹⁻^ζ 1−2¹⁻^ζ

.

As opposed to Corollaries1.2and1.3, our next result is set in an infinite-dimensional framework.

1.2. An application to Gaussian r.v.s in Banach spaces

We apply the results above to the case whereE is a separable Banach space with norm · , andμ is a centered Gaussian random variable with values in E, meaning that the image of μ by every continuous linear functional f∈E^∗is a centered Gaussian variable inR. The couple(E, μ)is called a (separable) Gaussian–Banach space.

LetXbe anE-valued r.v. with lawμ, and define the weak variance ofμas σ= sup

f∈E∗,|f|≤1

Ef²(X)1/2

.

The small ball function of a Gaussian–Banach space(E, μ)is the function ψ (t )= −logμ

B(0, t ) .

We can associate to the couple(E, μ)their Cameron–Martin Hilbert spaceH⊂E, see e.g. [22] for a reference.

It is known that the small ball function has deep links with the covering numbers of the unit ball ofH, see e.g. the papers by Kuelbs and Li [21] and Li and Linde [25], as well as with the approximation ofμby measures with finite support in Wasserstein distance (the quantization or optimal quantization problem), see Fehringer’s Ph.D. thesis [13], Dereich, Fehringer, Matoussi and Scheutzow [9], Graf, Luschgy and Pagès [18].

We make the following assumptions on the small ball function:

(1) there existsκ >1 such thatψ (t )≤κψ (2t )for 0< t≤t0, (2) for allε >0,n⁻^ε=o(ψ⁻¹(logn)).

Assumption(2) implies that the Gaussian measure is genuinely infinite dimensional: indeed, in the case when dimK <+∞, the measure is supported in a finite-dimensional Banach space, and in this case the small ball function behaves as logt.

(5)

Theorem 1.4. Let(E, μ)be a Gaussian–Banach space with weak varianceσ and small ball functionψ.Assume that assumptions(1)and(2)hold.

Then there exists a universal constantcsuch that for all integersn≥1such that logn≥(6+κ)

log 2∨ψ (1)∨ψ (2t0)∨1/σ² , the following holds:

E

W₂(L_n, μ)

≤c

ψ⁻¹ 1

6+κlogn

+σ n⁻^1/^[⁴⁽⁶⁺^κ)^]

. (4)

In particular,there is aC=C(μ)such that E

W₂(L_n, μ)

≤Cψ⁻¹(logn). (5)

Moreover,forλ >0,

W₂(L_n, μ)≤(C+λ)ψ⁻¹(logn) with probability1−exp

−n

ψ⁻¹(logn)2 λ² 2σ²

. (6)

Remark. Note that the choice of6+κ is not particularly sharp and may likely be improved.

In order to underline the interest of the result above, we introduce some definitions from optimal quantization. For n≥1 and 1≤r <+∞, define the optimal quantization error at ratenas

δ_n,r(μ)= inf

ν∈Θn

W_r(μ, ν),

where the infimum runs over the setΘnof probability measures with finite support of cardinal bounded byn.

Precise connections have been made in the literature between the rate of optimal quantization (i.e. the speed of convergence ofδ_n,r to 0) and the behaviour of the small ball function near 0. Two rather complete works on this topic are [9] and [18]. Assume thatt→ψ (1/t )isregularly varying at infinity, i.e. that there exists a functionLanda >0 such that

ψ (ε)=ε⁻^aL 1

ε

, L(st )

L(t ) →t→+∞1 for alls >0.

Roughly speaking, Theorem 4.1 in [9] and Theorem 2 in [18] imply that there existc, c>0 such that cψ⁻¹(logn)δn,rcψ⁻¹(logn)

(wherea_nb_nmeans lim infb_n/a_n≥1).

Please note that the regular variation condition is not the sharpest condition stated in either paper; however it is satisfied by usual Gaussian processes. In the terminology of quantization, the quantization rate is given by the sequence ψ⁻¹(logn). We can restate Theorem1.4by saying that the empirical measure is a rate-optimal quantizer with high probability (under some assumptions on the small ball function and when the distortion index is r=2). This is of practical interest, since obtaining the empirical measure is only as difficult as simulating an instance of the Gaussian vector, and one avoids dealing with computation of appropriate weights in the approximating discrete measure.

We now quote some results on the asymptotic behaviour of the quantization rate and the small ball function for classic Gaussian processes, to illustrate the result above. In all of these examples, the assumptions of Theorem1.4on the small ball function are satisfied, so that the convergence rate is also the proper one for empirical measures.

(6)

• E=(L²([0,1]), · 2)andμis the Wiener measure. In this case, we quote [9] to get ψ (t )∼t→0

1 8t². Thus,

√ 1

8 lognδ_n,r 1

√logn. Actually, a sharper result isδ_n,r∼√

2/π√

logn, c.f. [26]. In our case, we get the bound EW₂(L_n, μ)=O

1

√logn

.

• E=(C([0,1]), · ∞)(the space of continuous functions endowed with the sup-norm) andμis the Wiener measure. Quoting again from [9], we have

ψ (t )∼ π² 8t²

from which we deduce again that there existc, c>0 with

√c

lognδ_n,r c

√logn andEW2(Ln, μ)=O(1/√

logn).

• E=(C([0,1]), · ∞), and this timeμis the law of a fractional Brownian motion with Hurst exponentρ∈(0,1).

As stated in [18,25,26], we have ψ (t )∼ C_ρ

t^1/ρ

which entails a boundEW2(L_n, μ)=O((logn)⁻^ρ)(with a matching lower bound onδn,2).

• Finally, considerE=(C([0,1]²), · ∞)and the law of the Brownian sheet onI= [0,1]², i.e. the centered continuous Gaussian process(X_t)_t_∈_Isuch that

EX_sX_t=(s₁∧t₁)(s₂∧t₂)

ifs=(s₁, s₂),t=(t₁, t₂). Quoting again from [26], we get ct⁻²log(1/t )³ψ (t )ct⁻²log(1/t )³.

We obtainEW2(Ln, μ)=O((logn)⁻^1/2(log logn)^3/2)(and a matching lower bound onδn,2).

Many more results may be readily obtained from the literature. References are given e.g. in the two articles we quoted. One may also take advantage directly of some estimates of the Kolmogorov entropy of Gaussian processes (i.e. covering numbers of the Cameron–Martin ball), bypassing small ball estimates. Such direct estimates are provided for example in [27].

We leave aside the question of determining the sharp asymptotics for the average errorE(W₂(L_n, μ)), that is of finding whether there existsc >0 such that

E

W₂(L_n, μ)

∼cψ⁻¹(logn).

Let us underline that the corresponding question for the quantization rate is tackled for example in [26]. In this paper, instead of connecting the quantization rate to small deviation asymptotics, a truncation of the Karhunen–Loève expansion of the Gaussian vector is used. Proving the existence ofcas above and computing its value on classical Gaussian processes would allow a sharp comparison of the relative asymptotic performance of empirical measures and optimal quantizers.

(7)

1.3. The case of Markov chains

We wish to extend the control of the speed of convergence to weakly dependent sequences, such as rapidly-mixing Markov chains. There is a natural incentive to consider this question: there are cases when one does not know how to sample from a given measureπ, but a Markov chain with stationary measureπis nevertheless available for simulation.

This is the basic set-up of the Markov Chain Monte Carlo framework, and a very frequent situation, even in finite dimension.

When looking at the proof of Theorem1.1, it is apparent that the main ingredient missing in the dependent case is the argument following (19), i.e. that wheneverA⊂Xis measurable,nLn(A)follows a binomial law with parameters n andμ(A), and this must be remedied in some way. It is natural to look for some type of quantitative ergodicity property of the chain, expressing almost-independence ofX_iandX_j in the long range (|i−j|large).

We will consider decay-of-variance inequalities of the following form:

VarπPⁿf ≤CλⁿVarπf. (7)

In the reversible case, a bound of the type of (7) is ensured by Poincaré or spectral gap inequalities. We recall one possible definition in the discrete-time Markov chain setting.

Definition 1.2. LetP be a Markov kernel with reversible measureπ∈P(E).We say that a Poincaré inequality with constantC_P >0holds if

Varπf≤CP

f

I−P²

fdπ (8)

for allf ∈L²(π ).

If(8)holds,we have VarπPⁿf ≤λⁿVarπf withλ=(CP−1)/CP.

The choice of assumption (7) is fairly standard. More generally, one may assume that we have a control of the decay of the variance in the following form:

VarπPⁿf ≤Cλⁿ f−

fdπ

L^p

. (9)

As soon as p >2, these inequalities are weaker than (7). Our proof would be easily adaptable to this weaker decay-of-variance setting. We do not provide a complete statement of this claim.

For a discussion of the links between Poincaré inequality and other notions of weak dependence (e.g. mixing coefficients), see the recent paper [7].

For the next two theorems, we make the following dimension assumption onE: there existsk_E>0 andα >0 such that for allS⊂Ewith finite diameter,

N (S, δ)≤k_E(DiamS/δ)^α. (10)

The following theorem is the analogue of Corollary1.2under the assumption that the Markov chain satisfies a decay-of-variance inequality.

Theorem 1.5. Assume thatE has finite diameterΔ >0and (10)holds.Letπ∈P(E),and let(X_i)_i_≥₀ be anE- valued Markov chain with initial lawνsuch thatπis its unique invariant probability.Assume also that(7)holds for someC >0andλ <1.

(8)

Then if2p < α(1+1/r)andL_ndenotes the occupation measure1/n_n

i=1δ_X_i,the following holds:

Eν

W_p(L_n, π )

≤C₁k_E^(1/α(1⁺^1/r))Δ

C dν/dπ L^r(π )

(1−λ)n

1/[α(1+1/r)]

withC1≤64/3(_α(1₊_1/r)^2p ₋_2p)^2p/(α(1⁺^1/r)).

Our next theorem is an extension to the unbounded case under some moment conditions onπ.

Theorem 1.6. Assume that(10)holds.Letπ∈P(E),and let(Xi)i≥0be anE-valued Markov chain with initial law νsuch thatπis its unique invariant probability.Assume also that(7)holds for someC >0andλ <1.Letx₀∈Eand for allθ≥1,denoteM_θ=

d(x₀, x)^θdπ.Fixrand assume2p < α(1+1/r).Assume thatπadmits a finite moment of orderq >2p/(1−1/r)∨αp/(α−p).

Then there existsC₁>0depending onE, α, p, r, qand onM_qsuch that Eν

W_p(L_n, π )

≤C₁

C dν/dπ L^r(π )

(1−λ)n

1/[α(1+1/r)]

.

Remark. As in the case of Corollary1.3,a more precise expression may be found in the proof.

To conclude this section, we include without proof a possible variant of the results above. We no longer assume that (X_n)_n_≥₀is the trajectory of a Markov chain, but instead that(X_n)_n_∈Zis a generalπ-stationary sequence of variables with controlledρ-mixing coefficients. Remember that theρ-mixing coefficient of two sub-σ-algebrasFandGover a common probability space is

ρ(F,G)= sup

X∈L²(F),Y∈L²G

Cov(X, Y )

√Var(X)Var(Y ),

and that the sequence ofρ-mixing coefficients for theπ-stationary sequence(X_n)_n_∈Zis the sequence given byρ0=1 and

ρ_n=ρ

σ (X_s, s≤0), σ (Xt, t≥n)

forn≥1. The proof of the following proposition may be obtained along the same lines as in the Markov case.

Proposition 1.7. Assume that(X_n)is aπ-stationary sequence as above,such thatρ_n→0asn→ +∞.Define

χ_n= 1 n²

n i=1

i j=1

ρ_j.

Then ifΔ=DiamEandt >0is fixed,we have EW_p(L_n, π )≤c

t+χ_n^1/2p _Δ/4

t

N (E, δ)^1/2pdδ

.

If for exampleEhas finite-dimensional type with parameterαas defined earlier,we would get a convergence rate of the form

EW_p(L_n, π )≤cΔχ_n^1/α.

Remark. If(ρ_n)is exponentially decaying we retrieve our more usual case sinceχ_nis of ordern⁻¹.

(9)

2. Proofs in the independent case

Lemma 2.1. LetS⊂E,s >0andu, v∈Nwithu < v.Suppose thatN (S,4⁻^vs) <+∞.Foru≤j≤v,there exist integers

m(j )≤N

S,4⁻^js

(11) and non-empty subsetsS_j,l ofS,u≤j≤v, 1≤l≤m(j ),such that the setsS_j,l 1≤l≤m(j )satisfy

(1) for eachj,(S_j,l)₁_≤_l_≤_{m(j )}is a partition ofS, (2) DiamS_j,l≤4⁻^j⁺¹s,

(3) for eachj > u,for each1≤l≤m(j )there exists1≤l≤m(j−1)such thatS_j,l⊂S_j₋_1,l.

In other words,the setsS_j,l form a sequence of partitions ofS that get coarser asj decreases(tiles at the scale j−1are unions of tiles at the scalej).

Proof. We begin by picking a set of ballsB_j,l=B(x_j,l,4⁻^js)∩Swithu≤j≤vand 1≤l≤N (S,4⁻^js), such that for allj,

S⊂

N (S,4⁻^js) l=1

Bj,l.

DefineS_v,1=B_v,1, and successively set S_v,l=B_v,l

1≤l≤l−1

S_v,l.

Discard the possible empty sets and relabel the existing sets accordingly. We have obtained the finest partition, obviously satisfying conditions(1)–(2).

Assume now that the sets S_j,l have been built for k+1≤j ≤v. Set Sk,1 to be the reunion of all S_k₊_1,l

such thatS_k₊_1,l ∩Bk,1=∅. Likewise, define by induction onl the setS_k,l as the reunion of allS_k₊_1,l such that S_k₊_1,l∩B_k,l=∅andS_k₊_1,lS_k,p for 1≤p < l. Again, discard the possible empty sets and relabel the remaining tiles. It is readily checked that the sets obtained satisfy assumptions (1)and(3). We check assumption(2): letx_k,l denote the center ofB_k,land lety∈S_k₊_1,l⊂S_k,l. We have

d(x_k,l, y)≤4⁻^ks+DiamS_k₊_1,l≤2×4⁻^ks,

thus DiamSk,l≤4⁻^k⁺¹sas desired.

Consider as above a subsetSofEwith finite diameterΔ_S, and assume thatN (S,4⁻^kΔ_S) <+∞. Pick a sequence of partitions(S_j,l)₁_≤_l_≤_{m(j )}for 1≤j≤k, as per Lemma2.1. For each(j, l)choose a pointx_j,l∈S_j,l. Define the set of points of levelj as the setL(j )= {x_j,l}1≤l≤m(j ). Say thatx_j,l is an ancestor ofx_j,l ifS_j,l⊂S_j,l: we will denote this relation by(j, l)(j, l).

The next two lemmas study the cost of transporting a finite measurem_kto another measuren_kwhen these measures have support inL(k). The underlying idea is that we consider the finite metric space

T =

x_j,l,1≤l≤m(j ),1≤j≤k

as ametric tree, where the ancestry relationshipdefined above corresponds to the hierarchical structure of the tree.

This tree is naturally endowed with a metric by considering that the distance from a point to its child (i.e. one-step descendant) is given by their distance on the original space(E, d), and the distance between two points ofT is the sum of distances on the unique tree path that joins them. By the triangle inequality it is immediate to see that this tree metric dominates the metric inherited from(E, d). We consider the problem of transportation between two masses

(10)

at the leaves of the tree. The transportation algorithm we consider consists in allocating as much mass as possible at each point, then moving the remaining mass up one level in the tree, and iterating the procedure.

A technical warning: please note that the transportation cost is usually defined between two probability measures;

however there is no difficulty in extending its definition to the transportation between two finite measures of equal total mass, and we will freely use this fact in the sequel.

Lemma 2.2. Letm_j,n_j be measures with support inL_j with same mass.Define the measuresm˜_j₋₁ andn˜_j₋₁on L_j₋₁by setting

˜

mj−1(x_j₋_1,l)=

(j,l)(j−1,l)

mj(xj,l)−nj(xj,l)

∨0, (12)

˜

nj−1(x_j₋_1,l)=

(j,l)(j−1,l)

nj(xj,l)−mj(xj,l)

∨0. (13)

The measuresm˜_j₋₁andn˜_j₋₁have same mass,so the transportation cost between them may be defined.Moreover, ifΔ_S=DiamS,the following bound holds:

Wp(mj, nj)≤2×4⁻^j⁺²ΔS mj−nj 1/p

TV +Wp(m˜j−1,n˜j−1). (14)

Proof. Setmj∧nj(xj,l)=mj(xj,l)∧nj(xj,l). We introduce the measuremj∧nj+ ˜mj−1defined by [mj∧nj+ ˜mj−1](xj,l)=mj∧nj(xj,l),

[m_j∧n_j+ ˜m_j₋₁](x_j₋_1,l)= ˜m_j₋₁(x_j₋_1,l).

Likewise, we define the measurem_j∧n_j+ ˜n_j₋₁. We stress that they have the same mass asm_j, n_j. By the triangle inequality,

Wp(mj, nj)≤Wp(mj, mj∧nj+ ˜mj−1)+Wp(mj∧nj+ ˜mj−1, mj∧nj+ ˜nj−1) +W_p(m_j∧n_j+ ˜n_j₋₁, n_j).

We bound the term on the left. Introduce the transport planπ_mdefined by π_m(x_j,l, x_j,l)=m_j∧n_j(x_j,l),

πm(xj,l, x_j₋_1,l)=

mj(xj,l)−nj(xj,l)

+ when(j, l)

j−1, l . The reader can check thatπm∈P(mj, mj∧nj+ ˜mj−1). Moreover,

W_p(m_j, m_j∧n_j+ ˜m_j₋₁)≤

d^p(x, y)π_m(dx,dy) 1/p

≤4⁻^j⁺²Δ_S _{m(j )}

l=1

m_j(x_j,l)−n_j(x_j,l)

+

1/p

.

Likewise,

Wp(nj, mj∧nj+ ˜nj−1)≤4⁻^j⁺²ΔS

_{m(j )}

l=1

nj(xj,l)−mj(xj,l)

+

1/p

.

(11)

As for the term in the middle, it is bounded byW_p(m˜_j₋₁,n˜_j₋₁): indeed, this follows by considering a transport plan that leaves the massm_j∧n_j in place and optimally mapsm˜_j₋₁towardsn˜_j₋₁. Putting this together and using the inequalityx+y≤2¹⁻^1/p(x^p+y^p)^1/p, we get

W_p(m_j, n_j)≤2¹⁻^1/p4⁻^j⁺²Δ_S _{m(j )}

l=1

m_j(x_j,l)−n_j(x_j,l)1/p

+W_p(m˜_j₋1,n˜_j₋1).

Lemma 2.3. Letm_j,n_jbe measures with support inL_j.Define for1≤i < jthe measuresm_i,n_i with support inL_i by

mi(x_i,l)=

(j,l)(i,l)

mj(xj,l), ni(x_i,l)=

(j,l)(i,l)

nj(xj,l). (15)

The following bound holds:

W_p(m_j, n_j)≤ j

i=1

2×4⁻ⁱ⁺²Δ_S m_i−n_i ^1/p_TV. (16)

Proof. We proceed by induction onj. For j =1, the result is obtained by using the simple bound W_p(m₁, n₁)≤ Δ_S m₁−n₁ ^1/p_TV.

Suppose that (16) holds for measures with support inL_j₋₁. By Lemma2.2, we have W_p(m_j, n_j)≤2×4⁻^j⁺²Δ_S m_j−n_j ^1/p_TV +W_p(m˜_j₋₁,n˜_j₋₁),

wherem˜_j₋₁andn˜_j₋₁are defined by (12) and (13) respectively. For 1≤i < j−1, define following (15)

˜

mi(x_i,l)=

(j−1,l)(i,l)

˜

m_j₋₁(x_j₋_1,l), n˜i(x_i,l)=

(j−1,l)(i,l)

˜

n_j₋₁(x_j₋_1,l).

We have by induction hypothesis

W_p(m_j, n_j)≤2×4⁻^j⁺²Δ_S m_j−n_j ^1/p_TV +

j−1

i=1

2×4⁻ⁱ⁺²Δ_S ˜m_i− ˜n_i ^1/p_TV.

To conclude, it suffices to check that for 1≤i≤j−1, ˜mi− ˜ni TV= mi−ni TV. Proof of Theorem1.1. We pick some positive integerk whose value will be determined further on. Introduce the sequence of partitions(S_j,l)₁_≤_l_≤_{m(j )}for 0≤j≤kas in the lemmas above, as well as the pointsx_j,l. Defineμ_k as the measure with support inL(k)such thatμk(xk,l)=μ(Sk,l)for 1≤l≤m(k). The diameter of the setsSk,lis bounded by 4⁻^k⁺¹Δ_S, thereforeW_p(μ, μ_k)≤4⁻^k⁺¹Δ_S.

LetL^k_ndenote the empirical measure associated toμk.

For 0≤j≤k−1, define as in Lemma2.3the measuresμjandL^j_nwith support inL(j )by μ_j(x_j,l)=

(k,l)(j,l)

μ_k(x_k,l), (17)

L^j_n(x_j,l)=

(k,l)(j,l)

L^k_n(x_k,l). (18)

(12)

It is simple to check thatμ_j(x_j,l)=μ(S_j,l), and thatL^j_n is the empirical measure associated withμ_j. Applying (16), we get

W_p μ_k, L^k_n

≤ k j=1

2×4⁻^j⁺²Δ_Sμ_j−L^j_n^1/p

TV. (19)

Observe thatnL^j_n(xj,l) is a binomial law with parameters n andμ(Sj,l). The expectation of μj −L^j_n TV is bounded as follows:

Eμ_j−L^jn

TV

=1/2

m(j )

l=1

EL^j_n−μj

(xj,l)

≤1/2

m(j )

l=1

EL^j_n−μj

(xj,l)²

=1/2

m(j )

l=1

μ(S_j,l)(1−μ(S_j,l)) n

≤1/2 m(j )

n .

In the last inequality, we use Cauchy–Schwarz’s inequality and the fact that (Sj,l)1≤l≤m(j ) is a partition of S.

Plugging this back in (19), we get E

Wp

μk, L^k_n

≤n⁻^1/2p k j=1

2¹⁻^1/p4⁽⁻^j⁺²⁾ΔSm(i)^1/2p

≤2⁵⁻^1/pn⁻^1/2p k j=1

4⁻^jΔSN

S,4⁻^jΔS

1/2p

≤2⁶⁻^1/p/3n⁻^1/2p _Δ_S_/4

4^−(k+1)Δ_S

N (S, δ)^1/2pdδ.

In the last line, we use a standard sum-integral comparison argument.

By the triangle inequality, we have W_p(μ, L_n)≤W_p(μ, μ_k)+W_p

μ_k, L^k_n +W_p

L^k_n, L_n .

We claim thatE(W_p(L^k_n, L_n))≤W_p(μ, μ_k). Indeed, chooseni.i.d. couples(X_i, X_i^k)such thatX_i∼μ,X^k_i ∼μ_k, and the joint law of(X_i, X^k_i)achieves an optimal coupling, i.e.E|X_i−X_i^k|^p=W_p^p(μ, μ^k). Observe that existence of this optimal coupling is guaranteed e.g. by Theorem 4.1 in [32]. We have the identities in law

L_n∼1 n

n i=1

δ_X_i, L^k_n∼1 n

n i=1

δ_Xk i.

Choose the transport plan that sendsXi toX^k_i: this gives the upper bound

W_p^p Ln, L^k_n

≤1/n n i=1

Xi−X_i^k^p