Squared quadratic Wasserstein distance: optimal couplings and Lions differentiability

(1)

HAL Id: hal-01934705

https://hal.archives-ouvertes.fr/hal-01934705v2

Submitted on 16 Nov 2020

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Squared quadratic Wasserstein distance: optimal couplings and Lions differentiability

Aurélien Alfonsi, Benjamin Jourdain

To cite this version:

Aurélien Alfonsi, Benjamin Jourdain. Squared quadratic Wasserstein distance: optimal couplings and Lions differentiability. ESAIM: Probability and Statistics, EDP Sciences, 2020, 24, pp.703-717.

�10.1051/ps/2020013�. �hal-01934705v2�

(2)

https://doi.org/10.1051/ps/2020013 www.esaim-ps.org

SQUARED QUADRATIC WASSERSTEIN DISTANCE: OPTIMAL COUPLINGS AND LIONS DIFFERENTIABILITY^∗

Aur´ elien Alfonsi

^1,2,^∗∗

and Benjamin Jourdain

^1,2

Abstract. In this paper, we remark that any optimal coupling for the quadratic Wasserstein distance W2²(µ, ν) between two probability measuresµ and ν with finite second order moments on R^d is the composition of a martingale coupling with an optimal transport mapT. We check the existence of an optimal coupling in which this map gives the unique optimal coupling betweenµandT#µ. Next, we give a direct proof thatσ7→W₂²(σ, ν) is differentiable atµin the Lions (Cours au Coll`ege de France.

2008) sense iff there is a unique optimal coupling between µ and ν and this coupling is given by a map. It was known combining results by Ambrosio, Gigli and Savaré (Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, 2005) and Ambrosio and Gangbo (Comm. Pure Appl. Math., 61:18–

53, 2008) that, under the latter condition, geometric differentiability holds. Moreover, the two notions of differentiability are equivalent according to the recent paper of Gangbo and Tudorascu (J. Math.

Pures Appl. 125:119–174, 2019). Besides, we give a self-contained probabilistic proof that mere Fr´echet differentiability of a law invariant functionF onL²(Ω,P;R^d) is enough for the Fr´echet differential at X to be a measurable function ofX.

Mathematics Subject Classification.90C08, 60G42, 60E15, 58B10, 49J50.

Received December 23, 2019. Accepted March 4, 2020.

1. Introduction

In this paper, we are interested in the structure of optimal couplings for the squared quadratic Wasserstein distanceW₂²(µ, ν) betweenµandν in the setP2(R^d) of probability measures with finite second order moments on R^d, and in the differentiability ofW₂²(µ, ν) with respect toµ. By definition, W₂²(µ, ν) = inf_{π∈Π(µ,ν)}R

|y− x|²π(dx,dy) where Π(µ, ν) denotes the set of coupling measures on R^d×R^d with first and second marginals respectively equal to µ and ν and |.| denotes the Euclidean norm on R^d. There always exists an optimal coupling and we denote by Π^opt(µ, ν) the set of optimal couplings. According to [11], there exists only oneW2- optimal couplingπbetweenµand eachν ∈ P2(R^d) and this coupling is given by a mapT (i.e.π= (Id, T)#µ where Id denotes the identity function on R^d) iff µ gives 0 mass to the c−c hypersurfaces of dimension d−1. Even when µ does not satisfy this condition which is implied by absolute continuity with respect to the Lebesgue measure, according to Proposition 5.13 [8], ifϕ:R^d→R is a C² strictly convex function such

∗This research benefited from the support of the “Chaire Risques Financiers”, Fondation du Risque.

Keywords and phrases: Optimal transport, Wasserstein distance, differentiability, couplings of probability measures, convex order.

1 CERMICS, Ecole des Ponts, Marne-la-Vall´ee, France.

2 MathRisk, Inria, Paris, France.

*∗Corresponding author:[email protected]

c

The authors. Published by EDP Sciences, SMAI 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

(3)

that R

R^d|∇ϕ(x)|²µ(dx)<∞, then there is a uniqueW2-optimal coupling between µand ν=∇ϕ#µ and this coupling is given by the map∇ϕ. But there also exist measuresν∈ P₂(R^d) such that either the unique optimal coupling (uniqueness holds in dimensiond= 1 for instance) is not given by a map or there exist distinct optimal couplings. In the latter case, any strictly convex combination of these couplings is an optimal coupling which is not given by a map.

In Section 2, we study optimal couplings π which are not given by a map. By disintegration,π(dx,dy) = µ(dx)k(x,dy) for some Markov kernel konR^d (which is µ(dx) a.e. unique). Setting T(x) =R

R^dyk(x,dy) and using the bias-variance decomposition under the kernelk, we obtain that πis the composition of a martingale coupling betweenT#µandνwith the mapT which gives aW2-optimal coupling betweenµandT#µ. Note that couplings of this form have recently been studied by Gozlan and Juillet [12] when considering the barycentric optimal cost problem. For φ:R^d→Ra strictly convex function such that R

R^dφ(y)ν(dy)<∞, by minimizing R

R^dφ(T(x))µ(dx) over theW2-optimal couplings between µandν, we obtain optimal couplings such that the associated mapTφ gives the only optimal coupling betweenµandTφ#µ. There is a unique such coupling when φ(x) =|x|².

In Section 3, we are interested in the differentiability of W₂²(µ, ν) in the Lions sense with respect to µ.

Gangbo and Tudorascu have recently proved in Corollary 3.22 [10] that the Lions differentiability [15] of a functionf :P₂(R^d)→Ris equivalent to the geometric differentiability and that the Fr´echet derivative of the lift at X ∼µ is then given by ∇_µf(X) where∇_µf ∈L²(R^d, µ;R^d) is the geometric (or Wasserstein) gradient of f at µ. While the lifted space that they consider is the ball centered at the origin of unit volume in R^d endowed with the Lebesgue measure, the result can be transferred to any atomless lifted space by considering an almost isomorphism between those spaces¹. Theorem 10.2.6 [4] states thatσ7→W₂²(σ, ν) is subdifferentiable in the geometric sense at µ when Π^opt(µ, ν) ={(Id, T)#µ} for some measurable transport map T :R^d →R^d. On the other hand, Proposition 4.3 [3] states thatσ7→W₂²(σ, ν) is always superdifferentiable in the geometric sense atµwithx7→2 x−R

R^dyk(x,dy)

belonging to the superdifferential for each Markov kernelkonR^d such that µ(dx)k(x,dy)∈Π^opt(µ, ν). Since geometric differentiability amounts to simultaneous geometric sub- and superdifferentiability, as soon as Π^opt(µ, ν) ={(Id, T)#µ}, thenσ7→W₂²(σ, ν) is differentiable in the geometric sense atµ. On the other hand, geometric differentiability implies that the geometric sub- and superdifferential considered as subsets of L²(R^d, µ;R^d) coincide and contain one element only (see for instance [8], Prop. 5.63).

The fact that the quotient of {x7→R

R^dyk(x,dy) :µ(dx)k(x,dy)∈Π^opt(µ, ν)} for the µ(dx) a.e. equality is a singleton is therefore necessary for the geometric differentiability ofσ7→W₂²(σ, ν) to hold atµ.

We prove that σ7→W₂²(σ, ν) is differentiable atµ in the Lions sense iff Πôpt(µ, ν) ={(I_d, T)#µ}. We give a direct probabilistic proof of the sufficient condition which also follows from the just mentionned results. To prove the necessary condition, we use that the Fréchet differentiability at X ∼µ of the lift on an atomless probability space is enough for the Fréchet derivative at X to be a.s. equal to a measurable function of X, a consequence of [10] that we show again using simple probabilistic arguments. Let us emphasize that the quotient of {x7→R

R^dyk(x,dy) : µ(dx)k(x,dy) ∈Π^opt(µ, ν)} for the µ(dx) a.e. equality may be a singleton while Π^opt(µ, ν) is not equal to {(Id, T)#µ} for some measurable map T :R^d→R^d (see, in dimension d= 1, Rem. 2.4below).

2. Structure of quadratic Wasserstein optimal couplings

In this section, we are interested in characterizing the set Π^opt(µ, ν) ={π(dx,dy)∈ P2(R^d×R^d) :µ(dx) =

Z

y∈R^d

π(dx,dy), ν(dy) = Z

x∈R^d

π(dx,dy) andW₂²(µ, ν) =

Z

R^d×R^d

|y−x|²π(dx,dy)}.

1We thank one of the referees for pointing out this argument to us.

(4)

of optimal couplings between two probability measures µ, ν ∈ P₂(R^d) for the quadratic cost. This set is not empty : seee.g.[4], page 133.

The refined version of the Brenier theorem in [11] ensures that Π^opt(µ, ν) contains a single element (I_d, T)#µ which is given by a measurable transport mapT :R^d→R^d for eachν∈ P2(R^d) iffµdoes not give mass to the c−c hypersurfaces parametrized by an index i∈ {0, . . . , d−1} and two convex functionsf andg from R^d−1 to R:

{(x1, . . . , xi, f(x)−g(x), xi+1, . . . , x_d−1) :x= (x1, . . . , x_d−1)∈R^d−1}.

The next lemma deals with the case where Π^opt(µ, ν)6={(Id, T)#µ}for some measurable transport map.

Lemma 2.1. Letµ, ν∈ P2(R^d). One of the two conditions holds:

– Π^opt(µ, ν) ={(Id, T)#µ} for some measurable transport mapT :R^d→R^d, – ∃µ(dx)k(x,dy)∈Π^opt(µ, ν) such thatR

R^d×R^d|y−R

R^dzk(x, dz)|²k(x,dy)µ(dx)>0.

Moreover, if any coupling in Π^opt(µ, ν) is given by a map i.e. writes (Id, T)#µ for some measurable function T :R^d →R^d, then Π^opt(µ, ν)is a singleton.

Proof. If the set Π^opt(µ, ν) has a single element µ(dx)k(x,dy), defining T(x) =R

R^dyk(x,dy) we either have R

R^d×R^d|y− T(x)|²k(x,dy)µ(dx)>0 or µ(dx)k(x,dy) =µ(dx)δ_T(x)(dy). Otherwise, we can pick two distinct elementsk₁, k₂∈Π^opt(µ, ν) andk(x,dy) = ¹₂(k₁(x,dy) +k₂(x,dy)) is such thatµ(dx)k(x,dy)∈Π^opt(µ, ν) and R

R^d×R^d|y−R

R^dzk(x, dz)|²k(x,dy)µ(dx)>0. The second statement easily follows.

Remarking that if ν is the Dirac mass at x∈R^d and νε the uniform distribution on the ball centered at x with radiusε, thenW2(ν, νε)≤ε, we deduce from the next proposition that for any µ, ν∈ P2(R^d), we can always find µε, νε∈ P2(R^d) such that W2(µ, µε)≤ε, W2(ν, νε)≤ε and ∃µε(dx)kε(x,dy)∈Π^opt(µε, νε) such that R

R^d×R^d|y−R

R^dzkε(x,dz)|²kε(x,dy)µε(dx)>0.

Proposition 2.2. Assume thatν ∈ P2(R^d)is not a Dirac mass. Then for allµ∈ P2(R^d), there exists a sequence (µn)n of elements ofP2(R^d)such thatlimn→∞W2(µn, µ) = 0and for eachn, there does not existTn:R^d→R^d measurable such that Π^opt(µ_n, ν) ={(Id, T_n)#µ_n}.

Proof. Let (X_i)_i≥1 be an i.i.d. sequence of random variables with law µ, and (Y_i)_i≥1 an independent i.i.d.

sequence of uniform random variables on the unit ball{x∈R^d,|x| ≤1}. We set ˜µ_n= ¹_nPn

i=1δ_X_i the empirical measure andµ_n=_n¹Pn

i=1δ_X_i_+Y_i_/n. By construction, we haveW₂²(µ_n,µ˜_n)≤_n¹Pn

i=1|Y_i/n|²≤1/n²andP(∃i6=

j, X_i+Y_i/n=X_j+Y_j/n) = 0, which means that a.s. for eachn∈N^∗,µ_nweights a.s. exactlynpoints. The law of large numbers gives the almost sure weak convergence of ˜µ_n towardsµ and the almost sure convergence of

1 n

Pn

i=1|Xi|²toE[|X1|²]. Proposition 7.1.5 in [4] ensures thatW2(˜µn, µ) →

n→+∞0 almost surely. By the triangle inequality, we get W2(µn, µ) →

n→+∞0 almost surely.

Now, we consider (pn)_n≥1 the increasing sequence of prime numbers. Suppose that ∃n0 ∈ N^∗, such that T#µp_n₀ =ν. Then, ν weights at most pn₀ points and the masses are equal to k/pn₀ with 1≤k≤pn₀−1 sinceν is not a Dirac mass. Then, if we had T#µp_n =ν for somen > n0, we would have k/pn₀ =k⁰/pn with 1≤k⁰ ≤pn−1. This would imply thatpn₀ divideskpn and thusk, which is impossible since 1≤k≤pn₀−1.

Thus, there is at most onen0∈N^∗ such that there is a transport mapTn₀ satisfyingTn₀#µp_n₀ =ν.

Let us now give a necessary and sufficient condition for the existence of an optimal transport map in dimension d = 1. We denote Fη(x) = η((−∞, x]) and F_η⁻¹(u) = inf{x∈ R :η((−∞, x])≥ u} the cumula- tive distribution function and the quantile function of a probability measure η on R. For µ, ν ∈ P2(R), by Theorem 2.9 in [16], the only element of Π^opt(µ, ν) is the image of the Lebesgue measure on [0,1] by (F_µ⁻¹, F_ν⁻¹).

The next lemma characterizes the case when this coupling is given by a map.

(5)

Lemma 2.3. Let µ, ν ∈ P2(R). There exists T ∈L²(R, µ;R) such that Π^opt(µ, ν) ={(I1, T)#µ} iff for all x∈R such that µ({x})>0,F_ν⁻¹ is constant on (Fµ(x−), Fµ(x)]. Then, the unique optimal transport map is T(x) =F_ν⁻¹(Fµ(x)).

Remark 2.4. When F_ν⁻¹ is not constant on (Fµ(x−), Fµ(x)] for some x∈R such that µ({x})>0, then Π^opt(µ, ν) is not equal to{(I1, T)#µ}for some measurable mapT :R→Rwhile, since Π^opt(µ, ν) is a singleton, the quotient of{x7→R

R^dyk(x,dy) :µ(dx)k(x,dy)∈Π^opt(µ, ν)} for theµ(dx) a.e. equality is a singleton.

Proof. Let X ∼µ and U be an independent random variable uniform on [0,1]. The random variable V = Fµ(X−) +U(Fµ(X)−Fµ(X−)) is such that P({Fµ(X−)< V ≤ Fµ(X)} ∪ {Fµ(X−) = V = Fµ(X)}) = 1.

This is an uniform random variable on [0,1]: for u∈(0,1), u∈[Fµ(x−), Fµ(x)] for some x∈R and P(V ≤ u) =P(X < x) +P

X =x, U ≤_F^u−F^µ^(x−)

µ(x)−Fµ(x−)

=usinceX is independent of U. SinceF_µ⁻¹(V) =X for V ∈ (F_µ(X−), Fµ(X)] and F_µ⁻¹(V)≤X forV =F_µ(X−) =F_µ(X), we have F_µ⁻¹(V)≤X a.s.. SinceF_µ⁻¹(V) and X have the same law, we necessarily have F_µ⁻¹(V) =X a.s.. By the inverse transform sampling, F_ν⁻¹(V) is distributed according to ν. Let us assume that F_ν⁻¹ is constant on (F_µ(x−), F_µ(x)] for all x∈R such that µ({x})>0. ThenF_ν⁻¹(V) =F_ν⁻¹(F_µ(X)) a.s.,F_ν⁻¹◦F_µ#µ=ν and

Z 1

0

(F_µ⁻¹(v)−F_ν⁻¹(v))²dv=E[(X−F_ν⁻¹(Fµ(X)))²] = Z

R

(x−F_ν⁻¹(Fµ(x)))²µ(dx).

Hence T(x) =F_ν⁻¹(F_µ(x)) is an optimal transport map. Conversely, if T is an optimal transport map such that T#µ=ν, we have T(F_µ⁻¹(v)) = F_ν⁻¹(v), dv-a.e. For x∈R such that µ({x})>0, F_µ⁻¹ is constant on (F_µ(x−), F_µ(x)], and thereforeF_ν⁻¹ is necessarily constant on (F_µ(x−), F_µ(x)].

Remark 2.5. Lemma2.3still holds true forµ, νprobability measures onRwith finite moments of orderρ≥1, and a transport costc(x, y) =h(|y−x|), withh:R+→Rstrictly convex such that∃C <∞, ∀x∈R, h(|x|)≤ C(1 +|x|^ρ). The same proof applies since, by Theorem 2.9 in [16], the only optimal coupling for such a cost is the image of the Lebesgue measure on [0,1] by (F_µ⁻¹, F_ν⁻¹).

The next proposition, which is one of the main results of this section, shows that any W2-optimal coupling can be written as the composition of a transport map and a martingale kerneli.e.a Markov kernelksuch that for allx∈R^d,R

R^d|y|k(x,dy)<∞andR

R^dyk(x,dy) =x. Let us now give the definition of the convex order on probability measures before recalling its link with the existence of martingale couplings.

Definition 2.6. Let η, ν be two probability measures onR^d. We say that η is smaller than ν in the convex order and write η≤cxν if for each convex functionφ:R^d→Rsuch that the integrals make sense,

Z

R^d

φ(x)η(dx)≤ Z

R^d

φ(y)ν(dy).

Notice that since a convex functionφ onR^d is bounded from below by an affine function, for a probability measureη onR^d with finite first order moment (and in particular forη∈ P2(R^d)),R

R^dφ(x)η(dx) always makes sense possibly equal to +∞.

Theorem 8 in Strassen [17] ensures that, whenR

R^d|y|ν(dy)<∞,η≤cxνiff there exists a martingale Markov kernelk such thatη(dx)k(x,dy)∈Π(η, ν).

(6)

Proposition 2.7. Let µ, ν ∈ P₂(R^d), µ(dx)k(x,dy)∈Π^opt(µ, ν), T(x) = R

R^dyk(x,dy) and η =T#µ. Then η ≤_cxν,

W₂²(µ, ν) =W₂²(µ, η) + Z

R^d

|y|²ν(dy)− Z

R^d

|z|²η(dz) (2.1)

and(Id,T)#µ∈Π^opt(µ, η).

On the other hand, if η≤cxν is such that (2.1) holds, then combining µ(dx)q(x,dz)∈Π^opt(µ, η) with any martingale coupling η(dz)m(z,dy) between η andν, we obtain a W₂-optimal couplingµ(dx)qm(x,dy) (where, as usual, qm(x,dy) =R

z∈R^dq(x,dz)m(z,dy)) betweenµandν.

The first part of this proposition is also a consequence of Theorem 12.4.4 in [4]: the barycentric projection of µ(x)k(x,dy) is precisely (I_d,T)#µ. Here, we present this result with a probabilistic fashion. For µ(dx)k(x,dy) as in the first statement and (X, Y)∼µ(dx)k(x,dy), by definition of T, E[Y|X] =T(X) a.s.

so that E[Y|T(X)] = T(X) a.s. and this optimal coupling is the composition of the martingale coupling given by the law of (T(X), Y) and the transport map T. Notice that since it relies on the bias-variance decomposition, this structure of optimal couplings does not seem to generalize to other Wasserstein distances W_ρ(µ, ν) = inf_{π∈Π(µ,ν)}R

|y−x|^ρπ(dx,dy)^1/ρ

, ρ∈ [1,∞)\ {2}. Nevertheless, Gozlan and Juillet [12] have recently obtained optimal couplings that are the composition of a martingale coupling and a deterministic transport map by considering the barycentric optimal cost problem, which consists in minimizing for a given cost functionθ:R^d→R+the quantityR

R^dθ(x−R

R^dyk(x,dy))µ(dx) among all couplingsµ(dx)k(x,dy) between µandν.

Proof. Let us first prove the second statement. Let η≤cxν, q be a Markov kernel such thatµ(dx)q(x,dz)∈ Π^opt(µ, η) and m be any martingale kernel such that ηm=ν. Then µ(dx)qm(x,dy) is a coupling betweenµ andν such that

W₂²(µ, ν)≤ Z

R^d×R^d

|y−x|²µ(dx)qm(x,dy) = Z

R^d×R^d×R^d

|y−z+z−x|²µ(dx)q(x,dz)m(z,dy)

= Z

R^d×R^d

|y−z|²η(dz)m(z,dy) + Z

R^d×R^d

|z−x|²µ(dx)q(x,dz)

= Z

R^d

|y|²ν(dy)− Z

R^d

|z|²η(dz) +W₂²(µ, η) (2.2)

where we used the variance-bias decomposition under the martingale kernelm for the third equality. Hence, if (2.1) holds, thenµ(dx)qm(x,dy)∈Π^opt(µ, ν).

Let now µ(dx)k(x,dy)∈ Π^opt(µ, ν), T(x) =R

R^dyk(x,dy) and η =T#µ. Jensen’s inequality immediately givesη≤_cxν and thusη ∈ P₂(R^d). We have

W₂²(µ, ν) = Z

R^d

Z

R^d

|y− T(x) +T(x)−x|²µ(dx)k(x,dy)

= Z

R^d

Z

R^d

|y− T(x)|²µ(dx)k(x,dy) + Z

R^d

|T(x)−x|²µ(dx)

= Z

R^d

Z

R^d

(|y|²− |T(x)|²)µ(dx)k(x,dy) + Z

R^d

|T(x)−x|²µ(dx)

= Z

R^d

|y|²ν(dy)− Z

R^d

|z|²η(dz) + Z

R^d

|T(x)−x|²µ(dx),

where we used the variance-bias decomposition with respect to k(x, .) for the second equality. With (2.2), we deduce that R

R^d|T(x)−x|²µ(dx)≤W₂²(µ, η) andT is a W2-optimal transport map betweenµandη.

(7)

Forµ, ν∈ P₂(R^d), let us define the sets

I_µ^ν={η∈ P2(R^d) :η≤cxν andW₂²(µ, ν) =W₂²(µ, η) + Z

R^d

|y|²ν(dy)− Z

R^d

|z|²η(dz)}, I˜_µ^ν=

T#µ:∃µ(dx)k(x,dy)∈Π^opt(µ, ν),T(x) = Z

R^d

yk(x,dy)

.

By Proposition 2.7, we have ˜I_µ^ν ⊂ I_µ^ν and ˜I_µ^ν 6= ∅ since Π^opt(µ, ν)6=∅. Moreover, there exists an optimal transport map between µ and any element of ˜I_µ^ν. The measure T#µ associated with an optimal coupling in Π^opt(µ, ν) is possibly equal toν, which always belongs toI_µ^ν.

Lemma 2.8. Let µ, ν ∈ P2(R^d). If η ∈ I_µ^ν, then for any η˜ such that η ≤cx η˜ ≤cx ν, η˜∈ I_µ^ν and η ∈ I_µ^η^˜. Moreover, I_µ^ν ={η∈ P2(R^d) :∃˜η∈I˜_µ^ν,η˜≤cxη≤cxν}. Last, the setI_µ^ν is convex.

Proof. Letη∈ I_µ^ν and ˜η be such thatη≤cxη˜≤cxν. We have W₂²(µ, ν) =W₂²(µ, η) +

Z

R^d

|y|²ν(dy)− Z

R^d

|˜z|²η(d˜˜ z) + Z

R^d

|˜z|²η(d˜˜ z)− Z

R^d

|z|²η(dz). (2.3) Now, we considerµ(dx)k(x,dz)∈Π^opt(µ, η) andη(dz)m(z,d˜z) a martingale coupling betweenη and ˜η. Then, W₂²(µ,η)˜ ≤R

(R^d)³|˜z−z+z−x|²µ(dx)k(x,dz)m(z,d˜z) =W₂²(µ, η) +R

R^d|˜z|²η(d˜˜ z)−R

R^d|z|²η(dz). This inequality cannot be strict: otherwise, by combining an optimal coupling between µand ˜η and a martingale coupling between ˜η andν, we would contradict (2.3). The equality givesη∈ I_µ^η^˜and ˜η∈ I_µ^ν by using (2.3).

If ˜η ∈ I˜_µ^ν, since ˜I_µ^ν ⊂ I_µ^ν, by the first statement, each probability measure η such that ˜η ≤cx η ≤cx ν belongs to I_µ^ν. Hence {η ∈ P2(R^d) : ∃η˜ ∈ I˜_µ^ν,η˜ ≤cx η ≤cx ν} ⊂ I_µ^ν. On the other hand, for η ∈ I_µ^ν, µ(dx)q(x,dz)∈Π^opt(µ, η) and a martingale couplingη(dz)m(z,dy) betweenηandν, we haveµ(dx)qm(x,dy)∈ Π^opt(µ, ν), by the second assertion in Proposition 2.7. Since, by the martingale property, R

R^dyqm(x,dy) = R

R^d

R

R^dym(z,dy)q(x,dz) =R

R^dzq(x,dz) settingT(x) =R

R^dzq(x,dz), we haveT#µ∈I˜_µ^ν, by the first assertion in Proposition2.7. Since T#µ≤cxη, we conclude thatI_µ^ν ⊂ {η∈ P2(R^d) :∃η˜∈I˜_µ^ν,η˜≤cxη ≤cxν}.

Last, let us consider η1, η2∈ I_µ^ν and λ∈(0,1). Using a convex combination of couplings in Π^opt(µ, η1) and Π^opt(µ, η2), we obtain that W₂²(µ, λη1+ (1−λ)η2)≤ λW₂²(µ, η1) + (1−λ)W₂²(µ, η2). Since η1, η2 ∈ I_µ^ν, we deduce that

W₂²(µ, ν)≥W₂²(µ, λη₁+ (1−λ)η₂) + Z

R^d

|y|²ν(dy)− Z

R^d

|z|²(λη₁+ (1−λ)η₂)(dz).

Since λη₁+ (1−λ)η₂≤_cxν, there exists a martingale coupling between λη₁+ (1−λ)η₂ andν. Composing it with an element of Π^opt(µ, λη1+ (1−λ)η2), we obtain a coupling betweenµandν which ensures that

W₂²(µ, ν)≤W₂²(µ, λη₁+ (1−λ)η₂) + Z

R^d

|y|²ν(dy)− Z

R^d

|z|²(λη₁+ (1−λ)η₂)(dz).

Hence λη₁+ (1−λ)η₂∈ I_µ^ν.

In dimension d= 1, since Π^opt(µ, ν) is a singleton, we can specify the setsI_µ^ν and ˜I_µ^ν. Proposition 2.9. Letµ, ν∈ P2(R)and

T(x) = Z 1

0

F_ν⁻¹(F_µ(x−) +u[F_µ(x)−F_µ(x−)])du. (2.4)

(8)

We have I˜_µ^ν ={T#µ} andI_µ^ν ={η∈ P2(R) :T#µ≤cxη≤cxν}. Moreover,Π^opt(µ,T#µ) ={(I1,T)#µ} and there is a unique martingale coupling betweenT#µandν and it isW₂-optimal.

Proof. By the second assertion in Lemma2.8, the characterization ofI_µ^ν easily follows from the one of ˜I_µ^ν, which, with the definition of ˜I_µ^ν, the first statement in Proposition2.7and the uniqueness of the optimal coupling in dimensiond= 1, also implies that Π^opt(µ,T#µ) ={(I1,T)#µ}. LetU, U⁰be two independent uniform random variables on [0,1]. We define

V =Fµ(F_µ⁻¹(U)−) +U⁰[Fµ(F_µ⁻¹(U))−Fµ(F_µ⁻¹(U)−)], (2.5) and have by construction

F_µ⁻¹(V) =F_µ⁻¹(U) a.s.. (2.6)

Foru∈(0,1),u∈[F_µ(x−), Fµ(x)] for some x∈Rand

P(V ≤u) =P(F_µ⁻¹(U)< x) +P

F_µ⁻¹(U) =x, U⁰≤ u−Fµ(x−) Fµ(x)−Fµ(x−)

=u

sinceU⁰ is independent ofU. HenceV is uniformly distributed on [0,1]. According to Theorem 2.9 [16], the law of (F_µ⁻¹(V), F_ν⁻¹(V)) is the unique element of Π^opt(µ, ν). From (2.5), we get E[F_ν⁻¹(V)|U] =T(F_µ⁻¹(U)) and by (2.6),

E[F_ν⁻¹(V)|F_µ⁻¹(V)] =E[E[F_ν⁻¹(V)|U]|F_µ⁻¹(V)] =E[T(F_µ⁻¹(V))|F_µ⁻¹(V)] =T(F_µ⁻¹(V)).

Hence the single element of ˜I_µ^ν is the lawT#µofT(F_µ⁻¹(V)). SinceT is nondecreasing,T(F_µ⁻¹(V)) =F_T⁻¹_#µ(V) a.s. and E[F_ν⁻¹(V)|F_T⁻¹_#µ(V)] =F_T⁻¹_#µ(V) a.s.. Hence the law of (F_T⁻¹_#µ(V), F_ν⁻¹(V)), which is the single element of Π^opt(T#µ, ν), is a martingale coupling. Since all the martingale couplings share the quadratic cost R

Ry²ν(dy)−R

R(T(x))²µ(dx), each martingale coupling belongs to Π^opt(T#µ, ν) and is therefore equal to the previous one.

In dimensiond= 1, there is a single elementη∈I˜_µ^ν, a unique element in Π^opt(µ, η) and the unique martingale coupling betweenηandνisW₂-optimal. We now provide an example in dimensiond= 2 where these properties fail.

Example 2.10. Let µ = ¹₂ δ_(−1,0)+δ_(1,0)

and ν = ¹₂ δ_(0,−1)+δ_(0,1)

. Since |(0,−1) −(−1,0)| =

|(0,1) −(−1,0)| = |(0,−1) −(1,0)| = |(0,1) −(1,0)|, any coupling between µ and ν is W2-optimal.

The couplings write µ(dx)kp(x,dy) with kp((−1,0),dy) = pδ_(0,−1)+ (1−p)δ_(0,1)

(dy) and kp((1,0),dy) = (1−p)δ_(0,−1)+pδ(0,1)

(dy) for p ∈ (0,1). One has Tp((−1,0)) = (0,1−2p), Tp((1,0)) = (0,2p−1), and ηp = ¹₂ δ_(0,1−2p)+δ_(0,2p−1)

. Any coupling between µ and ηp is W2-optimal and as soon as p6= 1/2, there is an optimal coupling different from (I2,Tp)#µ. Moreover, unless p ∈ {0,1/2,1}, the martingale coupling betweenηpandν is notW2-optimal.

According to the next theorem, we can find elements η in ˜I_µ^ν such that Π^opt(µ, η) ={(Id, T)#µ} for some measurable transport mapT by minimizing overI_µ^ν the integral of a strictly convex function.

Theorem 2.11. Let µ, ν ∈ P₂(R^d), φ:R^d →R be strictly convex such that R

R^dφ(y)ν(dy)<∞ and I_µ,φ^ν :=

{η∈ I_µ^ν :R

R^dφ(z)η(dz) = infη∈I_µ^νR

R^dφ(z)η(dz)}. We have ∅ 6=I_µ,φ^ν ⊂I˜_µ^ν and for each η∈ I_µ,φ^ν ,Π^opt(µ, η) = {(Id, T)#µ} for some measurable transport map T :R^d→R^d. Moreover, there is a single ηφ ∈ I_µ,φ^ν such that R

R^d|z|²ηφ(dz) = infη∈I_µ,φ^ν R

R^d|z|²η(dz). Last, there is a single element η inI_µ,|x|^ν 2.

(9)

This theorem permits to select extreme elements of I_µ^ν and provides the following characterization of the existence of a minimal element for the convex order in this set.

Corollary 2.12. For µ, ν∈ P2(R^d), there exists η0∈ P2(R^d) such that I_µ^ν ={η0 ≤cxη ≤cxν} if and only if ηφ:φ:R^d→R^d strictly convex and such that R

R^dφ(y)ν(dy)<∞ ={η} and then η0=η.

Let us show the corollary before proving the theorem.

Proof of Corollary 2.12. The necessary condition is obvious. Let us show that it is sufficient. It is enough to check that for any φ:R^d →R convex such that ∃C <∞, ∀x∈R^d, |φ(x)| ≤C(1 +|x|), we have ∀η ∈ I_µ^ν,R

R^dφ(x)η(dx)≤R

R^dφ(x)η(dx) (see e.g. [1], Lem. A.1). For such a function φ and for ε > 0, φε(x) :=

φ(x) +ε|x|²is strictly convex and, sinceηφε =η, we have

∀η ∈ I_µ^ν, Z

R^d

φε(x)η(dx)≤ Z

R^d

φε(x)η(dx).

We conclude by letting ε→0 using the dominated convergence theorem.

To prove Theorem 2.11, we will need the following Lemma Lemma 2.13. Let ν be a probability measure on R^d such that R

R^d|y|ν(dy)<∞ and φ:R^d →R a convex function such that R

R^dφ(y)ν(dy)<∞. Then the family of probability measures {φ#η :η ≤cx ν} is uniformly integrable.

Proof of Lemma 2.13. Let us first suppose that φ is nonnegative. Let M ∈ (0,+∞), η ≤cx ν and m be a martingale kernel such thatR

x∈R^dη(dx)m(x,dy) =ν(dy). Using Jensen’s inequality for the first inequality and the Markov inequality combined with η≤cxν for the third one, we obtain that

Z

R^d

φ(x)1_{φ(x)≥M}η(dx)≤ Z

R^d

Z

R^d

φ(y)m(x,dy)1_{φ(x)≥M}η(dx)

≤ Z

R^d×R^d

φ(y)1_{φ(y)≥^√_M}+√

M1_{φ(x)≥M_}

m(x,dy)η(dx)

= Z

R^d

φ(y)1_{φ(y)≥^√_M}ν(dy) +

√ M

Z

R^d

1_{φ(x)≥M_}η(dx)

≤ Z

R^d

φ(y)1_{φ(y)≥^√_M}ν(dy) + 1

√ M

Z

R^d

φ(y)ν(dy).

Hence lim_M_→∞sup_η≤

cxν

R

R^dφ(x)1_{φ(x)≥M_}η(dx) = 0. In particular, the family {|x|#η : η ≤_cx ν} is uniformly integrable. When the sign of φis not constant, we obtain a nonnegative convex function ˜φ such that R

R^d

φ(y)ν(dy)˜ <∞ by addition to φof a suitable affine function ψ. The conclusion follows from the uniform integrability of both the families{ψ#η :η≤cxν} and{φ#η˜ :η≤cxν}.

Proof of Theorem 2.11. Let (ηn)_n∈N be a sequence in I_µ^ν minimizing R

R^dφ(z)η(dz). For n ∈ N, let µ(dx)qn(x,dz)∈Π^opt(µ, ηn) and ηn(dz)mn(z,dy) be a martingale coupling between ηn and ν. By the second part in Proposition 2.7,µ(dx)qnmn(x,dy)∈Π^opt(µ, ν). Up to extracting a subsequence, we may suppose that (µ(dx)qn(x,dz)mn(z,dy))n converges weakly to µ(dx)r_∞(x,dz,dy) where µ(dx)R

z∈R^dr_∞(x,dz,dy) ∈ Π^opt(µ, ν). LetT_∞(x) =R

R^d×R^dyr_∞(x,dz,dy) andη_∞=T_∞#µ. By the first part of Proposition2.7,η_∞∈I˜_µ^ν. Moreover, by the above weak convergence and the uniform integrability deduced from Lemma2.13,

Z

R^d×R^d×R^d

φ(z)µ(dx)r_∞(x,dz,dy) = lim

n→∞

Z

R^d

φ(z)ηn(dz).

(10)

Taking the limit n→ ∞ in the equality R

R^d×R^d×R^dϕ(x, z)(y−z)µ(dx)qn(x,dz)mn(z,dy) = 0, we obtain that R

R^d×R^d×R^dϕ(x, z)(y−z)µ(dx)r_∞(x,dz,dy) = 0 for any continuous and bounded function ϕ:R^d×R^d →R. Hence, for (X, Z, Y) distributed according to µ(dx)r_∞(x,dz,dy), Z =E[Y|(X, Z)] and T_∞(X) =E[Y|X] = E[E[Y|(X, Z)]|X] =E[Z|X]. By using Jensen inequality for the conditional expectation, we get

Z

R^d

φ(z)η∞(dz)≤ Z

R^d×R^d×R^d

φ(z)µ(dx)r∞(x,dz,dy) = lim

n→∞

Z

R^d

φ(z)ηn(dz).

Thus,η_∞satisfies R

R^dφ(z)η_∞(dz) = inf_η∈I_µ^νR

R^dφ(z)η(dz). Hence I_µ,φ^ν 6=∅.

Let η ∈ I_µ,φ^ν . We now check that η ∈ I˜_µ^ν and Πôpt(µ, η) is a singleton. Let µ(dx)q(x,dz)∈ Πôpt(µ, η) and η(dz)m(z,dy) be a martingale coupling between η and ν. By the second assertion in Proposition 2.7, µ(dx)qm(x,dy)∈Πôpt(µ, ν) and, by the first assertion, for T(x) =R

R^dyqm(x,dy), T#µ∈I˜_µ^ν. By the martingale property of m, T(x) = R

R^dzq(x,dz) so that T#µ≤cxη. Since T#µ∈ I_µ^ν and η ∈ I_µ,φ^ν implies that R

R^dφ(z)T#µ(dz)≥R

R^dφ(z)η(dz), we deduce with the strict convexity ofφthatη=T#µandµ(dx)q(x,dz) = µ(dx)δ_T_(x)(dz). Hence any coupling in Π^opt(µ, η) is given by a map. By the second statement in Lemma2.1, we conclude that this set is a singleton.

By repeating the first argument with (φ,I_µ^ν) replaced by (|x|²,I_µ,φ^ν ) , we obtain the existence ofη_φ∈ I_µ^ν such that R

R^d|z|²ηφ(dz)≤inf_η∈I^ν

µ,φ

R

R^d|z|²η(dz). Since the construction also reduces the integral ofφ,ηφ∈ I_µ,φ^ν . Let us now check that if ˜η ∈ I_µ,φ^ν is such that R

R^d|z|²η(dz) = inf˜ _η∈I^ν

µ,φ

R

R^d|z|²η(dz), then ˜η = ηφ. By the first statement, Π^opt(µ, ηφ) = {(Id, Tφ)#µ} and Π^opt(µ,η) =˜ {(Id,T˜)#µ} for measurable transport maps T_φ and ˜T : R^d → R^d. One has R

R^d|z|²η_∞(dz) = R

R^d|z|²η(dz) and therefore, since˜ η_φ,η˜ ∈ I_µ^ν, W₂²(µ, ηφ) =W₂²(µ,η). Let now ¯˜ η = ^η^φ₂^{+ ˜}^η. One has R

R^d|z|²η(dz) =¯ R

R^d|z|²ηφ(dz) =R

R^d|z|²η(dz). The cou-˜ pling µ(dx)¹₂

δ_T_φ_(x)(dz) +δT˜(x)(dz)

betweenµ and ¯η implies that W₂²(µ,η)¯ ≤W₂²(µ, ηφ) =W₂²(µ,η). Since˜ ηφ∈ I_µ^ν, we deduce that

W₂²(µ, ν)≥W₂²(µ,η) +¯ Z

R^d

|y|²ν(dy)− Z

R^d

|z|²η(dz).¯

Moreover, ¯η≤cxνand combining a coupling in Π^opt(µ,η) with a martingale coupling between ¯¯ ηandν, we deduce that the previous inequality is an equality so that ¯η∈ I_µ^ν andµ(dx)¹₂

δ_T_φ_(x)(dz) +δT˜(x)(dz)

∈Π^opt(µ,η). As¯ ηφ,η˜∈ I_µ,φ^ν ,R

R^dφ(z)¯η(dz) = infη∈I_µ^νR

R^dφ(z)η(dz) and ¯η ∈ I_µ,φ^ν . By the first assertion, Π^opt(µ,η) =¯ {(Id,T¯)#µ}

for some measurable transport mapT :R^d→R. Thereforeµ(dx) a.e.,Tφ(x) = ˜T(x) andηφ= ˜η. For the choice φ(x) =|x|², we deduce thatI_µ,|x|^ν 2 is a singleton.

From the equalityW₂²(µ, ν) =W₂²(µ, η) +R

R^d|y|²ν(dy)−R

R^d|z|²η(dz) valid forη∈ I_µ^ν, we see that minimizing R

R^d|z|²η(dz) overI_µ^ν is equivalent to minimizingW₂²(µ, η). Therefore the probability measureη can be seen as the W2-projection of µ on the set I_µ^ν. It is in general different from the W2-projection µ_P_(ν) of µ on the set P(ν) :={η:η≤cxν}, which has been studied recently in dimension d= 1 by Gozlanet al. [13] and in general dimensiondby Alfonsiet al.[1] (who also give an explicit formula for the antiderivative of the quantile function of this projection whend= 1), Alibertet al.[2], Gozlan and Juillet [12] and Backhoff-Veraguaset al.[5]. Notice that sinceI_µ^ν⊂ P(ν), one always hasW2(µ, µ_P(ν))≤W2(µ, η).

Example 2.14. For µ and ν the respective uniform distributions on [0,1] and [0,2], we have I_µ^ν ={ν} and thus η=ν. By using the characterization in Theorem 2.6 [1], we obtain that the W2-projectionµ_P(ν) ofµon the setP(ν) is the uniform distribution on [1/2,3/2].

(11)

The next example shows that the set

η_φ:φ:R^d→R^d strictly convex and such that Z

R^d

φ(y)ν(dy)<∞

may contain distinct elements.

Example 2.15. Letµ= ¹₂(δ_(−1,0)+δ_(1,0)) andν= ¹₄(δ_(−1,−1)+δ_(0,−1)+δ_(0,1)+δ_(1,1)). Any optimal coupling betweenµandνcan be written asµ(dx)kp(x,dy) withkp((−1,0),dy) =¹₂(δ_(−1,−1)+pδ_(0,−1)+ (1−p)δ_(0,1))(dy) and kp((1,0),dy) = ¹₂(δ(1,1)+ (1−p)δ_(0,−1)+pδ(0,1))(dy) for p∈[0,1]. One hasTp((−1,0)) = (−1/2,−p) and Tp((1,0)) = (1/2, p). The measuresηp =¹₂(δ_{(−1/2,−p)}+δ_(1/2,p)) are not comparable for the convex order since forp6=p⁰ there is no martingale coupling betweenηp and ηp⁰. Moreover, for eachp∈[0,1] the unique optimal transport plan δ((−1,0),(−1/2,−p))+δ((1,0),(1/2,p)) between µ and ηp is given by a map. For this example, η = η0= ¹₂ δ_(−1/2,0)+δ_(1/2,0)

andηp=ηφ_p, with φp(x) =x²₁+ (x2−2px1)². The W2-optimal couplings between η and ν can be written as η0(dz)kp(2z,dy) for p∈ [0,1] and in particular the unique martingale coupling η0(dz)k0(2z,dy) is optimal.

The last example shows that, unlike in the previous one, the martingale couplings between η andν are not necessarilyW₂-optimal (even when Π^opt(µ, ν) is a singleton).

Example 2.16. Let µ = ¹₂ δ_(−1,0)+δ_(1,0)

, νa = ¹₄ δ_(−1,−1)+δ_(−1,2a+1)+δ_{(1,−2a−1)}+δ_(1,1)

with a ∈R. The unique W₂-optimal coupling between µ and ν_a is µ(dx)k_a(x,dy) with k_a((−1,0),dy) = ¹₂(δ_(−1,−1)+ δ_(−1,2a+1))(dy) andka((1,0),dy) =¹₂(δ_{(1,−2a−1)}+δ_(1,1))(dy) so thatηa= ¹₂ δ_(−1,a)+δ_(1,−a)

. Since|(−1,−1)− (−1, a)|²− |(1,1)−(−1, a)|²= (a+ 1)²−4−(a−1)²= 4(a−1), fora >1,

W₂²(η_a, ν_a) = 1

2 (a+ 1)²+ 4 + (a−1)²

<(a+ 1)²= 1

2 3 + (2a+ 1)²

−(1 +a²)

= Z

|y|²νa(dy)− Z

|z|²ηa(dz),

so that the martingale coupling betweenη_a andν_a is notW₂-optimal.

3. Differentiability of the squared quadratic Wasserstein distance

We now present the notion of differentiability introduced by Lions [15]. Letf :P2(R^d)→R. We consider an atomless probability space (Ω,A,P) and denote byL²(Ω,P;R^d) the set ofR^d-valued square integrable random variables on this space. The lift of the functionf onL²(Ω,P;R^d) is the functionF :L²(Ω,P;R^d)→Rsuch that

∀X ∈L²(Ω,P;R^d), F(X) =f(L(X)),

whereL(X)∈ P2(R^d) is the probability distribution ofX. The atomless property is equivalent to the existence of a random variable U : Ω→Runiformly distributed on [0,1] (see e.g.[9], Prop. A.27). By the fundamental Theorem of simulation (see e.g. Bouleau and L´epingle [6], Thm. A.3.1, p. 38), it ensures the existence on (Ω,A,P) of a random variable distributed according to each probability measure on each Polish space, and in particular ofX : Ω→R^d distributed according toµ, for eachµ∈ P2(R^d).

Definition 3.1. A function f :P2(R^d)→RisL-differentiable at µ∈ P2(R^d) if there exists X∈L²(Ω,P;R^d) such thatX ∼µandF is Fr´echet differentiable atX.

Let f :P2(R^d)→R and F(X) =f(L(X)) for X ∈L²(Ω,P;R^d). The Fr´echet differentiability of F at X amounts to the existence of a bounded linear operatorD^F_X:L²(Ω,P;R^d)→Rsuch thatF(X+Y) =F(X) +