HAL Id: hal-01934705
https://hal.archives-ouvertes.fr/hal-01934705v2
Submitted on 16 Nov 2020
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Squared quadratic Wasserstein distance: optimal couplings and Lions differentiability
Aurélien Alfonsi, Benjamin Jourdain
To cite this version:
Aurélien Alfonsi, Benjamin Jourdain. Squared quadratic Wasserstein distance: optimal couplings and Lions differentiability. ESAIM: Probability and Statistics, EDP Sciences, 2020, 24, pp.703-717.
�10.1051/ps/2020013�. �hal-01934705v2�
https://doi.org/10.1051/ps/2020013 www.esaim-ps.org
SQUARED QUADRATIC WASSERSTEIN DISTANCE: OPTIMAL COUPLINGS AND LIONS DIFFERENTIABILITY∗
Aur´ elien Alfonsi
1,2,∗∗and Benjamin Jourdain
1,2Abstract. In this paper, we remark that any optimal coupling for the quadratic Wasserstein distance W22(µ, ν) between two probability measuresµ and ν with finite second order moments on Rd is the composition of a martingale coupling with an optimal transport mapT. We check the existence of an optimal coupling in which this map gives the unique optimal coupling betweenµandT#µ. Next, we give a direct proof thatσ7→W22(σ, ν) is differentiable atµin the Lions (Cours au Coll`ege de France.
2008) sense iff there is a unique optimal coupling between µ and ν and this coupling is given by a map. It was known combining results by Ambrosio, Gigli and Savar´e (Lectures in Mathematics ETH Z¨urich. Birkh¨auser Verlag, Basel, 2005) and Ambrosio and Gangbo (Comm. Pure Appl. Math., 61:18–
53, 2008) that, under the latter condition, geometric differentiability holds. Moreover, the two notions of differentiability are equivalent according to the recent paper of Gangbo and Tudorascu (J. Math.
Pures Appl. 125:119–174, 2019). Besides, we give a self-contained probabilistic proof that mere Fr´echet differentiability of a law invariant functionF onL2(Ω,P;Rd) is enough for the Fr´echet differential at X to be a measurable function ofX.
Mathematics Subject Classification.90C08, 60G42, 60E15, 58B10, 49J50.
Received December 23, 2019. Accepted March 4, 2020.
1. Introduction
In this paper, we are interested in the structure of optimal couplings for the squared quadratic Wasserstein distanceW22(µ, ν) betweenµandν in the setP2(Rd) of probability measures with finite second order moments on Rd, and in the differentiability ofW22(µ, ν) with respect toµ. By definition, W22(µ, ν) = infπ∈Π(µ,ν)R
|y− x|2π(dx,dy) where Π(µ, ν) denotes the set of coupling measures on Rd×Rd with first and second marginals respectively equal to µ and ν and |.| denotes the Euclidean norm on Rd. There always exists an optimal coupling and we denote by Πopt(µ, ν) the set of optimal couplings. According to [11], there exists only oneW2- optimal couplingπbetweenµand eachν ∈ P2(Rd) and this coupling is given by a mapT (i.e.π= (Id, T)#µ where Id denotes the identity function on Rd) iff µ gives 0 mass to the c−c hypersurfaces of dimension d−1. Even when µ does not satisfy this condition which is implied by absolute continuity with respect to the Lebesgue measure, according to Proposition 5.13 [8], ifϕ:Rd→R is a C2 strictly convex function such
∗This research benefited from the support of the “Chaire Risques Financiers”, Fondation du Risque.
Keywords and phrases: Optimal transport, Wasserstein distance, differentiability, couplings of probability measures, convex order.
1 CERMICS, Ecole des Ponts, Marne-la-Vall´ee, France.
2 MathRisk, Inria, Paris, France.
*∗Corresponding author:[email protected]
c
The authors. Published by EDP Sciences, SMAI 2020
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
that R
Rd|∇ϕ(x)|2µ(dx)<∞, then there is a uniqueW2-optimal coupling between µand ν=∇ϕ#µ and this coupling is given by the map∇ϕ. But there also exist measuresν∈ P2(Rd) such that either the unique optimal coupling (uniqueness holds in dimensiond= 1 for instance) is not given by a map or there exist distinct optimal couplings. In the latter case, any strictly convex combination of these couplings is an optimal coupling which is not given by a map.
In Section 2, we study optimal couplings π which are not given by a map. By disintegration,π(dx,dy) = µ(dx)k(x,dy) for some Markov kernel konRd (which is µ(dx) a.e. unique). Setting T(x) =R
Rdyk(x,dy) and using the bias-variance decomposition under the kernelk, we obtain that πis the composition of a martingale coupling betweenT#µandνwith the mapT which gives aW2-optimal coupling betweenµandT#µ. Note that couplings of this form have recently been studied by Gozlan and Juillet [12] when considering the barycentric optimal cost problem. For φ:Rd→Ra strictly convex function such that R
Rdφ(y)ν(dy)<∞, by minimizing R
Rdφ(T(x))µ(dx) over theW2-optimal couplings between µandν, we obtain optimal couplings such that the associated mapTφ gives the only optimal coupling betweenµandTφ#µ. There is a unique such coupling when φ(x) =|x|2.
In Section 3, we are interested in the differentiability of W22(µ, ν) in the Lions sense with respect to µ.
Gangbo and Tudorascu have recently proved in Corollary 3.22 [10] that the Lions differentiability [15] of a functionf :P2(Rd)→Ris equivalent to the geometric differentiability and that the Fr´echet derivative of the lift at X ∼µ is then given by ∇µf(X) where∇µf ∈L2(Rd, µ;Rd) is the geometric (or Wasserstein) gradient of f at µ. While the lifted space that they consider is the ball centered at the origin of unit volume in Rd endowed with the Lebesgue measure, the result can be transferred to any atomless lifted space by considering an almost isomorphism between those spaces1. Theorem 10.2.6 [4] states thatσ7→W22(σ, ν) is subdifferentiable in the geometric sense at µ when Πopt(µ, ν) ={(Id, T)#µ} for some measurable transport map T :Rd →Rd. On the other hand, Proposition 4.3 [3] states thatσ7→W22(σ, ν) is always superdifferentiable in the geometric sense atµwithx7→2 x−R
Rdyk(x,dy)
belonging to the superdifferential for each Markov kernelkonRd such that µ(dx)k(x,dy)∈Πopt(µ, ν). Since geometric differentiability amounts to simultaneous geometric sub- and superdifferentiability, as soon as Πopt(µ, ν) ={(Id, T)#µ}, thenσ7→W22(σ, ν) is differentiable in the geometric sense atµ. On the other hand, geometric differentiability implies that the geometric sub- and superdifferential considered as subsets of L2(Rd, µ;Rd) coincide and contain one element only (see for instance [8], Prop. 5.63).
The fact that the quotient of {x7→R
Rdyk(x,dy) :µ(dx)k(x,dy)∈Πopt(µ, ν)} for the µ(dx) a.e. equality is a singleton is therefore necessary for the geometric differentiability ofσ7→W22(σ, ν) to hold atµ.
We prove that σ7→W22(σ, ν) is differentiable atµ in the Lions sense iff Πopt(µ, ν) ={(Id, T)#µ}. We give a direct probabilistic proof of the sufficient condition which also follows from the just mentionned results. To prove the necessary condition, we use that the Fr´echet differentiability at X ∼µ of the lift on an atomless probability space is enough for the Fr´echet derivative at X to be a.s. equal to a measurable function of X, a consequence of [10] that we show again using simple probabilistic arguments. Let us emphasize that the quotient of {x7→R
Rdyk(x,dy) : µ(dx)k(x,dy) ∈Πopt(µ, ν)} for the µ(dx) a.e. equality may be a singleton while Πopt(µ, ν) is not equal to {(Id, T)#µ} for some measurable map T :Rd→Rd (see, in dimension d= 1, Rem. 2.4below).
2. Structure of quadratic Wasserstein optimal couplings
In this section, we are interested in characterizing the set Πopt(µ, ν) ={π(dx,dy)∈ P2(Rd×Rd) :µ(dx) =
Z
y∈Rd
π(dx,dy), ν(dy) = Z
x∈Rd
π(dx,dy) andW22(µ, ν) =
Z
Rd×Rd
|y−x|2π(dx,dy)}.
1We thank one of the referees for pointing out this argument to us.
of optimal couplings between two probability measures µ, ν ∈ P2(Rd) for the quadratic cost. This set is not empty : seee.g.[4], page 133.
The refined version of the Brenier theorem in [11] ensures that Πopt(µ, ν) contains a single element (Id, T)#µ which is given by a measurable transport mapT :Rd→Rd for eachν∈ P2(Rd) iffµdoes not give mass to the c−c hypersurfaces parametrized by an index i∈ {0, . . . , d−1} and two convex functionsf andg from Rd−1 to R:
{(x1, . . . , xi, f(x)−g(x), xi+1, . . . , xd−1) :x= (x1, . . . , xd−1)∈Rd−1}.
The next lemma deals with the case where Πopt(µ, ν)6={(Id, T)#µ}for some measurable transport map.
Lemma 2.1. Letµ, ν∈ P2(Rd). One of the two conditions holds:
– Πopt(µ, ν) ={(Id, T)#µ} for some measurable transport mapT :Rd→Rd, – ∃µ(dx)k(x,dy)∈Πopt(µ, ν) such thatR
Rd×Rd|y−R
Rdzk(x, dz)|2k(x,dy)µ(dx)>0.
Moreover, if any coupling in Πopt(µ, ν) is given by a map i.e. writes (Id, T)#µ for some measurable function T :Rd →Rd, then Πopt(µ, ν)is a singleton.
Proof. If the set Πopt(µ, ν) has a single element µ(dx)k(x,dy), defining T(x) =R
Rdyk(x,dy) we either have R
Rd×Rd|y− T(x)|2k(x,dy)µ(dx)>0 or µ(dx)k(x,dy) =µ(dx)δT(x)(dy). Otherwise, we can pick two distinct elementsk1, k2∈Πopt(µ, ν) andk(x,dy) = 12(k1(x,dy) +k2(x,dy)) is such thatµ(dx)k(x,dy)∈Πopt(µ, ν) and R
Rd×Rd|y−R
Rdzk(x, dz)|2k(x,dy)µ(dx)>0. The second statement easily follows.
Remarking that if ν is the Dirac mass at x∈Rd and νε the uniform distribution on the ball centered at x with radiusε, thenW2(ν, νε)≤ε, we deduce from the next proposition that for any µ, ν∈ P2(Rd), we can always find µε, νε∈ P2(Rd) such that W2(µ, µε)≤ε, W2(ν, νε)≤ε and ∃µε(dx)kε(x,dy)∈Πopt(µε, νε) such that R
Rd×Rd|y−R
Rdzkε(x,dz)|2kε(x,dy)µε(dx)>0.
Proposition 2.2. Assume thatν ∈ P2(Rd)is not a Dirac mass. Then for allµ∈ P2(Rd), there exists a sequence (µn)n of elements ofP2(Rd)such thatlimn→∞W2(µn, µ) = 0and for eachn, there does not existTn:Rd→Rd measurable such that Πopt(µn, ν) ={(Id, Tn)#µn}.
Proof. Let (Xi)i≥1 be an i.i.d. sequence of random variables with law µ, and (Yi)i≥1 an independent i.i.d.
sequence of uniform random variables on the unit ball{x∈Rd,|x| ≤1}. We set ˜µn= 1nPn
i=1δXi the empirical measure andµn=n1Pn
i=1δXi+Yi/n. By construction, we haveW22(µn,µ˜n)≤n1Pn
i=1|Yi/n|2≤1/n2andP(∃i6=
j, Xi+Yi/n=Xj+Yj/n) = 0, which means that a.s. for eachn∈N∗,µnweights a.s. exactlynpoints. The law of large numbers gives the almost sure weak convergence of ˜µn towardsµ and the almost sure convergence of
1 n
Pn
i=1|Xi|2toE[|X1|2]. Proposition 7.1.5 in [4] ensures thatW2(˜µn, µ) →
n→+∞0 almost surely. By the triangle inequality, we get W2(µn, µ) →
n→+∞0 almost surely.
Now, we consider (pn)n≥1 the increasing sequence of prime numbers. Suppose that ∃n0 ∈ N∗, such that T#µpn0 =ν. Then, ν weights at most pn0 points and the masses are equal to k/pn0 with 1≤k≤pn0−1 sinceν is not a Dirac mass. Then, if we had T#µpn =ν for somen > n0, we would have k/pn0 =k0/pn with 1≤k0 ≤pn−1. This would imply thatpn0 divideskpn and thusk, which is impossible since 1≤k≤pn0−1.
Thus, there is at most onen0∈N∗ such that there is a transport mapTn0 satisfyingTn0#µpn0 =ν.
Let us now give a necessary and sufficient condition for the existence of an optimal transport map in dimension d = 1. We denote Fη(x) = η((−∞, x]) and Fη−1(u) = inf{x∈ R :η((−∞, x])≥ u} the cumula- tive distribution function and the quantile function of a probability measure η on R. For µ, ν ∈ P2(R), by Theorem 2.9 in [16], the only element of Πopt(µ, ν) is the image of the Lebesgue measure on [0,1] by (Fµ−1, Fν−1).
The next lemma characterizes the case when this coupling is given by a map.
Lemma 2.3. Let µ, ν ∈ P2(R). There exists T ∈L2(R, µ;R) such that Πopt(µ, ν) ={(I1, T)#µ} iff for all x∈R such that µ({x})>0,Fν−1 is constant on (Fµ(x−), Fµ(x)]. Then, the unique optimal transport map is T(x) =Fν−1(Fµ(x)).
Remark 2.4. When Fν−1 is not constant on (Fµ(x−), Fµ(x)] for some x∈R such that µ({x})>0, then Πopt(µ, ν) is not equal to{(I1, T)#µ}for some measurable mapT :R→Rwhile, since Πopt(µ, ν) is a singleton, the quotient of{x7→R
Rdyk(x,dy) :µ(dx)k(x,dy)∈Πopt(µ, ν)} for theµ(dx) a.e. equality is a singleton.
Proof. Let X ∼µ and U be an independent random variable uniform on [0,1]. The random variable V = Fµ(X−) +U(Fµ(X)−Fµ(X−)) is such that P({Fµ(X−)< V ≤ Fµ(X)} ∪ {Fµ(X−) = V = Fµ(X)}) = 1.
This is an uniform random variable on [0,1]: for u∈(0,1), u∈[Fµ(x−), Fµ(x)] for some x∈R and P(V ≤ u) =P(X < x) +P
X =x, U ≤Fu−Fµ(x−)
µ(x)−Fµ(x−)
=usinceX is independent of U. SinceFµ−1(V) =X for V ∈ (Fµ(X−), Fµ(X)] and Fµ−1(V)≤X forV =Fµ(X−) =Fµ(X), we have Fµ−1(V)≤X a.s.. SinceFµ−1(V) and X have the same law, we necessarily have Fµ−1(V) =X a.s.. By the inverse transform sampling, Fν−1(V) is distributed according to ν. Let us assume that Fν−1 is constant on (Fµ(x−), Fµ(x)] for all x∈R such that µ({x})>0. ThenFν−1(V) =Fν−1(Fµ(X)) a.s.,Fν−1◦Fµ#µ=ν and
Z 1
0
(Fµ−1(v)−Fν−1(v))2dv=E[(X−Fν−1(Fµ(X)))2] = Z
R
(x−Fν−1(Fµ(x)))2µ(dx).
Hence T(x) =Fν−1(Fµ(x)) is an optimal transport map. Conversely, if T is an optimal transport map such that T#µ=ν, we have T(Fµ−1(v)) = Fν−1(v), dv-a.e. For x∈R such that µ({x})>0, Fµ−1 is constant on (Fµ(x−), Fµ(x)], and thereforeFν−1 is necessarily constant on (Fµ(x−), Fµ(x)].
Remark 2.5. Lemma2.3still holds true forµ, νprobability measures onRwith finite moments of orderρ≥1, and a transport costc(x, y) =h(|y−x|), withh:R+→Rstrictly convex such that∃C <∞, ∀x∈R, h(|x|)≤ C(1 +|x|ρ). The same proof applies since, by Theorem 2.9 in [16], the only optimal coupling for such a cost is the image of the Lebesgue measure on [0,1] by (Fµ−1, Fν−1).
The next proposition, which is one of the main results of this section, shows that any W2-optimal coupling can be written as the composition of a transport map and a martingale kerneli.e.a Markov kernelksuch that for allx∈Rd,R
Rd|y|k(x,dy)<∞andR
Rdyk(x,dy) =x. Let us now give the definition of the convex order on probability measures before recalling its link with the existence of martingale couplings.
Definition 2.6. Let η, ν be two probability measures onRd. We say that η is smaller than ν in the convex order and write η≤cxν if for each convex functionφ:Rd→Rsuch that the integrals make sense,
Z
Rd
φ(x)η(dx)≤ Z
Rd
φ(y)ν(dy).
Notice that since a convex functionφ onRd is bounded from below by an affine function, for a probability measureη onRd with finite first order moment (and in particular forη∈ P2(Rd)),R
Rdφ(x)η(dx) always makes sense possibly equal to +∞.
Theorem 8 in Strassen [17] ensures that, whenR
Rd|y|ν(dy)<∞,η≤cxνiff there exists a martingale Markov kernelk such thatη(dx)k(x,dy)∈Π(η, ν).
Proposition 2.7. Let µ, ν ∈ P2(Rd), µ(dx)k(x,dy)∈Πopt(µ, ν), T(x) = R
Rdyk(x,dy) and η =T#µ. Then η ≤cxν,
W22(µ, ν) =W22(µ, η) + Z
Rd
|y|2ν(dy)− Z
Rd
|z|2η(dz) (2.1)
and(Id,T)#µ∈Πopt(µ, η).
On the other hand, if η≤cxν is such that (2.1) holds, then combining µ(dx)q(x,dz)∈Πopt(µ, η) with any martingale coupling η(dz)m(z,dy) between η andν, we obtain a W2-optimal couplingµ(dx)qm(x,dy) (where, as usual, qm(x,dy) =R
z∈Rdq(x,dz)m(z,dy)) betweenµandν.
The first part of this proposition is also a consequence of Theorem 12.4.4 in [4]: the barycentric pro- jection of µ(x)k(x,dy) is precisely (Id,T)#µ. Here, we present this result with a probabilistic fashion. For µ(dx)k(x,dy) as in the first statement and (X, Y)∼µ(dx)k(x,dy), by definition of T, E[Y|X] =T(X) a.s.
so that E[Y|T(X)] = T(X) a.s. and this optimal coupling is the composition of the martingale coupling given by the law of (T(X), Y) and the transport map T. Notice that since it relies on the bias-variance decomposition, this structure of optimal couplings does not seem to generalize to other Wasserstein distances Wρ(µ, ν) = infπ∈Π(µ,ν)R
|y−x|ρπ(dx,dy)1/ρ
, ρ∈ [1,∞)\ {2}. Nevertheless, Gozlan and Juillet [12] have recently obtained optimal couplings that are the composition of a martingale coupling and a deterministic transport map by considering the barycentric optimal cost problem, which consists in minimizing for a given cost functionθ:Rd→R+the quantityR
Rdθ(x−R
Rdyk(x,dy))µ(dx) among all couplingsµ(dx)k(x,dy) between µandν.
Proof. Let us first prove the second statement. Let η≤cxν, q be a Markov kernel such thatµ(dx)q(x,dz)∈ Πopt(µ, η) and m be any martingale kernel such that ηm=ν. Then µ(dx)qm(x,dy) is a coupling betweenµ andν such that
W22(µ, ν)≤ Z
Rd×Rd
|y−x|2µ(dx)qm(x,dy) = Z
Rd×Rd×Rd
|y−z+z−x|2µ(dx)q(x,dz)m(z,dy)
= Z
Rd×Rd
|y−z|2η(dz)m(z,dy) + Z
Rd×Rd
|z−x|2µ(dx)q(x,dz)
= Z
Rd
|y|2ν(dy)− Z
Rd
|z|2η(dz) +W22(µ, η) (2.2)
where we used the variance-bias decomposition under the martingale kernelm for the third equality. Hence, if (2.1) holds, thenµ(dx)qm(x,dy)∈Πopt(µ, ν).
Let now µ(dx)k(x,dy)∈ Πopt(µ, ν), T(x) =R
Rdyk(x,dy) and η =T#µ. Jensen’s inequality immediately givesη≤cxν and thusη ∈ P2(Rd). We have
W22(µ, ν) = Z
Rd
Z
Rd
|y− T(x) +T(x)−x|2µ(dx)k(x,dy)
= Z
Rd
Z
Rd
|y− T(x)|2µ(dx)k(x,dy) + Z
Rd
|T(x)−x|2µ(dx)
= Z
Rd
Z
Rd
(|y|2− |T(x)|2)µ(dx)k(x,dy) + Z
Rd
|T(x)−x|2µ(dx)
= Z
Rd
|y|2ν(dy)− Z
Rd
|z|2η(dz) + Z
Rd
|T(x)−x|2µ(dx),
where we used the variance-bias decomposition with respect to k(x, .) for the second equality. With (2.2), we deduce that R
Rd|T(x)−x|2µ(dx)≤W22(µ, η) andT is a W2-optimal transport map betweenµandη.
Forµ, ν∈ P2(Rd), let us define the sets
Iµν={η∈ P2(Rd) :η≤cxν andW22(µ, ν) =W22(µ, η) + Z
Rd
|y|2ν(dy)− Z
Rd
|z|2η(dz)}, I˜µν=
T#µ:∃µ(dx)k(x,dy)∈Πopt(µ, ν),T(x) = Z
Rd
yk(x,dy)
.
By Proposition 2.7, we have ˜Iµν ⊂ Iµν and ˜Iµν 6= ∅ since Πopt(µ, ν)6=∅. Moreover, there exists an optimal transport map between µ and any element of ˜Iµν. The measure T#µ associated with an optimal coupling in Πopt(µ, ν) is possibly equal toν, which always belongs toIµν.
Lemma 2.8. Let µ, ν ∈ P2(Rd). If η ∈ Iµν, then for any η˜ such that η ≤cx η˜ ≤cx ν, η˜∈ Iµν and η ∈ Iµη˜. Moreover, Iµν ={η∈ P2(Rd) :∃˜η∈I˜µν,η˜≤cxη≤cxν}. Last, the setIµν is convex.
Proof. Letη∈ Iµν and ˜η be such thatη≤cxη˜≤cxν. We have W22(µ, ν) =W22(µ, η) +
Z
Rd
|y|2ν(dy)− Z
Rd
|˜z|2η(d˜˜ z) + Z
Rd
|˜z|2η(d˜˜ z)− Z
Rd
|z|2η(dz). (2.3) Now, we considerµ(dx)k(x,dz)∈Πopt(µ, η) andη(dz)m(z,d˜z) a martingale coupling betweenη and ˜η. Then, W22(µ,η)˜ ≤R
(Rd)3|˜z−z+z−x|2µ(dx)k(x,dz)m(z,d˜z) =W22(µ, η) +R
Rd|˜z|2η(d˜˜ z)−R
Rd|z|2η(dz). This inequal- ity cannot be strict: otherwise, by combining an optimal coupling between µand ˜η and a martingale coupling between ˜η andν, we would contradict (2.3). The equality givesη∈ Iµη˜and ˜η∈ Iµν by using (2.3).
If ˜η ∈ I˜µν, since ˜Iµν ⊂ Iµν, by the first statement, each probability measure η such that ˜η ≤cx η ≤cx ν belongs to Iµν. Hence {η ∈ P2(Rd) : ∃η˜ ∈ I˜µν,η˜ ≤cx η ≤cx ν} ⊂ Iµν. On the other hand, for η ∈ Iµν, µ(dx)q(x,dz)∈Πopt(µ, η) and a martingale couplingη(dz)m(z,dy) betweenηandν, we haveµ(dx)qm(x,dy)∈ Πopt(µ, ν), by the second assertion in Proposition 2.7. Since, by the martingale property, R
Rdyqm(x,dy) = R
Rd
R
Rdym(z,dy)q(x,dz) =R
Rdzq(x,dz) settingT(x) =R
Rdzq(x,dz), we haveT#µ∈I˜µν, by the first assertion in Proposition2.7. Since T#µ≤cxη, we conclude thatIµν ⊂ {η∈ P2(Rd) :∃η˜∈I˜µν,η˜≤cxη ≤cxν}.
Last, let us consider η1, η2∈ Iµν and λ∈(0,1). Using a convex combination of couplings in Πopt(µ, η1) and Πopt(µ, η2), we obtain that W22(µ, λη1+ (1−λ)η2)≤ λW22(µ, η1) + (1−λ)W22(µ, η2). Since η1, η2 ∈ Iµν, we deduce that
W22(µ, ν)≥W22(µ, λη1+ (1−λ)η2) + Z
Rd
|y|2ν(dy)− Z
Rd
|z|2(λη1+ (1−λ)η2)(dz).
Since λη1+ (1−λ)η2≤cxν, there exists a martingale coupling between λη1+ (1−λ)η2 andν. Composing it with an element of Πopt(µ, λη1+ (1−λ)η2), we obtain a coupling betweenµandν which ensures that
W22(µ, ν)≤W22(µ, λη1+ (1−λ)η2) + Z
Rd
|y|2ν(dy)− Z
Rd
|z|2(λη1+ (1−λ)η2)(dz).
Hence λη1+ (1−λ)η2∈ Iµν.
In dimension d= 1, since Πopt(µ, ν) is a singleton, we can specify the setsIµν and ˜Iµν. Proposition 2.9. Letµ, ν∈ P2(R)and
T(x) = Z 1
0
Fν−1(Fµ(x−) +u[Fµ(x)−Fµ(x−)])du. (2.4)
We have I˜µν ={T#µ} andIµν ={η∈ P2(R) :T#µ≤cxη≤cxν}. Moreover,Πopt(µ,T#µ) ={(I1,T)#µ} and there is a unique martingale coupling betweenT#µandν and it isW2-optimal.
Proof. By the second assertion in Lemma2.8, the characterization ofIµν easily follows from the one of ˜Iµν, which, with the definition of ˜Iµν, the first statement in Proposition2.7and the uniqueness of the optimal coupling in dimensiond= 1, also implies that Πopt(µ,T#µ) ={(I1,T)#µ}. LetU, U0be two independent uniform random variables on [0,1]. We define
V =Fµ(Fµ−1(U)−) +U0[Fµ(Fµ−1(U))−Fµ(Fµ−1(U)−)], (2.5) and have by construction
Fµ−1(V) =Fµ−1(U) a.s.. (2.6)
Foru∈(0,1),u∈[Fµ(x−), Fµ(x)] for some x∈Rand
P(V ≤u) =P(Fµ−1(U)< x) +P
Fµ−1(U) =x, U0≤ u−Fµ(x−) Fµ(x)−Fµ(x−)
=u
sinceU0 is independent ofU. HenceV is uniformly distributed on [0,1]. According to Theorem 2.9 [16], the law of (Fµ−1(V), Fν−1(V)) is the unique element of Πopt(µ, ν). From (2.5), we get E[Fν−1(V)|U] =T(Fµ−1(U)) and by (2.6),
E[Fν−1(V)|Fµ−1(V)] =E[E[Fν−1(V)|U]|Fµ−1(V)] =E[T(Fµ−1(V))|Fµ−1(V)] =T(Fµ−1(V)).
Hence the single element of ˜Iµν is the lawT#µofT(Fµ−1(V)). SinceT is nondecreasing,T(Fµ−1(V)) =FT−1#µ(V) a.s. and E[Fν−1(V)|FT−1#µ(V)] =FT−1#µ(V) a.s.. Hence the law of (FT−1#µ(V), Fν−1(V)), which is the single ele- ment of Πopt(T#µ, ν), is a martingale coupling. Since all the martingale couplings share the quadratic cost R
Ry2ν(dy)−R
R(T(x))2µ(dx), each martingale coupling belongs to Πopt(T#µ, ν) and is therefore equal to the previous one.
In dimensiond= 1, there is a single elementη∈I˜µν, a unique element in Πopt(µ, η) and the unique martingale coupling betweenηandνisW2-optimal. We now provide an example in dimensiond= 2 where these properties fail.
Example 2.10. Let µ = 12 δ(−1,0)+δ(1,0)
and ν = 12 δ(0,−1)+δ(0,1)
. Since |(0,−1) −(−1,0)| =
|(0,1) −(−1,0)| = |(0,−1) −(1,0)| = |(0,1) −(1,0)|, any coupling between µ and ν is W2-optimal.
The couplings write µ(dx)kp(x,dy) with kp((−1,0),dy) = pδ(0,−1)+ (1−p)δ(0,1)
(dy) and kp((1,0),dy) = (1−p)δ(0,−1)+pδ(0,1)
(dy) for p ∈ (0,1). One has Tp((−1,0)) = (0,1−2p), Tp((1,0)) = (0,2p−1), and ηp = 12 δ(0,1−2p)+δ(0,2p−1)
. Any coupling between µ and ηp is W2-optimal and as soon as p6= 1/2, there is an optimal coupling different from (I2,Tp)#µ. Moreover, unless p ∈ {0,1/2,1}, the martingale coupling betweenηpandν is notW2-optimal.
According to the next theorem, we can find elements η in ˜Iµν such that Πopt(µ, η) ={(Id, T)#µ} for some measurable transport mapT by minimizing overIµν the integral of a strictly convex function.
Theorem 2.11. Let µ, ν ∈ P2(Rd), φ:Rd →R be strictly convex such that R
Rdφ(y)ν(dy)<∞ and Iµ,φν :=
{η∈ Iµν :R
Rdφ(z)η(dz) = infη∈IµνR
Rdφ(z)η(dz)}. We have ∅ 6=Iµ,φν ⊂I˜µν and for each η∈ Iµ,φν ,Πopt(µ, η) = {(Id, T)#µ} for some measurable transport map T :Rd→Rd. Moreover, there is a single ηφ ∈ Iµ,φν such that R
Rd|z|2ηφ(dz) = infη∈Iµ,φν R
Rd|z|2η(dz). Last, there is a single element η inIµ,|x|ν 2.
This theorem permits to select extreme elements of Iµν and provides the following characterization of the existence of a minimal element for the convex order in this set.
Corollary 2.12. For µ, ν∈ P2(Rd), there exists η0∈ P2(Rd) such that Iµν ={η0 ≤cxη ≤cxν} if and only if ηφ:φ:Rd→Rd strictly convex and such that R
Rdφ(y)ν(dy)<∞ ={η} and then η0=η.
Let us show the corollary before proving the theorem.
Proof of Corollary 2.12. The necessary condition is obvious. Let us show that it is sufficient. It is enough to check that for any φ:Rd →R convex such that ∃C <∞, ∀x∈Rd, |φ(x)| ≤C(1 +|x|), we have ∀η ∈ Iµν,R
Rdφ(x)η(dx)≤R
Rdφ(x)η(dx) (see e.g. [1], Lem. A.1). For such a function φ and for ε > 0, φε(x) :=
φ(x) +ε|x|2is strictly convex and, sinceηφε =η, we have
∀η ∈ Iµν, Z
Rd
φε(x)η(dx)≤ Z
Rd
φε(x)η(dx).
We conclude by letting ε→0 using the dominated convergence theorem.
To prove Theorem 2.11, we will need the following Lemma Lemma 2.13. Let ν be a probability measure on Rd such that R
Rd|y|ν(dy)<∞ and φ:Rd →R a convex function such that R
Rdφ(y)ν(dy)<∞. Then the family of probability measures {φ#η :η ≤cx ν} is uniformly integrable.
Proof of Lemma 2.13. Let us first suppose that φ is nonnegative. Let M ∈ (0,+∞), η ≤cx ν and m be a martingale kernel such thatR
x∈Rdη(dx)m(x,dy) =ν(dy). Using Jensen’s inequality for the first inequality and the Markov inequality combined with η≤cxν for the third one, we obtain that
Z
Rd
φ(x)1{φ(x)≥M}η(dx)≤ Z
Rd
Z
Rd
φ(y)m(x,dy)1{φ(x)≥M}η(dx)
≤ Z
Rd×Rd
φ(y)1{φ(y)≥√M}+√
M1{φ(x)≥M}
m(x,dy)η(dx)
= Z
Rd
φ(y)1{φ(y)≥√M}ν(dy) +
√ M
Z
Rd
1{φ(x)≥M}η(dx)
≤ Z
Rd
φ(y)1{φ(y)≥√M}ν(dy) + 1
√ M
Z
Rd
φ(y)ν(dy).
Hence limM→∞supη≤
cxν
R
Rdφ(x)1{φ(x)≥M}η(dx) = 0. In particular, the family {|x|#η : η ≤cx ν} is uni- formly integrable. When the sign of φis not constant, we obtain a nonnegative convex function ˜φ such that R
Rd
φ(y)ν(dy)˜ <∞ by addition to φof a suitable affine function ψ. The conclusion follows from the uniform integrability of both the families{ψ#η :η≤cxν} and{φ#η˜ :η≤cxν}.
Proof of Theorem 2.11. Let (ηn)n∈N be a sequence in Iµν minimizing R
Rdφ(z)η(dz). For n ∈ N, let µ(dx)qn(x,dz)∈Πopt(µ, ηn) and ηn(dz)mn(z,dy) be a martingale coupling between ηn and ν. By the sec- ond part in Proposition 2.7,µ(dx)qnmn(x,dy)∈Πopt(µ, ν). Up to extracting a subsequence, we may suppose that (µ(dx)qn(x,dz)mn(z,dy))n converges weakly to µ(dx)r∞(x,dz,dy) where µ(dx)R
z∈Rdr∞(x,dz,dy) ∈ Πopt(µ, ν). LetT∞(x) =R
Rd×Rdyr∞(x,dz,dy) andη∞=T∞#µ. By the first part of Proposition2.7,η∞∈I˜µν. Moreover, by the above weak convergence and the uniform integrability deduced from Lemma2.13,
Z
Rd×Rd×Rd
φ(z)µ(dx)r∞(x,dz,dy) = lim
n→∞
Z
Rd
φ(z)ηn(dz).
Taking the limit n→ ∞ in the equality R
Rd×Rd×Rdϕ(x, z)(y−z)µ(dx)qn(x,dz)mn(z,dy) = 0, we obtain that R
Rd×Rd×Rdϕ(x, z)(y−z)µ(dx)r∞(x,dz,dy) = 0 for any continuous and bounded function ϕ:Rd×Rd →R. Hence, for (X, Z, Y) distributed according to µ(dx)r∞(x,dz,dy), Z =E[Y|(X, Z)] and T∞(X) =E[Y|X] = E[E[Y|(X, Z)]|X] =E[Z|X]. By using Jensen inequality for the conditional expectation, we get
Z
Rd
φ(z)η∞(dz)≤ Z
Rd×Rd×Rd
φ(z)µ(dx)r∞(x,dz,dy) = lim
n→∞
Z
Rd
φ(z)ηn(dz).
Thus,η∞satisfies R
Rdφ(z)η∞(dz) = infη∈IµνR
Rdφ(z)η(dz). Hence Iµ,φν 6=∅.
Let η ∈ Iµ,φν . We now check that η ∈ I˜µν and Πopt(µ, η) is a singleton. Let µ(dx)q(x,dz)∈ Πopt(µ, η) and η(dz)m(z,dy) be a martingale coupling between η and ν. By the second assertion in Proposition 2.7, µ(dx)qm(x,dy)∈Πopt(µ, ν) and, by the first assertion, for T(x) =R
Rdyqm(x,dy), T#µ∈I˜µν. By the mar- tingale property of m, T(x) = R
Rdzq(x,dz) so that T#µ≤cxη. Since T#µ∈ Iµν and η ∈ Iµ,φν implies that R
Rdφ(z)T#µ(dz)≥R
Rdφ(z)η(dz), we deduce with the strict convexity ofφthatη=T#µandµ(dx)q(x,dz) = µ(dx)δT(x)(dz). Hence any coupling in Πopt(µ, η) is given by a map. By the second statement in Lemma2.1, we conclude that this set is a singleton.
By repeating the first argument with (φ,Iµν) replaced by (|x|2,Iµ,φν ) , we obtain the existence ofηφ∈ Iµν such that R
Rd|z|2ηφ(dz)≤infη∈Iν
µ,φ
R
Rd|z|2η(dz). Since the construction also reduces the integral ofφ,ηφ∈ Iµ,φν . Let us now check that if ˜η ∈ Iµ,φν is such that R
Rd|z|2η(dz) = inf˜ η∈Iν
µ,φ
R
Rd|z|2η(dz), then ˜η = ηφ. By the first statement, Πopt(µ, ηφ) = {(Id, Tφ)#µ} and Πopt(µ,η) =˜ {(Id,T˜)#µ} for measurable trans- port maps Tφ and ˜T : Rd → Rd. One has R
Rd|z|2η∞(dz) = R
Rd|z|2η(dz) and therefore, since˜ ηφ,η˜ ∈ Iµν, W22(µ, ηφ) =W22(µ,η). Let now ¯˜ η = ηφ2+ ˜η. One has R
Rd|z|2η(dz) =¯ R
Rd|z|2ηφ(dz) =R
Rd|z|2η(dz). The cou-˜ pling µ(dx)12
δTφ(x)(dz) +δT˜(x)(dz)
betweenµ and ¯η implies that W22(µ,η)¯ ≤W22(µ, ηφ) =W22(µ,η). Since˜ ηφ∈ Iµν, we deduce that
W22(µ, ν)≥W22(µ,η) +¯ Z
Rd
|y|2ν(dy)− Z
Rd
|z|2η(dz).¯
Moreover, ¯η≤cxνand combining a coupling in Πopt(µ,η) with a martingale coupling between ¯¯ ηandν, we deduce that the previous inequality is an equality so that ¯η∈ Iµν andµ(dx)12
δTφ(x)(dz) +δT˜(x)(dz)
∈Πopt(µ,η). As¯ ηφ,η˜∈ Iµ,φν ,R
Rdφ(z)¯η(dz) = infη∈IµνR
Rdφ(z)η(dz) and ¯η ∈ Iµ,φν . By the first assertion, Πopt(µ,η) =¯ {(Id,T¯)#µ}
for some measurable transport mapT :Rd→R. Thereforeµ(dx) a.e.,Tφ(x) = ˜T(x) andηφ= ˜η. For the choice φ(x) =|x|2, we deduce thatIµ,|x|ν 2 is a singleton.
From the equalityW22(µ, ν) =W22(µ, η) +R
Rd|y|2ν(dy)−R
Rd|z|2η(dz) valid forη∈ Iµν, we see that minimizing R
Rd|z|2η(dz) overIµν is equivalent to minimizingW22(µ, η). Therefore the probability measureη can be seen as the W2-projection of µ on the set Iµν. It is in general different from the W2-projection µP(ν) of µ on the set P(ν) :={η:η≤cxν}, which has been studied recently in dimension d= 1 by Gozlanet al. [13] and in general dimensiondby Alfonsiet al.[1] (who also give an explicit formula for the antiderivative of the quantile function of this projection whend= 1), Alibertet al.[2], Gozlan and Juillet [12] and Backhoff-Veraguaset al.[5]. Notice that sinceIµν⊂ P(ν), one always hasW2(µ, µP(ν))≤W2(µ, η).
Example 2.14. For µ and ν the respective uniform distributions on [0,1] and [0,2], we have Iµν ={ν} and thus η=ν. By using the characterization in Theorem 2.6 [1], we obtain that the W2-projectionµP(ν) ofµon the setP(ν) is the uniform distribution on [1/2,3/2].
The next example shows that the set
ηφ:φ:Rd→Rd strictly convex and such that Z
Rd
φ(y)ν(dy)<∞
may contain distinct elements.
Example 2.15. Letµ= 12(δ(−1,0)+δ(1,0)) andν= 14(δ(−1,−1)+δ(0,−1)+δ(0,1)+δ(1,1)). Any optimal coupling betweenµandνcan be written asµ(dx)kp(x,dy) withkp((−1,0),dy) =12(δ(−1,−1)+pδ(0,−1)+ (1−p)δ(0,1))(dy) and kp((1,0),dy) = 12(δ(1,1)+ (1−p)δ(0,−1)+pδ(0,1))(dy) for p∈[0,1]. One hasTp((−1,0)) = (−1/2,−p) and Tp((1,0)) = (1/2, p). The measuresηp =12(δ(−1/2,−p)+δ(1/2,p)) are not comparable for the convex order since forp6=p0 there is no martingale coupling betweenηp and ηp0. Moreover, for eachp∈[0,1] the unique optimal transport plan δ((−1,0),(−1/2,−p))+δ((1,0),(1/2,p)) between µ and ηp is given by a map. For this example, η = η0= 12 δ(−1/2,0)+δ(1/2,0)
andηp=ηφp, with φp(x) =x21+ (x2−2px1)2. The W2-optimal couplings between η and ν can be written as η0(dz)kp(2z,dy) for p∈ [0,1] and in particular the unique martingale coupling η0(dz)k0(2z,dy) is optimal.
The last example shows that, unlike in the previous one, the martingale couplings between η andν are not necessarilyW2-optimal (even when Πopt(µ, ν) is a singleton).
Example 2.16. Let µ = 12 δ(−1,0)+δ(1,0)
, νa = 14 δ(−1,−1)+δ(−1,2a+1)+δ(1,−2a−1)+δ(1,1)
with a ∈R. The unique W2-optimal coupling between µ and νa is µ(dx)ka(x,dy) with ka((−1,0),dy) = 12(δ(−1,−1)+ δ(−1,2a+1))(dy) andka((1,0),dy) =12(δ(1,−2a−1)+δ(1,1))(dy) so thatηa= 12 δ(−1,a)+δ(1,−a)
. Since|(−1,−1)− (−1, a)|2− |(1,1)−(−1, a)|2= (a+ 1)2−4−(a−1)2= 4(a−1), fora >1,
W22(ηa, νa) = 1
2 (a+ 1)2+ 4 + (a−1)2
<(a+ 1)2= 1
2 3 + (2a+ 1)2
−(1 +a2)
= Z
|y|2νa(dy)− Z
|z|2ηa(dz),
so that the martingale coupling betweenηa andνa is notW2-optimal.
3. Differentiability of the squared quadratic Wasserstein distance
We now present the notion of differentiability introduced by Lions [15]. Letf :P2(Rd)→R. We consider an atomless probability space (Ω,A,P) and denote byL2(Ω,P;Rd) the set ofRd-valued square integrable random variables on this space. The lift of the functionf onL2(Ω,P;Rd) is the functionF :L2(Ω,P;Rd)→Rsuch that
∀X ∈L2(Ω,P;Rd), F(X) =f(L(X)),
whereL(X)∈ P2(Rd) is the probability distribution ofX. The atomless property is equivalent to the existence of a random variable U : Ω→Runiformly distributed on [0,1] (see e.g.[9], Prop. A.27). By the fundamental Theorem of simulation (see e.g. Bouleau and L´epingle [6], Thm. A.3.1, p. 38), it ensures the existence on (Ω,A,P) of a random variable distributed according to each probability measure on each Polish space, and in particular ofX : Ω→Rd distributed according toµ, for eachµ∈ P2(Rd).
Definition 3.1. A function f :P2(Rd)→RisL-differentiable at µ∈ P2(Rd) if there exists X∈L2(Ω,P;Rd) such thatX ∼µandF is Fr´echet differentiable atX.
Let f :P2(Rd)→R and F(X) =f(L(X)) for X ∈L2(Ω,P;Rd). The Fr´echet differentiability of F at X amounts to the existence of a bounded linear operatorDFX:L2(Ω,P;Rd)→Rsuch thatF(X+Y) =F(X) +