HAL Id: hal-03197398
https://hal.archives-ouvertes.fr/hal-03197398v2
Preprint submitted on 16 Apr 2021
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Distributions
Antoine Salmona, Julie Delon, Agnès Desolneux
To cite this version:
Antoine Salmona, Julie Delon, Agnès Desolneux. Gromov-Wasserstein Distances between Gaussian
Distributions. 2021. �hal-03197398v2�
Antoine Salmona
1, Julie Delon
2, Agn` es Desolneux
1 ∗1
ENS Paris-Saclay, CNRS, Centre Borelli UMR 9010
2
Universit´ e de Paris, CNRS, MAP5 UMR 8145 and Institut Universitaire de France April 16, 2021
Abstract
The Gromov-Wasserstein distances were proposed a few years ago to compare distributions which do not lie in the same space. In particular, they offer an interesting alternative to the Wasserstein distances for comparing probability measures living on Euclidean spaces of different dimensions. In this paper, we focus on the Gromov-Wasserstein distance with a ground cost defined as the squared Euclidean distance and we study the form of the optimal plan between Gaussian distributions. We show that when the optimal plan is restricted to Gaussian distributions, the problem has a very simple linear solution, which is also solution of the linear Gromov-Monge problem. We also study the problem without restriction on the optimal plan, and provide lower and upper bounds for the value of the Gromov-Wasserstein distance between Gaussian distributions.
Keywords—
optimal transport, Wasserstein distance, Gromov-Wasserstein distance, Gaussian distributions.
MSC 2020 subject classifications : 60E99, 68T09, 62H25, 49Q22.
1 Introduction
Optimal transport (OT) theory has become nowadays a major tool to compare probability distri- butions. It has been increasingly used over the last past years in various applied fields such as economy [11], image processing [20, 21], machine learning [4, 5] or more generally data science [18], with applications to domain adaptation [9] or generative models [3, 12], to name just a few.
Given two probability distributions µ and ν on two Polish spaces (X , d
X) and (Y, d
Y) and a positive lower semi-continuous cost function c : X × Y R
+, optimal transport focuses on solving the following optimization problem
inf
π∈Π(µ,ν)
Z
X ×Y
c(x, y)dπ(x, y), (1.1)
where Π(µ, ν) is the set of measures on X × Y with marginals µ and ν. When X and Y are equal and Euclidean, typically R
d, and c(x, y) = kx − yk
pwith p ≥ 1, Equation (1.1) induces a distance over the set of measures with finite moment of order p, known as the p-Wasserstein distance W
p:
W
p(µ, ν) =
π∈Π(µ,ν)
inf Z
Rd×Rd
kx − yk
pdπ(x, y)
1p, (1.2)
or equivalently
W
pp(µ, ν) = inf
X∼µ,Y∼ν
E[kX − Y k
p], (1.3)
∗The authors acknowledge support from the French Research Agency through the PostProdLEAP project (ANR-19- CE23-0027-01) and the MISTIC project (ANR-19-CE40-005).
where the notation X ∼ µ means that X is a random variable with probability distribution µ. It is known that Equation (1.1) always admits a solution [27, 26, 22] , i.e. the infinimum is always reached.
Moreover, in the case of W
2, it is known [6] that if µ is absolutely continuous, then the optimal transport plan π
∗is unique and has the form π
∗= (Id, T )#µ where # is the push-forward operator and T : R
dR
dis an application called optimal transport map, satisfying T #µ = ν . The 2-Wasserstein distance W
2admits a closed-form expression [10, 24] when µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1) are two Gaussian measures with means m
0∈ R
d, m
1∈ R
dand covariance matrices Σ
0∈ R
d×dand Σ
1∈ R
d×d, that is given by
W
22(µ, ν) = km
1− m
0k
2+ tr
Σ
0+ Σ
1− 2 Σ
1 2
0
Σ
1Σ
1 2
0
12, (1.4)
where for any symmetric semi-definite positive M , M
12is the unique symmetric semi-definite positive squared root of M . Moreover, if Σ
0is non-singular, then the optimal transport map T is affine and is given by
∀x ∈ R
d, T (x) = m
1+ Σ
−1 2
0
Σ
1 2
0
Σ
1Σ
1 2
0
12Σ
−1 2
0
(x − m
0), (1.5)
and the corresponding optimal transport plan π
∗is a degenerate Gaussian measure.
For some applications such as shape matching or word embedding, an important limitation of classic OT lies in the fact that it is not invariant to rotations and translations and more generally to isometries. Moreover, OT implies that we can define a relevant cost function to compare spaces X and Y. Thus, when for instance µ is a measure on R
2and ν a measure on R
3, it is not straightforward to design a cost function c : R
2× R
3R and so one cannot define easily an OT distance to compare µ with ν. To overcome these limitations, several extensions of OT have been proposed [1, 7, 17]. Among them, the most famous one is probably the Gromov-Wasserstein (GW) problem [16]: given two Polish spaces (X , d
X) and (Y, d
Y), each endowed respectively with probability measures µ and ν , and given two measurable functions c
X: X × X R and c
Y: Y × Y R , it aims at finding
GW
p(c
X, c
Y, µ, ν) =
inf
π∈Π(µ,ν)
Z
X2×Y2
|c
X(x, x
0) − c
Y(y, y
0)|
pdπ(x, y)dπ(x
0, y
0)
1p, (1.6) with p ≥ 1. As for classic OT, it can be shown that Equation (1.6) always admits a solution (see [25]). The GW problem can be seen as a quadratic optimization problem in π, as opposed to OT, which is a linear optimization problem in π. It induces a distance over the space of metric measure spaces (i.e. the triplets (X , d
X, µ)) quotiented by the strong isomorphisms [18]
1. The fundamental metric properties of GW
phave been studied in depth in [23, 16, 8]. In the Euclidean setting, when X = R
m, Y = R
n, with m not necessarily being equal to n, and for the natural choice of costs c
X= k.k
2Rmand c
Y= k.k
2Rn, where k.k
Rmmeans the Euclidean norm on R
m, it can be easily shown that GW
2(c
X, c
Y, µ, ν) is invariant to isometries. With a slight abuse of notations, we will note in the following GW
2(µ, ν) instead of GW
2(k.k
2Rm, k.k
2Rn, µ, ν).
In this work, we focus on the problem of Gromov-Wasserstein between Gaussian measures. Given µ = N (m
0, Σ
0), with m
0∈ R
mand with covariance matrix Σ
0∈ R
m×m, and ν = N (m
1, Σ
1), with m
1∈ R
nand with covariance matrix Σ
1∈ R
n×n, we aim to solve
GW
22(µ, ν) = inf
π∈Π(µ,ν)
Z Z
kx − x
0k
2Rm− ky − y
0k
2Rn 2dπ(x, y)dπ(x
0, y
0), (GW) or equivalently
GW
22(µ, ν) = inf
X,X0,Y,Y0∼π⊗π
E h
kX − X
0k
2Rm− kY − Y
0k
2Rn 2i
, (1.7)
1We say that (X, dX, µ) is strongly isomorphic to (Y, dY, ν) if it exists a bijection φ :X Y such that φis an isometry (dY(φ(y), φ(y0)) =dX(x, x0)), andφ#µ=ν.
where for x, x
0∈ R
mand y, y
0∈ R
n, (π ⊗ π)(x, x
0, y, y
0) = π(x, y)π(x
0, y
0). In particular, can we find equivalent formulas to (1.4) and (1.5) in the case of Gromov-Wasserstein? In Section 2, we derive an equivalent formulation of the Gromov-Wasserstein problem. This formulation is not specific to Gaussian measures but to all measures with finite order 4 moment. It takes the form of a sum of two terms depending respectively on co-moments of order 2 and 4 of π. Then in Section 3, we derive a lower bound by simply optimizing both terms separately. In Section 4, we show that the problem restricted to Gaussian optimal plans admits an explicit solution and this solution is closely related to Principal Components Analysis (PCA). In Section 5, we study the tightness of the bounds found in the previous sections and we exhibit a particular case where we are able to compute exactly the value of GW
22(µ, ν) and the optimal plan π
∗which achieves it. Finally, Section 6 discusses the form of the solution in the general case, and the possibility that the optimal plan between two Gaussian distributions is always Gaussian.
Notations
We define in the following some of the notations that will be used in the paper.
• The notation Y ∼ µ means that Y is a random variable with probability distribution µ.
• If µ is a positive measure on X and T : X → Y is an application, T #µ stands for the push-forward measure of µ by T , i.e. the measure on Y such that ∀A ∈ Y , (T#µ)(A) = µ(T
−1(A)).
• If X and Y are random vectors on R
mand R
n, we denote Cov(X, Y ) the matrix of size m × n of the form E
(X − E[X])(Y − E[Y ])
T.
• the notation tr(M ) denotes the trace of a matrix M .
• kM k
Fstands for the Frobenius norm of a matrix M , i.e. kM k
F= p
tr(M
TM ).
• rk(M ) stands for the rank of a matrix M .
• I
nis the identity matrix of size n.
• I ˜
nstands for any matrix of size n of the form diag((±1)
i≤n)
• Suppose n ≤ m. For A ∈ R
m×m, we denote A
(n)∈ R
n×nthe submatrix containing the n first rows and the n first columns of A.
• Suppose n ≤ m. For A ∈ R
n×n, we denote A
[m]∈ R
m×mthe matrix of the form A 0
0 0
.
• We denote S
n(R) the set of symmetric matrices of size n, S
n+(R) the set of semi-definite positive matrices, and S
n++(R) the set of definite positive matrices.
• 1
n,m= (1)
i≤n,j≤mdenotes the matrix of ones with n rows and m columns.
• kxk
Rnstands for the Euclidean norm of x ∈ R
n. We will denote kxk when there is no ambiguity about the dimension.
• hx, x
0i
nstands for the Euclidean inner product in R
nbetween x and x
0.
2 Derivation of the general problem
In this section, we derive an equivalent
2formulation of problem (GW) which takes the form of a functional of co-moments of order 2 and 4 of π. This formulation is not specific to Gaussian measures but to all measures with finite 4
thorder moment.
Theorem 2.1. Let µ be a probability measure on R
mwith mean vector m
0∈ R
mand covariance matrix Σ
0∈ R
m×msuch that R
kxk
4dµ < +∞ and ν a probability measure on R
nwith mean vector m
1and covariance matrix Σ
1∈ R
n×nsuch that R
kyk
4dν < +∞. Let P
0, D
0and P
1, D
1be respective diagonalizations of Σ
0(= P
0D
0P
0T) and Σ
1(= P
1D
1P
1T). Let us define T
0: x ∈ R
m7→ P
0T(x − m
0) and T
1: y ∈ R
n7→ P
1T(y − m
1). Then problem (GW) is equivalent to problem
sup
X∼T0#µ,Y∼T1#ν
X
i,j
Cov(X
i2, Y
j2) + 2kCov(X, Y )k
2F, (supCOV) where X = (X
1, X
2, . . . , X
m)
T, Y = (Y
1, Y
2, . . . , Y
n)
T, and k.k
Fis the Frobenius norm.
This theorem is a direct consequence of the two following intermediary results.
Lemma 2.1. We denote O
m= {O ∈ R
m×m| O
TO = I
m} the set of orthogonal matrices of size m. Let µ and ν be two probability measures on R
mand R
n. Let T
m: x 7→ O
mx+x
mand T
n: y 7→ O
ny +y
nbe two affine applications with x
m∈ R
m, O
m∈ O
m, y
n∈ R
n, and O
n∈ O
n. Then GW
2(T
m#µ, T
n#ν) = GW
2(µ, ν).
Lemma 2.2 (Vayer, 2020, [25]). Suppose there exist some scalars a, b, c such that c
X(x, x
0) = akxk
2Rm+bkx
0k
2Rm+chx, x
0i
m, where h., .i
mdenotes the inner product on R
m, and c
Y(y, y
0) = akyk
2Rn+ bky
0k
2Rn+ chy, y
0i
n. Let µ and ν be two probability measures respectively on R
mand R
n. Then
GW
22(c
X, c
Y, µ, ν) = C
µ,ν− 2 sup
π∈Π(µ,ν)
Z(π), (2.1)
where C
µ,ν= R
c
2Xdµdµ + R
c
2Ydνdν − 4ab R
kxk
2Rmkyk
2Rndµdν and
Z(π) = (a
2+ b
2) Z
kxk
2Rmkyk
2Rndπ(x, y) + c
2Z
xy
Tdπ(x, y)
2
F
+ (a + b)c Z
kxk
2RmhE
Y∼ν[Y ], yi
n+ kyk
2RnhE
X∼µ[X], xi
mdπ(x, y).
(2.2)
Proof of theorem 2.1. Using Lemma 2.1, we can focus without any loss of generality on centered Gaussian measures with diagonal covariance matrices. Thus, defining T
0: x ∈ R
m7→ P
0T(x − m
0) and T
1: y ∈ R
n7→ P
1T(y − m
1) and then applying Lemma 2.2 on GW
2(T
0#µ, T
1#ν) with a = 1, b = 1, and c = 2 while remarking that the last term in Equation (2.2) is null because E
X∼T0#µ[X] = 0 and E
Y∼T1#ν[Y ] = 0, it comes that problem (GW) is equivalent to
sup
π∈Π(T0#µ,T1#ν)
Z
kxk
2Rmkyk
2Rndπ(x, y) + 2 Z
xy
Tdπ(x, y)
2
F
. (2.3)
Since T
0#µ and T
1#ν are centered, we have that R
xy
Tdπ(x, y) = Cov(X, Y ) where X ∼ T
0#µ and Y ∼ T
1#ν . Furthermore, it can be easily computed that
Z
kxk
2Rmkyk
2Rndπ(x, y) = X
i,j
Cov(X
i2, Y
j2) + X
i,j
E[X
i2]E[Y
j2]. (2.4) Since the second term doesn’t depend on π, we get that problem (GW) is equivalent to problem (supCOV).
2We say that two optimization problems are equivalent if the solutions of one are readily obtained from the solutions of the other, and vice-versa.
The left-hand term of (supCOV) is closely related to the sum of symmetric co-kurtosis and so depends on co-moments of order 4 of π. On the other hand, the right-hand term is directly related to the co-moments of order 2 of π. For this reason, problem (supCOV) is hard to solve because it involves to optimize simultaneously the co-moments of order 2 and 4 of π and so to know the probabilistic rule which links them. This rule is well-known when π is Gaussian (Isserlis lemma) but this is not the case in general to the best of our knowledge and there is no reason for the solution of problem (supCOV) to be Gaussian.
3 Study of the general problem
Since problem (supCOV) is hard to solve because of its dependence on co-moments of order 2 and 4 of π, one can optimize both terms seperately in order to find a lower bound of GW
2(µ, ν). In the rest of the paper we suppose for convenience and without any loss of generality that n ≤ m.
Theorem 3.1. Suppose without any loss of generality that n ≤ m. Let µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1) be two Gaussian measures on R
mand R
n. Let P
0, D
0and P
1, D
1be the respective diagona- lizations of Σ
0(= P
0D
0P
0T) and Σ
1(= P
1D
1P
1T) which sort eigenvalues in decreasing order. We suppose that Σ
0is non-singular. A lower bound for GW
2(µ, ν) is then
GW
22(µ, ν) ≥ LGW
22(µ, ν), (3.1)
where
LGW
22(µ, ν) = 4 (tr(D
0) − tr(D
1))
2+ 4 (kD
0k
F− kD
1k
F)
2+ 4kD
(n)0− D
1k
2F+ 4
kD
0k
2F− kD
0(n)k
2F.
(LGW)
The proof of this theorem is divided in smaller intermediary results. First we recall the Isserlis lemma (see [14]), which allows to derive the co-moments of order 4 of a Gaussian distribution as a fonction of its co-moments of order 2.
Lemma 3.1 (Isserlis, 1918, [14]). Let X be a zero-mean Gaussian vector of size n. Then
∀i, j, k, l ≤ n, E[X
iX
jX
kX
l] = E[X
iX
j]E[X
kX
l] + E[X
iX
k]E[X
jX
l] + E[X
iX
l]E[X
jX
k]. (3.2) Then we derive the following general optimization lemmas. The proofs of these two lemmas are postponed to the Appendix (Section 8).
Lemma 3.2. Suppose that n ≤ m. Let Σ be a semi-definite positive matrix of size m + n of the form Σ =
Σ
0K K
TΣ
1,
with Σ
0∈ S
m++(R), Σ
1∈ S
n+(R) and K ∈ R
m×n. Let P
0, D
0and P
1, D
1be the respective diagonalisations of Σ
0(= P
0TD
0P
0) and Σ
1(= P
1TD
1P
1) which sort the eigenvalues in decreasing order. Then
max
Σ1−KTΣ−10 K∈Sn+(R)
kKk
2F= tr(D
(n)0D
1), (3.3) and is achieved at any
K
∗= P
0TI ˜
n(D
0(n))
12D
1 2
1
0
m−n,n!
P
1, (opKl2)
where I ˜
nis of the form diag((±1)
i≤n).
Lemma 3.3. Suppose that n ≤ m. Let Σ be a semi-definite positive matrix of size m + n of the form:
Σ =
Σ
0K K
TΣ
1,
where Σ
0∈ S
m++(R), Σ
1∈ S
n+(R), and K ∈ R
m×n. Let A ∈ R
n×mbe a matrix with rank 1. Then max
Σ1−KTΣ−10 K∈Sn+(R)
tr(KA) = q
tr(AΣ
0A
TΣ
1). (3.4)
In particular, if Σ
0= diag(α) and Σ
1= diag(β) with α ∈ R
mand β ∈ R
n, then max
Σ1−KTΣ−10 K∈S+n(R)
tr(K1
n,m) = p
tr(Σ
0)tr(Σ
1), (3.5)
with 1
n,m= (1)
i≤n,j≤m, and is achieved at
K
∗= αβ
Tp tr(Σ
0)tr(Σ
1) . (opKl1)
Proof of theorem 3.1. For µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1), we note P
0, D
0and P
1, D
1the respective diagonalizations of Σ
0and Σ
1which sort the eigenvalues in decreasing order. Let T
0: x ∈ R
m7→
P
0T(x − m
0) and T
1: y ∈ R
n7→ P
1T(y − m
1). For π ∈ Π(T
0#µ, T
1#ν) and (X, Y ) ∼ π, we denote Σ the covariance matrix of π and ˜ Σ the covariance matrix of (X
2, Y
2) with X
2:= ([XX
T]
i,i)
i≤mand Y
2:= ([Y Y
T]
j,j)
j≤n. Using Isserlis lemma to compute Cov(X
2, X
2) and Cov(Y
2, Y
2), it comes that Σ and ˜ Σ are of the form:
Σ =
D
0K K
TD
1and Σ = ˜
2D
20K ˜ K ˜
T2D
21. (3.6)
In order to find a supremum for each term of (supCOV), we use a necessary condition for π to be in Π(T
0#µ, T
1#ν ) which is that Σ and ˜ Σ must be semi-definite positive. To do so, we can use the equivalent condition that the Schur complements of Σ and ˜ Σ, namely D
1− K
TD
−10K and 2D
12−
1
2
K ˜
TD
−20K, must also be semi-definite positive. Remarking that the left-hand term in (supCOV) can ˜ be rewritten tr( ˜ K1
m,n), we have the two following inequalities
sup
X∼T0#µ,Y∼T1#ν
X
i,j
Cov(X
i2, Y
j2) ≤ max
2D21−12KTD−20 K∈Sn+(R)
tr( ˜ K1
n,m), (3.7) and
sup
X∼T0#µ,Y∼T1#ν
kCov(X, Y )k
2F≤ max
D1−KTD0−1K∈S+n(R)
kKk
2F. (3.8)
Applying Lemmas 3.2 and 3.3 on both right-hand terms, we get on one hand:
sup
X∼T0#µ,Y∼T1#ν
kCov(X, Y )k
2F≤ tr(D
(n)0D
1), (3.9) and on the other hand:
sup
X∼T0#µ,Y∼T1#ν
X
i,j
Cov(X
i2, Y
j2) ≤ 2 q
tr(D
02)tr(D
21)
= 2kD
0k
FkD
1k
F.
(3.10)
Furthermore, using Lemma 2.2, it comes that GW
22(µ, ν) = C
µ,ν− 4 sup
X∼T0#µ,Y∼T1#ν
X
i,j
Cov(X
i2, Y
j2) + X
i,j
E[X
i2]E[X
j2] + 2 kCov(X, Y )k
2F
≥ C
µ,ν− 8 q
tr(D
02)tr(D
21) − 4tr(D
0)tr(D
1) − 8tr(D
(n)0D
1),
(3.11)
where
C
µ,ν= E
U∼N(0,2D0)[kU k
4Rm] + E
V∼N(0,2D1)[kV k
4Rn] − 4E
X∼µ[kX k
2Rm]E
Y∼ν[kY k
2Rn]
= 8tr(D
02) + 4(tr(D
0))
2+ 8tr(D
12) + 4(tr(D
1))
2− 4tr(D
0)tr(D
1). (3.12) Finally
GW
22(µ, ν) ≥ 4(tr(D
0))
2+ 4(tr(D
1))
2− 8tr(D
0)tr(D
1) + 8tr(D
20) + 8tr(D
12)
− 8 q
tr(D
20)tr(D
12) − 8tr(D
(n)0D
1)
= LGW
22(µ, ν).
(3.13)
Inequalities (3.7) and (3.8) become equalities if one can exhibit a plan π such that Σ or ˜ Σ are such that kKk
2For tr( ˜ K1
n,m) are maximized. This is the case for (3.8) where we can exhibit the Gaussian plan π
∗such that K is of the form (opKl2) but it seems however more tricky to exhibit such a plan for inequality (3.7). Indeed, it can be shown that it doesn’t exist a Gaussian plan such that ˜ K is of the form (opKl1).
The lower bound LGW
2is reached if it exists a plan π which optimizes both terms simultaneously.
This seems rather unlikely because if a probability distribution has its covariance matrix such that K is of the form (opKl2), then it is necessarily Gaussian thanks to the equality case in Cauchy-Schwarz:
if D
0= diag(α) and D
1= diag(β) with α ∈ R
mand β ∈ R
n, and if π has its covariance matrix such that K is of the form (opKl2), then for all i ≤ n, Cov(X
i, Y
i) = ± √
α
iβ
iand Y
idepends linearly in X
i. As an outcome, π is Gaussian and we can compute, using Isserlis lemma, that tr( ˜ K1
n,m) = 2tr(D
0D
1) and so ˜ K cannot be of the form (opKl1). However, we didn’t prove that the solution of the form (opKl2) is unique so it may exist another solution which doesn’t imply that π has to be Gaussian.
4 Problem restricted to Gaussian transport plans
In this section, we study the following problem, where we constrain the optimal transport plan to be Gaussian.
GGW
22(µ, ν) = inf
π∈Π(µ,ν)∩Nm+n
Z Z
kx − x
0k
2Rm− ky − y
0k
2Rn 2dπ(x, y)dπ(x
0, y
0), (GaussGW) where N
m+nis the set of Gaussian measures on R
m+n. We show the following main result.
Theorem 4.1. Suppose without any loss of generality that n ≤ m. Let µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1) be two Gaussian measures on R
mand R
n. Let P
0, D
0and P
1, D
1be the respective diagona- lizations of Σ
0(= P
0D
0P
0T) and Σ
1(= P
1D
1P
1T) which sort eigenvalues in decreasing order. We suppose that Σ
0is non-singular (µ is not degenerate). Then problem (GaussGW) admits a solution of the form π
∗= (I
m, T )#µ with T affine of the form
∀x ∈ R
m, T (x) = m
1+ P
1AP
0T(x − m
0). (4.1)
where A ∈ R
n×mis written
A =
I ˜
nD
112(D
(n)0)
−120
n,m−n,
where I ˜
nis of the form diag((±1)
i≤n). Moreover
GGW
22(µ, ν) = 4(tr(D
0) − tr(D
1))
2+ 8kD
(n)0− D
1k
2F+ 8
kD
0k
2F− kD
(n)0k
2F. (GGW)
Proof. This theorem is a direct consequence of Isserlis lemma 3.1: indeed, the left term in equation (supCOV) can be in that case rewritten 2kCov(X, Y )k
2Fand so problem (GaussGW) is equivalent to
sup
X∼T0#µ,Y∼T1#ν
kCov(X, Y )k
2F. (4.2) Applying Lemma 3.2, we can exhibit a Gaussian optimal plan π
∗∈ Π(µ, ν) with covariance matrix Σ of the form:
Σ =
Σ
0K
∗K
∗TΣ
1, (4.3)
with
K
∗= P
0TI ˜
n(D
0(n))
12D
1120
m−n,n!
P
1. (4.4)
Thus, using the equality case in Cauchy-Schwarz, we can exhibit an optimal transport map T of the form
∀x ∈ R
m, T (x) = m
1+ P
1AP
0T(x − m
0), (4.5) with
A =
I ˜
nD
112(D
(n)0)
−120
n,m−n,
where ˜ I
nis of the form diag((±1)
i≤n). Moreover, using Lemmas 2.2 and 3.2, it comes that GGW
22(µ, ν) = C
µ,ν− 16 sup
X∼T0#µ,Y∼T1#ν
kCov(X, Y )k
2F= 8tr(D
20) + 4(tr(D
0))
2+ 8tr(D
21) + 4(tr(D
1))
2− 4tr(D
0)tr(D
1) − 16tr(D
(n)0D
1)
= 4(tr(D
0) − tr(D
1))
2+ 8tr
(D
0(n)− D
1)
2+ 8
tr(D
02) − tr((D
(n)0)
2) .
(4.6)
Link with Gromov-Monge The previous result generalizes Theorem 4.2.6 in [25], which studies the solutions of the linear Gromov-Monge problem between Gaussian distributions
inf
T#µ=ν,T linear
Z Z
kx − x
0k
2Rm− kT (x) − T (x
0)k
2Rn 2dµ(x)dν(x
0). (4.7) Indeed, solutions of (4.7) necessarily provide Gaussian transport plans π = (I
m, T )#µ if T is linear.
Conversely, Theorem 4.1 shows that restricting the optimal plan to be Gaussian in Gromov-Wasserstein
between two Gaussian distributions yields an optimal plan of the form π = (I
m, T )#µ with a linear
T, whatever the dimensions m and n of the two Euclidean spaces.
Link with Principal Component Analysis We can easily draw connections between GGW
22and PCA. Indeed, we can remark that the optimal plan can be derived by performing PCA on both distributions µ and ν in order to obtain distributions ˜ µ and ˜ ν with zero mean vectors and diagonal covariance matrices with eigenvalues in decreasing order (˜ µ = T
0#µ and ˜ ν = T
1#ν), then by keeping only the n first components in ˜ µ and finally by deriving the optimal transport plan which achieves W
22between the obtained truncated distribution and ˜ ν. In other terms, noting P
n: R
m→ R
nthe linear mapping which, for x ∈ R
mkeeps only its n first components (n-frame), T
W2the optimal transport map such that π
W2= (I
n, T
W2)#P
n#˜ µ achieves W
2(P
n#˜ µ, ν ˜ ), it comes that the optimal plan π
GGW2which achieves GGW
2(˜ µ, ν) can be written ˜
π
GGW2= (I
m, I ˜
n#T
W2#P
n)#˜ µ. (4.8) An example of π
GGW2can be found in Figure 1 when m = 2 and n = 1.
Figure 1: transport plan π
GGW2solution of problem (GaussGW) with m = 2 and n = 1. In that case, π
GGW2is the degenerate Gaussian distribution supported by the plan of equation y = T
W2(x), where T
W2is the classic W
2optimal transport map when the distributions are rotated and centered first.
Case of equal dimensions When m = n, the optimal plan π
GGW2which achieves GGW
2(µ, ν) is closely related to the optimal transport plan π
W2= (I
m, T
W2)#T
0#µ. Indeed, π
GGW2can be simply derived by applying the transformations T
0and T
1to respectively µ and ν , then by computing π
W2between T
0#µ and T
1#ν, and finally by applying the inverse transformations T
0−1and T
1−1. In other terms, π
GGW2can be written
π
GGW2= (I
m, T
1−1# ˜ I
n#T
W2#T
0)#µ. (4.9) An example of transport between two Gaussians measures in dimension 2 in Figure 2.
As illustrated in Figure 3, the GGW
2optimal transport map T
GGW2defined in Equation (4.1) is not equivalent to the W
2optimal transport map T
W2defined in (1.5) even when the dimensions m and n are equal. More precisely, it Σ
0and Σ
1can be diagonalized in the same orthonormal basis with eigenvalues in the same order (decreasing or increasing), then T
W2and T
GGW2are equivalent (top of Figure 3). On the other hand, if Σ
0and Σ
1can be diagonalized in the same orthonormal basis but with eigenvalues not in the same order, T
W2and T
GGW2will have very different behaviors (bottom of Figure 3). Between those two extreme cases, we can say that the closer the columns of P
0will be collinear to the columns of P
1(with the eigenvalues in decreasing order), the more T
W2and T
GGW2will tend to have similar behaviors (middle of Figure 3).
Figure 2: Solution of (GaussGW) between two Gaussians measures in dimension 2. First the distributions are centered and rotated. Then a classic W
2transport is applied between the two aligned distributions.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Figure 3: Comparison between W
2and GGW
2mappings between empirical distributions. Left: 2D
source distribution (colored) and target distribution (transparent). Middle: resulting mapping of
Wasserstein T
W2. Right: resulting mapping of Gaussian Gromov-Wasserstein T
GGW2. The colors are
added in order to visualize where each sample has been sent.
Link with Gromov-Wasserstein with inner product as cost function If µ and ν are centered Gaussian measures, let us consider the following problem
GW
22(h., .i
m, h., .i
n, µ, ν) = inf
π∈Π(µ,ν)
Z Z
(hx, x
0i
m− hy, y
0i
n)
2dπ(x, y)dπ(x
0, y
0). (innerGW) Notice that the above problem is not restricted to Gaussian plans, but the following proposition shows that in fact its solution is Gaussian.
Proposition 4.1. Suppose m ≤ n. Let µ = N (0, Σ
0) and ν = N (0, Σ
1) be two centered Gaussian measures respectively on R
mand R
n. Then the solution of problem (GaussGW) exhibited in theorem 4.1 is also solution of problem (innerGW).
Proof. The proof of this proposition is a direct consequence of lemma 2.2: indeed, applying it with a = 0, b = 0, and c = 1, it comes that problem (innerGW) is equivalent to
sup
π∈Π(µ,ν)
Z
xy
Tdπ(x, y)
2
F
. (4.10)
Since µ and ν are centered, it comes that problem (innerGW) is equivalent to sup
X∼µ,Y∼ν
kCov(X, Y )k
2F. (4.11) Applying Lemma 3.2, it comes that the solution exhibited in Theorem 4.1 is also solution of problem (innerGW).
Since GGW
2is the Gromov-Wasserstein problem restricted to Gaussian transport plan, it is clear that (GaussGW) is an upper bound of (GW). Combining this result with Theorem 3.1, we get the following simple but important result.
Proposition 4.2. If µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1) and Σ
0is non-singular, then
LGW
22(µ, ν) ≤ GW
22(µ, ν) ≤ GGW
22(µ, ν). (4.12)
5 Tightness of the bounds and particular cases
5.1 Bound on the difference
Proposition 5.1. Suppose without loss of generality that n ≤ m, if µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1), then
GGW
22(µ, ν) − LGW
22(µ, ν) ≤ 8kΣ
0k
FkΣ
1k
F1 − 1
√ m
. (5.1)
To prove this proposition, we will use the following technical result (the proof is postponed to the Appendix (Section 8)):
Lemma 5.1. Let u ∈ R
mand v ∈ R
mbe two unit vectors with non-negative coordinates ordered in decreasing order. Then
u
Tv ≥ 1
√ m , (5.2)
with equality if u = (
√1m,
√1m, . . . )
Tand v = (1, 0, . . . )
T.
Proof of Proposition 5.1. By subtracting (LGW) from (GGW), it comes that GGW
22(µ, ν) − LGW
22(µ, ν) = 8
kD
0k
FkD
1k
F− tr(D
(n)0D
1)
= 8
kD
0k
FkD
1[m]k
F− tr(D
0D
[m]1) ,
(5.3)
where D
[m]1=
D
10
0 0
∈ R
m×m. Noting α ∈ R
mand β ∈ R
mthe vectors of eigenvalues of D
0and D
[m]1, it comes
GGW
22(µ, ν) − LGW
22(µ, ν) = 8(kαkkβk − α
Tβ) = 8kαkkβk(1 − u
Tv), (5.4) where u =
kαkαand v =
kβkβ. Applying lemma 5.1, we get directly that
GGW
22(µ, ν) − LGW
22(µ, ν) ≤ 8kD
0k
FkD
[m]1k
F1 − 1
√ m
.
= 8kΣ
0k
FkΣ
1k
F1 − 1
√ m
.
(5.5)
The difference between GGW
22(µ, ν) and LGW
22(µ, ν) can be seen as the difference between the right and left terms of the Cauchy-Schwarz inequality applied to the two vectors of eigenvalues α ∈ R
mand β ∈ R
m. The difference is maximized when the vectors α and β are the least collinear possible.
This happens when the eigenvalues of D
0are all equal and n = 1 or ν is degenerate of true dimension 1. On the other hand , this difference is null when α and β are collinear. Between those two extremal cases, we can say that the difference between GGW
22(µ, ν) and LGW
22(µ, ν) will be relatively small if the last m−n eigenvalues D
0are small compared to the n first eigenvalues and if the n first eigenvalues are close to be proportional to the eigenvalues of D
1. An example in the case where m = 2 and n = 1 can be found in Figure 4.
0.0 0.2 0.4 0.6 0.8 1.0
2 0
2 4 6 8 10
12 GGW2
LGW2
0.0 0.2 0.4 0.6 0.8 1.0
2 9
10 11 12 13 14 15
16 GGW2
LGW2
0.0 0.2 0.4 0.6 0.8 1.0
2 880
900 920 940
960 GGW2
LGW2
Figure 4: plot of GGW
22(µ, ν) and LGW
22(µ, ν) in function of α
2for µ = N (0, diag(α)), ν = N (0, β
1), α = (α
1, α
2)
T, for (α
1, β
1) = (1, 1) (left), (α
1, β
1) = (1, 2) (middle), (α
1, β
1) = (1, 10) (right). One can easily compute using (GGW) and (LGW) that GGW
22(µ, ν) = 12α
22+ 8α
2(α
1− β
1) + 12(α
1−β
1)
2and LGW
22(µ, ν) = 12α
22+ 8α
2(α
1− β
1) − 4 p
α
22+ α
21β
1+ 12(α
1− β
1)
2+ 8α
1β
1.
5.2 Explicit case
As seen before, the difference between GGW
22(µ, ν) and LGW
22(µ, ν), with µ = N (m
0, Σ
0) and ν =
N (m
1, Σ
1), is null when the two vectors of eigenvalues of Σ
0and Σ
1(sorted in decreasing order) are
collinear. When we suppose Σ
0non-singular, it implies that m = n and that the eigenvalues of Σ
1are
proportional to the eigenvalues of Σ
0(rescaling). This case includes the more particular case where
m = n = 1. In that case µ = N (m
0, σ
20) and ν = N (m
1, σ
12), because σ
1is always proportional to σ
0.
Proposition 5.2. Suppose m = n. Let µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1) two Gaussian measures on R
m= R
n. Let P
0, D
0and P
1, D
1be the respective diagonalizations of Σ
0(= P
0D
0P
0T) and Σ
1(=
P
1D
1P
1T) which sort eigenvalues in non-increasing order. Suppose Σ
0is non-singular and that it exists a scalar λ ≥ 0 such that D
1= λD
0. In that case, GW
22(µ, ν) = GGW
22(µ, ν) = LGW
22(µ, ν) and the problem admits a solution of the form (I
m, T )#µ with T affine of the form:
∀x ∈ R
m, T (x) = m
1+ √
λP
1I ˜
mP
0T(x − m
0), (5.6) where I ˜
mof the form diag((±1)
i≤m). Moreover
GW
22(µ, ν) = (λ − 1)
24(tr(Σ
0))
2+ 8kΣ
0k
2F. (5.7)
Proof. From (5.3), we have
GGW
22(µ, ν) − LGW 2
2(µ, ν) = 8 (kD
0k
FkD
1k
F− tr(D
0D
1)) . (5.8) Noting α ∈ R
mand β ∈ R
mthe eigenvalues vectors of D
0and D
1, it comes
GGW
22(µ, ν) − LGW 2
2(µ, ν) = 8(kαkkβk − α
Tβ). (5.9) Since it exists λ ≥ 0 such that D
1= λD
0, we have β = λα, and so α
Tβ = kαkkβ k. Thus GGW
22(µ, ν)−
LGW
22(µ, ν) = 0 and using Proposition 4.2, we get that GW
22(µ, ν) = GGW
22(µ, ν) = LGW
22(µ, ν).
We get (5.6) and (5.7) by simply reinjecting in (4.1) and (GGW).
Corollary 5.1. Let µ = N (m
0, σ
02) and ν = N (m
1, σ
12) be two Gaussian measures on R. Then GW
22(µ, ν) = 12 σ
20− σ
212, (5.10)
and the optimal transport plan π
∗has the form (I
1, T )#µ with T affine of the form:
∀x ∈ R, T (x) = m
1± σ
1σ
0(x − m
0). (5.11)
Thus, the solution of W
22(µ, ν) is also solution of GW
22(µ, ν).
5.3 Case of degenerate measures
In all the results exposed above, we have supposed Σ
0non-singular, which means that µ is not degenerate. Yet, if Σ
0is not full rank, one can easily extend the previous results thanks to the following proposition.
Proposition 5.3. Let µ = N (0, D
0) and ν = N (0, D
1) be two centered Gaussian measures on R
mand R
nwith diagonal covariance matrices D
0and D
1with eigenvalues in decreasing order. We denote r = rk(D
0) the rank of D
0and we suppose that r < m. Let us define P
r= I
r0
r,m−r∈ R
r×m. Then GW
22(µ, ν) = GW
22(P
r#µ, ν), GGW
22(µ, ν) = GGW
22(P
r#µ, ν), and LGW
22(µ, ν) = LGW
22(P
r#µ, ν).
Proof. For r < m, we denote Γ
r(R
m) the set of vectors x = (x
1, . . . , x
m)
Tof R
msuch that x
r+1=
· · · = x
m= 0. For π ∈ Π(µ, ν), one can remark that for any borel set A ⊂ R
mr Γ
r(R
m), and any borel set B ⊂ R
n, we have π(A, B) = 0 and so
GW
22(µ, ν) = inf
π∈Π(µ,ν)
Z
Rm×Rn
Z
Rm×Rn
(kx − x
0k
2Rm− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
= inf
π∈Π(µ,ν)
Z
Γr(Rm)×Rn
Z
Γr(Rm)×Rn
(kx − x
0k
2Rm− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
= inf
π∈Π(µ,ν)
Z
Γr(Rm)×Rn
Z
Γr(Rm)×Rn
(kP
r(x − x
0)k
2Rr− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
(5.12)
Now, observe that for π ∈ Π(µ, ν), (P
r, I
n)#π ∈ Π(P
r#µ, ν). It follows that GW
22(µ, ν) ≤ inf
π∈Π(Pr#µ,ν)
Z
Rr×Rn
Z
Rr×Rn
(kx − x
0k
2Rr− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
= GW
22(P
r#µ, ν).
(5.13)
Conversely, since µ has no mass outside of Γ
r(R
m), P
rT#P
r#µ = µ, which implies that for π ∈ Π(P
r#µ, ν), (P
rT, I
n)#π ∈ Π(µ, ν). It follows that
GW
22(P
r#µ, ν) = inf
π∈Π(Pr#µ,ν)
Z
Rr×Rn
Z
Rr×Rn
(kx − x
0k
2Rr− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
= inf
π∈Π(Pr#µ,ν)
Z
Rr×Rn
Z
Rr×Rn
(kP
rT(x − x
0)k
2Rr− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
≤ inf
π∈Π(µ,ν)
Z
Rm×Rn
Z
Rm×Rn
(kx − x
0k
2Rr− ky − y
0k
2Rm)
2dπ(x, y)dπ(x
0, y
0)
≤ GW
22(µ, ν).
(5.14) The exact same reasoning can be made in the case of GGW
2. Morover, it can be easily seen when looking at (LGW) that LGW
22(µ, ν) = LGW
22(P
r#µ, ν).
Thus, when Σ
0is not full rank, one can apply Proposition 5.3 and consider directly the Gromov- Wasserstein distance between the projected (non-degenerate) measure P
r#µ on R
rand ν and so Proposition 4.2 still holds when µ is degenerate.
In the case of GGW
2, an explicit optimal transport plan can still be exhibited. In the following, we denote r
0and r
1the ranks of Σ
0and Σ
1, and we suppose without loss of generality that r
0≥ r
1, but this time not necessarily that m ≥ n. If µ = N (m
0, Σ
0) and ν = N (m
1, Σ
1) are two Gaussian measures on R
mand R
n, and (P
0, D
0) and (P
1, D
1) are the respective diagonalizations of Σ
0(= P
0D
0P
0T) and Σ
1(= P
1D
1P
1T) which sort the eigenvalues in decreasing order, an optimal transport plan which achieves GGW
2(µ, ν) is of the form π
∗= (I
m, T )#µ with
∀x ∈ R
m, T (x) = m
1+ P
1AP
0T(x − m
0), (5.15) where A ∈ R
n×mis of the form
A =
I ˜
r1(D
(r11))
12(D
(r01))
−120
r1,m−r10
n−r1,r10
n−r1,m−r1,
where ˜ I
r1is any matrix of the form diag((±1)
i≤r1).
6 Behavior of the empirical solution
To complete the previous study, we perform a simple experiment to illustrate the behavior of the solution of the Gromov Wasserstein problem. In this experiment, we draw independently k samples (X
j)
j≤kand (Y
i)
i≤kfrom respectively µ = N (0, diag(α)) and ν = N (0, diag(β )) with α ∈ R
mand β ∈ R
n. Then we compute the Gromov-Wasserstein distance between the two histograms X and Y with the algorithm proposed in [19] using the Python Optimal Transport library
3. In Figure 5, we plot the first coordinates of the samples Y
iin fonction of the the first coordinate of the samples X
jthey have been assigned to by the algorithm (blue dots). We draw also the line of equation y = ± √ βx to compare with the theorical solution of the Gaussian restricted problem (orange line) for k = 2000,
3The library is accessible here: https://pythonot.github.io/index.html