Gromov-Wasserstein Distances between Gaussian Distributions

(1)

HAL Id: hal-03197398

https://hal.archives-ouvertes.fr/hal-03197398v2

Preprint submitted on 16 Apr 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributions

Antoine Salmona, Julie Delon, Agnès Desolneux

To cite this version:

Antoine Salmona, Julie Delon, Agnès Desolneux. Gromov-Wasserstein Distances between Gaussian

Distributions. 2021. �hal-03197398v2�

(2)

Antoine Salmona

¹

, Julie Delon

²

, Agn` es Desolneux

¹ ^∗

1

ENS Paris-Saclay, CNRS, Centre Borelli UMR 9010

2

Universit´ e de Paris, CNRS, MAP5 UMR 8145 and Institut Universitaire de France April 16, 2021

Abstract

The Gromov-Wasserstein distances were proposed a few years ago to compare distributions which do not lie in the same space. In particular, they offer an interesting alternative to the Wasserstein distances for comparing probability measures living on Euclidean spaces of different dimensions. In this paper, we focus on the Gromov-Wasserstein distance with a ground cost defined as the squared Euclidean distance and we study the form of the optimal plan between Gaussian distributions. We show that when the optimal plan is restricted to Gaussian distributions, the problem has a very simple linear solution, which is also solution of the linear Gromov-Monge problem. We also study the problem without restriction on the optimal plan, and provide lower and upper bounds for the value of the Gromov-Wasserstein distance between Gaussian distributions.

Keywords—

optimal transport, Wasserstein distance, Gromov-Wasserstein distance, Gaussian distributions.

MSC 2020 subject classifications : 60E99, 68T09, 62H25, 49Q22.

1 Introduction

Optimal transport (OT) theory has become nowadays a major tool to compare probability distributions. It has been increasingly used over the last past years in various applied fields such as economy [11], image processing [20, 21], machine learning [4, 5] or more generally data science [18], with applications to domain adaptation [9] or generative models [3, 12], to name just a few.

Given two probability distributions µ and ν on two Polish spaces (X , d

_X

) and (Y, d

_Y

) and a positive lower semi-continuous cost function c : X × Y R

⁺

, optimal transport focuses on solving the following optimization problem

inf

π∈Π(µ,ν)

Z

X ×Y

c(x, y)dπ(x, y), (1.1)

where Π(µ, ν) is the set of measures on X × Y with marginals µ and ν. When X and Y are equal and Euclidean, typically R

^d

, and c(x, y) = kx − yk

^p

with p ≥ 1, Equation (1.1) induces a distance over the set of measures with finite moment of order p, known as the p-Wasserstein distance W

p

:

W

p

(µ, ν) =

π∈Π(µ,ν)

inf Z

R^d×R^d

kx − yk

^p

dπ(x, y)

¹_p

, (1.2)

or equivalently

W

_p^p

(µ, ν) = inf

X∼µ,Y∼ν

E[kX − Y k

^p

], (1.3)

∗The authors acknowledge support from the French Research Agency through the PostProdLEAP project (ANR-19- CE23-0027-01) and the MISTIC project (ANR-19-CE40-005).

(3)

where the notation X ∼ µ means that X is a random variable with probability distribution µ. It is known that Equation (1.1) always admits a solution [27, 26, 22] , i.e. the infinimum is always reached.

Moreover, in the case of W

2

, it is known [6] that if µ is absolutely continuous, then the optimal transport plan π

^∗

is unique and has the form π

^∗

= (Id, T )#µ where # is the push-forward operator and T : R

^d

R

^d

is an application called optimal transport map, satisfying T #µ = ν . The 2-Wasserstein distance W

2

admits a closed-form expression [10, 24] when µ = N (m

0

, Σ

0

) and ν = N (m

1

, Σ

1

) are two Gaussian measures with means m

0

∈ R

^d

, m

1

∈ R

^d

and covariance matrices Σ

0

∈ R

^d×d

and Σ

1

∈ R

^d×d

, that is given by

W

₂²

(µ, ν) = km

1

− m

0

k

²

+ tr

Σ

0

+ Σ

1

− 2 Σ

1 2

0

Σ

1

Σ

1 2

0

¹₂

, (1.4)

where for any symmetric semi-definite positive M , M

¹²

is the unique symmetric semi-definite positive squared root of M . Moreover, if Σ

0

is non-singular, then the optimal transport map T is affine and is given by

∀x ∈ R

^d

, T (x) = m

1

+ Σ

⁻

1 2

0

Σ

1 2

0

Σ

1

Σ

1 2

0

¹₂

Σ

⁻

1 2

0

(x − m

0

), (1.5)

and the corresponding optimal transport plan π

^∗

is a degenerate Gaussian measure.

For some applications such as shape matching or word embedding, an important limitation of classic OT lies in the fact that it is not invariant to rotations and translations and more generally to isometries. Moreover, OT implies that we can define a relevant cost function to compare spaces X and Y. Thus, when for instance µ is a measure on R

²

and ν a measure on R

³

, it is not straightforward to design a cost function c : R

²

× R

³

R and so one cannot define easily an OT distance to compare µ with ν. To overcome these limitations, several extensions of OT have been proposed [1, 7, 17]. Among them, the most famous one is probably the Gromov-Wasserstein (GW) problem [16]: given two Polish spaces (X , d

X

) and (Y, d

Y

), each endowed respectively with probability measures µ and ν , and given two measurable functions c

X

: X × X R and c

Y

: Y × Y R , it aims at finding

GW

p

(c

_X

, c

_Y

, µ, ν) =

inf

π∈Π(µ,ν)

Z

X²×Y²

|c

_X

(x, x

⁰

) − c

_Y

(y, y

⁰

)|

^p

dπ(x, y)dπ(x

⁰

, y

⁰

)

¹_p

, (1.6) with p ≥ 1. As for classic OT, it can be shown that Equation (1.6) always admits a solution (see [25]). The GW problem can be seen as a quadratic optimization problem in π, as opposed to OT, which is a linear optimization problem in π. It induces a distance over the space of metric measure spaces (i.e. the triplets (X , d

_X

, µ)) quotiented by the strong isomorphisms [18]

¹

. The fundamental metric properties of GW

_p

have been studied in depth in [23, 16, 8]. In the Euclidean setting, when X = R

^m

, Y = R

ⁿ

, with m not necessarily being equal to n, and for the natural choice of costs c

_X

= k.k

²_Rm

and c

_Y

= k.k

²_Rn

, where k.k

_Rm

means the Euclidean norm on R

^m

, it can be easily shown that GW

₂

(c

_X

, c

_Y

, µ, ν) is invariant to isometries. With a slight abuse of notations, we will note in the following GW

2

(µ, ν) instead of GW

2

(k.k

²_Rm

, k.k

²_Rn

, µ, ν).

In this work, we focus on the problem of Gromov-Wasserstein between Gaussian measures. Given µ = N (m

0

, Σ

0

), with m

0

∈ R

^m

and with covariance matrix Σ

0

∈ R

^m×m

, and ν = N (m

1

, Σ

1

), with m

1

∈ R

ⁿ

and with covariance matrix Σ

1

∈ R

^n×n

, we aim to solve

GW

₂²

(µ, ν) = inf

π∈Π(µ,ν)

Z Z

kx − x

⁰

k

²_Rm

− ky − y

⁰

k

²_Rn

²

dπ(x, y)dπ(x

⁰

, y

⁰

), (GW) or equivalently

GW

₂²

(µ, ν) = inf

X,X⁰,Y,Y⁰∼π⊗π

E h

kX − X

⁰

k

²_Rm

− kY − Y

⁰

k

²_Rn

²

i

, (1.7)

1We say that (X, d_X, µ) is strongly isomorphic to (Y, dY, ν) if it exists a bijection φ :X Y such that φis an isometry (dY(φ(y), φ(y⁰)) =dX(x, x⁰)), andφ#µ=ν.

(4)

where for x, x

⁰

∈ R

^m

and y, y

⁰

∈ R

ⁿ

, (π ⊗ π)(x, x

⁰

, y, y

⁰

) = π(x, y)π(x

⁰

, y

⁰

). In particular, can we find equivalent formulas to (1.4) and (1.5) in the case of Gromov-Wasserstein? In Section 2, we derive an equivalent formulation of the Gromov-Wasserstein problem. This formulation is not specific to Gaussian measures but to all measures with finite order 4 moment. It takes the form of a sum of two terms depending respectively on co-moments of order 2 and 4 of π. Then in Section 3, we derive a lower bound by simply optimizing both terms separately. In Section 4, we show that the problem restricted to Gaussian optimal plans admits an explicit solution and this solution is closely related to Principal Components Analysis (PCA). In Section 5, we study the tightness of the bounds found in the previous sections and we exhibit a particular case where we are able to compute exactly the value of GW

₂²

(µ, ν) and the optimal plan π

^∗

which achieves it. Finally, Section 6 discusses the form of the solution in the general case, and the possibility that the optimal plan between two Gaussian distributions is always Gaussian.

Notations

We define in the following some of the notations that will be used in the paper.

• The notation Y ∼ µ means that Y is a random variable with probability distribution µ.

• If µ is a positive measure on X and T : X → Y is an application, T #µ stands for the push-forward measure of µ by T , i.e. the measure on Y such that ∀A ∈ Y , (T#µ)(A) = µ(T

⁻¹

(A)).

• If X and Y are random vectors on R

^m

and R

ⁿ

, we denote Cov(X, Y ) the matrix of size m × n of the form E

(X − E[X])(Y − E[Y ])

^T

.

• the notation tr(M ) denotes the trace of a matrix M .

• kM k

_F

stands for the Frobenius norm of a matrix M , i.e. kM k

_F

= p

tr(M

^T

M ).

• rk(M ) stands for the rank of a matrix M .

• I

n

is the identity matrix of size n.

• I ˜

n

stands for any matrix of size n of the form diag((±1)

i≤n

)

• Suppose n ≤ m. For A ∈ R

^m×m

, we denote A

⁽ⁿ⁾

∈ R

^n×n

the submatrix containing the n first rows and the n first columns of A.

• Suppose n ≤ m. For A ∈ R

^n×n

, we denote A

^[m]

∈ R

^m×m

the matrix of the form A 0

0 0

.

• We denote S

n

(R) the set of symmetric matrices of size n, S

_n⁺

(R) the set of semi-definite positive matrices, and S

_n⁺⁺

(R) the set of definite positive matrices.

• 1

n,m

= (1)

_i≤n,j≤m

denotes the matrix of ones with n rows and m columns.

• kxk

Rⁿ

stands for the Euclidean norm of x ∈ R

ⁿ

. We will denote kxk when there is no ambiguity about the dimension.

• hx, x

⁰

i

n

stands for the Euclidean inner product in R

ⁿ

between x and x

⁰

.

(5)

2 Derivation of the general problem

In this section, we derive an equivalent

²

formulation of problem (GW) which takes the form of a functional of co-moments of order 2 and 4 of π. This formulation is not specific to Gaussian measures but to all measures with finite 4

^th

order moment.

Theorem 2.1. Let µ be a probability measure on R

^m

with mean vector m

0

∈ R

^m

and covariance matrix Σ

0

∈ R

^m×m

such that R

kxk

⁴

dµ < +∞ and ν a probability measure on R

ⁿ

with mean vector m

1

and covariance matrix Σ

1

∈ R

^n×n

such that R

kyk

⁴

dν < +∞. Let P

0

, D

0

and P

1

, D

1

be respective diagonalizations of Σ

0

(= P

0

D

0

P

₀^T

) and Σ

1

(= P

1

D

1

P

₁^T

). Let us define T

0

: x ∈ R

^m

7→ P

₀^T

(x − m

0

) and T

1

: y ∈ R

ⁿ

7→ P

₁^T

(y − m

1

). Then problem (GW) is equivalent to problem

sup

X∼T₀#µ,Y∼T₁#ν

X

i,j

Cov(X

_i²

, Y

_j²

) + 2kCov(X, Y )k

²_F

, (supCOV) where X = (X

1

, X

2

, . . . , X

m

)

^T

, Y = (Y

1

, Y

2

, . . . , Y

n

)

^T

, and k.k

_F

is the Frobenius norm.

This theorem is a direct consequence of the two following intermediary results.

Lemma 2.1. We denote O

m

= {O ∈ R

^m×m

| O

^T

O = I

m

} the set of orthogonal matrices of size m. Let µ and ν be two probability measures on R

^m

and R

ⁿ

. Let T

m

: x 7→ O

m

x+x

m

and T

n

: y 7→ O

n

y +y

n

be two affine applications with x

m

∈ R

^m

, O

m

∈ O

m

, y

n

∈ R

ⁿ

, and O

n

∈ O

n

. Then GW

2

(T

m

#µ, T

n

#ν) = GW

₂

(µ, ν).

Lemma 2.2 (Vayer, 2020, [25]). Suppose there exist some scalars a, b, c such that c

_X

(x, x

⁰

) = akxk

²_Rm

+bkx

⁰

k

²_Rm

+chx, x

⁰

i

m

, where h., .i

m

denotes the inner product on R

^m

, and c

_Y

(y, y

⁰

) = akyk

²_Rn

+ bky

⁰

k

²_Rn

+ chy, y

⁰

i

n

. Let µ and ν be two probability measures respectively on R

^m

and R

ⁿ

. Then

GW

₂²

(c

_X

, c

_Y

, µ, ν) = C

_µ,ν

− 2 sup

π∈Π(µ,ν)

Z(π), (2.1)

where C

µ,ν

= R

c

²_X

dµdµ + R

c

²_Y

dνdν − 4ab R

kxk

²_Rm

kyk

²_Rn

dµdν and

Z(π) = (a

²

+ b

²

) Z

kxk

²_Rm

kyk

²_Rn

dπ(x, y) + c

²

Z

xy

^T

dπ(x, y)

2

F

+ (a + b)c Z

kxk

²_Rm

hE

Y∼ν

[Y ], yi

n

+ kyk

²_Rn

hE

_X∼µ

[X], xi

m

dπ(x, y).

(2.2)

Proof of theorem 2.1. Using Lemma 2.1, we can focus without any loss of generality on centered Gaussian measures with diagonal covariance matrices. Thus, defining T

₀

: x ∈ R

^m

7→ P

₀^T

(x − m

₀

) and T

₁

: y ∈ R

ⁿ

7→ P

₁^T

(y − m

₁

) and then applying Lemma 2.2 on GW

₂

(T

₀

#µ, T

₁

#ν) with a = 1, b = 1, and c = 2 while remarking that the last term in Equation (2.2) is null because E

_X∼T₀_#µ

[X] = 0 and E

Y∼T1#ν

[Y ] = 0, it comes that problem (GW) is equivalent to

sup

π∈Π(T0#µ,T1#ν)

Z

kxk

²_Rm

kyk

²_Rn

dπ(x, y) + 2 Z

xy

^T

dπ(x, y)

2

F

. (2.3)

Since T

0

#µ and T

1

#ν are centered, we have that R

xy

^T

dπ(x, y) = Cov(X, Y ) where X ∼ T

0

#µ and Y ∼ T

1

#ν . Furthermore, it can be easily computed that

Z

kxk

²_Rm

kyk

²_Rn

dπ(x, y) = X

i,j

Cov(X

_i²

, Y

_j²

) + X

i,j

E[X

_i²

]E[Y

_j²

]. (2.4) Since the second term doesn’t depend on π, we get that problem (GW) is equivalent to problem (supCOV).

2We say that two optimization problems are equivalent if the solutions of one are readily obtained from the solutions of the other, and vice-versa.

(6)

The left-hand term of (supCOV) is closely related to the sum of symmetric co-kurtosis and so depends on co-moments of order 4 of π. On the other hand, the right-hand term is directly related to the co-moments of order 2 of π. For this reason, problem (supCOV) is hard to solve because it involves to optimize simultaneously the co-moments of order 2 and 4 of π and so to know the probabilistic rule which links them. This rule is well-known when π is Gaussian (Isserlis lemma) but this is not the case in general to the best of our knowledge and there is no reason for the solution of problem (supCOV) to be Gaussian.

3 Study of the general problem

Since problem (supCOV) is hard to solve because of its dependence on co-moments of order 2 and 4 of π, one can optimize both terms seperately in order to find a lower bound of GW

2

(µ, ν). In the rest of the paper we suppose for convenience and without any loss of generality that n ≤ m.

Theorem 3.1. Suppose without any loss of generality that n ≤ m. Let µ = N (m

₀

, Σ

₀

) and ν = N (m

₁

, Σ

₁

) be two Gaussian measures on R

^m

and R

ⁿ

. Let P

₀

, D

₀

and P

₁

, D

₁

be the respective diagonalizations of Σ

₀

(= P

₀

D

₀

P

₀^T

) and Σ

₁

(= P

₁

D

₁

P

₁^T

) which sort eigenvalues in decreasing order. We suppose that Σ

0

is non-singular. A lower bound for GW

2

(µ, ν) is then

GW

₂²

(µ, ν) ≥ LGW

₂²

(µ, ν), (3.1)

where

LGW

₂²

(µ, ν) = 4 (tr(D

0

) − tr(D

1

))

²

+ 4 (kD

0

k

_F

− kD

1

k

_F

)

²

+ 4kD

⁽ⁿ⁾₀

− D

1

k

²_F

+ 4

kD

0

k

²_F

− kD

₀⁽ⁿ⁾

k

²_F

.

(LGW)

The proof of this theorem is divided in smaller intermediary results. First we recall the Isserlis lemma (see [14]), which allows to derive the co-moments of order 4 of a Gaussian distribution as a fonction of its co-moments of order 2.

Lemma 3.1 (Isserlis, 1918, [14]). Let X be a zero-mean Gaussian vector of size n. Then

∀i, j, k, l ≤ n, E[X

i

X

j

X

k

X

l

] = E[X

i

X

j

]E[X

k

X

l

] + E[X

i

X

k

]E[X

j

X

l

] + E[X

i

X

l

]E[X

j

X

k

]. (3.2) Then we derive the following general optimization lemmas. The proofs of these two lemmas are postponed to the Appendix (Section 8).

Lemma 3.2. Suppose that n ≤ m. Let Σ be a semi-definite positive matrix of size m + n of the form Σ =

Σ

0

K K

^T

Σ

1

,

with Σ

0

∈ S

_m⁺⁺

(R), Σ

1

∈ S

_n⁺

(R) and K ∈ R

^m×n

. Let P

0

, D

0

and P

1

, D

1

be the respective diagonalisations of Σ

0

(= P

₀^T

D

0

P

0

) and Σ

1

(= P

₁^T

D

1

P

1

) which sort the eigenvalues in decreasing order. Then

max

Σ1−K^TΣ⁻¹₀ K∈Sn⁺(R)

kKk

²_F

= tr(D

⁽ⁿ⁾₀

D

₁

), (3.3) and is achieved at any

K

^∗

= P

₀^T

I ˜

n

(D

₀⁽ⁿ⁾

)

¹²

D

1 2

1

0

_m−n,n

!

P

₁

, (opKl2)

where I ˜

_n

is of the form diag((±1)

_i≤n

).

(7)

Lemma 3.3. Suppose that n ≤ m. Let Σ be a semi-definite positive matrix of size m + n of the form:

Σ =

Σ

0

K K

^T

Σ

1

,

where Σ

0

∈ S

_m⁺⁺

(R), Σ

1

∈ S

_n⁺

(R), and K ∈ R

^m×n

. Let A ∈ R

^n×m

be a matrix with rank 1. Then max

Σ1−K^TΣ⁻¹₀ K∈Sn⁺(R)

tr(KA) = q

tr(AΣ

₀

A

^T

Σ

₁

). (3.4)

In particular, if Σ

0

= diag(α) and Σ

1

= diag(β) with α ∈ R

^m

and β ∈ R

ⁿ

, then max

Σ₁−K^TΣ⁻¹₀ K∈S⁺n(R)

tr(K1

n,m

) = p

tr(Σ

0

)tr(Σ

1

), (3.5)

with 1

_n,m

= (1)

_i≤n,j≤m

, and is achieved at

K

^∗

= αβ

^T

p tr(Σ

₀

)tr(Σ

₁

) . (opKl1)

Proof of theorem 3.1. For µ = N (m

₀

, Σ

₀

) and ν = N (m

₁

, Σ

₁

), we note P

₀

, D

₀

and P

₁

, D

₁

the respective diagonalizations of Σ

₀

and Σ

₁

which sort the eigenvalues in decreasing order. Let T

₀

: x ∈ R

^m

7→

P

₀^T

(x − m

₀

) and T

₁

: y ∈ R

ⁿ

7→ P

₁^T

(y − m

₁

). For π ∈ Π(T

₀

#µ, T

₁

#ν) and (X, Y ) ∼ π, we denote Σ the covariance matrix of π and ˜ Σ the covariance matrix of (X

²

, Y

²

) with X

²

:= ([XX

^T

]

_i,i

)

_i≤m

and Y

²

:= ([Y Y

^T

]

j,j

)

_j≤n

. Using Isserlis lemma to compute Cov(X

²

, X

²

) and Cov(Y

²

, Y

²

), it comes that Σ and ˜ Σ are of the form:

Σ =

D

0

K K

^T

D

1

and Σ = ˜

2D

²₀

K ˜ K ˜

^T

2D

²₁

. (3.6)

In order to find a supremum for each term of (supCOV), we use a necessary condition for π to be in Π(T

0

#µ, T

1

#ν ) which is that Σ and ˜ Σ must be semi-definite positive. To do so, we can use the equivalent condition that the Schur complements of Σ and ˜ Σ, namely D

1

− K

^T

D

⁻¹₀

K and 2D

₁²

−

1

2

K ˜

^T

D

⁻²₀

K, must also be semi-definite positive. Remarking that the left-hand term in (supCOV) can ˜ be rewritten tr( ˜ K1

_m,n

), we have the two following inequalities

sup

X∼T0#µ,Y∼T1#ν

X

i,j

Cov(X

_i²

, Y

_j²

) ≤ max

2D²₁−¹₂K^TD⁻²₀ K∈Sn⁺(R)

tr( ˜ K1

n,m

), (3.7) and

sup

X∼T0#µ,Y∼T1#ν

kCov(X, Y )k

²_F

≤ max

D₁−K^TD₀⁻¹K∈S⁺n(R)

kKk

²_F

. (3.8)

Applying Lemmas 3.2 and 3.3 on both right-hand terms, we get on one hand:

sup

X∼T0#µ,Y∼T1#ν

kCov(X, Y )k

²_F

≤ tr(D

⁽ⁿ⁾₀

D

1

), (3.9) and on the other hand:

sup

X∼T0#µ,Y∼T1#ν

X

i,j

Cov(X

_i²

, Y

_j²

) ≤ 2 q

tr(D

₀²

)tr(D

²₁

)

= 2kD

0

k

F

kD

1

k

F

.

(3.10)

(8)

Furthermore, using Lemma 2.2, it comes that GW

₂²

(µ, ν) = C

µ,ν

− 4 sup

X∼T0#µ,Y∼T1#ν



 X

i,j

Cov(X

_i²

, Y

_j²

) + X

i,j

E[X

_i²

]E[X

_j²

] + 2 kCov(X, Y )k

²_F





≥ C

µ,ν

− 8 q

tr(D

₀²

)tr(D

²₁

) − 4tr(D

0

)tr(D

1

) − 8tr(D

⁽ⁿ⁾₀

D

1

),

(3.11)

where

C

µ,ν

= E

_U∼N_(0,2D₀₎

[kU k

⁴_Rm

] + E

_V_∼N_(0,2D₁₎

[kV k

⁴_Rn

] − 4E

X∼µ

[kX k

²_Rm

]E

Y∼ν

[kY k

²_Rn

]

= 8tr(D

₀²

) + 4(tr(D

0

))

²

+ 8tr(D

₁²

) + 4(tr(D

1

))

²

− 4tr(D

0

)tr(D

1

). (3.12) Finally

GW

₂²

(µ, ν) ≥ 4(tr(D

0

))

²

+ 4(tr(D

1

))

²

− 8tr(D

0

)tr(D

1

) + 8tr(D

²₀

) + 8tr(D

₁²

)

− 8 q

tr(D

²₀

)tr(D

₁²

) − 8tr(D

⁽ⁿ⁾₀

D

1

)

= LGW

₂²

(µ, ν).

(3.13)

Inequalities (3.7) and (3.8) become equalities if one can exhibit a plan π such that Σ or ˜ Σ are such that kKk

²_F

or tr( ˜ K1

_n,m

) are maximized. This is the case for (3.8) where we can exhibit the Gaussian plan π

^∗

such that K is of the form (opKl2) but it seems however more tricky to exhibit such a plan for inequality (3.7). Indeed, it can be shown that it doesn’t exist a Gaussian plan such that ˜ K is of the form (opKl1).

The lower bound LGW

2

is reached if it exists a plan π which optimizes both terms simultaneously.

This seems rather unlikely because if a probability distribution has its covariance matrix such that K is of the form (opKl2), then it is necessarily Gaussian thanks to the equality case in Cauchy-Schwarz:

if D

0

= diag(α) and D

1

= diag(β) with α ∈ R

^m

and β ∈ R

ⁿ

, and if π has its covariance matrix such that K is of the form (opKl2), then for all i ≤ n, Cov(X

i

, Y

i

) = ± √

α

i

β

i

and Y

i

depends linearly in X

i

. As an outcome, π is Gaussian and we can compute, using Isserlis lemma, that tr( ˜ K1

n,m

) = 2tr(D

0

D

1

) and so ˜ K cannot be of the form (opKl1). However, we didn’t prove that the solution of the form (opKl2) is unique so it may exist another solution which doesn’t imply that π has to be Gaussian.

4 Problem restricted to Gaussian transport plans

In this section, we study the following problem, where we constrain the optimal transport plan to be Gaussian.

GGW

₂²

(µ, ν) = inf

π∈Π(µ,ν)∩Nm+n

Z Z

kx − x

⁰

k

²_Rm

− ky − y

⁰

k

²_Rn

²

dπ(x, y)dπ(x

⁰

, y

⁰

), (GaussGW) where N

m+n

is the set of Gaussian measures on R

^m+n

. We show the following main result.

Theorem 4.1. Suppose without any loss of generality that n ≤ m. Let µ = N (m

₀

, Σ

₀

) and ν = N (m

₁

, Σ

₁

) be two Gaussian measures on R

^m

and R

ⁿ

. Let P

₀

, D

₀

and P

₁

, D

₁

be the respective diagonalizations of Σ

₀

(= P

₀

D

₀

P

₀^T

) and Σ

₁

(= P

₁

D

₁

P

₁^T

) which sort eigenvalues in decreasing order. We suppose that Σ

0

is non-singular (µ is not degenerate). Then problem (GaussGW) admits a solution of the form π

^∗

= (I

m

, T )#µ with T affine of the form

∀x ∈ R

^m

, T (x) = m

1

+ P

1

AP

₀^T

(x − m

0

). (4.1)

(9)

where A ∈ R

^n×m

is written

A =

I ˜

_n

D

₁¹²

(D

⁽ⁿ⁾₀

)

⁻¹²

0

_n,m−n

,

where I ˜

n

is of the form diag((±1)

_i≤n

). Moreover

GGW

₂²

(µ, ν) = 4(tr(D

0

) − tr(D

1

))

²

+ 8kD

⁽ⁿ⁾₀

− D

1

k

²_F

+ 8

kD

0

k

²_F

− kD

⁽ⁿ⁾₀

k

²_F

. (GGW)

Proof. This theorem is a direct consequence of Isserlis lemma 3.1: indeed, the left term in equation (supCOV) can be in that case rewritten 2kCov(X, Y )k

²_F

and so problem (GaussGW) is equivalent to

sup

X∼T0#µ,Y∼T1#ν

kCov(X, Y )k

²_F

. (4.2) Applying Lemma 3.2, we can exhibit a Gaussian optimal plan π

^∗

∈ Π(µ, ν) with covariance matrix Σ of the form:

Σ =

Σ

0

K

^∗

K

^∗T

Σ

1

, (4.3)

with

K

^∗

= P

₀^T

I ˜

_n

(D

₀⁽ⁿ⁾

)

¹²

D

₁¹²

0

_m−n,n

!

P

1

. (4.4)

Thus, using the equality case in Cauchy-Schwarz, we can exhibit an optimal transport map T of the form

∀x ∈ R

^m

, T (x) = m

₁

+ P

₁

AP

₀^T

(x − m

₀

), (4.5) with

A =

I ˜

_n

D

₁¹²

(D

⁽ⁿ⁾₀

)

⁻¹²

0

_n,m−n

,

where ˜ I

n

is of the form diag((±1)

_i≤n

). Moreover, using Lemmas 2.2 and 3.2, it comes that GGW

₂²

(µ, ν) = C

_µ,ν

− 16 sup

X∼T0#µ,Y∼T1#ν

kCov(X, Y )k

²_F

= 8tr(D

²₀

) + 4(tr(D

₀

))

²

+ 8tr(D

²₁

) + 4(tr(D

₁

))

²

− 4tr(D

₀

)tr(D

₁

) − 16tr(D

⁽ⁿ⁾₀

D

₁

)

= 4(tr(D

0

) − tr(D

1

))

²

+ 8tr

(D

₀⁽ⁿ⁾

− D

1

)

²

+ 8

tr(D

₀²

) − tr((D

⁽ⁿ⁾₀

)

²

) .

(4.6)

Link with Gromov-Monge The previous result generalizes Theorem 4.2.6 in [25], which studies the solutions of the linear Gromov-Monge problem between Gaussian distributions

inf

T#µ=ν,T linear

Z Z

kx − x

⁰

k

²_Rm

− kT (x) − T (x

⁰

)k

²_Rn

²

dµ(x)dν(x

⁰

). (4.7) Indeed, solutions of (4.7) necessarily provide Gaussian transport plans π = (I

_m

, T )#µ if T is linear.

Conversely, Theorem 4.1 shows that restricting the optimal plan to be Gaussian in Gromov-Wasserstein

between two Gaussian distributions yields an optimal plan of the form π = (I

_m

, T )#µ with a linear

T, whatever the dimensions m and n of the two Euclidean spaces.

(10)

Link with Principal Component Analysis We can easily draw connections between GGW

₂²

and PCA. Indeed, we can remark that the optimal plan can be derived by performing PCA on both distributions µ and ν in order to obtain distributions ˜ µ and ˜ ν with zero mean vectors and diagonal covariance matrices with eigenvalues in decreasing order (˜ µ = T

0

#µ and ˜ ν = T

1

#ν), then by keeping only the n first components in ˜ µ and finally by deriving the optimal transport plan which achieves W

₂²

between the obtained truncated distribution and ˜ ν. In other terms, noting P

n

: R

^m

→ R

ⁿ

the linear mapping which, for x ∈ R

^m

keeps only its n first components (n-frame), T

W₂

the optimal transport map such that π

W₂

= (I

n

, T

W₂

)#P

n

#˜ µ achieves W

2

(P

n

#˜ µ, ν ˜ ), it comes that the optimal plan π

GGW₂

which achieves GGW

2

(˜ µ, ν) can be written ˜

π

GGW₂

= (I

m

, I ˜

n

#T

W₂

#P

n

)#˜ µ. (4.8) An example of π

GGW₂

can be found in Figure 1 when m = 2 and n = 1.

Figure 1: transport plan π

GGW2

solution of problem (GaussGW) with m = 2 and n = 1. In that case, π

_GGW₂

is the degenerate Gaussian distribution supported by the plan of equation y = T

_W₂

(x), where T

_W₂

is the classic W

₂

optimal transport map when the distributions are rotated and centered first.

Case of equal dimensions When m = n, the optimal plan π

GGW2

which achieves GGW

2

(µ, ν) is closely related to the optimal transport plan π

_W₂

= (I

_m

, T

_W₂

)#T

₀

#µ. Indeed, π

_GGW₂

can be simply derived by applying the transformations T

₀

and T

₁

to respectively µ and ν , then by computing π

_W₂

between T

₀

#µ and T

₁

#ν, and finally by applying the inverse transformations T

₀⁻¹

and T

₁⁻¹

. In other terms, π

_GGW₂

can be written

π

_GGW₂

= (I

_m

, T

₁⁻¹

# ˜ I

_n

#T

_W₂

#T

₀

)#µ. (4.9) An example of transport between two Gaussians measures in dimension 2 in Figure 2.

As illustrated in Figure 3, the GGW

2

optimal transport map T

GGW₂

defined in Equation (4.1) is not equivalent to the W

2

optimal transport map T

W2

defined in (1.5) even when the dimensions m and n are equal. More precisely, it Σ

₀

and Σ

₁

can be diagonalized in the same orthonormal basis with eigenvalues in the same order (decreasing or increasing), then T

_W₂

and T

_GGW₂

are equivalent (top of Figure 3). On the other hand, if Σ

₀

and Σ

₁

can be diagonalized in the same orthonormal basis but with eigenvalues not in the same order, T

_W₂

and T

_GGW₂

will have very different behaviors (bottom of Figure 3). Between those two extreme cases, we can say that the closer the columns of P

₀

will be collinear to the columns of P

1

(with the eigenvalues in decreasing order), the more T

W₂

and T

GGW₂

will tend to have similar behaviors (middle of Figure 3).

(11)

Figure 2: Solution of (GaussGW) between two Gaussians measures in dimension 2. First the distributions are centered and rotated. Then a classic W

2

transport is applied between the two aligned distributions.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Figure 3: Comparison between W

2

and GGW

2

mappings between empirical distributions. Left: 2D

source distribution (colored) and target distribution (transparent). Middle: resulting mapping of

Wasserstein T

W₂

. Right: resulting mapping of Gaussian Gromov-Wasserstein T

GGW₂

. The colors are

added in order to visualize where each sample has been sent.

(12)

Link with Gromov-Wasserstein with inner product as cost function If µ and ν are centered Gaussian measures, let us consider the following problem

GW

₂²

(h., .i

m

, h., .i

n

, µ, ν) = inf

π∈Π(µ,ν)

Z Z

(hx, x

⁰

i

m

− hy, y

⁰

i

n

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

). (innerGW) Notice that the above problem is not restricted to Gaussian plans, but the following proposition shows that in fact its solution is Gaussian.

Proposition 4.1. Suppose m ≤ n. Let µ = N (0, Σ

0

) and ν = N (0, Σ

1

) be two centered Gaussian measures respectively on R

^m

and R

ⁿ

. Then the solution of problem (GaussGW) exhibited in theorem 4.1 is also solution of problem (innerGW).

Proof. The proof of this proposition is a direct consequence of lemma 2.2: indeed, applying it with a = 0, b = 0, and c = 1, it comes that problem (innerGW) is equivalent to

sup

π∈Π(µ,ν)

Z

xy

^T

dπ(x, y)

2

F

. (4.10)

Since µ and ν are centered, it comes that problem (innerGW) is equivalent to sup

X∼µ,Y∼ν

kCov(X, Y )k

²_F

. (4.11) Applying Lemma 3.2, it comes that the solution exhibited in Theorem 4.1 is also solution of problem (innerGW).

Since GGW

2

is the Gromov-Wasserstein problem restricted to Gaussian transport plan, it is clear that (GaussGW) is an upper bound of (GW). Combining this result with Theorem 3.1, we get the following simple but important result.

Proposition 4.2. If µ = N (m

0

, Σ

0

) and ν = N (m

1

, Σ

1

) and Σ

0

is non-singular, then

LGW

₂²

(µ, ν) ≤ GW

₂²

(µ, ν) ≤ GGW

₂²

(µ, ν). (4.12)

5 Tightness of the bounds and particular cases

5.1 Bound on the difference

Proposition 5.1. Suppose without loss of generality that n ≤ m, if µ = N (m

₀

, Σ

₀

) and ν = N (m

₁

, Σ

₁

), then

GGW

₂²

(µ, ν) − LGW

₂²

(µ, ν) ≤ 8kΣ

₀

k

_F

kΣ

₁

k

_F

1 − 1

√ m

. (5.1)

To prove this proposition, we will use the following technical result (the proof is postponed to the Appendix (Section 8)):

Lemma 5.1. Let u ∈ R

^m

and v ∈ R

^m

be two unit vectors with non-negative coordinates ordered in decreasing order. Then

u

^T

v ≥ 1

√ m , (5.2)

with equality if u = (

^√¹_m

,

^√¹_m

, . . . )

^T

and v = (1, 0, . . . )

^T

.

(13)

Proof of Proposition 5.1. By subtracting (LGW) from (GGW), it comes that GGW

₂²

(µ, ν) − LGW

₂²

(µ, ν) = 8

kD

0

k

_F

kD

1

k

_F

− tr(D

⁽ⁿ⁾₀

D

1

)

= 8

kD

0

k

_F

kD

₁^[m]

k

_F

− tr(D

0

D

^[m]₁

) ,

(5.3)

where D

^[m]₁

=

D

1

0 0 0

∈ R

^m×m

. Noting α ∈ R

^m

and β ∈ R

^m

the vectors of eigenvalues of D

0

and D

^[m]₁

, it comes

GGW

₂²

(µ, ν) − LGW

₂²

(µ, ν) = 8(kαkkβk − α

^T

β) = 8kαkkβk(1 − u

^T

v), (5.4) where u =

_kαk^α

and v =

_kβk^β

. Applying lemma 5.1, we get directly that

GGW

₂²

(µ, ν) − LGW

₂²

(µ, ν) ≤ 8kD

₀

k

_F

kD

^[m]₁

k

_F

1 − 1

√ m

.

= 8kΣ

0

k

_F

kΣ

1

k

_F

1 − 1

√ m

.

(5.5)

The difference between GGW

₂²

(µ, ν) and LGW

₂²

(µ, ν) can be seen as the difference between the right and left terms of the Cauchy-Schwarz inequality applied to the two vectors of eigenvalues α ∈ R

^m

and β ∈ R

^m

. The difference is maximized when the vectors α and β are the least collinear possible.

This happens when the eigenvalues of D

0

are all equal and n = 1 or ν is degenerate of true dimension 1. On the other hand , this difference is null when α and β are collinear. Between those two extremal cases, we can say that the difference between GGW

₂²

(µ, ν) and LGW

₂²

(µ, ν) will be relatively small if the last m−n eigenvalues D

0

are small compared to the n first eigenvalues and if the n first eigenvalues are close to be proportional to the eigenvalues of D

1

. An example in the case where m = 2 and n = 1 can be found in Figure 4.

0.0 0.2 0.4 0.6 0.8 1.0

2 0

2 4 6 8 10

12 GGW2

LGW2

0.0 0.2 0.4 0.6 0.8 1.0

2 9

10 11 12 13 14 15

16 GGW2

LGW2

0.0 0.2 0.4 0.6 0.8 1.0

2 880

900 920 940

960 GGW2

LGW2

Figure 4: plot of GGW

₂²

(µ, ν) and LGW

₂²

(µ, ν) in function of α

2

for µ = N (0, diag(α)), ν = N (0, β

1

), α = (α

1

, α

2

)

^T

, for (α

1

, β

1

) = (1, 1) (left), (α

1

, β

1

) = (1, 2) (middle), (α

1

, β

1

) = (1, 10) (right). One can easily compute using (GGW) and (LGW) that GGW

₂²

(µ, ν) = 12α

²₂

+ 8α

2

(α

1

− β

1

) + 12(α

1

−β

1

)

²

and LGW

₂²

(µ, ν) = 12α

²₂

+ 8α

2

(α

1

− β

1

) − 4 p

α

²₂

+ α

²₁

β

1

+ 12(α

1

− β

1

)

²

+ 8α

1

β

1

.

5.2 Explicit case

As seen before, the difference between GGW

₂²

(µ, ν) and LGW

₂²

(µ, ν), with µ = N (m

₀

, Σ

₀

) and ν =

N (m

₁

, Σ

₁

), is null when the two vectors of eigenvalues of Σ

₀

and Σ

₁

(sorted in decreasing order) are

collinear. When we suppose Σ

₀

non-singular, it implies that m = n and that the eigenvalues of Σ

₁

are

proportional to the eigenvalues of Σ

0

(rescaling). This case includes the more particular case where

m = n = 1. In that case µ = N (m

0

, σ

²₀

) and ν = N (m

1

, σ

₁²

), because σ

1

is always proportional to σ

0

.

(14)

Proposition 5.2. Suppose m = n. Let µ = N (m

0

, Σ

0

) and ν = N (m

1

, Σ

1

) two Gaussian measures on R

^m

= R

ⁿ

. Let P

0

, D

0

and P

1

, D

1

be the respective diagonalizations of Σ

0

(= P

0

D

0

P

₀^T

) and Σ

1

(=

P

1

D

1

P

₁^T

) which sort eigenvalues in non-increasing order. Suppose Σ

0

is non-singular and that it exists a scalar λ ≥ 0 such that D

1

= λD

0

. In that case, GW

₂²

(µ, ν) = GGW

₂²

(µ, ν) = LGW

₂²

(µ, ν) and the problem admits a solution of the form (I

m

, T )#µ with T affine of the form:

∀x ∈ R

^m

, T (x) = m

₁

+ √

λP

₁

I ˜

_m

P

₀^T

(x − m

₀

), (5.6) where I ˜

m

of the form diag((±1)

_i≤m

). Moreover

GW

₂²

(µ, ν) = (λ − 1)

²

4(tr(Σ

0

))

²

+ 8kΣ

0

k

²_F

. (5.7)

Proof. From (5.3), we have

GGW

₂²

(µ, ν) − LGW 2

²

(µ, ν) = 8 (kD

0

k

_F

kD

1

k

_F

− tr(D

0

D

1

)) . (5.8) Noting α ∈ R

^m

and β ∈ R

^m

the eigenvalues vectors of D

₀

and D

₁

, it comes

GGW

₂²

(µ, ν) − LGW 2

²

(µ, ν) = 8(kαkkβk − α

^T

β). (5.9) Since it exists λ ≥ 0 such that D

1

= λD

0

, we have β = λα, and so α

^T

β = kαkkβ k. Thus GGW

₂²

(µ, ν)−

LGW

₂²

(µ, ν) = 0 and using Proposition 4.2, we get that GW

₂²

(µ, ν) = GGW

₂²

(µ, ν) = LGW

₂²

(µ, ν).

We get (5.6) and (5.7) by simply reinjecting in (4.1) and (GGW).

Corollary 5.1. Let µ = N (m

0

, σ

₀²

) and ν = N (m

1

, σ

₁²

) be two Gaussian measures on R. Then GW

₂²

(µ, ν) = 12 σ

²₀

− σ

²₁

²

, (5.10)

and the optimal transport plan π

^∗

has the form (I

1

, T )#µ with T affine of the form:

∀x ∈ R, T (x) = m

1

± σ

1

σ

₀

(x − m

0

). (5.11)

Thus, the solution of W

₂²

(µ, ν) is also solution of GW

₂²

(µ, ν).

5.3 Case of degenerate measures

In all the results exposed above, we have supposed Σ

0

non-singular, which means that µ is not degenerate. Yet, if Σ

0

is not full rank, one can easily extend the previous results thanks to the following proposition.

Proposition 5.3. Let µ = N (0, D

0

) and ν = N (0, D

1

) be two centered Gaussian measures on R

^m

and R

ⁿ

with diagonal covariance matrices D

0

and D

1

with eigenvalues in decreasing order. We denote r = rk(D

0

) the rank of D

0

and we suppose that r < m. Let us define P

r

= I

r

0

r,m−r

∈ R

^r×m

. Then GW

₂²

(µ, ν) = GW

₂²

(P

r

#µ, ν), GGW

₂²

(µ, ν) = GGW

₂²

(P

r

#µ, ν), and LGW

₂²

(µ, ν) = LGW

₂²

(P

r

#µ, ν).

Proof. For r < m, we denote Γ

_r

(R

^m

) the set of vectors x = (x

₁

, . . . , x

_m

)

^T

of R

^m

such that x

_r+1

=

· · · = x

_m

= 0. For π ∈ Π(µ, ν), one can remark that for any borel set A ⊂ R

^m

r Γ

_r

(R

^m

), and any borel set B ⊂ R

ⁿ

, we have π(A, B) = 0 and so

GW

₂²

(µ, ν) = inf

π∈Π(µ,ν)

Z

R^m×Rⁿ

Z

R^m×Rⁿ

(kx − x

⁰

k

²_Rm

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

= inf

π∈Π(µ,ν)

Z

Γ_r(R^m)×Rⁿ

Z

Γ_r(R^m)×Rⁿ

(kx − x

⁰

k

²_Rm

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

= inf

π∈Π(µ,ν)

Z

Γ_r(R^m)×Rⁿ

Z

Γ_r(R^m)×Rⁿ

(kP

_r

(x − x

⁰

)k

²_Rr

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

(5.12)

(15)

Now, observe that for π ∈ Π(µ, ν), (P

r

, I

n

)#π ∈ Π(P

r

#µ, ν). It follows that GW

₂²

(µ, ν) ≤ inf

π∈Π(Pr#µ,ν)

Z

R^r×Rⁿ

Z

R^r×Rⁿ

(kx − x

⁰

k

²_Rr

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

= GW

₂²

(P

_r

#µ, ν).

(5.13)

Conversely, since µ has no mass outside of Γ

r

(R

^m

), P

_r^T

#P

r

#µ = µ, which implies that for π ∈ Π(P

r

#µ, ν), (P

_r^T

, I

n

)#π ∈ Π(µ, ν). It follows that

GW

₂²

(P

_r

#µ, ν) = inf

π∈Π(Pr#µ,ν)

Z

R^r×Rⁿ

Z

R^r×Rⁿ

(kx − x

⁰

k

²_Rr

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

= inf

π∈Π(Pr#µ,ν)

Z

R^r×Rⁿ

Z

R^r×Rⁿ

(kP

_r^T

(x − x

⁰

)k

²_Rr

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

≤ inf

π∈Π(µ,ν)

Z

R^m×Rⁿ

Z

R^m×Rⁿ

(kx − x

⁰

k

²_Rr

− ky − y

⁰

k

²_Rm

)

²

dπ(x, y)dπ(x

⁰

, y

⁰

)

≤ GW

₂²

(µ, ν).

(5.14) The exact same reasoning can be made in the case of GGW

2

. Morover, it can be easily seen when looking at (LGW) that LGW

₂²

(µ, ν) = LGW

₂²

(P

r

#µ, ν).

Thus, when Σ

0

is not full rank, one can apply Proposition 5.3 and consider directly the Gromov- Wasserstein distance between the projected (non-degenerate) measure P

r

#µ on R

^r

and ν and so Proposition 4.2 still holds when µ is degenerate.

In the case of GGW

2

, an explicit optimal transport plan can still be exhibited. In the following, we denote r

0

and r

1

the ranks of Σ

0

and Σ

1

, and we suppose without loss of generality that r

0

≥ r

1

, but this time not necessarily that m ≥ n. If µ = N (m

₀

, Σ

₀

) and ν = N (m

₁

, Σ

₁

) are two Gaussian measures on R

^m

and R

ⁿ

, and (P

₀

, D

₀

) and (P

₁

, D

₁

) are the respective diagonalizations of Σ

₀

(= P

₀

D

₀

P

₀^T

) and Σ

₁

(= P

₁

D

₁

P

₁^T

) which sort the eigenvalues in decreasing order, an optimal transport plan which achieves GGW

₂

(µ, ν) is of the form π

^∗

= (I

_m

, T )#µ with

∀x ∈ R

^m

, T (x) = m

1

+ P

1

AP

₀^T

(x − m

0

), (5.15) where A ∈ R

^n×m

is of the form

A =

I ˜

r₁

(D

^(r₁¹⁾

)

¹²

(D

^(r₀¹⁾

)

⁻¹²

0

r₁,m−r₁

0

n−r1,r1

0

n−r1,m−r1

,

where ˜ I

r₁

is any matrix of the form diag((±1)

_i≤r₁

).

6 Behavior of the empirical solution

To complete the previous study, we perform a simple experiment to illustrate the behavior of the solution of the Gromov Wasserstein problem. In this experiment, we draw independently k samples (X

_j

)

_j≤k

and (Y

_i

)

_i≤k

from respectively µ = N (0, diag(α)) and ν = N (0, diag(β )) with α ∈ R

^m

and β ∈ R

ⁿ

. Then we compute the Gromov-Wasserstein distance between the two histograms X and Y with the algorithm proposed in [19] using the Python Optimal Transport library

³

. In Figure 5, we plot the first coordinates of the samples Y

i

in fonction of the the first coordinate of the samples X

j

they have been assigned to by the algorithm (blue dots). We draw also the line of equation y = ± √ βx to compare with the theorical solution of the Gaussian restricted problem (orange line) for k = 2000,

3The library is accessible here: https://pythonot.github.io/index.html

(16)

Figure 5: plot of the first coordinate of samples Y

_i

in fonction of the the first coordinate of their assigned samples X

_j

(blue dots) and line of equation y = ± √

βx (orange line) for k = 2000, α = (1, 0.1)

^T

and β = 2 (top left), k = 2000, α = (1, 0.1)

^T

and β = (2, 0.3)

^T

(top right), k = 2000, α = (1, 0.1, 0.01)

^T

and β = 2 (middle left), k = 7000, α = (1, 0.3) and β = 2 (middle right), k = 7000, α = (1, 0.1)

^T

, and β = (2, 1)

^T

(bottom left), and k = 7000 and α = (1, 0.3, 0.1) and β = 2 (bottom right).

α = (1, 0.1)

^T

and β = 2 (top left), k = 2000, α = (1, 0.1)

^T

and β = (2, 0.3)

^T

(top right), k = 2000,

α = (1, 0.1, 0.01)

^T

and β = 2 (middle left), k = 7000, α = (1, 0.3) and β = 2 (middle right), k = 7000,

α = (1, 0.1)

^T

, and β = (2, 1)

^T

(bottom left), and k = 7000 and α = (1, 0.3, 0.1) and β = 2 (bottom

right). Observe that the empirical solution seems to be behaving exactly in the same way as the

theoretical solution exhibited in theorem 4.1 as soon as α and β are close to be collinear. However,

when α and β are further away from collinearity, determining the behavior of the empirical solution

becomes more complex. Solving Gromov-Wasserstein numerically, even approximately, is a particularly

hard task, therefore we cannot conclude if the empirical solution does not behave in the same way as