On the linear convergence of the multi-marginal Sinkhorn algorithm

(1)

HAL Id: hal-03176512

https://hal.archives-ouvertes.fr/hal-03176512

Preprint submitted on 22 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

On the linear convergence of the multi-marginal

Sinkhorn algorithm

Guillaume Carlier

To cite this version:

Guillaume Carlier. On the linear convergence of the multi-marginal Sinkhorn algorithm. 2021. �hal-03176512�

(2)

On the linear convergence of the

multi-marginal Sinkhorn algorithm

Guillaume Carlier

∗

March 22, 2021

Abstract

The aim of this short note is to give an elementary proof of linear convergence of the Sinkhorn algorithm for the entropic regularization of multi-marginal optimal transport. The proof simply relies on: i) the fact that Sinkhorn iterates are bounded, ii) strong convexity of the exponential on bounded intervals and iii) the convergence analysis of the coordinate descent (Gauss-Seidel) method of Beck and Tetru-ashvili [1].

Keywords: multi-marginal entropic optimal transport, Sinkhorn algo-rithm, linear convergence, block coordinate descent.

MS Classification: 45G15, 49M05.

1 Introduction

Eventhough the Sinkhorn algorithm1 _{is more than 50 years old [16], it has}

attracted a considerable attention in the last years. It is now at the heart of efficient solvers for the entropic regularization of optimal transport problems, a field on which Cuturi’s paper [6] had a tremendous impact (see Cuturi and Doucet [5], Cuturi and Peyré [13], Benamou et al. [2]...). The Sinkhorn algorithm remains fascinating by its simplicity and its connections with the Schrödinger bridge problem first addressed by Schrödinger in [15] and large deviations theory, see Dawson and Gärtner [7], Föllmer [9], Léonard [12, 11].

∗_{CEREMADE, Universit´e Paris Dauphine, PSL and INRIA-Paris, MOKAPLAN,} [email protected]

1_{also known as iterative proportional fitting procedure (IPFP) in the probability and} statistics literature, see R¨uschendorf, [14] and the references therein.

(3)

The linear convergence of the Sinhkorn algorithm for two marginals is well-known. A very elegant proof consists in using a celebrated theorem of Birkhoff to show that the Sinkhorn algorithm consists in iterating a contrac-tion for the Hilbert projective metric, see Franklin and Lorenz [10], and more recently, Chen, Georgiou and Pavon, [4].

To the best of our knowledge, the elegant Hilbert metric proof does not carry over to the multi-marginal case for which an annoying N − 1 factor (N being the number of marginals) appears in the Lipschitz constant for the Hilbert metric. Convergence of the multi-marginal Sinkhorn algorithm was recently obtained by Di Marino and Gerolin [8] and the well-posedness (ex-istence, uniqueness and smooth dependence on the data) of the Schr¨odinger system (see (2.3) below) was addressed by completely different arguments (local and global inversion theorems) by the author and Laborde in [3]. In the analysis of [8], a key ingredient is that Sinkhorn iterates are coordinate descent updates for a convex minimization problem (dual to an entropy min-imization subject to multi-marginal constraints), see the definition of F in (2.4) below. In this note, we slightly improve the results of Di Marino and Gerolin, by showing linear convergence. The proof relies on the convergence analysis of the coordinate descent method of Beck and Tetruashvili [1] which can easily be used here, since Sinkhorn iterates are bounded in L∞ _{so remain}

in a set where the functional F is uniformly convex.

2 Multi-marginal Sinkhorn algorithm

We are given an integer N ≥ 2, N probability spaces (Xi,Fi, mi), i =

1, . . . , N and set X := N Y i=1 Xi,F := N O i=1 Fi, m := N O i=1 mi. (2.1)

Given i ∈ {1, . . . , N}, we will denote by X−i := QN_j6=iXj, m−i := NN_j6=imj

and will always identify X to Xi× X−i i.e. will denote x = (x1, . . . , xN) ∈ X

as x = (xi, x−i). The set of measures on (X, F ) having m1, . . . , mN as

marginals will be denoted Π(m1, . . . , mN). Given p ∈ [1, ∞] and (ϕ1, . . . , ϕN) ∈

QN

i=1Lp(Xi,Fi, mi) we will use the notations

⊕N i=1ϕi : (x1, . . . , xN) ∈ X 7→ N X i=1 ϕi(xi),

(4)

and ⊗N_i=1ϕi : (x1, . . . , xN) ∈ X 7→ N Y i=1 ϕi(xi).

Given a cost c ∈ L∞_{(X, F , m), we set}

kck∞ := kckL∞_{(X,F ,m)}. (2.2)

The associated Gibbs kernel is

K := e−c,

so that e−kck∞ _{≤ K ≤ e}kck∞_{, m-almost everywhere. We look for potentials}

ϕ = (ϕ1, . . . , ϕN) ∈

QN

i=1L∞(Xi,Fi, mi) such that the measure

Qϕ := Ke⊕

N i=1ϕi_m

belongs to Π(m1, . . . , mN) i.e. solve the Schr¨odinger system:

eϕi(xi)

Z

X−i

e−c(x1,...,xN)+Pj6=iϕj(xj)_dm

−i(x−i) = 1, (2.3)

for every i and mi-a.e. xi. The system (2.3) is well-known to be the

Euler-Lagrange optimality condition for the convex minimization problem inf ϕ∈QNi=1L∞(Xi,Fi,mi) F(ϕ) := − N X i=1 Z Xi ϕidmi+ Z X dQϕ (2.4)

and if ϕ solves (2.3), the measure Qϕ solves the multi-marginal entropy

min-imization:

inf

Q∈Π(m1,...,mN)

H(Q|e−c_m).

Let us observe that whenever λ1, . . . , λN are constants which sum to 0, then

F(ϕ1+ λ1, . . . , ϕN + λN) = F (ϕ1, . . . , ϕN)

so that one can impose the N − 1 normalizing constraints: Z X1 ϕ1dm1 = . . . = Z XN−1 ϕN−1dmN−1 = 0. (2.5) Denoting by Lp

⋄(Xi,Fi, mi) the space of zero-mean Lp potentials:

Lp_⋄(Xi,Fi, mi) := {ϕi ∈ Lp(Xi,Fi, mi) :

Z

Xi

(5)

we thus consider inf ϕ∈EF(ϕ) where E := N−1_Y i=1 L∞_⋄ (Xi,Fi, mi) × L∞(XN,FN, mN). (2.6)

The Sinkhorn algorithm is nothing but block coordinate descent for the min-imization of F over E. Starting from ϕ0 _{∈ E, the updates of the Sinkhorn}

algorithm, consists, given ϕt_{= (ϕ}t

1, . . . , ϕtN) ∈ E, in: ϕt+1₁ := argminϕ1∈L∞⋄ (X1,F1,m1)F(ϕ1, ϕ t 2, . . . , ϕt2) (2.7) i.e. ϕt+1₁ (x1) := − log Z X−1 ePNj=2ϕtj(xj)_K(x 1, x−1)dm−1(x−1) + λt 1, ∀x1 ∈ X1 (2.8) where λt₁ = Z X1 log Z X−1 e⊕Nj=2ϕtjK(x 1, x−1)dm−1(x−1) dm1(x1). (2.9) Then, for i = 2, . . . , N − 1, ϕt+1_i := argmin_ϕ_i_∈L∞ ⋄ (Xi,Fi,mi)F(ϕ t+1 1 , . . . , ϕt+1i−1, ϕi, ϕti+1, . . . , ϕtN) (2.10) i.e. ϕt+1_i (xi) := − log Z X−i e⊕i−1j=1ϕ t+1 j ⊕Nj=i+1ϕtjK(x

i, x−i)dm−i(x−i)

+λt i, ∀xi ∈ Xi (2.11) where λt_i = Z Xi log Z X−i e⊕i−1j=1ϕ t+1 j ⊕Nj=i+1ϕtjK(x

i, x−i)dm−i(x−i)

dmi(xi). (2.12) Finally, for i = N, ϕt+1_N := argmin_ϕ_N_∈L∞_(X N,FN,mN)F(ϕ t+1 1 , . . . , ϕt+1N−1, . . . , ϕN) (2.13) i.e. ϕt+1_N (xN) := − log Z X−N e⊕N−1j=1 ϕ t+1 j K(x N, x−N)dm−N(x−N) , ∀xN ∈ XN. (2.14) The convergence of the Sinkhorn iterates to a solution of (2.3) (hence a minimizer of (2.4)) was established by Di Marino and Gerolin [8]. The aim of the next paragraph is to slightly improve this result by showing that this convergence is linear.

(6)

3 Linear convergence

Thanks to the normalization (2.5), arguing as in [8], we have uniform bounds on the Sinkhorn iterates:

Lemma 3.1. For every t ≥ 1, the Sinkhorn iterates ϕt _{satisfy the bounds:}

kϕt ikL∞_(X

i,Fi,mi)≤ 2kck∞, i= 1, . . . , N − 1, (3.1)

kϕtNkL∞_(X

N,FN,mN) ≤ (2N − 1)kck∞. (3.2)

Proof. Since for mi⊗ mi-a.e. (xi, yi) and m−i-a.e. x−i ∈ X−i, one has

c(yi, x−i) ≥ c(xi, x−i) − 2kck∞

we deduce from (2.8) and (2.11) that for i = 1, . . . , N − 1, ϕt

i(xi) − ϕti(yi) ≤

2kck∞, using the fact that ϕti ∈ L∞⋄ (Xi,Fi, mi) and integrating the previous

inequality, we immediately deduce (3.1). Once we have these bounds on ϕt i,

for i = 1, . . . , N − 1, using (2.14), together with e−kck∞ _{≤ K ≤ e}kck∞_{, we}

deduce (3.2).

Now that we have uniform, bounds on ϕt_{, we can take advantage of}

the strong convexity and Lipschitz continuity of the exponential function on bounded intervals, to use the analysis of Beck and Tetruashvili [1]. Indeed, given M > 0, one obviously has, ∀(a, b) ∈ [−M, M]2_:

eb− ea− ea(b − a) ≥ e −M 2 (b − a) 2_, _|eb − ea| ≤ eM|b − a|. (3.3) Lemma 3.2. Defining ν = e−(4N −2)kck∞_, _(3.4) one has F(ϕt_{) − F (ϕ}t+1_{) ≥} ν 2 N X i=1 kϕt i− ϕ t+1 i k 2 L2_(X i,Fi,mi). (3.5) Proof. Define e ϕt_i := (ϕt+1₁ , . . . , ϕt+1_i , ϕ_i+1t , . . . , ϕt_N_{), i = 1, . . . , N − 1, e}ϕt_N := ϕt+1, (3.6) and write in a telescopic fashion

F(ϕt_{) − F (ϕ}t+1_{) = F (ϕ}t

) − F ( eϕt₁) +

N_X−1 i=1

(7)

Using successively, the first basic inequality in (3.3), (2.8), the fact that ϕt

1− ϕt+11 has zero mean against m1, and the bounds from lemma 3.1, we get

F(ϕt_{) − F ( e}ϕt₁) = Z X (eϕt1(x1)_{− e}ϕt+11 (x1)₎ N Y j=2 eϕtj(xj)_e−c(x)_dm(x) ≥ Z X (ϕt 1(x1) − ϕt+11 (x1))eϕ t+1 1 (x1) N Y j=2 eϕtj(xj)e−c(x)dm(x) +e −2kck∞ 2 Z X (ϕt 1(x1) − ϕt+11 (x1))2 N Y j=2 eϕtj(xj)_e−c(x)_dm(x) ≥ eλt 1 Z X1 (ϕt 1(x1) − ϕt+11 (x1))dm1(x1) +e −(4N −2)kck∞ 2 Z X1 (ϕt 1(x1) − ϕt+11 (x1))2dm1(x1) = e −(4N −2)kck∞ 2 Z X1 (ϕt 1− ϕt+11 )2dm1.

Similarly, for i = 1, . . . , N − 1, we have F_{( e}ϕt_i_{) − F ( e}ϕt_i+1) ≥ e

−(4N −2)kck∞

2

Z

Xi+1

(ϕti+1− ϕt+1i+1)2dmi+1

which shows (3.5).

Since F is bounded from below on E, the left-hand side of (3.5) converges to 0. Note also that since ϕt _{and ϕ}t+1 _{belong to E, one has the identity}

k ⊕N_i=1(ϕt+1− ϕt)k2L2_{(X,F ,m)} = N X i=1 kϕti− ϕt+1i k2L2_(X i,Fi,mi) (3.7)

and we deduce from (3.5) lim t→+∞k ⊕ N i=1 (ϕt+1− ϕ t )k2_L2_{(X,F ,m)} = 0. (3.8)

Together with the uniform bounds from lemma 3.1, we deduce that ϕt i− ϕ t+1 i as well as eϕt i − eϕ t+1

i converge strongly to 0 in Lp(m_i) = Lp(X_i,F_i, m_i) for

(8)

Theorem 3.3. The sequence of Sinkhorn iterates ϕt _{converges strongly in}

Lp

⋄(X1,F1, m1) × . . . × Lp⋄(XN−1,FN−1, mN−1) × Lp(XN,FN, mN) for every

p∈ [1, +∞), to the unique solution ϕ of (2.6). Moreover, there holds F(ϕt) − F (ϕ) ≤1 − e

−(16N −8)kck∞

N

t

(F (ϕ0) − F (ϕ)). (3.9)

Proof. The convergence of ϕt _{in every L}p _{was obtained by Di Marino and}

Gerolin [8], we include a short proof for the sake of completeness. Setting ati := eϕ

t

i, passing to a subsequence if necessary, we may assume the constants

λt

i in (2.9)-(2.11) converge and that ati, at+1i converges weakly to some ai in

L2(mi). Hence, for every i, ⊗j<iat+1j ⊗j>iatj weakly converges in L2(m−i) to

⊗j<iaj ⊗j>iaj. By construction of the Sinkhorn iterates, e

λt_i

at+1_i is expressed

as a Hilbert-Schmidt hence compact integral functional of ⊗j<iat+1j ⊗j>iatj,

hence _at+11 i

converges strongly in L2_(m

i). Since ati is uniformly bounded and

uniformly bounded away from 0, at+1_i converges strongly in L2_(m

i) as well

as in in Lp_(m

i) for any p ∈ [1, +∞) by the bounds from lemma 3.1.

Us-ing again that at

i is uniformly bounded and uniformly bounded away from

0, ϕt

i also strongly converges in Lp(mi) to ϕi := eai and of course ϕ ∈ E.

Observing that, by construction, ⊗j≤iat+1j ⊗j>i atjKm admits eλ

t im

i as i-th

marginal for i = 1, . . . , N − 1 and mN as N-th marginal, one easily checks

that e⊕N

i=1ϕiKm= ⊗N

i=1aiKm admits m1, . . . , mN as marginals (and all the

constants λt

i, i = 1, . . . , N − 1, tend to 0). Thus ϕ solves the system (2.3)

hence minimizes F over E, but since F is strictly convex over E, this min-imizer is unique and in fact the whole sequence ϕt _{strongly converges in L}p

to ϕ for every p ∈ [1, +∞).

Since ϕ satisfies the bounds of lemma 3.1, using (3.3) as we did in the proof of lemma 3.2, we arrive at

F(ϕ) − F (ϕt_{) ≥} N X i=1 Z Xi ∂iF(ϕt)(xi)(ϕi(xi) − ϕti(xi))dmi(xi) +ν 2 N X i=1 kϕi− ϕ t ik2L2_(m i)

where ν is the constant in (3.4) and ∂iF(ϕ)(xi) = −1 + eϕi(xi)

Z

X−i

e⊕j6=iϕj(xj)_e−c(xi,x−i)_dm

(9)

Defining eϕt

iby (3.6), by construction of the Sinkhorn iterates, for i = 1, . . . , N,

we have _Z Xi ∂iF( eϕti)(ϕi− ϕ t i)dmi = 0 hence F(ϕ) − F (ϕt_{) ≥} N X i=1 Z Xi (∂iF(ϕt) − ∂iF( eϕti))(ϕi− ϕti)dmi +ν 2 N X i=1 kϕi− ϕtik 2 L2_(m i) ≥ − 1 2ν N X i=1 k∂iF(ϕt) − ∂iF( eϕti)k2L2_(m i)

where we have used Young’s inequality in the last line. We thus have shown that F(ϕt_{) − F (ϕ) ≤} 1 2ν N X i=1 k∂iF(ϕt) − ∂iF( eϕti)k 2 L2_(m i). (3.10)

Using the second inequality in (3.3) together with the L∞_{bounds on ϕ}t_from

lemma 3.1 and Jensen’s inequality yield (∂iF(ϕt)(xi) − ∂iF( eϕti(xi))2 ≤ 1 ν2 Z X−i (⊕N j=1ϕtj − ⊕ N j=1( eϕti)j)2m−i so that k∂iF(ϕt) − ∂iF( eϕti)k2L2_(m i) ≤ 1 ν2 N X j=1 kϕtj− ( eϕ t i)jk2L2_(m j) ≤ 1 ν2 N X j=1 kϕtj − ϕt+1j k2L2_(m j),

together with (3.10), we thus obtain F(ϕt_{) − F (ϕ) ≤} N 2ν3 N X i=1 kϕt i − ϕ t+1 i k 2 L2_(m i). (3.11)

Finally, combining (3.11) with (3.5), we deduce F(ϕt_{)−F (ϕ) ≤} N

ν4(F (ϕ

t_{)−F (ϕ}t+1_{)) =} N

ν4((F (ϕ

t_{)−F (ϕ))−(F (ϕ}t+1_{)−F (ϕ)))}

(10)

Remark 3.4. We also have linear convergence of ϕt _{to ϕ in L}2 _{and every L}p_,

p∈ [1, +∞).

Acknowledgments: G.C. is grateful to the Agence Nationale de la Recherche for its support through the project MAGA (ANR-16-CE40-0014).

References

[1] Amir Beck and Luba Tetruashvili. On the convergence of block coordi-nate descent type methods. SIAM J. Optim., 23(4):2037–2060, 2013. [2] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna,

and Gabriel Peyr´e. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing,

37(2):A1111–A1138, 2015.

[3] Guillaume Carlier and Maxime Laborde. A differential approach to the multi-marginal Schr¨odinger system. SIAM J. Math. Anal., 52(1):709– 717, 2020.

[4] Yongxin Chen, Tryphon Georgiou, and Michele Pavon. Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM J. Appl. Math., 76(6):2375–2396, 2016.

[5] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycen-ters. In Proceedings of the 31st International Conference on Machine

Learning (ICML), JMLR W&CP, volume 32, 2014.

[6] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.

[7] Donald A. Dawson and J¨urgen G¨artner. Large deviations from the McKean-Vlasov limit for weakly interacting diffusions. Stochastics,

20(4):247–308, 1987.

[8] Simone Di Marino and Augusto Gerolin. An optimal transport approach for the Schr¨odinger bridge problem and convergence of Sinkhorn algo-rithm. J. Sci. Comput., 85(2):Paper No. 27, 28, 2020.

[9] Hans Föllmer. Random fields and diffusion processes. In École d’ Été de Probabilités de Saint-Flour XV–XVII, 1985–87, volume 1362 of Lecture Notes in Math., pages 101–203. Springer, Berlin, 1988.

(11)

[10] Joel Franklin and Jens Lorenz. On the scaling of multidimensional ma-trices. Linear Algebra and its applications, 114:717–735, 1989.

[11] C. L´eonard. A survey of the Schr¨odinger problem and some of its connec-tions with optimal transport. Discrete Contin. Dyn. Syst. A, 34(4):1533– 1574, 2014.

[12] Christian L´eonard. From the Schr¨odinger problem to the Monge– Kantorovich problem. Journal of Functional Analysis, 262(4):1879–

1920, 2012.

[13] Gabriel Peyr´e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine

Learning, 11(5-6):355–607, 2019.

[14] Ludger R¨uschendorf. Convergence of the iterative proportional fitting procedure. Ann. Statist., 23(4):1160–1174, 1995.

[15] Erwin Schrödinger. Sur la théorie relativiste de l’électron et l’interprétation de la mécanique quantique. In Annales de l’institut

Henri Poincar´e, volume 2, pages 269–310, 1932.

[16] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. Amer. Math. Monthly, 74:402–405, 1967.