HAL Id: hal-03176512
https://hal.archives-ouvertes.fr/hal-03176512
Preprint submitted on 22 Mar 2021
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de
On the linear convergence of the multi-marginal
Sinkhorn algorithm
Guillaume Carlier
To cite this version:
Guillaume Carlier. On the linear convergence of the multi-marginal Sinkhorn algorithm. 2021. �hal-03176512�
On the linear convergence of the
multi-marginal Sinkhorn algorithm
Guillaume Carlier
∗March 22, 2021
Abstract
The aim of this short note is to give an elementary proof of linear convergence of the Sinkhorn algorithm for the entropic regularization of multi-marginal optimal transport. The proof simply relies on: i) the fact that Sinkhorn iterates are bounded, ii) strong convexity of the exponential on bounded intervals and iii) the convergence analysis of the coordinate descent (Gauss-Seidel) method of Beck and Tetru-ashvili [1].
Keywords: multi-marginal entropic optimal transport, Sinkhorn algo-rithm, linear convergence, block coordinate descent.
MS Classification: 45G15, 49M05.
1
Introduction
Eventhough the Sinkhorn algorithm1 is more than 50 years old [16], it has
attracted a considerable attention in the last years. It is now at the heart of efficient solvers for the entropic regularization of optimal transport problems, a field on which Cuturi’s paper [6] had a tremendous impact (see Cuturi and Doucet [5], Cuturi and Peyr´e [13], Benamou et al. [2]...). The Sinkhorn algorithm remains fascinating by its simplicity and its connections with the Schr¨odinger bridge problem first addressed by Schr¨odinger in [15] and large deviations theory, see Dawson and G¨artner [7], F¨ollmer [9], L´eonard [12, 11].
∗CEREMADE, Universit´e Paris Dauphine, PSL and INRIA-Paris, MOKAPLAN, [email protected]
1also known as iterative proportional fitting procedure (IPFP) in the probability and statistics literature, see R¨uschendorf, [14] and the references therein.
The linear convergence of the Sinhkorn algorithm for two marginals is well-known. A very elegant proof consists in using a celebrated theorem of Birkhoff to show that the Sinkhorn algorithm consists in iterating a contrac-tion for the Hilbert projective metric, see Franklin and Lorenz [10], and more recently, Chen, Georgiou and Pavon, [4].
To the best of our knowledge, the elegant Hilbert metric proof does not carry over to the multi-marginal case for which an annoying N − 1 factor (N being the number of marginals) appears in the Lipschitz constant for the Hilbert metric. Convergence of the multi-marginal Sinkhorn algorithm was recently obtained by Di Marino and Gerolin [8] and the well-posedness (ex-istence, uniqueness and smooth dependence on the data) of the Schr¨odinger system (see (2.3) below) was addressed by completely different arguments (local and global inversion theorems) by the author and Laborde in [3]. In the analysis of [8], a key ingredient is that Sinkhorn iterates are coordinate descent updates for a convex minimization problem (dual to an entropy min-imization subject to multi-marginal constraints), see the definition of F in (2.4) below. In this note, we slightly improve the results of Di Marino and Gerolin, by showing linear convergence. The proof relies on the convergence analysis of the coordinate descent method of Beck and Tetruashvili [1] which can easily be used here, since Sinkhorn iterates are bounded in L∞ so remain
in a set where the functional F is uniformly convex.
2
Multi-marginal Sinkhorn algorithm
We are given an integer N ≥ 2, N probability spaces (Xi,Fi, mi), i =
1, . . . , N and set X := N Y i=1 Xi,F := N O i=1 Fi, m := N O i=1 mi. (2.1)
Given i ∈ {1, . . . , N}, we will denote by X−i := QNj6=iXj, m−i := NNj6=imj
and will always identify X to Xi× X−i i.e. will denote x = (x1, . . . , xN) ∈ X
as x = (xi, x−i). The set of measures on (X, F ) having m1, . . . , mN as
marginals will be denoted Π(m1, . . . , mN). Given p ∈ [1, ∞] and (ϕ1, . . . , ϕN) ∈
QN
i=1Lp(Xi,Fi, mi) we will use the notations
⊕N i=1ϕi : (x1, . . . , xN) ∈ X 7→ N X i=1 ϕi(xi),
and ⊗Ni=1ϕi : (x1, . . . , xN) ∈ X 7→ N Y i=1 ϕi(xi).
Given a cost c ∈ L∞(X, F , m), we set
kck∞ := kckL∞(X,F ,m). (2.2)
The associated Gibbs kernel is
K := e−c,
so that e−kck∞ ≤ K ≤ ekck∞, m-almost everywhere. We look for potentials
ϕ = (ϕ1, . . . , ϕN) ∈
QN
i=1L∞(Xi,Fi, mi) such that the measure
Qϕ := Ke⊕
N i=1ϕim
belongs to Π(m1, . . . , mN) i.e. solve the Schr¨odinger system:
eϕi(xi)
Z
X−i
e−c(x1,...,xN)+Pj6=iϕj(xj)dm
−i(x−i) = 1, (2.3)
for every i and mi-a.e. xi. The system (2.3) is well-known to be the
Euler-Lagrange optimality condition for the convex minimization problem inf ϕ∈QNi=1L∞(Xi,Fi,mi) F(ϕ) := − N X i=1 Z Xi ϕidmi+ Z X dQϕ (2.4)
and if ϕ solves (2.3), the measure Qϕ solves the multi-marginal entropy
min-imization:
inf
Q∈Π(m1,...,mN)
H(Q|e−cm).
Let us observe that whenever λ1, . . . , λN are constants which sum to 0, then
F(ϕ1+ λ1, . . . , ϕN + λN) = F (ϕ1, . . . , ϕN)
so that one can impose the N − 1 normalizing constraints: Z X1 ϕ1dm1 = . . . = Z XN−1 ϕN−1dmN−1 = 0. (2.5) Denoting by Lp
⋄(Xi,Fi, mi) the space of zero-mean Lp potentials:
Lp⋄(Xi,Fi, mi) := {ϕi ∈ Lp(Xi,Fi, mi) :
Z
Xi
we thus consider inf ϕ∈EF(ϕ) where E := N−1Y i=1 L∞⋄ (Xi,Fi, mi) × L∞(XN,FN, mN). (2.6)
The Sinkhorn algorithm is nothing but block coordinate descent for the min-imization of F over E. Starting from ϕ0 ∈ E, the updates of the Sinkhorn
algorithm, consists, given ϕt= (ϕt
1, . . . , ϕtN) ∈ E, in: ϕt+11 := argminϕ1∈L∞⋄ (X1,F1,m1)F(ϕ1, ϕ t 2, . . . , ϕt2) (2.7) i.e. ϕt+11 (x1) := − log Z X−1 ePNj=2ϕtj(xj)K(x 1, x−1)dm−1(x−1) + λt 1, ∀x1 ∈ X1 (2.8) where λt1 = Z X1 log Z X−1 e⊕Nj=2ϕtjK(x 1, x−1)dm−1(x−1) dm1(x1). (2.9) Then, for i = 2, . . . , N − 1, ϕt+1i := argminϕi∈L∞ ⋄ (Xi,Fi,mi)F(ϕ t+1 1 , . . . , ϕt+1i−1, ϕi, ϕti+1, . . . , ϕtN) (2.10) i.e. ϕt+1i (xi) := − log Z X−i e⊕i−1j=1ϕ t+1 j ⊕Nj=i+1ϕtjK(x
i, x−i)dm−i(x−i)
+λt i, ∀xi ∈ Xi (2.11) where λti = Z Xi log Z X−i e⊕i−1j=1ϕ t+1 j ⊕Nj=i+1ϕtjK(x
i, x−i)dm−i(x−i)
dmi(xi). (2.12) Finally, for i = N, ϕt+1N := argminϕN∈L∞(X N,FN,mN)F(ϕ t+1 1 , . . . , ϕt+1N−1, . . . , ϕN) (2.13) i.e. ϕt+1N (xN) := − log Z X−N e⊕N−1j=1 ϕ t+1 j K(x N, x−N)dm−N(x−N) , ∀xN ∈ XN. (2.14) The convergence of the Sinkhorn iterates to a solution of (2.3) (hence a minimizer of (2.4)) was established by Di Marino and Gerolin [8]. The aim of the next paragraph is to slightly improve this result by showing that this convergence is linear.
3
Linear convergence
Thanks to the normalization (2.5), arguing as in [8], we have uniform bounds on the Sinkhorn iterates:
Lemma 3.1. For every t ≥ 1, the Sinkhorn iterates ϕt satisfy the bounds:
kϕt ikL∞(X
i,Fi,mi)≤ 2kck∞, i= 1, . . . , N − 1, (3.1)
kϕtNkL∞(X
N,FN,mN) ≤ (2N − 1)kck∞. (3.2)
Proof. Since for mi⊗ mi-a.e. (xi, yi) and m−i-a.e. x−i ∈ X−i, one has
c(yi, x−i) ≥ c(xi, x−i) − 2kck∞
we deduce from (2.8) and (2.11) that for i = 1, . . . , N − 1, ϕt
i(xi) − ϕti(yi) ≤
2kck∞, using the fact that ϕti ∈ L∞⋄ (Xi,Fi, mi) and integrating the previous
inequality, we immediately deduce (3.1). Once we have these bounds on ϕt i,
for i = 1, . . . , N − 1, using (2.14), together with e−kck∞ ≤ K ≤ ekck∞, we
deduce (3.2).
Now that we have uniform, bounds on ϕt, we can take advantage of
the strong convexity and Lipschitz continuity of the exponential function on bounded intervals, to use the analysis of Beck and Tetruashvili [1]. Indeed, given M > 0, one obviously has, ∀(a, b) ∈ [−M, M]2:
eb− ea− ea(b − a) ≥ e −M 2 (b − a) 2, |eb − ea| ≤ eM|b − a|. (3.3) Lemma 3.2. Defining ν = e−(4N −2)kck∞, (3.4) one has F(ϕt) − F (ϕt+1) ≥ ν 2 N X i=1 kϕt i− ϕ t+1 i k 2 L2(X i,Fi,mi). (3.5) Proof. Define e ϕti := (ϕt+11 , . . . , ϕt+1i , ϕi+1t , . . . , ϕtN), i = 1, . . . , N − 1, eϕtN := ϕt+1, (3.6) and write in a telescopic fashion
F(ϕt) − F (ϕt+1) = F (ϕt
) − F ( eϕt1) +
NX−1 i=1
Using successively, the first basic inequality in (3.3), (2.8), the fact that ϕt
1− ϕt+11 has zero mean against m1, and the bounds from lemma 3.1, we get
F(ϕt) − F ( eϕt1) = Z X (eϕt1(x1)− eϕt+11 (x1)) N Y j=2 eϕtj(xj)e−c(x)dm(x) ≥ Z X (ϕt 1(x1) − ϕt+11 (x1))eϕ t+1 1 (x1) N Y j=2 eϕtj(xj)e−c(x)dm(x) +e −2kck∞ 2 Z X (ϕt 1(x1) − ϕt+11 (x1))2 N Y j=2 eϕtj(xj)e−c(x)dm(x) ≥ eλt 1 Z X1 (ϕt 1(x1) − ϕt+11 (x1))dm1(x1) +e −(4N −2)kck∞ 2 Z X1 (ϕt 1(x1) − ϕt+11 (x1))2dm1(x1) = e −(4N −2)kck∞ 2 Z X1 (ϕt 1− ϕt+11 )2dm1.
Similarly, for i = 1, . . . , N − 1, we have F( eϕti) − F ( eϕti+1) ≥ e
−(4N −2)kck∞
2
Z
Xi+1
(ϕti+1− ϕt+1i+1)2dmi+1
which shows (3.5).
Since F is bounded from below on E, the left-hand side of (3.5) converges to 0. Note also that since ϕt and ϕt+1 belong to E, one has the identity
k ⊕Ni=1(ϕt+1− ϕt)k2L2(X,F ,m) = N X i=1 kϕti− ϕt+1i k2L2(X i,Fi,mi) (3.7)
and we deduce from (3.5) lim t→+∞k ⊕ N i=1 (ϕt+1− ϕ t )k2L2(X,F ,m) = 0. (3.8)
Together with the uniform bounds from lemma 3.1, we deduce that ϕt i− ϕ t+1 i as well as eϕt i − eϕ t+1
i converge strongly to 0 in Lp(mi) = Lp(Xi,Fi, mi) for
Theorem 3.3. The sequence of Sinkhorn iterates ϕt converges strongly in
Lp
⋄(X1,F1, m1) × . . . × Lp⋄(XN−1,FN−1, mN−1) × Lp(XN,FN, mN) for every
p∈ [1, +∞), to the unique solution ϕ of (2.6). Moreover, there holds F(ϕt) − F (ϕ) ≤1 − e
−(16N −8)kck∞
N
t
(F (ϕ0) − F (ϕ)). (3.9)
Proof. The convergence of ϕt in every Lp was obtained by Di Marino and
Gerolin [8], we include a short proof for the sake of completeness. Setting ati := eϕ
t
i, passing to a subsequence if necessary, we may assume the constants
λt
i in (2.9)-(2.11) converge and that ati, at+1i converges weakly to some ai in
L2(mi). Hence, for every i, ⊗j<iat+1j ⊗j>iatj weakly converges in L2(m−i) to
⊗j<iaj ⊗j>iaj. By construction of the Sinkhorn iterates, e
λti
at+1i is expressed
as a Hilbert-Schmidt hence compact integral functional of ⊗j<iat+1j ⊗j>iatj,
hence at+11 i
converges strongly in L2(m
i). Since ati is uniformly bounded and
uniformly bounded away from 0, at+1i converges strongly in L2(m
i) as well
as in in Lp(m
i) for any p ∈ [1, +∞) by the bounds from lemma 3.1.
Us-ing again that at
i is uniformly bounded and uniformly bounded away from
0, ϕt
i also strongly converges in Lp(mi) to ϕi := eai and of course ϕ ∈ E.
Observing that, by construction, ⊗j≤iat+1j ⊗j>i atjKm admits eλ
t im
i as i-th
marginal for i = 1, . . . , N − 1 and mN as N-th marginal, one easily checks
that e⊕N
i=1ϕiKm= ⊗N
i=1aiKm admits m1, . . . , mN as marginals (and all the
constants λt
i, i = 1, . . . , N − 1, tend to 0). Thus ϕ solves the system (2.3)
hence minimizes F over E, but since F is strictly convex over E, this min-imizer is unique and in fact the whole sequence ϕt strongly converges in Lp
to ϕ for every p ∈ [1, +∞).
Since ϕ satisfies the bounds of lemma 3.1, using (3.3) as we did in the proof of lemma 3.2, we arrive at
F(ϕ) − F (ϕt) ≥ N X i=1 Z Xi ∂iF(ϕt)(xi)(ϕi(xi) − ϕti(xi))dmi(xi) +ν 2 N X i=1 kϕi− ϕ t ik2L2(m i)
where ν is the constant in (3.4) and ∂iF(ϕ)(xi) = −1 + eϕi(xi)
Z
X−i
e⊕j6=iϕj(xj)e−c(xi,x−i)dm
Defining eϕt
iby (3.6), by construction of the Sinkhorn iterates, for i = 1, . . . , N,
we have Z Xi ∂iF( eϕti)(ϕi− ϕ t i)dmi = 0 hence F(ϕ) − F (ϕt) ≥ N X i=1 Z Xi (∂iF(ϕt) − ∂iF( eϕti))(ϕi− ϕti)dmi +ν 2 N X i=1 kϕi− ϕtik 2 L2(m i) ≥ − 1 2ν N X i=1 k∂iF(ϕt) − ∂iF( eϕti)k2L2(m i)
where we have used Young’s inequality in the last line. We thus have shown that F(ϕt) − F (ϕ) ≤ 1 2ν N X i=1 k∂iF(ϕt) − ∂iF( eϕti)k 2 L2(m i). (3.10)
Using the second inequality in (3.3) together with the L∞bounds on ϕtfrom
lemma 3.1 and Jensen’s inequality yield (∂iF(ϕt)(xi) − ∂iF( eϕti(xi))2 ≤ 1 ν2 Z X−i (⊕N j=1ϕtj − ⊕ N j=1( eϕti)j)2m−i so that k∂iF(ϕt) − ∂iF( eϕti)k2L2(m i) ≤ 1 ν2 N X j=1 kϕtj− ( eϕ t i)jk2L2(m j) ≤ 1 ν2 N X j=1 kϕtj − ϕt+1j k2L2(m j),
together with (3.10), we thus obtain F(ϕt) − F (ϕ) ≤ N 2ν3 N X i=1 kϕt i − ϕ t+1 i k 2 L2(m i). (3.11)
Finally, combining (3.11) with (3.5), we deduce F(ϕt)−F (ϕ) ≤ N
ν4(F (ϕ
t)−F (ϕt+1)) = N
ν4((F (ϕ
t)−F (ϕ))−(F (ϕt+1)−F (ϕ)))
Remark 3.4. We also have linear convergence of ϕt to ϕ in L2 and every Lp,
p∈ [1, +∞).
Acknowledgments: G.C. is grateful to the Agence Nationale de la Recherche for its support through the project MAGA (ANR-16-CE40-0014).
References
[1] Amir Beck and Luba Tetruashvili. On the convergence of block coordi-nate descent type methods. SIAM J. Optim., 23(4):2037–2060, 2013. [2] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna,
and Gabriel Peyr´e. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing,
37(2):A1111–A1138, 2015.
[3] Guillaume Carlier and Maxime Laborde. A differential approach to the multi-marginal Schr¨odinger system. SIAM J. Math. Anal., 52(1):709– 717, 2020.
[4] Yongxin Chen, Tryphon Georgiou, and Michele Pavon. Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM J. Appl. Math., 76(6):2375–2396, 2016.
[5] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycen-ters. In Proceedings of the 31st International Conference on Machine
Learning (ICML), JMLR W&CP, volume 32, 2014.
[6] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
[7] Donald A. Dawson and J¨urgen G¨artner. Large deviations from the McKean-Vlasov limit for weakly interacting diffusions. Stochastics,
20(4):247–308, 1987.
[8] Simone Di Marino and Augusto Gerolin. An optimal transport approach for the Schr¨odinger bridge problem and convergence of Sinkhorn algo-rithm. J. Sci. Comput., 85(2):Paper No. 27, 28, 2020.
[9] Hans F¨ollmer. Random fields and diffusion processes. In ´Ecole d’ ´Et´e de Probabilit´es de Saint-Flour XV–XVII, 1985–87, volume 1362 of Lecture Notes in Math., pages 101–203. Springer, Berlin, 1988.
[10] Joel Franklin and Jens Lorenz. On the scaling of multidimensional ma-trices. Linear Algebra and its applications, 114:717–735, 1989.
[11] C. L´eonard. A survey of the Schr¨odinger problem and some of its connec-tions with optimal transport. Discrete Contin. Dyn. Syst. A, 34(4):1533– 1574, 2014.
[12] Christian L´eonard. From the Schr¨odinger problem to the Monge– Kantorovich problem. Journal of Functional Analysis, 262(4):1879–
1920, 2012.
[13] Gabriel Peyr´e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine
Learning, 11(5-6):355–607, 2019.
[14] Ludger R¨uschendorf. Convergence of the iterative proportional fitting procedure. Ann. Statist., 23(4):1160–1174, 1995.
[15] Erwin Schr¨odinger. Sur la th´eorie relativiste de l’´electron et l’interpr´etation de la m´ecanique quantique. In Annales de l’institut
Henri Poincar´e, volume 2, pages 269–310, 1932.
[16] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. Amer. Math. Monthly, 74:402–405, 1967.