• Aucun résultat trouvé

On the linear convergence of the multi-marginal Sinkhorn algorithm

N/A
N/A
Protected

Academic year: 2021

Partager "On the linear convergence of the multi-marginal Sinkhorn algorithm"

Copied!
11
0
0

Texte intégral

(1)

HAL Id: hal-03176512

https://hal.archives-ouvertes.fr/hal-03176512

Preprint submitted on 22 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

On the linear convergence of the multi-marginal

Sinkhorn algorithm

Guillaume Carlier

To cite this version:

Guillaume Carlier. On the linear convergence of the multi-marginal Sinkhorn algorithm. 2021. �hal-03176512�

(2)

On the linear convergence of the

multi-marginal Sinkhorn algorithm

Guillaume Carlier

March 22, 2021

Abstract

The aim of this short note is to give an elementary proof of linear convergence of the Sinkhorn algorithm for the entropic regularization of multi-marginal optimal transport. The proof simply relies on: i) the fact that Sinkhorn iterates are bounded, ii) strong convexity of the exponential on bounded intervals and iii) the convergence analysis of the coordinate descent (Gauss-Seidel) method of Beck and Tetru-ashvili [1].

Keywords: multi-marginal entropic optimal transport, Sinkhorn algo-rithm, linear convergence, block coordinate descent.

MS Classification: 45G15, 49M05.

1

Introduction

Eventhough the Sinkhorn algorithm1 is more than 50 years old [16], it has

attracted a considerable attention in the last years. It is now at the heart of efficient solvers for the entropic regularization of optimal transport problems, a field on which Cuturi’s paper [6] had a tremendous impact (see Cuturi and Doucet [5], Cuturi and Peyr´e [13], Benamou et al. [2]...). The Sinkhorn algorithm remains fascinating by its simplicity and its connections with the Schr¨odinger bridge problem first addressed by Schr¨odinger in [15] and large deviations theory, see Dawson and G¨artner [7], F¨ollmer [9], L´eonard [12, 11].

CEREMADE, Universit´e Paris Dauphine, PSL and INRIA-Paris, MOKAPLAN, [email protected]

1also known as iterative proportional fitting procedure (IPFP) in the probability and statistics literature, see R¨uschendorf, [14] and the references therein.

(3)

The linear convergence of the Sinhkorn algorithm for two marginals is well-known. A very elegant proof consists in using a celebrated theorem of Birkhoff to show that the Sinkhorn algorithm consists in iterating a contrac-tion for the Hilbert projective metric, see Franklin and Lorenz [10], and more recently, Chen, Georgiou and Pavon, [4].

To the best of our knowledge, the elegant Hilbert metric proof does not carry over to the multi-marginal case for which an annoying N − 1 factor (N being the number of marginals) appears in the Lipschitz constant for the Hilbert metric. Convergence of the multi-marginal Sinkhorn algorithm was recently obtained by Di Marino and Gerolin [8] and the well-posedness (ex-istence, uniqueness and smooth dependence on the data) of the Schr¨odinger system (see (2.3) below) was addressed by completely different arguments (local and global inversion theorems) by the author and Laborde in [3]. In the analysis of [8], a key ingredient is that Sinkhorn iterates are coordinate descent updates for a convex minimization problem (dual to an entropy min-imization subject to multi-marginal constraints), see the definition of F in (2.4) below. In this note, we slightly improve the results of Di Marino and Gerolin, by showing linear convergence. The proof relies on the convergence analysis of the coordinate descent method of Beck and Tetruashvili [1] which can easily be used here, since Sinkhorn iterates are bounded in L∞ so remain

in a set where the functional F is uniformly convex.

2

Multi-marginal Sinkhorn algorithm

We are given an integer N ≥ 2, N probability spaces (Xi,Fi, mi), i =

1, . . . , N and set X := N Y i=1 Xi,F := N O i=1 Fi, m := N O i=1 mi. (2.1)

Given i ∈ {1, . . . , N}, we will denote by X−i := QNj6=iXj, m−i := NNj6=imj

and will always identify X to Xi× X−i i.e. will denote x = (x1, . . . , xN) ∈ X

as x = (xi, x−i). The set of measures on (X, F ) having m1, . . . , mN as

marginals will be denoted Π(m1, . . . , mN). Given p ∈ [1, ∞] and (ϕ1, . . . , ϕN) ∈

QN

i=1Lp(Xi,Fi, mi) we will use the notations

⊕N i=1ϕi : (x1, . . . , xN) ∈ X 7→ N X i=1 ϕi(xi),

(4)

and ⊗Ni=1ϕi : (x1, . . . , xN) ∈ X 7→ N Y i=1 ϕi(xi).

Given a cost c ∈ L∞(X, F , m), we set

kck∞ := kckL∞(X,F ,m). (2.2)

The associated Gibbs kernel is

K := e−c,

so that e−kck∞ ≤ K ≤ ekck∞, m-almost everywhere. We look for potentials

ϕ = (ϕ1, . . . , ϕN) ∈

QN

i=1L∞(Xi,Fi, mi) such that the measure

Qϕ := Ke⊕

N i=1ϕim

belongs to Π(m1, . . . , mN) i.e. solve the Schr¨odinger system:

eϕi(xi)

Z

X−i

e−c(x1,...,xN)+Pj6=iϕj(xj)dm

−i(x−i) = 1, (2.3)

for every i and mi-a.e. xi. The system (2.3) is well-known to be the

Euler-Lagrange optimality condition for the convex minimization problem inf ϕ∈QNi=1L∞(Xi,Fi,mi) F(ϕ) := − N X i=1 Z Xi ϕidmi+ Z X dQϕ (2.4)

and if ϕ solves (2.3), the measure Qϕ solves the multi-marginal entropy

min-imization:

inf

Q∈Π(m1,...,mN)

H(Q|e−cm).

Let us observe that whenever λ1, . . . , λN are constants which sum to 0, then

F(ϕ1+ λ1, . . . , ϕN + λN) = F (ϕ1, . . . , ϕN)

so that one can impose the N − 1 normalizing constraints: Z X1 ϕ1dm1 = . . . = Z XN−1 ϕN−1dmN−1 = 0. (2.5) Denoting by Lp

⋄(Xi,Fi, mi) the space of zero-mean Lp potentials:

Lp(Xi,Fi, mi) := {ϕi ∈ Lp(Xi,Fi, mi) :

Z

Xi

(5)

we thus consider inf ϕ∈EF(ϕ) where E := N−1Y i=1 L∞ (Xi,Fi, mi) × L∞(XN,FN, mN). (2.6)

The Sinkhorn algorithm is nothing but block coordinate descent for the min-imization of F over E. Starting from ϕ0 ∈ E, the updates of the Sinkhorn

algorithm, consists, given ϕt= (ϕt

1, . . . , ϕtN) ∈ E, in: ϕt+11 := argminϕ1∈L∞⋄ (X1,F1,m1)F(ϕ1, ϕ t 2, . . . , ϕt2) (2.7) i.e. ϕt+11 (x1) := − log  Z X−1 ePNj=2ϕtj(xj)K(x 1, x−1)dm−1(x−1)  + λt 1, ∀x1 ∈ X1 (2.8) where λt1 = Z X1  log Z X−1 e⊕Nj=2ϕtjK(x 1, x−1)dm−1(x−1)  dm1(x1). (2.9) Then, for i = 2, . . . , N − 1, ϕt+1i := argminϕi∈L∞ ⋄ (Xi,Fi,mi)F(ϕ t+1 1 , . . . , ϕt+1i−1, ϕi, ϕti+1, . . . , ϕtN) (2.10) i.e. ϕt+1i (xi) := − log  Z X−i e⊕i−1j=1ϕ t+1 j ⊕Nj=i+1ϕtjK(x

i, x−i)dm−i(x−i)

 +λt i, ∀xi ∈ Xi (2.11) where λti = Z Xi  log Z X−i e⊕i−1j=1ϕ t+1 j ⊕Nj=i+1ϕtjK(x

i, x−i)dm−i(x−i)

 dmi(xi). (2.12) Finally, for i = N, ϕt+1N := argminϕN∈L(X N,FN,mN)F(ϕ t+1 1 , . . . , ϕt+1N−1, . . . , ϕN) (2.13) i.e. ϕt+1N (xN) := − log  Z X−N e⊕N−1j=1 ϕ t+1 j K(x N, x−N)dm−N(x−N)  , ∀xN ∈ XN. (2.14) The convergence of the Sinkhorn iterates to a solution of (2.3) (hence a minimizer of (2.4)) was established by Di Marino and Gerolin [8]. The aim of the next paragraph is to slightly improve this result by showing that this convergence is linear.

(6)

3

Linear convergence

Thanks to the normalization (2.5), arguing as in [8], we have uniform bounds on the Sinkhorn iterates:

Lemma 3.1. For every t ≥ 1, the Sinkhorn iterates ϕt satisfy the bounds:

kϕt ikL∞(X

i,Fi,mi)≤ 2kck∞, i= 1, . . . , N − 1, (3.1)

kϕtNkL∞(X

N,FN,mN) ≤ (2N − 1)kck∞. (3.2)

Proof. Since for mi⊗ mi-a.e. (xi, yi) and m−i-a.e. x−i ∈ X−i, one has

c(yi, x−i) ≥ c(xi, x−i) − 2kck∞

we deduce from (2.8) and (2.11) that for i = 1, . . . , N − 1, ϕt

i(xi) − ϕti(yi) ≤

2kck∞, using the fact that ϕti ∈ L∞⋄ (Xi,Fi, mi) and integrating the previous

inequality, we immediately deduce (3.1). Once we have these bounds on ϕt i,

for i = 1, . . . , N − 1, using (2.14), together with e−kck∞ ≤ K ≤ ekck∞, we

deduce (3.2).

Now that we have uniform, bounds on ϕt, we can take advantage of

the strong convexity and Lipschitz continuity of the exponential function on bounded intervals, to use the analysis of Beck and Tetruashvili [1]. Indeed, given M > 0, one obviously has, ∀(a, b) ∈ [−M, M]2:

eb− ea− ea(b − a) ≥ e −M 2 (b − a) 2, |eb − ea| ≤ eM|b − a|. (3.3) Lemma 3.2. Defining ν = e−(4N −2)kck∞, (3.4) one has F(ϕt) − F (ϕt+1) ≥ ν 2 N X i=1 kϕt i− ϕ t+1 i k 2 L2(X i,Fi,mi). (3.5) Proof. Define e ϕti := (ϕt+11 , . . . , ϕt+1i , ϕi+1t , . . . , ϕtN), i = 1, . . . , N − 1, eϕtN := ϕt+1, (3.6) and write in a telescopic fashion

F(ϕt) − F (ϕt+1) = F (ϕt

) − F ( eϕt1) +

NX−1 i=1

(7)

Using successively, the first basic inequality in (3.3), (2.8), the fact that ϕt

1− ϕt+11 has zero mean against m1, and the bounds from lemma 3.1, we get

F(ϕt) − F ( eϕt1) = Z X (eϕt1(x1)− eϕt+11 (x1)) N Y j=2 eϕtj(xj)e−c(x)dm(x) ≥ Z X (ϕt 1(x1) − ϕt+11 (x1))eϕ t+1 1 (x1) N Y j=2 eϕtj(xj)e−c(x)dm(x) +e −2kck∞ 2 Z X (ϕt 1(x1) − ϕt+11 (x1))2 N Y j=2 eϕtj(xj)e−c(x)dm(x) ≥ eλt 1 Z X1 (ϕt 1(x1) − ϕt+11 (x1))dm1(x1) +e −(4N −2)kck∞ 2 Z X1 (ϕt 1(x1) − ϕt+11 (x1))2dm1(x1) = e −(4N −2)kck∞ 2 Z X1 (ϕt 1− ϕt+11 )2dm1.

Similarly, for i = 1, . . . , N − 1, we have F( eϕti) − F ( eϕti+1) ≥ e

−(4N −2)kck∞

2

Z

Xi+1

(ϕti+1− ϕt+1i+1)2dmi+1

which shows (3.5).

Since F is bounded from below on E, the left-hand side of (3.5) converges to 0. Note also that since ϕt and ϕt+1 belong to E, one has the identity

k ⊕Ni=1(ϕt+1− ϕt)k2L2(X,F ,m) = N X i=1 kϕti− ϕt+1i k2L2(X i,Fi,mi) (3.7)

and we deduce from (3.5) lim t→+∞k ⊕ N i=1 (ϕt+1− ϕ t )k2L2(X,F ,m) = 0. (3.8)

Together with the uniform bounds from lemma 3.1, we deduce that ϕt i− ϕ t+1 i as well as eϕt i − eϕ t+1

i converge strongly to 0 in Lp(mi) = Lp(Xi,Fi, mi) for

(8)

Theorem 3.3. The sequence of Sinkhorn iterates ϕt converges strongly in

Lp

⋄(X1,F1, m1) × . . . × Lp⋄(XN−1,FN−1, mN−1) × Lp(XN,FN, mN) for every

p∈ [1, +∞), to the unique solution ϕ of (2.6). Moreover, there holds F(ϕt) − F (ϕ) ≤1 − e

−(16N −8)kck∞

N

t

(F (ϕ0) − F (ϕ)). (3.9)

Proof. The convergence of ϕt in every Lp was obtained by Di Marino and

Gerolin [8], we include a short proof for the sake of completeness. Setting ati := eϕ

t

i, passing to a subsequence if necessary, we may assume the constants

λt

i in (2.9)-(2.11) converge and that ati, at+1i converges weakly to some ai in

L2(mi). Hence, for every i, ⊗j<iat+1j ⊗j>iatj weakly converges in L2(m−i) to

⊗j<iaj ⊗j>iaj. By construction of the Sinkhorn iterates, e

λti

at+1i is expressed

as a Hilbert-Schmidt hence compact integral functional of ⊗j<iat+1j ⊗j>iatj,

hence at+11 i

converges strongly in L2(m

i). Since ati is uniformly bounded and

uniformly bounded away from 0, at+1i converges strongly in L2(m

i) as well

as in in Lp(m

i) for any p ∈ [1, +∞) by the bounds from lemma 3.1.

Us-ing again that at

i is uniformly bounded and uniformly bounded away from

0, ϕt

i also strongly converges in Lp(mi) to ϕi := eai and of course ϕ ∈ E.

Observing that, by construction, ⊗j≤iat+1j ⊗j>i atjKm admits eλ

t im

i as i-th

marginal for i = 1, . . . , N − 1 and mN as N-th marginal, one easily checks

that e⊕N

i=1ϕiKm= ⊗N

i=1aiKm admits m1, . . . , mN as marginals (and all the

constants λt

i, i = 1, . . . , N − 1, tend to 0). Thus ϕ solves the system (2.3)

hence minimizes F over E, but since F is strictly convex over E, this min-imizer is unique and in fact the whole sequence ϕt strongly converges in Lp

to ϕ for every p ∈ [1, +∞).

Since ϕ satisfies the bounds of lemma 3.1, using (3.3) as we did in the proof of lemma 3.2, we arrive at

F(ϕ) − F (ϕt) ≥ N X i=1 Z Xi ∂iF(ϕt)(xi)(ϕi(xi) − ϕti(xi))dmi(xi) +ν 2 N X i=1 kϕi− ϕ t ik2L2(m i)

where ν is the constant in (3.4) and ∂iF(ϕ)(xi) = −1 + eϕi(xi)

Z

X−i

e⊕j6=iϕj(xj)e−c(xi,x−i)dm

(9)

Defining eϕt

iby (3.6), by construction of the Sinkhorn iterates, for i = 1, . . . , N,

we have Z Xi ∂iF( eϕti)(ϕi− ϕ t i)dmi = 0 hence F(ϕ) − F (ϕt) ≥ N X i=1 Z Xi (∂iF(ϕt) − ∂iF( eϕti))(ϕi− ϕti)dmi +ν 2 N X i=1 kϕi− ϕtik 2 L2(m i) ≥ − 1 2ν N X i=1 k∂iF(ϕt) − ∂iF( eϕti)k2L2(m i)

where we have used Young’s inequality in the last line. We thus have shown that F(ϕt) − F (ϕ) ≤ 1 2ν N X i=1 k∂iF(ϕt) − ∂iF( eϕti)k 2 L2(m i). (3.10)

Using the second inequality in (3.3) together with the L∞bounds on ϕtfrom

lemma 3.1 and Jensen’s inequality yield (∂iF(ϕt)(xi) − ∂iF( eϕti(xi))2 ≤ 1 ν2 Z X−i (⊕N j=1ϕtj − ⊕ N j=1( eϕti)j)2m−i so that k∂iF(ϕt) − ∂iF( eϕti)k2L2(m i) ≤ 1 ν2 N X j=1 kϕtj− ( eϕ t i)jk2L2(m j) ≤ 1 ν2 N X j=1 kϕtj − ϕt+1j k2L2(m j),

together with (3.10), we thus obtain F(ϕt) − F (ϕ) ≤ N 2ν3 N X i=1 kϕt i − ϕ t+1 i k 2 L2(m i). (3.11)

Finally, combining (3.11) with (3.5), we deduce F(ϕt)−F (ϕ) ≤ N

ν4(F (ϕ

t)−F (ϕt+1)) = N

ν4((F (ϕ

t)−F (ϕ))−(F (ϕt+1)−F (ϕ)))

(10)

Remark 3.4. We also have linear convergence of ϕt to ϕ in L2 and every Lp,

p∈ [1, +∞).

Acknowledgments: G.C. is grateful to the Agence Nationale de la Recherche for its support through the project MAGA (ANR-16-CE40-0014).

References

[1] Amir Beck and Luba Tetruashvili. On the convergence of block coordi-nate descent type methods. SIAM J. Optim., 23(4):2037–2060, 2013. [2] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna,

and Gabriel Peyr´e. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing,

37(2):A1111–A1138, 2015.

[3] Guillaume Carlier and Maxime Laborde. A differential approach to the multi-marginal Schr¨odinger system. SIAM J. Math. Anal., 52(1):709– 717, 2020.

[4] Yongxin Chen, Tryphon Georgiou, and Michele Pavon. Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM J. Appl. Math., 76(6):2375–2396, 2016.

[5] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycen-ters. In Proceedings of the 31st International Conference on Machine

Learning (ICML), JMLR W&CP, volume 32, 2014.

[6] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.

[7] Donald A. Dawson and J¨urgen G¨artner. Large deviations from the McKean-Vlasov limit for weakly interacting diffusions. Stochastics,

20(4):247–308, 1987.

[8] Simone Di Marino and Augusto Gerolin. An optimal transport approach for the Schr¨odinger bridge problem and convergence of Sinkhorn algo-rithm. J. Sci. Comput., 85(2):Paper No. 27, 28, 2020.

[9] Hans F¨ollmer. Random fields and diffusion processes. In ´Ecole d’ ´Et´e de Probabilit´es de Saint-Flour XV–XVII, 1985–87, volume 1362 of Lecture Notes in Math., pages 101–203. Springer, Berlin, 1988.

(11)

[10] Joel Franklin and Jens Lorenz. On the scaling of multidimensional ma-trices. Linear Algebra and its applications, 114:717–735, 1989.

[11] C. L´eonard. A survey of the Schr¨odinger problem and some of its connec-tions with optimal transport. Discrete Contin. Dyn. Syst. A, 34(4):1533– 1574, 2014.

[12] Christian L´eonard. From the Schr¨odinger problem to the Monge– Kantorovich problem. Journal of Functional Analysis, 262(4):1879–

1920, 2012.

[13] Gabriel Peyr´e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine

Learning, 11(5-6):355–607, 2019.

[14] Ludger R¨uschendorf. Convergence of the iterative proportional fitting procedure. Ann. Statist., 23(4):1160–1174, 1995.

[15] Erwin Schr¨odinger. Sur la th´eorie relativiste de l’´electron et l’interpr´etation de la m´ecanique quantique. In Annales de l’institut

Henri Poincar´e, volume 2, pages 269–310, 1932.

[16] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. Amer. Math. Monthly, 74:402–405, 1967.

Références

Documents relatifs

Further, it may be possible to extend the use of selected components of closed systems in other applications if they are standardised, and thus reduce their cost

Similarly to the game of chess, statistics games are determined: either Anke or Boris has a winning strategy [2].. 1 Correctness of

We will also show that for nonconvex sets A, B the Douglas–Rachford scheme may fail to converge and start to cycle without solving the feasibility problem.. We show that this may

In order to justify the acceleration of convergence that is observed in practice, we now study the local convergence rate of the overrelaxed algorithm, which follows from the

Eventual linear convergence rate of an exchange algorithm for superresolution.. Frédéric de Gournay, Pierre Weiss,

An entropic interpolation proof of the HWI inequality Ivan Gentil, Christian Léonard, Luigia Ripani, Luca Tamanini.. To cite

Abstract— Stability properties for continuous-time linear switched systems are determined by the Lyapunov exponent associated with the system, which is the analogous of the

Index Terms— optimal control, singular control, second or- der optimality condition, weak optimality, shooting algorithm, Gauss-Newton