• Aucun résultat trouvé

Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures

N/A
N/A
Protected

Academic year: 2021

Partager "Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures"

Copied!
23
0
0

Texte intégral

(1)

HAL Id: hal-03227641

https://hal.archives-ouvertes.fr/hal-03227641

Preprint submitted on 17 May 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures

Lénaïc Chizat

To cite this version:

Lénaïc Chizat. Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures. 2021. �hal-03227641�

(2)

Convergence Rates of Gradient Methods

for Convex Optimization in the Space of Measures

L´ena¨ıc Chizat May 17, 2021

Abstract

We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on ad-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iterationkis inO(log(k)k−1) for multiplicative updates, while it is inO(k−q/(d+q)) for additive updates for someq={1,2,4}determined by the structure of the objective function. Our flexible proof strategy, based on approxi- mation arguments, allows to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy andLp divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential ind.

1 Introduction

Convex optimization in the space of measures is a theoretical framework that leads to fruitful point of views on a large variety of problems, ranging from sparse deconvolution [Bredies and Pikkarainen, 2013], two-layer neural networks [Bengio et al., 2006], global optimiza- tion [Lasserre, 2001] and many more [Boyd et al., 2017]. Various algorithms have been proposed to solve such problems including moments methods [Lasserre,2001], conditional gradient [Bredies and Pikkarainen,2013,Denoyelle et al.,2019], (non-convex) particle gradient flows [Chizat,2021] and noisy versions [Mei et al.,2018,Nitanda et al.,2020].

In this paper, we consider perhaps the simplest methods: gradient descent and its extensions that handle non-smooth regularizers and non-euclidean geometries, the Bregman Proximal Gradient Method (PGM) (an extension of mirror descent [Nemirovskij and Yudin, 1983] that handles composite objectives) and its acceleration (APGM) [Tseng,2010]. Our aim is to establish their convergence rates in this setting. Our contributions are the following:

• While the standard convergence analysis of (A)PGM fails when the Bregman divergence between the initialization and a minimizer is infinite, we propose in Section 2 an elementary strategy to obtain convergence rates anyways, via approximation arguments;

• We recall and adapt (A)PGM in Section3, taking care of the subtleties that appear in our context (definition of the iterates and lack of strong convexity of the divergence);

• We prove in Section4upper-bounds on the convergence rate for (A)PGM under various structural assumptions, summarized in Table2. These rates depend on the choice of the Bregman divergence and on the structure of the objective function;

CNRS, Universit´e Paris-Saclay, Laboratoire de math´ematiques d’Orsay, 91405, Orsay, France.

lenaic.chizat@universite-paris-saclay.fr

(3)

• Tight lower bounds of two kinds are proved in Section 5: proof technique-dependent lower bounds, andalgorithm-dependent lower bounds (the latter are stronger but do not cover all cases);

• Numerical experiments on synthetic toy problems in Section6 often show an excellent agreement between the theoretical rates and the ones observed in practice. Even for cases with an apparent mismatch, a closer look at the structure of the problem shows that the theory still shades light on the observed rates.

Our motivation for studying this problem is twofold. First, we believe that a precise understanding of (A)PGM in this context is useful to develop and analyze more complex methods, such as the grid free approaches mentioned above. Second, this setting offers a rich test case to deepen our understanding of Bregman gradient methods in Banach spaces.

However, our goal is not to advocate for a widespread use of (A)PGM for optimization in the space of measures; in fact grid-free approaches are often more relevant choices in practice.

Related work The comparison between additive updates (L2 geometry) and multiplicative updates (entropy geometry) is well-known in finite dimensional spaces [Kivinen and Warmuth, 1997]. For instance, for convex optimization in then-dimensional simplex, the two methods typically converge at the same rate but the “constant” factor is polynomial inn for additive updates while it is logarithmic in nfor multiplicative updates, see [Bubeck,2015, Section 4].

We obtain in this paper an infinite dimensional (n=∞) version of this separation; but where the distinction is directly in the rates rather than in the constants.

Analysis of convex optimization in infinite dimensional (Banach) spaces is a classical subject [Bauschke et al., 2001,2003]. Here, we study a concrete class of problems defined on the space of measures which exhibit specific features. This problem-specific approach for infinite dimensional problems has proved fruitful for the analysis of gradient methods for least-squares (e.g. Yao et al.[2007],Dieuleveut [2017] and references therein), for partly smooth problems [Liang et al.,2014] and for the Iterative Soft Thresholding Algorithm (ISTA) in Hilbert spaces [Bredies and Lorenz,2008,Garrigos et al.,2020].

The latter is close to our subject since ISTA is in fact an instance of PGM with the L2-divergence – and FISTA [Beck and Teboulle, 2009] is analogous to APGM with the L2-divergence. These prior works perform the analysis in a Hilbert space, while we work in the space of measures or inL1, which are non-reflexive Banach space. This is also the context ofChambolle and Tovey[2021] who, for a modified version of FISTA, obtained in particular the convergence rate of Table2 whenp= 2 andq = 1, and also discuss discretization. Our analysis allows to compare various algorithms and shows that FISTA is always slower than APGM with the hyperbolic entropy geometry [Ghai et al.,2020] when the solution is truly sparse, see the rates in Table2. This is clearly observed in numerical experiments and suggests that the latter forms a stronger baseline for our class of problems.

Notation The domain of a functionF :V →R∪ {+∞}is domF ={x∈V ; F(x)<+∞}.

Throughout, Θ is a compact d-dimensional manifold, M(Θ) (resp. M+(Θ)) is the set of finite signed (resp. nonnegative) Borel measures on Θ and P(Θ) is the set of Borel probability measures. Forµ∈M(Θ), kµkis its total variation norm. For a Hilbert space F,Cp(Θ;F) is the set ofp-times continuously differentiable functions from Θ toF. Lip(f) is the Lipschitz constant of a functionf. Forτ ∈P(Θ) andp≥1,Lp(τ) is the space of (equivalence classes of) measurable functionsf : Θ→R such thatR

Θ|f(θ)|pdτ(θ)<+∞ or, for p= +∞, such that

|f|is τ-almost everywhere bounded by some K >0. The asymptotic notation a(k).b(k) means that there exists c >0 independent of k such that a(k) ≤c·b(k), and a(k) b(k) means [a.band b.a].

(4)

0 α F(x0)−infF•

ψ(α)

Figure 1: Shape ofψ defined in Eq. (2) in our situation of interest whereD(xk, x0) explodes for any minimizing sequence (xk)k∈N (Prop. 2.1). When an optimization method satisfies Eq. (1) for some sequence (αk)k∈Nthenψ(αk) bounds its convergence rate in objective values.

2 A general strategy to obtain convergence rates

Let F be a lower bounded convex function defined on a real vector space. Suppose that an iterative method designed to minimizeF initialized at x0∈domF generates a sequence x1, x2,· · · ∈domF that satisfies

F(xk)−F(x)≤αk·D(x, x0), ∀x∈domF, ∀k≥1, (1) where (αk)k∈N is a positive sequence converging to 0 andD is adivergence, i.e. D(x, x0)∈ [0,+∞] and D(x0, x0) = 0. Most first order methods enjoy guarantees of this form. For instance, PGM and APGM enjoy such guarantees with respectivelyαk.k−1 andαk.k−2 under suitable assumptions, see Section3.3.

While Eq. (1) is often the endpoint of the analysis in the optimization literature, this is here our starting point: we are interested in the case where for any quasi-minimizer x for F, the quantity D(x, x0) is very large or even infinite. In this case, choosing a fixed x independent of k in Eq. (1) leads to a poor upper bound which often does not match the observed practical behavior. Instead, we propose to exploit the flexibility offered by the guarantee of Eq. (1) and to choose a different reference point at each time step. This means that we reformulate the guarantee in the equivalent form:

F(xk)−infF ≤ψ(αk) where ψ(α) := inf

x

n

F(x)−infF+αD(x, x0) o

. (2)

As we will see, studying ψ is particularly fruitful to understand optimization algorithms satisfying Eq. (1). In particular, its behavior at 0 determines the asymptotic convergence rate.

This function can be interpreted as the value atx0 of the (Bregman) Moreau envelope [Kan and Song,2012] of (F−infF) with regularization parameter α, and it intervenes in many area of applied mathematics. For instance, whenD(x, x0) is a squared Hilbertian norm,ψ has a variety of behaviors which characterize the performance of kernel ridge regression in machine learning (see e.g. [Bach,2021, Chap. 7.5]). Before we head in a more concrete setting, let us gather a few relevant properties of the functionψ that hold in full generality.

Proposition 2.1. Assume that F(x0)<+∞ andD(·, x0)≥0 with equality atx0. Then the functionψ is concave on [0,+∞[and satisfies 0 =ψ(0)≤ψ(α)≤F(x0)−infF. Moreover, (i) ψ is right-continuous at 0 if and only if there exists a minimizing sequence (xk)k∈N

such that F(xk)→infF(x) and D(xk, x0)<+∞, ∀k∈N ;

(ii) ψ0(0) := limα→0+ψ(α)/α is finite if and only if there exists a minimizing sequence (xk)k∈N such that F(xk)→infF(x) and D(xk, x0) is bounded.

Proof. The functionψis concave as the pointwise infimum of affine functions. The lower bound is immediate and the upper bound is obtained by takingx0 as a candidate in the infimum.

(5)

Let us prove (ii) (the proof of (i) follows a similar scheme and is simpler). By concavity, the limit defining ψ0(0) always exists and belongs to ]0,+∞]. If a sequence (xk) exists as in the statement, then for anyα >0, there existsxk such thatψ(α)≤αD(xk, x0) +α2 soψ0(0) is finite. Conversely, ifψ0(0) is finite, take a decreasing sequence (αk) that converges to 0 and let (xk) be a sequence of quasi-minimizers for Eq. (2) satisfyingF(xk)−infF +αkD(xk, x0)≤

ψ(αk) +α2k. By dividing byαk, we see thatF(xk)→infF andD(xk, x0) is bounded.

Thus in general the rates (ψ(αk))k obtained from Eq. (2) are slower than (αk)k in the sense thatα=o(ψ(α)) is possible. Figure1 illustrates the general shape of the functionψ.

3 Gradient methods for optimization in the space of measures

In the rest of this paper, we apply the general method of Section 2to a class of optimization problems in the space of measures where it leads to a zoo of convergence rates.

3.1 Objective function

Let Θ be a compact Riemannian manifold without boundary, with distance dist and with a reference probability measureτ ∈P(Θ) that is proportional to the volume measure. We consider an objective function on the space of measures ¯F :M(Θ)→R∪ {+∞}of the form

F¯(µ) := ¯G(µ) + ¯H(µ) where G(µ) :=¯ R Z

Φdµ

.

Typically, ¯Gis a data-fitting term and ¯H a regularizer. We make the following assumptions, where ιC is the convex indicator of a convex set C andλ≥0 a regularization parameter:

(A1) Φ∈C0(Θ;F) where F is a Hilbert space,R:F→Ris convex and differentiable with a Lipschitz gradient ∇R, and ¯H is a sum of functions from the following list: ιP(Θ), ιM+(Θ),λkµk and ι;λkµk≤1}.

One specific property of ¯H that we use in our proof is that it should be non-decreasing under convolutions by a probability kernel, but we prefer to work with these specific instances rather than giving abstract conditions. We finally denote by F :L1(τ)→R∪ {+∞}the function defined, for f ∈L1(τ), by

F(f) := ¯F(f τ),

and similarlyH(f) := ¯H(f τ) andG(f) := ¯G(f τ) so thatF =G+H. These notations convey the idea that ¯F ,G,¯ H¯ are the lower-semicontinuous (l.s.c.) extensions ofF, G, H for the weak*

topology induced by C0(Θ) onM(Θ).

Here are examples of problems that fall under this setting:

• (Sparse deconvolution) The goal is, given a signaly ∈L2(τ), to find a sparse measureµ such that the convolution ofµwith a filterφ∈C(Θ) approximately recoversy. Here the domain is typically thed-dimensional torus Θ =Tdendowed with the Lebesgue measure τ and the objective is [De Castro and Gamboa,2012,Cand`es and Fernandez-Granda, 2014]

F¯(µ) :=

Z

Θ

Z

Θ

φ(θ1−θ2)dµ(θ2)−y1)

2

dτ(θ2) +λkµk.

Adding the nonnegativity constraintιM+(Θ) is also relevant in certain applications.

(6)

• (Two-layer relu neural networks). The goal is, givenn observations (xi, yi)∈Rd×R, to find a regressor written as a linear combination of simple “ridge” functions. Consider a loss`:R2→R convex and smooth in its second argument, letφ(s) = (s)+ and let

F(µ) :=¯ 1 n

n

X

i=1

`

yi, Z

Θ

φ([xi; 1]>θ)dµ(θ)

+ ¯H(µ).

Key differences with the previous setting are that Θ = Sd is the sphere, with d potentially large, and that the object that is truly sought after is the regressorx 7→

Rφ([x; 1]>θ)dµ(θ) rather than the measureµ. Typical choices for` are the logistic loss

`(y, z) = log(1 + exp(−yz)) when yi ∈ {−1,1} or the square loss `(y, z) = 12|y−z|2. The signed setting with regularization ¯H(µ) =λkµk is the most common one [Bengio et al., 2006, Bach, 2017] but the regularization ιP(Θ) also appears in the context of max-margin problems [Chizat and Bach,2020].

The following smoothness lemma will be useful to analyze optimization algorithms and is analogous to the usual “Lipschitz gradient” property in convex optimization. Since the dual of (M(Θ),k · k) is a bit exotic, we avoid using the notion of gradient at all.

Lemma 3.1 (Smoothness). Under Assumption (A1), if Φ ∈Cp(Θ,F) for p∈N, then the differential of G¯ at µ∈M(Θ) can be represented by the function G¯0[µ]∈Cp(Θ,R) defined by

0[µ](θ) =D

∇RZ

Φdµ ,Φ(θ)E

F

in the sense that it holdsG(µ+ν)¯ −G(µ) =¯ RG¯0[µ](θ)dν(θ)+o(kµ−νk). Moreover,µ7→G¯0[µ]

is Lipschitz continuous as a function from M(Θ) to Cp(Θ,R). The following smoothness inequality holds withLip( ¯G0)≤ kΦk2·Lip(∇R) and for all µ, ν ∈M(Θ),

0≤G(ν)¯ −G(µ)¯ − Z

0[µ]d[ν−µ]≤ 1

2Lip( ¯G0)kν−µk2.

Those results hold true when replacing (M(Θ),G(µ),¯ G¯0[µ]) by (L1(τ), G(f), G0[f] := ¯G0[f τ]).

Proof. For the first part, the differentiability of R implies that G(µ¯ +ν)−G(µ) =¯ D

∇RZ

Φdµ ,

Z ΦdνE

F+o Z

Φdν F

= Z

0[µ]dν+o(kΦkkνk).

For the regularity of µ7→G¯0[µ], we have for µ, ν ∈M(Θ),

kG¯0[µ]−G¯0[ν]kCp ≤ kΦkCp(Θ,F)·Lip(∇R)· kΦk· kµ−νk.

The smoothness inequality can be shown by bounding a 1-dimensional integral as in the Euclidean case [Nesterov, 2003, Thm. 2.1.5]. Finally, L1(τ)3f 7→f τ ∈M(Θ) is an isometry, so those results hold mutandis mutatis inL1(τ).

3.2 Bregman divergences

Let us considerη:R→[0,∞] a differentiabledivergence-generating function. Forf ∈L1(τ) we writeη(f) :=η◦f and we define

¯ η(f) :=

Z

Θ

η(f(θ)) dτ(θ) = Z

η(f)dτ.

(7)

Figure 2: Divergence-generating functions (plain) and their second-order derivatives (dashed).

LetDη (resp. Dη¯) be the Bregman divergence associated toη(resp. ¯η), given for f, g∈L1(τ) by

Dη(a, b) :=η(a)−η(b)−η0(b) a−b

and Dη¯(f, g) :=

Z

Θ

Dη(f(θ), g(θ))dτ(θ).

We consider the following assumptions on the divergence-generating functionη:

(A2) η:R→[0,∞] is strictly convex, l.s.c., continuously differentiable in int(domη), such thatη0(int(domη)) =R andη has polynomial growth at +∞. Moreover, either:

(A2)+ domη = [0,+∞[ and η(1) =η0(1) = 0, or (A2)± domη =R,η is even andη(0) =η0(0) = 0.

Specifying the values ofη andη0 at a point in int(domη) is just for convenience and is not restrictive sinceDη is not affected by affine perturbations of η. Under assumption (A2)+, we have thatη0(0) := limt→0+η(t)/t=−∞which automatically enforces an nonnegativity constraint in the methods in the next section.

Here are examples of divergence-generating functions that fall under these assumptions:

• (Power functions ηp). Defined onR forp >1 byηp(s) := p(p−1)|s|p , satisfy (A2)±;

• (Shannon entropyηent). Defined on R+ byηent(s) :=slog(s)−s+ 1, satisfies (A2)+;

• (Hyperbolic entropyηhyp) Defined on Rbyηhyp :=sarcsinh(s/β)−p

s22+β with β >0, satisfies (A2)± (introduced by [Ghai et al.,2020]).

When η is smooth, it holds Dη(a, b) =η00(b)· ka−bk2/2 +o(ka−bk2) so locally, Dη is equivalent to a squared Riemannian metric on the real axis given by η00/2. For the examples listed above, it holdsηp00(s) =|s|p−200ent(s) =s−1 and ηhyp00 (s) = (s22)−1/2, see Figure2 for an illustration. The hyperbolic entropyηhyp can be interpreted as a “signed” version of ηent (see Proposition 3.6 for a precise version of this remark).

The next lemma states the strong convexity of these divergences with respect to the L1(τ) norm, which is needed in the next section. It is a generalization of Pinsker inequality, recovered when K = 1 and with ηent. Notice that when p < 2, the bound worsens as the norm increases.

Lemma 3.2 (Strong convexity of Bregman divergences). Assume that f, g have L1(τ)-norm bounded byK. Then for p∈]1,2],

Dη¯p(f, g)≥ Kp−2

2 kf−gk2L1(τ). The inequality also holds for ηent and ηhyp with p= 1.

(8)

Proof. For f ∈L1(τ) andp≥1, since 2−p≤1, it holds by the (reverse) Jensen’s inequality Z

|f|2−pdτ ≤Z

|f|dτ2−p

=kfk2−pL1(τ).

Letg∈L1(τ) be such that kgkL1(τ)= 1. It holds by the Cauchy-Schwarz inequality 1 =

Z

|g|dτ2

= Z

|g||f|p−22 |f|2−p22

≤Z

|g|2|f|p−2dτ Z

|f|2−p

. Combining both inequalities and by homogeneity, it follows that if kfkL1(τ) ≤ K, then

∀g∈L1(τ) it holds

Z

|g|2|f|p−2dτ ≥Kp−2kgk2L1(τ).

Notice that η00p(a) = |a|p−2 and ηent00 (a) = |a|−1, thus the above expression is in fact a comparison between the Hessian of ¯η and the squared norm. It only remains to integrate this comparison twice to prove the lemma, see e.g. [Yu,2013, Thm. 3]. For the hyperbolic entropy ηhyp, it holdsηhyp00 (a) = (a2+ 1)−1/2 ≥ |a|−1, so the same bound holds with p= 1.

3.3 Gradient methods and their classical guarantees

We now detail two classical algorithms that enjoy guarantees of the form Eq. (1) for a large class of composite optimization problems. Algorithm1 (PGM) is closely related to mirror descent [Nemirovskij and Yudin,1983] and is discussed inBauschke et al. [2003], Auslender and Teboulle [2006]. Algorithm 2 (APGM) is taken from Tseng[2010] who presents it as a generalization ofAuslender and Teboulle [2006] itself an extension of Nesterov’s second method [Nesterov, 1988]. For the sake of concreteness, we instantiate these algorithms in the context of optimization in the space of measures, where small adaptations have to be made.

Algorithm 1:(Bregman) Proximal Gradient Method (PGM) Initialization: f0∈domH, step-size s >0

for k=0,1,. . . do fk+1 = arg minf

G(fk) +R

G0[fk](f−fk)dτ+H(f) +1sDη¯(f, fk) end

Output: fk+1

Algorithm 2:Accelerated (Bregman) Proximal Gradient Method (APGM) Initialization: f0=h0 ∈domH,γ0 = 1, step-size s >0

for k=0,1,. . . do

gk= (1−γk)fkkhk hk+1= arg minf

G(gk) +h∇G(gk), f −gki+H(f) +γskDη¯(f, hk) fk+1 = (1−γk)fkkhk+1

γk+1= 12 q

γk4+ 4γk2−γk2 end

Output: fk+1

In the next proposition, we verify that the updates are well-defined under suitable assumptions. Table1 lists some update formulas which are directly implementable, after discretization.

(9)

η Assumption (A2)± (domη =R) Assumption (A2)+ (domη=R+) (i)ιM+(Θ)+λk · k [η0]−1 η0(hk)−sG0[gk]−sλ

+0]−1 η0(hk)−sG0[gk]−sλ (ii)ιP(Θ)0]−1 η0(hk)−sG0[gk]−κ

+0]−1 η0(hk)−sG0[gk]−κ (iii)λk · k [η0]−1

sfthλs η0(hk)−sG0[gk]

0]−1 η0(hk)−sG0[gk]−sλ (iv)ι;kµk≤K}0]−1

sfthκ η0(hk)−sG0[gk]

0]−1 η0(hk)−sG0[gk]−κ Table 1: Update step hk+1 for APGM, and PGM (whenfk =gk =hk) where sfthκ(a) = sign(a)(|a| −κa) is a soft-thresholding. In (ii),κ ∈R is the unique number such that the update satisfies the constraint. In (iv),κ≥0 is thesmallest number such that the update satisfies the constraint (seeCondat[2016] for efficient algorithms to compute κ in practice).

Proposition 3.3 (Well-defined updates). Assume (A1) and (A2). If η0(hk)∈L(τ) and gk∈L1(τ), then there exists a unique solution hk+1 ∈L1(τ) to the optimization problem

min

f∈L1(τ)

Z

G0[gk]fdτ +H(f) +1 s

Z

η(f)dτ − Z

η0(hk)fdτ

which moreover satisfies η0(hk+1) ∈L(τ). It is characterized by the fact that there exists φ∈∂H(hk+1)⊂L(τ) such that

G0[gk] + 1

s(η0(hk+1)−η0(hk)) +φ= 0. (3) Proof. LetJkbe the function to minimize. Thanks to our assumptions thatη0(int(domη)) =R and sinceH is lower-bounded, by [Rockafellar, 1971, Cor. 2B], the sublevels ofJk are compact with respect to the weak topology (induced onL1(τ) byL(τ)). Moreover,Jk is convex and l.s.c. for the same topology; in particular because G0[gk], η0(hk) ∈L(τ) and for the term Rη(f)dτ, this follows from [Rockafellar,1971, Cor. 2A]. Thus by the direct method of the calculus of variations, there exists a minimizerfk+1 ∈L1(τ). Sinceη is strictly convex, so is Jk and this minimizer is unique. The condition of Eq. (3) is always a sufficient optimality condition since, by the subdifferential inclusion rule, it implies that 0∈∂Jk(f). It thus remain to show that it is also necessary, in which case the propertyη0(hk+1)∈L(τ) immediately follows.

This is done on a case by case basis for the functions ¯H admissible under Assump- tion (A1). Consider for instance the nonnegativity constraint ¯H =ιM+(Θ) andη that satisfies Assumption (A2)±. Then with the updatehk+1 given in Table 1 (take λ= 0), it holds

φ:= 1

0(hk)−G0[gk]−1

0(hk+1) = min{0,1

0(hk)−G0[gk]}.

Clearly φ ∈ L(τ), φ ≤0 and R

φhk+1dτ = 0 and thus φ ∈∂H(hk+1), which shows that hk+1 is a minimizer and satisfies Eq. (3). The other cases for ¯H and η that are admissible under (A1) and (A2) (such as those listed in Table1) can be treated similarly and follow computations which are standard in the finite dimensional setting.

Let us now recall the guarantees for these methods. We stress that, as discussed in Section2, these guarantees do not necessarily lead to convergence rates.

Proposition 3.4. Assume (A1) and (A2) and thatη satisfies the conclusion of Lemma3.2 for some p∈[1,2]. Consider an initialization f0 ∈domH and, for some K ≥ kf0kL1(τ), a step-size s≤Kp−2(kΦk2Lip(∇R))−1.

(10)

(i) Let (fk)k∈N be generated by Algorithm1 (PGM). If supkkfkkL1(τ)≤K, then Eq. (1) holds withαk= 1/(sk), i.e.

F(fk)−F(f)≤ 1

skDη¯(f, f0), ∀f ∈L1(τ),∀k≥1.

(ii) Let (fk, gk, hk)k∈N be generated by Algorithm2 (APGM). If supkkhkkL1(τ)≤K, then Eq.(1) holds for fk withαk= 4/((k+ 1)2s), i.e.

F(fk)−F(f)≤ 4

s(k+ 1)2Dη¯(f, f0), ∀f ∈L1(τ),∀k≥1.

Moreover, it holds 0< γk ≤1 and γk≤2/(k+ 2) ∀ ≥0.

Proof. By Proposition 3.3, the updates are well-defined. The proof of [Tseng, 2010, Thm. 1]

goes through, in particular thanks to Lemma 3.1 (smoothness) and since Dη/Kp−2 is 1- strongly convex with respect to k · kL1(τ) whenever this property is needed in the proof. A particularly simple exposition of the proof for APGM can be found in [d’Aspremont et al., 2021, Thm. 4.24].

Remark 3.5. A difficulty in Proposition 3.4 is that when p <2, one needs to assume a priori bounds on the L1-norm of certain iterates to obtain convergence guarantees, because the metric induced by the divergence Dη¯ becomes weaker as the L1-norm increases. Since Algorithm 1 (PGM) is a descent method, kfkkL1 is bounded, uniformly in k, as soon as the objective is coercive for the L1-norm. But for Algorithm 2(APGM), even if variants exist where (F(fk))k is monotonous [d’Aspremont et al., 2021], this property does not seem to be sufficient to controlkhkkL1, even for coercive objectives. Of course, uniform bounds are always trivially satisfied whenH¯ includes the constraint ιkµk≤K or ιP(Θ).

3.4 Reparameterized gradient descent as a Bregman descent

In this paragraph, we recall a link between Bregman gradient descent, a.k.a. mirror descent (an instance of Algorithm1) andL2 gradient descent dynamics on certain reparameterized objectives. The purpose is to show that the convergence rates proved in Section4 withηent and ηhyp are also relevant to understand L2 gradient descent in certain contexts. While these remarks are well-known [Amid and Warmuth,2020,Vaskevicius et al.,2019,Azulay et al.,2021], we find it instructive to state them clearly in our context. In order to reduce the discussion to its simplest setting, we consider the continuous time dynamics in the unregularized setting, and we assume that they are well-defined.

The optimality conditions of the update of Algorithm 1 (see Proposition 3.3) can be written asη0(fk+1)−η0(fk) =−sF0[fk]. As the step-sizesvanishes, this leads to a continuous trajectory (ft)t≥0, which we refer to as theη-mirror flow, that solves

d

dtη0(ft) =−F0[ft] ⇔ d

dtft=−[η00(ft)]−1F0[ft]. (4) Proposition 3.6 (Reparameterized mirror flows as gradient flows). (i) (Square parame- terization). Let (ft)t≥0 be the L2(τ) gradient flow of Fˆ : f 7→ F(f2) initialized such that log(f0)∈L(τ). Then ht:=ft2 is the ηent-mirror flow of 4F.

(ii) (Difference of squares parameterization). Let (ft, gt)t≥0 be the (L2(τ))2 gradient flow of Fˆ : (f, g)7→F(f2−g2) initialized such that log(f0g0)∈L(τ). Then ht:=ft2−gt2 is the

ηhyp-mirror flow of 4F with parameter β= 2f0g0 (here β is function instead of a scalar).

We can make the following remarks:

(11)

• Combining (i) and (ii), we find that if (h+t , ht)t≥0 follows a ηent-mirror flow for (h+, h) 7→ F(h+−h) then h+t −ht is a ηhyp-mirror flow for F. This confirms the interpretation of ηhyp as a “signed” version of the entropy (see also [Ghai et al., 2020, Thm. 23]).

• These exact equivalences are lost in discrete time, with an error term that scales as the squared step-size. It is thus difficult the convert the most efficient guarantees for (Bregman) PGM into guarantees with the same convergence rate for gradient descent.

Proof. (i) The L2(τ) gradient flow of ˆF satisfies dtdft = −Fˆ0[ft2] = −2f F0[ft2]. Thus the functionht:=ft2 evolves according to

d

dtht=−2ft d

dtft=−4ft2F0[ft2] =−4htF0[ht] which is precisely the ηent-mirror flow of 4F since η00ent(s) =s−1 fors >0.

(ii) The (L2(τ))2 gradient flow of ˆF satisfies dtdft=−2ftF0[ft2−g2t] and dtdgt= 2gtF0[ft2− g2t]. As a consequence, we have forht:=ft2−gt2 and ˜ht:=ft2+gt2 that dtdht=−4˜htF0[ht].

To conclude, it remains to show that ˜ht = [η00hyp(ht)]−1 = (h2t2)1/2 for some β >0. To prove this, observe that

d

dt(˜h2t −h2t) = d

dt(4ft2g2t) = 8ft0ftg2t + 8gt0gtft2= 0.

Hence ˜h2t −h2t = ˜h20−h20 = 4f02g02, which proves that ˜ht= (h2t2)1/2 withβ = 2f0g0.

4 Upper bounds on the convergence rates

This section contains the main result of this paper which is Theorem 4.1and summarized in Table 2. As discussed in Section 2 and thanks to Proposition 3.4, in order to derive convergence rates for Algorithms 1and 2 it is sufficient to control the function

ψ(α) := inf

f∈L1(τ)

n

F(f)−infF+αDη(f, f0) o

. (5)

For the class of problems we consider, the behavior ofψ highly depends on the context. The simplest situation is when F admits a minimizer f ∈ Lq(τ) with q > 1 (since τ is finite, it holds Lq(τ) ⊂L1(τ)). Then for the divergence-generating functions ηp for 1< p≤q or ηhyp, it is easy to see thatDη¯(f, f0)<+∞ (this further requiresf≥0 forηent). Thus by Proposition2.1, it holdsψ(α).α and the convergence rates (αk) given in Proposition3.4 are preserved.

In the more subtle case where the minimizer of ¯F is only assumed to be in M(Θ), the variety of behaviors is captured by the following result.

Theorem 4.1. Under Assumptions (A1) and (A2), let f0 be such that η0(f0) ∈ L(τ).

Assume that there exists µ ∈ M(Θ) such that F(µ) = infF. Under setting (A2)+ (i.e. domη = R+) assume moreover µ ∈ M+(Θ). Let q ∈ {1,2,4} be determined by Table 2-(b). Then it holds

ψ(α).inf

>0 qdη(−d). (6)

Given this expression, it is straightforward to compute the rates given in Table2. The exponentq ∈ {1,2,4} can be interpreted a follows: it is such that F(µ)−F(µ).q where µ is the convolution ofµ with a box kernel of radius >0. An asymptotic analysis leads to lower bounds for this exponent under several assumptions (Table2-(b)), but in practice, non asymptotic effects may play an important role (see experiments in Section6.2).

(12)

PGM APGM ηent, ηhyp log(k)k−1 log(k)k−2 ηp,p >1 k

q

(p−1)d+q k

2q (p−1)d+q

(a) Convergence rates

Φ Lipschitz ∇θΦ Lipschitz G¯0] arbitrary q = 1 (I) q = 2 (I) G¯0] = 0 q= 2 (I*) q = 4 (II*)

(b) Value ofq

Table 2: (a) Upper bounds on the convergence rates of F(fk)−infF for Algorithm 1(PGM) and 2 (APGM). (b) The value of q that appears in the rate depends on the regularity of Φ and on whether ¯G0 vanishes at optimality or not. This defines 4 settings referred to as (I), (I*), (II) and (II*). Upper bounds are derived in Thm. 4.1, lower bounds are proved in

Section5.

Proof. The upper-bound in Eq. (6) corresponds to an upper bound on F(f) −infF + αDη¯(f, f0) for a specific family of candidates f ∈L1(τ). A special case of this argument for sparse µ, ηent and q = 1 appeared in Chizat [2021] and extended to µ ∈ M+(Θ) in Domingo-Enrich et al. [2020]. In the following we write F, G, H for ¯F ,G,¯ H¯ to lighten notations.

Step 1. Smoothing with a box kernel. For > 0 (smaller than the injectivity radius of the exponential map in Θ) consider the transition kernel (γθ,)θ∈Θ whereγθ,∈P(Θ) is proportional to the restriction ofτ to the closed geodesic ballB(θ) of radiuscentered at θ.

We define γ ∈M(Θ2) asγ(dθ,dθ0) :=γθ,(dθ0(dθ). By construction, the first marginal of γ isµ and we call µ its second marginal, which is absolutely continuous with respect toτ with density f= . Since θ,0) =τ(B(θ))−11B(θ)0), it holds

µ(dθ0) = Z

Θ

γθ,(dθ0(dθ), and f0) = Z

Θ

1Bθ,0)

τ(B(θ))dµ(θ).

Note thatτ(B(θ))d, see e.g. [Gray and Vanhecke,1979, Thm.3.3].

Step 2. Bounding F(µ)−F(µ). For our admissible regularizers, it is easy to verify that H(µ)≤H(µ). By convexity of G, we have

F(µ)−F(µ)≤G(µ)−G(µ)≤ Z

G0]d[µ−µ] = Z

Θ

G0](θ)−G0](θ0)

dγ(θ, θ0).

It is clear that the magnitude and regularity ofG0] plays a role in the magnitude of this quantity. To go further, let us consider the various cases in Table 2-(b) successively.

(I). If Φ is Lipschitz then forθ, θ0 ∈Θ it holds

|G0](θ)−G0](θ0)| ≤ ∇RZ

Φdµ

F·Lip(Φ)·dist(θ, θ0).

Since ∇R is Lipschitz continuous, we deduce that there existsK >0 such that|G0](θ)− G0](θ0)| ≤Kdist(θ, θ0). It follows

F(µ)−F(µ)≤K Z

Θ×Θ

dist(θ, θ0)dγ(θ, θ0)

=K Z

Θ

1 τ(B(θ))

Z

B(θ)

dist(θ, θ0)dτ(θ0)dµ?(θ)≤K·· kµk.

(I*). If Φ is Lipschitz and moreoverG0] = 0, it holds G0](θ) =

D

∇RZ Φdµ

− ∇RZ Φdµ

,Φ(θ)

E

F.

(13)

Since ∇R is Lipschitz continuous, the first factor is bounded by Lip(∇R)· kR

Φd[µ − µ]k = Lip(∇R)·supkΦk≤1˜ R

hΦ,˜ Φ(θ)id[µ−µ](θ). Under assumption (I*), the functions {θ 7→ hΦ,˜ Φ(θ)i ; kΦk ≤˜ 1} are uniformly Lipschitz, so by the reasoning above, it follows that kR

Φd[µ−µ]k., so Lip(G0]).. It follows that the previous bound improves to F(µ)−F(µ).2.

(II). Here we have thatG0]∈C1(Θ;R) and∇θG0] is Lipschitz. By the Mean Value Theorem on Riemannian manifolds (see e.g. [Gray and Willmore,1982, Thm. 4.6]), there exists a constant K0≥0 such that for allθ∈Θ

1 τ(B(θ))

Z

B(θ)

G0](θ0)dτ(θ0)−G0](θ)

≤K02. It follows

F(µ)−F(µ)≤ Z

(θ) Z

G0](θ)−G0](θ0)

θ,0)≤K0k2.

(II*). If∇Φ is Lipschitz and moreoverG0] = 0 then a improvement as in (I*) applies.

The functions {θ 7→ hΦ,˜ Φ(θ)i ; kΦk ≤˜ 1} are differentiable with a uniformly Lipschitz derivative so arguments as in the previous paragraph show that kR

Φd[µ−µ]k. 2, so Lip(∇θG0]) . 2. Thus, going through the argument for (II), with all the constants multiplied by2, we obtain that the bound improves to F(µ)−F(µ).4.

Step 3. Bounding Dη¯(f, f0). It holdsDη¯(f, f0) =R

η(f)dτ −R

η(f0)dτ −R

η0(f0)(f− f0)dτ. Since η0(f0)∈L, all these terms are bounded by a constant independent of except Rη(f)dτ, so it remains to bound the latter. If µ = 0 then this quantity is bounded by a constant independent ofand we are done. Otherwise let us first assume that µ∈M+(Θ).

By Jensen’s inequality, one has η(f0)) =ηZ

Θ

µ(Θ)1B(θ)0) τ(B(θ))

(θ) µ(Θ)

≤ Z

Θ

ηµ(Θ)1B(θ)0) τ(B(θ))

(θ) µ(Θ). It follows, by Fubini’s theorem,

Z

η(f0))dτ(θ0)≤ Z

Θ

(θ) µ(Θ)

Z

Θ

dτ(θ0)ηµ(Θ)1B(θ)0) τ(B(θ))

≤sup

θ

τ(B(θ))

η µ(Θ) τ(B(θ))

+|η(0)|

Now we use the fact thatτ(B(θ)d and that η has at most polynomial growth – so that the constants can be ignored – to bound this quantity byO(dη(−d)).

For the general case whereµ ∈M(Θ) let µ+, µ∈M+(Θ) be the Jordan decomposition of µ+−µ. It holds|f(θ)| ≤max{f+,(θ), f−,(θ)}where f+, andf−, are obtained by applying the smoothing procedure of Step. 1 toµ+ andµ respectively. Using the fact that under Assumption (A2)±,η is even and increasing on R+, we obtain the bound

Z

η(f)dτ ≤ Z

η(f+,)dτ+ Z

η(f−,)dτ.

We finally bound each of these terms as when µ∈M+(Θ) to conclude.

5 Lower bounds

We will consider two types of lower bounds: (i) lower bounds onψ(α) in order to confirm that the analysis in Thm.4.1is tight, and (ii) direct lower bounds on the convergence rates of Algorithms1and2. Of course, the latter imply the former, but studying ψdirectly has its own interest and makes it simpler to cover all the cases.

(14)

5.1 Tight lower bounds on ψ

Let us show that the bounds onψ in Theorem 4.1cannot be improved without additional assumptions.

Proposition 5.1 (Lower-bounds). For each of the 4 settings of Table 2-(b) (determining the value of q ∈ {1,2,4}), there exists an objective function F¯ satisfying Assumption (A1) and infF = ¯F(µ) with µ ∈ M+(Θ), such that for any divergence-generating function η satisfying Assumption (A2) and f0 such that η0(f0)∈L(τ), it holds

ψ(α)&inf

>0 qdη(−d). (7)

Proof. Let us build explicit objective functions with Θ =Tdtheddimensional torus, F=R and ¯H = ιP(Θ). Let θ0 ∈Td and for f ∈ L1(Θ) such that f τ ∈ P(Θ), let f > 0 be such thatR

Bf0)fdτ = 12. Such an f exists because 7→R

B0)fdτ continuously interpolates between 0 when = 0 and 1 whenis large. For any η satisfying Assumption (A2), using the fact that η≥0 and Jensen’s inequality, it holds

Z

η(f)dτ ≥ Z

Bf0)

η(f)dτ ≥τ(Bf0))η 1 τ(Bf0))

Z

Bf0)

fdτ

&dfη(−df ).

Thus for any f0 such that η0(f0)∈L(τ) it holdsDη¯(f, f0)&dfη(−df ).

(I). Consider ¯G(µ) =R

Θdist(θ0, θ)dµ(θ). This satisfies Assumption (A1)-(I) with R the identity on Rand Φ(θ) = dist(θ0, θ) which is clearly Lipschitz. Since ¯H =ιP(Θ), ¯F admits a unique minimizerµθ0 and it holds ¯F(µ) = inf ¯F = 0. For any f ∈domH, we have the following lower bound wheref is defined above:

F(f)−infF = Z

Θ

dist(θ, θ0)f(θ)dτ(θ)≥ Z

Θ\Bf0)

dist(θ, θ0)f(θ)dτ(θ)≥ f 2. This proves the lower bound of Eq. (7) withq= 1.

(II). Consider ¯G(µ) =R

ΘΦ(θ)dµ(θ), where ˜˜ Φ is any function which is continuously twice differentiable, coincides with dist(θ0,·)2 onB1/20) and is larger than 1/4 outside of this ball. We cannot directly take dist(θ0,·)2 because this function is not smooth everywhere on Θ due to the existence of a cut locus, but it is smooth on B1/20). Assumptions (A1)-(II) are satisfied. Againµθ0 is the unique minimizer and F(µ) = 0. Analogous computations show thatF(f)−infF ≥min{14, 2f}. This proves the lower bound of Eq. (7) withq = 2.

(I*). Consider ¯G(µ) = 12 R Φdµ2

where Φ(θ) = dist(θ, θ0). Clearly, µ = δθ0 is the unique minimizer and ¯G0] = 0 so Assumption (A1)-(I*) is satisfied. By direct computations, it holdsF(f)−infF &2f. This proves the lower bound of Eq. (7) with q= 2.

(II*). Consider ¯G(µ) = 12 R Φdµ˜ 2

where ˜Φ is defined as in the analysis of (II). Again, µ is the unique minimizer and ¯G0] = 0 so Assumption (A1)-(II*) is satisfied. By direct computations, it holdsF(f)−infF &4f, which proves the lower bound withq= 4.

Remark 5.2 (Exact decay ofψfor a natural class of problems.). There exists in fact a broad class of problems satisfying Assumption (A1)-(II) for which the bound on ψ with q = 2 is exact. These are problems with a sparse solution µ that satisfy an additional non-degeneracy condition at optimality, that appear naturally in certain contexts [Poon et al.,2018]. For these

problems, is it shown in [Chizat, 2021, Prop. 3.2] that F¯(µ)−infF & supgR

gd[µ−µ]2

where the supremum is over 1-Lipschitz functions g : Θ → R uniformly bounded by 1.

Reasoning as in the proof of Proposition 5.1, this implies F(f)−infF &2f which leads to ψ(α)inf

>02dη(−d).

(15)

5.2 Direct lower bounds on the convergence rates

In this section, we directly lower-bound the convergence rates of Algorithm1and Algorithm2.

We focus on theL2 geometry (η2) for which we prove all the lower bounds (there are 8 cases to consider) and on the relative entropy geometry (ηent) for which we omit certain settings for the sake of conciseness. In all the cases considered, the lower bounds match the upper bounds (up to logarithmic terms for ηent).

Proposition 5.3. For each of the settings (I), (I*), (II) and (II*) under Assumption (A1), there exists a function F such that the iterates of Algorithm 1 (PGM) (fk)k≥0 with the divergence-generating functionη2, and initialized withf0 = 1, with any step-sizes >0, satisfy

F(fk)−infF &kd+qq

where q is the constant associated to the setting via Table2-(b).

Proof. (I).As in the proof of Proposition 5.1, we consider Θ =Td, θ0 ∈T, ¯H =ιP(Θ) and G(µ) =¯ R

Φdµ with Φ(θ) = dist(θ, θ0). We set s= 1 as the step-size plays no role in what follows. In this case, the update equation of Algorithm1writesfk+1= fk−Φ−κk

+ where κk ∈ R is such that fk+1 ∈ domH. Thanks to the symmetries of the problem, a simple recursion shows that it holds

fk= (m(k)−kΦ)+

for somem(k)∈R+. Let r(k) :=m(k)/k which is such thatBr(k)0) is the support of fk. We have

1 = Z

fkdτ Z r(k)

0

ud−1(m(k)−ku)du m(k)

d r(k)d− 1

d+ 1k·r(k)d+1 m(k)d+1k−d thusm(k)kd/(d+1) andr(k)k−1/(d+1). We can compute the objective

F(fk) Z r(k)

0

ud−1u(m(k)−ku)du 1

d+ 1m(k)r(k)d+1−k 1

d+ 2r(k)d+2k−1/(d+1). Which proves the case (I) (hereq = 1) since infF = 0.

(II). Consider a smooth function ˜Φ which equals dist(θ, θ0)2 on the ball B1/20) as in the proof of Proposition5.1. Again the iterates have the form fk = (m(k)−kΦ)˜ + and now letr(k)2 =m(k)/k. Fork large enough, so that ˜Φ = dist(·, θ0)2 overBr(k)0), it holds 1 =

Z

fkdτ Z r(k)

0

ud−1(m(k)−ku2)du m(k)

d r(k)d− k

d+ 2·r(k)d+2m(k)(d+2)/2k−d/2 thusm(k)kd/(d+2) andr(k)k−1/(d+2). It also holds

F(fk) Z r(k)

0

ud−1u2(m(k)−ku2)du m(k)

d+ 2r(k)d+2− k

d+ 4r(k)d+4 k−2/(d+2) Which proves the case (II) (here q= 2).

(I*). We consider the function G(µ) = 12(R

Φdµ)2. Now the reasoning is slightly more subtle because of the non-linearity. It holds fk = (m(k)−s(k)Φ)+ where s(k+ 1) = s(k) +R

Φfkdτ. Again, m(k) is such that the iterate is feasible. Our upper bounds imply thatR

Φfkdτ .k−1/(d+2), thus s(k+ 1)−s(k).k−1/(d+2) and it followss(k) .kd+1d+2 and F(fk)&(s(k)−1/(d+1))2 k−2/(d+2) as desired.

Références

Documents relatifs

Subject to the conditions of any agreement between the United Nations and the Organization, approved pursuant to Chapter XVI, States which do not become Members in

intransitive / no object transitive / used with an object Le week-end passé, Joe-Bob est

(2.2) As a consequence of Theorem 2.1 we obtain the following result which can be used in fluid mechanics to prove the existence of the pressure when the viscosity term is known to

This monotone inclusion for a general convex function F is associated to the optimization Algorithm FISTA of Beck et al... The ideas presented in this note are an adaptation to

Indeed, in his earlier book, The English Novel in History 1895-1920, Trotter praises the trilogy precisely for the reason that at the height of Edwardian patriotism, when

(2.2) As a consequence of Theorem 2.1 we obtain the following result which can be used in fluid mechanics to prove the existence of the pressure when the viscosity term is known to

(ii) welch- with non-referential plural indefinites (9 items for German) (iii) welch- with referential plural indefinites (9 items for German) (iv) welch- with indefinite mass nouns

We obtain an analytical solution in the case of cold contacts and cold phonon bath (T ph ¼ 0 ). T ¼ ð P= Þ 1=4 is the maximum temperature reached in the absence of electron