Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures

(1)

HAL Id: hal-03227641

https://hal.archives-ouvertes.fr/hal-03227641

Preprint submitted on 17 May 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures

Lénaïc Chizat

To cite this version:

Lénaïc Chizat. Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures. 2021. �hal-03227641�

(2)

Convergence Rates of Gradient Methods

for Convex Optimization in the Space of Measures

L´ena¨ıc Chizat^∗ May 17, 2021

Abstract

We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on ad-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iterationkis inO(log(k)k⁻¹) for multiplicative updates, while it is inO(k^−q/(d+q)) for additive updates for someq={1,2,4}determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy andL^p divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential ind.

1 Introduction

Convex optimization in the space of measures is a theoretical framework that leads to fruitful point of views on a large variety of problems, ranging from sparse deconvolution [Bredies and Pikkarainen, 2013], two-layer neural networks [Bengio et al., 2006], global optimization [Lasserre, 2001] and many more [Boyd et al., 2017]. Various algorithms have been proposed to solve such problems including moments methods [Lasserre,2001], conditional gradient [Bredies and Pikkarainen,2013,Denoyelle et al.,2019], (non-convex) particle gradient flows [Chizat,2021] and noisy versions [Mei et al.,2018,Nitanda et al.,2020].

In this paper, we consider perhaps the simplest methods: gradient descent and its extensions that handle non-smooth regularizers and non-euclidean geometries, the Bregman Proximal Gradient Method (PGM) (an extension of mirror descent [Nemirovskij and Yudin, 1983] that handles composite objectives) and its acceleration (APGM) [Tseng,2010]. Our aim is to establish their convergence rates in this setting. Our contributions are the following:

• While the standard convergence analysis of (A)PGM fails when the Bregman divergence between the initialization and a minimizer is infinite, we propose in Section 2 an elementary strategy to obtain convergence rates anyways, via approximation arguments;

• We recall and adapt (A)PGM in Section3, taking care of the subtleties that appear in our context (definition of the iterates and lack of strong convexity of the divergence);

• We prove in Section4upper-bounds on the convergence rate for (A)PGM under various structural assumptions, summarized in Table2. These rates depend on the choice of the Bregman divergence and on the structure of the objective function;

∗CNRS, Universit´e Paris-Saclay, Laboratoire de math´ematiques d’Orsay, 91405, Orsay, France.

lenaic.chizat@universite-paris-saclay.fr

(3)

• Tight lower bounds of two kinds are proved in Section 5: proof technique-dependent lower bounds, andalgorithm-dependent lower bounds (the latter are stronger but do not cover all cases);

• Numerical experiments on synthetic toy problems in Section6 often show an excellent agreement between the theoretical rates and the ones observed in practice. Even for cases with an apparent mismatch, a closer look at the structure of the problem shows that the theory still shades light on the observed rates.

Our motivation for studying this problem is twofold. First, we believe that a precise understanding of (A)PGM in this context is useful to develop and analyze more complex methods, such as the grid free approaches mentioned above. Second, this setting offers a rich test case to deepen our understanding of Bregman gradient methods in Banach spaces.

However, our goal is not to advocate for a widespread use of (A)PGM for optimization in the space of measures; in fact grid-free approaches are often more relevant choices in practice.

Related work The comparison between additive updates (L² geometry) and multiplicative updates (entropy geometry) is well-known in finite dimensional spaces [Kivinen and Warmuth, 1997]. For instance, for convex optimization in then-dimensional simplex, the two methods typically converge at the same rate but the “constant” factor is polynomial inn for additive updates while it is logarithmic in nfor multiplicative updates, see [Bubeck,2015, Section 4].

We obtain in this paper an infinite dimensional (n=∞) version of this separation; but where the distinction is directly in the rates rather than in the constants.

Analysis of convex optimization in infinite dimensional (Banach) spaces is a classical subject [Bauschke et al., 2001,2003]. Here, we study a concrete class of problems defined on the space of measures which exhibit specific features. This problem-specific approach for infinite dimensional problems has proved fruitful for the analysis of gradient methods for least-squares (e.g. Yao et al.[2007],Dieuleveut [2017] and references therein), for partly smooth problems [Liang et al.,2014] and for the Iterative Soft Thresholding Algorithm (ISTA) in Hilbert spaces [Bredies and Lorenz,2008,Garrigos et al.,2020].

The latter is close to our subject since ISTA is in fact an instance of PGM with the L²-divergence – and FISTA [Beck and Teboulle, 2009] is analogous to APGM with the L²-divergence. These prior works perform the analysis in a Hilbert space, while we work in the space of measures or inL¹, which are non-reflexive Banach space. This is also the context ofChambolle and Tovey[2021] who, for a modified version of FISTA, obtained in particular the convergence rate of Table2 whenp= 2 andq = 1, and also discuss discretization. Our analysis allows to compare various algorithms and shows that FISTA is always slower than APGM with the hyperbolic entropy geometry [Ghai et al.,2020] when the solution is truly sparse, see the rates in Table2. This is clearly observed in numerical experiments and suggests that the latter forms a stronger baseline for our class of problems.

Notation The domain of a functionF :V →R∪ {+∞}is domF ={x∈V ; F(x)<+∞}.

Throughout, Θ is a compact d-dimensional manifold, M(Θ) (resp. M+(Θ)) is the set of finite signed (resp. nonnegative) Borel measures on Θ and P(Θ) is the set of Borel probability measures. Forµ∈M(Θ), kµkis its total variation norm. For a Hilbert space F,C^p(Θ;F) is the set ofp-times continuously differentiable functions from Θ toF. Lip(f) is the Lipschitz constant of a functionf. Forτ ∈P(Θ) andp≥1,L^p(τ) is the space of (equivalence classes of) measurable functionsf : Θ→R such thatR

Θ|f(θ)|^pdτ(θ)<+∞ or, for p= +∞, such that

|f|is τ-almost everywhere bounded by some K >0. The asymptotic notation a(k).b(k) means that there exists c >0 independent of k such that a(k) ≤c·b(k), and a(k) b(k) means [a.band b.a].

(4)

0 α F(x₀)−infF•

ψ(α)

Figure 1: Shape ofψ defined in Eq. (2) in our situation of interest whereD(x_k, x₀) explodes for any minimizing sequence (xk)k∈N (Prop. 2.1). When an optimization method satisfies Eq. (1) for some sequence (α_k)k∈Nthenψ(α_k) bounds its convergence rate in objective values.

2 A general strategy to obtain convergence rates

Let F be a lower bounded convex function defined on a real vector space. Suppose that an iterative method designed to minimizeF initialized at x₀∈domF generates a sequence x1, x2,· · · ∈domF that satisfies

F(x_k)−F(x)≤α_k·D(x, x₀), ∀x∈domF, ∀k≥1, (1) where (αk)k∈N^∗ is a positive sequence converging to 0 andD is adivergence, i.e. D(x, x0)∈ [0,+∞] and D(x₀, x₀) = 0. Most first order methods enjoy guarantees of this form. For instance, PGM and APGM enjoy such guarantees with respectivelyα_k.k⁻¹ andα_k.k⁻² under suitable assumptions, see Section3.3.

While Eq. (1) is often the endpoint of the analysis in the optimization literature, this is here our starting point: we are interested in the case where for any quasi-minimizer x for F, the quantity D(x, x0) is very large or even infinite. In this case, choosing a fixed x independent of k in Eq. (1) leads to a poor upper bound which often does not match the observed practical behavior. Instead, we propose to exploit the flexibility offered by the guarantee of Eq. (1) and to choose a different reference point at each time step. This means that we reformulate the guarantee in the equivalent form:

F(xk)−infF ≤ψ(αk) where ψ(α) := inf

x

n

F(x)−infF+αD(x, x0) o

. (2)

As we will see, studying ψ is particularly fruitful to understand optimization algorithms satisfying Eq. (1). In particular, its behavior at 0 determines the asymptotic convergence rate.

This function can be interpreted as the value atx0 of the (Bregman) Moreau envelope [Kan and Song,2012] of (F−infF) with regularization parameter α, and it intervenes in many area of applied mathematics. For instance, whenD(x, x0) is a squared Hilbertian norm,ψ has a variety of behaviors which characterize the performance of kernel ridge regression in machine learning (see e.g. [Bach,2021, Chap. 7.5]). Before we head in a more concrete setting, let us gather a few relevant properties of the functionψ that hold in full generality.

Proposition 2.1. Assume that F(x0)<+∞ andD(·, x₀)≥0 with equality atx0. Then the functionψ is concave on [0,+∞[and satisfies 0 =ψ(0)≤ψ(α)≤F(x0)−infF. Moreover, (i) ψ is right-continuous at 0 if and only if there exists a minimizing sequence (x_k)k∈N

such that F(xk)→infF(x) and D(xk, x0)<+∞, ∀k∈N ;

(ii) ψ⁰(0) := lim_α→0⁺ψ(α)/α is finite if and only if there exists a minimizing sequence (x_k)k∈N such that F(x_k)→infF(x) and D(x_k, x₀) is bounded.

Proof. The functionψis concave as the pointwise infimum of affine functions. The lower bound is immediate and the upper bound is obtained by takingx0 as a candidate in the infimum.

(5)

Let us prove (ii) (the proof of (i) follows a similar scheme and is simpler). By concavity, the limit defining ψ⁰(0) always exists and belongs to ]0,+∞]. If a sequence (x_k) exists as in the statement, then for anyα >0, there existsxk such thatψ(α)≤αD(xk, x0) +α² soψ⁰(0) is finite. Conversely, ifψ⁰(0) is finite, take a decreasing sequence (α_k) that converges to 0 and let (x_k) be a sequence of quasi-minimizers for Eq. (2) satisfyingF(x_k)−infF +α_kD(x_k, x₀)≤

ψ(αk) +α²_k. By dividing byαk, we see thatF(xk)→infF andD(xk, x0) is bounded.

Thus in general the rates (ψ(α_k))_k obtained from Eq. (2) are slower than (α_k)_k in the sense thatα=o(ψ(α)) is possible. Figure1 illustrates the general shape of the functionψ.

3 Gradient methods for optimization in the space of measures

In the rest of this paper, we apply the general method of Section 2to a class of optimization problems in the space of measures where it leads to a zoo of convergence rates.

3.1 Objective function

Let Θ be a compact Riemannian manifold without boundary, with distance dist and with a reference probability measureτ ∈P(Θ) that is proportional to the volume measure. We consider an objective function on the space of measures ¯F :M(Θ)→R∪ {+∞}of the form

F¯(µ) := ¯G(µ) + ¯H(µ) where G(µ) :=¯ R Z

Φdµ

.

Typically, ¯Gis a data-fitting term and ¯H a regularizer. We make the following assumptions, where ι_C is the convex indicator of a convex set C andλ≥0 a regularization parameter:

(A1) Φ∈C⁰(Θ;F) where F is a Hilbert space,R:F→Ris convex and differentiable with a Lipschitz gradient ∇R, and ¯H is a sum of functions from the following list: ι_P_(Θ), ι_M₊_(Θ),λkµk and ι{µ;λkµk≤1}.

One specific property of ¯H that we use in our proof is that it should be non-decreasing under convolutions by a probability kernel, but we prefer to work with these specific instances rather than giving abstract conditions. We finally denote by F :L¹(τ)→R∪ {+∞}the function defined, for f ∈L¹(τ), by

F(f) := ¯F(f τ),

and similarlyH(f) := ¯H(f τ) andG(f) := ¯G(f τ) so thatF =G+H. These notations convey the idea that ¯F ,G,¯ H¯ are the lower-semicontinuous (l.s.c.) extensions ofF, G, H for the weak*

topology induced by C⁰(Θ) onM(Θ).

Here are examples of problems that fall under this setting:

• (Sparse deconvolution) The goal is, given a signaly^∗ ∈L²(τ), to find a sparse measureµ such that the convolution ofµwith a filterφ∈C(Θ) approximately recoversy^∗. Here the domain is typically thed-dimensional torus Θ =T^dendowed with the Lebesgue measure τ and the objective is [De Castro and Gamboa,2012,Cand`es and Fernandez-Granda, 2014]

F¯(µ) :=

Z

Θ

Z

Θ

φ(θ₁−θ₂)dµ(θ₂)−y^∗(θ₁)

2

dτ(θ₂) +λkµk.

Adding the nonnegativity constraintι_M₊_(Θ) is also relevant in certain applications.

(6)

• (Two-layer relu neural networks). The goal is, givenn observations (xi, yi)∈R^d×R, to find a regressor written as a linear combination of simple “ridge” functions. Consider a loss`:R²→R convex and smooth in its second argument, letφ(s) = (s)+ and let

F(µ) :=¯ 1 n

n

X

i=1

`

yi, Z

Θ

φ([xi; 1]^>θ)dµ(θ)

+ ¯H(µ).

Key differences with the previous setting are that Θ = S^d is the sphere, with d potentially large, and that the object that is truly sought after is the regressorx 7→

Rφ([x; 1]^>θ)dµ(θ) rather than the measureµ. Typical choices for` are the logistic loss

`(y, z) = log(1 + exp(−yz)) when y_i ∈ {−1,1} or the square loss `(y, z) = ¹₂|y−z|². The signed setting with regularization ¯H(µ) =λkµk is the most common one [Bengio et al., 2006, Bach, 2017] but the regularization ι_P_(Θ) also appears in the context of max-margin problems [Chizat and Bach,2020].

The following smoothness lemma will be useful to analyze optimization algorithms and is analogous to the usual “Lipschitz gradient” property in convex optimization. Since the dual of (M(Θ),k · k) is a bit exotic, we avoid using the notion of gradient at all.

Lemma 3.1 (Smoothness). Under Assumption (A1), if Φ ∈C^p(Θ,F) for p∈N, then the differential of G¯ at µ∈M(Θ) can be represented by the function G¯⁰[µ]∈C^p(Θ,R) defined by

G¯⁰[µ](θ) =D

∇RZ

Φdµ ,Φ(θ)E

F

in the sense that it holdsG(µ+ν)¯ −G(µ) =¯ RG¯⁰[µ](θ)dν(θ)+o(kµ−νk). Moreover,µ7→G¯⁰[µ]

is Lipschitz continuous as a function from M(Θ) to C^p(Θ,R). The following smoothness inequality holds withLip( ¯G⁰)≤ kΦk²_∞·Lip(∇R) and for all µ, ν ∈M(Θ),

0≤G(ν)¯ −G(µ)¯ − Z

G¯⁰[µ]d[ν−µ]≤ 1

2Lip( ¯G⁰)kν−µk².

Those results hold true when replacing (M(Θ),G(µ),¯ G¯⁰[µ]) by (L¹(τ), G(f), G⁰[f] := ¯G⁰[f τ]).

Proof. For the first part, the differentiability of R implies that G(µ¯ +ν)−G(µ) =¯ D

∇RZ

Φdµ ,

Z ΦdνE

F+o Z

Φdν F

= Z

G¯⁰[µ]dν+o(kΦk_∞kνk).

For the regularity of µ7→G¯⁰[µ], we have for µ, ν ∈M(Θ),

kG¯⁰[µ]−G¯⁰[ν]k_C^p ≤ kΦk_C^p_(Θ,_F₎·Lip(∇R)· kΦk∞· kµ−νk.

The smoothness inequality can be shown by bounding a 1-dimensional integral as in the Euclidean case [Nesterov, 2003, Thm. 2.1.5]. Finally, L¹(τ)3f 7→f τ ∈M(Θ) is an isometry, so those results hold mutandis mutatis inL¹(τ).

3.2 Bregman divergences

Let us considerη:R→[0,∞] a differentiabledivergence-generating function. Forf ∈L¹(τ) we writeη(f) :=η◦f and we define

¯ η(f) :=

Z

Θ

η(f(θ)) dτ(θ) = Z

η(f)dτ.

(7)

Figure 2: Divergence-generating functions (plain) and their second-order derivatives (dashed).

LetDη (resp. Dη¯) be the Bregman divergence associated toη(resp. ¯η), given for f, g∈L¹(τ) by

D_η(a, b) :=η(a)−η(b)−η⁰(b) a−b

and D_η_¯(f, g) :=

Z

Θ

D_η(f(θ), g(θ))dτ(θ).

We consider the following assumptions on the divergence-generating functionη:

(A2) η:R→[0,∞] is strictly convex, l.s.c., continuously differentiable in int(domη), such thatη⁰(int(domη)) =R andη has polynomial growth at +∞. Moreover, either:

(A2)₊ domη = [0,+∞[ and η(1) =η⁰(1) = 0, or (A2)_± domη =R,η is even andη(0) =η⁰(0) = 0.

Specifying the values ofη andη⁰ at a point in int(domη) is just for convenience and is not restrictive sinceD_η is not affected by affine perturbations of η. Under assumption (A2)₊, we have thatη⁰(0) := lim_t→0+η(t)/t=−∞which automatically enforces an nonnegativity constraint in the methods in the next section.

Here are examples of divergence-generating functions that fall under these assumptions:

• (Power functions ηp). Defined onR forp >1 byηp(s) := _p(p−1)^|s|^p , satisfy (A2)_±;

• (Shannon entropyηent). Defined on R+ byηent(s) :=slog(s)−s+ 1, satisfies (A2)₊;

• (Hyperbolic entropyη_hyp) Defined on Rbyη_hyp :=sarcsinh(s/β)−p

s²+β²+β with β >0, satisfies (A2)_± (introduced by [Ghai et al.,2020]).

When η is smooth, it holds D_η(a, b) =η⁰⁰(b)· ka−bk²/2 +o(ka−bk²) so locally, D_η is equivalent to a squared Riemannian metric on the real axis given by η⁰⁰/2. For the examples listed above, it holdsη_p⁰⁰(s) =|s|^p−2,η⁰⁰_ent(s) =s⁻¹ and η_hyp⁰⁰ (s) = (s²+β²)^−1/2, see Figure2 for an illustration. The hyperbolic entropyη_hyp can be interpreted as a “signed” version of ηent (see Proposition 3.6 for a precise version of this remark).

The next lemma states the strong convexity of these divergences with respect to the L¹(τ) norm, which is needed in the next section. It is a generalization of Pinsker inequality, recovered when K = 1 and with ηent. Notice that when p < 2, the bound worsens as the norm increases.

Lemma 3.2 (Strong convexity of Bregman divergences). Assume that f, g have L¹(τ)-norm bounded byK. Then for p∈]1,2],

D_η_¯_p(f, g)≥ K^p−2

2 kf−gk²_L1(τ). The inequality also holds for ηent and ηhyp with p= 1.

(8)

Proof. For f ∈L¹(τ) andp≥1, since 2−p≤1, it holds by the (reverse) Jensen’s inequality Z

|f|^2−pdτ ≤Z

|f|dτ2−p

=kfk^2−p_L1(τ).

Letg∈L¹(τ) be such that kgk_L1(τ)= 1. It holds by the Cauchy-Schwarz inequality 1 =

Z

|g|dτ2

= Z

|g||f|^p−2² |f|^2−p² dτ 2

≤Z

|g|²|f|^p−2dτ Z

|f|^2−pdτ

. Combining both inequalities and by homogeneity, it follows that if kfk_L1(τ) ≤ K, then

∀g∈L¹(τ) it holds

Z

|g|²|f|^p−2dτ ≥K^p−2kgk²_L1(τ).

Notice that η⁰⁰_p(a) = |a|^p−2 and η_ent⁰⁰ (a) = |a|⁻¹, thus the above expression is in fact a comparison between the Hessian of ¯η and the squared norm. It only remains to integrate this comparison twice to prove the lemma, see e.g. [Yu,2013, Thm. 3]. For the hyperbolic entropy η_hyp, it holdsη_hyp⁰⁰ (a) = (a²+ 1)^−1/2 ≥ |a|⁻¹, so the same bound holds with p= 1.

3.3 Gradient methods and their classical guarantees

We now detail two classical algorithms that enjoy guarantees of the form Eq. (1) for a large class of composite optimization problems. Algorithm1 (PGM) is closely related to mirror descent [Nemirovskij and Yudin,1983] and is discussed inBauschke et al. [2003], Auslender and Teboulle [2006]. Algorithm 2 (APGM) is taken from Tseng[2010] who presents it as a generalization ofAuslender and Teboulle [2006] itself an extension of Nesterov’s second method [Nesterov, 1988]. For the sake of concreteness, we instantiate these algorithms in the context of optimization in the space of measures, where small adaptations have to be made.

Algorithm 1:(Bregman) Proximal Gradient Method (PGM) Initialization: f0∈domH, step-size s >0

for k=0,1,. . . do f_k+1 = arg min_f

G(f_k) +R

G⁰[f_k](f−f_k)dτ+H(f) +¹_sD_η_¯(f, f_k) end

Output: f_k+1

Algorithm 2:Accelerated (Bregman) Proximal Gradient Method (APGM) Initialization: f0=h0 ∈domH,γ0 = 1, step-size s >0

for k=0,1,. . . do

g_k= (1−γ_k)f_k+γ_kh_k hk+1= arg minf

G(gk) +h∇G(g_k), f −gki+H(f) +^γ_s^kDη¯(f, hk) f_k+1 = (1−γ_k)f_k+γ_kh_k+1

γk+1= ¹₂ q

γ_k⁴+ 4γ_k²−γ_k² end

Output: fk+1

In the next proposition, we verify that the updates are well-defined under suitable assumptions. Table1 lists some update formulas which are directly implementable, after discretization.

(9)

H¯

η Assumption (A2)_± (domη =R) Assumption (A2)₊ (domη=R+) (i)ι_M₊_(Θ)+λk · k [η⁰]⁻¹ η⁰(h_k)−sG⁰[g_k]−sλ

+ [η⁰]⁻¹ η⁰(h_k)−sG⁰[g_k]−sλ (ii)ι_P_(Θ) [η⁰]⁻¹ η⁰(hk)−sG⁰[gk]−κ

+ [η⁰]⁻¹ η⁰(hk)−sG⁰[gk]−κ (iii)λk · k [η⁰]⁻¹

sfth_λs η⁰(h_k)−sG⁰[g_k]

[η⁰]⁻¹ η⁰(h_k)−sG⁰[g_k]−sλ (iv)ι_{µ_;kµk≤K} [η⁰]⁻¹

sfth_κ η⁰(h_k)−sG⁰[g_k]

[η⁰]⁻¹ η⁰(h_k)−sG⁰[g_k]−κ Table 1: Update step h_k+1 for APGM, and PGM (whenf_k =g_k =h_k) where sfthκ(a) = sign(a)(|a| −κa) is a soft-thresholding. In (ii),κ ∈R is the unique number such that the update satisfies the constraint. In (iv),κ≥0 is thesmallest number such that the update satisfies the constraint (seeCondat[2016] for efficient algorithms to compute κ in practice).

Proposition 3.3 (Well-defined updates). Assume (A1) and (A2). If η⁰(h_k)∈L^∞(τ) and g_k∈L¹(τ), then there exists a unique solution h_k+1 ∈L¹(τ) to the optimization problem

min

f∈L¹(τ)

Z

G⁰[g_k]fdτ +H(f) +1 s

Z

η(f)dτ − Z

η⁰(h_k)fdτ

which moreover satisfies η⁰(h_k+1) ∈L^∞(τ). It is characterized by the fact that there exists φ∈∂H(hk+1)⊂L^∞(τ) such that

G⁰[g_k] + 1

s(η⁰(h_k+1)−η⁰(h_k)) +φ= 0. (3) Proof. LetJkbe the function to minimize. Thanks to our assumptions thatη⁰(int(domη)) =R and sinceH is lower-bounded, by [Rockafellar, 1971, Cor. 2B], the sublevels ofJ_k are compact with respect to the weak topology (induced onL¹(τ) byL^∞(τ)). Moreover,J_k is convex and l.s.c. for the same topology; in particular because G⁰[gk], η⁰(hk) ∈L^∞(τ) and for the term Rη(f)dτ, this follows from [Rockafellar,1971, Cor. 2A]. Thus by the direct method of the calculus of variations, there exists a minimizerf_k+1 ∈L¹(τ). Sinceη is strictly convex, so is Jk and this minimizer is unique. The condition of Eq. (3) is always a sufficient optimality condition since, by the subdifferential inclusion rule, it implies that 0∈∂J_k(f). It thus remain to show that it is also necessary, in which case the propertyη⁰(h_k+1)∈L^∞(τ) immediately follows.

This is done on a case by case basis for the functions ¯H admissible under Assump- tion (A1). Consider for instance the nonnegativity constraint ¯H =ι_M₊_(Θ) andη that satisfies Assumption (A2)_±. Then with the updatehk+1 given in Table 1 (take λ= 0), it holds

φ:= 1

sη⁰(h_k)−G⁰[g_k]−1

sη⁰(h_k+1) = min{0,1

sη⁰(h_k)−G⁰[g_k]}.

Clearly φ ∈ L^∞(τ), φ ≤0 and R

φh_k+1dτ = 0 and thus φ ∈∂H(h_k+1), which shows that hk+1 is a minimizer and satisfies Eq. (3). The other cases for ¯H and η that are admissible under (A1) and (A2) (such as those listed in Table1) can be treated similarly and follow computations which are standard in the finite dimensional setting.

Let us now recall the guarantees for these methods. We stress that, as discussed in Section2, these guarantees do not necessarily lead to convergence rates.

Proposition 3.4. Assume (A1) and (A2) and thatη satisfies the conclusion of Lemma3.2 for some p∈[1,2]. Consider an initialization f₀ ∈domH and, for some K ≥ kf₀k_L1(τ), a step-size s≤K^p−2(kΦk²_∞Lip(∇R))⁻¹.

(10)

(i) Let (f_k)k∈N be generated by Algorithm1 (PGM). If sup_kkf_kk_L1(τ)≤K, then Eq. (1) holds withα_k= 1/(sk), i.e.

F(f_k)−F(f)≤ 1

skDη¯(f, f0), ∀f ∈L¹(τ),∀k≥1.

(ii) Let (f_k, g_k, h_k)k∈N be generated by Algorithm2 (APGM). If sup_kkh_kk_L1(τ)≤K, then Eq.(1) holds for fk withαk= 4/((k+ 1)²s), i.e.

F(f_k)−F(f)≤ 4

s(k+ 1)²D_η_¯(f, f₀), ∀f ∈L¹(τ),∀k≥1.

Moreover, it holds 0< γk ≤1 and γk≤2/(k+ 2) ∀ ≥0.

Proof. By Proposition 3.3, the updates are well-defined. The proof of [Tseng, 2010, Thm. 1]

goes through, in particular thanks to Lemma 3.1 (smoothness) and since Dη/K^p−2 is 1- strongly convex with respect to k · k_L1(τ) whenever this property is needed in the proof. A particularly simple exposition of the proof for APGM can be found in [d’Aspremont et al., 2021, Thm. 4.24].

Remark 3.5. A difficulty in Proposition 3.4 is that when p <2, one needs to assume a priori bounds on the L¹-norm of certain iterates to obtain convergence guarantees, because the metric induced by the divergence D_η_¯ becomes weaker as the L¹-norm increases. Since Algorithm 1 (PGM) is a descent method, kf_kk_L1 is bounded, uniformly in k, as soon as the objective is coercive for the L¹-norm. But for Algorithm 2(APGM), even if variants exist where (F(f_k))_k is monotonous [d’Aspremont et al., 2021], this property does not seem to be sufficient to controlkh_kk_L1, even for coercive objectives. Of course, uniform bounds are always trivially satisfied whenH¯ includes the constraint ι_kµk≤K or ι_P_(Θ).

3.4 Reparameterized gradient descent as a Bregman descent

In this paragraph, we recall a link between Bregman gradient descent, a.k.a. mirror descent (an instance of Algorithm1) andL² gradient descent dynamics on certain reparameterized objectives. The purpose is to show that the convergence rates proved in Section4 withη_ent and ηhyp are also relevant to understand L² gradient descent in certain contexts. While these remarks are well-known [Amid and Warmuth,2020,Vaskevicius et al.,2019,Azulay et al.,2021], we find it instructive to state them clearly in our context. In order to reduce the discussion to its simplest setting, we consider the continuous time dynamics in the unregularized setting, and we assume that they are well-defined.

The optimality conditions of the update of Algorithm 1 (see Proposition 3.3) can be written asη⁰(f_k+1)−η⁰(f_k) =−sF⁰[f_k]. As the step-sizesvanishes, this leads to a continuous trajectory (ft)t≥0, which we refer to as theη-mirror flow, that solves

d

dtη⁰(ft) =−F⁰[ft] ⇔ d

dtft=−[η⁰⁰(ft)]⁻¹F⁰[ft]. (4) Proposition 3.6 (Reparameterized mirror flows as gradient flows). (i) (Square parameterization). Let (f_t)t≥0 be the L²(τ) gradient flow of Fˆ : f 7→ F(f²) initialized such that log(f0)∈L^∞(τ). Then ht:=f_t² is the ηent-mirror flow of 4F.

(ii) (Difference of squares parameterization). Let (f_t, g_t)t≥0 be the (L²(τ))² gradient flow of Fˆ : (f, g)7→F(f²−g²) initialized such that log(f₀g₀)∈L^∞(τ). Then h_t:=f_t²−g_t² is the

ηhyp-mirror flow of 4F with parameter β= 2f0g0 (here β is function instead of a scalar).

We can make the following remarks:

(11)

• Combining (i) and (ii), we find that if (h⁺_t , h⁻_t)t≥0 follows a ηent-mirror flow for (h⁺, h⁻) 7→ F(h⁺−h⁻) then h⁺_t −h⁻_t is a η_hyp-mirror flow for F. This confirms the interpretation of ηhyp as a “signed” version of the entropy (see also [Ghai et al., 2020, Thm. 23]).

• These exact equivalences are lost in discrete time, with an error term that scales as the squared step-size. It is thus difficult the convert the most efficient guarantees for (Bregman) PGM into guarantees with the same convergence rate for gradient descent.

Proof. (i) The L²(τ) gradient flow of ˆF satisfies _dt^dft = −Fˆ⁰[f_t²] = −2f F⁰[f_t²]. Thus the functionh_t:=f_t² evolves according to

d

dth_t=−2f_t d

dtf_t=−4f_t²F⁰[f_t²] =−4h_tF⁰[h_t] which is precisely the η_ent-mirror flow of 4F since η⁰⁰_ent(s) =s⁻¹ fors >0.

(ii) The (L²(τ))² gradient flow of ˆF satisfies _dt^df_t=−2f_tF⁰[f_t²−g²_t] and _dt^dg_t= 2g_tF⁰[f_t²− g²_t]. As a consequence, we have forht:=f_t²−g_t² and ˜ht:=f_t²+g_t² that _dt^dht=−4˜htF⁰[ht].

To conclude, it remains to show that ˜h_t = [η⁰⁰_hyp(h_t)]⁻¹ = (h²_t +β²)^1/2 for some β >0. To prove this, observe that

d

dt(˜h²_t −h²_t) = d

dt(4f_t²g²_t) = 8f_t⁰ftg²_t + 8g_t⁰gtf_t²= 0.

Hence ˜h²_t −h²_t = ˜h²₀−h²₀ = 4f₀²g₀², which proves that ˜h_t= (h²_t +β²)^1/2 withβ = 2f₀g₀.

4 Upper bounds on the convergence rates

This section contains the main result of this paper which is Theorem 4.1and summarized in Table 2. As discussed in Section 2 and thanks to Proposition 3.4, in order to derive convergence rates for Algorithms 1and 2 it is sufficient to control the function

ψ(α) := inf

f∈L¹(τ)

n

F(f)−infF+αDη(f, f0) o

. (5)

For the class of problems we consider, the behavior ofψ highly depends on the context. The simplest situation is when F admits a minimizer f^∗ ∈ L^q(τ) with q > 1 (since τ is finite, it holds L^q(τ) ⊂L¹(τ)). Then for the divergence-generating functions ηp for 1< p≤q or η_hyp, it is easy to see thatD_η_¯(f^∗, f₀)<+∞ (this further requiresf^∗≥0 forη_ent). Thus by Proposition2.1, it holdsψ(α).α and the convergence rates (αk) given in Proposition3.4 are preserved.

In the more subtle case where the minimizer of ¯F is only assumed to be in M(Θ), the variety of behaviors is captured by the following result.

Theorem 4.1. Under Assumptions (A1) and (A2), let f0 be such that η⁰(f0) ∈ L^∞(τ).

Assume that there exists µ^∗ ∈ M(Θ) such that F(µ^∗) = infF. Under setting (A2)₊ (i.e. domη = R+) assume moreover µ^∗ ∈ M+(Θ). Let q ∈ {1,2,4} be determined by Table 2-(b). Then it holds

ψ(α).inf

>0 ^q+α^dη(^−d). (6)

Given this expression, it is straightforward to compute the rates given in Table2. The exponentq ∈ {1,2,4} can be interpreted a follows: it is such that F(µ)−F(µ^∗).^q where µ is the convolution ofµ^∗ with a box kernel of radius >0. An asymptotic analysis leads to lower bounds for this exponent under several assumptions (Table2-(b)), but in practice, non asymptotic effects may play an important role (see experiments in Section6.2).

(12)

PGM APGM ηent, η_hyp log(k)k⁻¹ log(k)k⁻² η_p,p >1 k⁻

q

(p−1)d+q k⁻

2q (p−1)d+q

(a) Convergence rates

Φ Lipschitz ∇_θΦ Lipschitz G¯⁰[µ^∗] arbitrary q = 1 (I) q = 2 (I) G¯⁰[µ^∗] = 0 q= 2 (I*) q = 4 (II*)

(b) Value ofq

Table 2: (a) Upper bounds on the convergence rates of F(fk)−infF for Algorithm 1(PGM) and 2 (APGM). (b) The value of q that appears in the rate depends on the regularity of Φ and on whether ¯G⁰ vanishes at optimality or not. This defines 4 settings referred to as (I), (I*), (II) and (II*). Upper bounds are derived in Thm. 4.1, lower bounds are proved in

Section5.

Proof. The upper-bound in Eq. (6) corresponds to an upper bound on F(f) −infF + αD_η_¯(f, f₀) for a specific family of candidates f ∈L¹(τ). A special case of this argument for sparse µ^∗, η_ent and q = 1 appeared in Chizat [2021] and extended to µ^∗ ∈ M+(Θ) in Domingo-Enrich et al. [2020]. In the following we write F, G, H for ¯F ,G,¯ H¯ to lighten notations.

Step 1. Smoothing with a box kernel. For > 0 (smaller than the injectivity radius of the exponential map in Θ) consider the transition kernel (γθ,)θ∈Θ whereγθ,∈P(Θ) is proportional to the restriction ofτ to the closed geodesic ballB(θ) of radiuscentered at θ.

We define γ ∈M(Θ²) asγ(dθ,dθ⁰) :=γ_θ,(dθ⁰)µ^∗(dθ). By construction, the first marginal of γ isµ^∗ and we call µ its second marginal, which is absolutely continuous with respect toτ with density f= ^dµ_dτ. Since ^dγ_dτ^θ,(θ⁰) =τ(B(θ))⁻¹1_B_(θ)(θ⁰), it holds

µ(dθ⁰) = Z

Θ

γθ,(dθ⁰)µ^∗(dθ), and f(θ⁰) = Z

Θ

1_B_θ,(θ⁰)

τ(B(θ))dµ^∗(θ).

Note thatτ(B(θ))^d, see e.g. [Gray and Vanhecke,1979, Thm.3.3].

Step 2. Bounding F(µ)−F(µ^∗). For our admissible regularizers, it is easy to verify that H(µ)≤H(µ^∗). By convexity of G, we have

F(µ)−F(µ^∗)≤G(µ)−G(µ^∗)≤ Z

G⁰[µ]d[µ−µ^∗] = Z

Θ

G⁰[µ](θ)−G⁰[µ](θ⁰)

dγ(θ, θ⁰).

It is clear that the magnitude and regularity ofG⁰[µ] plays a role in the magnitude of this quantity. To go further, let us consider the various cases in Table 2-(b) successively.

(I). If Φ is Lipschitz then forθ, θ⁰ ∈Θ it holds

|G⁰[µ](θ)−G⁰[µ](θ⁰)| ≤ ∇RZ

Φdµ

F·Lip(Φ)·dist(θ, θ⁰).

Since ∇R is Lipschitz continuous, we deduce that there existsK >0 such that|G⁰[µ](θ)− G⁰[µ](θ⁰)| ≤Kdist(θ, θ⁰). It follows

F(µ)−F(µ^∗)≤K Z

Θ×Θ

dist(θ, θ⁰)dγ(θ, θ⁰)

=K Z

Θ

1 τ(B(θ))

Z

B(θ)

dist(θ, θ⁰)dτ(θ⁰)dµ^?(θ)≤K·· kµ^∗k.

(I*). If Φ is Lipschitz and moreoverG⁰[µ^∗] = 0, it holds G⁰[µ](θ) =

D

∇RZ Φdµ

− ∇RZ Φdµ^∗

,Φ(θ)

E

F.

(13)

Since ∇R is Lipschitz continuous, the first factor is bounded by Lip(∇R)· kR

Φd[µ − µ^∗]k = Lip(∇R)·sup_k_Φk≤1_˜ R

hΦ,˜ Φ(θ)id[µ−µ^∗](θ). Under assumption (I*), the functions {θ 7→ hΦ,˜ Φ(θ)i ; kΦk ≤˜ 1} are uniformly Lipschitz, so by the reasoning above, it follows that kR

Φd[µ−µ^∗]k., so Lip(G⁰[µ]).. It follows that the previous bound improves to F(µ)−F(µ^∗).².

(II). Here we have thatG⁰[µ]∈C¹(Θ;R) and∇_θG⁰[µ] is Lipschitz. By the Mean Value Theorem on Riemannian manifolds (see e.g. [Gray and Willmore,1982, Thm. 4.6]), there exists a constant K⁰≥0 such that for allθ∈Θ

1 τ(B(θ))

Z

B(θ)

G⁰[µ](θ⁰)dτ(θ⁰)−G⁰[µ](θ)

≤K⁰². It follows

F(µ)−F(µ^∗)≤ Z

dµ^∗(θ) Z

G⁰[µ](θ)−G⁰[µ](θ⁰)

dγθ,(θ⁰)≤K⁰kµ^∗k².

(II*). If∇Φ is Lipschitz and moreoverG⁰[µ^∗] = 0 then a improvement as in (I*) applies.

The functions {θ 7→ hΦ,˜ Φ(θ)i ; kΦk ≤˜ 1} are differentiable with a uniformly Lipschitz derivative so arguments as in the previous paragraph show that kR

Φd[µ−µ^∗]k. ², so Lip(∇_θG⁰[µ]) . ². Thus, going through the argument for (II), with all the constants multiplied by², we obtain that the bound improves to F(µ)−F(µ^∗).⁴.

Step 3. Bounding D_η_¯(f, f₀). It holdsD_η_¯(f, f₀) =R

η(f)dτ −R

η(f₀)dτ −R

η⁰(f₀)(f− f0)dτ. Since η⁰(f0)∈L^∞, all these terms are bounded by a constant independent of except Rη(f)dτ, so it remains to bound the latter. If µ^∗ = 0 then this quantity is bounded by a constant independent ofand we are done. Otherwise let us first assume that µ^∗∈M+(Θ).

By Jensen’s inequality, one has η(f(θ⁰)) =ηZ

Θ

µ^∗(Θ)1_B_(θ)(θ⁰) τ(B(θ))

dµ^∗(θ) µ^∗(Θ)

≤ Z

Θ

ηµ^∗(Θ)1_B_(θ)(θ⁰) τ(B(θ))

dµ^∗(θ) µ^∗(Θ). It follows, by Fubini’s theorem,

Z

η(f(θ⁰))dτ(θ⁰)≤ Z

Θ

dµ^∗(θ) µ^∗(Θ)

Z

Θ

dτ(θ⁰)ηµ^∗(Θ)1_B_(θ)(θ⁰) τ(B(θ))

≤sup

θ

τ(B(θ))

η µ^∗(Θ) τ(B(θ))

+|η(0)|

Now we use the fact thatτ(B(θ)^d and that η has at most polynomial growth – so that the constants can be ignored – to bound this quantity byO(^dη(^−d)).

For the general case whereµ^∗ ∈M(Θ) let µ^∗₊, µ^∗₋∈M+(Θ) be the Jordan decomposition of µ^∗ =µ^∗₊−µ^∗₋. It holds|f(θ)| ≤max{f_+,(θ), f−,(θ)}where f+, andf−, are obtained by applying the smoothing procedure of Step. 1 toµ^∗₊ andµ^∗₋ respectively. Using the fact that under Assumption (A2)_±,η is even and increasing on R+, we obtain the bound

Z

η(f)dτ ≤ Z

η(f+,)dτ+ Z

η(f−,)dτ.

We finally bound each of these terms as when µ^∗∈M+(Θ) to conclude.

5 Lower bounds

We will consider two types of lower bounds: (i) lower bounds onψ(α) in order to confirm that the analysis in Thm.4.1is tight, and (ii) direct lower bounds on the convergence rates of Algorithms1and2. Of course, the latter imply the former, but studying ψdirectly has its own interest and makes it simpler to cover all the cases.

(14)

5.1 Tight lower bounds on ψ

Let us show that the bounds onψ in Theorem 4.1cannot be improved without additional assumptions.

Proposition 5.1 (Lower-bounds). For each of the 4 settings of Table 2-(b) (determining the value of q ∈ {1,2,4}), there exists an objective function F¯ satisfying Assumption (A1) and infF = ¯F(µ^∗) with µ^∗ ∈ M+(Θ), such that for any divergence-generating function η satisfying Assumption (A2) and f₀ such that η⁰(f₀)∈L^∞(τ), it holds

ψ(α)&inf

>0 ^q+α^dη(^−d). (7)

Proof. Let us build explicit objective functions with Θ =T^dtheddimensional torus, F=R and ¯H = ι_P_(Θ). Let θ0 ∈T^d and for f ∈ L¹(Θ) such that f τ ∈ P(Θ), let _f > 0 be such thatR

B_f(θ0)fdτ = ¹₂. Such an _f exists because 7→R

B(θ0)fdτ continuously interpolates between 0 when = 0 and 1 whenis large. For any η satisfying Assumption (A2), using the fact that η≥0 and Jensen’s inequality, it holds

Z

η(f)dτ ≥ Z

B_f(θ0)

η(f)dτ ≥τ(B_f(θ₀))η 1 τ(B_f(θ₀))

Z

B_f(θ0)

fdτ

&^d_fη(^−d_f ).

Thus for any f₀ such that η⁰(f₀)∈L^∞(τ) it holdsD_η_¯(f, f₀)&^d_fη(^−d_f ).

(I). Consider ¯G(µ) =R

Θdist(θ0, θ)dµ(θ). This satisfies Assumption (A1)-(I) with R the identity on Rand Φ(θ) = dist(θ0, θ) which is clearly Lipschitz. Since ¯H =ι_P_(Θ), ¯F admits a unique minimizerµ^∗ =δ_θ₀ and it holds ¯F(µ^∗) = inf ¯F = 0. For any f ∈domH, we have the following lower bound wheref is defined above:

F(f)−infF = Z

Θ

dist(θ, θ₀)f(θ)dτ(θ)≥ Z

Θ\B_f(θ0)

dist(θ, θ₀)f(θ)dτ(θ)≥ _f 2. This proves the lower bound of Eq. (7) withq= 1.

(II). Consider ¯G(µ) =R

ΘΦ(θ)dµ(θ), where ˜˜ Φ is any function which is continuously twice differentiable, coincides with dist(θ₀,·)² onB_1/2(θ₀) and is larger than 1/4 outside of this ball. We cannot directly take dist(θ0,·)² because this function is not smooth everywhere on Θ due to the existence of a cut locus, but it is smooth on B_1/2(θ0). Assumptions (A1)-(II) are satisfied. Againµ^∗ =δ_θ₀ is the unique minimizer and F(µ^∗) = 0. Analogous computations show thatF(f)−infF ≥min{¹₄, ²_f}. This proves the lower bound of Eq. (7) withq = 2.

(I*). Consider ¯G(µ) = ¹₂ R Φdµ2

where Φ(θ) = dist(θ, θ₀). Clearly, µ^∗ = δ_θ₀ is the unique minimizer and ¯G⁰[µ^∗] = 0 so Assumption (A1)-(I*) is satisfied. By direct computations, it holdsF(f)−infF &²_f. This proves the lower bound of Eq. (7) with q= 2.

(II*). Consider ¯G(µ) = ¹₂ R Φdµ˜ 2

where ˜Φ is defined as in the analysis of (II). Again, µ^∗ is the unique minimizer and ¯G⁰[µ^∗] = 0 so Assumption (A1)-(II*) is satisfied. By direct computations, it holdsF(f)−infF &⁴_f, which proves the lower bound withq= 4.

Remark 5.2 (Exact decay ofψfor a natural class of problems.). There exists in fact a broad class of problems satisfying Assumption (A1)-(II) for which the bound on ψ with q = 2 is exact. These are problems with a sparse solution µ^∗ that satisfy an additional non-degeneracy condition at optimality, that appear naturally in certain contexts [Poon et al.,2018]. For these

problems, is it shown in [Chizat, 2021, Prop. 3.2] that F¯(µ)−infF & sup_gR

gd[µ−µ^∗]2

where the supremum is over 1-Lipschitz functions g : Θ → R uniformly bounded by 1.

Reasoning as in the proof of Proposition 5.1, this implies F(f)−infF &²_f which leads to ψ(α)inf

>0²+α^dη(^−d).

(15)

5.2 Direct lower bounds on the convergence rates

In this section, we directly lower-bound the convergence rates of Algorithm1and Algorithm2.

We focus on theL² geometry (η₂) for which we prove all the lower bounds (there are 8 cases to consider) and on the relative entropy geometry (η_ent) for which we omit certain settings for the sake of conciseness. In all the cases considered, the lower bounds match the upper bounds (up to logarithmic terms for η_ent).

Proposition 5.3. For each of the settings (I), (I*), (II) and (II*) under Assumption (A1), there exists a function F such that the iterates of Algorithm 1 (PGM) (f_k)k≥0 with the divergence-generating functionη₂, and initialized withf₀ = 1, with any step-sizes >0, satisfy

F(f_k)−infF &k⁻^d+q^q

where q is the constant associated to the setting via Table2-(b).

Proof. (I).As in the proof of Proposition 5.1, we consider Θ =T^d, θ0 ∈T, ¯H =ι_P_(Θ) and G(µ) =¯ R

Φdµ with Φ(θ) = dist(θ, θ₀). We set s= 1 as the step-size plays no role in what follows. In this case, the update equation of Algorithm1writesf_k+1= f_k−Φ−κ_k

+ where κk ∈ R is such that fk+1 ∈ domH. Thanks to the symmetries of the problem, a simple recursion shows that it holds

f_k= (m(k)−kΦ)₊

for somem(k)∈R+. Let r(k) :=m(k)/k which is such thatB_r(k)(θ₀) is the support of f_k. We have

1 = Z

f_kdτ Z r(k)

0

u^d−1(m(k)−ku)du m(k)

d r(k)^d− 1

d+ 1k·r(k)^d+1 m(k)^d+1k^−d thusm(k)k^d/(d+1) andr(k)k^−1/(d+1). We can compute the objective

F(f_k) Z r(k)

0

u^d−1u(m(k)−ku)du 1

d+ 1m(k)r(k)^d+1−k 1

d+ 2r(k)^d+2k^−1/(d+1). Which proves the case (I) (hereq = 1) since infF = 0.

(II). Consider a smooth function ˜Φ which equals dist(θ, θ0)² on the ball B1/2(θ0) as in the proof of Proposition5.1. Again the iterates have the form fk = (m(k)−kΦ)˜ + and now letr(k)² =m(k)/k. Fork large enough, so that ˜Φ = dist(·, θ₀)² overB_r(k)(θ0), it holds 1 =

Z

fkdτ Z r(k)

0

u^d−1(m(k)−ku²)du m(k)

d r(k)^d− k

d+ 2·r(k)^d+2m(k)^(d+2)/2k^−d/2 thusm(k)k^d/(d+2) andr(k)k^−1/(d+2). It also holds

F(f_k) Z r(k)

0

u^d−1u²(m(k)−ku²)du m(k)

d+ 2r(k)^d+2− k

d+ 4r(k)^d+4 k^−2/(d+2) Which proves the case (II) (here q= 2).

(I*). We consider the function G(µ) = ¹₂(R

Φdµ)². Now the reasoning is slightly more subtle because of the non-linearity. It holds fk = (m(k)−s(k)Φ)+ where s(k+ 1) = s(k) +R

Φf_kdτ. Again, m(k) is such that the iterate is feasible. Our upper bounds imply thatR

Φf_kdτ .k^−1/(d+2), thus s(k+ 1)−s(k).k^−1/(d+2) and it followss(k) .k^d+1^d+2 and F(fk)&(s(k)^−1/(d+1))² k^−2/(d+2) as desired.