First-order optimization algorithms via inertial systems with Hessian driven damping

(1)

HAL Id: hal-02193846

https://hal.archives-ouvertes.fr/hal-02193846

Submitted on 24 Jul 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

First-order optimization algorithms via inertial systems with Hessian driven damping

Hedy Attouch, Zaki Chbani, Jalal M. Fadili, Hassan Riahi

To cite this version:

Hedy Attouch, Zaki Chbani, Jalal M. Fadili, Hassan Riahi. First-order optimization algorithms via inertial systems with Hessian driven damping. Mathematical Programming, Springer Verlag, In press.

�hal-02193846�

(2)

FIRST-ORDER OPTIMIZATION ALGORITHMS VIA INERTIAL SYSTEMS WITH HESSIAN DRIVEN DAMPING

HEDY ATTOUCH, ZAKI CHBANI, JALAL FADILI, AND HASSAN RIAHI

Abstract. In a Hilbert space setting, for convex optimization, we analyze the convergence rate of a class of first-order algorithms involving inertial features. They can be interpreted as discrete time versions of inertial dynamics involving both viscous and Hessian-driven dampings. The geometrical damping driven by the Hessian intervenes in the dynamics in the form∇²f(x(t)) ˙x(t). By treating this term as the time derivative of∇f(x(t)), this gives, in discretized form, first-order algorithms in time and space. In addition to the convergence properties attached to Nesterov-type accelerated gradient methods, the algorithms thus obtained are new and show a rapid convergence towards zero of the gradients. On the basis of a regularization technique using the Moreau envelope, we extend these methods to non-smooth convex functions with extended real values. The introduction of time scale factors makes it possible to further accelerate these algorithms. We also report numerical results on structured problems to support our theoretical findings.

Key words: Hessian driven damping; inertial optimization algorithms; Nesterov accelerated gradient method;

Ravine method; time rescaling.

AMS subject classification. 37N40, 46N10, 49M30, 65B99, 65K05, 65K10, 90B50, 90C25.

1. Introduction

Unless specified, throughout the paper we make the following assumptions







His a real Hilbert space;

f :H →Ris a convex function of classC², S = argminf 6=∅;

γ, β, b: [t₀,+∞[→R⁺ are non-negative continuous functions, t₀>0.

As a guide in our study, we will rely on the asymptotic behavior, whent→+∞, of the trajectories of the inertial system with Hessian-driven damping

¨

x(t) +γ(t) ˙x(t) +β(t)∇²f(x(t)) ˙x(t) +b(t)∇f(x(t)) = 0.

γ(t) andβ(t) are damping parameters, andb(t) is a time scale parameter.

The time discretization of this system will provide a rich family of first-order methods for minimizingf. At first glance, the presence of the Hessian may seem to entail numerical difficulties. However, this is not the case as the Hessian intervenes in the above ODE in the form ∇²f(x(t)) ˙x(t), which is nothing but the derivative wrt time∇f(x(t)). This explains why the time discretization of this dynamic provides first-order algorithms. Thus, the Nesterov extrapolation scheme [25,26] is modified by the introduction of the difference of the gradients at consecutive iterates. This gives algorithms of the form

(yk=xk+αk(xk−xk−1)−βk(∇f(xk)− ∇f(xk−1)) xk+1=T(yk),

whereT, to be specified later, is an operator involving the gradient or the proximal operator off.

Coming back to the continuous dynamic, we will pay particular attention to the following two cases, specifically adapted to the properties off:

• For a general convex functionf, takingγ(t) = ^α_t, gives (DIN-AVD)_α,β,b x(t) +¨ α

tx(t) +˙ β(t)∇²f(x(t)) ˙x(t) +b(t)∇f(x(t)) = 0.

1

(3)

In the case β ≡0, α = 3, b(t)≡ 1, it can be interpreted as a continuous version of the Nesterov accelerated gradient method [31]. According to this, in this case, we will obtainO t⁻²

convergence rates for the objective values.

• For a µ-strongly convex function f, we will rely on the autonomous inertial system with Hessian driven damping

(DIN)2√

µ,β x(t) + 2¨ √

µx(t) +˙ β∇²f(x(t)) ˙x(t) +∇f(x(t)) = 0,

and show exponential (linear) convergence rate for both objective values and gradients.

For an appropriate setting of the parameters, the time discretization of these dynamics provides first-order algorithms with fast convergence properties. Notably, we will show a rapid convergence towards zero of the gradients.

1.1. A historical perspective. B. Polyak initiated the use of inertial dynamics to accelerate the gradient method in optimization. In [27, 28], based on the inertial system with a fixed viscous damping coefficient γ >0

(HBF) x(t) +¨ γx(t) +˙ ∇f(x(t)) = 0,

he introduced the Heavy Ball with Friction method. For a strongly convex functionf, (HBF) provides convergence at exponential rate off(x(t)) to min_Hf. For general convex functions, the asymptotic convergence rate of (HBF) isO(¹_t) (in the worst case). This is however not better than the steepest descent. A decisive step to improve (HBF) was taken by Alvarez-Attouch-Bolte-Redont [2] by introducing the Hessian-driven damping termβ∇²f(x(t)) ˙x(t), that is (DIN)0,β. The next important step was accomplished by Su-Boyd- Cand`es [31] with the introduction of a vanishing viscous damping coefficient γ(t) =^α_t, that is (AVD)α (see Section1.1.2). The system (DIN-AVD)α,β,1 (see Section2) has emerged as a combination of (DIN)0,β and (AVD)α. Let us review some basic facts concerning these systems.

1.1.1. The (DIN)γ,β dynamic. The inertial system

(DIN)γ,β x(t) +¨ γx(t) +˙ β∇²f(x(t)) ˙x(t) +∇f(x(t)) = 0,

was introduced in [2]. In line with (HBF), it contains afixed positive friction coefficientγ. The introduction of the Hessian-driven damping makes it possible to neutralize the transversal oscillations likely to occur with (HBF), as observed in [2] in the case of the Rosenbrook function. The need to take a geometric damping adapted tof had already been observed by Alvarez [1] who considered

¨

x(t) + Γ ˙x(t) +∇f(x(t)) = 0,

where Γ : H → His a linear positive anisotropic operator. But still this damping operator is fixed. For a general convex function, the Hessian-driven damping in (DIN)γ,β performs a similar operation in a closed- loop adaptive way. The terminology (DIN) stands shortly for Dynamical Inertial Newton. It refers to the natural link between this dynamic and the continuous Newton method.

1.1.2. The (AVD)α dynamic. The inertial system (AVD)α x(t) +¨ α

tx(t) +˙ ∇f(x(t)) = 0,

was introduced in the context of convex optimization in [31]. For general convex functions it provides a continuous version of the accelerated gradient method of Nesterov. For α ≥ 3, each trajectory x(·) of (AVD)α satisfies the asymptotic rate of convergence of the values f(x(t))−inf_Hf =O 1/t²

. As a specific feature, the viscous damping coefficient ^α_t vanishes (tends to zero) as time t goes to infinity, hence the terminology. The convergence properties of the dynamic (AVD)α have been the subject of many recent studies, see [3,4,5,6,8,9,10,14,15,24,31]. They helped to explain why ^α_t is a wise choise of the damping coefficient.

In [20], the authors showed that a vanishing damping coefficient γ(·) dissipates the energy, and hence makes the dynamic interesting for optimization, as long asR+∞

t₀ γ(t)dt= +∞. The damping coefficient can go to zero asymptotically but not too fast. The smallest which is admissible is of order ¹_t. It enforces the inertial effect with respect to the friction effect.

2

(4)

The tuning of the parameterαin front of ¹_t comes from the Lyapunov analysis and the optimality of the convergence rates obtained. The caseα= 3, which corresponds to Nesterov’s historical algorithm, is critical.

In the caseα= 3, the question of the convergence of the trajectories remains an open problem (except in one dimension where convergence holds [9]). As a remarkable property, forα >3, it has been shown by Attouch- Chbani-Peypouquet-Redont [8] that each trajectory converges weakly to a minimizer. The corresponding algorithmic result has been obtained by Chambolle-Dossal [21]. For α > 3, it is shown in [10] and [24]

that the asymptotic convergence rate of the values is actuallyo(1/t²). The subcritical caseα≤3 has been examined by Apidopoulos-Aujol-Dossal[3] and Attouch-Chbani-Riahi [9], with the convergence rate of the objective valuesO

t⁻^2α³

. These rates are optimal, that is, they can be reached, or approached arbitrarily close:

• α ≥3: the optimal rate O t⁻²

is achieved by taking f(x) = kxk^r with r → +∞(f become very flat around its minimum), see [8].

•α <3: the optimal rateO t⁻^2α³

is achieved by takingf(x) =kxk, see [3].

The inertial system with a general damping coefficientγ(·) was recently studied by Attouch-Cabot in [4,5], and Attouch-Cabot-Chbani-Riahi in [6].

1.1.3. The (DIN-AVD)α,β dynamic. The inertial system (DIN-AVD)α,β x(t) +¨ α

tx(t) +˙ β∇²f(x(t)) ˙x(t) +∇f(x(t)) = 0,

was introduced in [11]. It combines the two types of damping considered above. Its formulation looks at a first glance more complicated than (AVD)_α. In [12], Attouch-Peypouquet-Redont showed that (DIN-AVD)_α,β is equivalent to the first-order system in time and space







˙

x(t) +β∇f(x(t))−

1 β −^α_t

x(t) +_β¹y(t) = 0;

˙ y(t)−

1

β−^α_t +^αβ_t₂

x(t) +_β¹y(t) = 0.

This provides a natural extension to f : H → R∪ {+∞} proper lower semicontinuous and convex, just replacing the gradient by the subdifferential.

To get better insight, let us compare the two dynamics (AVD)α and (DIN-AVD)α,β on a simple quadratic minimization problem, in which case the trajectories can be computed in closed form as explained in AppendixA.3. TakeH=R² andf(x1, x2) =¹₂(x²₁+ 1000x²₂), which is ill-conditioned. We take parameters α= 3.1, β = 1, so as to obey the conditionα >3. Starting with initial conditions: (x1(1), x2(1)) = (1,1), ( ˙x1(1),x˙2(1)) = (0,0), we have the trajectories displayed in Figure 1. This illustrates the typical situation of an ill-conditioned minimization problem, where the wild oscillations of (AVD)α are neutralized by the Hessian damping in (DIN-AVD)α,β (see Appendix A.3for further details).

1.2. Main algorithmic results. Let us describe our main convergence rates for the gradient type algorithms. Corresponding results for the proximal algorithms are also obtained.

General convex function. Let f : H → R be a convex function whose gradient is L-Lipschitz continuous.

Based on the discretization of (DIN-AVD)_α,β,1+β

t , we consider (yk=xk+ 1−^α_k

(xk−xk−1)−β√

s(∇f(xk)− ∇f(xk−1))−^β

√s

k ∇f(xk−1) xk+1=yk−s∇f(yk).

Suppose thatα≥3, 0< β <2√

s,sL≤1. In Theorem3.3, we show that i)f(x_k)−min

H f =O 1

k²

as k→+∞;

ii) X

k

k²k∇f(y_k)k²<+∞and X

k

k²k∇f(x_k)k²<+∞.

3

(5)

2 4 6 8 10 50

100 150 200 250 300 350 400 450 500

0 0.2 0.4 0.6 0.8 1

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 1. Evolution of the objective (left) and trajectories (right) for (AVD)_α (α= 3.1) and (DIN-AVD)_α,β (α= 3.1, β= 1) on an ill-conditioned quadratic problem inR².

Strongly convex function. When f :H →R isµ-strongly convex for some µ >0, our analysis relies on the autonomous dynamic (DIN)γ,β withγ= 2√

µ. Based on its time discretization, we obtain linear convergence results for the values (hence the trajectory) and the gradients terms. Explicit discretization gives the inertial gradient algorithm

xk+1=xk+1−√ µs 1 +√

µs(xk−xk−1)− β√ s 1 +√

µs(∇f(xk)− ∇f(xk−1))− s 1 +√

µs∇f(xk).

Assuming that∇f is L-lipschitz continuous,L sufficiently small and β ≤ 1

√µ, it is shown in Theorem5.4 that, withq= 1

1 + ¹₂√

µs ( 0< q <1) f(x_k)−min

H f =O q^k

and kxk−x^?k=O q^k/2

ask →+∞, Moreover, the gradients converge exponentially fast to zero.

1.3. Contents. The paper is organized as follows. Sections 2 and 3 deal with the case of general convex functions, respectively in the continuous case and the algorithmic cases. We improve the Nesterov convergence rates by showing in addition fast convergence of the gradients. Sections4 and 5 deal with the same questions in the case of strongly convex functions, in which case, linear convergence results are obtained.

Section6 is devoted to numerical illustrations. We conclude with some perspectives.

2. Inertial dynamics for general convex functions Our analysis deals with the inertial system with Hessian-driven damping

(DIN-AVD)α,β,b x(t) +¨ α

tx(t) +˙ β(t)∇²f(x(t)) ˙x(t) +b(t)∇f(x(t)) = 0.

2.1. Convergence rates. By specializing the functions β and b, the convergence rates obtained in the following theorem make it possible to find most of the related results existing in the literature. The following quantities play a central role in our analysis:

(1) w(t) :=b(t)−β(t)˙ −β(t)

t and δ(t) :=t²w(t).

4

(6)

Theorem 2.1. Takeα≥1. Letx: [t0,+∞[→ Hbe a solution trajectory of(DIN-AVD)α,β,b. Suppose that the following growth conditions are satisfied:

(G2) b(t)>β˙(t) +β(t) t ; (G3) tw(t)˙ ≤(α−3)w(t).

Then, w(t)is positive and

i)f(x(t))−min

H f =O 1

t²w(t)

as t→+∞;

ii) Z +∞

t0

t²β(t)w(t)k∇f(x(t))k²dt <+∞;

iii) Z +∞

t₀

t

(α−3)w(t)−tw(t)˙

(f(x(t))−min

H f)dt <+∞.

Proof. Givenx^?∈argmin_Hf, define fort≥t0

(2) E(t) :=δ(t)(f(x(t))−f(x^?)) +1

2kv(t)k², where v(t) := (α−1)(x(t)−x^?) +t( ˙x(t) +β(t)∇f(x(t)).

The functionE(·) will serve as Lyapunov’s function. DifferentiatingE gives

(3) d

dtE(t) = ˙δ(t)(f(x(t))−f(x^?)) +δ(t)h∇f(x(t)),x(t)i˙ +hv(t),v(t)i.˙ Using equation (DIN-AVD)_α,β,b, we have

˙

v(t) = αx(t) +˙ β(t)∇f(x(t)) +t

¨

x(t) + ˙β(t)∇f(x(t)) +β(t)∇²f(x(t)) ˙x(t)

= αx(t) +˙ β(t)∇f(x(t)) +t

−^α_tx(t) + ( ˙˙ β(t)−b(t))∇f(x(t))

= tβ(t) +˙ β(t) t −b(t)

∇f(x(t)).

Hence,

hv(t),v(t)i˙ = (α−1)t

β(t) +˙ β(t)

t −b(t)

h∇f(x(t)), x(t)−x^?i +t²

β(t) +˙ β(t)

t −b(t)

h∇f(x(t)),x(t)i˙ +t²β(t)

β˙(t) +β(t)

t −b(t)

k∇f(x(t))k².

Let us go back to (3). According to the choice ofδ(t), the termsh∇f(x(t)), x(t)i˙ cancel, which gives d

dtE(t) = δ(t)(f˙ (x(t))−f(x^?)) +^(α−1)_t δ(t)h∇f(x(t)), x^?−x(t)i

− β(t)δ(t)k∇f(x(t))k².

Condition (G₂) gives δ(t)>0. Combining this equation with convexity off, f(x^?)−f(x(t))≥ h∇f(x(t)), x^?−x(t)i, we obtain the inequality

(4) d

dtE(t) +β(t)δ(t)k∇f(x(t))k²+h(α−1)

t δ(t)−δ(t)˙ i

(f(x(t))−f(x^?))≤0.

Then note that

(5) (α−1)

t δ(t)−δ(t) =˙ t

(α−3)w(t)−tw(t)˙ . Hence, condition (G3) writes equivalently

(6) (α−1)

t δ(t)−δ(t)˙ ≥0,

5

(7)

which, by (4), gives d

dtE(t) ≤0. Therefore, E(·) is nonincreasing, and hence E(t) ≤E(t0). Since all the terms enterE(·) are nonnegative, we obtain

f(x(t))−f(x^?)≤ E(t0) t²

b(t)−β(t)˙ −β(t) t

.

Then, by integrating (4) we obtain Z +∞

t₀

β(t)δ(t)k∇f(x(t))k²dt≤E(t₀)<+∞, and

Z +∞

t0

t

(α−3)w(t)−tw(t)˙

(f(x(t))−f(x^?))dt≤E(t0)<+∞,

which givesii) andiii), and completes the proof.

2.2. Particular cases.

Case 1. The (DIN-AVD)α,β system corresponds to β(t) ≡ β and b(t) ≡ 1. In this case, w(t) = 1− ^β_t. Conditions (G₂) and (G₃) are satisfied by takingα >3 andt > ^α−2_α−3β. Hence, as a consequence of Theorem 2.1, we obtain the following result of Attouch-Peypouquet-Redont [12]:

Theorem 2.2 ([12]). Let x: [t0,+∞[→ Hbe a trajectory of the dynamical system(DIN-AVD)α,β. Suppose α >3. Then

f(x(t))−min

H f =O 1

t²

and Z ∞

t₀

t²k∇f(x(t))k²dt <+∞.

Case 2. The system(DIN-AVD)_α,β,1+β t

, which corresponds to β(t) ≡β and b(t) = 1 +^β_t, was considered in [30]. Compared to (DIN-AVD)_α,β it has the additional coefficient ^β_t in front of the gradient term. This vanishing coefficient will facilitate the computational aspects while keeping the structure of the dynamic.

Observe that in this case,w(t)≡1. Conditions (G₂) and (G₃) boil down toα≥3. Hence, as a consequence of Theorem2.1, we obtain

Theorem 2.3. Let x: [t₀,+∞[→ Hbe a solution trajectory of the dynamical system (DIN-AVD)_α,β,1+β t

. Supposeα≥3. Then

f(x(t))−min

H f =O 1

t²

and Z ∞

t0

t²k∇f(x(t))k²dt <+∞.

Case 3. The dynamical system (DIN-AVD)_α,0,b, which corresponds toβ(t)≡0, was considered by Attouch- Chbani-Riahi in [7]. It comes also naturally from the time scaling of (AVD)_α. In this case, we have w(t) =b(t). Condition (G₂) is equivalent tob(t)>0. (G₃) becomes

tb(t)˙ ≤(α−3)b(t),

which is precisely the condition introduced in [7, Theorem 8.1]. Under this condition, we have the convergence rate

f(x(t))−min

H f =O 1

t²b(t)

as t→+∞.

This makes clear the acceleration effect due to the time scaling. Forb(t) =t^r, we havef(x(t))−minHf = O

1 t^2+r

, under the assumptionα≥3 +r.

6

(8)

Figure 2. Convergence of the objective values and trajectories associated with the system (DIN-AVD)α,β,b for different choices ofβ(t) andb(t).

Case 4. Let us illustrate our results in the caseb(t) =ct^b,β(t) =t^β. We havew(t) =ct^b−(β+1)t^β−1, w⁰(t) = cbt^b−1−(β²−1)t^β−2.The conditions (G2),(G3) can be written respectively as:

(7) ct^b>(β+ 1)t^β−1 andc(b−α+ 3)t^b ≤(β+ 1)(β−α+ 2)t^β−1.

Whenb=β−1, the conditions (7) are equivalent toβ < c−1 and β ≤α−2,which gives the convergence ratef(x(t))−min_Hf =O

1 t^β+1

.

Let us apply these choices to the quadratic functionf : (x1, x2)∈R²7→(x1+x2)². f is convex but not strongly so, and argminf = {(x1, x2) ∈R² : x2 = −x1}. The closed-form solution of the ODE with this choice ofβ(t) andb(t) is given in AppendixA.3. We choose the valuesα= 5, β= 3, b=β−1 = 2 andc= 5 in order to satisfy condition (7). The left panel of Figure 2 depicts the convergence profile of the function value, and its right panel the trajectories associated with the system (DIN-AVD)α,β,b for different scenarios of the parameters. Once again, the damping of oscillations due to the presence of the Hessian is observed.

3. Inertial algorithms for general convex functions 3.1. Proximal algorithms.

3.1.1. Smooth case. Implicit time discretization of (DIN-AVD)_α,β,b, with step sizeh >0, gives xk+1−2xk+xk−1

h² + α

kh

xk+1−xk

h +βk

h(∇f(xk+1)− ∇f(xk)) +bk∇f(xk+1) = 0.

Equivalently

k(xk+1−2xk+x_k−1) +α(xk+1−xk) +βkhk(∇f(xk+1)− ∇f(xk)) +bkh²k∇f(xk+1) = 0.

(8)

7

(9)

Sets=h². We obtain the following algorithm withβk andbk varying withk:

(IPAHD): Inertial Proximal Algorithm with Hessian Damping.

Stepk: Setµ_k:= _k+α^k (β_k√

s+sb_k).

(IPAHD)

(y_k=x_k+

1−_k+α^α

(x_k−x_k−1) +β_k√ s

1−_k+α^α

∇f(x_k) xk+1= prox_µ_k_f(yk).

Theorem 3.1. Suppose that α≥1. Set

(9) δk :=h

bkhk−βk+1−k(βk+1−βk) (k+ 1), and suppose that the following growth conditions are satisfied:

(G2) b_khk−β_k+1−k(β_k+1−β_k)>0;

(G3) δk+1−δk ≤(α−1) δ_k k+ 1. Then, δk is positive and, for any sequence (xk)_k∈

Ngenerated by (IPAHD) i)f(x_k)−min

H f =O 1

δk

=O 1

k(k+ 1) b_kh−¹_kβ_k+1−(β_k+1−β_k)

!

ii)X

k

δkβk+1k∇f(xk+1)k²<+∞.

Proof. Givenx^?∈argminf, set

Ek :=δk(f(xk)−f(x^?)) +1 2kvkk², where

v_k := (α−1)(x_k−x^?) +k(x_k−x_k−1+β_kh∇f(x_k)), and (δk)_k∈

N is a positive sequence that will be adjusted. Set ∆Ek:=Ek+1−Ek, i.e.,

∆Ek= (δk+1−δk)(f(xk+1)−f(x^?)) +δk(f(xk+1)−f(xk)) +1

2(kvk+1k²− kvkk²)

Let us evaluate the last term of the above expression with the help of the three-point identity ¹₂kv_k+1k²−

1

2kvkk²=hvk+1−vk, vk+1i −¹₂kvk+1−vkk². Using successively the definition ofvk and (8), we get

v_k+1−v_k= (α−1)(x_k+1−x_k) + (k+ 1)(x_k+1−x_k+β_k+1h∇f(x_k+1))

−k(xk−x_k−1+βkh∇f(xk))

=α(x_k+1−x_k) +k(x_k+1−2x_k+x_k−1) +β_k+1h∇f(x_k+1) +hk(βk+1∇f(xk+1)−βk∇f(xk))

= [α(xk+1−xk) +k(xk+1−2xk+x_k−1) +khβk(∇f(xk+1)− ∇f(xk))]

+β_k+1h∇f(x_k+1) +kh(β_k+1−β_k)∇f(x_k+1)

=−bkh²k∇f(xk+1) +βk+1h∇f(xk+1) +kh(βk+1−βk)∇f(xk+1)

=h

βk+1+k(βk+1−βk)−bkhk

∇f(xk+1).

8

(10)

Set shortlyCk=βk+1+k(βk+1−βk)−bkhk. We have obtained 1

2kvk+1k²−1

2kvkk²=−h²

2 C_k²k∇f(xk+1)k²

h∇f(x_k+1),(α−1)(x_k+1−x^?) + (k+ 1)(x_k+1−x_k+β_k+1h∇f(x_k+1))i

=−h²1

2C_k²−Ckβk+1

k∇f(xk+1)k²−(α−1)hCkh∇f(xk+1), x^?−xk+1i

−hC_k(k+ 1)h∇f(x_k+1), x_k−x_k+1i.

Let us assume that, forklarge enough

−Ck =bkhk−βk+1−k(βk+1−βk)≥0.

Then, in the above expression, the coefficient ofk∇f(x_k+1)k² is less or equal than zero, which gives 1

2kv_k+1k²−1

2kv_kk²≤ −(α−1)hC_kh∇f(x_k+1), x^?−x_k+1i

−hCk(k+ 1)h∇f(xk+1), xk−xk+1i.

According to the (convex) subdifferential inequality andC_k ≤0, we infer 1

2kvk+1k²−1

2kvkk²≤ −(α−1)hCk(f(x^?)−f(xk+1)

−hCk(k+ 1)(f(xk)−f(xk+1)).

Takeδk:=−hCk(k+ 1) =h

bkhk−βk+1−k(βk+1−βk)

(k+ 1) so that the termsf(xk)−f(xk+1) cancel inE_k+1−E_k. We obtain

E_k+1−E_k ≤

δ_k+1−δ_k−(α−1)h(b_khk−β_k+1−k(β_k+1−β_k))

(f(x_k+1)−f(x^?)) Equivalently

Ek+1−Ek≤

δk+1−δk−(α−1) δk

k+ 1

(f(xk+1)−f(x^?)).

By assumption (G₃), we haveδ_k+1−δ_k−(α−1)_k+1^δ^k ≤0. Therefore, the sequence (E_k)_k∈

Nis non-increasing, which, by definition ofEk, gives, fork≥0

f(xk)−min

H f ≤ E0

δ_k. By summing the inequalities

Ek+1−Ek+hh

2(βk+1+k(βk+1−βk)−bkhk)²+δkβk+1

k∇f(xk+1)k²≤0 we finally obtain P

kδkβk+1k∇f(xk+1)k²<+∞.

3.1.2. Non-smooth case. Let f : H → R∪ {+∞} be a proper lower semicontinuous and convex function.

We rely on the basic properties of the Moreau-Yosida regularization. Letf_λbe the Moreau envelope off of indexλ >0, which is defined by:

f_λ(x) = min

z∈H

f(z) + 1

2λkz−xk²

, for anyx∈ H.

We recall that fλ is a convex function, whose gradient is λ⁻¹-Lipschitz continuous, such that argminfλ = argminf. The interested reader may refer to [17,19] for a comprehensive treatment of the Moreau envelope in a Hilbert setting. Since the set of minimizers is preserved by taking the Moreau envelope, the idea is to replacef byfλin the previous algorithm, and take advantage of the fact thatfλis continuously differentiable.

The Hessian dynamic attached tofλ becomes

¨ x(t) +α

tx(t) +˙ β∇²f_λ(x(t)) ˙x(t) +b(t)∇fλ(x(t)) = 0.

9

(11)

However, we do not really need to work on this system (which requiresfλto beC²), but with the discretized form which only requires the function to be continuously differentiable, as is the case offλ. Then, algorithm (IPAHD) now reads

( yk =xk+

1−_k+α^α

(xk−x_k−1) +β√ s

1−_k+α^α

∇fλ(xk) x_k+1= prox k

k+α(β√

s+sbk)fλ(y_k).

By applying Theorem3.1we obtain that under the assumption (G₂) and (G₃), f_λ(x_k)−min_Hf =O

1 k²b_k

, P

kk²b²_kk∇f_λ(x_k+1)k²<+∞.

Thus, we just need to formulate these results in terms off and its proximal mapping. This is straightforward thanks to the following formulae from proximal calculus [17]:

• fλ(x) =f(prox_λf(x)) +_2λ¹

x−prox_λf(x))

2.

• ∇fλ(x) =_λ¹ x−prox_λf(x) .

• prox_θf

λ(x) = _λ+θ^λ x+_λ+θ^θ prox_(λ+θ)f(x).

We obtain the following relaxed inertial proximal algorithm (NS stands for Non-Smooth):

(IPAHD-NS) :

Setµk := _{λ(k+α)+k(β}^λ(k+α)^√_s+sb

k)







yk=xk+ (1−_k+α^α )(xk−x_k−1) +^β

√s λ

1−_k+α^α

xk−prox_λf(xk) xk+1 =µkyk+ (1−µk) prox λ

µkf(yk).

Theorem 3.2. Let f :H →R∪ {+∞} be a convex, lower semicontinuous, proper function. Suppose that the following growth conditions are satisfied

(G2) bkhk−βk+1−k(βk+1−βk)>0;

(G₃) δ_k+1−δ_k ≤(α−1) δk

k+ 1,

where the sequence(δk)has been defined in (9). Then, for any sequence(xk)_k∈Ngenerated by(IPAHD-NS), the following holds

f(prox_λf(xk))−min

H f =O 1

k²bk

,X

k

δkβk+1

xk+1−prox_λf(xk+1)

2<+∞.

3.2. Gradient algorithms. Takef a convex function whose gradient isL-Lipschitz continuous. Our analysis is based on the dynamic (DIN-AVD)_α,β,1+β

t considered in Theorem2.3with damping parametersα≥3, β≥0. Consider the time discretization of (DIN-AVD)_α,β,1+β

t

1

s(xk+1−2xk+xk−1) + α

ks(xk−xk−1) + β

√s(∇f(xk)− ∇f(xk−1)) + β

k√

s∇f(x_k−1) +∇f(yk) = 0,

withyk inspired by Nesterov’s accelerated scheme. We obtain the following scheme:

(IGAHD) : Inertial Gradient Algorithm with Hessian Damping.

Stepk:αk = 1−^α_k.

(y_k =x_k+α_k(x_k−x_k−1)−β√

s(∇f(x_k)− ∇f(x_k−1))−^β

√s

k ∇f(x_k−1) xk+1=yk−s∇f(yk),

Following [5], sett_k+1= _α−1^k , whencet_k= 1 +t_k+1α_k.

10

(12)

Givenx^?∈argminf, our Lyapunov analysis is based on the sequence (Ek)_k∈N E_k:=t²_k(f(x_k)−f(x^?)) + 1

2skvkk² (10)

vk:= (x_k−1−x^?) +tk

xk−x_k−1+β√

s∇f(x_k−1) . (11)

Theorem 3.3. Let f :H →Rbe a convex function whose gradient is L-Lipschitz continuous. Let (x_k)_k∈

N

be a sequence generated by algorithm (IGAHD), whereα≥3,0≤β <2√

s andsL≤1. Then the sequence (Ek)_k∈

Ndefined by (10)-(11)is non-increasing, and the following convergence rates are satisfied:

i)f(xk)−min

H f =O 1

k²

ask→+∞;

ii) Suppose that β >0. Then X

k

k²k∇f(yk)k²<+∞ and X

k

k²k∇f(xk)k²<+∞.

Proof. We rely on the following reinforced version of the gradient descent lemma (Lemma A.1in Appen- dixA.1). Sinces≤ _L¹, and ∇f isL-lipschitz continuous,

f(y−s∇f(y))≤f(x) +h∇f(y), y−xi −s

2k∇f(y)k²−s

2k∇f(x)− ∇f(y)k²

for all x, y∈ H. Let us write it successively aty =yk and x=xk, then at y =yk, x=x^?. According to x_k+1=y_k−s∇f(y_k) and∇f(x^?) = 0, we get

f(xk+1)≤f(xk) +h∇f(yk), yk−xki − s

2k∇f(yk)k²−s

2k∇f(xk)− ∇f(yk)k² (12)

f(xk+1)≤f(x^?) +h∇f(yk), yk−x^?i −s

2k∇f(yk)k²−s

2k∇f(yk)k². (13)

Multiplying (12) bytk+1−1≥0, then adding (13), we derive that

tk+1(f(xk+1)−f(x^?))≤(tk+1−1)(f(xk)−f(x^?)) +h∇f(yk),(tk+1−1)(yk−xk) +yk−x^?i −s

2tk+1k∇f(yk)k².

−s

2(tk+1−1)k∇f(xk)− ∇f(yk)k²−s

2k∇f(yk)k². (14)

Let us multiply (14) bytk+1 to make appearEk. We obtain

t²_k+1(f(x_k+1)−f(x^?))≤(t²_k+1−t_k+1−t²_k)(f(x_k)−f(x^?)) +t²_k(f(x_k)−f(x^?)) +tk+1h∇f(yk),(tk+1−1)(yk−xk) +yk−x^?i −s

2t²_k+1k∇f(yk)k²

−s

2(t²_k+1−tk+1)k∇f(xk)− ∇f(yk)k²−s

2tk+1k∇f(yk)k². Sinceα≥3 we havet²_k+1−tk+1−t²_k ≤0, which gives

t²_k+1(f(xk+1−f(x^?))≤t²_k(f(xk)−f(x^?)) +t_k+1h∇f(yk),(t_k+1−1)(y_k−x_k) +y_k−x^?i −s

2t²_k+1k∇f(y_k)k²

−s

2(t²_k+1−t_k+1)k∇f(x_k)− ∇f(y_k)k²−s

2t_k+1k∇f(y_k)k². According to the definition ofEk, we infer

Ek+1−Ek ≤ tk+1h∇f(yk),(tk+1−1)(yk−xk) +yk−x^?i −s

−s

2(t²_k+1−tk+1)k∇f(xk)− ∇f(yk)k²−s

2tk+1k∇f(yk)k² +1

2skvk+1k²− 1 2skvkk².

Let us compute this last expression with the help of the elementary inequality 1

2kvk+1k²−1

2kvkk²=hvk+1−v_k, v_k+1i −1

2kvk+1−v_kk².

11

(13)

By definition ofvk, according to (IGAHD) andtk−1 =tk+1αk, we have v_k+1−v_k=x_k−x_k−1+t_k+1(x_k+1−x_k+β√

s∇f(x_k))

−tk(xk−x_k−1+β√

s∇f(x_k−1))

=tk+1(xk+1−xk)−(tk−1)(xk−xk−1) +β√ s

tk+1∇f(xk)−tk∇f(xk−1)

=tk+1

xk+1−(xk+αk(xk−xk−1) +β√

s

tk+1∇f(xk)−tk∇f(xk−1)

=t_k+1(x_k+1−y_k)−t_k+1β√

s(∇f(x_k)− ∇f(x_k−1))−t_k+1β√ s

k ∇f(x_k−1) +β√

s(t_k+1∇f(x_k)−t_k∇f(x_k−1))

=tk+1(xk+1−yk) +β√ s

tk+1(1−1 k)−tk

∇f(xk−1)

=tk+1(xk+1−yk) =−stk+1∇f(yk).

Hence

1

2skvk+1k²− 1

2skvkk²=−s

−tk+1

D∇f(y_k), x_k−x^?+t_k+1

x_k+1−x_k+β√

s∇f(x_k)E . Collecting the above results, we obtain

Ek+1−Ek ≤ tk+1h∇f(yk),(tk+1−1)(yk−xk) +yk−x^?i −st²_k+1k∇f(yk)k²

−tk+1

D∇f(yk), xk−x^?+tk+1

xk+1−xk+β√

s∇f(xk)E

−s

2(t²_k+1−tk+1)k∇f(xk)− ∇f(yk)k²−s

2tk+1k∇f(yk)k². Equivalently

Ek+1−Ek ≤ tk+1h∇f(yk), Aki −st²_k+1k∇f(yk)k²

−s

2(t²_k+1−t_k+1)k∇f(x_k)− ∇f(y_k)k²−s

2t_k+1k∇f(y_k)k², with

Ak = (tk+1−1)(yk−xk) +yk−xk−tk+1

xk+1−xk+β√

s∇f(xk)

= t_k+1y_k−t_k+1x_k−t_k+1(x_k+1−x_k)−t_k+1β√

s∇f(x_k)

= tk+1(yk−xk+1)−tk+1β√

s∇f(xk)

= stk+1∇f(yk)−tk+1β√

s∇f(xk) Consequently

Ek+1−Ek≤tk+1h∇f(yk), stk+1∇f(yk)−tk+1β√

s∇f(xk)i

−st²_k+1k∇f(yk)k²−s

2(t²_k+1−tk+1)k∇f(xk)− ∇f(yk)k²−s

2tk+1k∇f(yk)k²

=−t²_k+1β√

sh∇f(yk), ∇f(xk)i −s

2(t²_k+1−tk+1)k∇f(xk)− ∇f(yk)k²

−s

2tk+1k∇f(yk)k²

=−tk+1Bk, where

B_k :=t_k+1β√

sh∇f(y_k),∇f(x_k)i+s

2(t_k+1−1)k∇f(x_k)− ∇f(y_k)k²+s

2k∇f(y_k)k².

12

(14)

Whenβ= 0 we haveBk≥0. Let us analyze the sign ofBkin the caseβ >0. SetY =∇f(yk),X =∇f(xk).

We have

Bk = s

2kYk²+s

2(tk+1−1)kY −Xk²+tk+1β√ shY, Xi

= s

2t_k+1kYk²+ t_k+1(β√

s−s) +s

hY, Xi+s

2(t_k+1−1)kXk²

≥ s

2t_k+1kYk²− t_k+1(β√

s−s) +s

kYkkXk+s

2(t_k+1−1)kXk². Elementary algebra gives that the above quadratic form is non-negative when

tk+1(β√

s−s) +s²

≤s²tk+1(tk+1−1).

Recall thattk is of orderk. Hence, this inequality is satisfied for klarge enough if (β√

s−s)²< s², which is equivalent toβ <2√

s.Under this conditionEk+1−Ek≤0, which gives conclusioni). Similar argument gives that for 0< <2√

sβ−β² (suchexists according to assumption 0< β <2√ s) Ek+1−Ek+1

2t²_k+1k∇f(yk)k²≤0.

After summation of these inequalities, we obtain conclusionii).

Remark 3.4. FromP

kk²k∇f(xk)k²<+∞we immediately infer that fork≥1 inf

i=1,...,kk∇f(x_i)k²

k

X

i=1

i²≤

k

X

i=1

i²k∇f(x_i)k²≤X

i∈N

i²k∇f(x_i)k²<+∞.

A similar argument holds foryk. Hence inf

i=1,...,kk∇f(xi)k²=O 1

k³

, inf

i=1,...,kk∇f(yi)k²=O 1

k³

.

Remark 3.5. In Theorem3.3, the convergence property of the values is expressed according to the sequence (xk)_k∈

N. It is natural to know if a similar result is true for the sequence (yk)_k∈

N. This is an open question in the case of Nesterov’s accelerated gradient method and the corresponding FISTA algorithm for structured minimization [26, 18]. In the case of the Hessian-driven damping algorithms, we give a partial answer to this question. By the classical descent lemma, and the monotonicity of ∇f we have

f(yk) ≤ f(xk+1) +hyk−xk+1,∇f(xk+1)i+L

2kyk−xk+1k²

≤ f(xk+1) +hyk−xk+1,∇f(yk)i+L

2kyk−xk+1k² According to x_k+1=y_k−s∇f(y_k)we obtain

f(yk)−min

H f ≤ f(xk+1)−min

H f+sk∇f(yk)k²+s²L

2 k∇f(yk)k². From Theorem3.3 we deduce that

f(yk)−min

H f ≤ O 1

k²

+

s+s²L 2

k∇f(yk)k²=O 1

k²

+o 1

k²

.

Remark 3.6. When f is a proper lower semicontinuous proper function, but not necessarily smooth, we follow the same reasoning as in Section3.1.2. We consider minimizing the Moreau envelopefλ off, whose gradient is 1/λ-Lipschitz continuous, and then apply (IGAHD)tofλ. We omit the details for the sake of brevity. This observation will be very useful to solve even structured composite problems as we will describe in Section 6.

13

(15)

4. Inertial dynamics for strongly convex functions 4.1. Smooth case. Recall the classical definition of strong convexity:

Definition 4.1. A function f : H → R is said to be µ-strongly convex for some µ > 0 if f − ^µ₂k · k² is convex.

For strongly convex functions, a suitable choice of γ and β in (DIN)γ,β provides exponential decay of the value function (hence of the trajectory), and of the gradients. This corresponds to linear convergence in the algorithmic case. It can be seen as an extension of the Nesterov accelerated method for strongly convex functions that corresponds to the particular caseβ = 0. The result in the caseβ= 0 was considered in [29, Theorem 2.2]. In the caseβ >0, a related but different result can be found in [32, Theorem 1]. The gradient estimate is new.

Theorem 4.2. Suppose that f :H → Risµ-strongly convex for some µ >0. Letx(·) : [t0,+∞[→ H be a solution trajectory of

(15) x(t) + 2¨ √

µx(t) +˙ β∇²f(x(t)) ˙x(t) +∇f(x(t)) = 0.

Suppose that0≤β≤ ₂^√¹_µ. Then, the following hold:

i) for allt≥t₀

µ

2 kx(t)−x^?k²≤f(x(t))−min

H f ≤Ce⁻

√µ 2 (t−t0)

whereC:=f(x(t0))−min_Hf +µdist(x(t0), S)²+kx(t˙ 0) +β∇f(x(t0))k². ii) There exists some constant C1>0 such that, for allt≥t0

e⁻^√^µt Z t

t0

e^√^µsk∇f(x(s))k²ds≤C1e⁻

√µ 2 t. Moreover,R∞

t0 e

√µ

2 tkx(t)k˙ ²dt <+∞.

Whenβ = 0, we havef(x(t))−min_Hf =O e⁻^√^µt

as t →+∞.

Proof. i) Let x^? be the unique minimizer off. DefineE : [t0,+∞[→R⁺by E(t) :=f(x(t))−min

H f+1 2k√

µ(x(t)−x^?) + ˙x(t) +β∇f(x(t))k². Setv(t) =√

µ(x(t)−x^?) + ˙x(t) +β∇f(x(t)). Derivation ofE(·) gives d

dtE(t) :=h∇f(x(t)),x(t)i˙ +hv(t),√

µx(t) + ¨˙ x(t) +β∇²f(x(t)) ˙x(t)i.

Using (15), we get d

dtE(t) =h∇f(x(t)),x(t)i˙ +hv(t),−√

µx(t)˙ − ∇f(x(t))i.

After developing and simplification, we obtain d

dtE(t) +√

µh∇f(x(t)), x(t)−x^?i+µhx(t)−x^?,x(t)i˙ +√

µkx(t)k˙ ² +β√

µh∇f(x(t)),x(t)i˙ +βk∇f(x(t))k²= 0.

By strong convexity off we have

h∇f(x(t)), x(t)−x^?i ≥f(x(t))−f(x^?) +µ

2kx(t)−x^?k². Thus, combining the last two relations we obtain

d

dtE(t) +√

µA≤0, where (the variabletis omitted to lighten the notation)

A:=f(x)−f(x^?) +µ

2kx−x^?k²+√

µhx−x^?,xi˙ +kxk˙ ²+βh∇f(x),xi˙ + β

√µk∇f(x)k²

14

(16)

Let us formulateAwithE(t).

A=E −1

2kx˙+β∇f(x)k²−√

µhx−x^?,x˙+β∇f(x)i+√

µhx−x^?,xi˙ +kxk˙ ² +βh∇f(x),xi˙ + β

√µk∇f(x)k². After developing and simplifying, we obtain

d

dtE(t) +√ µ

E(t) +1 2kxk˙ ²+

β

√µ−β² 2

k∇f(x)k²−β√

µhx−x^?,∇f(x)i

≤0.

Since 0≤β ≤^√¹_µ, we immediately get ^√^β_µ−^β₂² ≥ ₂^√^β_µ.Hence d

dtE(t) +√ µ

E(t) +1

2kxk˙ ²+ β 2√

µk∇f(x)k²−β√

µhx−x^?,∇f(x)i

≤0.

Let us use again the strong convexity off to write E(t) =1

2E(t) +1

2E(t)≥1

2E(t) +1

2(f(x(t))−f(x^?))≥1

2E(t) +µ

4kx(t)−x^?k². By combining the two inequalities above, we obtain

d dtE(t) +

√µ 2 E(t) +

√µ

2 kx(t)k˙ ²+√

µB≤0, whereB= ^µ₄kx(t)−x^?k²+₂^√^β_µk∇f(x)k²−β√

µkx−x^?kk∇f(x)k.

SetX =kx−x^?k,Y =k∇f(x)k. Elementary algebraic computation gives that, under the condition 0≤β ≤₂^√¹_µ

µ

4X²+ β 2√

µY²−β√

µXY ≥0.

Hence for 0≤β≤ ₂^√¹_µ d dtE(t) +

√µ 2 E(t) +

√µ

2 kx(t)k˙ ²≤0.

By integrating the differential inequality above we obtain E(t)≤ E(t0)e⁻

√µ 2 (t−t0). By definition ofE(t), we infer

f(x(t))−min

H f ≤ E(t0)e⁻

√µ 2 (t−t0), and

k√

µ(x(t)−x^?) + ˙x(t) +β∇f(x(t))k²≤2E(t0)e⁻

√µ 2 (t−t0). ii) Set C= 2E(t0)e

√µ

2 t0. Developing the above expression, we obtain µkx(t)−x^?k²+kx(t)k˙ ²+β²k∇f(x(t))k²+ 2β√

µhx(t)−x^?,∇f(x(t))i +hx(t),˙ 2β∇f(x(t)) + 2√

µ(x(t)−x^?)i ≤Ce⁻

√µ 2 t.

By convexity off we havehx(t)−x^?,∇f(x(t))i ≥f(x(t))−f(x^?). Moreover, hx(t),˙ 2β∇f(x(t)) + 2√

µ(x(t)−x^?)i

= d

dt 2β(f(x(t))−f(x^?)) +√

µkx(t)−x^?k² . Combining the above results, we obtain

√µ[2β(f(x(t))−f(x^?)) +√

µkx(t)−x^?k²] +β²k∇f(x(t))k² +d

dt 2β(f(x(t))−f(x^?)) +√

µkx(t)−x^?k²

≤Ce⁻

√µ 2 t.

15

(17)

SetZ(t) := 2β(f(x(t))−f(x^?)) +√

µkx(t)−x^?k²]. We have d

dtZ(t) +√

µZ(t) +β²k∇f(x(t))k²≤Ce⁻

√µ 2 t. By integrating this differential inequality, elementary computation gives

e⁻^√^µt Z t

t0

e^√^µsk∇f(x(s))k²ds≤Ce⁻

√µ 2 t.

Noticing that the integral ofe^√^µsover [t0, t] is of ordere^√^µt, the above estimate reflects the fact, as t→+∞, the gradient termsk∇f(x(t))k²tend to zero at exponential rate (in average, not pointwise).

Remark 4.3. Let us justify the choice ofγ= 2√

µin Theorem 4.2. Indeed, considering

¨

x(t) + 2γx(t) +˙ β∇²f(x(t)) +∇f(x(t)) = 0,

a similar proof to that described above can be performed on the basis of the Lyapunov function

E(t) :=f(x(t))−min

H f +1

2kγ(x(t)−x^?) + ˙x(t) +β∇f(x(t))k². Under the conditionsγ≤√

µ and β ≤_2γ^µ3 we obtain the exponential convergence rate f(x(t))−min

H f =O e⁻^γ²^t

as t →+∞.

Takingγ=√

µ gives the best convergence rate, and the result of Theorem4.2.

4.2. Non-smooth case. Following [2], (DIN)γ,β is equivalent to the first-order system







˙

x(t) +β∇f(x(t)) + γ−_β¹

x(t) +_β¹y(t) = 0;

˙ y(t) +

γ−_β¹

x(t) +_β¹y(t) = 0. .

This permits to extend (DIN)γ,β to the case of a proper lower semicontinuous convex function f : H → R∪ {+∞}. Replacing the gradient off by its subdifferential, we obtain its Non-Smooth version :

(DIN-NS)_γ,β







˙

x(t) +β∂f(x(t)) + γ−_β¹

x(t) +_β¹y(t)30;

˙ y(t) +

γ−_β¹

x(t) +_β¹y(t) = 0.

Most properties of (DIN)_γ,β are still valid for this generalized version. To illustrate it, let us consider the following extension of Theorem4.2.

Theorem 4.4. Suppose that f :H →R∪ {+∞} is lower semicontinuous and µ-strongly convex for some µ >0. Let x(·)be a trajectory of(DIN-NS)2√

µ,β. Suppose that0≤β≤ ₂^√¹_µ. Then µ

2 kx(t)−x^?k²≤f(x(t))−min

H f =O e⁻

√µ 2 t

as t →+∞, and

Z ∞ t₀

e

√µ

2 tkx(t)k˙ ²dt <+∞.

Proof. Let us introduceE : [t₀,+∞[→R⁺ defined by E(t) :=f(x(t))−min

H f+1 2k√

µ(x(t)−x^?)−

2√ µ− 1

β

x(t)− 1 βy(t)k²,

that will serve as a Lyapunov function. Then, the proof follows the lines of Theorem4.2, with the use of the

derivation rule of Brezis [19, Lemme 3.3, p. 73].

5. Inertial algorithms for strongly convex functions 5.1. Proximal algorithms.

16