A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser

(1)

A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal

denoiser

Samuel Hurault

¹

, Antonin Chambolle

²

, Arthur Leclaire

¹

, and Nicolas Papadakis

¹

1

Univ. Bordeaux, CNRS, Bordeaux INP, IMB, UMR 5251, F-33400 Talence, France

2

CEREMADE, CNRS, Université Paris-Dauphine, PSL, Palaiseau 91128, France

Abstract

This paper presents a new convergent Plug-and-Play (PnP) algorithm.

PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem.

Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-αPGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.

1 Introduction

We focus on the convergence of Plug-and-Play methods associated to the class of inverse problems:

ˆ

x∈arg min

x λf(x) +φ(x), (1)

wheref is aLf-Lipschitz gradient function acting as a data-fidelity term with respect to a degraded observation y, φ is a M-weakly convex regularization function (such thatφ+^M₂ ||.||²is convex); and λ >0is a parameter weighting the influence of both terms. In our experimental setting, we consider degradation models y = Ax^∗+ν ∈ R^m for some groundtruth signal x^∗ ∈ Rⁿ, a linear operator A ∈ R^n×m and a white Gaussian noise ν ∈ Rⁿ. We thus deal with convex data-fidelity terms of the formf(x) =¹₂||Ax−y||², withLf =||A^TA||.

arXiv:2301.13731v1 [stat.ML] 31 Jan 2023

(2)

Our analysis can nevertheless apply to a broader class of convex or nonconvex functions f.

To ﬁnd an adequate solution of the ill-posed problem of recoveringx^∗ fromy, the choice of the regularization φ is crucial. Convex [19] and nonconvex [26]

handcrafted functions are now largely outperformed by learning approaches [18, 27], that may not even be associated to a closed-form regularization functionφ.

1.1 Proximal algorithms

Estimating a local or global optima of problem (1) is classically done using proximal splitting algorithms such as Proximal Gradient Descent (PGD) or Douglas-Rashford Splitting (DRS). Given an adequate stepsizeτ >0, these methods alternate between explicit gradient descent,Id−τ∇hfor smooth functionsh, and/or implicit gradient steps using the proximal operatorProxτ h(x)∈ arg min_z_2τ¹||z−x||²+h(z), for proper lower semi-continuous functionsh. Proxi- mal algorithms are originally designed for convex functions, but under appropriate assumptions, PGD [2] and DRS [23] algorithms converge to a stationary point of problem (1) associated to nonconvex functionsf andφ.

1.2 Plug-and-Play algorithms

Plug-and-Play (PnP) [25] and Regularization by Denoising (RED) [18] methods consist in splitting algorithms in which the descent step on the regularization function is performed by an oﬀ-the-shelf image denoiser. They are respectively built from proximal splitting schemes by replacing the proximal operator (PnP) or the gradient operator (RED) of the regularization φby an image denoiser.

When used with a deep denoiser (i.e parameterized by a neural network) these approaches produce impressive results for various image restoration tasks [27].

Theoretical convergence of PnP and RED algorithms with deep denoisers has recently been addressed by a variety of studies [20, 21, 17, 9]. Most of these works require speciﬁc constraints on the deep denoiser, such as nonexpansivity.

However, imposing nonexpansivity of a denoiser can severely degrade its denoising performance.

Another line of works tries to address convergence by making PnP and RED algorithms be exact proximal algorithms. The idea is to replace the denoiser of RED algorithms by a gradient descent operator and the one of PnP algorithm by a proximal operator. Theoretical convergence then follows from known convergence results of proximal splitting algorithms. The authors of [7, 10] thus propose to plug an explicitgradient-step denoiser of the formD= Id−∇g, for a tractable and potentially nonconvex potential g parameterized by a neural network. As shown in [10], such a constrained parametrization does not harm denoising performance. The gradient-step denoiser guarantees convergence of RED methods without sacriﬁcing performance, but it does not cover convergence of PnP algorithms. An extension to PnP has been addressed in [11]: following [8], wheng is trained with contractive gradient, the gradient-step denoiser can be written as a proximal operatorD= Id−∇g= Proxφ of a nonconvex potentialφ.

A PnP scheme with thisproximal denoiser becomes again a genuine proximal splitting algorithm associated to an explicit functional. Following existing convergence results of the PGD and DRS algorithms in the nonconvex setting, [11] proves convergence of PnP-PGD and PnP-DRS with proximal denoiser.

(3)

The main limitation of this approach is that the proximal denoiserD= Prox_φ does not give tractability of Proxτ φ for τ 6= 1. Therefore, to be a provable converging proximal splitting algorithm, the stepsize of the overall PnP algorithm has to be ﬁxed to τ = 1. For instance, for the PGD algorithm with stepsize τ = 1:

xk+1∈Proxφ(Id−λ∇f)(xk) (2) the convergence ofxkto a stationary point of (1) is only ensured for regularization parametersλsatisfyingLfλ <1[2]. This is an issue for low noise levelsν, for which relevant solutions are obtained with a dominant data-ﬁdelity term in (1).

Our objective is to design a convergent PnP algorithm with a proximal denoiser, and with minimal restriction on the regularization parameterλ. Contrary to previous work on PnP convergence [20, 21, 22, 7, 10, 9], we not only wish to adapt the denoiser but also the original optimization scheme of interest. We study a new proximal algorithm able to deal with a proximal operator that can only be computed for a predeﬁned and ﬁxed stepsize τ= 1.

1.3 Contributions and outline

In this paper, we propose a relaxation of the Proximal Gradient Descent algorithm, calledαPGD, such that when used with a proximal denoiser, the corresponding PnP schemeProx-PnP-αPGD can converge for any regularization parameterλ.

In section 2, extending the result from [11], we show how building a denoiser D that corresponds to the proximal operator of aM-weakly convex functionφ.

Then we introduce a relaxation of the denoiser that allows to controlM, the constant of weak convexity ofφ.

In section 3, we give new results on the convergence of Prox-PnP-PGD [11] with regularization constraint λ(Lf +M) < 2. In particular, using the convergence of PGD for nonconvexf and weakly convexφgiven in Theorem 1, Corollary 1 improves previous PnP convergence results [11] forM < Lf.

In section 4, we present αPGD, a relaxed version of the PGD algorithm¹ reminiscent of the accelerated PGD scheme from [24]. Its convergence is shown Theorem 2 for a smooth convex functionf and a weakly convex oneφ. Corol- lary 2 then illustrates how the relaxation parameterαcan be tuned to make the proposed PnP-αPGD algorithm convergent for regularization parametersλ satisfyingλLfM <1. Having a multiplication, instead of an addition, between the constants L_f and M opens new perspectives. In particular, by plugging a relaxed denoiser with controllable weak convexity constant, Corollary 3 demon- strates that, for all regularization parameterλ, we can always decreaseM such that λLfM <1 i.e. such that theαPGD algorithm converges.

In section 5, we provide experiments for both image deblurring and image super-resolution applications. We demonstrate the eﬀectiveness of our PnP- αPGD algorithm, which closes the performance gap between Prox-PnP-PGD and the state-of-the-art plug-and-play algorithms.

1There are two different notions of relaxation in this paper. One is for the relaxation of the proximal denoiser and the other for the relaxation of the optimization scheme.

(4)

2 Relaxed Proximal Denoiser

This section introduces the denoiser used in our PnP algorithm. We first redefine the Gradient Step denoiser and show in Proposition 1 how it can be constrained to be a proximal denoiser; and finally introduced the relaxed proximal denoiser.

2.1 Gradient Step Denoiser

In this paper, we make use of the Gradient Step Denoiser introduced in [10, 7].

It writes as a gradient step over a diﬀerentiable potentialgσ parametrized by a neural network.

Dσ = Id−∇gσ. (3)

This denoiser can then be trained to denoise white Gaussian noiseνσof various standard deviationsσby minimizing theℓ2denoising lossE[||Dσ(x+νσ)−x)||²].

When parametrized using a DRUNet architecture [27], it was shown in [10] that the Gradient Step Denoiser (3), despite being constrained to be a conservative vector ﬁeld (as in [18]), achieves state-of-the-art denoising performance.

2.2 Proximal Denoiser

We propose here an new version of the result of [11] on the characterization of the gradient-step denoiser as a proximal operator of some potential φ. In particular, we state a new result regarding the weak convexity of theφfunction.

The proof of this result, given in Appendix A relies on the results from [8].

Proposition 1 (Proximal denoisers) Let gσ:Rⁿ→R a C² function with

∇gσ Lgσ-Lipschitz with Lgσ <1. Then, forDσ:= Id−∇gσ, there exists a potential φσ:Rⁿ→R∪ {+∞}such that Proxφσ is one-to-one and

Dσ= Proxφσ (4)

Moreover, φσ is _L^L^gσ

gσ+1-weakly convex and it can be written φσ = ˆφσ+K on Im(Dσ) (which is open) for some constant K ∈R, withφˆσ:X →R∪ {+∞}

defined by φˆσ(x) :=

gσ(Dσ−1(x)))−¹₂

Dσ−1(x)−x

2 if x∈Im(Dσ),

+∞ otherwise. (5)

Additionally φˆ_σ verifies∀x∈Rⁿ,φˆ_σ(x)≥g_σ(x).

To get a proximal denoiser from the denoiser (3), the gradient of the learned potential gσ must be contractive. In [11] the Lipschitz constant of∇gσ is softly constrained to satisfyLgσ <1, by penalizing the spectral norm

∇²gσ(x+νσ) S

in the denoiser training loss.

2.3 Relaxed Denoiser

Once trained, the Gradient Step Denoiser Dσ= Id−∇gσ can be relaxed as in [10] with a parameterγ∈[0,1]

D_σ^γ =γDσ+ (1−γ) Id = Id−γ∇gσ. (6)

(5)

Applying Proposition 1 withg_σ^γ =γg_σ which has a γL_g_σ-Lipschitz gradient, we get that ifγLg<1, there exists a _γL^γL^gσ

gσ+1-weakly convex φ^γ_σ such that

D_σ^γ= Prox_φ^γ_σ, (7)

satisfyingφ⁰_σ = 0andφ¹_σ =φσ. Hence, one can control the weak convexity of the regularization function by relaxing the proximal denoising operatorD_σ^γ.

3 Plug-and-Play Proximal Gradient Descent (PnP- PGD)

In this section, we give convergence results for the Prox-PnP-PGD algorithm, xk+1=Dσ◦(Id−λf)(xk) = Proxφσ◦(Id−λf)(xk),

which is the PnP version of PGD, with plugged Proximal Denoiser (4). The authors of [11] proposed a suboptimal convergence results as the semiconvexity of φσ was not exploited. Doing so, we improve the condition on the regularization parameter λfor convergence.

We present properties of smooth functions and weakly convex functions in Section 3.1. Then we show in Section 3.2 the convergence of PGD in the smooth/weakly convex setting. We ﬁnally apply this result to PnP in Section 3.3.

3.1 Useful inequalities

We present two results relative to weakly convex functions and smooth ones.

We make use of the subdiﬀerential of a proper, nonconvex function φdeﬁned as

∂φ(x)=n

v∈Rⁿ,∃(xk), φ(xk)→φ(x), vk→v,lim_z→x_k^φ(z)−φ(x_||z−x^k^)−hv_k_||^k^,z−x^kⁱ≥0∀ko . Proposition 2 (Weakly convex functions, proof in Appendix B) Forφ proper lsc and M-weakly convex withM >0,

(i) ∀x, yandt∈[0,1],

φ(tx+ (1−t)y)≤tφ(x) + (1−t)φ(y) +M

2 t(1−t)||x−y||²; (8) (ii) ∀x, y, we have∀z∈∂φ(y),

φ(x)≥φ(y) +hz, x−yi −M

2 ||x−y||²; (9) (iii) Three-points inequality. For z⁺∈Proxφ(z), we have,∀x

φ(x) +1

2||x−z||²≥φ(z⁺) +1 2

z⁺−z

2+1−M 2

x−z⁺

2. (10) Lemma 1 (Descent Lemma for smooth functions) Forf proper differentiable and with a Lf-Lipschitz gradient, we have∀x, y

f(x)≤f(y) +h∇f(y), x−yi+Lf

2 ||x−y||². (11)

(6)

3.2 Proximal Gradient Descent with a weakly convex func- tion

We consider the following minimization problem for a smooth nonconvex functionf and a weakly convex functionφthat are both bounded from below:

minx F(x) :=λf(x) +φ(x). (12) We now show under which conditions the classical Proximal Gradient Descent

xk+1∈Proxτ φ◦(Id−τ λ∇f)(xk) (13) converges to a stationary point of (12). We ﬁrst show convergence of function values, and then convergence of the iterates, ifFveriﬁes the Kurdyka-Lojasiewicz (KL) property [2]. Large classes of functions, in particular all the proper, closed, semi-algebraic functions [1] satisfy this property, which is, in practice, the case of all the functions considered in this analysis.

Theorem 1 (Convergence of PGD algorithm (13)) Assumef andφproper lsc, bounded from below withf differentiable withLf-Lipschitz gradient, andφ M-weakly convex. Then for τ <2/(λL_f+M), the iterates (13)verify

(i) (F(xk))monotonically decreases and converges.

(ii) ||xk+1−xk|| converges to0 at ratemink≤K||xk+1−xk||=O(1/√ K) (iii) All cluster points of the sequencexk are stationary points ofF.

(iv) If the sequence xk is bounded and if F verifies the KL property at the cluster points of xk, thenxk converges, with finite length, to a stationary

point ofF.

The proof follows standard arguments of the convergence analysis of the PGD in the nonconvex setting [4, 2, 16]. We only demonstrate here the ﬁrst point of the theorem, the rest of the proof is detailed in Appendix C.Proof.[i]

Relation (13) leads to ^x^k^−x_τ^k+1 −λ∇f(xk) ∈ ∂φ(xk+1), by deﬁnition of the proximal operator. AsφisM-weakly convex, Proposition 2 (ii) leads to

φ(xk)≥φ(xk+1)+||xk−xk+1||²

τ +λh∇f(xk), xk+1−xki −M

2 ||xk−xk+1||². (14) The descent Lemma 1 gives forf:

f(xk+1)≤f(xk) +h∇f(xk), xk+1−xki+Lf

2 ||xk−xk+1||². (15) Combining both inequalities, forFλ,σ=λf+φσ, we obtain

F(xk)≥F(xk+1) + 1

τ −M+λL_f 2

||xk−xk+1||². (16) Therefore, if τ < 2/(M+λLf), (F(xk))is monotically deacreasing. As F is

assumed lower-bounded,(F(xk))converges.

(7)

3.3 Prox-PnP Proximal Gradient Descent (Prox-PnP-PGD)

Equipped with the convergence of PGD, we can now study the convergence of Prox-PnP-PGD, the PnP-PGD algorithm with plugged Proximal Denoiser (4):

xk+1=Dσ(Id−λf)(xk) = Proxφσ(Id−λf)(xk). (17) This algorithm targets stationary points of the functionalFλ,σ deﬁned as:

Fλ,σ:=λf+φσ. (18)

The following result, obtained from Theorem 1, improves [11] using the fact that the potentialφσ is not any nonconvex function but a weakly convex one.

Corollary 1 (Convergence of Prox-PnP-PGD (17)) Let gσ : Rⁿ → R∪ {+∞}of classC², coercive, withLgσ-Lipschitz gradient,Lgσ <1, andDσ:= Id−∇gσ. Let φσ be defined fromgσ as in Proposition 1. Letf :Rⁿ →R∪ {+∞} differentiable withLf-Lipschitz gradient. Assumef andDσ respectively KL and semi- algebraic, andf andgσbounded from below. Then, forλLf <(Lgσ+2)/(Lgσ+1), the iteratesxk given by the iterative scheme(17)verify the convergence properties (i)-(iv) of Theorem 1 for F =Fλ,σ.

The proof of this result is given in Appendix D. It is a direct application of Theorem 1 using τ = 1 and the fact that φσ deﬁned in Proposition 1 is M =L_g_σ/(L_g_σ + 1)-weakly convex.

By exploiting the weak convexity ofφσ, the convergence conditionλLf <1 of [11] is here replaced byλLf < ^L_L^gσ⁺²

gσ+1. Even if the bound is improved, we are still limited to regularization parameters satisfyingλLf <2. In the next section, we propose a modiﬁcation of the PGD algorithm to relax this constraint.

4 PnP Relaxed Proximal Gradient Descent (PnP- α PGD)

In this section, we study the convergence of a relaxed PGD algorithm applied to problems (1) involving a smooth convex function f and a weakly convex function φ. Our objective is to design a convergent algorithm in which the proximal operator is only computable for τ = 1 and the data-ﬁdelity term constraint is less restrictive than the bound τ <2/(M +λLf)of Theorem 1.

4.1 α PGD algorithm

We present our main result which concerns, for weakly convex functions φ, the convergence of the followingα-relaxed PGD algorithm, deﬁned for0< α <1as







qk+1= (1−α)yk+αxk

xk+1= Proxτ φ(xk−τ λ∇f(qk+1)) yk+1= (1−α)yk+αxk+1.

(19a) (19b) (19c) Algorithm (19) withα= 1exactly corresponds to the PGD algorithm (13). This scheme is reminiscent of [24] (takingα=θkandτ= _θ¹

kLf in Algorithm 1 of [24]), which generalizes Nesterov-like accelerated proximal gradient methods [4, 15].

(8)

As shown in [12], there is a strong connection between the proposed algorithm (19) and the Primal-Dual algorithm [5] with Bregman proximal operator [6]. In the convex setting, one can show that ergodic convergence is obtained withτ λLf >2 and small valuesα. Convergence of a close algorithm is also shown in [14] for a M-semi convexφand ac > M-strongly convexf. However,φσis here nonconvex whilef is only convex, so that a new convergence result needs to be derived.

Theorem 2 (Convergence of αPGD (19)) Assumef andφproper lsc, bounded from below, f convex differentiable withL_f-Lipschitz gradient andφ M-weakly convex. Then² forα∈(0,1)andτ <min

1 αλLf,_M^α

, the updates (19)verify (i) F(yk) +_2τ^α 1−_α¹2

||yk−yk−1||² monotonically decreases and converges.

(ii) ||yk+1−yk||converges to 0 at ratemink≤K||yk+1−yk||=O(1/√ K) (iii) All cluster points of the sequenceyk are stationary points of F.

The proof, given in Appendix E, relies on Lemma 1 and Proposition 2. It follows the general strategy of the proofs in [24], and also requires the convexity off.

With this theorem,αPGD is shown to verify convergence of the iterates and of the norm of the residual to0. Note that we do not have here the analog of Theorem 1(iv) on the iterates convergence using the KL hypothesis. Indeed, as we detail in Appendix E.0.1, the nonconvex convergence analysis with KL functions from [2] or [16] do not extend to our case.

Whenα= 1, Algorithms (19) and (13) are equivalent, but we get a slightly worst bound in Theorem 2 than in Theorem 1 (τ <min

1 λLf,_M¹

≤ _λL_f²_+M).

Nevertheless, when used withα <1, we next show that the relaxed algorithm is more relevant in the perspective of PnP with proximal denoiser.

4.2 Prox-PnP- α PGD algorithm

We can now study the Prox-PnP-αPGD algorithm obtained by taking the proximal denoiser (4) in theαPGD algorithm (19):







qk+1= (1−α)yk+αxk

xk+1=Dσ(xk−λ∇f(qk+1)) yk+1= (1−α)yk+αxk+1

(20a) (20b) (20c) This scheme targets the minimization of the functionalFλ,σgiven in (18).

Corollary 2 (Convergence of Prox-PnP-αPGD (20)) Let gσ:Rⁿ →R∪ {+∞} of class C², coercive, withLg <1-Lipschitz gradient andDσ:= Id−∇gσ. Let φσ be the M =Lgσ/(Lgσ + 1)-weakly convex function defined from gσ as in Proposition 1. Let f be proper, convex and differentiable with Lf-Lipschitz gradient. Assumef and g_σ bounded from below. Then, if λL_fM <1 and for any α∈[0,1]

M < α <1/(λLf) (21) the iteratesxk given by the iterative scheme(20)verify the convergence properties (i)-(iii) of Theorem 2 forF =Fλ,σ defined in (18).

2As shown in the proof, a better bound can be found, but with little numerical gain.

(9)

This PnP corollary is obtained by taking τ = 1 in Theorem 2 and using the M = (Lgσ)/(Lgσ+ 1)-weakly convex potential φσ deﬁned in Proposition 1.

The existence of α ∈ [0,1] satisfying relation (21) is ensured as soon as λLfM <1. As a consequence, whenM gets small (i.e φσ gets "more convex") λLf can get arbitrarily large. This is a major advance compared to Prox-PnP- PGD that was limited (Corollary 1) toλLf <2 even for convexφ(M = 0). To further exploit this property, we now consider the relaxed denoiserD^γ_σ (6) that is associated to a function φ^γ_σ with a tunable weak convexity constantM^γ. Corollary 3 (Convergence of Prox-PnP-αPGD with relaxed denoiser) LetF_λ,σ^γ :=λf+φ^γ_σ, with theM^γ= _γL^γL_g₊₁^g -weakly convex potentialφ^γ_σ introduced in (7) and Lg < 1. Then, for M^γ < α < 1/(λLf), the iterates xk given by the Prox-PnP-αPGD (20)with γ-relaxed denoiserD^γ_σ defined in (6)verify the convergence properties (i)-(iii) of Theorem 2 for F=F_λ,σ^γ .

Therefore, using the γ-relaxed denoiser D_σ^γ = γDσ+ (1−γ) Id, the overall convergence condition onλis nowλ < _L¹

f

γLg

γLg+1.

Provided γ gets small, λ can be arbitrarily large. Small γ means small amount of regularization brought by denoising at each step of the PnP algorithm.

Moreover, for small γ, the targeted regularization function φ^γ_σ gets close to a convex function and it has already been observed that deep convex regularization can be sub-optimal compared to more ﬂexible nonconvex ones [7]. Depending on the inverse problem, and on the necessary amount of regularization, the choice of the couple(γ, λ)will be of paramount importance for eﬃcient restoration.

5 Experiments

The eﬃciency of the proposed Prox-PnP-αPGD algorithm (20) is now demonstrated on deblurring and super-resolution. For both applications, we consider a degraded observation y = Ax^∗+ν ∈ R^m of a clean image x^∗ ∈ Rⁿ that is estimated by solving problem (1) with f(x) = ¹₂||Ax−y||². Its gradient

∇f =A^T(Ax−y)is thus Lipschitz with constantLf =

A^TA

S. We use for eval- uation and comparison the 68 images from the CBSD68 dataset, center-cropped ton= 256×256and Gaussian noise with3 noise levelsν∈ {0.01,0.03,0.05}.

Fordeblurring, the degradation operatorA=H is a convolution performed with circular boundary conditions. As in [28, 10, 17, 27], we consider the 8 real-world camera shake kernels of [13], the9×9uniform kernel and the25×25 Gaussian kernel with standard deviation1.6.

For single imagesuper-resolution (SR), the low-resolution imagey∈R^m is obtained from the high-resolution onex∈Rⁿ viay=SHx+νwhereH ∈R^n×n is the convolution with anti-aliasing kernel. The matrixS is the standards-fold downsampling matrix of sizem×nandn=s²×m. As in [27], we evaluate SR performance on 4 isotropic Gaussian blur kernels with standard deviations 0.7, 1.2,1.6 and2.0; and consider downsampled images at scales= 2ands= 3.

The proximal denoiserDσ deﬁned in Proposition 1 is trained following [11]

withLg <1. For both Prox-PnP-PGD and Prox-PnP-αPGD algorithm, we use theγ-relaxed version of the denoiser (6). All the hypotheses onf andgσ

from Corollaries 1 and 3 are thus veriﬁed and convergence of Prox-PnP-PGD and Prox-PnP-αPGD are theoretically guaranteed provided the corresponding

(10)

conditions onλare satisﬁed. Hyper-parametersγ∈[0,1],λandσare optimized via grid-search. In practice, we found that the same choice of parametersγ and σare optimal for both PGD andαPGD, with values depending on the amount of noise ν in the input image. We thus choose λ ∈ [0, λlim] where for Prox- PnP-PGDλ^PGD_lim = _L¹

f

γ+2

γ+1 and for Prox-PnP-αPGDλ^α_lim^PGD =_L¹

f

γ+1

γ ≥λ^PGD_lim . For both ν = 0.01 and ν = 0.03, λ is set to its maximal allowed value λlim. Prox-PnP-αPGD is expected to outperform Prox-PnP-PGD at these noise levels.

Finally, for Prox-PnP-αPGD,αis set to its maximum possible value1/(λLf).

We numerically compare in Table 1 the presented methods Prox-PnP-PGD (that improves [11]) and Prox-PnP-PGD against three state-of-the-art deep PnP approaches: IRCNN [28], DPIR [27], and GS-PnP [10]. Among them, only GS-PnP has convergence guarantees. Both IRCNN and DPIR use PnP-HQS, the PnP version of the Half-Quadratic Splitting algorithm, with well-chosen varying stepsizes. GS-PnP uses the gradient-step denoiser (3) in PnP-HQS.

As expected, by allowing larger values forλ, we observe that Prox-PnP-αPGD outperforms Prox-PnP-PGD in PSNR at low noise levelν∈ {0.01,0.03}. The performance gap is significant for deblurring and super-resolution with scale 2 andν= 0.01, in which case only a low amount of regularization is necessary, that is to say a largeλvalue. In these conditions, Prox-PnP-αPGD almost closes the PSNR gap between Prox-PnP-PGD and the state-of-the-art PnP methods DPIR and GS-PnP that do not have any restriction on the choice of λ. We assume that the remaining difference of performance is due to theγ-relaxation of the denoising operation that affects the regularizer φγ. We qualitatively verify in Figure 1 (deblurring) and Figure 2 (super-resolution), on "starfish" and "leaves"

images, the performance gain of Prox-PnP-αPGD over Prox-PnP-PGD. We also plot the evolution ofFλ,σ(xk)and||xk+1−xk||²to empirically validate the convergence of both algorithms.

Deblurring Super-resolution

scales= 2 scales= 3 Noise levelν 0.01 0.03 0.05 0.01 0.03 0.05 0.01 0.03 0.05

IRCNN [28] 31.42 28.01 26.40 26.97 25.86 25.45 25.60 24.72 24.38 DPIR [27] 31.93 28.30 26.82 27.79 26.58 25.83 26.05 25.27 24.66 GS-PnP [10] 31.70 28.28 26.86 27.88 26.81 26.01 25.97 25.35 24.74 Prox-PnP-PGD 30.91 27.97 26.66 27.68 26.57 25.81 25.94 25.20 24.62 Prox-PnP-αPGD 31.55 28.03 26.66 27.92 26.61 25.80 26.03 25.26 24.61

Table 1: PSNR (dB) results on CBSD68 for deblurring (left) and super-resolution (right). PSNR are averaged over10blur kernels for deblurring (left) and4 blur

kernels along various scalessfor super-resolution (right).

6 Conclusion

In this paper, we propose a new convergent plug-and-play algorithm built from a relaxed version of the Proximal Gradient Descent (PGD) algorithm. When used with a proximal denoiser, while the original PnP-PGD imposes restrictive conditions on the parameters of the problem, the proposed algorithms converge, with minor conditions, towards stationary points of a weakly convex functional.

We illustrate numerically the convergence and the eﬃciency of the method.

(11)

(a) Clean (b) Observed (c) Prox-PnP-PGD (33.32dB)

(d) Prox-PnP-αPGD (33.62dB)

(e)F_λ,σ(x_k) Prox-PnP-PGD

(f)F_λ,σ(x_k)

Prox-PnP-αPGD (g)||x_i+1−x_i||²

Figure 1: Deblurring of “starﬁsh" blurred with the shown kernel and noise ν = 0.01.

(a) Clean (b) Observed (c) Prox-PnP-PGD (28.19dB)

(d) Prox-PnP-αPGD (28.59dB)

(e)F_λ,σ(x_k) Prox-PnP-PGD

(f)F_λ,σ(x_k)

Prox-PnP-αPGD (g)||x_i+1−x_i||²

Figure 2: SR of "leaves" downscaled with the shown kernel, scale2andν= 0.01.

Acknowledgements

This work was funded by the French ministry of research through a CDSN grant of ENS Paris-Saclay. This study has also been carried out with ﬁnancial support from the French Research Agency through the PostProdLEAP and Mistic projects (ANR-19-CE23-0027-01 and ANR-19-CE40-005).

References

[1] Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: An approach

(12)

based on the kurdyka-łojasiewicz inequality. Mathematics of operations research35(2), 438–457 (2010)

[2] Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Math. Programming137(1- 2), 91–129 (2013)

[3] Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces, vol. 408. Springer (2011)

[4] Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE TIP18(11), 2419–2434 (2009)

[5] Chambolle, A., Pock, T.: A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging. J. of Math. Imaging and Vision 40(1), 120–145 (2011)

[6] Chambolle, A., Pock, T.: On the ergodic convergence rates of a ﬁrst-order primal–dual algorithm. Mathematical Programming159(1), 253–287 (2016) [7] Cohen, R., Blau, Y., Freedman, D., Rivlin, E.: It has potential: Gradient- driven denoisers for convergent solutions to inverse problems. NeurIPS34 (2021)

[8] Gribonval, R., Nikolova, M.: A characterization of proximity operators.

Journal of Mathematical Imaging and Vision62(6), 773–789 (2020) [9] Hertrich, J., Neumayer, S., Steidl, G.: Convolutional proximal neural

networks and plug-and-play algorithms. Linear Algebra and its Applications 631, 203–234 (2021)

[10] Hurault, S., Leclaire, A., Papadakis, N.: Gradient step denoiser for convergent plug-and-play. In: ICLR (2022)

[11] Hurault, S., Leclaire, A., Papadakis, N.: Proximal denoiser for convergent plug-and-play optimization with nonconvex regularization. In: ICML (2022) [12] Lan, G., Zhou, Y.: An optimal randomized incremental gradient method.

Mathematical programming171(1), 167–215 (2018)

[13] Levin, A., Weiss, Y., Durand, F., Freeman, W.T.: Understanding and evaluating blind deconvolution algorithms. In: IEEE CVPR. pp. 1964–1971 (2009)

[14] Möllenhoﬀ, T., Strekalovskiy, E., Moeller, M., Cremers, D.: The primal-dual hybrid gradient method for semiconvex splittings. SIAM J. on Im. Sc.8(2), 827–857 (2015)

[15] Nesterov, Y.: Gradient methods for minimizing composite functions. Math- ematical programming140(1), 125–161 (2013)

[16] Ochs, P., Chen, Y., Brox, T., Pock, T.: ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM J. on Im. Sc.7(2), 1388–1419 (2014)

(13)

[17] Pesquet, J.C., Repetti, A., Terris, M., Wiaux, Y.: Learning maximally monotone operators for image recovery. SIAM J. on Im. Sciences 14(3), 1206–1237 (2021)

[18] Romano, Y., Elad, M., Milanfar, P.: The little engine that could: Regular- ization by denoising (red). SIAM J. on Im. Sc.10(4), 1804–1844 (2017) [19] Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise

removal algorithms. Phys. D60, 259–268 (1992)

[20] Ryu, E., Liu, J., Wang, S., Chen, X., Wang, Z., Yin, W.: Plug-and-play methods provably converge with properly trained denoisers. In: ICML. pp.

5546–5557 (2019)

[21] Sun, Y., Wu, Z., Xu, X., Wohlberg, B., Kamilov, U.S.: Scalable plug-and- play admm with convergence guarantees. IEEE Trans. on Comput. Im.7, 849–863 (2021)

[22] Terris, M., Repetti, A., Pesquet, J.C., Wiaux, Y.: Building ﬁrmly nonex- pansive convolutional neural networks. In: IEEE ICASSP. pp. 8658–8662 (2020)

[23] Themelis, A., Patrinos, P.: Douglas–rachford splitting and ADMM for nonconvex optimization: Tight convergence results. SIAM J. on Opt.30(1), 149–181 (2020)

[24] Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization2(3) (2008) [25] Venkatakrishnan, S.V., Bouman, C.A., Wohlberg, B.: Plug-and-play priors

for model based reconstruction. In: IEEE GlobalSIP. pp. 945–948 (2013) [26] Wen, F., Chu, L., Liu, P., Qiu, R.C.: A survey on nonconvex regularization-

based sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE Access6, 69883–69906 (2018)

[27] Zhang, K., Li, Y., Zuo, W., Zhang, L., Van Gool, L., Timofte, R.: Plug- and-play image restoration with deep denoiser prior. IEEE Trans. on Im.

Proces. (2021)

[28] Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: IEEE CVPR. pp. 3929–3938 (2017)

(14)

A Proof of Proposition 1 (Characterization of the Proximal Denoiser)

Proof. This result is an extension of the Proposition 3.1 from [11]. Under the same assumptions as ours, it is shown, using the characterization of the Proximity operator from [8], that forφˆσ:Rⁿ→Rdeﬁned as

φˆσ(x) :=

gσ(Dσ−1(x)))−¹2

Dσ−1(x)−x

2 if x∈Im(Dσ),

+∞ otherwise (22)

we haveDσ = Proxφˆσ and∀x∈Rⁿ,φˆσ(x)≥gσ(x).

We would like to show thatφˆσis semiconvex. However, asdom( ˆφσ) =Im(Dσ) is non necessarily convex, we can not obtain this result. We thus propose another choice of functionφ_σsuch thatD_σ = Prox_φ_σ. As we suppose that∇g_σ= Id−D_σ isLLipschitz,Dσ isL+ 1 Lipschitz. By [8, Proposition 2], we have∀x∈ X,

D_σ(x)∈Prox_φ_σ(x) (23) for some proper lsc fonctionφσ:X →Rⁿsuch thatx→φσ(x)+

1−_L+1¹ _||x||2

2

is convex, i.e. such that φσ is 1−L+1¹ = _L+1^L weakly convex. For ally ∈Rⁿ, the function x→φσ(x) +¹₂||x−y||² is then strongly convex and the proximal operator is single valued. Then∀x∈ X,

Dσ(x) = Proxφσ(x) = Proxφˆσ(x) (24) By [8, Corollary 5], for anyC ⊂ Im(Dσ)polygonally connected, there is K ∈ R such that φσ = ˆφσ +K on C. As X is convex, it is connected and, as Dσ is continuous,Im(Dσ)is connected and open, thus polygonally connected.

Therefore,∃K∈Rsuch thatφσ= ˆφσ+K onIm(Dσ).

B Proof of Proposition 2 (Three points inequality for weakly convex functions)

Proof. (i) and (ii) follow from the fact thatφ+^M₂ ||x||²is convex. We now prove (iii). Optimality conditions of the proximal operatorz⁺∈Proxφ(z)gives

z−z⁺∈∂φ(z⁺). (25)

Hence, by (ii), we have∀x,

φ(x)≥φ(z⁺) +hz−z⁺, x−z⁺i −M 2

x−z⁺

2, (26)

and therefore, φ(x) +1

2||x−z||²≥φ(z⁺) +1

2||x−z||²+hz−z⁺, x−z⁺i −M 2

x−z⁺

2

=φ(z⁺) +1 2

x−z⁺

2+1 2

z−z⁺

2−M 2

x−z⁺

2

=φ(z⁺) +1 2

z−z⁺

2+1−M 2

x−z⁺

2.

(15)

C Proof of Theorem 1 (PGD for nonconvex and weakly convex optimization)

We here demonstrate the last three points of Theorem 1. Proof.

(i) We have demonstrated Section 3.2 that(F(xk))is monotically deacreasing and converges. We callF^∗ its limit.

(ii) Summing (16) overk= 0,1, ..., mgives

m

X

k=0

||xk+1−xk||²≤ 1

1

τ −^M^+L₂ ^f (F(x0)−F(xm+1))

≤ 1

1

τ −^M^+L2 ^f

(F(x0)−F^∗).

(27)

Therefore,limk→∞||xk+1−xk||= 0with the convergence rate γk= min

0≤i≤k||xi+1−xi||²≤ 1 k

F(x0)−F^∗

1

τ −^M+L₂ ^f (28)

(iii) Suppose that a subsequence (xki)i is converging towards y. Let’s show thatxis a critical point of F. We had

xk+1−xk

τ −λ∇f(xk+1)∈∂φ(xk+1). (29) From the continuity of∇f, we have∇f(xki)→ ∇f(y). As||xk+1−xk|| → 0, we get

xki−xki−1

τ −λ∇f(xki)→ −λ∇f(y). (30) We recall that the subdifferential of a proper, nonconvex function φ is defined as the limiting subdifferential

∂φ(x) :=

v∈Rⁿ,∃xk, φ(xk)→φ(x), vk→v,

lim

z→x_k

φ(z)−φ(xk)− hvk, z−xki

||z−xk|| ≥0∀k

.

(31)

which veriﬁes

{v∈Rⁿ,∃xk, φ(xk)→φ(x), vk →v, vk∈∂φ(xk)} ⊆∂φ(x). (32) Therefore, if we can show thatφ(xki)→φ(y), we get−λ∇f(y)∈∂φ(y) i.e. y is a critical point ofF. Using the fact that φis lsc,

lim inf

i→∞ φ(xki)≥φ(y). (33)

(16)

On the other hand, by weak convexity ofφ, Proposition 2 (ii) gives, for zki= ^x^ki−¹_τ^−x^ki −λ∇f(xki)∈∂φ(xki),

φ(xki)≤φ(y)− hzki, xki−yi+M

2 ||xki−y||²

≤φ(y) +||zki|| ||xki−y||+M

2 ||xki−y||²

≤φ(y) +

||xki−1−xki||

τ +λ||∇f(x_k_i)||+M

2 ||x_k_i−y||

||x_k_i−y||.

(34)

Therefore,

lim sup

i→∞ φ(xki)≤φ(y), (35) and

i→∞lim φ(x_k_i) =φ(y). (36) (iv) We wish to apply Theorem 2.9 from [2]. We need to satisfy H1, H2, and H3 speciﬁed in [2, Section 2.3]. Condition H1 corresponds to the suﬃcient decrease condition shown in (i). For condition H3, we use that, as(xk) is bounded, there exists a subsequence(xki)converging towardsy. Then F(xki)→F(y)has been shown in (iii). Finally, for condition H2, from (14), we had that

zk+1+λ∇f(xk+1) =xk+1−x_k

τ (37)

wherezk+1∈∂φ(xk+1). Therefore,

||zk+1+λ∇f(xk+1)||=||xk+1−xk||

τ (38)

which gives condition H2.

D Proof of Corollary 1 (Prox-PnP-PGD conver- gence)

Proof. The convergence results is obtained applying Theorem 1 forφ=φσdeﬁned in Proposition 1 and τ = 1. The condition τ λLf+M < 2from Theorem 1 becomes, withτ= 1 and the fact thatφσ isM =Lgσ/(Lgσ + 1)-weakly convex

λLf <2−M =Lgσ + 2

Lgσ + 1 (39)

Point (iv) requiresFλ,σ to verify the Kurdyka-Lojasiewicz (KL) property at the cluster points ofxk, and the iteratesxk to be bounded. We now show these two points.

(17)

• Asf is supposed KL, we only need to show thatφ_σ is KL on the open Im(Dσ). Indeed, cluster points ofxk are ﬁxed points of the PGD operator Dσ◦(Id−τ∇f)and thus belong toIm(Dσ). We supposedDσto be a semi- algebraic mapping. As mentioned in [2], by the Tarski–Seidenberg theorem, the composition and inverse of semi-algebraic mappings are semi-algebraic mappings. Moreover, semi-algebraic scalar functions are KL. Therefore, by the deﬁnition ofφˆσ equation (5), we get that φσ is semi-algebraic and thus KL onIm(Dσ).

Let us precise here that, in practice,Dσ will be deﬁned, as in [11], as the gradient of a neural network. It is easy to check, using the chain rule and the fact that semi-algebraicity is stable by product, sum, inverse and composition thatDσ is a then a semi-algebraic mapping.

• Using Proposition 1 with the notations of the same Proposition, we have that ∀x ∈ Rⁿ, gσ(x) ≤ φˆσ(x) Thus, the coercivity of gσ implies the coercivity of φˆσ. Moreover ∀x∈Im(Dσ), φσ = ˆφσ+K. Therefore, as xk ∈Im(Dσ), we have along the sequence thatFλ,σ=λf(xk) +φσ(xk) = λf(x_k) + ˆφ_σ+K. Therefore, by coercivity ofφˆ_σ (and lower boundedness off) and the fact thatFλ,σ(xk)monotonically decreases, the sequencexk

remains necessarily bounded.

E Proof of Theorem 2 ( α PGD for convex and weakly convex optimization)

Proof. (i) and (ii) : We can write (19b) as xk+1∈arg min

y∈Rn

φ(y) +λh∇f(qk+1), y−xki+ 1

2τ||y−xk||²

∈arg min

y∈Rn φ(y) +λf(qk+1) +λh∇f(qk+1), y−qk+1i+ 1

2τ||y−xk||²

∈arg min

y∈Rn Φ(y) + 1

2τ ||y−xk||²

(40)

withΦ(y) :=φ(y) +λf(qk+1) +λh∇f(qk+1), y−qk+1i. AsφisM-weakly convex, so doesΦ. The three-points inequality of Proposition 2 (iii) applied toΦthus gives∀y∈Rⁿ,

Φ(y) + 1

2τ ||y−xk||²≥Φ(xk+1) + 1

2τ||xk+1−xk||²+ 1

2τ −M 2

||xk+1−y||²

(41) that is to say,

φ(y) +λf(qk+1) +λh∇f(qk+1), y−qk+1i+ 1

2τ||y−xk||²≥ φ(xk+1) +λf(qk+1) +λh∇f(qk+1), xk+1−qk+1i

+ 1

2τ ||xk+1−xk||²+ 1

2τ −M 2

||xk+1−y||²

(42)

(18)

Using relation (19c), and the descent Lemma 1 as well as the convexity on f, f(qk+1) +h∇f(qk+1), xk+1−qk+1i

=f(qk+1) +

∇f(qk+1), 1 αyk+1+

1− 1

α

yk−qk+1

=1 α

f(qk+1) +h∇f(qk+1), yk+1−qk+1i

+

1− 1 α

f(qk+1) +h∇f(qk+1), yk−qk+1i

≥1 α

f(yk+1)−Lf

2 ||yk+1−qk+1||²

+

1− 1 α

f(qk+1) +h∇f(qk+1), yk−qk+1i

≥1 α

f(yk+1)−Lf

2 ||yk+1−qk+1||²

+

1− 1 α

f(yk).

(43)

Sinceyk+1−qk+1=α(xk+1−xk)(from relations (19a) and (19c)), by combining relations (42) and (43), we now have for ally∈Rⁿ,

φ(y) +λ 1

α−1

f(yk) + 1

2τ||y−xk||²+λf(qk+1) +λh∇f(qk+1), y−qk+1i ≥ φ(xk+1) +λ

αf(yk+1) + 1

2τ−αλLf

2

||xk+1−x_k||²+ 1

2τ−M 2

||xk+1−y||².

(44) Using again the convexity off we get for ally∈Rⁿ,

φ(y) +λf(y) +λ 1

α−1

f(yk) + 1

2τ ||y−xk||²≥ φ(xk+1) +λ

αf(yk+1) + 1

2τ −αλLf

2

||xk+1−xk||²+ 1

2τ −M 2

||xk+1−y||².

(45) Now, the weak convexity ofφwith relation (19c) gives

φ(xk+1)≥ 1

αφ(yk+1) +

1− 1 α

φ(yk)−M

2 (1−α)||yk−xk+1||². (46) Combining (45) and (46), and usingF =λf+φleads to

∀y∈Rⁿ 1

α−1

(F(yk)−F(y)) + 1

2τ||y−xk||²≥ 1

α(F(yk+1)−F(y)) + 1

2τ −αλL_f 2

||xk+1−xk||²

+ 1

2τ −M 2

||xk+1−y||²−M

2 (1−α)||yk−xk+1||².

(47)

(19)

Fory=y_k, we get 1

α(F(yk)−F(yk+1))≥ − 1

2τ||yk−xk||²+ 1

2τ −αλL_f 2

||xk+1−xk||²

+ 1

2τ −M(2−α) 2

||xk+1−y_k||².

(48)

For constantα∈(0,1), using that yk−xk+1= 1

α(yk−yk+1) yk−xk =

1− 1

α

(yk−yk−1)

(49) (50) we get

F(y_k)−F(yk+1)≥ − α 2τ

1− 1

α 2

||y_k−yk−1||²

+α 1

2τ −αλLf

2

||xk+1−xk||²

+ 1 α

1

2τ −M(2−α) 2

||yk+1−yk||².

(51)

With the assumption τ αλLf < 1, the second term of the right-hand side is non-negative and therefore,

F(yk)−F(yk+1)≥ −α 2τ

1− 1

α 2

||yk−yk−1||²

+ 1 α

1

2τ −M(2−α) 2

||yk+1−yk||².

=−δ||yk−yk−1||²+δ||yk+1−yk||² + (γ−δ)||yk+1−yk||²

(52)

with

δ= α 2τ

1− 1

α 2

γ= 1 α

1

2τ −M(2−α) 2

(53) (54) We now make use of the following lemma.

Lemma 2 ([3]) Let(an)n∈Nand(bn)n∈Nbe two real sequences such thatbn≥0

∀n∈N,(an)is bounded from below andan+1+bn≤an ∀n∈N. Then (an)n∈N

is a monotonically non-increasing and convergent sequence and P

n∈Nbn<+∞ To apply Lemma 2 we look for a stepsize satisfying

γ−δ >0 i.e. τ < α

M. (55)

Therefore, hypothesisτ <min

1 αλLf,_M^α

gives that(F(yk) +δ||yk−yk−1||²) is a non-increasing and convergent sequence and that P

k||yk−yk+1||²<+∞.

(20)

Note that a slightly more precise bound can be found keeping the second term in (51). For sake of completeness, we develop it here. Keeping the assumption τ αλLf <1in (51). We can use that

xk+1−xk = 1

α(yk+1−yk) +

1− 1 α

(yk−yk−1). (56) Then, by convexity of the squaredℓ2 norm, for0< α <1, we have

||yk+1−y_k||²≤α||xk+1−x_k||²+ (1−α)||y_k−yk−1||² (57) and

||xk+1−x_k||²≥ 1

α||yk+1−y_k||²+

1− 1 α

||y_k−yk−1||². (58) which gives ﬁnally

F(yk)−F(yk+1)≥ α

1− 1

α 1

2τ −αLf

2

− α 2τ

1− 1

α 2!

||yk−yk−1||²

+ 1

2τ −αλLf

2 + 1 α( 1

2τ −M(2−α)

2 )

||yk+1−yk||².

=−1−α

2ατ 1−α²τ λLf

||yk−yk−1||²

+ 1

2ατ 1 +α−α²τ λLf−τ M(2−α)

||yk+1−yk||².

=−δ||yk−yk−1||²+δ||yk+1−yk||²+ (γ−δ)||yk+1−yk||²

(59)

with

δ=1−α

2ατ 1−α²τ λLf

γ= 1

2ατ 1−α²τ λLf+α−τ M(2−α)

(60) (61) The condition on the stepsize becomes

γ−δ >0

⇔τ < 2α α³Lf+ (2−α)M

(62)

And the overall condition is τ <min

1

αL_f, 2α α³L_f+ (2−α)M

(63) (iii) The proof of this result is an extension of the proof proposed in Ap- pendix C in the context of the classical PGD. Suppose that a subsequence(yki) is converging towards y. Let’s show that yis a critical point ofF. From (19b), we have

xk+1−xk

τ −λ∇f(qk+1)∈∂φ(xk+1). (64)

(21)

First we show thatx_k_i+1−x_k_i→0. We have∀k >1,

||xk+1−xk||= 1

αyk+1+ (1− 1

α)yk− 1

αyk−(1−1 α)yk−1

≤ 1

α||yk+1−yk||+ (1

α−1)||yk−yk−1||

→0.

(65)

From (19a), we also get ||qk+1−q_k|| → 0. Now, let’s show that x_k_i →y and qki →y. First using (19c), we have

||xki−y|| ≤ ||xki+1−y||+||xki+1−xki||

≤ 1

α||yki+1−y||+ (1

α−1)||yki−y||+||xki+1−xki||

→0.

(66)

Second, from (19a), we get in the same wayqki→y. From the continuity of∇f, we get∇f(qki)→ ∇f(y)and therefore

xki−xki−1

τ −λ∇f(qki)→ −λ∇f(y). (67) As explained in the proof Appendix C (iii), if we can also show thatφ(xki)→φ(y), we get from the subdiﬀerential characterization (32) that−λ∇f(y)∈∂φ(y)i.e.

y is a critical point ofF.

Using the fact thatφis lsc andx_k_i→y.

lim inf

i→∞ φ(xki)≥φ(y). (68)

On the other hand, with Equation (45) for k + 1 = ki, taking i → +∞,

||y−xki+1|| → 0, ||y−xki|| → 0, f(yki) → f(y), f(yki+1) → f(y) and we get

lim sup

i→∞

φ(xki)≤φ(y), (69)

and therefore

i→∞lim φ(xki) =φ(y). (70) E.0.1 On the convergence of the iterates with the KL hypothesis In order to prove a result similar to Theorem 1 (iv) on the convergence of the iterates with the KL hypothesis, we can not directly apply Theorem 2.9 from [2] onF as the objective functionF(xk)by itself does not decrease along the sequence butF(xk) +δ||xk+1−xk||² does (whereδ= _2τ^α 1−_α¹²

).

Our situation is more similar to the variant of this result presented in [16, Theorem 3.7]. Indeed, denoting F : Rⁿ ×Rⁿ → R defined as F(x, y) = F(x) +δ||x−y||²and considering∀k≥1, the sequencez_k = (y_k, yk−1)withy_k following our algorithm, we can easily show that zk verifies the conditions H1 and H3 specified in [16, Section 3.2]. However, condition H2 does not extend to our algorithm. We plan as future work to derive a new version of [16, Section 3.2]

that ﬁts to our case of interest.

21