Optimal Multivariate Gaussian Fitting with Applications to PSF Modeling in Two-Photon Microscopy Imaging

(1)

HAL Id: hal-01985663

https://hal.archives-ouvertes.fr/hal-01985663

Submitted on 18 Jan 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Optimal Multivariate Gaussian Fitting with Applications to PSF Modeling in Two-Photon Microscopy Imaging

Emilie Chouzenoux, Tim Tsz-Kit Lau, Claire Lefort, Jean-Christophe Pesquet

To cite this version:

Emilie Chouzenoux, Tim Tsz-Kit Lau, Claire Lefort, Jean-Christophe Pesquet. Optimal Multivariate

Gaussian Fitting with Applications to PSF Modeling in Two-Photon Microscopy Imaging. Journal of

Mathematical Imaging and Vision, Springer Verlag, 2019, 61 (7), pp.1037-1050. �10.1007/s10851-019-

00884-1�. �hal-01985663�

(2)

(will be inserted by the editor)

Optimal Multivariate Gaussian Fitting with Applications to PSF Modeling in Two-Photon Microscopy Imaging

Emilie Chouzenoux

^1,2 ·

Tim Tsz-Kit Lau

³ ·

Claire Lefort

⁴ ·

Jean-Christophe Pesquet

¹

Received: date / Accepted: date

Abstract Fitting Gaussian functions to empirical data is a crucial task in a variety of scientific applications, especially in image processing. However, most of the existing approaches for performing such fitting are restricted to two dimensions and they cannot be easily extended to higher dimensions. Moreover, they are usually based on alternating minimization schemes which benefit from few theoretical guarantees in the underlying nonconvex setting. In this paper, we provide a novel variational formulation of the multivariate Gaussian fitting problem, which is applicable to any dimension and accounts for possible non-zero background and noise in the input data. The block multiconvexity of our objective function leads us to propose a proximal alternating method to minimize it in order to estimate the Gaus- sian shape parameters. The resulting FIGARO algorithm is shown to converge to a critical point under mild assumptions. The algorithm shows a good robustness when tested on synthetic datasets. To demonstrate the versatility of FI- GARO, we also illustrate its excellent performance in the fitting of the Point Spread Functions of experimental raw data from a two-photon fluorescence microscope.

Keywords Gaussian fitting

·

Kullback-Leibler divergence

·

Alternating minimization

·

Proximal methods

·

PSF identification

·

Two-photon Fluorescence microscopy

Emilie Chouzenoux·[email protected]

1 Center for Visual Computing, CentraleSup´elec, INRIA Saclay, Universit´e Paris-Saclay, 91190 Gif-sur-Yvette, France

2 Laboratoire d’Informatique Gaspard Monge, UMR CNRS 8049, Université Paris-Est Marne-la-Vallée, 77454 Marne-la-Vallée Cedex 2, France

3 Department of Statistics, Northwestern University, Evanston, IL 60208, United States of America

4 XLIM Research Institute, UMR CNRS 7252, Universit´e de Limo- ges, 87032 Limoges, France

1 Introduction

Fitting Gaussian shapes from noisy observed data points is an essential task in various science and engineering applications. In the one-dimensional (1D) case, it lies for instance at the core of spectroscopy signal analysis techniques in physi- cal science [21, 31]. In the two-dimensional (2D) case, where Gaussian profile parameters are estimated from images, some worth mentioning applications include Gaussian beam char- acterization, particle tracking, and sensor calibration [28, 37, 15]. In the domain of image recovery, a particularly impor- tant application of Gaussian shape fitting is the modeling of Point Spread Functions (PSF) from raw data of optical sys- tems (e.g., microscopes, telescopes). The success of image restoration strategies strongly depends on the accuracy of the PSF estimation [13]. This estimation is often performed through a preliminary step of image acquisition of normal- ized and calibrated objects, associated with a model fitting strategy. The PSF model is chosen as a trade-off between accuracy and simplicity. Gaussian models often lead to both tractable and good quality approximations [35, 32, 1, 42, 41].

Let L

¹

(

R^Q

) denote the space of real-valued summable functions defined on

R^Q

. In this paper, we address the problem of fitting a Gaussian model to an observed function y

∈

L

¹

(

R^Q

). We assume that the observed function y can be modeled as

(

∀

u u u

∈R^Q

) y(u u u) = a + bp(u u u) + v(u u u), (1.1) where a

∈R

is a background term, b

∈

(0, +∞) is a scal- ing parameter, p

∈

L

¹

(

R^Q

) represents a noiseless version of the observed field, and v is a function accounting for acquisition errors. The main assumption is that p is close, in a sense to be made precise, to the probability density function u

u

7→

g(u u u, µµµ ,C C C), of a Q-dimensional normal distribution with

mean µµµ

∈R^Q

and precision (i.e., inverse covariance) matrix

(3)

C C C

∈S⁺⁺

Q 1

. This distribution is expressed as (

∀

u u u

∈R^Q

)(

∀

µµµ

∈R^Q

)(

∀

C C C

∈S⁺⁺

Q

) g(u u u, µµµ ,C C C) =

s

|

C C C

|

(2 π )

^Q

exp

−

1 2 (u u u

−

µµµ )

^⊤

C C C(u u u

−

µµµ )

, (1.2) where

|

C C C

|

denotes the determinant of matrix C C C. The fitting problem thus consists of finding an estimate (

b

a,

b

b, p,

b

µµµ

b

, C C C)

b

of (a,b, p, µµµ ,C C C) in accordance with model (1.1)

Because of its prominent importance in applications, there has been a significant amount of works on this subject [12, 25, 24, 23, 34, 42]. To the best of our knowledge, all existing works consider that p = g(

·

, µµµ , C C C) and they are focused on fitting parameters (

b

a,

b

b,

b

µµµ , C C C)

b

from y. Two main classes of methods can be distinguished. The first set of approaches [25, 24, 34] is based on the search for the best fitting parameters minimizing a least-squares cost between the observations and the sought model. The minimization process is based on the famous Levenberg-Marquardt alternating minimization strategy. However, it is worth mentioning that few established convergence guarantees are available for this method, which may be detrimental to its reliable use in practice. The second class of methods uses the so-called Caru- ana’s formulation [12]. The idea here is to assume that the background term a is zero and to search for (b b,

b

µµµ , C C C)

b

which minimize the difference of logarithms between the data and the model [23, 1]. The advantage of such a strategy is that it gives rise to a convex formulation, for which efficient and reliable optimization techniques can be applied. It is however worth emphasizing that all the aforementioned works are focused on the resolution of the fitting problem in low dimensions, that is when Q = 1 [12, 25, 23, 34] or Q = 2 [24, 1, 42]. Moreover, except in [34] where a polynomial background is accounted for, the background term a is consid- ered as zero. These assumptions however usually do not cor- respond to constraints inherent to an experimental setup or environment.

The aim of this paper is to propose a new multivariate Gaussian fitting strategy which avoids the aforementioned limitations. Our method relies on the minimization of a hybrid cost function combining a least-squares data fidelity term, a Kullback-Leibler divergence regularizer for improved robustness, and range constraints on the parameters. This original variational formulation results in a nonconvex minimization problem for which we propose a theoretically sound and efficient proximal alternating iterative resolution scheme.

When applied to the analysis of 3D raw data acquired with

1 Throughout the paper,S⁺⁺

Q will denote the set of symmetric positive definite matrices of R^Q×Q, S⁺

Q the set of symmetric positive semidefinite matrices ofR^Q×Q

andS_Qthe set of symmetric matrices ofR^Q×Q

a two-photon fluorescence microscope, our new computa- tional strategy shows an unprecedented accuracy and relia- bility.

In Section 2, the data fitting problem is formulated in a variational manner. A proximal alternating optimization method called FIGARO is then proposed in Section 3 for finding a minimizer of the proposed nonconvex cost function. The implementation of the algorithm steps is discussed.

The convergence of the sequence of iterates resulting from FIGARO is established in Section 4. Section 5 illustrates the high robustness of our approach to a model mismatch, when compared to a standard nonlinear least squares fitting strategy on 3D synthetic data. In Section 6, the scope of our approach is demonstrated through the analysis of the Point Spread Function of a 3D two-photon fluorescence microscope. Finally, Section 7 concludes the paper.

2 Proposed Variational Formulation

The key ingredient of our method relies on measuring the closeness of p to the Gaussian probability density functions by using the Kullback-Leibler (KL) divergence [5]. Let us first recall the definition of KL divergence. Let

P

denote the set of probability density functions supported on

R^Q

:

P

=

n

q

∈

L

¹

(

R^Q

)

|

(

∀

u u u

∈R^Q

) q(u u u)

≥

0

Z

Ω

q(u u u)du u u = 1

o

. (2.1)

Suppose that (p, q)

∈P²

and q takes (strictly) positive values, the KL divergence from q to p reads

KL

(p

k

q) =

Z

RQ

p(u u u) log p(u u u)

q(u u u)

du u u, (2.2)

with the convention 0 log 0 = 0.

In order to avoid singularity issues, we will assume that the Gaussian variances in each direction are bounded above by some maximal values. The spectrum of the precision matrix C C C is thus bounded from below, in the sense that there exists some ε > 0 such that C C C = D D D + ε ^III

Q

where D D D belongs to

S⁺

Q

and III

_Q∈R^Q×Q

denotes the identity matrix of

R^Q

. We then propose to define ( a,

bb

b,

b

p, µµµ

b

, D D D)

b

as a minimizer of a hybrid cost function, gathering information regarding the observation model (1.1) and the Gaussian shape prior (1.2).

The minimization problem reads minimize

a∈A,b∈B µµµ∈R^Q,p∈P,DDD∈S⁺

Q

1 2

Z

R^Q

y(u u u)

−

a

−

bp(u u u)

2

du u u + λ

^KL

^p

k

g(

·

, µµµ , D D D + ε ^III

Q

)

. (2.3)

(4)

Hereabove,

A

and

B

are some nonempty closed bounded real intervals corresponding to known bounds on a and b respectively, and λ > 0 is a regularization parameter weight- ing the KL penalty term favoring the proximity between p and the Gaussian model (1.2) parametrized by ( µµµ ,D D D).

In practice, however, one generally has access only to a sampling of y, which is performed on a bounded Borel set Ω of

R^Q

. The set Ω is supposed here chosen large enough so that it captures most of the probability mass of the sought Gaussian disribution. More precisely, we will assume that Ω is paved into N

∈N

voxels of volume ∆

∈

(0, +∞) and mass centers (xxx

n

)

1≤n≤N

. The available vector of observations is then yyy = (y

n

)

1≤n≤N

where, for every n

∈ {

1, . . . , N

}

, y

_n

= y(xxx

n

). After this discretization, by assuming that y and p are continuous functions in (2.3) and that ∆ is small enough, the following more tractable optimization problem is thus substituted for the original variational formulation:

minimize

a∈A,b∈B µµµ∈R^Q,ppp∈P_d,DDD∈S⁺

Q

1

2

k

yyy

−

a1 1 1

N−

bp p p

k²

+ λ ∑

^N

n=1

p

_n

log

p

_n

g(xxx

n

, µµµ , D D D + ε ^III

Q

)

, (2.4) where

k · k

denotes the standard Euclidean norm. The probability density function p has been replaced by the vector p p p = ( p

_n

)

1≤n≤N

which belongs

P_d

= [0, +∞)

^N∩C

, where

C

is the affine hyperplane

C

=

(

p p p

∈R^N

∑

N n=1

p

_n

= ∆

⁻¹ )

. (2.5)

The discrete KL term in (2.4) can be rewritten as

∑

N n=1

p

_n

log

p

_n

g(xxx

n

, µµµ , D D D + ε ^III

Q

)

=

∑

N n=1

ent(p

n

) + p

_n

Q

2 log(2 π )

−

1 2 log(

|

D D D + ε ^III

Q|

) + 1

2 (xxx

n−

µµµ )

^⊤

(D D D + ε ^III

Q

)(xxx

n−

µµµ )

, (2.6)

where

(

∀

υ

∈R

) ent( υ ) =







υ log υ , υ > 0,

0, υ = 0,

+∞, otherwise.

(2.7)

Note that the above definition of the function ent allows us to impose directly the nonnegativity of the components of p p p.

For technical reasons which will appear later, we will also need to perform a twice continuously differentiable extension of the function D D D

7→ −

log(

|

D D D + ε ^III

Q|

) on the whole domain

S_Q

. This extension ϕ is defined as follows. For every D D D

∈S_Q

decomposed as U U U Diag( σσσ )U U U

^⊤

with U U U

∈R^Q×Q

an

orthogonal matrix and σσσ = ( σ

q

)

1≤q≤Q

the associated vector of eigenvalues of D D D,

ϕ (D D D) = ϕ

e

( σσσ )

=











−

log(

|

D D D+ ε ^III

Q|

) =

−

∑

Q q=1

log( σ

q

+ ε ), if D D D

∈S⁺

Q

,

ϕ

e

(0 0 0

Q

) + σσσ

^⊤∇e

ϕ (0 0 0

Q

) +

¹₂

σσσ

^⊤∇²

ϕ

e

(0 0 0

Q

) σσσ , otherwise,

(2.8) where 0 0 0

Q

is the Q-dimensional null vector, 1 1 1

Q

the Q-dimensional vector of all ones, and

ϕ

e

(0 0 0

Q

) =

−

Q log ε ,

∇

ϕ

e

(0 0 0

Q

) =

−

ε

⁻¹

1 1 1

Q

,

∇²

ϕ

e

(0 0 0

Q

) = ε

⁻²

^III

Q

. (2.9) Let us denote by ι

^S

the indicator function of a set

S

, which is equal to 0 on this set and +∞ otherwise. We are now ready to define the cost function which is minimized in our Gaus- sian fitting approach:

(

∀

a

∈R

)(

∀

b

∈R

)(

∀

p p p

∈R^N

)(

∀

µµµ

_∈^R^Q

)(

∀

D D D

∈S_Q

) F(a, b, p p p, µµµ , D D D) = 1

2

k

yyy

−

a1 1 1

N−

bp p p

k²

+ ι

^A

(a)

+ ι

^B

(b) + λΨ (p p p, µµµ ,D D D), (2.10) where

(

∀

p p p

∈R^N

)(

∀

µµµ

∈R^Q

)(

∀

D D D

∈S_Q

) Ψ (p p p, µµµ , D D D) =

∑

N n=1

ent(p

n

) + p

_n

2 Q log(2 π ) + ϕ (D D D) + (xxx

n−

µµµ )

^⊤

(D D D + ε ^III

Q

)(xxx

n−

µµµ )

+ ι

^C

(p p p) + ι

S⁺

Q

(D D D).

(2.11) Remark 1 The proposed formulation deals with a regular grid but it can be easily extended to the case of irregular sampling by changing the definition of

C

into

C

=

(

p p p

∈R^N

∑

N n=1

∆

n

p

_n

= 1

)

(2.12) where, for every n

∈ {

1, . . . , N

}

, ∆

n∈

(0,+∞)

^N

is the volume of the n-th voxel.

3 FIGARO Minimization Algorithm

3.1 Proposed Algorithm

The objective function (4.1) is nonconvex, yet convex with

respect to each variable. A standard resolution approach is

thus to adopt an alternating minimization strategy, where, at

(5)

each iteration, F is minimized with respect to one variable while the others remain fixed. This approach, sometimes re- ferred to as Block Coordinate Descent or nonlinear Gauss- Seidel method, has been widely used in the context of PSF model fitting [42, 30, 32]. However, its convergence is only guaranteed under restrictive assumptions [38]. In order to get sounder convergence results, we propose to use an al- ternative strategy based on proximal tools which consists of replacing, at each iteration the direct minimization step by a proximal one ([33, Def. 1.22], [6, Def. 12.23], [18, Def.

10.1] [11]).

Definition 1 (Domain) Let f be a function from

Rⁿ

to (

−∞,

+∞]. The domain of f is defined by

dom f :=

{

x

∈Rⁿ

: f (x) < +∞

}

.

The function f is proper if and only if dom f is nonempty.

Definition 2 (Proximity operator) Let f :

Rⁿ→

(

−∞,

+∞]

be a convex, proper, lower semi-continuous function. The proximity operator of f at x

∈Rⁿ

is defined as

prox

_f

(x) = argmin

y∈Rⁿ

f (y) + 1

2

k

y

−

x

k²

.

Let

S

be a nonempty closed convex subset of

Rⁿ

. Then prox

_ι_S

is equal to the projection

P^S

onto

S

.

The application of the proximal alternating method [4, 2, 8] to the minimization of (4.1) yields Algorithm 1, called FIGARO (Fitting Gaussians with Proximal Optimization).

Algorithm 1 FIGARO method

a

0∈A

, b

0∈B

, p p p

₀∈C

, µµµ

₀_∈^R^Q

,D D D

0∈S⁺

Q

, ( γ

a

, γ

b

, γ

p

, γ

µ

, γ

D

)

∈

(0, +∞)

⁵

.

for i = 1, 2, . . . do a

⁽ⁱ⁺¹⁾

= prox

_γ

aF(·,b⁽ⁱ⁾,ppp⁽ⁱ⁾,µµµ⁽ⁱ⁾,DDD⁽ⁱ⁾)

(a

⁽ⁱ⁾

) b

⁽ⁱ⁺¹⁾

= prox

_γ

bF(a⁽ⁱ⁺¹⁾,·,ppp⁽ⁱ⁾,µµµ⁽ⁱ⁾,DDD⁽ⁱ⁾)

(b

⁽ⁱ⁾

) p p p

⁽ⁱ⁺¹⁾

= prox

_γ

pF(a⁽ⁱ⁺¹⁾,b⁽ⁱ⁺¹⁾,·,µµµ⁽ⁱ⁾,DDD⁽ⁱ⁾)

(p p p

⁽ⁱ⁾

) µµµ

⁽ⁱ⁺¹⁾

= prox

_γ_µ_F(a(i+1),b⁽ⁱ⁺¹⁾,ppp⁽ⁱ⁺¹⁾,·,DDD⁽ⁱ⁾)

( µµµ

⁽ⁱ⁾

) D D D

⁽ⁱ⁺¹⁾

= prox

_γ

DF(a⁽ⁱ⁺¹⁾,b⁽ⁱ⁺¹⁾,ppp⁽ⁱ⁺¹⁾,µµµ⁽ⁱ⁺¹⁾,·)

(D D D

⁽ⁱ⁾

) end for

Remark that other methods such as those proposed in [40, 17, 10] are also applicable to our problem, but the con- sidered alternating proximal point algorithm may appear prefer- able because of its simplicity.

3.2 Expressions of the Proximity Operators

In this part, we show that the proximity operators required in Algorithm 1 have closed form expressions.

Proposition 1 Let (a, b, p p p, µµµ , D D D)

∈R×R×R^N×R^Q×S_Q

and ( γ

a

, γ

b

)

∈

(0, +∞)

²

. The proximity operator of γ

a

F (

·

,b, p p p, µµµ ,D D D) at a is given by

prox

_γ_a_F(_·_,b,p_p_p,µµµ,D_D_D)

(a) =

P^A

a + γ

a

1 1 1

^⊤_N

(yyy

−

bp p p) 1 + γ

a

N

!

(3.1) and the proximity operator of γ

b

F(a,

·

, p p p, µµµ , D D D) at b is given by

prox

_γ_b_F(a,_·_,p_p_p,µµµ,D_D_D)

(b) =

P^B

b + γ

b

(yyy

−

a1 1 1

N

)

^⊤

p p p 1 + γ

bk

p p p

k²

. (3.2) Proof Calculating the proximity operator of γ

a

F(

·

, b, p p p, µµµ , D D D) is equivalent to calculating the proximity operator of the one-variable function ϑ + ι

^A

where

(

∀

a

∈R

) ϑ (a) = γ

a

2 ∑

N n=1

(y

n−

a

−

bp

_n

)

²

. (3.3) It follows from [14] that

prox

_γ_a_F(_·_,b,p_p_p,µµµ,D_D_D)

=

P^A◦

prox

_ϑ

. (3.4) On the other hand, it follows from [18] that

prox

_ϑ

(a) = a + γ

a

1 1 1

^⊤_N

(yyy

−

bp p p)

1 + γ

a

N . (3.5)

Expression (3.2) is obtained by similar arguments.

⊓⊔

Proposition 2 Let (a, b, p p p, µµµ , D D D)

and γ

p

> 0. The proximity operator of γ

p

F (a, b,

·

, µµµ , D D D) at p p p is given by

prox

_γ_p_F(a,b,_·_,µµµ,D_D_D)

(p p p) = ( ρ

⁻¹

W ρ exp w

_n

(b ν )

1≤n≤N

, (3.6) where W denotes the Lambert-W function [19],

ρ = γ

p

b

²

+ 1

γ

p

λ ^, ^(3.7)

and, for every n

∈ {

1, . . . , N

}

, w

_n

is the function defined as (

∀

ν

∈R

) w

_n

( ν ) =

−

1

−

c

_n

+( γ

p

λ )

⁻¹

(p

n

+ γ

p

b(y

n−

a)

−

ν ),

(3.8) with

c

n

= Q

2 log(2 π ) + 1 2 ϕ (D D D) + 1

2 (xxx

n−

µµµ )

^⊤

(D D D + ε ^III

Q

)(xxx

n−

µµµ ). (3.9) Moreover,

b

ν

∈R

is the the unique zero of the function (

∀

ν

∈R

) Φ ( ν ) = ρ

⁻¹

∑

^N

n=1

W ρ exp(w

n

( ν ))

−

∆

⁻¹

.

(3.10)

(6)

Proof Let

e

p p p

∈R^N

. Then,

b

p p p = prox

_γ_p_F(a,b,_·_,µµµ,D_D_D)

(e p p p)

= argmin

p p p∈RN

γ

p

F(a, b, p p p, µµµ , D D D) + 1

2

k

p p p

−e

p p p

k²

= argmin

p p p∈C

γ

p

∑

N n=1

1 2 (y

n−

a

−

bp

_n

)

²

+ γ

p

λ ∑

^N

n=1

(p

n

log p

_n

+ p

_n

c

_n

) + 1

2 ∑

N n=1

(p

n−

p ˜

_n

)

²

. (3.11)

The Lagrangian function associated with the above constrained problem reads

(

∀

p p p

∈

[0, +∞)

^N

)(

∀

ν

_∈^R

)

L

(p p p, ν ) = γ

p

∑

N n=1

1 2 (y

n−

a

−

bp

n

)

²

+

∑

N n=1

γ

p

λ (p

n

log p

_n

+ p

_n

c

_n

) + 1

2 (p

n−

p ˜

_n

)

²

+ ν

∑

^N

n=1

p

_n−

∆

⁻¹

. (3.12)

Since Slater’s condition obviously holds, there exists

b

ν

∈R

such that (b p p p,

b

ν ) is a saddle point of the

L

[7]. By Fermat’s rule [6],

b

p p p = ( p

b_n

)

1≤n≤N

is thus obtained by finding a zero of the partial subdifferential of

L

with respect to variable p p p.

By using (3.7), this yields, for every n

∈ {

1, . . . , N

}

, γ

p

(b

²

p

n−

by

n

+ ab) + γ

p

λ (1 +log p

n

+ c

n

)

+ p

_n−

p ˜

_n

+

b

ν = 0

⇔

ρ ^p

n

+ log p

_n

= w

_n

(b ν )

⇔

ρ ^p

n

exp( ρ ^p

n

) = ρ exp(w

n

(b ν )). (3.13) By recalling that the Lambert-W function is such that (

∀

z

∈ R

) W(z) exp(W(z)) = z, we deduce (3.6).

In addition, canceling the derivative of

L

with respect to ν amounts to finding a zero of function Φ defined in (3.10).

This existence of a zero is guaranteed by the existence of

b

p p p. Let us now establish its uniqueness by evaluating the derivative Φ

^′

using the following property of the Lambert W-function:

(

∀

z

∈R⁺

) W

^′

(z) = 1

(W(z) + 1)e

^W(z)

= W(z) (W(z) + 1)z .

(3.14) We have then

(

∀

ν

∈R

) Φ

^′

( ν ) =

∑

N n=1

W

^′

ρ exp(w

n

( ν ))

exp w

_n

( ν ) w

^′_n

( ν )

=

−

1 γ

p

λρ

∑

N n=1

1

−

1 W ρ exp(w

n

( ν )) + 1

!

. (3.15)

Therefore, since W takes positive values on (0, +∞), Φ

^′

( ν ) <

0 for every ν

∈R

, i.e., Φ is strictly decreasing. We thus conclude that it has a unique zero

b

ν .

⊓⊔

The computation of the above proximity operator re- quires to determine the zero of the scalar function Φ . The following lemma shows that this can be achieved with high precision using Newton algorithm, the convergence of which is guaranteed for any initialization.

Lemma 1 The Newton iteration (

∀

t

∈N

) ν

^(t+1)

= ν

^(t)−

Φ ( ν

^(t)

)

Φ

^′

( ν

^(t)

) (3.16)

converges to the unique zero of Φ from any starting point ν

⁽⁰⁾∈R

.

Proof We have already shown that Φ is strictly decreasing on

R

and has a unique zero. Let us now establish the con- vexity of Φ by calculating its second-order derivative (

∀

ν

∈R

)

Φ

^′′

( ν ) =

−

1 γ

p

λ

∑

N n=1

W

^′

( ρ exp(w

n

( ν ))) exp(w

n

( ν ))w

^′_n

( ν ) (W( ρ exp(w

n

( ν ))) + 1)

²

= 1

γ

²p

λ

²

ρ

∑

N n=1

W( ρ exp(w

n

( ν )))

(W( ρ exp(w

n

( ν ))) + 1)

³

. (3.17) Since W( ρ exp(w

n

( ν ))) > 0 for every n

∈ {

1, . . . , N

}

and ν

∈R

, we have Φ

^′′

( ν ) > 0 for all ν

∈R

, i.e., Φ is strictly convex. Now, let us ascertain the convergence of Newton’s method for finding the unique root of Φ . The remaining of our proof follows similar arguments as the one of [29, Chap- ter 3, Theorem 2]. For every t

∈N

, let e

^(t)

is the error defined as

(

∀

t

∈N

) e

^(t)

= ν

^(t)−b

ν ,

where

b

ν is the zero of Φ . From the definition of the Newton iteration, we have

(

∀

t

∈N

) e

^(t+1)

= ν

^(t+1)−b

ν = ν

^(t)−

Φ ( ν

^(t)

) Φ

^′

( ν

^(t)

)

−b

ν

= e

^(t)−

Φ ( ν

^(t)

)

Φ

^′

( ν

^(t)

) = e

^(t)

Φ

^′

( ν

^(t)

)

−

Φ ( ν

^(t)

) Φ

^′

( ν

^(t)

) .

(3.18) By performing a second-order Taylor expansion, we get (

∀

t

∈N

) 0 = Φ (b ν ) = Φ ( ν

^(t)−

e

^(t)

)

= Φ ( ν

^(t)

)

−

e

^(t)

Φ

^′

( ν

^(t)

) + 1

2 (e

^(t)

)

²

Φ

^′′

( ξ

^(t)

), (3.19) where, for all t

∈N

, ξ

^(t)∈

[min(b ν , ν

^(t)

), max(b ν , ν

^(t)

)]. Com- bining the latter equality with (3.18) yields

(

∀

t

∈N

) e

_t+1

= 1 2

Φ

^′′

( ξ

^(t)

)

Φ

^′

( ν

^(t)

) (e

^(t)

)

²

. (3.20)

(7)

Recall that Φ

^′

( ν ) < 0 and Φ

^′′

( ν ) > 0 for all ν

∈R

. Ac- cording to (3.20), for every t

∈N

, e

^(t+1)

< 0, which implies that ν

^(t)

<

b

ν for all t

≥

1. Thus, since Φ is strictly decreasing, (

∀

t

≥

1) Φ ( ν

^(t)

) > Φ (b ν ) = 0. By (3.18), (

∀

t

≥

1) e

^(t+1)

> e

^(t)

, and thus (e

^(t)

)

t≥1

is increasing and upper bounded by 0. Hence, ( ν

^(t)

)

t≥1

is also increasing and upper bounded by

b

ν . Therefore, the limits e

^∗

= lim

t→+∞

e

^(t)

and ν

^∗

= lim

t→+∞

ν

^(t)

exist. We deduce from (3.18) that e

^∗

= e

^∗−

Φ ( ν

^∗

)/ Φ

^′

( ν

^∗

), which implies that Φ ( ν

^∗

) = 0 and

ν

^∗

=

b

ν .

⊓⊔

Proposition 3 Let (a, b, p p p, µµµ , D D D)

and γ

µ

> 0. The proximity operator of γ

µ

F(a, b, p p p,

·

, D D D) at µµµ is given by

prox

_γ_µ_F(a,b,p_p_p,_·_,D_D_D)

( µµµ ) =

III

_Q

+ γ

µ

λ (1 1 1

^⊤_N

p p p)(D D D+ ε ^III

Q

)

₋1

×

µµµ + γ

µ

λ ∑

^N

n=1

p

_n

(D D D + ε ^III

Q

)xxx

n

!

. (3.21) Proof Calculating the proximity of operator of γ

µ

F (a, b, p p p,

·

, D D D) is equivalent to calculating the proximity operator of the quadratic function

µµµ

7→

γ

µ

λ ∑

^N

n=1

p

_n

2 (xxx

n−

µµµ )

^⊤

(D D D + ε ^III

Q

)(xxx

n−

µµµ ). (3.22) The result then follows from [18].

⊓⊔

Proposition 4 Let (a,b, p p p, µµµ , D D D)

∈R×R×R^N×R^Q×S

(Q) and γ

D

> 0. The proximity operator of γ

D

F (a, b, p p p, µµµ ,

·

) at D D D is given by

prox

_γ_D_F(a,b,p_p_p,µµµ,_·₎

(D D D) = 1

2 V V V Diag max

ω

q−

ε +

q

( ω

q

+ ε )

²

+4m, 0

1≤q≤Q

!

×

V V V

^⊤

, (3.23) where ω ω ω = ( ω

q

)

1≤q≤Q

is a vector of eigenvalues of D D D

−

S S S and V V V is a Q

×

Q orthogonal matrix such that D D D

−

S S S = V V V Diag( ω ω ω )V V V

^⊤

with S S S =

¹₂

γ

D

λ

∑^N_n=1

p

_n

(xxx

n−

µµµ )(xxx

n−

µµµ )

^⊤

and m = 1

2 γ

D

λ (1 1 1

^⊤_N

p p p).

Proof Let

k · kF

denote the Frobenius norm and let D D D

e∈S_Q

. We have

prox

(e D D D)

= argmin

D DD∈S_Q

γ

D

F(a,b, p p p, µµµ , D D D) + 1

2

k

D D D

−

D D D

ek²F

= argmin

DDD∈S_Q

1

2

k

D D D

−

D D

e

D

k²F

+tr(D D DS S S) + m ϕ (D D D) + ι

S⁺

Q

(D D D)

= prox

_mϕ+ι

S+ Q

( D D D

e−

S S S). (3.24)

Since ϕ and ι

S⁺

Q

are spectral functions on

S_Q

associated with the functions ϕ

e

and ι

_[0,+∞)^Q

, respectively, it follows from [6, Corollary 24.65] that

prox

( D D D) =

e

V V V Diag prox

_m_ϕ+ι_e

[0,+∞)Q

( ω ω ω ) V V V

^⊤

,

(3.25) where ω ω ω = ( ω

q

)

1≤q≤Q

is a vector of eigenvalues of D D

e

D

−

S S

S and V V V is a Q

×

Q orthogonal matrix such that D D D

e−

S S S = V V V Diag( ω ω ω )V V V

^⊤

. Since m ϕ

e

+ ι

_[0,+∞)Q

is a separable function, prox

_m_ϕ+ι_e

[0,+∞)Q

( ω ω ω ) = (b σ

q

)

1≤q≤Q

, (3.26) where, for every q

∈ {

1, . . . , Q

}

,

σ

bq

= argmin

σq∈[0,+∞)−

m log( σ

q

+ ε ) + 1

2 ( σ

q−

ω

q

)

²

= 1 2 max

ω

q−

ε +

q

( ω

q

+ ε )

²

+ 4m, 0

. (3.27)

4 Convergence Analysis

Let us now establish the convergence of the iterates generated by Algorithm 1. Our analysis will rely on the observation that FIGARO can be viewed as a special instance of the regularized Gauss-Seidel method from [4].

4.1 Preliminaries

Let us first recall some useful definitions concerning variational analysis and the fundamental Kurdyka-Łojasiewicz property that will be at the core of the convergence analysis of our algorithm.

Definition 3 (Subdifferential) [33, Def. 8.3] Let f :

Rⁿ→

(

−∞,

+∞] be a proper function.

(a) For a given x

∈

dom f , the Fr´echet subdifferential of f at x, written ∂

b

f (x), is the set of all vectors u

∈Rⁿ

which satisfy

lim

y6=x

inf

y→x

f (y)

−

f (x)

− h

u,y

−

x

i k

y

−

x

k ≥

0. When x

∈

/ dom f , we set

b

∂ ^f (x) =

∅

.

(b) The limiting-subdifferential, or simply the subdifferential, of f at x

∈

dom f , written ∂ ^f (x), is defined as

∂ ^f (x) = v

∈Rⁿ|

∃

x

^(t)→

x, f (x

^(t)

)

→

f (x), v

^(t)∈b

∂ ^f (x

^(t)

)

→

v .

(8)

Definition 4 (Kurdyka-Łojasiewicz property) [10] The function f :

Rⁿ→

(

−∞,

+∞] is said to satisfy the Kurdyka-Ło- jasiewicz (KL) property at x

^∗∈

dom ∂ ^f if there exist η

∈

(0, +∞], a neighbourhood U of x

^∗

, and a continuous con- cave function ϕ : [0, η )

→R⁺

such that

(a) ϕ (0) = 0, (b) ϕ is

C¹

on (0, η ),

(c) for all s

∈

(0, η ), ϕ

^′

(s) > 0,

(d) for all x

∈

U

∩

[ f (x

^∗

) < f < f (x

^∗

) + η ], the Kurdyka- Łojasiewicz inequality holds:

ϕ

^′

( f (x)

−

f (x

^∗

)) dist(0, ∂ ^f (x))

≥

1. Moreover, f is called a KL function if it satisfies the Kurdyka- Łojasiewicz inequality at every point in dom ∂ ^f .

4.2 Convergence Theorem

In order to establish convergence results, we will show that the objective function is KL, and that it can be split into the sum of a locally Lipschitz differentiable part involving all the variables, and non differentiable separable terms.

Lemma 2 Function (4.1) is a KL function.

Proof Let us recall that there exists an o-minimal structure, denoted by

S

(

R_an,exp

) with

R_an,exp

:= (

R

, +,

·

, ( f ), exp), that contains the exponential functions and every restricted an- alytic functions (see [20, Example (6), pp. 505]). Note that

S

(

R_an,exp

) also contains the logarithm function log : (0, +∞)

→R

and (

·

)

^r

:

R→R

defined by

a

7→

(

a

^r

, a > 0 0, a

≤

0,

where r

∈R

. Then, by using [20, Section 5], we conclude that F is definable in an o-minimal structure. As a conse- quence, the results of [9] and Theorem 4.1 of [3] apply and hence F is a KL function.

Lemma 3 Function (4.1) can be rewritten as (

∀

a

∈R

)(

∀

b

∈R

)(

∀

p p p

∈R^N

)(

∀

µµµ

∈R^Q

)(

∀

D D D

∈S_Q

) F (a, b, p p p, µµµ , D D D) = G(a, b, p p p, µµµ , D D D)

+ f

₁

(a) + f

₂

(b) + f

₃

(p p p) + f

₄

(D D D), (4.1) where

(

∀

a

∈R

)(

∀

b

∈R

)(

∀

p p p

∈R^N

)(

∀

µµµ

∈R^Q

)(

∀

D D D

∈S

(Q)) G(a, b, p p p, µµµ , D D D) = 1

2

k

yyy

−

a1 1 1

N−

bp p p

k²

+ λ ∑

^N

n=1

p

_n

2 (xxx

n−

µµµ )

^⊤

(D D D + ε ^III

Q

)(xxx

n−

µµµ ) + ϕ (D D D) , (4.2)

(

∀

a

∈R

) f

₁

(a) = ι

^A

(a), (4.3)

(

∀

b

∈R

) f

₂

(b) = ι

^B

(b), (4.4)

(

∀

p p p

∈R^N

) f

3

(p p p) = ι

^C

(p p p) + λ ∑

^N

n=1

ent(p

n

) + p

_n

Q

2 log(2 π ) , (4.5)

(

∀

D D D

∈S

(Q)) f

₄

(D D D) = ι

S+(Q)

(D D D). (4.6) Moreover, G is C

²

on

R×R×R^N×R^Q×S

(Q).

Proof We first calculate the gradients

∇a

G,

∇b

G,

∇ppp

G,

∇_µµµ

G and

∇DDD

G of G with respect to the different variables. Let us denote by (eee

n,N

)

1≤n≤N

the canonical basis of

R^N

. For every (a, b, p p p, µµµ , D D D)

(Q),

∇a

G(a, b, p p p, µµµ , D D D) = b1 1 1

^⊤_N

p p p + Na

−

1 1 1

^⊤_N

yyy,

∇b

G(a, b, p p p, µµµ , D D D) = b

k

p p p

k²−

yyy

^⊤

p p p +a1 1 1

^⊤_N

p p p = (bp p p

−

yyy+a1 1 1

N

)

^⊤

p p p,

∇ppp

G(a, b, p p p, µµµ , D D D) = b

²

p p p

−

byyy + ab1 1 1

N

+ λ 2

∑

N n=1

(xxx

n−

µµµ )

^⊤

(D D D + ε III

_Q

)(xxx

n−

µµµ ) + ϕ (D D D) eee

_n,N

,

∇_µµµ

G(a, b, p p p, µµµ , D D D) = λ ∑

^N

n=1

p

_n

(D D D + ε ^III

Q

)( µµµ

−

xxx

_n

),

∇DDD

G(a, b, p p p, µµµ , D D D) =











λ 2

∑

N n=1

p

_n

(xxx

n−

µµµ )(xxx

n−

µµµ )

^⊤−

(D D D + ε ^III

Q

)

⁻¹

if D D D

∈S⁺

(Q),

λ 2

∑

N n=1

p

n

(xxx

n−

µµµ )(xxx

n−

µµµ )

^⊤−

ε

⁻¹

III

Q

+ ε

⁻²

D D D

otherwise.

(4.7)

Let us now calculate the partial second-order derivatives of

G. In the following,

⊗

denotes the matrix Kronecker product

and vec(M M M) the columnwise ordering of a matrix M M M. For

every (a, b, p p p, µµµ , D D D)

(Q), by setting

(9)

d d d = vec(D D D), we have

∇²_a

G(a, b, p p p, µµµ , D D D) = N,

∇²_b

G(a, b, p p p, µµµ , D D D) =

k

p p p

k²

,

∇²_p_p_p

G(a, b, p p p, µµµ , D D D) = b

²

III

_N

,

∇²_µµµ

G(a,b, p p p, µµµ , D D D) = λ ∑

^N

n=1

p

_n

(D D D + ε ^III

Q

),

∇²_a,b

G(a, b, p p p, µµµ , D D D) = 1 1 1

^⊤_N

p p p,

∇²_p_p_p,a

G(a, b, p p p, µµµ , D D D) = b1 1 1

N

,

∇²_µµµ,a

G(a, b, p p p, µµµ , D D D) = 0 0 0

Q

,

∇²_d_d_d,a

G(a, b, p p p, µµµ , D D D) = 0 0 0

_Q2

,

∇²_p_p_p,a

G(a, b, p p p, µµµ , D D D) = 2bp p p

−

yyy +a1 1 1

N

,

∇²_µµµ,b

G(a, b, p p p, µµµ , D D D) = 0 0 0

Q

,

∇²_d_d_d,b

G(a, b, p p p, µµµ , D D D) = 0 0 0

_Q2

,

∇²_µµµ,p_p_p

G(a,b, p p p, µµµ ,D D D) = λ (D D D + ε ^III

Q

)

∑

N n=1

( µµµ

−

xxx

_n

)eee

^⊤_n,N

,

∇²_p_p_p,d_d_d

G(a, b, p p p, µµµ , D D D)

=











λ 2

∑

N n=1

(xxx

n−

µµµ )

^⊤⊗

eee

_n,N

(xxx

n−

µµµ )

^⊤

−

eee

_n,N

vec (D D D + ε ^III

Q

)

⁻¹_⊤

if D D D

∈S⁺

(Q), λ

2 ∑

N n=1

(xxx

n−

µµµ )

^⊤⊗

eee

_n,N

(xxx

n−

µµµ )

^⊤

−

ε

⁻¹

^eee

n,N

(1 1 1

_Q2−

ε

⁻¹

^d ^d d)

^⊤

otherwise,

∇²_µµµ,d_d_d

G(a, b, p p p, µµµ , D D D) = λ ∑

^N

n=1

p

_n

( µµµ

−

xxx

_n

)

^⊤⊗

III

_Q

,

∇²_d_d_d

G(a, b, p p p, µµµ , D D D) =











λ

2 (1 1 1

^⊤_N

p p p)(D D D+ ε ^III

Q

)

⁻¹⊗

(D D D + ε ^III

Q

)

⁻¹

if D D D

∈S⁺

(Q),

λ

2 (1 1 1

^⊤_N

p p p) ε

⁻²

^III

_Q²

otherwise.

(4.8) Thanks to the definition of ϕ , the Hessian of G is thus defined and continuous on

R×R×R^N×R^Q×S

(Q). Hence the result.

We are now ready to prove the convergence of FIGARO.

Theorem 4.1 Let (ttt

⁽ⁱ⁾

)

_i_∈N

= (a

⁽ⁱ⁾

, b

⁽ⁱ⁾

, p p p

⁽ⁱ⁾

, µµµ

⁽ⁱ⁾

, D D D

⁽ⁱ⁾

)

_i_∈N

be a sequence generated by Algorithm 1. If (D D D

⁽ⁱ⁾

)

i∈N

is upper

bounded, then (ttt

⁽ⁱ⁾

)

i∈N

converges to

b

ttt = ( a,

bb

b,

b

p p p, µµµ

b

, D D D)

b

sat- isfying the following equilibrium:

(

∀

a

∈R

)(

∀

b

∈R

)(

∀

p p p

∈R^N

)(

∀

µµµ

∈R^Q

)(

∀

D D D

∈S

(Q)) F (a,

b

b,

b

p p p, µµµ

b

, D D

b

D)

≥

F (

b

a,

b

b,

b

p p p,

b

µµµ , D D

b

D)

F (

b

a, b,

b

p p p, µµµ

b

, D D

b

D)

≥

F (

b

a,

b

b,

b

p p p,

b

µµµ , D D

b

D) F (

b

a,

b

b, p p p, µµµ

b

, D D

b

D)

≥

F (

b

a,

b

b,

b

p p p,

b

µµµ , D D

b

D) F (

b

a,b b,

b

p p p, µµµ , D D

b

D)

≥

F (

b

a,b b,

b

p p p,

b

µµµ , D D

b

D)

F (

b

a,

b

b,

b

p p p, µµµ

b

, D D D)

≥

F (

b

a,

b

b,

b

p p p,

b

µµµ , D D

b

D). (4.9) Moreover the sequence (ttt

⁽ⁱ⁾

)

_i_∈N

has a finite length.

Proof In (4.1), it appears that, if p p p

6∈

[0, +∞)

^N∩C

or D D D

6∈

S⁺

(Q), then F(a, b, p p p, µµµ , D D D) = +∞, whereas, if p

p p

∈

[0, +∞)

^N∩C

and D D D

∈S⁺

(Q), G(a, b, p p p, µµµ , D D D)

≥

λ ∑

^N

n=1

p

_n

2 (xxx

n−

µµµ )

^⊤

(D D D + ε ^III

Q

)(xxx

n−

µµµ ) + ϕ (D D D)

≥

λ ∆

⁻¹

2 ε inf

n∈{1,...,N}k

xxx

_n−

µµµ

k²−

Qlog ε

, (4.10) f

₁

(a)

≥

0, f

₂

(b)

≥

0, f

₄

(D D D) = 0, (4.11) f

₃

(p p p)

≥

λ ∑

^N

n=1

ent(p

n

)

≥ −

Ne

⁻¹

λ . (4.12) Hence

F(a, b, p p p, µµµ , D D D)

≥

λ ∆

⁻¹

2 ε inf

n∈{1,...,N}k

xxx

_n−

µµµ

k²−

Qlog ε

−

Ne

⁻¹

λ . (4.13) This shows that F is bounded from below. Moreover, since FIGARO alternates proximal steps, F(ttt

⁽ⁱ⁾

)

i∈N

is a decay- ing convergent sequence. It then follows from (4.13) that ( µµµ

⁽ⁱ⁾

)

i∈N

is bounded (otherwise the function value sequence would be divergent). Since (a

⁽ⁱ⁾

)

i∈N

, (b

⁽ⁱ⁾

)

i∈N

and (D D D

⁽ⁱ⁾

)

i∈N

are bounded sequences, (ttt

⁽ⁱ⁾

)

i∈N

is bounded. Moreover, ac- cording to Lemma 3, G is C

²

on

R×R×R^N×R^Q×S

(Q), which implies that G is C

¹

with locally Lipschitz gradient on

R×R×R^N×R^Q×S

(Q). Consequently, all the condi- tions in [4, Theorem 6.2] are met to guarantee that (ttt

⁽ⁱ⁾

)

i∈N

is a finite length sequence converging to a critical point of F. We then deduce (4.9) from the fact that F is convex with respect to each of its argument.

Remark 2 Note that the assumption on the boundedness of (D D D

⁽ⁱ⁾

)

i∈N

becomes unnecessary if an upper bound on D D D is introduced in the formulation of the optimization problem.

This however was not observed to influence the practical be-

haviour of the algorithm.