From Bernoulli-Gaussian deconvolution to sparse signal restoration

(1)

HAL Id: hal-00443842

https://hal.archives-ouvertes.fr/hal-00443842v4

Submitted on 17 Jun 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

From Bernoulli-Gaussian deconvolution to sparse signal

restoration

Charles Soussen, Jérôme Idier, David Brie, Junbo Duan

To cite this version:

Charles Soussen, Jérôme Idier, David Brie, Junbo Duan. From Bernoulli-Gaussian deconvolution to sparse signal restoration. IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers, 2011, 59 (10), pp.4572-4584. �10.1109/TSP.2011.2160633�. �hal-00443842v4�

(2)

From Bernoulli-Gaussian deconvolution to

sparse signal restoration

Charles Soussen⋆, J´erˆome Idier, Member, IEEE, David Brie, and Junbo Duan

Abstract

Formulated as a least square problem under anℓ0 constraint, sparse signal restoration is a discrete

optimization problem, known to be NP complete. Classical algorithms include, by increasing cost and ef-ficiency, Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP), Orthogonal Least Squares (OLS), stepwise regression algorithms and the exhaustive search. We revisit the Single Most Likely Replacement (SMLR) algorithm, developed in the mid-80’s for Bernoulli-Gaussian signal restoration. We show that the formulation of sparse signal restoration as a limit case of Bernoulli-Gaussian signal restoration leads to anℓ0-penalized least square minimization problem, to which SMLR can be straightforwardly adapted.

The resulting algorithm, called Single Best Replacement (SBR), can be interpreted as a forward-backward extension of OLS sharing similarities with stepwise regression algorithms. Some structural properties of SBR are put forward. A fast and stable implementation is proposed. The approach is illustrated on two inverse problems involving highly correlated dictionaries. We show that SBR is very competitive with popular sparse algorithms in terms of trade-off between accuracy and computation time.

Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request topubs-permissions@ieee.org.

This work was carried out in part while C. Soussen was visiting IRCCyN during the academic year 2010-2011 with the financial support of CNRS.

C. Soussen and D. Brie are with the Centre de Recherche en Automatique de Nancy (CRAN, UMR 7039, Nancy-University, CNRS). Campus Sciences, B.P. 70239, F-54506 Vandœuvre-l`es-Nancy, France. Tel: (+33)-3 83 68 44 71, Fax: (+33)-3 83 68 44 62. E-mail:{FirstName.SecondName}@cran.uhp-nancy.fr.

J. Idier is with the Institut de Recherche en Communications et Cybern´etique de Nantes (IRCCyN, UMR CNRS 6597), BP 92101, 1 rue de la No¨e, F-44321 Nantes Cedex 3, France. Tel: (+33)-2 40 37 69 09, Fax: (+33)-2 40 37 69 30. E-mail:

Jerome.Idier@irccyn.ec-nantes.fr.

J. Duan was with CRAN. He now is with the Department of Biomedical Engineering and Biostatistics, Tulane University. 1440 Canal Street, Suite 2051, New Orleans, LA 70112, USA. Tel: (+1)-504 988 1341, Fax: (+1)-504 988 1706. E-mail:

(3)

Index Terms

Sparse signal estimation; inverse problems; Bernoulli-Gaussian signal restoration; SMLR algorithm; mixedℓ2-ℓ0criterion minimization; Orthogonal Least Squares; stepwise regression algorithms.

I. INTRODUCTION

Sparse signal restoration arises in inverse problems such as Fourier synthesis, mono- and multidimen-sional deconvolution, and statistical regression. It consists in the decomposition of a signal y as a linear combination of a limited number of elements from a dictionary A. While formally very similar, sparse signal restoration has to be distinguished from sparse signal approximation. In sparse signal restoration, the choice of the dictionary is imposed by the inverse problem at hand whereas in sparse approximation, the dictionary has to be chosen according to its ability to represent the data with a limited number of coefficients.

Sparse signal restoration can be formulated as the minimization of the squared errorky −Axk2_(where

k · k refers to the Euclidean norm) under the constraint that the ℓ0 pseudo-norm of x, defined as the number of non-zero entries in x, is small. This problem is often referred to as subset selection because it consists in selecting a subset of columns of A. This yields a discrete problem (since there are a finite number of possible subsets) which is known to be NP-complete [1]. In this paper, we focus on “difficult” situations in which some of the columns of A are highly correlated, the unknown weight vector x is only approximately sparse, and/or the data are noisy. To address subset selection in a fast and sub-optimal manner, two approaches can be distinguished.

The first one, which has been the most popular in the last decade, approximates the subset selection problem by a continuous optimization problem, convex or not, that is easier to solve [2–7]. In partic-ular, the ℓ1 relaxation of the ℓ0-norm has been increasingly investigated [2, 3], leading to the LASSO optimization problem.

The second approach addresses the exact subset selection problem using either iterative thresholding [8– 11] or greedy search algorithms. The latter gradually increase or decrease by one the set of active columns. The simplest greedy algorithms are Matching Pursuit (MP) [12] and the improved version Orthogonal Matching Pursuit (OMP) [13]. Both are referred to as forward greedy algorithms since they start from the empty active set and then gradually increase it by one element. In contrast, the backward algorithm of Couvreur and Bresler [14] starts from a complete active set which is gradually decreased by one element. It is, however, only valid for undercomplete dictionaries. Forward-backward algorithms (also

(4)

known as stepwise regression algorithms) in which insertions and removals of dictionary elements are both allowed, are known to yield better recovery performance since an early wrong selection can be counteracted by its further removal from the active set [15–18]. In contrast, the insertion of a wrong element is irreversible when using forward algorithms. We refer the reader to [18, Chapter 3] for an overview of the forward-backward algorithms in subset selection.

The choice of the algorithm depends on the amount of time available and on the structure of matrix

A. In favorable cases, the sub-optimal search algorithms belonging to the first or the second approach

provide solutions having the same support as the exhaustive search solution. Specifically, if the unknown signal is highly sparse and if the correlation between any pair of columns of A is low, the ℓ1-norm approximation provides optimal solutions [3]. But when fast algorithms are unsatisfactory, it is relevant to consider slower algorithms being more accurate and remaining very fast compared to the exhaustive search. The Orthogonal Least Squares algorithm (OLS) [19] which is sometimes confused with OMP [20], falls into this category. Both OLS and OMP share the same structure, the difference being that at each iteration, OLS solves as many least square problems as there are non-active columns while OMP only performs one linear inversion. In this paper, we derive a forward-backward extension of OLS allowing an insertion or a removal per iteration, each iteration requiring to solve n least square problems, where n is the size of x.

The proposed forward-backward extension of OLS can be viewed as a new member of the family of stepwise regression algorithms. The latter family traces back to 1960 [15], and other popular algorithms were proposed in the 1980’s [18] and more recently [21]. Note that forward-backward extensions of OMP have also been proposed [22, 23]. In contrast with the other stepwise regression algorithms, our approach relies on a bi-objective formulation in order to handle the trade-off between low residual and low cardinality. This formulation reads as the minimization of theℓ0-penalized least square cost function

ky −Axk2_+λkxk

0. Then, we design a heuristic algorithm to minimize this cost function in a suboptimal way. While the other forward-backward strategies [15–17, 21, 22] aim at handling the same trade-off, most of them are not expressed as optimization algorithms, but rather as empirical schemes without any connexion with an objective function. Moreover, some of them involve discrete search parameters that control variable selection or de-selection [15, 16, 22] while others do not involve any parameter [17, 21]. An exception can be made for Broersen’s algorithm [17] since it aims at minimizingky − Axk2_{+ λkxk}

0 for a specificλ value corresponding to Mallows’ Cp statistic. However, it is only valid for undercomplete problems. On the contrary, our proposed algorithm is general and valid for any λ value. It does not

(5)

Our starting point is the Single Most Likely Replacement (SMLR) algorithm which proved to be a very efficient tool for the deconvolution of a Bernoulli-Gaussian signal [24–27]. We show that sparse signal restoration can be seen as a limit case of maximum a posteriori (MAP) Bernoulli-Gaussian restoration which results in an adaptation of SMLR to subset selection. The paper is organized as follows. In Section II, we introduce the Bernoulli-Gaussian model and the Bayesian framework from which we formulate the sparse signal restoration problem. In Section III, we adapt SMLR resulting in the so-called Single Best Replacement (SBR) algorithm. In Section IV, we propose a fast and stable SBR implementation. Finally, Sections V and VI illustrate the method on the sparse spike deconvolution with a Gaussian impulse response and on the joint detection of discontinuities at different orders in a signal.

II. SPARSE SIGNAL ESTIMATION USING A LIMITBERNOULLI-GAUSSIAN MODEL

A. Preliminary definitions and working assumptions

Given an observation vector y ∈R

m _{and a dictionary A}_{= [a}

1, . . . , an] ∈R

m×n_{, a subset selection} algorithm aims at computing a weight vector x∈R

n _{yielding an accurate approximation y}_{≈ Ax. The} columns ai corresponding to the non-zero weightsxi are referred to as the active (or selected) columns. Throughout this paper, no assumption is made on the size of A: m can be either smaller or larger

than n. A is assumed to satisfy the unique representation property (URP): any min(m, n) columns of A are linearly independent. This assumption is usual when m 6 n; it is stronger than the full rank

assumption [28]. When m > n, it amounts to the full rank assumption. Although URP was originally

introduced to guarantee uniqueness of sparse solutions [28], we use this assumption to propose a valid algorithm. It can actually be relaxed provided that the search strategy guarantees that the selected columns are linearly independent (see Section VI-C for details).

The support of a vector x ∈ R

n _{is the set} _{S(x) ⊆ {1, . . . , n} defined by i ∈ S(x) if and only if}

xi 6= 0. We denote by Q ⊆ {1, . . . , n} the active set and by q ∈ {0, 1}n the related vector defined by

qi = 1 if and only if i ∈ Q. When Card[Q] 6 min(m, n), let AQ be the submatrix of sizem × Card[Q] formed of the active columns of A. We define the least square solution and the related squared error:

xQ , arg min S(x)⊆Q

{E(x) = ky − Axk2} (1)

(6)

B. Bayesian formulation of sparse signal restoration

We consider the restoration of a sparse signal x from a linear observation y = Ax + n, where n

stands for the observation noise. An acknowledged probabilistic model dedicated to sparse signals is the Bernoulli-Gaussian (BG) model [24, 25, 27]. For such model, deterministic optimization algorithms [27] and Markov chain Monte Carlo techniques [29] are used to compute the MAP and the posterior mean, respectively. Hereafter, we define the BG model and then consider its estimation in the joint MAP sense. A BG process can be defined using a Bernoulli random vector q ∈ {0, 1}n _{coding for the support and} a Gaussian random vector r∼ N (0, σ2

xIn), with In the identity matrix of size n. Each sample xi of x is modeled as xi = qiri [24, 25]. The Bernoulli parameter ρ = Pr(qi= 1) is the probability of presence of signal and σ_x2 controls the variance of the nonzero amplitudes xi = ri. The Bayesian formulation consists in inferring x = (q, r) knowing y. The MAP estimator can be obtained by maximizing the

marginal likelihood l(q | y) [27] or the joint likelihood l(q, r | y) [25, 26]. Following [25] and assuming

a Gaussian white noise n∼ N (0, σ2

nIm), independent from x, Bayes’ rule leads to:

L(q, r) , −2σ2_nlog[l(q, r | y)] = ky − A∆qrk2+ σ2_n σ2 x krk2+ λkqk0+ c (3)

where λ = 2σ_n2log(1/ρ − 1), ∆q is the diagonal matrix of size n whose diagonal elements are qi (x reads x= ∆qr), and c is a constant.

Now, a signal x is sparse if some entries xi are equal to 0. Since this definition does not impose constraints on the range of the non zero amplitudes, we choose to use a limit Bernoulli-Gaussian model in which the amplitude variance σ2

x is set to infinity. Note that a parallel limit development was done, independently from our work, in the conference paper [23]. In Appendix A, we show that the minimization ofL w.r.t. x = (q, r) rereads: min x∈R n{J (x; λ) = ky − Axk 2_{+ λkxk} 0}. (4)

This formulation is close to that obtained in the Bayesian subset selection literature [18, Chapter 7] using an alternative Bernoulli-Gaussian model. In the latter model, the Gaussian prior relies on RQr instead of r, with RQ the Cholesky factor of the Gram matrix At_QAQ. This leads to a cost function of the form (4), the difference being that λ depends on the amplitude variance σ_x2 and tends to infinity as σ_x2

tends to infinity [30, 31].

Remark 1 (Noise-free case) The Bayesian development above is valid for noisy data. In the noise-free

(7)

to classical results in optimization [32, Chapter 17], if{λk} is a sequence decreasing towards 0 and xk

is an exact global minimizer of J (x; λk), then every limit point of the sequence {xk} is a solution of

arg minxkxk0 s.t. ky − Axk2 is minimal. In Appendix B, we derive a more precise result: “the set of

minimizers of J (x; λ) is constant when λ is close enough to 0 (λ 6= 0). It is equal to the set of sparsest

solutions to y = Ax in the overcomplete case, and to the unconstrained least-squares solution in the

undercomplete case.”

In the following, we focus on the minimization problem (4). The hyperparameterλ is fixed. It controls

the level of sparsity of the desired solution. The algorithm that will be developed relies on an efficient search of the support of x. The search strategy is based on the definition of a neighborhood relationship between two supports: two supports are neighbors if one is nested inside the other and the largest support has one more element.

III. SINGLE BEST REPLACEMENT ALGORITHM

We propose to adapt the SMLR algorithm to the minimization of the mixedℓ2-ℓ0cost functionJ (x; λ) defined in (4). To clearly distinguish SMLR which specifically aims at minimizing (3), the adapted algorithm will be termed as Single Best Replacement (SBR).

A. Principle of SMLR and main notations

SMLR [24] is a deterministic coordinatewise ascent algorithm to maximize likelihood functions of the form l(q | y) (marginal MAP estimation) or l(q, r | y) (joint MAP estimation). In the latter case, it

is easy to check from (3) that given q, the minimizer of L(q, r) w.r.t. r has a closed form expression r = r(q). Consequently, the joint MAP estimation reduces to the minimization of L(q, r(q)) w.r.t. q.

At each SMLR iteration, all the possible single replacements of the support q (set qi = 1 − qi while keeping the otherqj, j 6= i unchanged) are tested, then the replacement yielding the maximal decrease of

L(q, r(q)) is chosen. This task is repeated until no single replacement can decrease L(q, r(q)) anymore.

The number of possible supports q being finite and SMLR being a descent algorithm, it terminates after a finite number of iterations.

Before adapting SMLR, let us introduce some useful notations. We denote byQ•i a single replacement,

i.e., an insertion or removal into/from the active set Q: Q • i ,    Q ∪ {i} if i /∈ Q, Q\{i} otherwise.

(8)

When Card[Q] 6 min(m, n), we define the cost function

JQ(λ) , EQ+ λCard[Q] (5)

involving the squared error EQ defined in (2). By definition of J (xQ; λ) = EQ + λkxQk0, JQ(λ) coincides with J (xQ; λ) when the support of xQ is equal to Q.

Although it aims at minimizing J (x; λ), the proposed SBR algorithm involves the computation of JQ(λ) rather than J (xQ; λ). We make this choice because JQ(λ) can be computed and updated more efficiently, the computation of xQ being no longer necessary. In subsection III-C, we show that for noisy data, the replacement of J (xQ; λ) by JQ(λ) has a negligible effect.

B. The Single Best Replacement algorithm

SMLR can be seen as an exploration strategy for discrete optimization rather than an algorithm specific to a posterior likelihood function. Here, we use this strategy to minimize J (x; λ). We rename the

algorithm Single Best Replacement to remove any statistical connotation.

SBR works as follows. Consider the current support Q. The n single replacements Q • i are tested,

i.e., we compute the squared errors EQ•i and we memorize the values of JQ•i(λ). If the minimum of

JQ•i(λ) is lower than JQ(λ), then we select the index yielding this minimum value:

ℓ ∈ arg min

i∈{1,...,n}

JQ•i(λ). (6)

The next SBR iterate is thus defined as Q′ = Q • ℓ. This task is repeated until JQ(λ) cannot decrease anymore. By default, we use the initial empty support. The algorithm is summarized in Table I.

C. Case where some active amplitudes are zero

We show that this case almost surely never arises when the data y are corrupted with “non degenerate” noise.

Theorem 1 Let y = y0+ n where y0 ∈ R

m _{is fixed and n is an absolute continuous random vector,}

i.e., admitting a probability density w.r.t. the Lebesgue measure. Then, when Card[Q] 6 min(m, n), the

probability thatkxQk0< Card[Q] is equal to 0.

Proof: Letk = Card[Q] and tQbe the minimizer ofky−AQtk2overR

k_{. t}

Qreads tQ= VQy where matrix VQ = (AtQAQ)−1AtQ is of sizek × m, and kxQk0 = ktQk0 6 k. Denoting by v1, . . . , vk∈R

m the row vectors of VQ,ktQk0 < k if and only if there exists i such that hy, vii = 0 (where h. , .i denotes

(9)

TABLE I

SBRALGORITHM. BY DEFAULT,Q1= ∅.

Input: A, y, λ and supportQ1 (Card[Q1] 6 min(m, n))

Step 1: Set j= 1.

Step 2: For i∈ {1, . . . , n}, compute JQ_j•i(λ).

Compute ℓ using (6). IfJQ_j•ℓ(λ) < JQ_j(λ), SetQj+1= Qj• ℓ. else, Terminate SBR. End if.

Set j= j + 1 and go to Step 2.

Output: supportQj= SBR(Q1; λ)

the inner product). Because AQ is full rank, VQ is full rank and then ∀i, vi 6= 0. Denoting by H⊥(vi) the hyperplane of R

m _{which is orthogonal to v}i_{, we have}

kxQk0 < k ⇐⇒ y ∈ k

[

i=1

H⊥(vi). (7)

Because the set S_i H⊥_(vi_{) has a Lebesgue measure equal to zero and the random vector y admits a} probability density, the probability of event (7) is zero.

Theorem 1 implies that when dealing with real noisy data, it is almost sure that all active coefficients

xi are non-zero. Hence, each SBR iterate Q almost surely satisfies J (xQ; λ) = JQ(λ). In any case, SBR can be applied without restriction and the properties stated below (e.g., termination after a finite number of iterations) remain valid when an SBR iterate satisfies kxQk0< Card[Q].

D. Properties of SBR

Proposition 1 Under the assumptions of Theorem 1, each SBR iterate xQ is almost surely a local

minimizer of J (x; λ). In particular, the SBR output satisfies this property.

Proof: Let x = xQ be an SBR iterate. According to Theorem 1, the support S(x) = Q almost surely. Setting ε = mini∈Q|xi| > 0, it is easy to check that if x′ ∈ R

n _satisfies _kx′ _{− xk < ε, then}

S(x′_{) ⊇ S(x) = Q, thus kx}′_k

0 > kxk0. Assume that x′ satisfies kx′− xk < ε. • IfS(x′_{) = Q, then, by definition of x = x}

(10)

• Otherwise, J (x′_{; λ) = E(x}′_{) + λkx}′_k

0 > E(x′) + λ(kxk0 + 1). By continuity of E, there exists a neighborhood V(x) of x such that if x′ _{∈ V(x), |E(x}′_{) − E(x)| < λ. Thus, if x}′ _{∈ V(x),}

kx′_{− xk < ε and S(x}′_{) ⊃ Q, then J (x}′_{; λ) > E(x) + λkxk}

0 = J (x; λ). Finally, if x′ ∈ V(x) and kx′_{− xk < ε, then J (x}′_{; λ) > J (x; λ).}

Termination: Because SBR is a descent algorithm, a support Q cannot be explored twice and SBR

terminates after a finite number of iterations. We emphasize that no stopping condition is needed unlike many algorithms which require to set a maximum number of iterations and/or a threshold on the squared error variation (CoSaMP, Subspace Pursuit, Iterative Hard Thresholding, Iterative Reweightedℓ1).

OLS as a special case: When λ = 0, SBR coincides with the well known OLS algorithm [19, 33].

The removal operation never occurs because it yields an increase of the squared error JQ(0) = EQ.

Empty solutions: We characterize the λ-values for which SBR yields an empty solution.

Remark 2 SBR(∅; λ) yields the empty set if and only if λ > λmax, maxi(hai, yi2/kaik2).

This result directly follows from checking that any insertion trial fails, i.e., ∀i, E{i}+ λ > E∅. It allows us to design an automatic procedure which sets a number ofλ-values adaptively to the data in order to

compute SBR solutions at different sparsity levels (see Section VI-D).

Relation between SBR and SMLR: The main difference between both algorithms is that SMLR involves

the inversion of a matrix of the form At_QAQ+ αICard[Q] whereas SBR computes the inverse of AtQAQ. In the case of SMLR, the term αICard[Q] acts as a regularization on the amplitude values. It avoids instabilities when AQ is ill conditioned at the price of handling the additional hyperparameter α. On the contrary, instabilities may occur while using SBR. In the next section, we focus on this issue and propose a stable implementation.

IV. IMPLEMENTATION ISSUES

Given the current support Q, an SBR iteration consists in computing the squared error EQ′ for any

replacementQ′ = Q • i, leading to the computation of JQ′(λ) = E_Q′+ λCard[Q′]. Our implementation

is inspired by the fast implementation of the homotopy algorithm for ℓ1 regression [3, 34]. It consists in maintaining the Cholesky factorization of the Gram matrix GQ , AtQAQ whenQ is modified by one element. The Cholesky factorization takes the form GQ= LQLtQ where LQ is a lower triangular matrix of sizek = Card[Q]. Also, LQis better conditioned than GQ, improving the stability of matrix inversion. We now give the main updating equations. Full detailed derivation can be found in Appendix C.

(11)

A. Efficient strategy based on the Cholesky factorization

The replacement tests only rely on the current matrix LQ and do not require its update.

1) Single replacement tests: An insertion test Q′ _{= Q ∪ {i} takes the form:}

JQ′(λ) − J_Q(λ) = λ −

lt_Q,iL−1_Q At_Qy− at iy

2

kaik2− klQ,ik2 (8)

with lQ,i= L−1Q AtQai. This computation mainly requires a triangular system inversion (computation of

lQ,i in O(k2) elementary operations) up to the pre-computation of L−1_Q (AtQy) at the beginning of the current SBR iteration.

According to [18, 35], a removal testQ′ = Q\{i} reads JQ′(λ)−J_Q(λ) = x_Q(i)2/γ_i−λ where x_Q(i)

is the ith element in vector xQ and γi is the diagonal element of G−1_Q corresponding to the position of ai in AQ. The overall removal tests mainly amount to the inversion of the triangular matrix LQ (in

O(k3_{) operations) as the computation of γ}

i for all i and of G−1Q AtQy (i.e., the values of xQ(i)) from

L−1_Q are both inO(k2_).

Note that insertion and removal tests can be easily done in parallel. In Matlab, this parallel implemen-tation leads to a significant save of compuimplemen-tation time due to the SIMD capabilities of Matlab.

2) Updating the Cholesky factorization: The update of LQ can be easily done in the insertion case by adding the new column ai at the last position in AQ∪{i}. The new matrix LQ′ is a 2 × 2 block

matrix whose upper left block is LQ (see Appendix C). The removal case requires more care since a removal breaks the triangular structure of LQ. The update can be done by performing either a series of Givens planar rotations [21] or a positive rank 1 Cholesky update [36]. We describe the latter strategy in Appendix C. The Cholesky factorization update is in O(k2_{) in the insertion case and in O((k − I)}2₎ in the removal case whereI denotes the position of the column to be removed in AQ.

B. Reduced search

Additionally, we propose an acceleration of SBR yielding the same iterates with a reduced search. We notice that a column removal Q′ = Q\{i} yields an increase of the squared error and a decrease

of the penalty equal to λ. Hence, the maximum decrease of JQ(λ) which can be expected is λ. The acceleration of SBR consists in testing insertions first. If any insertion leads to JQ(λ) − JQ′(λ) > λ,

then removals are not worth being tested. Otherwise, the removals have to be tested as stated in Table I. We have implemented this acceleration systematically.

(12)

C. Memory requirements and computation burden

The actual implementation may vary depending on the size and the structure of matrix A. We briefly describe the main possible implementations.

When the size of A is relatively small, the computation and storage of the Gram matrix AtA prior to

any SBR iteration (storage ofn2scalar elements) avoids to recompute the vectors At_Qaiwhich are needed when the insertion of ai into the active set is tested. The storage of the other quantities (mainly LQ) that are being updated amounts toO(k2) scalar elements and a replacement test costs O(k2) elementary

operations in average.

When A is larger, the storage of AtA is no longer possible, thus At_Qai must be recomputed for any SBR iteration. This computation costskm elementary operations and now represents the most important

part of an insertion test. When the dictionary has some specific structure, this limitation can be alleviated, enabling a fast implementation even for large n. For instance, if a large number of pairs of columns of A are orthogonal to each other, AtA can be stored as a sparse array. Also, finite impulse response

deconvolution problems enable a fast implementation since AtA is then a Toeplitz matrix (save

north-west and/or south-east submatrices, depending on the boundary conditions). The knowledge of the auto-correlation of the impulse response is sufficient to describe most of the Gram matrix.

All these variants have been implemented1. In the following, we analyze the behavior of SBR for two difficult problems involving highly correlated dictionaries: the deconvolution of a sparse signal with a Gaussian impulse response (Section V) and the joint detection of discontinuities at different orders in a signal (Section VI).

V. DECONVOLUTION OF A SPARSE SIGNAL WITH AGAUSSIAN IMPULSE RESPONSE

This is a typical problem for which SMLR was introduced [27]. It affords us to study the ability of SBR to perform an exact recovery in a simple noise-free case (separation of two Gaussian signals) and to test SBR in a noisy case (estimation of a larger number of Gaussians) and compare it with other algorithms. For simulated problems, we denote by x⋆ the exact sparse signal, the data reading y= Ax⋆_{+ n. The} dictionary columns are always normalized: kaik2 = 1. The signal to noise ratio (SNR) is defined by SNR= 10 log(Py/Pn), where Py = kAx⋆k2/m is the average power of the noise-free data and Pn is the variance of the noise process n.

1

Matlab codes provided by the authors can be downloaded athttp://ieeexplore.org. In our Matlab implementation, the insertion and removal tests are done in parallel.

(13)

TABLE II

SEPARATION OF TWOGAUSSIAN FEATURES FROM NOISE-FREE DATA WITHSBR. dSTANDS FOR THE DISTANCE BETWEEN THEGAUSSIAN FEATURES. WE DISPLAY THE SIZE OF THE SUPPORT OBTAINED FOR A SEQUENCE OF DECREASING λ-VALUESλ0> λ1> . . . > λ7. THE LABEL⋆INDICATES AN EXACT RECOVERY FOR A SUPPORT OF CARDINALITY2.

λ λ0 λ1 λ2 λ3 λ4 λ5 λ6 λ 6 λ7

d= 20 0 0 2⋆ 2⋆ 2⋆ 2⋆ 2⋆ 2⋆

d= 13 0 1 3 4 5 2⋆ 2⋆ 2⋆

d= 6 0 1 1 3 5 6 8 2⋆

A. Dictionary and simulated data

The impulse response h is a Gaussian signal of standard deviation σ, sampled on a regular grid

at integer locations. It is approximated by a finite impulse response of length 6σ by thresholding the

smallest values, allowing for fast implementation even for large size problems (see subsection IV-C). The deconvolution problem leads to a Toeplitz matrix A whose columns are obtained by shifting the signal

h. The dimension of A is chosen to have any Gaussian feature resulting from the convolution h∗ x⋆

belonging to the observation window{1, . . . , m}. This implies that A is slightly undercomplete (m > n).

B. Separation of two close Gaussian features

We first analyze the ability of SBR to separate two Gaussian features (kx⋆_k

0 = 2) from noise-free data. The centers of both Gaussian features lay at a relative distance d (expressed as a number of samples)

and their weights x⋆

i are set to 1. We analyze the SBR outputs for decreasing λ-values by computing their cardinality and testing whether they coincide with the true supportS(x⋆_{). Table II shows the results} obtained for a problem of size 300 × 270 (σ = 5) with distances equal to d = 20, 13, and 6 samples.

It is noticeable that the exact recovery always occurs provided that λ is sufficiently small. This result

remains true even for smaller distances (fromd = 2). When the Gaussian features strongly overlap, i.e.,

for d 6 13, the size of the output support first increases while λ decreases, and then removals start to

occur, enabling the exact recovery for lower λ’s.

C. Behavior of SBR for noisy data

We consider a more realistic simulation in which the data are of larger size (m = 3000 samples)

and noisy. The impulse response h is of size 301 (σ = 50) yielding a matrix A of size 3000 × 2700,

(14)

0 500 1000 1500 2000 2500 −6 −4 −2 0 2 4 6 x* y 0 500 1000 1500 2000 2500 −6 −4 −2 0 2 4 6 x* x Ax

(a) Simulated data (17 Gaussians) (b)λ = 500

0 500 1000 1500 2000 2500 −6 −4 −2 0 2 4 6 x* x Ax 0 500 1000 1500 2000 2500 −6 −4 −2 0 2 4 6 x* x Ax (c)λ = 10 (d)λ = 0.5

Fig. 1. Gaussian deconvolution results. Problem of size3000 × 2700 (σ = 50). (a) Generated data, with 17 Gaussian features

and with SNR = 20 dB. The exact locations x⋆are labeled o. (b,c,d) SBR outputs and data approximations with empirical settings of λ. The estimated amplitudes x are shown with vertical spikes. The SBR outputs (supports) are of size 5, 12, and 18, respectively. The computation time always remains below 3 seconds (Matlab implementation).

composed of 17 spikes that are uniformly located in {1, . . . , n}. The non-zero amplitudes x⋆

i are drawn according to an i.i.d. Laplacian distribution. Let us remark that the limit Bernoulli-Gaussian model is not a proper probabilistic model so that one cannot use it to design simulated data. We choose a Laplacian distribution since the non-zero amplitudes are more heterogeneous than with a Gaussian distribution with finite variance.

On Fig. 1(b-d), we display the SBR results for three λ-values. For large λ’s, only the main Gaussian

features are found. When λ decreases, the smaller features are being recovered together with spurious

features. Removals occur forλ 6 0.8 yielding approximations that are more accurate than those obtained

with OLS and for the same cardinality (the residual ky − Axk2 _{is lower) while when} _{λ > 0.8, the} SBR output coincides with the OLS solution of same cardinality. Note that the theoretical value of λ

obtained from (3) is equal to 0.3 yielding a support of cardinality 18. The residual is slightly lower

(15)

2

4

6

8 CPU time (sec.)

J(x;

λ

)

SBR OLS OMP L1 IRL1

Fig. 2. Comparison of sparse algorithms in terms of trade-off between accuracy (J (x; λ)) and CPU time for the deconvolution

problem of Fig. 1. SBR(λ= 0.5) is run first yielding a support of cardinality ksbr=18. Then, we run OLS(ksbr), OMP(ksbr),

homotopy for ℓ1 regression [39], and IRℓ1(λ) [40]. The ℓ1 result is the homotopy iterate of cardinality ksbr yielding the least

value ofJ (x; λ).

and the neighboring columns of A are highly correlated. In such difficult case, one needs to perform a wider exploration of the discrete set {0, 1}n by introducing moves that are more complex than single replacements. Such extensions were already proposed in the case of SMLR. One can for instance shift an existing spike xi forwards of backwards [37] or update a block of neighboring amplitudes jointly (e.g.,

xi and xi+1) [38]. Various search strategies are also reported in [18, Chapter 3].

D. Comparison of SBR with other sparse algorithms

We compared SBR with classical and recent sparse algorithms: OMP, OLS, CoSaMP [8], Subspace Pursuit [9], Iterative Hard Thresholding (IHT) [10, 11], ℓ1 regression [3] and Iterative Reweighted ℓ1 (IRℓ1) [5, 40]. A general trend is that thresholding algorithms perform poorly when the dictionary columns are strongly correlated. CoSaMP and Subspace Pursuit yield the worst results: they stop after a very few iterations as the squared error increases from one iteration to the next. On the contrary, IHT guarantees that the squared error decreases but the convergence is very slow and the results remain poor in comparison with SBR. In the simulation of Fig. 1(c), SBR performs 12 iterations (only insertions) leading to a support of cardinality 12. Meanwhile, the number of iterations of IHT before convergence is huge: both versions of IHT presented in [10] require at least 10,000 iterations to converge, leading to an overall computation time (22 and 384 seconds) that is much larger than the SBR computation time (3 seconds).

Fig. 2 is a synthetic view of the performance of SBR, OLS, OMP, ℓ1 regression, and IRℓ1 for a given sparsity levelλ. The computation time and the value of J (x; λ) are shown on the horizontal and

(16)

vertical axes, respectively. This enables us to define several categories of algorithms depending on their locations on the 2D plane: the outputs of fast algorithms (OMP and ℓ1) lay in the upper left region whereas slower but more efficient algorithms (OLS, SBR, and IRℓ1) yield points laying in the lower right region. We chose not to represent the outputs of thresholding algorithms since they yield poorer performance, i.e., points located either in the upper right (IHT) or upper left (CoSaMP, Subspace Pursuit) regions. In details, we observed that ℓ1 regression tends to overestimate the support cardinality and to place several spikes at very close locations. We used Donoho’s homotopy implementation [3, 39] and found that it requires many iterations: homotopy runs during 200 iterations before reaching a support of cardinality 18 when processing the data of Fig. 1 (we recall that homotopy starts from the empty set and performs a single support replacement per iteration). The performance of ℓ1 regression fluctuates around that of OMP depending on the trials and the sparsity level. Regarding IRℓ1, we used the Adaptive LASSO implementation from Zou [40] since it is dedicated to the minimization ofJ (x; λ). We stopped

the algorithm when two successiveℓ1 iterates share the same support. For the simulation of Fig. 1, IRℓ1 and SBR yield comparable results in that one algorithm does not outperform the other for all λ values,

but IRℓ1 generally performs slightly better (Fig. 2). We designed other simulations in which the nonzero weightsx⋆_i are spread over a wider interval. In this case, SBR most often yields the best approximations. Fig. 2 is representative of the empirical results obtained while performing many trials. Obviously, the figure may significantly change depending on several factors among which the λ-value and the tuning

parameters of IRℓ1. The goal is definitely not to conclude that an algorithm always outperforms the others but rather to sketch a classification of groups of algorithms according to the trade-off between accuracy and computation time.

VI. JOINT DETECTION OF DISCONTINUITIES AT DIFFERENT ORDERS IN A SIGNAL

We now consider another challenging problem: the joint detection of discontinuities at different orders in a signal [41, 42]. We process both simulated and real data and compare the performance of SBR with respect to OMP, Bayesian OMP (BOMP) which is an OMP based forward-backward algorithm [23], OLS, ℓ1 regression [3], and IRℓ1 [5, 7, 40]. Firstly, we formulate the detection of discontinuities at a single order as a spline approximation problem. Then, we take advantage of this formulation to introduce the joint detection problem.

(17)

Signal a0i i − 1 i i + 1 Signal a1 i i − 1 i i + 1 Signal a2 i i − 1 i i + 1

Fig. 3. Signals api related to the pth order discontinuities at location i. a 0

i is the Heaviside step function, a 1

i is the ramp

function, and a2i is the one-sided quadratic function. Each signal is equal to 1 at location i and its support is equal to{i, . . . , m}.

A. Approximation of a spline of degreep

Following [41], we introduce the dictionary Ap of size m × (m − p) formed of shifted versions of

the one-sided power function k 7→ [max(k, 0)]p for all possible shifts (see Fig. 3) and we address the sparse approximation of y by the piecewise polynomial Apxp (actually, we impose as initial condition that the spline function is equal to 0 fork 6 0). It consists in the detection of the discontinuity locations

(also referred to as knots in the spline approximation literature) and the estimation of their amplitudes:

xp_i codes for the amplitude of a jump at location i (p = 0), the change of slope at location i (p = 1),

etc. Here, the notion of sparsity is related to the number of discontinuity locations.

B. Piecewise polynomial approximation

We formulate the joint detection of discontinuities of ordersp = 0, . . . , P by appending the elementary

dictionaries Ap in a global dictionary A = [A0, . . . , AP]. The product Ax yields a sum of piecewise

polynomials of degree lower thanP with a limited number of pieces. The dictionary A is overcomplete

since it is of size m × s, with s = (P + 1)(m − P/2) > m for P > 1. Moreover, any column ap_i of Ap overlaps all other columns aq_j because their respective supports are the intervals {i, . . . , m}

and {j, . . . , m}. The discontinuity detection problem is difficult as most algorithms are very likely to

(18)

discontinuities at distinct locationsi and j, greedy algorithms start to position a first (wrong) discontinuity

in betweeni and j, and forward greedy algorithms cannot remove it.

C. Adaptation of SBR

The above defined dictionary does not satisfy the unique representation property. Indeed, it is easy to check that the difference between two discrete ramps at locationsi and i + 1 yields the discrete Heaviside

function at location i: a1_i − a1_i+1 = a0_i. We thus need to slightly modify SBR in order to ensure that only full rank matrices AQ are explored. The modification is based on the following proposition which gives a sufficient condition for full rankness of AQ.

Proposition 2 Letni denote the number of columns api, p ∈ {0, . . . , P } which are active for sample i.

Let us define the binary condition C(i):

• if ni = 0, C(i) , 1;

• if ni > 1, C(i) ,ni+j = 0, j = 1, . . . , ni− 1

If Q satisfies ∀i, C(i) = 1, then AQ is full rank.

Proposition 2 is proved in Appendix D. Basically, it states that we can allow several discontinuities to be active at the same locationi, but then, the next samples i+1, . . . , i+ni−1 must not host any discontinuity. This condition ensures that there are at mostni discontinuities in the interval{i, . . . , i + ni− 1} of length

ni. The SBR adaptation consists in testing an insertion only when the new support Q′ = Q ∪ {(i, p)} satisfies the above condition.

D. Numerical simulations

We first setP = 1 leading to the piecewise affine approximation problem. The noise-free data y = Ax⋆

of Fig. 4(a) are of sizem = 1000 with kx⋆k0= 18 discontinuities. According to Remark 2, we compute the value λmax above which the SBR output is the empty set, and we run SBR with λj = λmax10−j/2 for j = 0, . . . , 20. For the least λ-value, SBR yields an exact recovery (see Fig. 4(a)). For comparison

purpose, we also run 27 iterations of OMP and OLS. The “ℓ2-ℓ0” curves represented on Fig. 4(b) express the squared residual ky − Axk2 _{versus the cardinality} _kxk

0 for each algorithm (we plot the first 27 iterates of OMP and OLS and for all j, we plot the output of SBR(λj) after full convergence of SBR). Whatever the cardinality, SBR yields the least residual. For noisy data, the “ℓ2-ℓ0” curve corresponding to SBR still lays below the OMP and OLS curves for most cardinalities. In the next paragraph, we also consider the Bayesian OMP, ℓ1 regression, and IRℓ1 algorithms for further comparisons.

(19)

0 200 400 600 800 1000 y Ax Order 0 (unknown) Order 0 (estimation) Order 1 (unknown) Order 1 (estimation) 5 10 15 20 25 10−8 10−6 10−4 10−2 CARDINALITY

RESIDUAL (log. scale)

OMP OLS SBR

(a) Noise-free data and SBR approximation (b) “ℓ2-ℓ0” curves (noise-free data)

0 200 400 600 800 1000 y Ax Order 0 (unknown) Order 0 (estimation) Order 1 (unknown) Order 1 (estimation) 5 10 15 20 25 10−0.3 10−0.2 10−0.1 CARDINALITY

OMP OLS SBR

(c) Noisy data (SNR = 20 dB) and SBR approximation (d) “ℓ2-ℓ0” curves (noisy data)

Fig. 4. Joint detection of discontinuities of orders 0 and 1. The dictionary is of size1000 × 1999 and the data signal includes

18 discontinuities. The true and estimated discontinuity locations are represented with unfilled black and filled gray labels. The shape of the labels (circular or triangular) indicates the discontinuity order. The dashed gray and solid black curves represent the data signal y and its approximation Ax for the least λ-value. (a) Approximation from noise-free data. The recovery is exact. (b) “ℓ2-ℓ0” curves showing the squared residual versus the cardinality for the SBR, OLS, and OMP solutions. (c,d) Similar

results for noisy data (SNR = 20 dB).

E. AFM data processing

In Atomic Force Microscopy (AFM), a force curve measures the interatomic forces exerting between a probe associated to a cantilever and a nano-object. Specifically, the recorded signal z 7→ y(z) shows the

force evolution versus the probe-sample distancez, expressed in nanometers. Researching discontinuities

(location, order, and amplitude) in a force curve is a challenging task because they are used to provide a precise characterization of the physico-chemical properties of the nano-object (topography, energy of adhesion, etc.) [43].

The data displayed on Fig. 5(a) are related to a bacterial cell Shewanella putrefaciens laying in aqueous solution, interacting with the tip of the AFM probe [44]. A retraction force curve is recorded by positioning the tip in contact with the bacterial cell, and then gradually retracting the tip from the sample until it

(20)

(a) −3200 −3100 −3000 −2900 −2800 −2700 −20 40 100 160 Z (nm) FORCE (nN) (b) 0 10 20 30 40 50 100.3 100.5 100.7 CARDINALITY

OMP OLS SBR (c) 0 10 20 30 40 50 0 50 100 150 200 250 300 CARDINALITY

REC. TIME (sec.)

OMP OLS SBR

Fig. 5. Joint detection of discontinuities of orders 0, 1, and 2 (problem of size2167 × 6498). (a) Experimental AFM data showing the force evolution versus the probe-sample distance z. (b) Squared residual versus cardinality for the SBR, OLS, and OMP solutions. (c) Time of reconstruction versus cardinality.

loses contact. In the retraction curve shown on Fig. 5(a), three regions of interest can be distinguished from right to left. The linear region on the right characterizes the rigid contact between the probe and the sample. It describes the mechanical interactions of the cantilever and the sample. The rigid contact is maintained until z ≈ −2840 nm. The interactions occurring in the interval z ∈ [−3050, −2840] nm are

adhesion forces during the tip retraction. In the flat part on the left, no interaction occurs as the cantilever has lost contact with the sample.

We search for the discontinuities of orders 0, 1, and 2. Similar to the processing of simulated data, we run SBR with 14λ-values and we run OLS and OMP until iteration 41. For each algorithm, we plot the

“ℓ2-ℓ0” curve and the curve displaying the time of reconstruction versus the cardinality (Figs. 5(b,c)). These figures show that the performance of SBR is at least equal and sometimes better than that of OLS. Both algorithms yield results that are far more accurate than OMP at the price of a larger computation time.

Fig. 6 displays the approximations yielded by the three algorithms together with the BOMP, ℓ1, and IRℓ1 approximations. For the largest valueλ1, SBR runs during 6 iterations (4 insertions and 2 removals) yielding a support of cardinality 2. SBR performs better than other algorithms (Figs. 6(a-f)). Although

(21)

−3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 1 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 1 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 1 Order 2

(a) SBR(λ1), card. 2 (b) OLS, card. 2 (g) SBR(λ2), card. 5 (h) OLS, card. 5

−3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 1 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 0 Order 1 Order 2

(c) BOMP(λ1), card. 3 (d) OMP, card. 2 (i) BOMP(λ2), card. 9 (j) OMP, card. 5

−3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 1 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 1 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 1 Order 2 −3000 −2900 −2800 −20 0 20 40 60 Z (nm) y Ax Order 1 Order 2

(e) LASSO, card. 3 (f) IRℓ1(λ1), card. 4 (k) LASSO, card. 8 (l) IRℓ1(λ2), card. 7

Fig. 6. AFM data processing: joint detection of discontinuities at orders 0, 1, and 2. The estimated discontinuities x are represented with vertical spikes and with a label indicating the discontinuity order. (a) SBR output of cardinality 2: 4 insertions and 2 removals have been done (λ1= 120). (b-f) OLS and OMP outputs after 2 iterations, BOMP and IRℓ1 [40] outputs for

λ= λ1, homotopy iterate (LASSO) leading to the minimal value ofJ (x; λ1). (g-l) Same simulation with a lower λ-value

(λ2= 8.5). The SBR output is of cardinality 5 (7 insertions and 2 removals).

IRℓ1 yields the most accurate approximation, it relies on 4 dictionary columns leading to a larger value of

J (x; λ1). We observed the same behavior for the lowest value λ2 (subfigures (g-l)). Again, SBR yields the least value ofJ (x; λ2) among all algorithms. Moreover, SBR provides a very precise localization of both first order discontinuities (subfigure (a)) which are crucial information for the physical interpretation of the data. On the contrary, all other algorithms fail for the highest sparsity level, and some do not even succeed for the lowest. Specifically, OLS accurately locates both first order discontinuities when 5 iterations have been performed (the desired discontinuities are the first and the last ones among the 5) while OMP fails even after 5 iterations. LASSO and BOMP yield very poor approximations for the

(22)

highest sparsity level and approximations with many dictionary columns for the lowest sparsity level. In terms of value of the cost function J (x; λ), BOMP and LASSO fluctuate around OMP but they are far

outperformed by OLS, SBR, and IRℓ1.

VII. CONCLUSION

A. Discussion

We performed comparisons for two problems involving highly correlated dictionary columns. SBR is at least as accurate as OLS and sometimes more accurate, with a slightly larger cost of computation. We also considered sparse algorithms that are slower than OLS. SBR was found to be very competitive in terms of trade-off between accuracy and computation time. Although OLS based forward-backward algorithms yield a relatively large computational cost per iteration, we have noticed that for correlated dictionaries, the number of SBR iterations (i.e., of elementary modifications of the support) is much lower than the number of support modifications performed by several other algorithms. Typically, IHT and IRℓ1 can often be more expensive than SBR. Additionally, SBR terminates within a finite number of iterations, thus it does not require to tune any empirical stopping parameter. The limitation of SBR in terms of speed arises when the dictionary A is unstructured and the size of A is too large to store AtA.

The inner products at_iaj must then be recomputed for each iteration, which is relatively burdensome. In the recent literature, it is often acknowledged that the cost function J (x; λ) has a large number

of local minimizers therefore discouraging its direct optimization [5, 7]. Many authors thus choose to minimize an approximate cost function in which theℓ0norm|xi|0is replaced with a nonconvex continuous functionϕ(xi). However, when the range of values of the (expected) nonzero amplitudes xi6= 0 is wide, it is difficult to find a good approximation ϕ(xi) of |xi|0 for all xi. Selecting an appropriate ϕ function generally relies on the introduction of a degree of freedom whose tuning is not obvious [5, 6]. For instance, the IRℓ1 algorithm can be interpreted as an approximate ℓ2-ℓ0 minimization method where theℓ0 norm is replaced with ϕ(xi; ε) = log(|xi| + ε) [5, 7]. The parameter ε controls the “degree of nonconvexity” of the surrogate function ϕ (2_).

Although J (x; λ) has a large number of local minima, we have found that SBR is often as accurate

as algorithms based on the nonconvex approximation of J . Moreover, SBR is simple to use. The good

behavior of SBR is somehow related to the result of Proposition 1 which states that any SBR iterate is

2

(23)

almost surely a local minimizer ofJ . We conclude that SBR is actually capable to “skip” local minima

with a large cost J (x; λ).

B. Perspectives

In the proposed approach, the main difficulty relies in the choice of theλ-value. If a specific cardinality

or approximation residual is desired, one can resort to a trial and error procedure in which a number ofλ-values are tried until the desired approximation level is found. In [45], we sketched a continuation

version in which a series of SBR solutions are computed for decreasing levels of sparsity λ, and the λ-values are recursively computed. This continuation version is showing promising results and will be

the subject of a future extended contribution. A similar perspective was actually proposed by Zhang to generalize his FoBa algorithm in a path-following algorithm (see the discussion section in [22]).

Another important perspective is to investigate whether SBR can guarantee exact recovery in the noise-free case under some conditions on matrix A and on the unknown sparse signal x⋆. According to Remark 1, we will study the behavior of SBR when λ → 0. In the simulations done in Sections V

and VI, we observed that SBR is able to perform exact recoveries provided thatλ is sufficiently small.

This promising result is a first step towards a more general theoretical study.

APPENDIXA

DETAILED DEVELOPMENT OF LIMITBGSIGNAL RESTORATION

Consider the Bernoulli-Gaussian model x = (q, r) introduced in Section II-B and the joint MAP

formulation (3) involving the cost function L(q, r). Given q, let us split r into two subvectors u and t

indexed by the null and non-null entries of q, respectively. Sincekrk2_{= ktk}2 _{and A}_∆

qr= AQt do not depend on u, we haveminuL(q, t, u) = L(q, t, 0). Thus, the joint MAP estimation problem reduces to the minimization of L(q, t, 0) w.r.t. (q, t). In the limit case σ2

x → ∞, this problem rereads:

min

q,t {L(q, t, 0) = ky − AQtk

2_{+ λkqk}

0}. (9)

The equivalence between (9) and (4) directly follows from the change of variable x = {q, t} where q

and t are the support and non-zero amplitudes of x.

APPENDIXB PROOF OFREMARK1

(24)

Lemma 1 For λ > 0, any minimizer of J (x; λ) takes the form xQ with Card[Q] 6 min(m, n).

Proof of lemma: According to the URP assumption, any min(m, n) columns of A yield an

unconstrained minimizer ofky − Axk2_{. Let x}

LS be such minimizer, with kxLSk0 6 min(m, n), and let

u be a minimizer ofJ (x; λ). J (u; λ) 6 J (xLS; λ) implies that kuk0 6 kxLSk0+ (E(xLS) − E(u))/λ 6 kxLSk0 6 min(m, n).

We denote by Q the support of u. The related least-square solution xQ obviously satisfies E(xQ) 6

E(u) and kxQk0 6 Card[Q] = kuk0, thus J (xQ; λ) 6 J (u; λ). Since u is a minimizer of J (x; λ), we haveJ (xQ; λ) = J (u; λ) hence E(xQ) = E(u). Because of the URP assumption, the least-squares minimizer over Q is unique, thus u = xQ.

Lemma 2 There existsλmin> 0 such that for 0 < λ 6 λmin, the minimizers ofJ (x; λ) are unconstrained

minimizers ofky − Axk2_.

Proof of lemma: When λ tends towards 0, we have for all Q, J (xQ; λ) = EQ+ λkxQk0 → EQ. In particular, J (xQLS; λ) → EQLS with xQLS an unconstrained minimizer of ky − Axk

2 _{yielded by a} subset QLS of cardinality min(m, n). Because the number of possible subsets Q is finite and for all Q, EQ > EQLS, there exists λmin > 0 such that for 0 < λ 6 λmin, the subsets Q

⋆ _minimizing _{J (x} Q; λ) satisfy EQ⋆ = E_Q_LS. Consequently, the minimizers of J (x; λ) are unconstrained least-squares solutions

according to Lemma 1.

Proof of Remark 1: The proof directly follows from the application of Lemma 2. We denote by

Xλ the set of minimizers ofJ (x; λ).

In the undercomplete case, there is a unique unconstrained least-square minimizer xLS. Thus, Xλ =

{xLS} for 0 6 λ 6 λmin.

In the overcomplete case, we denote by X⋆ _{the set of sparsest solutions to y} _{= Ax. To show that}

Xλ= X⋆ for 0 < λ 6 λmin, we consider x∈ X⋆ and x′∈ Xλ. According to Lemma 2, x′ satisfies y=

Ax′, thenJ (x′_{; λ) = λkx}′_k

0. By definition ofX⋆, we have y= Ax and J (x; λ) = λkxk0 6 J (x′; λ). Because x′ ∈ Xλ is a minimizer of J , we deduce that kx′k0 = kxk0, then x′ ∈ X⋆ and x∈ Xλ. We have proved that Xλ = X⋆ for 0 < λ 6 λmin.

APPENDIXC

UPDATE OF THE CHOLESKY FACTORIZATION

At each SBR iteration,n linear systems of the form tQ , G−1_Q At_Qy must be solved, the corresponding squared errors reading EQ = ky − AQtQk2 = kyk2 − ytAQtQ. Using the Cholesky factorization

(25)

GQ= LQLt_Q, tQ rereads tQ = L−t_QL−1_Q At_Qy, thus

EQ = kyk2− kL−1_Q AtQyk2. (10)

Insertion of a new column after the existing columns: Including a new column leads to AQ′ = [A_Q, a_i].

Thus, the new Gram matrix reads as a 2 × 2 block matrix: GQ′ =   GQ AtQai (At Qai)t kaik2  

and the Cholesky factor of GQ′ can be straightforwardly updated:

LQ′ =   LQ 0 lt_Q,i pkaik2− klQ,ik2   (11)

with lQ,i = L−1Q AtQai. The update (8) ofJQ(λ) = EQ+ λCard[Q] directly follows from (10) and (11).

Removal of an arbitrary column: When removing a column ai, updating LQremains possible although more complex. This idea was developed by Ge et al. [46] who update the Cholesky factorization of matrix

G−1_Q . We adapt it to the direct (simpler) factorization of GQ. Let I be the position of ai in AQ (with

1 6 I 6 Card[Q]). LQ can be written in a block matrix form:

LQ=      Λ 0 0 bt d 0 C e F      (12)

where the lowercase characters refer to the scalar (d) and vector quantities (b, e) appearing in the Ith

row and in the Ith column. The computation of GQ = LQLt_Q and the removal of the Ith row and the

Ith column in GQ lead to

GQ′ =   Λ 0 C F     Λ t _Ct 0 _Ft   +   0 e   0 _et .

By identification with GQ′ = L_Q′Lt_Q_′ and because the Cholesky factorization is unique, L_Q′ necessarily

reads: LQ′ =   Λ 0 C X   , (13)

where X is a lower triangular matrix satisfying XXt= F Ft+ eet. The problem of computing X from

F and e is classical; it is known as a positive rank 1 Cholesky update and there exists a stable algorithm

(26)

APPENDIXD PROOF OFPROPOSITION 2

Let us first introduce some notations specific to the piecewise polynomial dictionary problem. Consider a subsetQ of columns ap_i and leti−= min{i | ni > 0} denote the lowest location of an active entry (we recall that ni denotes the number of active columns for sample i). Up to a reordering of the columns of AQ, AQ rereads AQ = [Ai−, eA_i−] where A_i− gathers then_i− active columns ap_i such that i = i−

and eAi− gathers the remaining active columns (with i > i−). The following lemma is a key element to

prove Proposition 2.

Lemma 3 Assume that Q satisfies the condition of Proposition 2. If eAi− is full rank, then A_Q is full

rank.

Proof: LetI = ni− denote the number of discontinuities at locationi− and let0 6 p₁< p₂< . . . <

pI denote their orders, sorted in the ascending order. Suppose that there exist two families of scalars

{µp1

i−, . . . , µ

pI

i−} and {µ

p

i| i 6= i−andi is active at order p} such that I X j=1 µpj i−a pj i−+ X i6=i− X p µp_i ap_i = 0. (14)

Let us show that all µ-values are then equal to 0.

Rewriting the first I nonzero equations in this system and because Q satisfies the condition of

Proposition 2, we have, for all k ∈ {i−, . . . , i−+ I − 1}, PI_j=1µpj

i−(k + i

−_{− 1)}pj = 0. Hence, the

polynomial F (X) =PI_j=1µpj

i−X

pj _has I positive roots. Because any non-zero polynomial formed of I

monomials of different degree has at mostI − 1 positive roots [47, p. 76], F is the zero polynomial, thus

all scalars µpj

i− are 0. We deduce from (14) and from the full rankness of eAi− thatµ

p

i = 0 for all (i, p). We have shown that the column vectors of AQ are linearly independent, i.e., that AQ is full rank. The proof of Proposition 2 directly results from the recursive application of Lemma 3. Starting from the empty set, all the indices, sorted by decreasing order, are successively included.

ACKNOWLEDGMENT

The authors would like to thank Dr. Gr´egory Francius from LCPME (UMR CNRS 7564, Nancy, France) for providing them with real AFM data.

(27)

REFERENCES

[1] B. K. Natarajan, “Sparse approximate solutions to linear systems”, SIAM J. Comput., vol. 24, no. 2, pp. 227–234, Apr. 1995.

[2] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit”, SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.

[3] D. L. Donoho and Y. Tsaig, “Fast solution of ℓ1-norm minimization problems when the solution may be sparse”, IEEE

Trans. Inf. Theory, vol. 54, no. 11, pp. 4789–4812, Nov. 2008.

[4] B. D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado, “Subset selection in noise based on diversity measure minimization”, IEEE Trans. Signal Process., vol. 51, no. 3, pp. 760–770, Mar. 2003.

[5] E. J. Cand`es, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted ℓ1 minimization”, J. Fourier Anal. Appl.,

vol. 14, no. 5-6, pp. 877–905, Dec. 2008.

[6] G. H. Mohimani, M. Babaie-Zadeh, and C. Jutten, “A fast approach for overcomplete sparse decomposition based on smoothed ℓ0

norm”, IEEE Trans. Signal Process., vol. 57, no. 1, pp. 289–301, Jan. 2009.

[7] D. P. Wipf and S. Nagarajan, “Iterative reweighted ℓ1 and ℓ2 methods for finding sparse solutions”, IEEE J. Sel. Top.

Signal Process. (Special Issue on Compressive Sensing), vol. 4, no. 2, pp. 317–329, Apr. 2010.

[8] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples”, Appl. Comp.

Harmonic Anal., vol. 26, no. 3, pp. 301–321, May 2009.

[9] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction”, IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2230–2249, May 2009.

[10] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse approximations”, J. Fourier Anal. Appl., vol. 14, no. 5, pp. 629–654, Dec. 2008.

[11] T. Blumensath and M. E. Davies, “Normalized iterative hard thresholding: Guaranteed stability and performance”, IEEE

J. Sel. Top. Signal Process., vol. 4, no. 2, pp. 298–309, Apr. 2010.

[12] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries”, IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397–3415, Dec. 1993.

[13] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition”, in Proc. 27th Asilomar Conf. on Signals, Systems and Computers, Nov. 1993, vol. 1, pp. 40–44.

[14] C. Couvreur and Y. Bresler, “On the optimallity of the backward greedy algorithm for the subset selection problem”,

SIAM J. Matrix Anal. Appl., vol. 21, no. 3, pp. 797–808, Feb. 2000.

[15] M. A. Efroymson, “Multiple regression analysis”, in Mathematical Methods for Digital Computers, A. Ralston and H. S. Wilf, Eds., vol. 1, pp. 191–203. Wiley, New York, NY, 1960.

[16] K. N. Berk, “Forward and backward stepping in variable selection”, J. Statist. Comput. Simul., vol. 10, no. 3-4, pp. 177–185, Apr. 1980.

[17] P. M. T. Broersen, “Subset regression with stepwise directed search”, J. R. Statist. Soc. C, vol. 35, no. 2, pp. 168–177, 1986.

[18] A. J. Miller, Subset Selection in Regression, Chapman and Hall, London, UK, 2nd edition, Apr. 2002.

[19] S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification”, Int. J. Control, vol. 50, no. 5, pp. 1873–1896, Nov. 1989.

(28)

[20] T. Blumensath and M. E. Davies, “On the difference between Orthogonal Matching Pursuit and Orthogonal Least Squares”, Tech. Rep., University of Edinburgh, Mar. 2007.

[21] D. Haugland, A Bidirectional Greedy Heuristic for the Subspace Selection Problem, vol. 4638 of Lect. Notes Comput. Sci., pp. 162–176, Springer Verlag, Berlin, Germany, Engineering stochastic local search algorithms. Designing, implementing and analyzing effective heuristics edition, 2007.

[22] T. Zhang, “Adaptive forward-backward greedy algorithm for learning sparse representations”, Tech. Rep., Rutgers Statistics Department, Apr. 2008.

[23] C. Herzet and A. Dr´emeau, “Bayesian pursuit algorithms”, in Proc. Eur. Sig. Proc. Conf., Aalborg, Denmark, Aug. 2010, pp. 1474–1478.

[24] J. J. Kormylo and J. M. Mendel, “Maximum-likelihood detection and estimation of Bernoulli-Gaussian processes”, IEEE

Trans. Inf. Theory, vol. 28, pp. 482–488, May 1982.

[25] J. M. Mendel, Optimal Seismic Deconvolution, Academic Press, New York, 1983.

[26] Y. Goussard, G. Demoment, and J. Idier, “A new algorithm for iterative deconvolution of sparse spike trains”, in Proc.

IEEE ICASSP, Albuquerque, NM, Apr. 1990, pp. 1547–1550.

[27] F. Champagnat, Y. Goussard, and J. Idier, “Unsupervised deconvolution of sparse spike trains using stochastic approximation”, IEEE Trans. Signal Process., vol. 44, no. 12, pp. 2988–2998, Dec. 1996.

[28] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm”, IEEE Trans. Signal Process., vol. 45, no. 3, pp. 600–616, Mar. 1997.

[29] Q. Cheng, R. Chen, and T.-H. Li, “Simultaneous wavelet estimation and deconvolution of reflection seismic signals”, IEEE

Trans. Geosci. Remote Sensing, vol. 34, pp. 377–384, Mar. 1996.

[30] E. I. George and D. P. Foster, “Calibration and empirical Bayes variable selection”, Biometrika, vol. 87, no. 4, pp. 731–747, 2000.

[31] H. Chipman, E. I. George, and R. E. McCulloch, “The practical implementation of Bayesian model selection”, IMS Lecture

Notes – Monograph Series, vol. 38, pp. 65–134, 2001.

[32] J. Nocedal and S. J. Wright, Numerical optimization, Springer Series in Operations Research and Financial Engineering. Springer Verlag, New York, 1999.

[33] S. Chen and J. Wigger, “Fast orthogonal least squares algorithm for efficient subset model selection”, IEEE Trans. Signal

Process., vol. 43, no. 7, pp. 1713–1715, July 1995.

[34] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression”, Ann. Statist., vol. 32, no. 2, pp. 407–451, 2004.

[35] S. J. Reeves, “An efficient implementation of the backward greedy algorithm for sparse signal reconstruction”, IEEE

Signal Process. Lett., vol. 6, no. 10, pp. 266–268, Oct. 1999.

[36] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders, “Methods for modifying matrix factorizations”, Math. Comp., vol. 28, no. 126, pp. 505–535, Apr. 1974.

[37] C. Y. Chi and J. M. Mendel, “Improved maximum-likelihood detection and estimation of Bernoulli-Gaussian processes”,

IEEE Trans. Inf. Theory, vol. 30, pp. 429–435, Mar. 1984.

[38] M. Allain and J. Idier, “Efficient binary reconstruction for non-destructive evaluation using gammagraphy”, Inverse Problems, vol. 23, no. 4, pp. 1371–1393, Aug. 2007.

(29)

[40] H. Zou, “The adaptive Lasso and its oracle properties”, J. Acoust. Soc. Amer., vol. 101, no. 476, pp. 1418–1429, Dec. 2006.

[41] M. S. Smith and R. Kohn, “Nonparametric regression using Bayesian variable selection”, J. Econometrics, vol. 75, no. 2, pp. 317–343, Dec. 1996.

[42] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite rate of innovation”, IEEE Trans. Signal Process., vol. 50, no. 6, pp. 1417–1428, June 2002.

[43] H.-J. Butt, B. Cappella, and M. Kappl, “Force measurements with the atomic force microscope: Technique, interpretation and applications”, Surf. Sci. Rep., vol. 59, no. 1–6, pp. 1–152, Oct. 2005.

[44] F. Gaboriaud, B. S. Parcha, M. L. Gee, J. A. Holden, and R. A. Strugnell, “Spatially resolved force spectroscopy of bacterial surfaces using force-volume imaging”, Colloids Surf. B., vol. 62, no. 2, pp. 206–213, Apr. 2008.

[45] J. Duan, C. Soussen, D. Brie, and J. Idier, “A continuation approach to estimate a solution path of mixed L2-L0 minimization problems”, in Signal Processing with Adaptive Sparse Structured Representations (SPARS workshop), Saint-Malo, France, Apr. 2009, pp. 1–6.

[46] D. Ge, J. Idier, and E. Le Carpentier, “Enhanced sampling schemes for MCMC based blind Bernoulli-Gaussian deconvolution”, Signal Process., vol. 91, no. 4, pp. 759–772, Apr. 2011.

[47] F. R. Gantmacher and M. G. Krein, Oscillation matrices and kernels and small vibrations of mechanical systems, AMS Chelsea Publishing, Providence, RI, revised edition, 2002.

Charles Soussen was born in France in 1972. He received the degree from the ´Ecole Nationale Supérieure en Informatique et Mathématiques Appliquées, Grenoble, France, and the PhD degree in Physics from the Laboratoire des Signaux et Systèmes, Université de Paris-Sud, Orsay, France, in 1996 and 2000, respectively. He is currently an Assistant Professor at Nancy-University, France. He has been with the Centre de Recherche en Automatique de Nancy since 2005. His research interests are in inverse problems and sparse approximation.

J´erˆome Idier was born in France in 1966. He received the diploma degree in electrical engineering

from École Supérieure d’ Électricité, Gif-sur-Yvette, France, in 1988 and the Ph.D. degree in physics from University of Paris-Sud, Orsay, France, in 1991.

Since 1991, he joined the Centre National de la Recherche Scientifique. He is currently a Senior Researcher at the Institut de Recherche en Communications et Cybern´etique in Nantes. His major scientific interests are in probabilistic approaches to inverse problems for signal and image processing. He is serving as an Associate Editor for the IEEE Transactions on Signal Processing.

(30)

David Brie received the Ph.D. degree in 1992 and the Habilitation `a diriger des Recherches degree in 2000,

both from the Henri Poincar´e University, Nancy, France. He is currently Professor at the telecommunication and network department from the Institut Universitaire de Technologie, Nancy-University. Since 1990, he has been with the Centre de Recherche en Automatique de Nancy. His research interests mainly concern inverse problems and multidimensional signal processing.

Junbo Duan was born in China in 1981. He received the B.S. degree in information engineering and

M.S. degree in communication and information system from Xi’an Jiaotong University, China, in 2004 and 2007 respectively, and the Ph.D. degree in signal processing from Universit´e Henry Poincar´e, Nancy, France, in 2010. He is currently a postdoc in the Department of Biomedical Engineering and Biostatistics, Tulane University. His major research interests are in probabilistic approaches to inverse problems in bioinformatics.