Posterior Expectation of the Total Variation model: Properties and Experiments

(1)

HAL Id: hal-00764175

https://hal.archives-ouvertes.fr/hal-00764175v3

Submitted on 27 Apr 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Posterior Expectation of the Total Variation model:

Properties and Experiments

Cécile Louchet, Lionel Moisan

To cite this version:

Cécile Louchet, Lionel Moisan. Posterior Expectation of the Total Variation model: Properties and

Experiments. SIAM Journal on Imaging Sciences, Society for Industrial and Applied Mathematics,

2013, 6 (4), pp.2640-2684. �10.1137/120902276�. �hal-00764175v3�

(2)

Vol. 6, No. 4, pp. 2640–2684

Posterior Expectation of the Total Variation Model:

Properties and Experiments ^∗ C´ ecile Louchet ^† and Lionel Moisan ^‡

Abstract. The total variation image (or signal) denoising model is a variational approach that can be in- terpreted, in a Bayesian framework, as a search for the maximum point of the posterior density (maximum a posteriori estimator). This maximization aspect is partly responsible for a restoration bias called the “staircasing effect,” that is, the outbreak of quasi-constant regions separated by sharp edges in the intensity map. In this paper we study a variant of this model that considers the expectation of the posterior distribution instead of its maximum point. Apart from the least square error optimality, this variant seems to better account for the global properties of the posterior dis- tribution. We present theoretical and numerical results that demonstrate in particular that images denoised with this model do not suffer from the staircasing effect.

Key words. image denoising, total variation, Bayesian model, least square estimate, maximum a posteriori, estimation in high-dimensional spaces, proximity operators, staircasing eﬀect

AMS subject classiﬁcations. 68U10, 62H35 DOI. 10.1137/120902276

1. Introduction. Total variation (TV) is probably one of the simplest analytic priors on images that favor smoothness while allowing discontinuities at the same time. Its typical use, introduced in the celebrated Rudin, Osher, and Fatemi (ROF) image restoration model [64], consists in solving an inverse problem like Au = v (where v is the observed image, u is the unknown ideal image, and A is a given operator) by minimizing the energy

(1) E(u) = Au − v ² + λT V (u).

This energy establishes a trade-off between data fidelity (the first term) and data regularity (TV), the relative weight of the latter being specified by the hyperparameter λ. In a continuous formulation, the TV of a gray-level image u : R ² → R is defined by

T V (u) = inf

R

²

u div p; p ∈ C c ^∞ ( R ² , R ² ), p _∞ ≤ 1

,

which boils down to

(2) T V (u) =

R

²

| Du | with | Du | =

∂u

∂x ₂

+ ∂u

∂y ₂

∗

Received by the editors December 12, 2012; accepted for publication (in revised form) August 26, 2013; published electronically December 17, 2013.

http://www.siam.org/journals/siims/6-4/90227.html

†

MAPMO, Universit´ e d’Orl´ eans, CNRS, UMR 6628, 45067 Orl´ eans, France ([email protected]).

‡

MAP5, Universit´ e Paris Descartes, CNRS UMR 8145, 75006 Paris, France ([email protected]).

2640

(3)

for smooth images. Depending on the choice of A, (1) can be used for image denoising (A is the identity operator), image deblurring (A is a convolution with a given blur kernel), tomography (A is a Radon transform), superresolution (A is an image subsampling operator), etc. In the last two decades, the TV prior has been used in a large variety of image processing and computer vision applications: image inpainting [23], interpolation [35], segmentation [20], image quality assessment [8], scale detection [49], cartoon+texture decomposition [4, 5], motion estimation [76], and many others (and also, of course, in applications that do not concern images). For a more complete list of applications concerning image processing and the TV model, we invite the interested reader to consult [15, 18] and references therein.

Even if other prior functionals have been proposed (Besov priors, Markov random fields learned on a large bench of images, sparsity priors, fields of experts [63]), TV still frequently appears in nonlinear image processing algorithms. A possible explanation for this is the simplicity of the TV operator and its ability to penalize edges (that is, sharp transitions), but not too much: images that are smooth away from a jump set which is a finite union of smooth curves of finite length will have a finite TV. Conversely, TV does penalize highly oscillating patterns, noise in particular. Among other reasons that make TV worth studying, we can mention the following:

• The prior model based on TV (or the median pixel prior, its discrete counterpart) shows a natural connection with purely discrete Markov models [7, 9].

• If u is a binary image, that is, the characteristic function of some—regular enough—

subset S of R ² , then T V (u) is simply the perimeter of S. For a general real-valued image u, this correspondence is generalized thanks to the coarea formula [1]: the idea is to decompose u into nested binary images (corresponding to the level sets of u) and to sum up the inﬁnitesimal contribution of the TV of each binary image. This geometric characterization of TV allows us to interpret the ROF model as a regularization of the level lines of v. If the data-ﬁdelity term u − v ² is replaced by its L ¹ -norm counterpart u − v ₁ in (1), which is more suitable in the case of impulse noise, we even have a contrast-invariant transform [26], which processes the level sets of u independently.

This nice analytical framework around TV and bounded variation (BV) spaces [1]

makes it particularly ﬁtted to mathematical image analysis.

• The TV model is simple enough to produce few artifacts, which is important for applications in medical imaging, for instance, be it the segmentation of an organ or the analysis of a pathology. This may not be the case for more sophisticated methods like BM3D [24] or dictionary learning methods, where the higher performance comes along with artifacts that are diﬃcult to control and anticipate.

TV is simple and convenient, but it has its own drawbacks. First, textured image parts, which are very oscillatory in general, are highly penalized by TV and are often destroyed (or at least strongly attenuated) by the ROF model. Another well-known artifact is the staircasing eﬀect.

The staircasing eﬀect. A simple example of the staircasing eﬀect is obtained when a noisy

aﬃne one-dimensional signal is processed with the ROF model: the denoised signal is not

smooth but piecewise constant (see, e.g., the ROF-denoised signal that is displayed in the

middle plot of Figure 2). The staircase shape obtained in this case is quite general: as

ﬁrst noticed in [25] and then analyzed by [11, 15, 33, 57, 61] in diﬀerent frameworks, the

(4)

application of ROF denoising leads to blocky structures on most signals and images; this phenomenon is called the staircasing effect. As far as we know, all image processing papers discussing this effect on natural images consider it an artifact rather than a desirable property [6, 10, 13, 15, 21, 22, 45]. Indeed, in [36], histograms of local features (such as directional gradients) computed on a large bunch of natural images are faithfully modeled by generalized Laplace distributions with a probability density function (p.d.f.) proportional to exp( − s | x | ^α ) with α ≈ 0.55. An image restored with the ROF model will have a potentially large proportion of pixels with a strictly zero gradient, thus adding a Dirac mass to the generalized Laplace distribution and contradicting the observed statistics of natural images. Furthermore, the staircasing effect is incompatible with the Shannon sampling theory [68] in the sense that a nonconstant bandlimited image (in the continuous domain) cannot be locally constant. Even from a physical image formation viewpoint, truly constant regions are rare: because of little variations in illumination, orientation, or simply because of the perspective, a minimum drift of gray levels is generally observed. Last, the staircasing effect comes with the creation of spurious edges in what should be smooth areas. Using an ROF filter as a preprocessing step before a higher-level processing, such as segmentation, may be a source of systematic failures.

It is commonly admitted that the staircasing eﬀect is due to the nonregularity of TV [53, 55, 57], which, in the discrete framework, comes from the singularity of TV at zero gradients. More than that, under several hypotheses [71], the ROF model is equivalent to minimizing an energy like (1), but where the TV (the ¹ -norm of the gradient) is replaced with the ⁰ -“norm” of the gradient, hence promoting sparsity for the gradient and favoring piecewise constant images.

A way to avoid this staircasing artifact is to regularize the TV operator, as proved in [55].

Among variants that have been proposed [3, 7, 28, 56, 74], some introduce a parameter ε > 0 and replace the term | Du | in (2) by f _ε ( | Du | ), where

f _ε (t) =

ε ² + t ² , or f _ε (t) = t ² if | t | < ε,

ε ² otherwise, or f _ε (t) =

t

²

2 ε + ^ε ₂ if | t | < ε,

| t | otherwise, or f _ε is another even, smooth function that is nondecreasing on R ⁺ . More recently, different authors managed to promote sparsity for higher-order derivatives [6, 10, 21, 22, 45], leading to piecewise affine or piecewise polynomial images (hence pushing the staircasing effect to higher orders). In [38], an elegant modification of the TV operator seems to avoid staircasing in denoising and deblurring experiments, but no proof is provided.

All the above-mentioned variants require modiﬁcations of TV, or the addition of higher-

order terms in the variational model. One contribution of the present paper is to show that

the true TV prior is compatible with the avoidance of the staircasing artifact, provided that an

appropriate framework is used. Indeed, the ROF model can be reinterpreted in a statistical

(Bayesian) framework, where it exactly corresponds to the maximum a posteriori (MAP)

estimate, which means that the ROF model selects the image that maximizes the p.d.f. of a

certain distribution (the posterior distribution, associated to the TV prior and the data-ﬁdelity

term). Several authors [57, 75] pointed out that MAP estimates tend to be very singular with

regard to the prior distribution. The staircasing artifact can be considered one of these prior

statistics singularities.

(5)

Let us also mention the nonlocal extension of ROF (NL-ROF) proposed in [32], where the neighborhood between pixels is based on a comparison of patches as in the NL-means method [13]. As the authors notice in their experiments, it clearly improves the basic ROF model, but produces staircasing effects. Another NL extension of ROF (NLBV) proposed in [39] avoids staircasing artifacts by considering differences of gradients instead of differences of gray values, in a spirit similar to that of the second-order TV models considered in [6, 10].

In the present work, we propose keeping the statistical framework associated to the ROF model, but moving away from MAP estimation and considering instead the mean of the posterior distribution (rather than its maximum). As in the preliminary work [47], we will denote this approach by TV-LSE, for it reaches the least square error. This kind of approach is also often called MMSE (minimizer of the mean square error) in the literature, or sometimes CM (conditional mean).

LSE estimates versus MAP estimates. LSE estimates have been proposed for a long time in the context of Bayesian image restoration. As early as 1989, Besag [7] mentioned the possibility of using the LSE estimate instead of MAP in the discrete TV framework (then called median pixel prior), as well as the marginal posterior mode and the median estimate. In the case of a TV prior model, LSE is presented in [27] as a favorable alternative to MAP concerning the statistics of the reconstructed image, relying on the example of binary image denoising (the TV model is then equivalent to the Ising model), where MAP provides a nonrobust estimate.

Lassas, Siltanen, and colleagues [40, 42, 41] focus on one-dimensional signal restoration with a TV prior, and make a comparative study of MAP and LSE at the interface between the discrete and the continuous settings, when the quantization step goes to zero (so that the dimension goes to inﬁnity). They show that in their asymptotic framework, the TV prior may lead only to trivial estimates (MAP equal to 0, LSE equivalent to Gaussian smoothing), and they conclude by switching to a Besov prior which behaves properly when the quantization step goes to 0.

MAP estimation, seen as the minimization of an energy, is often preferred to LSE estima- tion because the computation is made easier and faster by a whole world of energy minimiza- tion algorithms, contrary to LSE which requires Monte-Carlo Markov chain algorithms [31] or Gibbs samplers [29], which are known to be slow. This computational issue can motivate one to use MAP instead of LSE or, more interestingly, to see an LSE estimate as a MAP estimate in another Bayesian framework, as was done in [34] and [58].

The debate between MAP and LSE goes far beyond algorithmic issues, as the literature, mostly on learned prior Markov random ﬁelds, testiﬁes. LSE estimates, regarding [58, 63, 67], seem to recover the prior statistics in a better way than MAP estimates. But in [59], it is argued that the prior learning method (maximum margin principle or maximum likelihood) has to be connected to the estimation function: maximum likelihood seems to perform better while associated to an LSE estimator, but learning with a maximum margin principle seems to perform even better while associated to a MAP estimator.

Since the preliminary work [47] in 2008, several researchers have taken an interest in TV- LSE. Jalalzai and Chambolle [38], Lefkimmiatis, Bourquard, and Unser [45], and Salmon [66]

mention the TV-LSE model for its ability to naturally remove staircasing artifacts. In the

conclusion of [51], Mirebeau and Cohen propose a TV-LSE–like approach to denoising images

using anisotropic smoothness features, an interesting counterpart to TV, arguing that LSE

(6)

is able to deal with nonconvex functionals. Chaari and colleagues propose LSE estimates for a frame-based Bayesian denoising task [16] and for a parameter estimation task [17], where a TV prior is used jointly with a prior on the frame coeﬃcients; the abundant numerical experiments show that the proposed method compares favorably with the MAP estimate. In the handbook chapter [15], Caselles, Chambolle, and Novaga dedicate a section to TV-LSE.

Outline of the paper. The paper is organized as follows. In section 2 we recall the Bayesian point of view on the ROF model and justify the LSE approach using measure concentration arguments. In section 3 we analyze the proposed TV-LSE estimator in a finite-dimensional framework (finite number of pixels but real-valued images). Simple invariance and convergence properties are first given in sections 3.1 and 3.2. Then in section 3.3, a deeper insight is developed, where the TV-LSE denoiser is viewed as the gradient of a convex function, which allows us to prove, using convex duality tools, that TV-LSE avoids the constant regions of the staircasing effect while allowing the restoration of sharp edges. We also interpret the TV-LSE denoiser as a MAP estimate, whose prior is carefully analyzed. In section 4, we give numerical experiments on image denoising, showing that the TV-LSE offers an interesting compromise between blur and staircasing, and generally gives rise to more natural images than ROF. We then conclude in section 5.

2. From ROF to TV-LSE: Bayes TV-based models.

2.1. ROF Bayesian interpretation. Let u : Ω → R be a discrete gray-level image defined on a finite rectangular domain Ω ⊂ Z ² , which maps each pixel x = (x, y) ∈ Ω to the gray level u(x). The (discrete) total variation of the image u is defined by

(3) T V (u) =

x∈Ω

| Du(x) | ,

where | Du(x) | is a discrete scheme used to estimate the gradient norm of u at point x. In what follows we shall consider either the ¹ - or the ² -norm on R ² , associated with the simplest possible approximation of the gradient vector, given by

(4) Du(x, y) =

u(x + 1, y) − u(x, y) u(x, y + 1) − u(x, y)

(note that all the results of this paper hold for a large variety of discrete TV operators; see Appendix A). Concerning boundary conditions, we shall use the convention that diﬀerences involving pixels outside the domain Ω are zero. Given a (noisy) image v, the ROF method proposes selecting the unique image u minimizing the energy

(5) E _v,λ (u) = u − v ² + λT V (u),

where · is the classical L ² -norm on images and λ is a hyperparameter which controls the denoising level. This formulation as energy minimizer can be transposed in a Bayesian framework. Indeed, for β > 0 and μ ∈ R, let us consider the p.d.f.

(6) ∀u ∈ E μ , p β (u) = 1

Z β e ⁻ ^{βT V} ⁽ ^u ⁾ , where Z β =

E

μ

e ⁻ ^{βT V} ⁽ ^u ⁾ du,

(7)

(7) and ∀ μ ∈ R , E μ =

u ∈ R ^Ω , u ¯ = μ

with u ¯ = 1

| Ω |

x∈Ω

u(x).

Let us now suppose that instead of u, we observe the noisy image v = u + N , where N is a white Gaussian noise with zero mean and variance σ ² . Applying Bayes’ rule with prior distribution p β leads to the following posterior p.d.f.:

(8) p(u | v) = p(v | u)p β (u) p(v) = 1

Z exp

− E v,λ (u) 2σ ²

,

where λ = 2βσ ² and Z is a normalizing constant depending on v and λ only, ensuring that u → p(u | v) remains a p.d.f. on R ^Ω . Hence, the variational formulation (arg min _u E _v,λ (u)) is equivalent to a Bayesian formulation in terms of MAP

(9) u ˆ _ROF = arg max

u ∈E

μ

p(u | v).

This means that ROF denoising amounts to selecting the most probable image under the posterior probability deﬁned by p(u|v). Notice that the constraint u ∈ E μ , which was imposed to obtain a proper (that is, integrable) prior p.d.f., p _β , can be dropped out when μ = ¯ v, since this leaves the MAP estimate unchanged [2].

In a certain sense, the most complete information is given by the whole posterior distri- bution function. However, for obvious practical reasons, one generally seeks an “optimal”

estimate of the original image built from the posterior distribution, with respect to a certain criterion. The MAP estimate is obtained by minimizing the Bayes risk when the associated cost function is a Dirac mass located on the true solution. In a certain sense, this estimator is not very representative of the posterior distribution, since it only “sees” its maximum; in particular, as (8) shows, the solution does not depend on σ, which measures the “spread”

of the posterior distribution. As ˆ u _ROF minimizes the energy E v,λ (u), it tends to concentrate certain exceptional structures which are cheap in energy, in particular, regions of constant intensity, leading to the well-known staircasing eﬀect (see Figure 1).

2.2. The staircasing eﬀect. In order to establish the existence of this staircasing eﬀect for ROF denoising in the discrete setting, Nikolova [55] remarks that the discrete TV operator (3) can be written under the more general form

T V (u) = r i =1

ϕ i (G i u),

where G _i : R ^Ω → R ^m (1 ≤ i ≤ r) are linear operators (here, diﬀerences between neighboring pixels), and ϕ i : R ^m → R are piecewise smooth functions that are not diﬀerentiable in zero (here, all ϕ i correspond to the L ¹ -norm on R ² ).

Proposition 2.1 (see [55]). If S(v) = arg min _u u − v ² +λJ(u), where J (u) = _r

i =1 ϕ _i (G _i u) and, for all i, G _i is linear and ϕ _i is piecewise smooth and not diﬀerentiable in 0, then, under some technical assumptions, there exists an open neighborhood V of v for which

(10) ∀ v ∈ V,

i | G _i (S(v )) = 0

=

i | G _i (S(v)) = 0

.

(8)

(a) (c) (e)

(b) (d) (f)

0 20000 40000 60000 80000 100000 120000 140000

-40 -30 -20 -10 0 10 20 30 40

original image

0 20000 40000 60000 80000 100000 120000 140000

-40 -30 -20 -10 0 10 20 30 40

ROF image

(g) (h)

Figure 1. The staircasing eﬀect. A noisy version (a) of the Lena image (additive white Gaussian noise with standard deviation σ = 10) is denoised with the ROF model with λ = 40 (b). The details (c) and (d) of (b) reveal the so-called staircasing eﬀect: ROF denoising tends to create smooth regions separated by spurious edges.

This eﬀect clearly appears on the level lines (e) and (f) of images (c) and (d): most level lines (here computed using a bilinear interpolation) tend to be concentrated along spurious edges. The histograms of the horizontal derivative of the original Lena image (g) and the ROF-denoised image (h) also reveal this staircasing eﬀect:

whereas such a histogram is generally well modeled by a generalized Laplace distribution for natural images, the

ROF version presents a large peak in 0 that is a direct consequence of the staircasing eﬀect. Similar plots would

be obtained with the vertical derivative.

(9)

In the case of TV, the G _i are differences between neighboring pixels, so the set { i ∈ { 1, . . . , r } | G _i (S(v)) = 0 } corresponds to the constant sets of S(v) = ˆ u _ROF . The existence of an open neighborhood V = { v } of v for which any S(v ) has the same constant set as S(v) indicates that the regions of constant gray level have a certain stability with respect to perturbations of the observed data v. This gives a first theoretical explanation of the staircasing effect. In other words, if the space of noisy images R ^Ω is endowed with a probability distribution that is absolutely continuous with respect to the Lebesgue measure on R ^Ω , then for any x ∈ Ω, the probability of having a zero gradient at pixel x in the denoised image is positive. Hence, there is a bias toward constant regions, which can be measured by a Dirac mass at zero on the histograms of gradients (see Figure 1, plots (g) and (h)).

In the continuous domain, Jalalzai [37] assesses the presence of staircasing by testing the positivity of |{ x ∈ Ω | S(v)(x) = c}| for some c. He proves that this property occurs for c = max S(v) and c = min S(v) when the datum v is in L ² (Ω) ∩ L ^∞ (Ω) and Ω = R ² . In particular the gradient of S(v) is zero in the interior of { x ∈ R ² | S(v)(x) = c } . As this deﬁnition is speciﬁc to the continuous setting, in what follows we shall rather focus on Nikolova’s viewpoint [55].

Let us also cite the recent work of Caselles, Chambolle, and Novaga [14], where staircasing is studied not from the point of view of constant regions but in terms of discontinuities. An interesting property concerning the jump set of the reconstructed image in the continuous framework is proved, which could suggest that staircasing is due only to a bad quantization of the TV. The (approximate) jump set of a continuous image u is deﬁned as the set of points x ∈ R ² satisfying

∃ u ⁺ (x) = u ⁻ (x), ∃ ν _u (x) ∈ R ² ,

⎧ ⎪

⎪ ⎪

⎪ ⎨

⎪ ⎪

⎩

|ν u (x)| = 1, lim ρ ↓0

B

_ρ⁺

( x,ν

_u

( x )) | u(y) − u ⁺ (x) | dy

B

_ρ⁺

( x,ν

_u

( x )) dy = 0, lim ρ ↓0

B

_ρ⁻

( x,ν

_u

( x )) | u(y) − u ⁻ (x) | dy

B

_ρ⁻

( x,ν

_u

( x )) dy = 0, where

B _ρ ⁺ (x, ν _u (x)) = { y | y − x < ρ , y − x, ν _u (x) > 0 }

and B _ρ ⁻ (x, ν u (x)) is the same with a negative inner product. Intuitively, the jump set of an image u is the set of points where u can be locally described as a two-dimensional Heaviside function, which corresponds to regular edges. It is shown that if the datum image v has bounded variation, then the jump set of the solution ˆ u to the continuous ROF denoising problem is contained within the jump set of v. In other words, ROF denoising does not create edges which did not already exist in v. This would contradict some kind of staircasing eﬀect (the discontinuity part) if we forgot that v is generally noisy so that the jump set contains almost every point of the domain.

2.3. Concentration of the posterior distribution. Another distortion induced by the

MAP approach comes from the high dimension of the problem. Indeed, the MAP estimate

depends only on the location of the mode (that is, the point with maximum density), not

(10)

on the probability mass that this mode contains [65], and this diﬀerence may become huge in high-dimensional spaces. Let us illustrate this with a simple example. If n is a positive integer and X is a random vector distributed as N (0, σ ² I n ) (centered normal distribution with covariance matrix σ ² I _n , I _n being the n-dimensional identity matrix), then by applying the Bienaym´ e–Chebyshev inequality to the random variable X ² = _n

i =1 X _i ² (which follows a σ ² χ ² (n) distribution), we obtain

(11) ∀ ε > 0, P

1 n X ² − σ ² > ε

≤ 2σ ⁴ nε ² ,

and the right-hand term decreases toward 0 when the dimension n grows to inﬁnity. In this example, the mode of X is located in 0, but when n goes to + ∞ all the mass of the distribution

“concentrates” around the sphere centered in 0 with radius σ √

n and therefore goes away from the mode. This kind of situation is quite common in high dimension. A similar example is the case of the uniform distribution on the unit ball, whose mass concentrates in an arbitrarily small neighborhood of the unit sphere when the dimension grows. Hence, the MAP estimate may be, especially in high dimension, a very special image whose properties may strongly diﬀer from those of typical samples of the posterior. This remark particularly makes sense for images, whose typical dimension can be n = 10 ⁶ or more.

In our image denoising problem, we should deal with the posterior probability π asso- ciated to the p.d.f. π(u) = p(u | v) (see (8)), which is a log-concave Gibbs ﬁeld. For such probability distributions, we can have concentration results similar to (11), but they require more sophisticated tools [44].

A key notion in studying the concentrating power of a probability distribution π on a Euclidean space X is the concentration function α _π (r) [44], deﬁned for each r > 0 by

(12) α _π (r) = sup

1 − π (A r ); A Borel set of R ^Ω and π (A) ≥ 1 2

,

where A r = { u ∈ R ^Ω , d(u, A) < r } (d is the Euclidean distance on X ). For instance, in the case of a uniform distribution on the d-dimensional unit sphere, the concentration function can be proved to be smaller than a Gaussian function of r [43], whose fast decay for large r indicates, thanks to (12), that the probability is very concentrated near any equator.

An analytical point of view for α _π is useful when considering other distributions. The concentration function can be addressed in terms of the concentration of Lipschitz-continuous functions around their medians. Namely, a measurable function F : X → R is said to be 1-Lipschitz when

F _Lip := sup

u,v ∈X

| F (u) − F (v) | u − v ≤ 1, and m F is called a median of F if it satisﬁes

π (F ≤ m F ) ≥ 1

2 and π (F ≥ m F ) ≥ 1 2 . The concentration function can be characterized for any r > 0 by [43]

(13) α _π (r) = sup

F π (F − m F ≥ r),

(11)

where the supremum runs over all real-valued measurable 1-Lipschitz functions F , and where m _F is any median of F .

In the case of the posterior probability having p.d.f. (8) of our image denoising problem, we can have a Gaussian concentration inequality similar to the uniform distribution on the unit sphere, as shown in the following proposition.

Proposition 2.2 (concentration property for the ROF posterior distribution). Let π denote the posterior probability with p.d.f. (8). Then

(14) ∀ r > 0, α _π (r) ≤ 2e ⁻

^r

2 4σ2

.

Proof. The probability π has p.d.f. π = _Z ¹ e ⁻ ^V , where V = ₂ _σ ¹

₂

E v,λ satisﬁes the strong convexity inequality

(15) ∃c > 0 ∀u, v ∈ R ^Ω , V (u) + V (v) − 2 V

u + v 2

≥ c

4 u − v ² with c = 1/σ ² . Then applying [44, Theorem 2.15, p. 36], we obtain (14).

This shows that any Lipschitz-continuous function F is concentrated around its median m _F : indeed, combining (13) and (14) yields

(16) π (F − m F ≥ r) ≤ 2e ⁻

F2 Lipr2 4σ2

, and the same argument with − F leads to

(17) π (F − m _F ≤ − r) ≤ 2e ⁻

F2 Lipr2 4σ2

. Putting both inequalities together, we deduce that for each r > 0,

(18) π ( | F − m F | ≥ r) ≤ 4e ⁻

F2 Lipr2 4σ2

.

Now we prove that the energy E _v,λ is concentrated around a particular value. E _v,λ is not Lipschitz-continuous (because the data-ﬁdelity term is quadratic), but its square root is Lipschitz-continuous as soon as v is not constant, which leads to a weaker form of concentra- tion for the energy.

Proposition 2.3. Let π denote the posterior probability with p.d.f. (8). Assume that v is not constant. Then there exist m ∈ R and c > 0 such that for any r ≥ 0,

(19) π ( |

E _v,λ − m | ≥ r) ≤ 4e ⁻

^c

2r2 4σ2

. Proof. For any images u and u , let us write

E _v,λ (u ) −

E _v,λ (u) = C ₁ + C ₂ with

C ₁ =

u − v ² + λT V (u ) −

u − v ² + λT V (u )

(12)

and

C ₂ =

u − v ² + λT V (u ) −

u − v ² + λT V (u).

For C ₁ , let us recall that when u is ﬁxed, u →

u − v ² + λT V (u ) is the composition of u → u − v and x ∈ R → √

x ² + ε with ε = λT V (u ) ≥ 0, which are both 1-Lipschitz. This gives the inequality

(20) |C ₁ | ≤ u − u.

To bound C ₂ , we need to compute the Lipschitz constant of the discrete TV. It depends on the scheme for T V which is used and depends monotonically on | Ω | . Writing T V Lip = κ

| Ω | , κ can be evaluated to κ = 4 + O(1/

| Ω | ) for the ¹ -scheme, and κ = 2 √

2 + O(1/

| Ω | ) for the ² -scheme (the approximation is due to the domain’s border eﬀect, and in both cases the Lipschitz constant is reached when computing T V (u) − T V (0), where u is the chessboard image deﬁned by u(i, j) = ( − 1) ⁱ ⁺ ^j ). We have

C ₂ = u − v ² + λT V (u ) − u − v ² − λT V (u) u − v ² + λT V (u ) +

u − v ² + λT V (u) .

But as v is supposed to be nonconstant, E v,λ is coercive and cannot equal zero, so that it is bounded from below by a positive constant. Hence, since

u − v ² + λT V (u ) is nonnega- tive, we have

(21) | C ₂ | ≤ λ|T V (u ) − T V (u)|

0 +

min E _v,λ ≤ λκ

| Ω | u − u min E _v,λ . Then, combining (20) and (21), we obtain

E _v,λ (u ) −

E _v,λ (u) ≤

1 + λκ

| Ω | min E v,λ

u − u , and

E _v,λ is Lipschitz-continuous, with constant c, with c = λκ

| Ω | / min E _v,λ + O(1) when

| Ω | goes to ∞ . We conclude by applying (18).

By homogeneity of · − v ² and T V with respect to the dimension | Ω | of the images, it is not restrictive to assume that the median m goes to ∞ as the dimension | Ω | of images increases (by juxtaposing several versions of v, for instance). This means that as the dimension increases, π ( |

E v,λ − m | / | m | ≥ r) is bounded by 4 exp ( − ^c

²

₄ ^r _σ

²

^m

2²

), where c ² = λ ² κ ² | Ω |

min E v,λ + O(1)

is bounded because min E _v,λ is proportional to | Ω | , while m goes to + ∞ as | Ω | → + ∞ . Hence π ( |

E _v,λ − m | / | m | ≥ r) converges to 0, and for large domains Ω, almost any image u drawn from π satisﬁes E v,λ (u) ≈ m ² .

As E _v,λ is strictly convex and continuous, the lower set { u, E _v,λ (u) < m ² } is a bounded

convex set. It is not symmetric, and its boundary is not smooth as soon as v is not constant

(13)

and λ > 0, but it always contains ˆ u _ROF (because it reaches the lowest energy). Let us deﬁne the median energy set as the boundary of { u, E _v,λ (u) < m ² } . In high dimension, (19) means that almost all the mass of π is supported by a thin dilation of this median energy set.

Estimating the original image u by ˆ u _ROF does not take the geometry of this median energy set into consideration, its asymmetrical shape in particular. In high dimension, the mean of π approximately corresponds to the isobarycenter of the median energy set, which is likely to give interesting results in terms of image denoising performance.

2.4. Deﬁnition of the TV-LSE operator. Instead of using the risk associated to a Dirac cost (leading to a MAP estimate), we propose using a least square risk, which amounts to searching the image ˆ u(v) minimizing

(22) E u,v

u − u(v) ˆ ²

=

R

^Ω

E

μ

u − u(v) ˆ ² p(u, v) dv du.

The image reaching this minimum is the expectation of the posterior distribution (least square estimate (LSE)), that is,

(23) u ˆ _LSE := E (u | v) =

R

^Ω

p(u | v) u du, which, thanks to (8), can be rewritten in the form below.

Definition 2.4. The TV-LSE operator (denoted by S _LSE ) maps a discrete image v ∈ R ^Ω into the discrete image u ˆ _LSE deﬁned by

(24) u ˆ _LSE = S _LSE (v) =

R

^Ω

exp

− E v,λ (u) 2σ ²

· u du

R

^Ω

exp

− E v,λ (u) 2σ ²

du

,

where λ and σ are positive parameters and E _v,λ is the energy function deﬁned in (5).

In this paper we concentrate on TV-LSE, but we are conscious that minimizing risks other than the least square risk in (22) can lead to other interesting estimates (a median estimate for an L ¹ risk, for instance), though they seem to be more diﬃcult to analyze.

3. Properties of TV-LSE. In this section, we explore several theoretical aspects of the TV-LSE operator. We give geometric invariance properties and study the limiting operator when one of the parameters goes either to 0 or to + ∞ . Finally, we use Moreau’s theory of proximations (or proximity operators) [52, 62] to state ﬁner properties of TV-LSE, among which the fact that the staircasing eﬀect cannot occur in TV-LSE denoising.

3.1. Invariance properties. Here we give several geometric invariance properties of S _LSE (such as gray-level average preservation, translation, and symmetry invariance), all shared with ROF denoising [2], which are basic but essential requirements for image processing.

First we establish a facilitating formulation of the TV-LSE operator. It makes use of

integrals on the smaller space E v _¯ , on which the prior p _β is proper and compatible with (6).

(14)

Lemma 3.1. Let v ¯ be the average of v, and let E _¯ v be the space of images having average v ¯ (see (7)). Then (24) can be rewritten as

(25) ∀ v ∈ R ^Ω , S _LSE (v) =

E

¯v

exp

− E _v,λ (u) 2σ ²

u du

E

v¯

exp

− E _v,λ (u) 2σ ²

du

.

Proof. For a given v ∈ R ^Ω , let us make the change of variable u = ¯ u + z in (24), where

¯

u ∈ R is the mean of u and z is in E _¯ v . We have

S _LSE (v) =

z ∈E

¯v

¯

u ∈R (¯ u + z)e ⁻

^2σ¹²

^(¯ ^u ⁺ ^z ⁻ ^v

²

⁺ ^{λT V} ⁽ ^z ⁾⁾ d¯ u dz

z ∈E

¯v

¯

u ∈R e ⁻

^2σ¹²

^(¯ ^u ⁺ ^z ⁻ ^v

²

⁺ ^{λT V} ⁽ ^z ⁾⁾ d¯ u dz .

As both z and v have mean ¯ v, the quadratic term u ¯ + z − v ² equals | Ω | u ¯ ² + z − v ² . Hence ˆ

u _LSE becomes

S _LSE (v) =

z ∈E

v¯

e ⁻

^2σ¹²

⁽ ^z ⁻ ^v

²

⁺ ^{λT V} ⁽ ^z ⁾⁾

¯

u ∈R (¯ u + z)e ⁻

^|Ω|¯^u

2 2σ2

d¯ u dz

z ∈E

v¯

e ⁻

^2σ¹²

⁽ ^z ⁻ ^v

²

⁺ ^{λT V} ⁽ ^z ⁾⁾

¯

u ∈R e ⁻

^|Ω|^u^¯

2 2σ2

d¯ u dz

,

which, thanks to the properties of the normal distribution N (0, _|Ω| ^σ

²

), simpliﬁes into the desired expression (25).

Proposition 3.2 (average preservation). For any image u, let u ¯ = _|Ω| ¹

x∈Ω u(x) denote the mean gray level of u. Then for every v ∈ R ^Ω ,

S _LSE (v) = v.

Proof. Thanks to Lemma 3.1, S _LSE (v) is written as a weighted average of images all having mean ¯ v. Hence the result has mean ¯ v.

Proposition 3.3 (invariance by composition with a linear isometry). Let s : R ^Ω → R ^Ω be a linear isometry such that for all u ∈ R ^Ω , T V ◦ s(u) = T V (u) holds. Then

∀v ∈ R ^Ω , S _LSE ◦ s(v) = s ◦ S _LSE (v).

Proof. The change of variable u = s ⁻¹ (u) in the numerator and the denominator of (24) yields

S _LSE (s(v)) =

s(u )e ⁻

^s(u

)−s(v)2+λT V(s(u))

2σ2

du

e ⁻

^s(u

)−s(v)2+λT V(s(u))

2σ2

du

,

(15)

because s being an isometry implies ds(u ) = du . Furthermore, s is isometric, so we have s(u ) − s(v) ² = u − v ² and T V (s(u )) = T V (u ); thus

S _LSE ◦ s(v) =

s(u )e ⁻

^{u−v2+λT V}^(u

)

2σ2

du

e ⁻

^{u−v2+λT V}^(u

)

2σ2

du

= s(S _LSE (v)),

because s is linear.

A consequence of Proposition 3.3 is that the TV-LSE operator inherits many properties of the discrete scheme used for TV. For the classical ¹ - or ² -schemes used in (3) and (4), we obtain in particular the following invariances:

(1) translation invariance: S _LSE ◦ τ _t = τ _t ◦ S _LSE , where τ _t is the translation operator of vector t ∈ Z ² deﬁned by τ t ◦ u(x) = u(x − t) (Ω is assumed to be a torus);

(2) π/2-rotation invariance: if ρ is a π/2-rotation sending Ω onto itself, then S _LSE ◦ ρ = ρ ◦ S _LSE ;

(3) gray-level shift invariance: for all u ∈ R ^Ω , for all c ∈ R , S _LSE (u + c) = S _LSE (u) + c (this is not a direct consequence of Proposition 3.3, but the proof is easily adapted to the case s(u) = u + c).

These properties can help ﬁnd the structure of S _LSE (v) when v contains many redundancies and much structure. For example, if v is a constant image, then S _LSE (v) = v. Indeed, v is invariant under the translations of vectors (1, 0) and (0, 1), and so is S _LSE (v); moreover, the average gray level of S _LSE (v) is the same as v. Finally S _LSE (v) is a constant equal to v.

Another example is the checkerboard, deﬁned by

v _i,j = a if i + j is even, b if i + j is odd

for some constants a, b ∈ R . It is quite easy to see that v = S _LSE (v) is also a checkerboard (use the invariance by translations of vectors (1, 1) and (1, − 1)), even if it seems diﬃcult to get the associated gray levels a and b .

3.2. Asymptotics. Unlike ROF denoising (which depends on the single parameter λ), TV-LSE denoising depends on two distinct parameters λ and σ. The strict Bayesian point of view (see section 2.1) would rather encourage the focus on the single parameter β = λ/(2σ ² ) associated to the TV prior, while σ ² is set as the actual noise variance. In practice, it is more interesting to relax this point of view and to consider σ as a parameter, because, as we shall see later, in general the best denoising results are not obtained when σ ² equals the actual noise variance. A second reason is that when an image is corrupted with a noise of variance σ ² , the parameter λ in ROF achieving the best peak signal-to-noise ratio (PSNR) is not proportional to σ ² (as the identity λ = 2βσ ² would suggest) but behaves roughly like a linear function for large values of σ (in [30] a regression yields the estimate λ _opt (σ) ≈ 2.92σ ² /(1 + 1.03σ)). This is why we propose taking (λ, σ) as the parameters of the TV-LSE model. In section 4.3, the role of these parameters is further discussed and illustrated by numerical experiments.

Theorem 3.4 below sums up several asymptotic behaviors of ˆ u _LSE when one of the param-

eters goes to 0 or + ∞ .

(16)

Remark 1. By the change of variables v = v/σ, u = u/σ, λ = λ/σ, σ = 1, the trans- formed operator, with obvious notation, satisﬁes S _LSE ^λ,σ (v) = σS _LSE ^λ/σ, ¹ ( ^v _σ ).

Theorem 3.4. For a given image v ∈ R ^Ω , let us write ˆ u _LSE (λ, σ) = S _LSE (v) to recall the dependency of u ˆ _LSE with respect to λ and σ. For any ﬁxed λ > 0, we have

(i) u ˆ _LSE (λ, σ) −−−→

σ →0 u ˆ _ROF (λ), (ii) u ˆ _LSE (λ, σ) −−−−−→

σ →+∞ v, while for any σ > 0, we have

(iii) u ˆ _LSE (λ, σ) −−−→

λ →0 v, (iv) u ˆ _LSE (λ, σ) −−−−→

λ →+∞ v ¯ 1 ,

where ¯ v 1 is the constant image equal to the average of v. Moving λ such that β = λ/(2σ ² ) is kept constant, we have

(v) u ˆ _LSE (2βσ ² , σ) −−−→ _σ

→0 v, (vi) u ˆ _LSE (2βσ ² , σ) −−−−−→

σ →+∞ ¯ v 1 .

Proof. As E _v,λ is strongly convex (15), the probability distribution with density _Z ¹ exp( − ^E ₂

^v,λ

_σ

2

) (where Z is a normalizing constant depending on σ) weakly converges when σ → 0 to the Dirac distribution located at ˆ u _ROF (λ) = arg min _u E _v,λ (u), whose expectation is ˆ u _ROF (λ), which proves (i).

For (ii), let us consider the change of variable w = (u − v)/σ. Then

(26) u ˆ _LSE (λ, σ) = v +

R

^Ω

σwe ⁻

¹²

⁽ ^w

²

⁺

^λ^σ

^{T V} ⁽ ^w ⁺

^v^σ

⁾⁾ dw

R

^Ω

e ⁻

¹²

⁽ ^w

²

⁺

^λ^σ

^{T V} ⁽ ^w ⁺

^v^σ

⁾⁾ dw

= v + N D .

When σ → ∞ , the function inside the denominator D converges almost everywhere (a.e.) to e ⁻ ^w

²

^/ ² and is uniformly bounded by e ⁻ ^w

²

^/ ² ; thus thanks to Lebesgue’s dominated convergence theorem, D converges toward

e ⁻ ^w

²

^/ ² dw. For the numerator, notice that the mean value theorem applied to x → e ⁻ ^x implies the existence of a real number c _w,σ ∈ [0, ₂ ^λ _σ T V (w + _σ ^v )] such that

e ⁻

^2σ^λ

^{T V} ⁽ ^w ⁺

^σ^v

⁾ = 1 − λ 2σ T V

w + v

σ

e ⁻ ^c

^w,σ

.

Hence N can be split into N = σ

we ⁻

^w

2

dw − λ

2 we ⁻

^w

2 2

T V

w + v

σ

e ⁻ ^c

^w,σ

f

_σ

( w )

dw.

(17)

The ﬁrst integral is equal to zero. Concerning the second integral, when σ → ∞ , c _w,σ goes to 0, and as T V is Lipschitz-continuous, f _σ satisﬁes, for every σ ≥ 1,

f _σ (w) −−−→ _σ

→∞ we ⁻

^w

2

T V (w) a.e.

and

(27) f σ (w) ≤ we ⁻

^w²²

(T V (w) + αv),

where α is the Lipschitz-continuity coeﬃcient of T V . As the right-hand term of (27) belongs to L ¹ ( R ^Ω ) (as a function of w), again Lebesgue’s dominated convergence theorem applies and

f _σ (w) dw −−−→ _σ

→∞

we ⁻

^w

2

T V (w) dw = 0

because the function inside the integral is odd (since T V is even). Hence, N goes to 0 as σ tends to inﬁnity, which implies the convergence of ˆ u _LSE (λ, σ) toward v and proves (ii).

The proof of (iii) is a simple application of Lebesgue’s dominated convergence theorem on both integrals of (24).

For (iv), let us assume that v has zero mean (which does not reduce the generality of the proof, because of the gray-level shift invariance of section 3.1). Then, thanks to the average invariance property (Proposition 3.2), we simply have to show that ˆ u _LSE (λ, σ) converges to 0 when λ goes to ∞ . Using Lemma 3.1 and making the change of variable u = z/λ, we obtain

(28) u ˆ _LSE (λ, σ) = 1

λ

E

0

ze ⁻

^2σ¹²

⁽

^λ^z

⁻ ^v

²

⁺ ^{T V} ⁽ ^z ⁾⁾ dz

E

0

e ⁻

^2σ¹²

⁽

^λ^z

⁻ ^v

²

⁺ ^{T V} ⁽ ^z ⁾⁾ dz .

Now, for both functions g(z) = 1 and g(z) = z, we have

⎧ ⎪

⎨

⎪ ⎩

g(z)e ⁻

^2σ¹²

⁽

^λ^z

⁻ ^v

²

⁺ ^{T V} ⁽ ^z ⁾⁾ −−−→

λ →∞ g(z)e ⁻

^2σ¹²

⁽ ^v

²

⁺ ^{T V} ⁽ ^z ⁾⁾ ,

!! ! g(z)e ⁻

^2σ¹²

⁽

^λ^z

⁻ ^v

²

⁺ ^{T V} ⁽ ^z ⁾⁾ !! ! ≤ g(z) e ⁻

^2σ¹²

^{T V} ⁽ ^z ⁾ ≤ g(z) e ⁻

^2σ^C²

^z

¹

^,

where the last inequality comes from the fact that since T V is a norm on the ﬁnite-dimensional space E ₀ , there exists C > 0 such that for every z ∈ E ₀ , T V (z) ≥ Cz ₁ (this can be considered as a discrete version of the Poincar´ e inequality [1]). Thus thanks to Lebesgue’s dominated convergence theorem, each integral in (28) converges to a positive value when λ → + ∞ , and dividing by λ yields the desired limit ˆ u _LSE (λ, σ) → 0.

To prove (v) it is enough to apply Lebesgue’s dominated convergence theorem to the integrals that appear in (26).

For (vi), we use Lemma 3.1 and Lebesgue’s dominated convergence theorem as in the

proof of (iv) above.

(18)

3.3. TV-LSE as a proximity operator and several consequences. Proximity operators [52, 62] are mappings of a Hilbert space into itself, which extend the notion of projection onto a convex space; here we prove that the TV-LSE denoiser is a proximity operator on R ^Ω . From that, we deduce several stability and regularity properties of TV-LSE, and prove that it cannot create staircasing artifacts.

3.3.1. S _LSE is a proximity operator. Let us start by setting a frame of convex analysis (in ﬁnite dimension) around TV-LSE. Let n = | Ω | denote the total size of the considered images. An image is therefore an element of R ⁿ . Let Γ ₀ (R ⁿ ) be the space of convex, lower semi-continuous functions from R ⁿ to ( −∞ , + ∞ ] that are proper (that is, nonidentically equal to + ∞ ).

Definition 3.5 (see [52, 62]). Let f be an arbitrary function in Γ ₀ . The proximity operator associated to f is the mapping prox _f : R ⁿ → R ⁿ deﬁned by

prox _f (u) = arg min

v ∈R

ⁿ

1 2 v − u ² + f (v).

Notice that if f is the characteristic function associated to a closed, convex and nonempty set C (f = 0 on C and f = + ∞ elsewhere), prox _f simply reduces to the projection on C, and that prox

λ

2

T V corresponds to the ROF denoising operator.

Note 1 (see [52, 62]). Whenever f is in Γ ₀ , its convex conjugate f ^∗ (Legendre–Fenchel transform), deﬁned by

f ^∗ (v) = sup

u ∈R

ⁿ

u, v − f (u),

is in Γ ₀ ( R ⁿ ) and satisﬁes f ^∗∗ = f . Moreover, Moreau’s decomposition theorem states that given f ∈ Γ ₀ , every z ∈ R ⁿ can be decomposed into z = u + v, with u = prox _f (z) and v = prox _f

∗

(z).

Definition 3.6 (see [52, 62]). The primitive function associated to prox _f is the function Φ ∈ Γ ₀ ( R ⁿ ) deﬁned by

∀ z ∈ R ⁿ , Φ(z) = 1

2 v ² + f (u), where u = prox _f (z) and v = prox _f

∗

(z).

The function f = ₂ _σ ^λ

₂

T V is an element of Γ ₀ ( R ⁿ ) whose domain { u ∈ R ⁿ | f (u) < ∞} has a nonempty interior. In addition, f can be viewed as the potential of the (improper) prior distribution in our Bayesian framework whose p.d.f. is p = exp( − f ).

Letting G σ denote the Gaussian kernel u ∈ R ⁿ → _σ

_n

₍₂ ¹ _π ₎

_n/2

exp(− ₂ ^u _σ

2²

Posterior Expectation of the Total Variation model: Properties and Experiments

HAL Id: hal-00764175

https://hal.archives-ouvertes.fr/hal-00764175v3

Submitted on 27 Apr 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Posterior Expectation of the Total Variation model:

Properties and Experiments

Cécile Louchet, Lionel Moisan

To cite this version:

Cécile Louchet, Lionel Moisan. Posterior Expectation of the Total Variation model: Properties and

Experiments. SIAM Journal on Imaging Sciences, Society for Industrial and Applied Mathematics,

2013, 6 (4), pp.2640-2684. �10.1137/120902276�. �hal-00764175v3�

Posterior Expectation of the Total Variation Model:

Properties and Experiments ∗ C´ ecile Louchet † and Lionel Moisan ‡

Key words. image denoising, total variation, Bayesian model, least square estimate, maximum a posteriori, estimation in high-dimensional spaces, proximity operators, staircasing eﬀect

AMS subject classiﬁcations. 68U10, 62H35 DOI. 10.1137/120902276

(1) E(u) = Au − v 2 + λT V (u).

This energy establishes a trade-off between data fidelity (the first term) and data regularity (TV), the relative weight of the latter being specified by the hyperparameter λ. In a continuous formulation, the TV of a gray-level image u : R 2 → R is defined by

T V (u) = inf

R

u div p; p ∈ C c ∞ ( R 2 , R 2 ), p ∞ ≤ 1

,

which boils down to

(2) T V (u) =

R

| Du | with | Du | =

∂u

∂x 2

+ ∂u

∂y 2

Received by the editors December 12, 2012; accepted for publication (in revised form) August 26, 2013; published electronically December 17, 2013.

http://www.siam.org/journals/siims/6-4/90227.html

MAPMO, Universit´ e d’Orl´ eans, CNRS, UMR 6628, 45067 Orl´ eans, France ([email protected]).

MAP5, Universit´ e Paris Descartes, CNRS UMR 8145, 75006 Paris, France ([email protected]).

2640

• The prior model based on TV (or the median pixel prior, its discrete counterpart) shows a natural connection with purely discrete Markov models [7, 9].

• If u is a binary image, that is, the characteristic function of some—regular enough—

This nice analytical framework around TV and bounded variation (BV) spaces [1]

makes it particularly ﬁtted to mathematical image analysis.

TV is simple and convenient, but it has its own drawbacks. First, textured image parts, which are very oscillatory in general, are highly penalized by TV and are often destroyed (or at least strongly attenuated) by the ROF model. Another well-known artifact is the staircasing eﬀect.

The staircasing eﬀect. A simple example of the staircasing eﬀect is obtained when a noisy

aﬃne one-dimensional signal is processed with the ROF model: the denoised signal is not

smooth but piecewise constant (see, e.g., the ROF-denoised signal that is displayed in the

middle plot of Figure 2). The staircase shape obtained in this case is quite general: as

ﬁrst noticed in [25] and then analyzed by [11, 15, 33, 57, 61] in diﬀerent frameworks, the

A way to avoid this staircasing artifact is to regularize the TV operator, as proved in [55].

Among variants that have been proposed [3, 7, 28, 56, 74], some introduce a parameter ε > 0 and replace the term | Du | in (2) by f ε ( | Du | ), where

f ε (t) =

ε 2 + t 2 , or f ε (t) = t 2 if | t | < ε,

ε 2 otherwise, or f ε (t) =

t

2 ε + ε 2 if | t | < ε,

All the above-mentioned variants require modiﬁcations of TV, or the addition of higher-

order terms in the variational model. One contribution of the present paper is to show that

the true TV prior is compatible with the avoidance of the staircasing artifact, provided that an

appropriate framework is used. Indeed, the ROF model can be reinterpreted in a statistical

(Bayesian) framework, where it exactly corresponds to the maximum a posteriori (MAP)

estimate, which means that the ROF model selects the image that maximizes the p.d.f. of a

certain distribution (the posterior distribution, associated to the TV prior and the data-ﬁdelity

term). Several authors [57, 75] pointed out that MAP estimates tend to be very singular with

regard to the prior distribution. The staircasing artifact can be considered one of these prior

statistics singularities.

Since the preliminary work [47] in 2008, several researchers have taken an interest in TV- LSE. Jalalzai and Chambolle [38], Lefkimmiatis, Bourquard, and Unser [45], and Salmon [66]

mention the TV-LSE model for its ability to naturally remove staircasing artifacts. In the

conclusion of [51], Mirebeau and Cohen propose a TV-LSE–like approach to denoising images

using anisotropic smoothness features, an interesting counterpart to TV, arguing that LSE

2. From ROF to TV-LSE: Bayes TV-based models.

2.1. ROF Bayesian interpretation. Let u : Ω → R be a discrete gray-level image defined on a finite rectangular domain Ω ⊂ Z 2 , which maps each pixel x = (x, y) ∈ Ω to the gray level u(x). The (discrete) total variation of the image u is defined by

(3) T V (u) =

x∈Ω

| Du(x) | ,

where | Du(x) | is a discrete scheme used to estimate the gradient norm of u at point x. In what follows we shall consider either the 1 - or the 2 -norm on R 2 , associated with the simplest possible approximation of the gradient vector, given by

(4) Du(x, y) =

u(x + 1, y) − u(x, y) u(x, y + 1) − u(x, y)

(5) E v,λ (u) = u − v 2 + λT V (u),

where · is the classical L 2 -norm on images and λ is a hyperparameter which controls the denoising level. This formulation as energy minimizer can be transposed in a Bayesian framework. Indeed, for β > 0 and μ ∈ R, let us consider the p.d.f.

(6) ∀u ∈ E μ , p β (u) = 1

Z β e − βT V ( u ) , where Z β =

E

Properties and Experiments ^∗ C´ ecile Louchet ^† and Lionel Moisan ^‡

(1) E(u) = Au − v ² + λT V (u).

This energy establishes a trade-off between data fidelity (the first term) and data regularity (TV), the relative weight of the latter being specified by the hyperparameter λ. In a continuous formulation, the TV of a gray-level image u : R ² → R is defined by

u div p; p ∈ C c ^∞ ( R ² , R ² ), p _∞ ≤ 1

∂x ₂

∂y ₂

Among variants that have been proposed [3, 7, 28, 56, 74], some introduce a parameter ε > 0 and replace the term | Du | in (2) by f _ε ( | Du | ), where

f _ε (t) =

ε ² + t ² , or f _ε (t) = t ² if | t | < ε,

ε ² otherwise, or f _ε (t) =

2 ε + ^ε ₂ if | t | < ε,

2.1. ROF Bayesian interpretation. Let u : Ω → R be a discrete gray-level image defined on a finite rectangular domain Ω ⊂ Z ² , which maps each pixel x = (x, y) ∈ Ω to the gray level u(x). The (discrete) total variation of the image u is defined by

where | Du(x) | is a discrete scheme used to estimate the gradient norm of u at point x. In what follows we shall consider either the ¹ - or the ² -norm on R ² , associated with the simplest possible approximation of the gradient vector, given by

(5) E _v,λ (u) = u − v ² + λT V (u),

where · is the classical L ² -norm on images and λ is a hyperparameter which controls the denoising level. This formulation as energy minimizer can be transposed in a Bayesian framework. Indeed, for β > 0 and μ ∈ R, let us consider the p.d.f.

Z β e ⁻ ^{βT V} ⁽ ^u ⁾ , where Z β =

e ⁻ ^{βT V} ⁽ ^u ⁾ du,

u ∈ R ^Ω , u ¯ = μ

Let us now suppose that instead of u, we observe the noisy image v = u + N , where N is a white Gaussian noise with zero mean and variance σ ² . Applying Bayes’ rule with prior distribution p β leads to the following posterior p.d.f.:

− E v,λ (u) 2σ ²

where λ = 2βσ ² and Z is a normalizing constant depending on v and λ only, ensuring that u → p(u | v) remains a p.d.f. on R ^Ω . Hence, the variational formulation (arg min _u E _v,λ (u)) is equivalent to a Bayesian formulation in terms of MAP

(9) u ˆ _ROF = arg max

of the posterior distribution. As ˆ u _ROF minimizes the energy E v,λ (u), it tends to concentrate certain exceptional structures which are cheap in energy, in particular, regions of constant intensity, leading to the well-known staircasing eﬀect (see Figure 1).

where G _i : R ^Ω → R ^m (1 ≤ i ≤ r) are linear operators (here, diﬀerences between neighboring pixels), and ϕ i : R ^m → R are piecewise smooth functions that are not diﬀerentiable in zero (here, all ϕ i correspond to the L ¹ -norm on R ² ).

Proposition 2.1 (see [55]). If S(v) = arg min _u u − v ² +λJ(u), where J (u) = _r

i =1 ϕ _i (G _i u) and, for all i, G _i is linear and ϕ _i is piecewise smooth and not diﬀerentiable in 0, then, under some technical assumptions, there exists an open neighborhood V of v for which

i | G _i (S(v )) = 0

i | G _i (S(v)) = 0

∃ u ⁺ (x) = u ⁻ (x), ∃ ν _u (x) ∈ R ² ,