Maximum Entropy Models from Phase Harmonic Covariances

(1)

HAL Id: hal-02982280

https://hal.archives-ouvertes.fr/hal-02982280v2

Preprint submitted on 29 Jan 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Maximum Entropy Models from Phase Harmonic Covariances

Sixin Zhang, Stéphane Mallat

To cite this version:

Sixin Zhang, Stéphane Mallat. Maximum Entropy Models from Phase Harmonic Covariances. 2021.

�hal-02982280v2�

(2)

Maximum Entropy Models from Phase Harmonic Covariances

Sixin Zhang

^1,4

, St´ ephane Mallat

^1,2,3∗

1

ENS, PSL University, Paris, France

2

Coll` ege de France, Paris, France

3

Flatiron Institute, New York, USA

4

Center for Data Science, Peking University, Beijing, China January 29, 2021

Abstract

The covariance of a stationary process Xis diagonalized by a Fourier transform. It does not take into account the complex Fourier phase and defines Gaussian maximum entropy models. We introduce a general family of phase harmonic covariance moments, which rely on complex phases to capture non-Gaussian properties. They are defined as the covariance ofH(LX), whereb L is a complex linear operator andHb is a non-linear phase harmonic operator which multiplies the phase of each complex coefficient by integers. The operator Hb can also be calculated from rectifiers, which relatesH(LX)b to neural network coefficients. If L is a Fourier transform then the covariance is a sparse matrix whose non-zero off-diagonal coefficients capture dependencies between frequencies. These coefficients have similarities with high order moment, but smaller statistical variabilities becauseHbis Lipschitz. IfLis a complex wavelet transform then off-diagonal coefficients reveal dependencies across scales, which specify the geometry of local coherent structures. We introduce maximum entropy models conditioned by these wavelet phase harmonic covariances. The precision of these models is numerically evaluated to synthesize images of turbulent flows and other stationary processes.

Keywords— covariance, stationary process, phase, Fourier, wavelets, turbulence

1 Introduction

Many phenomena in physics, finance, signal processing and image analysis can be modeled as realizations of a non-Gaussian stationary process X. We often observe a single realization of dimensiondfrom which the model must be estimated. Models of stochastic processes may be defined as maximum entropy distributions conditioned by a predefined family of moments [1, 2]. To estimate accurately these moments from a

∗This work is supported by the ERC InvariantClass 320959. This work was supported by the PRAIRIE 3IA Institute of the French ANR-19-P3IA-0001 program.

(3)

single or few realizations, the number of moments should not be too large. The Fourier transform diagonalizes the covariance of stationary processes but a maximum entropy model conditioned by these second order moments is Gaussian. The main difficulty is to specify moments which characterize important classes of non-Gaussian stationary distributions.

Second order moments do not depend upon the complex phase of Fourier coefficients. We shall see that this complex phase plays a crucial role to specify non Gaussian properties of probability distributions, which can be captured with appropriate covariance moments. We define a phase harmonic covariances as the covariance of H(LX),b where L is a complex linear operator and Hb is a non-linear phase harmonic operator introduced in [3]. The operator Hb multiplies by integers the phase of each complex coefficient ofLX. The paper studies general properties of phase harmonic covariances whenLis a Fourier or a wavelet transform, to approximate non-Gaussian distributions through maximum entropy models.

IfLis a Fourier transform, Section 2 proves that the covariance ofH(LX) is a sparseb matrix whose non-zero off-diagonal coefficients reveal non-linear dependencies between different frequencies. Similarly to high order moments, the covariance of H(LX) cap-b tures this dependence across frequencies by cancelling relative phase fluctuations. A high order moment of a complex random variable z =|z|eîϕ(z) ∈ C computes an ex- ponentiation z^k = |z|^keîkϕ(z), which amplifies the variability of large |z| if k > 1. A phase harmonic operator Hb performs a similar transformation of the phase by calculating|z|eîkϕ(z). It also produces non-zero off-diagonal covariance coefficients without amplifying |z|. It is a Lipschitz operator as opposed to high order exponentiations, which reduces the variance of empirical statistical estimators.

The error of a maximum entropy model can be measured with a Kullback-Leibler divergence relatively to the true probability distribution ofX. Section 2.1 explains that this error can be reduced by improving the sparsity of H(LX). For large classes ofb random processesX, sparsity is improved by definingLas a complex wavelet transform as opposed to a Fourier transform. High order moments of wavelet coefficients have been used to characterize non-Gaussian multifractal properties of random processes and turbulent flows [4, 5]. We replace these high order moments by the covariance matrix of H(LX), to capture non-linear dependencies across scales, without sufferingb from high statistical variabilities. Dependencies across scales reveal the existence of coherent signal structures as opposed to Gaussian fluctuations. Section 3 studies the covariance of this wavelet phase harmonic operator.

Wavelet phase harmonics are related to the pioneer work of Grossmann and Morlet [6], who showed qualitatively that complex wavelet transforms have a phase whose variations across scales provide important information on the geometry of transient structures. In image processing, Portilla and Simoncelli [7] brought a new perspective by showing that one can synthesize non-Gaussian image textures from the covariance of the modulus of wavelet coefficients and by using their phase at different scales. We shall see that Portilla and Simoncelli covariances coefficients [7] are low order phase harmonic covariances H(LX), computed with a wavelet transformb L. Other remark-

(4)

able image texture syntheses have been obtained in [8], by computing the covariance of output coefficients of an one-layer convolutional neural network with a rectifier.

Section 3.2 proves that a non-linear phase harmonic operator Hb can be calculated with a rectifier followed by a Fourier transform along a phase variable. These neural network covariances are therefore equivalent to phase harmonic covariances, but for a convolutional operator L which is not a wavelet transform. The analysis of phase harmonic operators and the properties of covariance matrices gives a general mathe- matical framework to understand these algorithms. It guides the choice of the linear transform L, of the diagonal and off-diagonal covariance moments, to model specific classes of random processes.

Section 4 reviews the properties of maximum entropy models conditioned by empirical estimators of covariance moments. Sampling a maximum entropy distribution requires to use expensive algorithms that iterate over Gibbs samplers [9], which is not feasible over large size images. We thus rather use microcanonical models studied in [10]. Samples of microcanonical models are estimated by transporting an initial maximum entropy measure with a gradient descent which adjusts its moments. Section 5 defines low-dimensional foveal models conditioned by a small subset of wavelet phase harmonics covariances. These moments capture local dependencies within and across scales. We evaluate the numerical precision of foveal wavelet phase harmonic models, to approximate two-dimensional non-Gaussian processes including images of turbulent flows and multifractals. All calculations can be reproduced by a Python software available athttps://github.com/kymatio/phaseharmonics.

Notations: We write z^∗ the complex conjugate of z ∈ C, Y^∗ is the complex transpose of a matrixY and A^∗ is the adjoint of an operatorA. The covariance of two random variables A and B is written Cov(A, B) =E(A B^∗)−E(A)E(B)^∗. An inner product is writtenhx, yi=R

x(u)y(u)duin L²(R^d) andhx, yi=P

ux(u)y(u) inR^d. The cardinal of a setSis|S|. The norm ofxinL²(R^d) orR^dis writtenkxk=hx, xi^1/2.

2 Fourier Phase Harmonic Representation

Section 2.1 introduces general properties of maximum entropy models conditioned by covariance coefficients of a non-linear representation R(X) over a graph. Section 2.2 then shows that the Fourier coefficients of a stationary process are uncorrelated because of phase fluctuations. It motivates the use of higher order moments to define a representation which can cancel random phase fluctuations and capture dependence across frequencies. Section 2.3 reviews the properties of phase harmonic operators Hb introduced in [3]. IfL is a Fourier transform, Section 2.4 shows that R(X) =H(LX)b is a bi-Lipschitz representation, whose covariance captures non-Gaussian properties through dependencies across frequencies, similarly to high order moments.

(5)

2.1 Maximum Entropy Covariance Graph Models

We introduce maximum entropy models of a stationary random vectorX, conditioned by covariance coefficients of a representationR(X) which is also a random vector:

KR = Cov(R(X),R(X)) =E

(R(X)−MR)(R(X)−MR)^∗

withMR =E(R(X)). We suppose that X(u) is defined foru in a cube Λd⊂Z^r with dgrid points, for example in [1, d^1/r]^r. Ifr = 2 then each realization is an image of d pixels. This section reviews general properties of maximum entropy models defined on a graph, which take advantage of known symmetries of the distribution ofX.

Maximum entropy on a graph with symmetries Let us write R(X) = {R_v(X)}_v∈V, where R_v(X) ∈ C and V is a finite set of vertices in a graph that we shall define. The representationR(X) is a complex-valued vector inC^|V^|.

We introduce a maximum entropy model conditioned by the covariance between vertices

KR(v, v⁰) = Cov(R_v(X),R_v⁰(X)),

wherev⁰ is in a neighborhoodN_v ⊂V ofv. We suppose that v∈ N_v. If v⁰ ∈ N_v then v∈ N_v⁰ so it defines a reflexive and symmetric undirected graph (V, E), where the set of edges relates all neighborsE={(v, v⁰) :v∈V , v⁰∈ N_v}. The edge weights are the covariancesKR(v, v⁰) overE. In statistical physics, (R_v(x)−MR(v))(R_v⁰(x)−MR(v⁰))^∗ is called the interaction potential of (v, v⁰)∈E, where the meanMRis considered here as a predefined constant vector.

Let p be the probability density ofX. We can reduce the constraints of the maximum entropy model if we also know a finite groupGof symmetries of p. We consider linear unitary symmetriesg fromR^d toR^d, which satisfy p(g.x) =p(x) for allx∈R^d. The stationarity ofX in Λdmeans thatpis invariant to periodic translations, that we write g.x(u) = x(u−g), so G includes all translations. If p is invariant to rotations thenGalso includes rotations. Sincep is invariant to the action ofG, its covariance is also invariant to the action of anyg∈G

Cov(R_v(g.X),R_v⁰(g.X)) = Cov(R_v(X),R_v⁰(X)).

Let |G| be the total number of symmetries. These |G| conditions will be included in the maximum entropy model.

The entropy of a density ˜p on R^dis H(˜p) =−

Z

˜

p(x) log ˜p(x)dx.

A maximum entropy macrocanonical modelXe conditioned by the covarianceKR over E and by the symmetry group G has a probability density ˜p which maximizes H(˜p) and satisfies covariance moment conditions for all (v, v⁰, g)∈E×G

Z

(R_v(g.x)−MR(v)) (R_v⁰(g.x)−MR(v⁰))^∗p(x)˜ dx=KR(v, v⁰). (1)

(6)

If there exist Lagrange multipliers βv,v⁰ ∈C such that

˜

p(x) =Z⁻¹ exp

− 1

|G|

X

(v,v⁰,g)∈E×G

β_v,v⁰(R_v(g.x)−M_R(v)) (R_v⁰(g.x)−M_R(v⁰))^∗ (2)

is a probability distribution which satisfies each equality constraint (1), then it is the unique solution to the maximum-entropy problem [2]. The sum in the exponential is the Gibbs energy, and Z is the partition function. Since KR(v, v⁰) = K_R^∗(v⁰, v) we have β_v,v⁰ =β_v^∗0,v. We verify that for all g ∈G, ˜p(g.x) = ˜p(x) so G is also a group of symmetries of ˜p. IfGincludes all translations then ˜pis stationary.

In the particular case where allR_v(x) are linear operators then the Gibbs energy is bilinear and ˜p(x) is thus a Gaussian distribution. Appendix A shows that the Lagrange multipliers can then be computed efficiently [2]. If some of the R_v(x) are non-linear then computing the Lagrange multipliers is computationally very expensive when the dimensiondof X and the total number |E|of moments is large.

Covariance estimation from symmetries The covariance estimation from a single realization ¯x ofX is calculated with an average over the symmetries ofG. The orbit of the action of G on ¯x is the set of {g.¯x}_g∈G. The empirical estimation of MR(v) =E(R_v(X)) is computed as an empirical average over this orbit

MfR= 1

|G|

X

g∈G

R(g.¯x). (3)

Similarly, KR is estimated with an average on the same orbit KeR¯x = 1

|G|

X

g∈G

R(g.¯x)−MfR R(g.¯x)−MfR

∗

. (4)

Summing over all transformations of ¯x by g ∈ G is called “data augmentation” in machine learning. The larger |G| the more accurate the estimation. The maximum entropy model is estimated from a single realization ¯x by replacing the meanMR and covarianceKR in (1) by their estimationMfR and KeR¯x.

The existence of symmetries also reduces the number of covariances that must to be estimated. We defineEGas a minimum set of edges (v, v⁰) such that for any (v, v⁰)∈E there exists (v₁, v₁⁰, g) ∈E_G×G such that R_v(x) = R_v₁(g.x) and R_v⁰(x) = R_v⁰

1(g.x).

Since pis invariant to the action ofG, its covariance KR is also invariant:

KR(v, v⁰) = Cov(R_v₁(g.X),R_v⁰

1(g.X)) =KR(v₁, v⁰₁).

It is therefore sufficient to estimate covariance coefficients indexed byE_G to specify all covariances indexed by E. The set E_G is interpreted as a set of sufficient statistics.

Given a realization ¯xof dimensiond, to control the overall covariance estimation errors we must insure that|E_G| d.

(7)

Bi-Lipschitz continuity The estimation of covariance coefficients may have a large variance in the presence of rare outliers. These outliers induce a large variability in the empirical sum (4) depending upon the realization ¯x ofX. To avoid amplifying these outliers, we impose that Ris bi-Lipschitz, so that the variations of R(X) are of the same order as the variations ofX. It also insures that Ris an invertible operator.

The representation R is bi-Lipschitz if there exists AR >0 and BR such that for all (x, x⁰)∈R^2d

ARkx−x⁰k² ≤ kR(x)− R(x⁰)k² ≤BRkx−x⁰k². (5) For any random vector Z, we write σ²(Z) = E(kZ−E(Z)k²), which is the trace of its covariance. The following proposition proves that the variance ofR(X) andX have the same order of magnitude.

Proposition 2.1. If Ris bi-Lipschitz then

ARσ²(X)≤σ²(R(X))≤BRσ²(X). (6) Proof: If X⁰ and X are two independent random vectors having same probability distribution then the bi-Lipschitz bounds (5) ofR imply that

ARE(kX−X⁰k²)≤E(kR(X)− R(X⁰)k²)≤BHE(kX−X⁰k²). (7) IfY and Y⁰ are two i.i.d random variables then we verify that

E(|Y −Y⁰|²) = 2E(|Y −E(Y)|²).

where the first expected value is taken relatively to the joint distribution ofY andY⁰. It results that E(kX−E(X)k²) = 2⁻¹E(kX−X⁰k²) and E(kR(X)−E(R(X))k²) = 2⁻¹E(kR(X)− R(X⁰))k²). Inserting these equalities in (7) proves (6).

Maximum entropy reduction with sparsity Zhu, Wu and Mumford [11] have proposed to optimize maximum entropy parameterized models by minimizing the resulting maximum entropy. Indeed, if X has a probability density p then model error can be evaluated with a Kullback-Leibler divergence

D_KL(p||˜p) = Z

p(x) logp(x)

˜ p(x)dx.

We verify that R

p(x) log ˜p(x)dx = R

˜

p(x) log ˜p(x)dx by inserting (2) and by using the equalities (1). SinceH(p) =R

p(x) logp(x)dx, it results that the Kullback-Leibler divergence is equal to the excess of entropy of the maximum entropy model:

DKL(p||˜p) =H(˜p)−H(p)≥0. (8) We thus reduce the model error by reducing the maximum entropy H(˜p) to obtain a lower entropy model. The minimum H(˜p) = H(p) is reached if and only if ˜p = p almost-surely.

(8)

In our case, the model depends upon the choice of R and of the edges E ⊂ V² of the covariance graph. Optimizing R by calculating H(˜p) is computationally too expensive. However, we explain below that we can minimize an upper bound ofH(˜p) by finding a representation such thatR(X) is as sparse as possible, which gives a partial control on the maximum entropy. Let Xe be a maximum entropy model of density ˜p.

If R has a stable inverse, which we guarantee with the bi-Lipschitz condition, then an upper bound ofH(˜p) can be computed from the entropy of each marginal R_v(X).e Such an entropy is small ifR_v(X) is sparse, because it then has a narrow probabilitye density centered at 0. To impose this sparsity we must find a representation R such that R_v(X) is sparse. This sparsity must also be captured by diagonal covariance coefficients KR(v, v) so that the maximum entropy model X, conditioned by thesee moments retains this sparsity. This is the case for Fourier and wavelet phase harmonic representations studied in Sections 2.4 and 3.2. For non-Gaussian processes, it amounts to represent “coherent structures” with as few non-zero coefficients as possible.

Increasing the number|E|of moment conditions can further reduceH(˜p) but it also increases the statistical error when estimating these moments from a single realization of X. The choice of E is thus a trade-off between the model error (bias) and the estimation error (variance).

2.2 Fourier Phase and High Order Moments

The covariance of a stationary random vector is diagonalized by the discrete Fourier transform, because of random phase fluctuations. This suggests defining a covariance representation in a Fourier basis. For non-Gaussian processes, the dependence of Fourier coefficients is partly captured by high order moments which cancel random phase fluctuations.

For R = Id, since X(u) for u ∈ Λ_d is stationary, KR(u, u⁰) = Cov(X(u), X(u⁰)) only depends on u−u⁰. We write |u|the norm ofu ∈Λ_d and u⁰.u the inner product.

A low-dimensional maximum entropy model is constructed on V = Λd by restricting covariances to neighborhoods of fixed radius: N_v = {v⁰ : |v−v⁰| ≤ c}. The radius c is chosen to be large enough to capture long range correlations. SinceX is stationary, the symmetry group includes translations and we can thus defineEGby settingv= 0.

The maximum entropy model then defines a Gauss-Markov random vector [2] specified by covariances in a small neighborhood. This model does not capture non-Gaussian properties. If Λdis an r-dimensional grid, then the model size is |E_G|=O(c^r). It may be large if the process has long-range spatial correlations.

To better understand how to capture non-Gaussian properties, we study these covariance coefficients in a Fourier basis. We writexb=F_uxthe discrete Fourier transform of xover Λ_d:

x(ω) =b X

u∈Λ_d

x(u)e^−iω.u for ω= 2πd^−1/rm with m∈Λ_d. (9) The Fourier representationR=F_u is indexed byω∈V = 2πd^−1/rΛ_d. The covariance

(9)

for (ω, ω⁰)∈V² is

KR(ω, ω⁰) = Cov(X(ω),b X(ωb ⁰)).

If ω 6= 0 then E(X(ω)) = 0.b If ω 6= ω⁰ then KR(ω, ω⁰) = 0 because of random phase fluctuations. Indeed, translating X(u) by any τ ∈G_q multipliesX(ω) byb e^−iτ.ω. Since X is stationary, this translation is a symmetry which does not modify Cov(X(ω),b X(ωb ⁰)). It results that

Cov(X(ω),b X(ωb ⁰)) =e^i(ω−ω⁰^).τCov(X(ω),b X(ωb ⁰)). (10) Since this is true for any τ ∈Λ_d it implies that

Cov(X(ω),b X(ωb ⁰)) = 0 if ω6=ω⁰. (11) Since the discrete Fourier transform is periodic beyond [0,2π]^r, any equality or non- equality between frequencies must be understood modulo 2π along the r directions.

If X is Gaussian then Xb is also Gaussian so this non-correlation implies that X(ω)b and X(ωb ⁰) are independent. However, ifX is non-Gaussian thenX(ω) andb X(ωb ⁰) are typically not independent.

Phase cancellation with higher order moments To capture the dependencies of Fourier coefficients across frequencies, one can use high order moments [12]. For any k∈Z, similarly to (10), translating X by τ ∈Λ_d yields

Cov(X(ω)b ^k,X(ωb ⁰)^k⁰) =e^i(kω−k⁰^ω⁰^).τCov(X(ω)b ^k,X(ωb ⁰)^k⁰). (12) A high-order Fourier representationR_v(X) =X(ω)b ^kis indexed byv = (ω, k)∈V with 0≤k≤kmax. It results from (12) that

KR(v, v⁰) = Cov(X(ω)b ^k,X(ωb ⁰)^k⁰) = 0 if kω 6=k⁰ω⁰ . (13) Ifkω =k⁰ω⁰ and X is not Gaussian then this covariance is typically non-zero because the phase variations of X(ωb ⁰)^k⁰^∗ cancel the phase variations of X(ω)b ^k. This is also the key idea behind the use of bi-spectrum moments [13]. By adjusting (k, k⁰) these moments provide some dependency information between X(ω) andb X(ωb ⁰) forω 6=ω⁰.

For example, let X(u) =x(u−S) be a random shift vector, where x(u) is a fixed signal supported in Λ_d andS is a random periodic shift which is uniformly distributed in Λ_d. It is a stationary process whose Fourier coefficients have a random phase:

X(ω) =b x(ω)b e^−iSω. In this case, ifkω =k⁰ω⁰ then

Cov(X(ω)b ^k,X(ωb ⁰)^k⁰) =bx(ω)^kx(ωb ⁰)^∗k.

It is non-zero at frequencies where bx does not vanish. This shows that covariances of the high order Fourier exponents R(X) can capture the dependence of Fourier coefficients at different frequencies. However, this high order Fourier representation is not Lipschitz. Indeed, whenk >1, the exponent k amplifies the variability of each random variables X(ω), so estimators of covariance coefficients have a large variance.b

(10)

2.3 Phase Harmonic Operator

We review the properties of phase harmonic operators Hb introduced in [3], which replace high order exponents by an exponent on the phase only. These operators are bi-Lipschitz and can be computed by applying a rectifier followed by a Fourier transform on a phase variable. If L is a Fourier tranform, next section shows that H(LX) can have non-zero covariance coefficients across frequencies because of relativeb phase cancellations.

If z =|z|e^iϕ(z) ∈C then z^k =|z|^ke^ikϕ(z). If k > 1 then |z|^k amplifies large values of |z| and hence the variance of a random variable z. A phase harmonic computes a power k∈Zof the phase only:

[z]^k=|z|e^ikϕ(z) .

It preserves the modulus: |[z]^k|=|z|. A phase harmonic operator computes all such phase harmonics

H(z) =b {bh(k) [z]^k}_k∈_Z , (14) with finite energy weights kbhk² = P

k∈Z|bh(k)|² < ∞. The harmonic weights bh(k) amplify or eliminate different phase harmonics.

Phase windows and phase Fourier transform We show thatHbis the Fourier transform along a phase variable of a non-linear phase windowing operator, which can be computed with rectifiers used in neural networks.

Let us writebh=F_α(h) the Fourier transform along phases:

bh(k) = 1 2π

Z 2π 0

h(α)e^−ikαdα. (15)

Observe thatF_αh(ϕ(z) +α) ={ˆh(k)e^ikϕ(z)}_k∈_Z. It results that H(z) =b {bh(k) [z]^k}_k∈_Z =F_αH(z) with

H(z) ={|z|h(ϕ(z) +α)}_α∈[0,2π]. (16) The operatorHtranslates the phaseϕ(z) ofz byα∈[0,2π] and applies a 2π periodic phase windowing h(α).

A rectifier ρ(a) = max(a,0) is an important example of non-linearity which acts as a phase windowing when applied on the real part ofz. Indeed

ρ(Real(z)) =|z|ρ(cosϕ(z)), so

{ρ(Real(e^iαz))}_α∈[0,2π]=H(z) with h(α) =ρ(cosα). (17)

(11)

The rectifier phase window ρ(cosα) is positive and supported in [−π/2, π/2]. The corresponding harmonic weights are computed in [3] with the Fourier integral (15):

bh(k) =







(−1)^k/2+1

π(k²−1) ifkis even

1

4 ifk=±1

0 if|k|>1 is odd

. (18)

Their decay is slow becauseh(α) has discontinuous derivatives at ±π/2.

Bi-Lipschitz continuity We mentioned that polynomial exponents amplify the variability of random variables around their mean. Indeed if (z, z⁰) ∈ C² then |z^k− z^0k|/|z−z⁰| may be arbitrarily large if k > 1. On the contrary, a phase harmonic preserves the modulus which provides a bound on such amplification. It is proved in [3] that it is Lipschitz continuous

∀(z, z⁰)∈C² , |[z]^k−[z⁰]^k| ≤max(|k|,1)|z−z⁰|. (19) The distance|z−z⁰|is therefore amplified by at most |k|.

This Lipschitz continuity is extended to phase windowed Fourier transforms, which are shown to be bi-Lipschitz. The Fourier transform preserves norms and distances.

Calculating the norm of (14) gives

kH(z)k=kH(z)kb =khk²|z|², withkhk² =P

k|bh(k)|². Following [3], we also derive from (19) that a phase windowed Fourier transform is bi-Lipschitz over complex numbers (z, z⁰)∈C²

AH|z−z⁰|² ≤ kH(z)b −H(z)b ⁰k²≤BH|z−z⁰|² , (20) with AH = 2|bh(1)|² and BH = |bh(0)|² +P

k∈Zk²|bh(k)|². The upper bound BH is derived from (19) and the lower bound is obtained isolating the terms corresponding tok =±1. For the rectifier filter (18), we get AH = 1/8 and BH = 1/4 + 1/π². The following theorem gives a tight upper Lipschitz constant for the rectifier filter, proved in Appendix B.

Theorem 2.1. The rectifier phase windowh(α) = max(cosα,0)has a Lipschitz upper- bound BH= 1/4.

2.4 Phase Harmonics of Fourier Coefficients

IfL=F_u computes the Fourier transform of X(u) then the resulting phase harmonic representation is

R(X) =H(Fb _uX) =n

bh(k) [X(ω)]b ^ko

k∈Z,ω∈R

.

It is indexed byv= (ω, k). We show that the covariance ofH(Fb _uX) is sparse but not necessarily diagonal. Section 2.2 explains that ifX is stationary thenX(ω) andb X(ωb ⁰)

(12)

are not correlated if ω 6= ω⁰ but X(ω)b ^k and X(ωb ⁰)^k⁰ may be correlated if X is not Gaussian. Similarly we show that the covariance of [X(ω)]b ^kand [X(ωb ⁰)]^k⁰ may become non-zero forω 6=ω⁰, which reveals non-Gaussian properties.

Since Hb is bi-Lipschitz and the Fourier transform is unitary up to a factor d, it results that R=HFb _u is bi-Lipschitz with Lipschitz constants AR =dAH and BR = dBH. Proposition 2.1 implies that

dAHσ²(X)≤σ²(H(X))b ≤dBHσ²(X). (21) Sparse phase harmonic covariance Phase harmonic covariance coefficients are

KHFb u(ω, k, ω⁰, k⁰) =bh(k)bh(k⁰)^∗Cov([X(ω)]b ^k,[X(ωb ⁰)]^k⁰).

Ifbhis compactly supported in [0, k_max], sinceX is of dimensiond, the covarianceK

HFb

has (kmax+ 1)²d² coefficients. The following proposition proves that coefficients of KHFb are mostly zero and it is diagonal ifXis Gaussian. Diagonal values are specified.

Theorem 2.2. If X is real stationary then

Cov([X(ω)]b ^k,[X(ωb ⁰)]^k⁰) = 0 if kω6=k⁰ω⁰. (22) Along the diagonal, if ω6= 0 and k6= 0 or if ω = 0 and k is odd then

Cov([X(ω)]b ^k,[X(ω)]b ^k) = Cov(X(ω),b X(ω))b . (23) If ω= 0 andk is even then

Cov([X(0)]b ^k,[X(0)]b ^k) = Cov(|X(0)|,b |X(0)|)b . (24) For all ω6= 0

Cov(|X(ω)|,b |X(ω)|)b

Cov(X(ω),b X(ω))b = 1−E(|X(ω)|)b ²

E(|X(ω)|b ²). (25) If X is Gaussian thenK

HFb is diagonal and E(|X(ω)|)b ²/E(|X(ω)|b ²) =π/4 if ω6= 0.

The proof is in Appendix C. All equalities or non-equalities on frequencies must be understood modulo 2π in dimension r. Property (22) proves that K

HFb _u is highly sparse. If X is not Gaussian then off-diagonal coefficients are typically not zero when kω=k⁰ω⁰. Fork=k⁰ = 0, we getd² modulus covariances Cov(|X(ω)|,b |X(ωb ⁰)|) which are a priori non-zero for all (ω, ω⁰). It provides no information on the phase of X(ω)b and X(ωb ⁰). Phase correlations are captured whenk6= 0. Ifω 6= 0, it results from (22) that Cov([X(ω)]b ^k,[X(ωb ⁰)]^k⁰) 6= 0 only if ω⁰ is colinear with ω. For a grid Λd of size d, there are O(d^1/r) such frequencies ω⁰. The total number of non-zero off-diagonal covariance coefficients is thus at mostd²+O(d^1+1/r).

For a non-Gaussian process, off-diagonal coefficients with kω =k⁰ω⁰ are typically non-zero. This is illustrated with a random shift processX(u) =x(u−S) where S is uniformly distributed in Λd. Ifkω =k⁰ω⁰ then one can verify that

Cov(X(ω)b ^k,X(ωb ⁰)^k⁰) = [x(ω)]b ^k[bx(ω⁰)]^−k⁰ if ω6= 0, ω⁰ 6= 0.

(13)

Ifx(ω) does not vanish then all these coefficients are non-zero.b

Along the diagonal, (23) and (24) prove that all coefficients are either equal to Cov(X(ω),b X(ω)) or to Cov(|b X(ω)|,b |X(ω)|). Property (25) also proves that the ratiob between these covariance values depend upon the ratio E(|X(ω)|)b ²/E(|X(ω)|b ²). This ratio measures the sparsity of multiple realizations of the random variable X(ω). In-b deed, expected values are estimated by normalized sums, so ratio of expected values can be interpreted as ratios of `1 over `2 norms, which is a measure or sparsity. For all Gaussian random variables, this ratio is equal to π/4. If the ratio is smaller than π/4 thenX(ω) has a high probability to be relatively small and it has large amplitudeb outliers of low probability. The probability distribution ofX(ω) is then highly peakedb in zero with a long tail. The smaller this ratio the lower the entropy of this marginal probability distribution.

We saw in (8) that a maximum entropy model Xe introduces errors if its entropy is much larger than the entropy of X. The entropy H(˜p) is bounded by the sum of the entropy of each random variable X(ω), which is conditioned by Cov(b X(ω),b X(ω) andb Cov(|X(ω)|,b |X(ω)|). It gives an upper bound on the entropyb H(˜p), which decreases if the sparsity of X(ω) increases. However, ifb X(u) has sharp localized transitions then its Fourier coefficients have typically a large amplitude across most frequencies and are not sparse. On the contrary, wavelet coefficients may be sparse. In this case, we shall capture this sparsity and thus compute maximum entropy models of lower entropy, by replacing the Fourier transform by a wavelet transform.

3 Wavelet Phase Harmonics

To model random vectorsX whose realizations include singularities and sharp transitions, we define a phase harmonic representationR=HbLwhereL is a wavelet transform as opposed to a Fourier transform. Indeed, wavelet transform provides sparse representations of such signals and Section 2.1 explains that imposing sparsity constraints typically improves the accuracy of maximum entropy models. We concentrate on two-dimensional image applications. The same approach applies to signals of any dimension, with an appropriate wavelet transform. Section 3.1 reviews the construction of complex steerable wavelet frames and the properties of their covariance matrices.

Section 3.2 studies the covariance of wavelet phase harmonics.

3.1 Steerable Wavelet Frame Covariance

Complex steerable wavelet frames were introduced in [14] and are further studied in [15], to easily compute wavelet coefficients of rotated images. For simplicity, we begin by introducing wavelets as a localized functionsψ(u) for u ∈R², with R

ψ(u)du= 0.

Complex steerable wavelets have a Fourier ψ(ω) concentrated over one-half of theb Fourier domain. We impose that ψ(−u) = ψ^∗(u) so that ψ(ω) is real. This Fourierb transform is centered at a frequencyξ ∈R² and is non negligible forω ∈R² such that

|ω−ξ| ≤C⁰|ξ|for someC⁰ >0. Figure 1 gives an example of such a wavelet, which is

(14)

(a)

-4 -2 0 2 4

-0.02 -0.01 0 0.01 0.02

(b)

-2 - 0 2

0 0.1 0.2 0.3 0.4 0.5

Figure 1: (a): Real-part of steerable bump wavelet ψ(u). (b): Fourier transform ψ(ω).b

specified in Appendix D.

Let r_` be a rotation by an angle 2`π/Q. Multiscale steerable wavelets are derived from ψ with dilations by 2^j for j ∈ Z, and rotations over Q angles θ = 2`π/Q for 0≤` < Q

ψ_λ(u) = 2^−jψ(2^−jr−`u) ⇒ ψb_λ(ω) = 2^jψ(2b ^jr_`ω) with λ= 2^−jr−`ξ. (26) Sinceψ(ω) is non negligible forb |ω−ξ| ≤C|ξ|it results thatψb_λ(ω) is centered atλand non negligible for|ω−λ| ≤C|λ|. In space,ψ_λ(u) is non-negligible for|u| ≤C⁰|λ|⁻¹. We limit the scale 2^j to a maximum 2^J. The lowest frequencies are captured by a wavelet centered at λ= 0. It is computed by dilating a functionφ(u) such thatR

φ(u)du= 1:

ψ₀(u) = 2^−Jφ(2^−Ju) ⇒ ψb₀(ω) = 2^Jφ(2b ^Jω). (27) A wavelet frame is constructed by translating each ψ_λ forλ6= 0 byu= 2^j−1n and ψ₀ by u = 2^J−1n for all n∈Z². It introduces a factor 2 oversampling relatively to a wavelet orthonormal basis [16], which creates some redundancy. The wavelet transform of x∈L²(R²) is defined by

Wx={x ? ψ_λ(u)}_(λ,u)∈Γ

where Γ is a frequency-space index set with (λ, u) = (2^−jr−`ξ,2^j−1n) for 1 ≤j ≤ J, 0≤` < Q,n∈Z, or (λ, u) = (0,2^J−1n).

Under appropriate conditions onψ, the wavelet family{ψ_λ(·−u)}_(λ,u)∈Γis a frame of L²(R²) [15]. This means that there exists 0< AW ≤BW such that for anyx∈L²(R²) AWkxk²≤ kWxk² ≤BWkxk² (28) withkWxk² =P

(λ,u)∈Γ|x ? ψ_λ(u)|².

The wavelet transform can be redefined over discrete imagesxofdpixels supported in a two-dimensional square grid Λd, uniformly sampled at intervals 1 in [1, d^1/2]². It requires to discretize and modify “boundary wavelets” whose supports intersect image boundaries. This can be done over steerable wavelets [14, 15], while preserving the frame constants AW and BW. The resulting wavelet ψλ(· −u) are supported in Λd.

(15)

They are still indexed by (λ, u) ∈Γ. If λ=r−`ξ 6= 0 for 1≤j ≤ J, 0 ≤` < Q then P

u∈Λ_dψ_λ(u) = 0. If λ= 0 then P

u∈Λ_dψ0(u) = 2^J. Since J ≤(log₂d)/2, there are at mostQ(log₂d)/2 + 1 different frequency channelsλ. Forλ=r−`ξ6= 0,ψ_λ is translated by u = 2^j−1n ∈ Λd which yields 2^−j+1d wavelet coefficients. The total number of wavelets coefficients is about 4Qd/3 if J = (log₂d)/2.

The mother wavelet ψis chosen in order to obtain a sparse wavelet representation of realizations ofX, with few large amplitude wavelet coefficients. This sparsity high- lights non-Gaussian properties. Figure 2 displays the modulus and phase of wavelet coefficients of the vorticity field of a turbulent flow. This flow is obtained by running the 2D Navier Stokes equation with periodic boundary conditions, initialized with a random Gaussian field [17]. After a fixed time, it defines a stationary but non-Gaussian random process. For each scale and orientation, large amplitude modulus coefficients are located at positions where the image has sharp transitions, and the phase depends upon the position of these sharp transitions.

Wavelet covariance A wavelet transform defines a linear representation R= W indexed by v = (λ, u). Similarly to Fourier coefficients, we show that wavelet coefficients have a covariance which nearly vanish at different frequencies.

Similarly to Fourier coefficients, wavelet coefficients have a zero mean at non-zero frequencies. If λ 6= 0 then E(X ? ψλ(u)) = 0 because P

uψλ(u) = 0. If λ = 0 then P

uψ₀(u) = 2^J so E(X ? ψ₀(u)) = 2^JE(X(u)). The covariance at v = (λ, u) and v⁰ = (λ⁰, u⁰) is

KW(v, v⁰) = Cov(X ? ψλ(u), X ? ψ_λ⁰(u⁰)).

It depends on u−u⁰ because X is stationary. Let ψb_λ(ω) be the discrete Fourier transform ofψλ(u) defined in (9). Since wavelet coefficients are convolutions, covariance values can be rewritten from the power spectrumK(ω) =b ¹_dCov(X(ω),b X(ω)) ofb X

KW(v, v⁰) = X

ω∈Λ_d

K(ω)b ψbλ⁰(ω)ψb^∗_λ(ω)e^i(u−u⁰^).ω. (29) It results thatKW(v, v⁰) = 0 ifψb_λ(ω)ψb_λ⁰(ω) = 0 for allω. Sinceψb_λ(ω) is non-negligible only if |ω−λ| ≤C|λ|, the covarianceKW(v, v⁰) is non-negligible for λ6=λ⁰ only if

|λ−λ⁰|

|λ|+|λ⁰| ≤C . (30)

It shows that similarly to Fourier coefficients, wavelet covariances are negligible across frequencies which are sufficiently far apart.

Maximum entropy wavelet graph model A maximum entropy model conditioned by wavelet covariances is Gaussian because the wavelet transform is linear.

Figure 3(a) gives a realization of a stationary turbulent flow X. A low-dimensional maximum entropy model is defined on a graph of covariance coefficients KW(v, v⁰) for v⁰ = (λ⁰, u⁰) in a small neighborhood ofv = (λ, u). Since wavelet coefficients are nearly decorrelated across frequencies, at each scale 2^j, the neighborhood of v = (λ, u) is

(16)

-1.5 -1 -0.5 0 0.5 1 1.5

(a)

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

-3 -2 -1 0 1 2 3

(b)

0 0.05 0.1 0.15 0.2

-3 -2 -1 0 1 2 3

(c)

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

-3 -2 -1 0 1 2 3

(d)

0 0.05 0.1 0.15 0.2

-3 -2 -1 0 1 2 3

Figure 2: Top: turbulent vorticity field. Bottom: Each image gives the modulus (above) or the phase (below) of wavelet coefficients x ? ψ_λ(u) for different frequency channels λ = 2^−jr−`ξ. Large modulus coefficients are shown in black. The columns correspond to different scales and angles (j, `). (a): (j, `) = (1,0). (b): (j, `) = (2,0). (c): (j, `) = (1, Q/4). (d):

(j, `) = (2, Q/4).

(17)

defined as the set ofv⁰ = (u⁰, λ⁰) such that λ⁰ =λand |u−u⁰| ≤2^j−1∆ for a fixed ∆.

Covariances are thus specified over a spatial range proportional to the scale. This is a foveal neighborhood which is sufficient to approximate the covariances of large classes of random processes such as fractional Brownian motions [18]. Since (u, u⁰) = 2^j−1(n, n⁰), the neighborhoods of all v have the same size, which is smaller than (2∆ + 1)².

Since X is stationary, its probability distribution is invariant to the group G of translations. The number of covariance coefficients that must be estimated is equal to the number |E_G|of edges in the graph modulo translations. It is equal to the number of wavelet frequenciesλmultiplied by the size of each neighborhood, and thus bounded by (J Q+ 1)(2∆ + 1)² = O(log₂d). This model size is much smaller than the image sizedwhen dis large.

Figure 3(b) shows a realization of the maximum entropy Gaussian model X. Ite is conditioned by wavelet covariances on a foveal graph with ∆ = 2. The wavelet transform is computed with a bump wavelet specified in Appendix D, with J = 5, Q= 16 and d= 256². In this case |E_G|/d= 3.6 10⁻². The covariances are estimated from a single realization of X. The calculation of the Lagrange multipliers in (2) is explained in Appendix A. To measure the accuracy of this model we compare the power spectrum ofX and the Gaussian modelX. Both processes are isotropic so Figure 3(c)e gives the radial log power spectrum of X and Xe as a function of |ω|. These power spectrum are nearly the same, which means that the wavelet covariance graph gives an accurate estimation of the second order moments ofX from a single realization.

The realization of the Gaussian model in Figure 3(b) has a geometry which is very different from the turbulence flow X in Figure 3(a). It shows that X is highly non- Gaussian, which also appears in Figure 2. The wavelet coefficients of a stationary process X are not correlated at different scales and angles, which would imply that they are independent if X was Gaussian. On the contrary, Figure 2 shows that the the modulus and phases of wavelet coefficients of X are strongly dependent across scales and angles. High amplitude modulus coefficients are located in the same spatial neighborhoods because they are produced by the same sharp transitions of the flow.

Next section explains how to capture this dependence with phase harmonics.

3.2 Wavelet Phase Harmonic Covariance

A wavelet phase harmonic representation Φ =HbL is computed with a linear wavelet transformL=W. As in the Fourier case, the covariance of Φ(X) =H(WX) has non-b zero covariance coefficients across different frequency bands. We study the properties of the resulting covariance matrix, and the role of sparsity.

Wavelet phase harmonics To specify the dependence across frequencies, we ap- ply a phase harmonic operator to wavelet coefficients:

H(Wb x) = n

bh(k) [x ? ψ_λ(u)]^k o

(λ,u)∈Γ,k∈Z

.

(18)

(a)

-1.5 -1 -0.5 0 0.5 1 1.5

(b)

-1.5 -1 -0.5 0 0.5 1 1.5

(c)

Figure 3: (a): Realization of a stationary turbulent vorticity field X. (b): Realization of a Gaussian maximum entropy model Xe calculated from wavelet covariances estimated on a foveal graph. (c): The full line and dashed lines are the power spectrum ofX and the power spectrum ofXe respectively, as a function of the radial frequency|ω|(shown in log-log scale).

The coefficients ofR =HWb are indexed by v = (λ, k, u). The covariance coefficients of H(WX) areb

K_HW_b (v, v⁰) =bh(k)bh(k⁰)^∗Cov([X ? ψλ(u)]^k,[X ? ψλ⁰(u⁰)]^k⁰).

Since X is stationary, it only depends on u−u⁰. Wavelet harmonic covariances only need to be calculated fork≥0 andk⁰≥0. Indeed, sinceX is real andψ(−u) =ψ^∗(u) one can verify thatK

HWb (v, v⁰) does not change its value if (λ, k) becomes (−λ,−k) or if (k, k⁰) becomes (−k,−k⁰). Such wavelet harmonic covariances have first been computed by Portilla and Simoncelli [7] to characterize the statistics of image textures. Their representation correspond to (k, k⁰) equal to (0,0), (1,1), (1,2), which amounts to choosing ˆh(k) = 1_[0,2](k).

Since Hb and W are bi-Lipschitz the operator HWb is also bi-Lipschitz, with lower and upper boundsAH AW and BHBW. Proposition 2.1 implies that

AHAWσ²(X)≤σ²(H(WX))b ≤BHBWσ²(X), (31) which controls the variance of wavelet harmonic coefficients.

Rectified neural network coefficients Ustyuzhaninov et. al. in [8] have shown that one can get good texture synthesis from the covariance of a one-layer convolutional neural network, computed with a rectifier. In the following we show that these statistics are equivalent to phase harmonic covariances, computed with a rectifier phase window h(α).

Section 2.3 proves thatHb =F_αH, whereHcomputes a phase windowing of wavelet coefficients

H(Wx) =n

|x ? ψ_λ(u)|h(ϕ(x ? ψ_λ(u)) +α)}(λ,u)∈Γ,α∈[0,2π].

(19)

The covariance ofH(Wb X) andH(WX) thus satisfyK_HW_b =F_αKHWF_α⁻¹. The following proposition proves thatKHW gives the covariance of rectified wavelet coefficients ifh is a rectifier phase window.

Proposition 3.1. Let v = (λ, α, u) and v⁰ = (λ⁰, α⁰, u⁰). For a rectifier phase window h(α) =ρ(cosα)

KHW(v, v⁰) = Cov

ρ(X ? ψ_λ,α(u)), ρ(X ? ψ_λ⁰_,α⁰(u⁰))

(32) withψλ,α(u) = Real(e^−iαψλ(u)).

Proof: We proved in (17) that ifh(α) =ρ(cosα) then H(z) ={ρ(Real(e^iαz))}_α∈[0,2π]. It results that

KHW(v, v⁰) = Cov(ρ(Real(e^iαX ? ψ_λ(u))), ρ(Real(e^iα⁰X ? ψ_λ⁰(u⁰)))i, which proves (32).

Rectified wavelet coefficients ρ(X ? ψ_λ,α(u)) can be interpreted as one-layer convolutional network coefficients, computed with wavelet filtersψ_λ,αof different frequencies λand different phasesα. This result applies to any linear operatorL, not necessarily a wavelet transform. The statistics used by Ustyuzhaninov et. al. in [8] use local cosine or random filters in their network as opposed to steerable wavelets. It corresponds to convolutional operators Lthat are not wavelet transforms.

Sparse harmonic covariance Proposition 2.2 specifies the properties of a Fourier phase harmonic covariance. We qualitatively explain, without proof, why Cov([X ? ψ_λ(u)]^k,[X ? ψ_λ⁰(u⁰)]^k⁰) has similar sparsity properties.

For a fixedλandk, [X ? ψ_λ(u)]^kis a stationary random vector inu. The covariance of [X ? ψλ(u)]^k and [X ? ψλ⁰(u⁰)]^k⁰ is non-zero only if their power-spectrum have a support which overlap. We give a necessary condition by approximating these spectrum supports.

For k = 1, the spectrum of X ? ψλ(u) has an energy concentrated at frequencies where the Fourier transform ofψb_λ is concentrated, which is included in a ball centered atλof radiusC|λ|. For|k|>1, [3] explains that the Fourier transform of [X ? ψ_λ(u)]^k and hence its power spectrum is concentrated in a ball centered atkλof radius|k|C|λ|.

If k = 0 then the spectrum of|X ? ψ_λ(u)| is concentrated in a ball centered at the 0 frequency, of radiusC|λ|. This is illustrated numerically in Figure 4 which displays the power spectrum of [X ? ψλ(u)]^k fork= 0,1,2. In this case,X is a stationary vorticity field shown at the top of Figure 3(a). These power spectrum are estimated for a fixed λfrom 100 independent realizations ofX. Fork= 1, the power spectrum is supported over the Fourier support of ψb_λ. For k = 2 its support is approximately dilated by 2, whereas fork= 0 it is centered at the zero frequency.

(20)

(a)

k=0

- - /2 0 /2

- - /2

0 /2

-3 -2 -1 0 1 2

(b)

k=1

- - /2 0 /2

- - /2

0 /2

-3 -2 -1 0 1 2

(c)

k=2

- - /2 0 /2

- - /2

0 /2

-3 -2 -1 0 1 2

Figure 4: Power spectrum of [X ? ψλ(u)]^k for a fixed λ, for the turbulent vorticity flow at the top of Figure 5(a), and different k. The power spectrum is shown in log base 10. (a):

k = 0. (b): k = 1. (c): k = 2.

It results that the spectrum of [X ? ψ_λ(u)]^kand [X ? ψ_λ⁰(u⁰)]^k⁰ have a support which overlap only if

|kλ−k⁰λ⁰| ≤C(max(|k|,1)|λ|+ max(|k⁰|,1)|λ⁰|) . (33) Similarly to the Fourier case, ifk=k⁰ = 0 then this condition is satisfied for any pair of frequencies (λ, λ⁰). It corresponds to covariances of modulus coefficients Cov(|X ? ψ_λ(u)|,|X ? ψ_λ⁰(u⁰)|), which are typically non-zero when the process is non-Gaussian.

If k6= 0 andk⁰ 6= 0 then non-negligible values of Cov([X ? ψ_λ(u)]^k,[X ? ψ_λ⁰(u⁰)]^k⁰) occur forkλ≈k⁰λ⁰. This is similar to the vanishing property (22) of Fourier harmonic coefficients, at frequencies which are colinear. Since λ= 2^−jr−`ξ and λ⁰ = 2^−j⁰r−`⁰ξ, it requires that 2^j−j⁰ ≈ |k⁰|/|k|and that`≈`⁰.

Diagonal covariance coefficients have similar properties as Fourier coefficients. They are specified by Cov(|X ? ψ_λ(u)|,|X ? ψ_λ(u)|) and Cov(X ? ψ_λ(u), X ? ψ_λ(u)), which do not depend uponu. Moreover whenλ6= 0,

Cov(|X ? ψ_λ(u)|,|X ? ψ_λ(u)|)

Cov(X ? ψ_λ(u), X ? ψ_λ(u)) = 1−E(|X ? ψ_λ(u)|)² E(|X ? ψ_λ(u)|²).

The ratioE(|X ? ψ_λ(u)|)²/E(|X ? ψ_λ(u)|²) measures the sparsity of wavelet coefficients X ? ψ_λ(u). Indeed, the sparsity of a time series is usually quantified by its `₁ norm for a fixed `₂ norm [16], which amounts to compute their ratios. Since X ? ψ_λ(u) is stationary, spatial sums over u converge to expected values. It results that a ratio of an`₁ over an`₂ norm (P

u|X ? ψ_λ(u)|)²/(P

u|X ? ψ_λ(u)|²) is an estimator of the ratio of expected values. If X is Gaussian then X ? ψλ(u) is a complex Gaussian random variable and one can verify that this ratio is ^π₄ for allλ. A ratio smaller than ^π₄ implies that wavelet coefficientsX ? ψ_λ(u) are sparse and hence thatX is non-Gaussian.

To increase the accuracy of a maximum entropy model X, the Kullback-Leiblere divergence (8) shows that we must minimize its entropy. As in the Fourier case, an upper bound of the entropy is obtained as a sum of the entropy of the marginals of wavelet coefficients. The marginal entropies get smaller by reducing the sparsity ratio

(21)

E(|X ? ψ_λ(u)|)²/E(|X ? ψ_λ(u)|²). To minimize this upper bound of the model entropy, this suggests by finding a mother waveletψ which yields sparse coefficients.

Gaussianity test Even though ψ_λ and ψ_λ⁰ may have disjoint Fourier supports, the previous analysis showed that one can find k and k⁰ such that the spectrum of [X ? ψ_λ(u)]^k and [X ? ψ_λ⁰(u⁰)]^k⁰ overlap, for example with k = k⁰ = 0. A priori, the covariance K

HWb (v, v⁰) is then non-zero, unlessX is Gaussian in which case these coefficients vanish, as proved by the following theorem.

Proposition 3.2. Let(λ, λ⁰)be such that ψb_λψb_λ⁰ = 0. IfX is Gaussian and stationary thenK

HbW(v, v⁰) = 0 if v= (λ, k, u) and v⁰= (λ⁰, k⁰, u⁰), for any (k, u, k⁰, u⁰).

Proof: IfXis Gaussian thenX ? ψλ(u) andX ? ψλ⁰(u⁰) are jointly Gaussian random variables. Equation (29) proves that if ψb_λψb_λ⁰ = 0 then Cov(X ? ψ_λ(u), X ? ψ_λ⁰(u⁰)) = 0. These Gaussian random variables are uncorrelated and therefore independent. It results that [X ? ψλ(u)]^k and [X ? ψλ⁰(u⁰)]^k⁰ are also independent and thus have a covariance which is zero.

This proposition gives a test of Gaussianity by considering two frequencies λ and λ⁰ where|λ−λ⁰|is sufficiently large so that the support ofψbλ and ψb_λ⁰ do not overlap.

If kλ ≈ k⁰λ⁰ then wavelet harmonic covariances are zero if X is Gaussian but it is typically non-zero ifX is not Gaussian.

4 Microcanonical Models

Section 2.1 introduces macrocanonical maximum entropy models conditioned by covariance coefficients of a representationR(X). The resulting maximum entropy distribution ˜p depend upon Lagrange coefficients which are computationally very expensive to calculate, despite the development of efficient algorithms [19]. Section 4.1 reviews maximum entropy microcanonical models which avoid computing these Lagrange multipliers. Section 4.2 gives an alternative microcanonical model which is calculated with a faster algorithm, which transports a maximum entropy measure by gradient descent.

4.1 Maximum Entropy Microcanonical Models

Microcanonical models avoid the calculation of Lagrange multipliers and are guaranteed to exist as opposed to macrocanonical models. We first specify covariance estimators and then briefly review the properties of these microcanonical models.

Microcanonical models rely on an ergodicity property which insures that covariance estimations concentrate near the true covariance KR when d is sufficiently large. Let G be a known group of linear unitary symmetries of the density p of X. It includes translations becauseX is stationary. An estimation of the covarianceKRover the edge setEG is computed in (4) from a single realization ¯xof X, by transforming ¯x with all g ∈ G. To differentiate realizations of X from other x ∈ R^d, we associate a similar