Detecting spatial patterns with the cumulant function – Part 1: The theory

(1)

HAL Id: hal-00331113

https://hal.archives-ouvertes.fr/hal-00331113

Submitted on 15 Oct 2008

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Detecting spatial patterns with the cumulant function –

Part 1: The theory

A. Bernacchia, P. Naveau

To cite this version:

A. Bernacchia, P. Naveau. Detecting spatial patterns with the cumulant function – Part 1: The

theory. Nonlinear Processes in Geophysics, European Geosciences Union (EGU), 2008, 15 (1),

pp.159-167. �10.5194/npg-15-159-2008�. �hal-00331113�

(2)

www.nonlin-processes-geophys.net/15/159/2008/ © Author(s) 2008. This work is licensed

under a Creative Commons License.

Nonlinear Processes

in Geophysics

Detecting spatial patterns with the cumulant function –

Part 1: The theory

A. Bernacchia1and P. Naveau2

1_{Dipartimento di Fisica E.Fermi, Universita’ La Sapienza, Roma, Italy}

2_{Laboratoire des Sciences du Climat et de l’Environnement, IPSL-CNRS, France}

Received: 5 July 2007 – Revised: 3 September 2007 – Accepted: 17 September 2007 – Published: 19 February 2008

Abstract. In climate studies, detecting spatial patterns that

largely deviate from the sample mean still remains a sta-tistical challenge. Although a Principal Component Analy-sis (PCA), or equivalently a Empirical Orthogonal Functions (EOF) decomposition, is often applied for this purpose, it provides meaningful results only if the underlying multivari-ate distribution is Gaussian. Indeed, PCA is based on op-timizing second order moments, and the covariance matrix captures the full dependence structure of multivariate Gaus-sian vectors. Whenever the application at hand can not sat-isfy this normality hypothesis (e.g. precipitation data), alter-natives and/or improvements to PCA have to be developed and studied.

To go beyond this second order statistics constraint, that limits the applicability of the PCA, we take advantage of the cumulant function that can produce higher order moments information. The cumulant function, well-known in the sta-tistical literature, allows us to propose a new, simple and fast procedure to identify spatial patterns for non-Gaussian data. Our algorithm consists in maximizing the cumulant function. Three families of multivariate random vectors, for which ex-plicit computations are obtained, are implemented to illus-trate our approach. In addition, we show that our algorithm corresponds to selecting the directions along which projected data display the largest spread over the marginal probability density tails.

1 Introduction

In geosciences, Principal Component Analysis (PCA) has been an essential and powerful tool at detecting spatial struc-tures amongst time series recorded at different locations. PCA is a dimensionality reduction technique that extracts

Correspondence to: A. Bernacchia

(a.bernacchia@gmail.com)

relevant components of data, those responsible for the largest proportion of variability (Rencher, 1998). PCA builds decor-related components of the data and it finds the spatial patterns that maximize the variance. Hence, second order moments are the foundation of PCA. But relying exclusively on sec-ond moments implies that PCA is only optimal when applied to multivariate Gaussian vectors. Although rarely stated and even more rarely checked, this underlined normality assump-tion is not always satisfied in practice.

Recently, different approaches have been tested to extend the applicability of PCA in geosciences. NonLinear PCA (NLPCA) has been applied to several geophysical datasets (e.g. Hsieh, 2004; Monahan et al., 2001). In the NLPCA algorithm, data are considered as the input of an auto-associative neural network with five layers, with a bottleneck in the third layer (Kramer, 1991). Through the minimiza-tion of a cost funcminimiza-tion, the output is forced to be as close as possible to the input, and the bottleneck layer is a low di-mensional representation of the input. Since this neural net-work is nonlinear, NLPCA goes automatically beyond corre-lations. However, NLPCA suffers the intrinsic limitations of multilayered networks (e.g. Christiansen, 2005; Malthouse, 1998): it is computationally expensive and does not always converge to a global solution. Independent Component Anal-ysis (ICA) has been also applied in the geosciences (Aires et al., 2000). ICA builds independent (rather than uncorre-lated) components of the data, if any exist, by minimizing the entropy of the marginal density in the general non-Gaussian case (Bell and Sejnowski, 1995; Hyvarinen and Oja, 2000). Here we pursue a less ambitious aim: instead of trying to find decompositions that can explain the entire body of data with respect to a criterion, we focus on the part of data responsible for large anomalous behaviors.

In contrast to PCA, our approach tends to give maximal weight to data points which largely deviate from the mean, and to find the corresponding representative spatial patterns, i.e. the directions along which such points are prominently

(3)

160 A. Bernacchia and P. Naveau: Detecting spatial patterns with the cumulant function distributed. The key element in our procedure is the

expan-sion of the cumulant function that can provide information beyond the first two moments. By maximizing the cumulant function (Kenney and Keeping, 1951) over growing hyper-spheres in the data space, a set of components can be de-rived. We first illustrate our procedure by the application to three synthetic types of multivariate random vectors (Nor-mal, Skew-Nor(Nor-mal, Gamma). Then we demonstrate that, for any multivariate distribution, a larger cumulant function along a given direction is implied by a fatter tail of the corre-sponding marginal probability density. Hence, our algorithm provide the directions along which anomalies are mostly ex-pected.

We show that the first principal component of PCA is a special case of our approach whenever the Gaussian assump-tion is satisfied. Besides the Gaussian case we show that, for any probability density, the solution derived from the cumu-lant function can be transformed to the first principal com-ponent by decreasing the radius of the hypersphere. Other principal components could be found as well, by a general-ization of the proposed method. Our method is computation-ally cheap, and the solutions are found in the form of unit (normalized) vectors, as in the case of PCA, allowing a uni-dimensional projection with an easy geometrical interpreta-tion.

In summary, this paper focuses on the problem of char-acterizing spatial patterns associated to large anomalies, i.e. large deviations from the sample mean, when the data set under study cannot be assumed to be normally distributed.

2 Maximizing the cumulant function

In the univariate case, the cumulant function of the random variable X with finite moments is defined as the following scalar function lognE exp(sX)o= ∞ X n=1 κn sn n!,

where s∈R, E(.) represents the mean function and the scalar

κncorresponds to the nt hcumulant of X.

The first two cumulants κ1 and κ2 are simply the mean and the variance of X, respectively. The third and fourth cu-mulants are classically called the skewness and the kurtosis parameters. Concerning the existence of cumulants, we as-sume in this paper that all the cumulant coefficients are finite and that the cumulant function is always well defined. The cumulant function and its coefficients have many interesting properties. For example, if X and Y are two independent random variables, then the n-th cumulant of the sum X+Y is equal to the sum of the n-th cumulant of X and the n-th cumulant of Y for any integers n. If X follows a Gaussian distribution, then all but the first two cumulants are equal to zero. In a multivariate framework, the cumulant function of

the random vector X=(X1, . . . , Xm)t is simply defined as lognE exp(st_X)o

, for all st _{= (s}1, . . . , sm) ∈ Rm. (1) As in the univariate case, the linear and Gaussian proper-ties associated to the cumulant function defined by Eq. (1) still hold, but the cumulant coefficients formulas are more cumbersome to write down in a multivariate framework. For more information about cumulants, we refer the reader to Kenney and Keeping (1951).

To identify possible favorite projection directions with re-spect to the multivariate cumulant function, we first rewrite the vector s=(s1, . . . , sm)tin Eq. (1) as the product s=|s|×θ, where |s|2=P s2

i, the scalar |s| represents the norm (“ra-dius”) of s and θ is the unit “angular/direction” vector defined as θi=si/|s| (note that θtθ =1). Secondly, the cumulant func-tion for our vector X=(X1, . . . , Xm)tprojected along the di-rection vector θ is introduced as

G_|s|_{(θ ) = log}nE exp(|s| θt_X)o = ∞ X n=1 kn(θ )|s| n n! . (2)

Our algorithmic strategy is to maximize the cumulant func-tion, at fixed non-small |s|, with respect to the angular com-ponent θ that varies over an unit hypersphere. Practically, we have to find the optimal θs directions defined by

θs = argmaxG|s|(θ ) such that θtθ = 1 (3) for a fixed value of |s|. If the radius |s| is small enough, and the mean k1is zero, the variance k2(θ ) dominates the cumu-lant function and the contributions of other cumucumu-lants can be neglected. In this situation, finding the first PCA component can be viewed as a special case of this optimization proce-dure, because maximizing the cumulant function for small

|s| is equivalent to maximize the variance. As the value of |s| grows, higher and higher order cumulants become more

dominant. Our main goal is to find θsin Eq. (3) for the largest admissible |s| and study their properties. We will call the so-lutions of such an optimization scheme the “Maxima of the Cumulant Function” (MCF) directions.

We anticipate that, since the scalar product θtX is

invari-ant under orthogonal transformations, the cumulinvari-ant function is invariant as well. Given that the unit hypersphere is also invariant, our algorithm is symmetric with respect to orthog-onal transformations. For instance, if data vectors are rotated by a given angle, the solutions of the algorithm, in terms of

θ , are rotated by the same amount. This symmetry implies

that if the probability density is isotropic, then the cumu-lant function is isotropic as well, and no relative maxima of

G exist on the unit hypersphere. In that case no directions

are selected: for a rotationally symmetric distribution there is indeed no preferred direction along which anomalies are prominent, they are distributed uniformly over all angles.

(4)

To illustrate our optimization procedure, we derive in Sect. 3 explicit cumulant maximization schemes for three special cases of multivariate family distributions. Finally, in Sect. 4 we demonstrate that, in the general case, if a direction exists for which the marginal probability density of projected data display a larger tail than in other directions, our proce-dure is able to select that direction, which corresponds to the maximum of the cumulant function. Hence, our algorithm provide the directions along which anomalies are mostly ex-pected. For assessing the outputs of our algorithm when ap-plied to real data, we refer the reader to the second part of this paper (Bernacchia et al., 2008).

3 Theoretical examples

To study the properties of the maximization method devel-oped in Sect. 2, three examples of distribution functions are considered in this paper. We choose these three families be-cause explicit results can be derived and they have been clas-sically used in the statistical modeling of temperatures and precipitation data.

Without loss of generality, some relevant matrices are di-agonal thereafter. However, since the solutions covary with orthogonal transformations, they may be rotated along with the corresponding coordinate change. Analytical calcula-tions are performed for the general multivariate case, while figures are given for the bivariate case.

3.1 Multivariate Gaussian vectors

Suppose that the data at hand can be appropriately fitted by a multivariate Gaussian vector. We assume that the observa-tions have been centered (zero) mean and we denote the co-variance matrix as 6. The cumulant function of the centered Gaussian vector (e.g. Kenney and Keeping, 1951) is equal to lognE exp(stX)o₌ 1

2s t_6s.

Hence, it is easy to show that all cumulants but the second are equal to zero. The decomposition in Eq. (2), s=|s|×θ, implies that the cumulant function becomes

G_|s|_{(θ ) =} |s| 2 2 k2(θ ) = |s|2 2 θ t_6θ

Our optimization problem is to maximize G_|s|(θ ) under the

constraint θtθ =1. To find the optimal θs defined by Eq. (3), we introduce a function L to be maximized, constrained by the Lagrange multiplier λ, as

L(λ, θ ) = |s|

2

2 θ

t

6θ − λ(θtθ − 1).

Setting the derivative of L with respect to λ to zero gives the constraint θt_{θ =1, while setting the gradient with respect to}

θ to zero gives ∇θ L = |s|26θ − 2λθ = 0. (4)

−5

0

5 −5

0

5 PC1=MCF1

MCF2

Fig. 1. Isoprobability contours of the bivariate Gaussian distribu-tion, with zero mean and variances 1.2 and 0.5143 (entries of the di-agonal covariance matrix). The first principal component is shown (PC1), together with the two (opposite) maxima of the cumulant function (MCF1 and MCF2), for any value of |s|. All vectors are in arbitrary scale. The maxima of cumulant function are parallel to the first principal component, all pointing towards the large anoma-lies, in terms of high probability (at fixed vector norm). Probability contours are 10−1, 10−3, 10−5. . . 10−11.

Let λ6 be the largest eigenvalue of the covariance matrix

6 and θ6its associated eigenvector. Introducing λs=|s|

2

2 λ6, we can write |s|26θ6=2λsθ6. Since the eigenvalue λsis the largest one, for a fixed |s|, θ6is the maxima of the cumulant function. Note that θ6 depends on 6 but not on |s|. The optimal direction, for the Gaussian case, is θs=θ6, and it corresponds to the classical first principal component.

To illustrate this result, a bivariate vector of the normal distribution is presented in Fig. 1 (in which a contour plot is drawn in logarithmic scale). The matrix 6 is assumed to be diagonal, with entries 1.2 and 0.5143. The first princi-pal component (PC1), corresponding to the eigenvalue 1.2 is horizontal, while the second principal component, corre-sponding to the eigenvalue 0.5143 is vertical, and is not dis-played. The two maxima of the cumulant function (for any value of |s|), MCF1 and MCF2, are just the positive and neg-ative part of PC1. Indeed, both θ6and −θ6are solutions of Eq. (4). While this is always true for PCA, the maxima of the cumulant function may in general neither be parallel nor orthogonal.

PC1 is indeed the direction along which large anomalies are distributed in the Gaussian case. In order to derive PC2

(5)

162 A. Bernacchia and P. Naveau: Detecting spatial patterns with the cumulant function

−5

0

5

10 −5

0

5 PC1

MCF1

MCF2

Fig. 2. Isoprobability contours of the bivariate Skew-Normal dis-tribution, with parameters α=(4.365, −1.455) and the matrix 6 is chosen to be diagonal with entries 1.2 and 0.5143. The first prin-cipal component is shown (PC1), together with the two maxima of the cumulant function (MCF1 and MCF2), for large |s|. All vectors are in arbitrary scale. In this case, the two maxima of the cumulant function do not correspond to the first principal component, they are neither parallel nor orthogonal, and point towards the (local) large anomalies, in terms of high probability (at fixed, and large, vector norm). Probability contours are 10−1, 10−3, 10−5. . . 10−19.

and other higher principal components from the cumulant function, one would have to determine not only its maxima, but also its minima and saddle points.

3.2 Multivariate Skew-Normal vectors

To introduce skewness to the Gaussian density, while keep-ing some of valuable properties of the normal distribution, Azzalini and his co-authors have extended the normal den-sity to a larger class, called the Skew-Normal (SN) denden-sity (e.g. Azzalini and Dalla Valle, 1996; Azzalini and Capitanio, 1999; Gonzalez-Farias et al., 2004), that is defined as

f (x) = 2φ6(x)8(αtx) (5) where φ is a multivariate Normal probability density func-tion with zero mean and covariance matrix 6, and 8 is the cumulative density function of an univariate Gaussian ran-dom variable with zero mean and unit variance. The vector

α corresponds to the degree of skewness. When α=0, there

is no skewness, and the SN distribution reduces to the Gaus-sian case. From Eq. (5), it is possible to derive the cumulant

function (e.g Azzalini and Dalla Valle, 1996) of a SN vector lognE exp(stX)o₌ 1 2s t 6s + log28(√π 2µ t_s)_.

where µ represents the mean vector, and is equal to

µ = q 6α

π

2(1 + αt6α)

.

Note that the covariance matrix of the SN distribution is not

6 but 6−µµt_{(Azzalini and Capitanio, 1999).}

A bivariate example of the SN distribution is presented in Fig. 2, in which a contour plot is drawn in logarithmic scale. The matrix 6 is chosen to be diagonal with entries 1.2 and 0.5143. The skewness vector α is taken to be equal to

(4.365, −1.455). The distribution in Fig. 2 has been centered

such that the mean vector is zero, i.e. the original distribution is translated by µ, which is equal to (0.8367, −0.1195).

From the SN cumulant function (Azzalini and Capitanio, 1999), we can write the cumulant function of the centered vector (X−µ) as G_|s|_{(θ ) = −|s|µ}t_{θ +}|s| 2 2 θ t 6θ + logh28(√π 2|s|µ t_{θ )}i ₍₆₎

While it is not possible to find explicit solutions of the max-imization problem defined by Eq. (3) for Eq. (6), one can provide valuable approximated solutions for both small and large |s|. In the former case, the following two Taylor expan-sions are the key elements to derive our results

log(1 + s) = 1 + s −s 2 2 + o(s 3_{) and} ₍₇₎ 1 + erf(s) = 1 + √2s π + o(s 3₎ ₍₈₎

where erf corresponds to the error function defined by 28(s) = 1 + erf√s

2

.

Then we can write the following approximation logh28(√π 2|s|µ t_{θ )}i = logh1 + erf(√π 2 |s|µ t_{θ )}i = logh1 + |s|µtθ + o(|s|3)i, by (7), = |s|µtθ −|s| 2 2 θ t_µµt θ +o(|s|3), by (8).

From Eq. (6), it follows that the cumulant function G_|s|(θ ) is

approximately equal to G_|s|_{(θ ) ≃} |s| 2 2 θ t (6 − µµt)θ _{, for small |s|.} (9)

(6)

As previously noticed, the matrix 6−µµt represents the covariance of the SN distribution (Azzalini and Capitanio, 1999). Hence, the maximization of the right hand side of Eq. (9) is equivalent to solving the system defined by Eq. (4), but instead of working with 6, we just need to replace 6 by

6−µµt in Eq. (4). Consequently, the solution to maximize the SN cumulant function in the neighborhood of zero is the largest eigenvector of the matrix 6−µµt, i.e. the PC1 of the SN covariance matrix.

For large |s|, this result does not hold and different direc-tions are obtained. We need to recall the asymptotic expan-sion of the error function

1 + erf(s) ≃ ( −_s√1 π exp(−s 2₎h 1 + o(s−3)i_{, as s ↓ −∞,} 2 _{, as s ↑ +∞.}

If µt_{θ < 0, the logarithm in Eq. (6) can be expanded as} logh28(√π 2|s|µ t_{θ )}i = logh1 + erf(√π 2 |s|µ t_{θ )}i ≃ −π 2 |s|2 2 θ t_µµt θ + O log(|s|−1).

In this case, we have G_|s|_{(θ )≃}|s|₂2θt₆₋π₂µµtθ . The

di-rection that maximizes G_|s|(θ ) is an eigenvector of the matrix 6−π2µµ t_{, pointing towards µ}t_{θ <0.} If µt_{θ ≥0, we have} logh28(√π 2|s|µ t_{θ )}i ≃ log 2 , if µ t_{θ > 0,} 0 , if µt_{θ = 0.}

The cumulant function G_|s|(θ ) can be then approximated by G_|s|_{(θ )≃}|s|₂2θt6θ ; it is maximized by the largest eigenvector

of 6, pointing towards µt_{θ ≥0. In summary, depending on} the size of |s| (small or large) and the sign µtθ (positive or

negative), the solutions of Eq. (3) for the SN distribution can be viewed as the largest eigenvectors of each of three differ-ent matrices, 6−µµt, 6−π2µµtand 6.

For the bivariate example of Fig. 2, the PC1 is shown (in arbitrary scale), explaining 60% of the variance. The second PC (40% of the variance) is orthogonal to PC1 and is not displayed. Both maxima of the cumulant function for large

|s|, denoted as MCF1 and MCF2 (respectively for µtθ ≥0 and µtθ <0 ), are presented in Fig. 2 for the bivariate example

(same scale as PC1). The two local maxima point towards the large anomalies of the distribution: this can be seen by noting that a point at the upper-right end of PC1 corresponds to a small probability, and hence is less likely to be found, than a point at the right end of MCF1 (the two points being of equal norm). Similarly, a point at the down-left end of PC1 is less likely, in probability, than a point at the down end of MCF2.

3.3 Multivariate Gamma vectors

This section investigates the multivariate Gamma distribu-tion defined by Cheriyan and Ramabhadran (see Kotz et al.,

0

5

10

15 −5

0

5

10

15

PC1

MCF1

Fig. 3. Isoprobability contours of the bivariate Gamma distribution, with parameters α0=2, α1=0.5, α2=4. The first principal

compo-nent is shown (PC1), together with the maximum of the cumulant function (MCF1), for the largest admissible value of |s|=1/√2. All vectors are in arbitrary scale. Again, the maximum of the cumulant function is different from the first principal component, and points towards the large anomalies, in terms of high probability (at fixed, and large, vector norm). Probability contours are 10−1/2, 10−1, 10−3/2. . ..

1998). Each component of the data vector X is distributed following a Gamma distribution, and the components depend each other by means of an auxiliary variable z. The joint dis-tribution is f (x) = Z g(z; α0) n Y i=1 g(xi− z; αi)dz (10)

where g is a gamma distribution, i.e.

g(z, α) = e

−z_zα−1

Ŵ(α)

for z≥0, equal to zero otherwise. A bivariate example is presented in Fig. 3, with n=2, α0=2, α1=0.5 and α2=4 in Eq. (10).

The cumulant function for this multivariate Gamma distri-bution can be written as (Kotz et al., 1998)

lognE exp(st_X)o = −α0log 1− n X i=1 si − n X i=1 αilog(1−si)

(7)

164 A. Bernacchia and P. Naveau: Detecting spatial patterns with the cumulant function and the mean of the i-th component is µi=α0+αi. By

replac-ing s by |s|×θ, the cumulant function of the centered vector

(X−µ) can be written as G_|s|_{(θ ) = −α}0log 1 − |s| n X i=1 θi − n X i=1 αilog(1 − |s|θi) − |s| n X i=1 (α0+ αi)θi. (11)

For small |s|, the logarithms are approximated by the trun-cated Taylor expansions, i.e.

n X i=1 αilog 1 − |s|θi ≃ −|s| n X i=1 αiθi+|s| 2 2 n X i=1 αiθi2 and α0log 1 − |s| n X i=1 θi ≃ −|s| n X i=1 α0θi+|s| 2 2 α0 Xn i=1 θi 2 .

It follows that the cumulant function can be approximated by

G_|s|_{(θ ) ≃} |s|

2 2 θ

t_{Cθ ,} ₍₁₂₎

where the covariance matrix C is defined by (Kotz et al., 1998) C =    α0+ α1 α0 . ._. α0 α0+ αn   .

Hence, for small |s|, the solution of Eq. (3) is the classical PC1 eigenvector of the covariance matrix C associated with the largest eigenvalue. It is plotted in Fig. 3 for the bivariate example. The negative part of PC1 is not displayed, as well as the second principal component which is just orthogonal to the first.

For large |s|, the cumulant function is not defined, since the logarithms in Eq. (11) must have positive arguments, for all unit vectors θ . In particular, the following two inequalities have to be satisfied |s|θi < 1 and |s| n X i=1 θi < 1. In other words, |s| < min min 1 θi , min ₁ P iθi .

However, we take the largest allowed value of |s|, which is considered as a valid limit. Since θ is a unit vector, the max-imum of each component θi is 1, which holds when all the other components are zero, while the maximum for the sum

P

iθi is √n, which holds when θ1=θ2= . . . =θn=1/√n. The largest allowed value of |s| is then 1/√n: in that case G

remains finite for all θ ’s, except for θ1=θ2= . . . =θn=1/√n, where it diverges due to the first logarithm of Eq. (11) (all the others remain finite). For larger values of |s|, G diverges over subspaces larger than a single point. Hence the bound-ary case |s|=1/√n is taken as representative for a “large” |s|

limit, and the point θ1=θ2= . . . =θn=1/√n is taken as the maximum of the cumulant function.

For the bivariate example the largest admissible value is

|s|=1/√2, and the maximum of the cumulant function is plotted in Fig. 3, denoted as MCF1, and scaled to PC1. Note that for high values of the probability distribution (e.g. the inner contour), PC1 seems a representative direction of the egg-like shape of the distribution, but for low probabilities it becomes clear that MCF1 is responsible for the large devia-tions. A point at the end of PC1 is indeed less probable than a point at the end of MCF1.

This result is understood by noting that the joint density (10) is the probability distribution of variables xi defined as

xi=z0+zi, where the variables z0, z1, . . . , zn are indepen-dent and Gamma distributed with parameters α0, α1, . . . , αn. Hence, a large deviation of z0, which occurs independently on others z’s, corresponds to a large deviation of x which is placed on average along the line x1=x2= . . . =xn (that cor-responds to θ1_=θ2= . . . =θnfor the cumulant function).

4 Maximizing the marginal density tail

We have seen that the multivariate cumulant function reduces to the variance if the radius |s| tends to zero. In that case, its maxima corresponds to the first principal component of data set. If |s| grows, higher order cumulants come into play, but is not clear what the corresponding maxima represent. In order to clarify this point, we rewrite the cumulant function defined by Eq. (2) in terms of the explicit integral over the probability density

G_|s|_{(θ ) = log} Z

Rnf (x) exp(|s| θ

t_x)dx

This expression can be reduced to an unidimensional in-tegral, by defining the projected data as Z=θtX, and the

marginal probability density of the projected data, fθ (z). The cumulant function is then

G_|s|_{(θ ) = log} Z _+∞

−∞ fθ (z) exp(|s|z)dz,

(13) which corresponds to the cumulant function of the univariate vector Z=θtX with distribution density fθ (z). In the light

of this representation, our maximization procedure is better understood: we are looking for directions θ that correspond to a marginal probability density fθ (z) displaying maximal cumulant function, at fixed |s|. We want to demonstrate that if |s| grows, a larger cumulant function corresponds to Nonlin. Processes Geophys., 15, 159–167, 2008 www.nonlin-processes-geophys.net/15/159/2008/

(8)

a marginal density with a fatter tail. Hence our procedure selects the directions corresponding to the marginal densities with fatter tails, where the anomalous behaviour is expected. Specifically, consider two different directions θ and θ′: we want to demonstrate that if the marginal distribution along θ has a fatter tail than the distribution along θ′, then the cumu-lant function has also a fatter tail along θ with respect to θ′. More formally, we have the following theorem.

Theorem 1. Let θ and θ′be two directions. If there exists a real z∗_{such that the density distribution fθ (z) of the random}

variable θt_{X is strictly larger than the density fθ}′(z) of the random variable θ′t_{X, for all z > z}∗_{, i.e.}

fθ (z) > fθ′(z), for all z > z∗,

then there exists a radius |s|∗such the cumulant function of θtX and θ′tX satisfies

G_|s|(θ ) > G_|s|(θ′), _{for all |s| > |s|}∗.

Proof. In order to prove the result, we start by noting that the

following inequality holds

exp(|s|z)fθ (z) − fθ′(z) > exp(|s|z∗)fθ (z) − fθ′(z)

for all z>z∗, where we have replaced the exponen-tial function with its minimum value in the interval

z∈(z∗, +∞). The inequality holds because the density

dif-ference fθ (z)−fθ′(z) is positive in this interval, by

assump-tion. Since the above inequality holds in the whole interval

z ∈ (z∗, +∞), it can be integrated over, i.e. Z +∞ z∗ exp(|s|z)fθ (z) − fθ ′(z)dz > exp(|s|z∗) Z +∞ z∗ fθ (z) − fθ ′(z)dz. (14)

The two densities are normalized, i.e.

Z +∞

−∞ fθ (z)dz =

Z +∞

−∞ fθ

′(z)dz = 1

Splitting the integrals by z∗_{, and rearranging terms, the} nor-malization condition is rewritten as

Z +∞ z∗ fθ (z) − fθ ′(z)dz = Z z∗ −∞ f_θ′(z) − fθ (z)dz (15)

Note that since the left hand side (l.h.s.) is positive, by as-sumption, the right hand side (r.h.s.) is positive as well. Equation (15) can be substituted in the r.h.s. of the inequality (14), giving Z +∞ z∗ exp(|s|z)fθ (z) − fθ ′(z)dz > exp(|s|z∗) Z z∗ −∞ f_θ′(z) − fθ (z)dz

A lower bound for the r.h.s. can be found, depending on the value of |s|: in the following, we demonstrate that it exists an

|s|∗such that, for all |s|>|s|∗,

exp(|s|z∗) Z z∗ −∞ f_θ′(z) − fθ (z)dz > (16) Z z∗ −∞exp(|s|z) f_θ′(z) − fθ (z)dz

where the value of |s|∗must be determined. The theorem is proven once we have demonstrated the last inequality (16), since then we have, for all |s|>|s|∗,

Z +∞ z∗ exp(|s|z)fθ (z) − fθ ′(z)dz > Z z∗ −∞ exp(|s|z) f_θ′(z) − fθ (z)dz.

By rearranging terms, this is equivalent to G_|s|(θ )>G_|s|(θ′).

We rewrite the inequality (16) as

Z z∗ −∞ f_θ′(z) − fθ (z)dz > Z z∗ −∞exp |s|(z − z ∗_)f θ′(z) − fθ (z)dz

for all |s|>|s|∗. Note that even if the integral in the l.h.s. is positive, its integrand, the density difference fθ′(z)−fθ (z),

is not guaranteed to be positive for all z∈(−∞, z∗). If it was

positive as well, the inequality would hold trivially for all values of |s|, because the exponential in the r.h.s. is smaller then one. This corresponds to the case where additionally to

fθ (z)>fθ′(z) ∀z>z∗, we assume fθ (z)<fθ′(z) ∀z<z∗.

However, we do not need this additional request, and we just note that it would help our procedure by leaving the problem of using a large |s|. The integral in the l.h.s. is independent on |s|, while the integral in the r.h.s. converges to zero for

|s| −→ +∞, as long as the density difference remains finite,

because the exponential tends to zero in the whole interval

z∈(−∞, z∗_{). If we define |s|}∗_{as the largest possible value of}

|s| for which the two integrals are equal, then for all |s|>|s|∗

the integral in the l.h.s. is larger than that of r.h.s., and the theorem is proven.

5 Discussion

In this paper, we have introduced a novel method selecting the spatial patterns representative for the large deviations in the dataset. The method consists in finding the vectors in the space of data for which the cumulant function is maxi-mal. As in the case of PCA, the spatial patterns are found as normalized directions in the space of data, and a linear projection can be performed, with an easy geometrical inter-pretation. However, while PCA accounts only for the mass of

(9)

166 A. Bernacchia and P. Naveau: Detecting spatial patterns with the cumulant function the distribution, the cumulant function can give information

also about its tails. If one is interested on the large devia-tions, the projection allows to safely perform Extreme Value Analysis (Coles, 2001). In both cases the subspaces are or-dered: in PCA the order follows the fraction of variance of each subspace; The maxima of the cumulant function are or-dered by the value of G, expressing the relative importance of each marginal density tail.

Principal components are always symmetric, while large anomalous patterns, if generated by nonlinear processes, are expected to be neither specular nor orthogonal. Accord-ingly, the maxima of the cumulant function are not necessar-ily symmetric, since they account for the whole structure of dependencies, and not only covariances. Vector solutions of other nonsymmetric techniques, such as oblique Varimax ro-tations (Horel, 1981), generally do not covary with the space of data under orthogonal transformations. Hence, solutions depend not only on the shape of the underlying probability distribution, but also on its orientation: an undesirable prop-erty for our purposes. The maxima of the cumulant function, instead, covary with the probability density whose shape is the only feature determining the maxima.

In the case of normally distributed data, the maximiza-tion of cumulant funcmaximiza-tion yields the first principal compo-nent for all values of |s|: the elliptically symmetric distri-bution is characterized by the two tails along the major axis of the ellipse, i.e. the first principal component. When the method is applied to Skew-Normal and Gamma distributions, for non-small |s|, the maxima of the cumulant function de-termine large anomalies: high probability directions far from the center of mass. Note that the limit radius |s|→∞ is the innovative key from a technical point of view, allowing for analytical solutions. Using the limit, we were also able to demonstrate that the solutions of our algorithm correspond, in general, to the directions along which the marginal proba-bility density display the fattest tails.

Using the cumulant function is computationally cheap, there is no free parameter, and has the advantage of searching for local solutions, all of which are of interest. When a solu-tion is found, is always a valid local solusolu-tion, in contrast with neural networks applications, where local solutions are not of interest. In real applications, the radius |s| must be taken as large as possible, until the expected error in the estimate of the cumulant function, due to the finite sample, reach a toler-ance value (see Bernacchia et al., 2008). This corresponds to maximize a combination of cumulants which is of the high-est reliable order with the given amount of data, accounting for the available set of anomalies.

The solutions of our algorithm are expected to transform continuously as the radius |s| varies. Hence, even if the limit

|s|→∞ cannot be taken in practice, the solutions for a

fi-nite value of |s| are expected to represent a substantial de-parture from the PCA solution, towards the formal solution at |s|→∞. From the theoretical point of view, future work could be devoted to study in detail the nature of solutions at

varying |s|. For instance, one could attempt to find under which conditions and to what extent the solutions are in be-tween the PCA solution and the formal solution at infinite

|s|. From the applicative point of view, we expect several

datasets to be fruitfully analyzed with our new method (e.g. Bernacchia et al., 2008).

Note that the logarithm is taken for illustrative purposes: the moment generating function could be used instead of the cumulant function, since the maximization is invariant under application of a monotonous function. The present definition is however confortable in avoiding extremely large numbers. Centering of data about the mean is also a practical step, re-lated with the constraint of dealing with finite samples: if the limit |s|→∞ could be really taken, the mean would be irrelevant.

Results of our procedure are corrupted if variables are standardized by a rescaling, since the relative scale of dif-ferent directions is the key in detecting anomalies and com-paring the size of tails. If variables are standardized, our pro-cedure reduces to a special case of Independent Component Analyisis (ICA, see Hyvarinen and Oja, 2000), detecting in-dependent components rather than large anomalies. Results are also corrupted if we try to estimate the cumulant function when the underlying probability density decays slower than exponential. In that case, the cumulant function diverge, and the variance of the empirical estimate increases with the size of the sample (Sornette, 2000), implying that the estimate is always unreliable.

Acknowledgements. This work was supported by the european

E2-C2 grant, the National Science Foundation (grant: NSF-GMC (ATM-0327936)), by The Weather and Climate Impact Assessment Science Initiative at the National Center for Atmospheric Research (NCAR) and the ANR-AssimilEx project. Finally, we would like to thank I. Bordi, M. Petitta and A. Sutera for valuable discussions, and P. Aires and an anonimous referee for improving the manuscript with their comments.

Edited by: H. Rust

Reviewed by: F. Aires and another anonymous referee

References

Aires, F., Chedin, A., and Nadal, J.: Independent component anal-ysis of multivariate time series: Application to the tropical SST variability, J. Geophys. Res., 105(D13), 17 437–17 455, 2000. Azzalini, A. and Capitanio, A.: Statistical applications of the

mul-tivariate skew-normal distribution, J. Roy. Stat. Soc., B61, 579– 602, 1999.

Azzalini, A. and Dalla Valle, A.: The multivariate skew-normal dis-tribution, Biometrika, 83, 715–726, 1996.

Bell, A. J. and Sejnowski, T. J.: An information-maximization ap-proach to blind separation and blind deconvolution, Neural Com-putation, 7, 1129–1159, 1995.

Bernacchia, A., Naveau, P., Yiou, P., and Vrac, M.: Detecting spa-tial patterns with the cumulant function – Part 2: An application

(10)

to El Ni˜no Nonlin. Processes Geophys., 15, 169–177, 2008, http://www.nonlin-processes-geophys.net/15/169/2008/. Christiansen, B.: The shortcomings of nonlinear principal

compo-nent analysis in identifying circulation regimes, J. Climate, 18, 4814–4823, 2005.

Coles, S.: An introduction to statistical modeling of extreme values, Springer, Berlin, 2001.

Gonzalez-Farias, G., Dominguez-Molina, A., and Gupta, A. K.: Additive properties of skew-normal random vectors, J. Stat. Plan. Infer., 126, 521–534, 2004.

Horel, J. D.: A rotated principal component analysis of the interan-nual variability of the Northern Hemisphere 500 mb height field, Mon. Weather Rev., 109, 2080, 2092–2092, 1981.

Hsieh, W. W.: Nonlinear multivariate and time series analy-sis by neural network methods, Rev. Geophys., 42, RG1003, doi:1010.1029/2002RG000112233-243, 2004.

Hyvarinen, A. and Oja, E.: Independent Component Analysis: al-gorithms and applications, Neural Networks, 13, 411–430, 2000.

Kenney, J. F. and Keeping, E. S.: Cumulants and the Cumulant-Generating Function, in: Mathematics of Statistics, Princeton, NJ, 1951.

Kotz, S., Balakrishnan, N., and Johnson, N. H.: Continuous multi-variate distributions, John Wiley and Sons, New York, 1998. Kramer, M. A.: Nonlinear principal component analysis using

au-toassociative neural networks, J. Amer. Inst. Chem. Eng., 37, 233–243, 1991.

Malthouse, E. C.: Limitations of nonlinear PCA as performed with generic neural networks, IEEE Trans.NN, 9, 165–173, 1998. Monahan, A. H., Pandolfo, L., and Fyfe J. C.: The preferred

struc-ture of variability of the northern hemisphere atmospheric circu-lation, Geophys. Res. Lett., 28, 1019–1022, 2001.

Rencher, A. C.: Multivariate statistical inference and applications, John Wiley and Sons, New York, 1998.

Sornette, D.: Critical phenomena in natural sciences, Springer-Verlag, Berlin, 2000.