• Aucun résultat trouvé

An eigenanalysis of data centering in machine learning

N/A
N/A
Protected

Academic year: 2021

Partager "An eigenanalysis of data centering in machine learning"

Copied!
15
0
0

Texte intégral

(1)

HAL Id: hal-01966116

https://hal.archives-ouvertes.fr/hal-01966116

Submitted on 27 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Paul Honeine

To cite this version:

Paul Honeine. An eigenanalysis of data centering in machine learning. [Research Report] ArXiv.

2016, pp.1-13. �hal-01966116�

(2)

An eigenanalysis of data centering in machine learning

Paul Honeine,Member, IEEE

Abstract—Many pattern recognition methods rely on statistical information from centered data, with the eigenanalysis of an empirical central moment, such as the covariance matrix in principal component analysis (PCA), as well as partial least squares regression, canonical-correlation analysis and Fisher discriminant analysis. Recently, many researchers advocate working on non-centered data.

This is the case for instance with the singular value decomposition approach, with the (kernel) entropy component analysis, with the information-theoretic learning framework, and even with nonnegative matrix factorization. Moreover, one can also consider a non- centered PCA by using the second-order non-central moment.

The main purpose of this paper is to bridge the gap between these two viewpoints in designing machine learning methods. To provide a study at the cornerstone of kernel-based machines, we conduct an eigenanalysis of the inner product matrices from centered and non- centered data. We derive several results connecting their eigenvalues and their eigenvectors. Furthermore, we explore the outer product matrices, by providing several results connecting the largest eigenvectors of the covariance matrix and its non-centered counterpart.

These results lay the groundwork to several extensions beyond conventional centering, with the weighted mean shift, the rank-one update, and the multidimensional scaling. Experiments conducted on simulated and real data illustrate the relevance of this work.

Index Terms—Kernel-based methods, Gram matrix, machine learning, pattern recognition, principal component analysis, kernel entropy component analysis, centering data.

1 INTRODUCTION

Most pattern recognition in machine learning can be ex- plained by performing an eigenanalysis, with an eigen- decomposition or a spectral decomposition (i.e., singular value decomposition or SVD) [1]. These machines seek a set of relevant axes from a given dataset. The principal component analysis (PCA) [2] is the most prominent eigenanalysis problem for feature extraction and dimen- sionality reduction. In this case, the most relevant axes, obtained from the eigendecomposition of the covariance matrix, capture the largest amount of variance in the data. Other machines include multidimensional scaling [3], partial least squares regression (PLS) [4], canonical- correlation analysis (CCA) [5] and its classification-based version known as Fisher discriminant analysis (FDA) [6].

The latter two methods solve a generalized eigendecom- position problem.

Kernel-based machines provide an elegant framework to generalize these linear pattern recognition methods to the nonlinear domain. They rely on the concept ofkernel trick, initially introduced by Aizerman et al. in [7]. The main breakthrough lies in two folds. On the one hand, most pattern recognition, classification and regression algorithms can be written in terms of inner products between data. On the other hand, by substituting each inner product by a (positive definite) kernel function, a nonlinear transformation is implicitly operated on the data without any significant computational cost. There- fore, the eigenanalysis as well as most of the operations

M. Honeine is with the Institut Charles Delaunay (CNRS), Universit´e de Technologie de Troyes, Troyes, France.

are performed on the kernel matrix, which corresponds to an inner product “Gram” matrix in some feature space. This property is revealed in kernelized versions of PCA [8], PLS [9], CCA [10] and FDA [11]. See [12], [13], [14] for a survey of kernel-based machines.

In several kernel-based machines, as given for instance in PCA, CCA, and FDA, data should be centered in the feature space, by shifting the origin to the centroid of the data. From an algorithmic point of view, centering is performed easily with matrix algebra, either in batch mode by a subsequent column and row centering of the kernel matrix [8], or in a recursive way when dealing with online learning [15]. From a theoretical point of view, centering reveals central moments, i.e., moments about the centroid/mean of the available data, as well as other related statistics. Well-known central moments include the second-order central moment, also called covariance, which is investigated by the PCA for esti- mating the maximum-variance directions. Furthermore, these directions minimize the reconstruction error.

Many researchers advocate the use of non-centered data in pattern recognition. Information is extracted directly, either with the data matrix and its Gram ma- trix, or with non-central moments. Several motivations were revealed in favor of working on non-centered data. The intuitive motivation is the application of the spectral decomposition without data-centering in many pattern recognition and machine learning problems, thus providing a sort of a non-centered PCA by using the second-order non-central moment. This is the case for instance in signal analysis and classification [16] and in designing dictionaries for sparse representation [17]. A key motivation towards keeping data non-centered is the

(3)

nonparametric density estimation with kernel functions [18], as revealed recently with exceptional performance in the (kernel) entropy component analysis (ECA) [19]

and the information-theoretic learning framework [20].

See also [21]. A further motivation is that data often deviate from the origin, and the measure of such de- viation may constitute an interesting feature, such as in hyperspectral unmixing [22]. It turns out that in many fields of computer science, signal processing and machine learning, acquired data are nonnegative, and even positive. This is the case with the study of gene expressions in bioinformatics [23], with eigenfaces in the computer vision problem of human face recognition [24], with online handwritten character recognition [25], and with numerous applications for nonnegative matrix factorization [26], [27]. As a consequence, nonnegative Gram matrices are pervasive. Such information is lost when centering the data, as well as several interesting properties1.

The issue of centering the data versus keeping the data uncentering is an open question in pattern recog- nition: (Pearson) correlation versus congruence coeffi- cients, (centered) PCA versus non-centered PCA (or SVD), covariance and centered Gram matrices versus their non-centered counterparts. In this paper, we study the impact of centering the data on the distribution of the eigenvalues and eigenvectors of both inner-product and outer-product matrices. By examining the Gram matrix and its centered counterpart, we show the interlacing property of their eigenvalues. We devise bounds con- necting the eigenvalues of these two matrices, including a lower bound on the largest eigenvalue of the centered Gram matrix. Furthermore, we examine the eigenvectors of the inner product and outer product matrices. We pro- vide connections between the most relevant eigenvector of the covariance matrix and that of the non-centered matrix, a result that corroborate the work in [30].

In our study, we focus on the eigenanalysis of the Gram matrices. This work opens the way to understand- ing the impact of centering the data in most kernel- based machine. This is shown by bridging the gap between the (centered) PCA and the (non-centered) ECA.

Moreover, our work goes beyond the PCA and ECA, since it extends naturally to many kernel-based machines where the eigen-decomposition of the Gram matrix is crucial. To this end, we revisit the multidimensional scal- ing problem where centering is essential. Moreover, we provide extension beyond conventional mean-centering.

Related (and unrelated) work

In machine learning for classification and discrimination, the issue of centering the data has been addressed in few publications. In the Bayesian framework proposed

1. For instance, the Perron–Frobenius theorem [28], [29] states that, under some mild conditions, the non-negativity of a matrix is inherited by its unique largest eigenvalue and that the corresponding eigenvec- tor has positive components. This result is no longer valid when the data are centered.

in [31], the bias-free formulation of support vector ma- chines (SVM) and least-squares SVM yields the eigende- composition of the centered Gram matrix, while Gaus- sian processes yield similar expressions applied on the non-centered Gram matrix. In [32], the authors propose a modified FDA to take into account the fact that centering leads to a singular Gram matrix, even when the non- centered Gram matrix is non-singular. More recently, it is devised in [33] that data and label should be centered when dealing with the alignment criterion. In [34], the author consider the issue of optimizing the centering as well as the low-rank approximation problem.

In Bayesian statistics, the impact of centering has been extensively studied in multilevel models, when dealing with hierarchically nested models [35], [36]. In this case, centering is performed either with the grand mean, or

“partially” with a group mean centering, at different levels [37], [38]. See [39, Chapter 5.2] for a comprehensive review on the centering issue in multilevel modeling.

This problem is revisited within a Bayesian framework in [40], [41], with the issue of centered or non-centered parameterisations. This is beyond the scope of this paper, since we investigate non-parametric methods such as PCA and ECA.

To the best of our knowledge, only Cadima and Jolliffe investigated in [30] the relationship between the PCA and its non-centered variant. To this end, they confronted the eigendecomposition of the covariance matrix with the eigendecomposition of its non-centered counterpart.

Such connection can be done thanks to the rank-one update that connects both matrices. In this paper, we study the eigendecompositions of the centered and the non-centered Gram matrices. It is easy to see that this is a much harder problem. By performing an eigenanalysis of the Gram matrices, we provide a framework that integrates the analysis of PCA, ECA, MDS, and beyond.

This work opens the door to the study of centering or not in kernel-based machines.

Notation

The matrixIis then-by-nidentity matrix, and the vector 1is the all-ones vector of nentries. Then, 11 is the n- by-nmatrix of all ones, and11=n. Moreover,1M1is the grand sum of the matrixM. The subscript c allows to recognize the case of centered data, as opposed to the non-centered one:Xc versusX,Kc versusK,Ccversus C,λci versus λi, etc.

The spectral decomposition (i.e., singular value de- composition) of a matrix M defines its singular values σ1(M), σ2(M), . . ., given in non-increasing order. Thei- th largest eigenvalue of a square matrix M is denoted byλi(M), and its trace bytr(M). A first analysis on the eigenvalues of a matrix is given by its trace, with:

tr(M) =X

i

λi(M). (1)

This corresponds to the variance of the data when deal- ing with the data covariance matrix.

(4)

For the sake of clarity, the i-th largest eigenvalues of the inner product matrices K and Kc are respectively λi and λci, namely λi =λi(K)and λci =λci(Kc). The eigenpairi,αi)ofKdenotes itsi-th largest eigenvalue λi and its corresponding eigenvectorαi. Thefirsteigen- vector is the one associated to the largest eigenvalue.

2 INTRODUCTION

In this section, we present the eigendecomposition prob- lems of the inner product and outer product matrices, for both non-centered and centered data. Then, two kernel- based machines are presented, PCA and ECA, in order to contrast the paradigm of centering or not the data.

2.1 Non-centered data

Consider a set of n available samples, x1,x2, . . . ,xn, from a vector space of dimension d, with the conven- tional inner productxi xj. LetX = [x1 x2 · · · xn] be the data matrix, and K = XX be the corresponding Gram matrix, i.e., the inner product matrix. It turns out that this matrix encapsulates the essence of the information in the data, as illustrated with the kernel trick throughout the literature. It is therefore natural to focus on the Gram matrix in out study.

Let i,αi)be an eigenpair of the matrix K, namely K αi=λiαi, (2) for i = 1,2, . . . , n, the eigenvalues λ1, λ2, . . . , λn being given in non-increasing order. These quantities describe the spectral decomposition of the matrix K, with

K=AΛA. (3) The eigendecomposition of the Gram matrixKis related to the spectral decomposition of the data matrixX, since the latter is given by

X=WΛ12A, (4) where A is the n-by-n matrix whose columns are the eigenvectors α1,α2, . . . ,αn, and Λ12 is the d-by-n rect- angular diagonal matrix whose i-th diagonal entry is

σi(X) =p

λi. (5)

Thedcolumns ofW are called the left-singular vectors of X. They define the spectral decomposition of the matrix XX, which is known as the realized covaria- tion matrix [42] in financial economics. For a coherent analysis, we consider in this paper the second-order non-central moment matrix C = n1Pn

i=1xixi , written in matrix form as C = n1XX. Following the spectral decomposition ofX in (4), we get

C wi= 1nλiwi,

where W = [w1 w2 · · · wd]. The matrix C is less known than the second-order central moment matrix.

The latter, called covariance matrix, is obtained after centering the data, as given in the following.

2.2 Centered data

Consider centering the data, xci = xi µ for i = 1,2, . . . , n, where µ= 1nPn

i=1xi is the empirical mean (i.e., centroid). We get in matrix form Xc = X µ1, where

µ= 1nX1.

From the spectral decomposition of the matrixK in (3), we can rewrite the norm of the mean as

kµk2=n121K1= n12

n

X

i=1

λii1)2, (6) which illustrates how each pair of eigenvalue and eigen- vector contributes to the norm of the data mean. The impact of centering the data is clear on both the Gram and the covariance matrices, as illustrated next.

Let Kc = XcXc be the Gram matrix of the centered data, with entriesxc

i xcj fori, j= 1,2, . . . , n, namely Kc=K1n11K1nK11+n1211K11, (7) Letci,αci)be an eigenpair of this matrix, then:

Kcαci=λciαci. (8) Let Cc = n1XcXc be the covariance matrix, namely the second-order central moment matrix defined by

Cc =Cµµ. (9) The eigenpairs of this matrix define the eigenproblem

Ccwci= n1λciwci. (10) Nonlinear extension using kernel functions

A symmetric kernel κ(·,·) is called a positive definite kernel if it gives rise to a positive definite matrix, namely for allnIN andx1,x2, . . . ,xn we have

β0, (11) for all vectors β, where the matrix K has entries κ(xi,xj). Such kernel functions provide a nonlinear extension of the conventional inner product since, thanks to Mercer’s theorem [43], [44], κ(xi,xj)corresponds to an inner product between transformed samples xφi and xφj in some feature space, namely

κ(xi,xj) =xφixφj. (12) Examples of nonlinear kernels include the polynomial kernel, of the form(c+xi xj)p, and the Gaussian kernel exp(12kxixjk2)where σis the tunable bandwidth parameter.

It turns out that the derivations given in this paper can be easily generalized to kernel functions, since a kernel matrix corresponds to a Gram matrix in some feature space. Centering is performed in the feature space, with the centroid2µφ= 1nXφ1. As a consequence, we get the same expression as in (6), withkµφk2=n121K1.

2. In most pattern recognition tasks, one does not need to quantify µφ, but only its inner product with any xφi. One can estimate its counterpart in the input space. This is the pre-image problem, which is clearly beyond the scope of this paper. See [45] for a recent survey.

(5)

2.3 The paradigm of centering or not the data In order to confront centering the data with keeping the data non-centered, we present next two well-known machines for pattern recognition. On the one hand, advocating the use of central matrices, the PCA is pre- sented in its two-folds, the conventional and the ker- nelized formulations. On the other hand, advocating the exploration of non-centered data, nonparametric density estimation using the ECA.

2.3.1 Case study 1: Principal component analysis The PCA seeks the axes that capture most of the variance within the data [2]. It is well-known3 that these axes are defined by the eigenvectors associated to the largest eigenvalues of the covariance matrixCc, with

Ccwci=n1λciwci.

Then, 1nλci measures the variance along the axe wci, while the normalized i-th eigenvalue πci = Pλci

jλcj ac- counts for the proportion of total variation. It is worth noting from (1) that the total variance of the data is given by the trace of Cc.

By substituting the definition of Cc in the eigenprob- lem (10), we get

wci= λnciCcwci=λ1ci

n

X

j=1

(xc

jwci)xci, (13) and therefore each wci lies in the span of the data. This means that there exists a vector αci such that wci = Xcαci. By injecting this relation in the above expression, we get Xcαci = λ1ciXcXcXcαci. Equivalently, by mul- tiplying each side by Xc, we get Kcαci = λ1ciKc2αci. Thus, we have the following eigenproblem

Kcαci=λciαci.

The eigenvectors of Kc allow to identify the projection of any samplexonto the corresponding eigenvectors of Cc. Representing it on a set of eigenvectors is given by

X

k

(xwck)wck=X

k

(xXcαck)Xcαck. (14) In order to satisfy the unit-norm ofwci for anyi, the eigenvectors of Kc should be normalized. Indeed, we have wc

i wci =αc

i XcXcαci =αc

i Kcαci =λciαc i αci. Therefore, one can define each feature wci directly from the eigenvectorαciof the Gram matrixKc, after normal- ization such that

kαcik2= 1 λci

. (15)

This scaling allows to preserve the variance of the data along the respective axes [46]. When one needs to fix the

3. By writing these vectors in a matrixW, maximizing the variance of the projected data can be written asarg max tr(W CcW), under the constraint WW = I. By using the Lagrangian multipliers and taking the derivative of the resulting cost function, we get the corresponding eigenproblem.

scale along all these directions, the following normaliza- tion is considered:

kαcik2= 1 λc2

i

, (16)

This scaling allows to set a unit variance within projected data on each axe [47].

2.3.2 Case study 2: Nonparametric density estimation Nonparametric density estimation is essential in many applied mathematical problems. Many machine learning techniques are based on density estimation, often with a Parzen window approach [48], [49]. This is the case of the information-theoretic methods [20], which are essen- tially based on the quadratic R´enyi entropy, of the form

logR

p(x)2dxfor a probability densityp(·). It is also the case of the (kernel) entropy component analysis (ECA) for data transformation and dimensionality reduction [19]. See also [18] for more details.

Often unknown, the probability density p(·) is esti- mated using a Parzen estimator of the form p(x) =ˆ

1 n

Pn

i=1κ(x,xi) for a given kernel function centered at each available samplexi. By using the kernel matrixK, the estimator is given by

ˆ

p(x) = 1 n

n

X

i=1

κ(x,xi) = 1n1κ(x),

where κ(x) is the vector of entries κ(x,xi) for i = 1,2, . . . , n. From the definition (12), it is easy to see that this expression corresponds to the inner product between the sample and the mean, with

ˆ

p(x) =xφ⊤

1 n

n

X

i=1

xφi

=xφ⊤µφ The quantity R

p(x)2dx, related to the quadratic R´enyi entropy, is therefore estimated by R

ˆ

p(x)2dx = kµφk2, which leads to

Z ˆ

p(x)2dx= n12

n

X

i=1

λii1)2, (17) where (6) is used. This expression uncovers the com- position of the entropy in terms of eigenvectors of K.

This is the main motivation of the ECA, where the relevant eigenvectors are selected in order to maximize the estimated entropy, thus the smallest termsλii1)2. Therefore, any eigenvector αi for which αi16= 0 and λi 6= 0 contributes to the entropy estimate, in contrast with the PCA where only eigenvectors associated to non- zero eigenvalues contribute to the variance.

Finally, we emphasize that non-centered data is used in the density estimation. Centering the data leads to a null density estimator, which yields an infinite quadratic R´enyi entropy. This fact is also corroborated by several studies, including the (kernel) entropy component anal- ysis. See [19] for more details.

(6)

2.4 Features from centered vs non-centered data Even in conventional PCA, the issue of centering is still an open question4. To the best of our knowledge, only the work in [30] studied the link between the eigende- compositions of C and Cc. Expression (9) reveals the rank-one update nature between these two matrices. The analysis of the Gram matricesK andKc is obviously a much harder problem, as illustrated in expression (7).

In this paper, we take the initiative to study the of the inner product matrices,KandKc, and carry on with the eigenanalysis of the outer product matrices,CandCc. It turns out that the work [30] onCandCc can be derived easily from the proposed approach. Moreover, the study of the inner product matrices broadens the scope of the work to the analysis of all kernel-based techniques [12], [17], [19], beyond the PCA approach. Next section gives the main contributions of this paper, while this study is completed in Section 4 by several extensions, beyond centering.

3 MAIN RESULTS

The main results are given next, in two-folds: the (in- ner product) Gram matrix and the (outer product) co- variance matrix. For each, we describe the relations between the eigenvectors of the centered matrix with those obtained from the non-centered one. The relations between the eigenvalues are studied by examining the inner product matrices, which allows the generalization of these results to nonlinear kernel functions. But before, we need to briefly introduce the orthogonal projection, and its link to the centering operation.

Background on orthogonal projections

The matrix of orthogonal projections onto the subspace spanned by the columns of a given matrixM is defined by PM = M MM1

M, while IPM denotes te projection onto its orthogonal complement. Projections are idempotent transformations, i.e., PMPM = PM. In particular, we are interested in this paper in the pro- jection onto the all-ones vector1ofn entries, with

P1 = 1

n11. (18)

By considering the projection of the data onto the subspace spanned by1and its complement, we have5

XP1 =µ1. (19)

4. For instance in R (The R Project for Statistical Computing) [50], there are two ways to performs PCA: On the one hand, the R-mode by using the eigendecomposition of the covariance matrix as given in (10), with the functionprincomp; and on the other hand, theQ-mode by using the singular value decomposition of the non-centered data, as given in (4), with the functionsvd.

5. Since the data are given column-wise in the matrixX, the mean µobtained from the operationXP1is the vector of means of each row ofX. This is in opposition to the operationP1Xwhich provides the means of each of the columns,i.e.,eachxi.

It is easy to see that the matrix of centered data is given by Xc = X(IP1) and verifies the identity Xc1= 0. More generally, we have

(IP1)1=1(IP1) =0. (20) Finally, for any square matrixM of appropriate size:

P1M P1 =n1211M11= 1Mn 1P1. (21) This yields

tr(P1M P1) = tr(P1M) = tr(M P1) =n11M1, (22) where the cyclic property of the trace operator is used in the first two equalities. Therefore, we have

tr (IP1)M(IP1)

= tr(M)1n1M1. (23)

3.1 Inner product – the Gram matrixKc

Let K = XX be the Gram matrix, then its centered counterpart is Kc =XcXc, namely

Kc = (IP1)K(IP1) (24)

=KP1KKP1+1Kn1P1 (25)

=KP1KKP1+nkµk2P1, (26) where relation (21) is applied onK, and the last equality is due to (6). These expressions reveal thedouble center- ing, which corresponds to subtracting the row and col- umn means of the matrixKfrom its entries, and adding the grand mean. A byproduct of the double centering is obtained from (24) and (20): 0 is the eigenvalue of Kc

associated to the eigenvector n11.

A first study to understand the distribution of the data associated to each matrix, K and Kc, is given by their traces, since they correspond to the sum of all eigenvalues. The following lemma provides a measure of the variance reduction due to centering.

Lemma 1. Let K and Kc be respectively the Gram matrix and its centered counterpart, then their corresponding traces verify the following relationships:

tr(Kc) = tr(K)nkµk2, and

tr(Kc)

tr(K) = 1 1K1 ntr(K). Hence, their eigenvalues verifyPn

i=1λci=Pn

i=1λinkµk2. Proof:The proof is straightforward, by applying (23) toK given in definition (24), with 1K1=n2kµk2.

Next, we explore beyond the sum of the eigenvalues.

In Section 3.1.1, we show that eigenvalues of K are interlaced with those of Kc. In Section 3.1.2, we pro- vide bounds on a sum of the largest t eigenvalues, for any t, and in particular a lower bound on the largest eigenvalue of Kc.

(7)

3.1.1 Eigenvalue interlacing theorems forKandKc

Before proceeding, the following theorem from [51, The- orem 5.9] is central to our study. It is worth noting that this theorem has been known in the literature for some time, see for instance [52, Appendix A], and is obtained from the Poincar´e Separation Theorem [53], [54] (see also [29, Corollary 4.3.37 in page 248]).

Theorem 2 (Separation Theorem [51], [52]). Let M be a d-by-nmatrix. Let two orthogonal projection matrices bePleft, of size d-by-d, and Pright, of size n-by-n. Then,

σj+t(M)σj(Pleft M Pright)σj(M),

where σj(·) denotes the j-th largest singular value of the matrix,t=dr(Pright) +nr(Pleft)andr(·)is the rank of the matrix.

Theorem 3. Let K and Kc be respectively the Gram ma- trix and its centered counterpart, then their eigenvalues are interlaced, such that

λj+1λcj λj,

with λcn = 0, whereλj and λcj denote respectively thej-th largest eigenvalue of the matrices K andKc.

Proof: To prove this, we apply the Separation The- orem 2 with M = X, Pleft being the d-by-d identity matrix, and Pright = (IP1), where r(Pleft) = d and r(Pright) =n1, and thus t= 1. In this case, we get

σj+1(X)σj(X(IP1))σj(X).

Relations (5) and (24) are used to conclude the proof.

Corollary 4. LetKandKcbe respectively the Gram matrix and its centered counterpart, then their proportion of total variation accounted for by the eigenvectors are interlaced, such that

πj+1 γ πcjπj, whereπj= Pλj

iλi,πcj= Pλcj

iλci andγ=tr(Ktr(K)c).

Proof: The proof is direct, on the one hand by dividing the inequalities of Theorem 3 by the trace of K, and on the other hand by settingγ= tr(Ktr(K)c) with the direct application of (1).

All these results show the impact of centering the data on the distribution of the eigenvalues, where the eigen- values ofKc aresandwitchedbetween the eigenvalues of K. This illustrates thatKcbehaves like a “coarse” matrix compared toK.

3.1.2 Bounds on the eigenvalues ofKandKc

In this section, we provide lower bounds on the largest eigenvalues of the matrices K and Kc. To this end, we state the Schur–Horn Theorem [55]. See [56, Chapter 9]

for a recent review on the theory of majorization.

Theorem 5 (Schur–Horn Theorem). For anyn-by-nsym- metric matrix with diagonal entriesd1, d2, . . . , dn and eigen- valuesλ1, λ2, . . . , λn given in non-increasing order, we have:

t

X

i=1

di

t

X

i=1

λi,

for anyt= 1,2, . . . , n, with equality fort=n.

The Schur–Horn Theorem has been proven in the situation when the diagonal entriesd1, d2, . . . , dnare also given in non-increasing order. Still, one can also use any subset of the diagonal entries, although the resulting lower bound in the above theorem may not be as tight as when using the statementdn. . .d2d1.

By applying this theorem to both matricesKandKc, we get the following result.

Lemma 6. LetKandKcbe the Gram matrix and its centered counterpart with, respectively, diagonal entriesd1, d2, . . . , dn

and dc1, dc2, . . . , dcn; and eigenvalues λ1, λ2, . . . , λn and λc1, λc2, . . . , λcn, given in non-increasing order. Then

t

X

i=1

di

t

X

i=1

λi,

t

X

i=1

dci

t

X

i=1

λci,

with equalities fort=n.

The direct application of the Schur–Horn Theorem separately on each matrix, K and Kc, does not give any particular result. The following lemma is a first step towards a connection between the eigenvalues of both matrices, and allows to establish the bounds given in Theorem 8.

Lemma 7. Let K = AΛA and Kc = AcΛcAc be the spectral decompositions of the Gram matrices. Then the spec- tral decomposition of the matrixAKcAisAAcΛc(AAc) and of the matrixAcKAc is AcAΛ(AcA).

It is easy to prove these results, by replacing either KcorKby its spectral decomposition and verifying that the product of two orthonormal matrices is orthonormal.

This lemma shows that the matrixAKcAhas the same eigenvalues as the matrixKc, and its eigenvectors are the columns ofAAc, which is the matrix whose entries are the inner products between the eigenvectors of K and the eigenvectors ofKc (i.e.,αiαcj). Since both matrices Kc andAKcAshare the same eigenvalues, we propose next to apply the Schur–Horn Theorem on the latter matrix.

Theorem 8. The sum of the largestt(for anyt= 1,2, . . . , n) eigenvalues of the matrixKc is lower bounded as follows:

t

X

i=1

di

t

X

i=1

λci,

wherediis thei-th largest value ofλi+(kµk2n2λi) (αi 1)2, fori= 1,2, . . . , n. Here,i,αi)is an eigenpair of the matrix K. This expression provides a lower bound on the largest

Références

Documents relatifs

Recently, Ofer and Linial (2014) have created a tool called NeuroPID which identifies likely neuropeptide pre- cursors based solely on statistical information about the

Total milk yield was not increased after milk ejection, indicating that the transfer of an autocrine inhibitor of milk secretion from the alveolar tissue to the cisternal cavity

the evaluation of Equations 2.3, 2.4, and 2.5 with the Jacobian matrix being calculated with only the major species alone does not guarantee that the most meaningful or

Model Setting Principal component analysis [1] linear batch Kernel-PCA (kernel machines) [16] nonlinear batch Oja’s [5] and Sanger’s [7] rules (GHA) linear online Iterative

Alors sans intervention de type supervisé (cad sans apprentissage avec exemples), le système parvient à détecter (structurer, extraire) la présence de 10 formes

• Cet algorithme hiérarchique est bien adapté pour fournir une segmentation multi-échelle puisqu’il donne en ensemble de partitions possibles en C classes pour 1 variant de 1 à N,

In Advances in Kernel Methods—Support Vector Learning (B. On optimal nonlinear associative recall. Networks for approximation and learning. The Art of Scientific Computation.

The slice selected for the response variable comes from the dataset House prices of the Scottish portal 5 with the temporal dimension fixed to 2012, the measure type fixed to mean,