• Aucun résultat trouvé

Covariance estimation on matrix manifolds

N/A
N/A
Protected

Academic year: 2021

Partager "Covariance estimation on matrix manifolds"

Copied!
150
0
0

Texte intégral

(1)

Covariance estimation on matrix manifolds

by

Antoni Musolas

S.M., Computation for Design and Optimization, MIT (2016)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computational Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2020

© Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

April 10, 2020

Certified by . . . .

Youssef M. Marzouk

Professor of Aeronautics and Astronautics

Thesis Supervisor

Certified by . . . .

Alan Edelman

Professor of Mathematics

Thesis Committee Member

Certified by . . . .

Nicholas Roy

Professor of Aeronautics and Astronautics

Thesis Committee Member

Accepted by . . . .

Sertac Karaman

Associate Professor of Aeronautics and Astronautics

Chairman, Graduate Program Committee

Accepted by . . . .

Nicolas Hadjiconstantinou

Professor of Mechanical Engineering

Co-Director, Center for Computational Science and Engineering

(2)
(3)

Covariance estimation on matrix manifolds

by

Antoni Musolas

Submitted to the Department of Aeronautics and Astronautics on April 10, 2020, in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy in Computational Science and Engineering

Abstract

The estimation of covariance matrices is a fundamental problem in multivariate anal-ysis and uncertainty quantification. Covariance matrices are an essential modeling tool in climatology, econometrics, model reduction, biostatistics, signal processing, and geostatistics, among other applications. In practice, covariances often must be estimated from samples. While the sample covariance matrix is a consistent esti-mator, it performs poorly when the relative number of samples is small; improved estimators that impose structure must be considered. Yet standard parametric covari-ance families can be insufficiently flexible for many applications, and non-parametric approaches may not easily allow certain kinds of prior knowledge to be incorporated. In this thesis, we harness the structure of the manifold of symmetric positive-(semi)definite matrices to build families of covariance matrices out of geodesic curves. These covariance families offer more flexibility for problem-specific tailoring than clas-sical parametric families, and are preferable to simple convex combinations. More-over, the proposed families can be interpretable: the internal parameters may serve as explicative variables for the problem of interest. Once a covariance family has been chosen, one typically needs to select a representative member by solving an optimization problem, e.g., by maximizing the likelihood associated with a data set. Consistent with the construction of the covariance family, we propose a differential geometric interpretation of this problem: minimizing the natural distance on the co-variance manifold. Our approach does not require assuming a particular probability distribution for the data.

Within this framework, we explore two different estimation settings. First, we consider problems where representative “anchor” covariance matrices are available; these matrices may result from offline empirical observations or computational sim-ulations of the relevant spatiotemporal process at related conditions. We connect multiple anchors to build multi-parametric covariance families, and then project new observations onto this family—for instance, in online estimation with limited data. We explore this problem in the full-rank and low-rank settings. In the former, we show that the proposed natural distance-minimizing projection and maximum like-lihood are locally equivalent up to second order. In the latter, we devise covariance

(4)

families and minimization schemes based on generalizations of multi-linear and Bézier interpolation to the appropriate manifold.

Second, for problems where anchor matrices are unavailable, we propose a geodesic reformulation of the classical shrinkage estimator: that is, we construct a geodesic family that connects the identity (or any other target) matrix to the sample covari-ance matrix and minimize the expected natural distcovari-ance to the true covaricovari-ance. The proposed estimator inherits the properties of the geodesic distance, for instance, in-variance to inversion. Leveraging previous results, we propose a solution heuristic that compares favorably with recent non-linear shrinkage estimators.

We demonstrate these covariance families and estimation approaches in a range of synthetic examples, and in applications including wind field modeling and ground-water hydrology.

Thesis Supervisor: Youssef M. Marzouk

Title: Professor of Aeronautics and Astronautics Thesis Committee Member: Alan Edelman Title: Professor of Mathematics

Thesis Committee Member: Nicholas Roy

(5)

Acknowledgments

I am deeply indebted to my Ph.D. advisor Youssef Marzouk for his superior guid-ance and exceptional mentorship throughout these years. His values and standards will continue to resonate in me for years to come.

I want to thank the members of my thesis committee, Alan Edelman and Nick Roy, my academic mentors, Juanjo Egozcue, Gordon Kaufman and Antoni Subirà, and my collaborators, Pierre-Antoine Absil, Julien Hendrickx, Estelle Massart and Steve Smith for countless conversations that have enhanced the quality of this dissertation. Special thank you to “la Caixa” Banking Foundation (ID 100010434, project LCF/BQ/AN13/10280009) for making this journey possible and for introducing me to the “la Caixa” USA 2013 scholars, with some of whom I have developed long-lasting relationships. The present research has also been supported by Air Force Office of Scientific Research (Computational Mathematics Program) and the United States Department of Energy Office of Advanced Scientific Computing Research.

To Alessio, Andrea, Angxiu, Ben, Chi, Daniele, Ferran, Mario, Mathieu, Lucio, Lutao, Olivier, Pablo, Patrick, Rémi, Ricardo and Zheng for unforgettable memories during my time at MIT.

This thesis is dedicated to my loved ones and those who love me. Your uncondi-tional support has served as a source of passion and inspiration.

(6)
(7)

Contents

1 Introduction 19

1.1 Motivation . . . 19

1.2 Objectives . . . 21

1.3 Outline. . . 21

2 Notation and definitions 23 2.1 The geometry of the symmetric positive-definite cone . . . 23

2.2 The geometry of the set of positive-semidefinite matrices . . . 25

3 Geodesically parameterized covariance estimation 29 3.1 Introduction . . . 29

3.2 Tools for covariance estimation on a geodesic family . . . 32

3.2.1 Definition of the geodesic covariance family. . . 32

3.2.2 Spectral functions . . . 33

3.3 Problem statement: covariance estimation in a geodesic family . . . . 34

3.4 Properties of the covariance estimation problem, one-parameter case . 36 3.5 Local analysis and comparison of the projections. . . 40

3.6 Generalization to p-parameter covariance families . . . 45

3.7 Case study: hydraulic head in an aquifer . . . 47

3.7.1 Analytical model . . . 48

3.7.2 Construction of the covariance family . . . 48

3.7.3 Regularization of a solution obtained with a reduced number of Monte Carlo instances . . . 49

(8)

3.7.4 Regularization of a noisy solution . . . 51

3.7.5 Performance of proposed algorithm for multi-parametric families 53 3.8 Discussion . . . 55

4 Low-rank multi-parametric covariance estimation 59 4.1 Introduction . . . 59

4.2 Problem statement: low-rank covariance estimation . . . 61

4.3 Construction of low-rank covariance functions . . . 62

4.3.1 First-order covariance functions . . . 62

4.3.2 Piecewise first-order covariance functions . . . 64

4.3.3 Higher-order covariance functions using Bézier curves . . . 66

4.3.4 Interpolation of labelled matrices using covariance functions . 68 4.4 Covariance estimation using distance minimization . . . 68

4.4.1 One-parameter first-order covariance function . . . 69

4.4.2 Two-parameter first-order covariance functions . . . 70

4.4.3 Higher-order covariance functions using Bézier curves . . . 72

4.5 Case study: wind field estimation . . . 73

4.5.1 Model problem and data set . . . 74

4.5.2 One-parameter covariance families. . . 75

4.5.3 Distances to two-parameter covariance families . . . 77

4.5.4 Benchmarking first-order and higher-order covariance functions 79 4.6 Discussion . . . 83

5 Geodesic shrinkage of covariance matrices 85 5.1 Introduction . . . 85

5.2 Tools for geodesic shrinkage . . . 88

5.3 Problem statement: geodesic shrinkage . . . 90

5.4 Geodesic shrinkage estimator . . . 92

5.4.1 Estimation class . . . 92

5.4.2 Formulation . . . 93

(9)

5.5 Properties of geodesic shrinkage . . . 95

5.5.1 Is the geodesic estimator consistent? . . . 96

5.5.2 Is the geodesic shrinkage problem convex? . . . 96

5.5.3 Rotation equivariance . . . 97

5.5.4 Invariance to inversion . . . 98

5.6 Monte Carlo simulations . . . 98

5.6.1 Convergence . . . 100

5.6.2 Concentration . . . 101

5.6.3 Condition number. . . 102

5.6.4 Alternative population covariance matrix . . . 102

5.7 Discussion . . . 105

6 Contributions and further research 107 6.1 Contributions . . . 107

6.2 Other applications . . . 109

6.3 Future work . . . 111

A Technical results of Chapter 3 113

B Technical results of Chapter 4 121

C Technical results of Chapter 5 125

(10)
(11)

List of Figures

3-1 Representation of the geodesic between anchors A1 and A2 and

projec-tion of the sample covariance matrix[C as solution of Problems 3.3.1

to 3.3.4. We denote ˆt as the value of t that maximizes the likelihood

(reverse I-projection), ˇt the I-projection, and t∗ the natural projection (cf. Lemma 3.4.2). . . 35

3-2 Illustration of the orthogonality condition in Proposition 3.4.1. Solving Problem 3.3.4 is equivalent to finding a t∗ such that C is orthogonalb

to the unitary vector R(1 + t∗). . . 39

3-3 We draw three matrix realizations from a Wishart distribution of size

n = 10 (A, A2, and C). A1 is then constructed as the minimizer of

the natural distance from C to the geodesic ϕA→A2(t) (cf. construction

used in Theorem 3.5.1). We show ˆ∆t and ˇ∆t as a function of the distance  from A1to C. We control  by definingC = ϕb A

1→C

 1

d(A1,C)



; thus d(A1,C) = 1 and  = d(Ab 1, ϕ

A1→Cb()) (cf. Equation (2.1.5)). By

construction, ∆t for the natural projection is always zero. . . . 44

3-4 Unbalanced (left) versus balanced (right) trees. Green nodes represents anchor matrices and white circles are combination of one-parameter covariance families. . . 46

(12)

3-6 Contrasting the geodesic and “flat” one-parameter covariance families, by evaluating the natural distance to a third matrix. Left: when the anchor matrices are close (l1 = 20, l2 = 30, and l3 = 25) the two

families are similar. Right: when the anchors are far apart (l1 =

20, l2 = 100, and l3 = 60), the natural distance to the flat family

is not convex; outside of a certain range, where the red dashed line disappears, it is not even well defined.. . . 50

3-7 Illustration of regularizing by projection. . . 51

3-8 Using 1000 solutions obtained via the process outlined in Section 3.7.3, we plot a histogram of the initial distance b0 to the true covariance matrix and a histogram of the distance b once the sample covariance is regularized via natural projection (cf. Figure 3-7). The averages of b0 and b are 0.77 and 0.07, respectively. On average, we reduce the error by a regularization ratio of 11.8. . . 52

3-9 Boxplot of regularization ratios obtained with natural projection (left) and maximum likelihood (right) for increasing magnitudes of noise. Larger values are better. Each box is the result of 500 instances. Stan-dard deviation of the noise perturbations is α0.05qtr(A3)/n. . . . 53

3-10 Multi-parametric case: using 1000 solutions obtained via the process outlined in Section 3.7.5, we plot a histogram of the initial distance

b0 to the true covariance matrix and a histogram of the distance b obtained after natural projection. The average distances are 0.79 and 0.14, respectively. On average, we reduce the error by a regularization ratio of 6.5. . . 54

3-11 Multi-parametric regularization for one instance of ˆA4, described in

Section 3.7.5. Left panel: contours/colors represent the value of the distance dAˆ4, ϕA1→A2→A3(t1, t2)



and triangles show iterations of Al-gorithm 1 until convergence. Red square corresponds to the minimizer of dA4, ϕA1→A2→A3(t1, t2)



. Right panel: contours/colors represent

dA4, ϕA1→A2→A3(t1, t2)



(13)

4-1 Schematic representation of the sectional covariance function. The vertical lines correspond to equivalence classes and the horizontal line is the section into which the data matrices are projected. . . 63

4-2 Configuration of the points in a Euclidean setting. Example for two parameters. . . 64

4-3 Representation of the wind field. Left: two-dimensional domain, with wind field around the square obstacle represented by light blue arrows and the prevailing wind in dark blue; gray contours are the variance field. Right: notional 3-D problem, with a section of the wind field at an altitude h. . . . 75

4-4 Instantaneous snapshots of the wind velocity field for θ = 45 and W = 7. 76

4-5 Minimizing value of t (blue points) for data drawn from a range of wind headings θ, for W = 8.5 (cf. Section 4.5.2). The red line represents a “perfect” linear relationship. . . 77

4-6 Distance from A3 = C(19.7, 7) to the one-parameter covariance family

built from A1 = C(16.9, 7) and A2 = C(22.5, 7), as a function of the

input variable t. A1and A2are marked with green obstacles to identify

them as the data matrices/anchors. The blue point represents the distance minimizer. . . 78

4-7 Left: distance from A5 to ϕLS(A1→A2)→(A3→A4)(t1, t2). Right: distance

from A5 to ϕLG(A1→A2)→(A3→A4)(t1, t2). . . 79

4-8 Data set for the wind field problem; each dot shows the wind magni-tude W and heading θ dorersponding to a data covariance matrix. The red (crossed) and blue (dashed) lines represent the data used in Sec-tion 4.5.2. The rectangle illustrates the operating zone of SecSec-tion 4.5.3. The blue nodes comprise the training set and the black nodes the test set for Section 4.5.4. . . 80

(14)

4-9 Distribution of the errors obtained with the higher-order geodesic co-variance family ϕBG. Crosses are training points, circles are test points.

It is more difficult to recover the data when varying θ, particularly at larger W . Interpolation in the W direction, on the other hand, yields very small errors. . . 83

5-1 Histograms of eigenvalues of a sample covariance matrix from realiza-tions of a n-variate standard Gaussian distribution and corresponding prediction by the Marčenko-Pastur’s law. Left: 1000 samples, n = 100,

c = 0.1. Right: 110 samples, n = 100, c = 0.9. . . . 86

5-2 Illustration of geodesic shrinkage, where C is the sample covarianceb

and C is the population covariance. . . . 87

5-3 Illustration of consistency of the oracle solution (Proposition 5.4.1). . 96

5-4 Illustration of convexity of the geodesic shrinkage problem (Problem 5.3.3). 97

5-5 Performance of shrinkage estimators as a function of the dimension measured as PRIAL to target (left) and PRIAL to actual (right). The concentration ratio is n/q = 0.5 for all instances.. . . 100

5-6 Performance of shrinkage estimators as a function of the concentration ratio measured as PRIAL to target (left) and PRIAL to actual (right). The dimension times the number of samples is kept as in the base case, that is n × q = 20, 000. . . . 101

5-7 Performance of shrinkage estimators as a function of the value of the lowest eigenvalue of C measured as PRIAL to target (left) and PRIAL to actual (right). The dimension and concentration ratio are n = 100 and q = 200, respectively. . . . 102

5-8 Performance of shrinkage estimators as a function of the dimension measured as PRIAL to target (left) and PRIAL to actual (right). The concentration ratio is n/q = 0.5 for all instances.. . . 103

(15)

5-9 Performance of shrinkage estimators as a function of the concentration ratio measured as PRIAL to target (left) and PRIAL to actual (right). The dimension times the number of samples is kept as in the base case, that is n × q = 20, 000. . . . 104

5-10 Performance of shrinkage estimators as a function of the value of the lowest eigenvalue of C measured as PRIAL to target (left) and PRIAL to actual (right). The dimension and concentration ratio are n = 100 and q = 200, respectively. . . . 104

C-1 Performance of shrinkage estimators as a function of the dimension measured as PRIAL to target (left) and PRIAL to actual (right). The concentration ratio is n/q = 0.5 for all instances.. . . 126

C-2 Performance of shrinkage estimators as a function of the concentration ratio measured as PRIAL to target (left) and PRIAL to actual (right). The dimension times the number of samples is kept as in the base case, that is n × q = 20, 000. . . . 126

C-3 Performance of shrinkage estimators as a function of the value of the lowest eigenvalue of C measured as PRIAL to target (left) and PRIAL to actual (right). The dimension and concentration ratio are n = 100 and q = 200, respectively. . . . 127

C-4 Performance of shrinkage estimators as a function of the dimension measured as PRIAL to target (left) and PRIAL to actual (right). The concentration ratio is n/q = 0.5 for all instances.. . . 128

C-5 Performance of shrinkage estimators as a function of the concentration ratio measured as PRIAL to target (left) and PRIAL to actual (right). The dimension times the number of samples is kept as in the base case, that is n × q = 20, 000. . . . 128

(16)

C-6 Performance of shrinkage estimators as a function of the value of the lowest eigenvalue of C measured as PRIAL to target (left) and PRIAL to actual (right). The dimension and concentration ratio are n = 100 and q = 200, respectively. . . . 129

(17)

List of Tables

4.1 Average (squared) distance separating a given test point C(θi, Wi) from

the corresponding interpolation point on the different surfaces. For the methods defined on a section of the manifold, the subscript of ϕ indicates how the section was chosen: ‘one’ means that we use one of the data matrices (here, the matrix at the lower left corner of the patch) as basis of the section, while ‘arithm’ and ‘inductive’ denote, respectively, the arithmetic and inductive means of the corners of the patch. . . 81

4.2 Average (squared) distance separating a given test point from its closest approximation on the different surfaces. For methods defined on a section of the manifold, the subscript of ϕ indicates how the section was chosen: ‘one’ means that we use one of the data matrices (here, the matrix at the lower left corner of the patch) as basis of the section, while ‘arithm’ and ‘inductive’ denote, respectively, the arithmetic and inductive means of the corners of the patch. . . 82

(18)
(19)

Chapter 1

Introduction

1.1

Motivation

One of the most fundamental problems in multivariate analysis and uncertainty quan-tification is the estimation of covariance matrices. Covariance matrices are an essential tool in climatology, econometrics, model order reduction, biostatistics, signal process-ing, and geostatistics, among other applications. As a specific example (which we shall revisit in Chapter 4), covariance matrices of wind velocity fields [77,78, 93, 22, 140] capture the relationships among wind velocity components at different points in space. These relationships enable recursive estimation or updating of the wind field as new pointwise measurements become available. Similarly, in oil reservoirs [107,108,65,66] (cf. Chapter 3), covariance matrices allow information from borehole measurements to be propagated into more accurate global estimates of the permeability field. In telecommunications [92, 106, 146], covariance matrices and their eigenvectors are paramount for discerning between signal and noise.

Covariance matrix estimation has been an active research topic in the last years. Widely used methods in spatial statistics [28,120,135] include variogram estimation [27, 25] (often a first step in kriging [134, 34, 56]), covariance kernels [26, 116, 123] or tapering of sample covariance matrices [55, 48]. Other regularized covariance es-timation methodologies include sparse covariance and precision matrix eses-timation [20, 47, 21] and many varieties of shrinkage [82, 83, 126, 86]. Given that the sample

(20)

covariance matrix maximises the normal likelihood, some authors used penalised ver-sions of the this function [122,117], sometimes through the Cholesky factors [60], and analyse rates of convergence [71]. In [113], graphical Gaussian models are used to ob-tain improved estimators of the covariance matrix. Other recent approaches include regularization and conditioning of the sample eigenvalues [39,142,121], banding [16], thresholding [15], diagonal loading [129], and reference prior methods [143]. Many of these methods make rather generic assumptions on the covariance matrix (e.g., sparsity of the precision, or some structure in the spectrum); others (e.g., Matérn covariance kernels or particular variogram models) assume that the covariance is de-scribed within some parametric family.

While standard parametric covariance families can make the estimation problem quite tractable, they can sometimes be insufficiently flexible. By construction, these approaches describe a high-dimensional object (a symmetric positive-semidefinite ma-trix with O(n2) degrees of freedom) using only a few generic parameters that are not problem-specific; more broadly, these parametric families may not be rich enough to capture the phenomena of interest. For instance, in wind field modeling, these covariance families are insufficiently expressive to capture the non-stationary and multiscale features of the velocity or vorticity [74, 75, 76]. Non-parametric meth-ods (e.g., sparse precision matrix estimation, tapering, and shrinkage) can be much more flexible. However, neither approach easily allows prior knowledge—for instance, known covariance matrices at related conditions—to be incorporated.

Estimation in both settings often involves solving an optimization problem, usually maximizing the likelihood of a data set. Defining this objective function requires prescribing a specific probability distribution for the data, which may not be readily available. Moreover, maximum likelihood is not a proper notion of distance on the manifold of symmetric positive-(semi)definite matrices.

To overcome these obstacles, in this thesis, we propose to build covariance fam-ilies by connecting representative covariance matrices (called “anchors”) through geodesics. The resulting covariance classes can thus be tailored to the problem of interest. Second, as an alternative to maximizing the likelihood, we also advocate

(21)

for a differential geometric approach to estimation: minimizing the geodesic distance. Our approach, which can be understood as a projection, is consistent with the geomet-ric framework employed to build the covariance family and does not require assuming a particular probability distribution for the data.

1.2

Objectives

The main objectives of this thesis are two-fold. First, we aim to construct rich classes of parametric covariance families using geodesic curves and/or different methods of (higher order) interpolation and approximation techniques onmatrix manifolds. Sec-ond, we devise new covariance estimation techniques that benefit either from the proposed parametric families or shrinkage-type estimators based on geodesics. In short, we harness differential geometrical results to create more capable and expres-sive models for covariance estimation.

1.3

Outline

The proposed research lies at the intersection of differential geometry and uncertainty quantification. We start with two main assumptions:

Assumption A. Availability of representative covariance matrices. We have at least two anchor matrices related to the problem of interest. To illustrate, in wind field estimation, anchors can be covariance matrices that have been obtained for a particular conditions of the prevailing wind, e.g., heading and magnitude. In oil reservoirs, they can be covariance matrices of the hydraulic head in areas with different soil conditions, e.g., permeability or grain size. In some cases, these anchors can be obtained offline (solving a numerical model) to a desired accuracy and later used for online estimation.

Assumption B. All covariance matrices are full-rank. The sample covariance matrix (and the anchors) are full-rank. In some cases, it is sound to regularize a singular covariance matrix, e.g., adding  Id, which mimics observation noise.

(22)

The problems we aim to tackle can be divided in three different but interconnected work-streams, depending on which assumptions are enforced.

1. Geodesically parameterized covariance estimation (Assumption A and

B). Use geodesic curves on the manifold of symmetric positive-definite

ma-trices to devise (multi-parametric) covariance functions. Given a predefined covariance structure, develop a differential geometric alternative to likelihood maximization and reduce it to an optimization problem on matrix manifolds. (cf. Chapter 3)

2. Low-rank multi-parametric covariance estimation (Assumption A). De-vise theoretically sound and computationally efficient methods to build multi-parametric covariance functions on the manifold of symmetric positive-semidefinite matrices. Similarly to above, develop a differential geometric alternative to like-lihood maximization and propose algorithms for this alternative formulation. (cf. Chapter 4)

3. Geodesic shrinkage of covariance matrices (Assumption B). Propose an estimator of the covariance matrix by adjusting the eigenvalues of the sample covariance matrix (shrinkage). Our approach can be understood geometrically as constructing a geodesic curve between the standard sample covariance matrix and the identity matrix, and then selecting the point (matrix) on the line that minimizes the estimated distance to the actual covariance. (cf. Chapter 5)

The rest of this thesis is organized as follows. We start by introducing the notation and definitions in Chapter2. In Chapter3, we focus on full-rank geodesically parame-terized covariance modelling and subsequent estimation by projection. In Chapter4, we turn to the low-rank setting and leverage higher order multi-parametric inter-polation techniques for covariance estimation. Unlike the previous two chapters, in Chapter5, we tackle non-parametric estimation through shrinkage-type geodesic esti-mation. Finally, we summarize contributions and propose future work in Chapter6.

(23)

Chapter 2

Notation and definitions

In this chapter, we recall some results on the geometry of the symmetric positive-definite matrix manifold (Section 2.1) and describe the geometry of the set of sym-metric positive-semidefinite matrices that we will employ (Section2.2).

2.1

The geometry of the symmetric positive-definite

cone

Let S(n) be the space of n × n symmetric matrices, and let S+(n) denote the man-ifold of symmetric positive-definite n × n matrices. This manman-ifold has been studied extensively in the literature (e.g., [14, 44, 130, 35]).

Let XA, YA ∈ S(n) be tangent vectors to S+(n) at A: TAS+(n). The natural metric gA is defined as the inner product in the tangent space at A:

gA(XA, YA) = tr(XAA−1YAA−1).

We will denote the tangent vector XAsimply as X when there is no ambiguity in the

choice of tangent space.

(24)

correspond-ing element on the manifold, and is defined as: B = expA(B) = A12 exp(A− 1 2BA− 1 2)A 1 2,

where A12 is the symmetric square root. Conversely, the logarithm map transports objects from the manifold to the tangent space:

B = logA(B) = A12 log m(A −1 2BA− 1 2)A 1 2, A− 1 2BA− 1 2 = log m(A −1 2BA− 1 2),

where logm is the logarithm of a matrix.

Let A1 and A2 belong to this manifold. Associated with the natural metric, there exists a natural distance d(A1, A2) that is invariant with respect to matrix inversion:

d(A1, A2) = d(A−11 , A −1

2 ), (2.1.1)

and with respect to congruence via any invertible matrix Z:

d(A1, A2) = d(ZA1Z>, ZA2Z>). (2.1.2)

Moreover, a parametrization of the geodesic, which at any point minimizes the natural distance to A1 and A2, is given by:

ϕA1→A2(t) = A 1 2 1 expm  t logm(A− 1 2 1 A2A −1 2 1 )  A 1 2 1,

where ϕA1→A2(t) ∈ S+(n) for all t ∈ R. Clearly, A −1 2 1 A2A −1 2 1 admits an orthogonal eigendecomposition of the form U ΛU>. Notice that Λ contains the generalized eigen-values of the pencil (A2, A1), which we denote as λ

(A2,A1)

k , k = 1, . . . , n. Therefore, ϕA1→A2(t) can be expressed as:

ϕA1→A2(t) = A 1 2 1 expm  t logm(U ΛU>)A 1 2 1 = A 1 2 1U ΛtU > A 1 2 1. (2.1.3)

(25)

we call A1 and A2 the anchor matrices. The geodesic can also be expressed as:

ϕA1→A2(t) = A1(A −1

1 A2)t1 = A2(A−12 A1)1−t1.

Additionally, there is a closed-form expression for the natural distance between any two matrices in S+(n):

d(A1, A2) = d(A −1 2 1 A2A −1 2 1 , I) = k logm(A −1 2 1 A2A −1 2 1 )kF = v u u t n X k=1 log2λ(A2k ,A1). (2.1.4)

From the above, we have:

dA1, ϕA1→A2(t) 

= |t|d(A1, A2). (2.1.5)

Equation (2.1.4) appears extensively in the literature of differential geometry and inference. Up to a constant, it is known as the Fisher information metric or as Rao’s distance [11, 115, 114]. When measuring distance between covariance matrices, it is also known as the Förstner metric [46,131].

Unlike the manifold of symmetric positive-definite matrices, the manifold of sym-metric positive-semidefinite matrices does not enjoy as much structure. As shown clearly in [18] and discussed in [139], the main drawback is that there is no notion of distance that satisfies the invariance properties in Equations (2.1.1) and (2.1.2).

2.2

The geometry of the set of positive-semidefinite

matrices

In this section, we define useful tools to work on the manifold S+(r, n) of positive-semidefinite matrices, with rank r and size n × n, with r < n.

Several metrics have been proposed for this manifold but, to our knowledge, none of them manages to turn it into a complete metric space with a closed-form expression for endpoint geodesics. We use the quotient geometry S+(r, n) ' Rn×r/O(r) proposed

(26)

in [63] and further developed in [97], with Rn×r

∗ endowed with the Euclidean metric. This geometry relies on the fact that a matrix A ∈ S+(r, n) can be factorized as

A = YAYA>, where the factor YA∈ Rn×r∗ has full column rank. The decomposition is not unique, as each factor SA= YAQ, with Q ∈ O(r) an orthogonal matrix, leads to

the same product. As a consequence, any symmetric positive-semidefinite matrix A is represented by an equivalence class:

[YA] = {YAQ|Q ∈ O(r)}.

In our computations, we work with representatives of the equivalence classes. For ex-ample, the geodesic between two symmetric positive-semidefinite matrices A and B will be computed based on two arbitrary representatives YA, YB, of the corresponding

equivalence classes. The geodesic will of course be invariant to the choice of the rep-resentatives. Moreover, this approach saves computational cost as the representatives are of size n × r, instead of n × n.

In [97], the authors propose an expression for the main tools to perform compu-tations on S+(r, n), endowed with this geometry. The geodesic ϕA1→A2 between two symmetric positive-semidefinite matrices A1, A2, with representatives YA1 and YA2, is given by:

ϕA1→A2(t) = YϕA1→A2(t)Yϕ>A1→A2(t) with A1→A2(t) = YA1 + t ˙YA1→A2. (2.2.1)

In this expression, the vector ˙YA1→A2 is defined as ˙YA1→A2 = YA2Q > − Y

A1 with

YA>1YA2 = HQ a polar decomposition. In the generic case where Y >

A1YA2 is nonsingular, the polar decomposition is unique and the resulting curve t ∈ [0, 1] 7→ ϕA1→A2(t) is the unique minimizing geodesic between A1 and A2. This curve has the following properties:

1. ϕA1→A2(0) = A1, and ϕA1→A2(1) = A2. 2. For each t ∈ R, ϕA1→A2(t) ∈ S+(≤ r, n),

(27)

rank upper-bounded by r.

Notice that the last property suggests that this manifold is not a complete metric space for this metric, because the points in the geodesics are not necessarily of rank

r. However, completeness is not central for statistical modeling, and, as we will see

in Section 4.5, the fact that not all covariance matrices have the same rank will not have any consequence in practice.

We finally mention that [97] also contains expressions for the exponential and log-arithm maps, on which rely the patchwise Bézier surfaces introduced in Section 4.3.3

below. We make here the following assumption:

Hypothesis 2.2.1. The data matrices are such that all the logarithm and exponential

maps to which we refer in the sequel are well-defined.

Hypothesis2.2.1 will typically be satisfied when the data matrices are sufficiently close to each other.

Instead of working directly on the quotient manifold (which involves the compu-tation of geodesics, exponential and logarithm maps), a simpler approach consists in working on an affine section of the quotient. Consider an equivalence class [YA] with YA a representative of the class. We define the section of the quotient at YA as the

set of points: SA:= {YA  I + (YA>YA)−1S  +YA⊥K | S> = S, S  −YA>YA, K ∈ R (n−r)×r}, (2.2.2)

where the matrix YA⊥ ∈ Rn×(n−r) is any orthonormal basis for the orthogonal

com-plement of YA, i.e., YA>YA⊥ = 0 and YA⊥> YA⊥ = In−r. The constraint S  −YA>YA

guarantees that there is at most one representative of each equivalence class [YB] in

the section, and exactly one under the generic condition that YA>YB is nonsingular.

Consider the section of the quotient at YA1. The representative in the section of any equivalence class [YA2] (with Y

>

A1YA2 nonsingular) is then ¯YA2 = YA2Q

>, where Q is the orthogonal factor of the polar decomposition of YA>1YA2. Once all the points are projected on the section, we can simply perform Euclidean operations on the section.

(28)
(29)

Chapter 3

Geodesically parameterized

covariance estimation

This chapter revises a submitted publication [105]: A. Musolas, S. T. Smith, and

Y. Marzouk, Geodesically parameterized covariance estimation, arXiv preprint, (2020).

Other work by the author, not presented in this thesis but nonetheless related to this chapter, includes [103]: A. Musolas, J. Egozcue, and M.

Crusells-Girona, Vulnerability models for environmental risk assessment. Application to a

nuclear power plant containment building, Stochastic environmental research and risk

assessment, 30 (2016), pp. 2287–2301.

3.1

Introduction

Statistical modeling of spatiotemporal phenomena often requires employing and esti-mating covariance matrices. In this chapter, we propose to build full-rank covariance families by connecting representative covariance matrices (called “anchors”) through geodesics. The resulting covariance classes can thus be tailored to the problem of interest. Second, as an alternative to maximizing the likelihood, we advocate for a differential geometric approach to estimation: minimizing the geodesic distance to a sample covariance matrix. Our approach, which we call natural projection, is

(30)

consis-tent with the notion of distance employed to build the covariance family and does not require assuming a particular probability distribution for the data.

Geodesic interpolation, smoothing, and regression of matrices and related objects have been active topics of research. Several papers [9, 8, 10] use interpolation on matrix manifolds to adapt and construct reduced-order models, while others [110,88] propose a Riemannian framework for the interpolation and regularization of tensor fields, with broad applications in imaging. Local polynomial regression in the man-ifold of symmetric positive-definite matrices has been explored in [145], also in the context of computer vision and medical imaging. Other authors have pursued higher-order interpolation of positive-definite [2, 54] and semi-definite [53] matrices using Bézier curves. Additionally, [99, 100] characterize the Riemannian mean of positive-definite matrices and propose a multivariate geodesic interpolation scheme for such matrices using weights. Our work also relies on geodesic interpolation, but for the purpose of building covariance families. The idea of differential geometric methods for covarianceestimation has links to the broad field of information geometry [7,124], which constructs manifolds of probability distributions and analyzes their geometric properties. In this general setting, the likelihood function has been characterized as a notion of distance in [6, 5]. Matrix nearness using Bregman divergences, and its geometric interpretations, have been discussed thoroughly in [32]. Another field that combines differential geometry and uncertainty quantification is the treatment of compositional data on the simplex [103,38, 3, 37,4].

As described above, our covariance families will follow from geodesic interpolation of a given set of anchor matrices. The anchors should be representative of known problem instances—e.g., empirical observations or computational simulations of the relevant spatiotemporal process at related conditions. Combining these instances into a parametric family constitutes a hybrid approach to covariance modeling that can yield much richer and more problem-specific covariances than standard kernels. Using geodesics ensures that the entire covariance family lies in the manifold of symmet-ric positive-definite matsymmet-rices. These families can also be interpretable: the internal parameters may serve as explicative variables for the problem of interest. Another

(31)

ad-vantage of this approach is that it harnesses the asymmetry of information between online and offline stages of a problem: that is, each anchor covariance matrix can be computed offline to a desired accuracy and later used for online estimation with limited data.

Having constructed a geodesically parameterized covariance family, we also an-alyze different alternatives for estimation within the family—i.e., given a data set, identifying the most “representative” member. This problem is usually solved by assigning a probability distribution to the data and selecting the parameter values that maximize the resulting likelihood function. Under some conditions, this process is equivalent to minimizing a particular direction of the Kullback–Leibler (KL) di-vergence, known as reverse information-projection (reverse I-projection) [29]; this is further equivalent to minimizing Stein’s loss [132,133]. Alternatively, one might min-imize the opposite direction of KL divergence; this choice is known as I-projection. Consistent with the construction of the family, we instead propose to use natural pro-jection: selecting the covariance matrix within the family that minimizes the geodesic distance to the sample covariance matrix. We will show that the other methods are essentially linear approximations of natural projection. In particular, we will show that the estimates produced by natural projection and the two forms of I-projection are locally equivalent up to second order. In contrast with other methods, however, the optimality condition for natural projection is equivalent to an orthogonality con-dition on the tangent space of the matrix manifold. Since natural projection does not require modeling the distribution of the data, it may be easier to apply in practice. We also show that it can yield reduced sensitivity to noise.

The plan of this chapter is as follows. In Section3.2, we review the geometry of the manifold of symmetric positive-definite matrices and define the notion of a geodesic covariance family. We introduce natural projection alongside some standard alter-natives in Section 3.3. Section 3.4 discusses properties of the optimization problems associated with each of these estimation methods. In Section 3.5, we perform local analyzes that compare natural projection with existing alternatives. Section 3.6 ex-tends the geodesic covariance family construction to general multi-parameter settings.

(32)

In Section 3.7, we demonstrate the performance of geodesic covariance families and natural projection in a case study: characterizing the spatial variations of hydraulic head in an aquifer. Conclusions follow in Section 3.8.

3.2

Tools for covariance estimation on a geodesic

family

In Section3.2.1, we introduce the idea of a geodesic covariance family. In Section3.2.2, we present the loss functions we will consider for estimation. In Section 3.3, we introduce natural projection and contrast it with canonical approaches to parametric covariance estimation.

3.2.1

Definition of the geodesic covariance family

As described in Section 3.1, the ability to create tailored parametric covariance

func-tions out of two (or more) covariance matrices of interest has several potential benefits.

First, via the choice of anchor matrices, the covariance function can be made repre-sentative of the particular problem at hand; second, the parameters of the covariance function can become meaningful explicatory variables of the spatiotemporal process. To begin, we define the notion of a one-parameter covariance family. We will generalize this notion to the multi-parameter case in Section 3.6.

Definition 3.2.1 (Covariance function and family). A one-parameter covariance

function is a map ϕ: R → S+(n); its corresponding covariance family is the image of

ϕ.

The covariance family structure we will employ in Sections 3.2, 3.4, and 3.5 is a geodesic between anchor matrices. Let A1 and A2 be two elements in S+(n). Then ϕA1→A2(t) as defined in Equation (2.2.1) is a one-parameter covariance function, whose image is a covariance family. This covariance function immediately satisfies the following two properties:

(33)

1. ϕ−1A

1→A2(t) = ϕA−11 →A−12 (t), 2. ϕA1→A2(t) = ϕA2→A1(1 − t).

Since the parameter t ∈ R, the covariance family is an infinitely long curve on the manifold rather than a segment between the two anchor matrices. Notice that the first property allows one to work with precision matrices and still obtain the same results. The second guarantees invariance with respect to the order of the matrices. These properties make the geodesic a compelling covariance function to be used for practical applications.

It is often useful to be able to scale the family to adapt to changes of the magnitude of the problem. Within our covariance family framework, this extra degree of freedom is achieved by building a geodesic between any matrix and the same times a constant. Remark 3.2.1 (Scaling of a covariance function). Let A1 be an element in S+(n) and α ∈ R+. Notice that ϕA1→αA1(t) as defined in Equation (2.2.1) is a

one-parameter covariance function of the form ϕA1→αA1(t) = αtA1. If A1 is replaced by a

one-parameter covariance function, the scaling applies to the whole family.

The scaling factor αt in Remark3.2.1is positive for any t in the real line. Clearly,

if the scaling is applied to a one-parameter covariance function, we are left with two degrees of freedom: one that moves across the anchor matrices and another that controls the magnitude of the entries.

3.2.2

Spectral functions

In Section 3.4, we will be interested in selecting the “most representative” member of a covariance family given a data set. Doing so will entail minimizing certain loss functions: distances or divergences between distributions. All the loss functions we employ are spectral functions.

Definition 3.2.2 (Spectral function). Let A1 be a matrix in S+(n). A function

F (A1) is a spectral function if it is a differentiable and symmetric map from the

(34)

the eigenvalue function λ and a differentiable and symmetric map f; that is, F (A1) =

f ◦ λ(A1).

Closed-form expressions for some spectral functions that we shall use later are presented below.

Remark 3.2.2. The following notions of distance or divergence can be expressed as

functions of the generalized eigenvalues (λk)n

k=1 of the pencil (A2, A1): • Natural distance in S+(n): d(A1, A2) = v u u t n X k=1 log2λk. (3.2.1)

• Kullback–Leibler divergence between multivariate normals: DKL  N (0, A1) N (0, A2)  = n X k=1 λ−1k + log λk− 1 2 . (3.2.2)

• Kullback–Leibler divergence between multivariate normals, swapping the order: DKL  N (0, A2) N (0, A1)  = n X k=1 λk− log λk− 1 2 . (3.2.3)

3.3

Problem statement: covariance estimation in

a geodesic family

Now we define several alternative optimization problems that each describe estimation in a geodesic covariance family. We will formally contrast these problems in the next sections. Figure3-1illustrates the covariance function as a geodesic from A1and A2on the manifold of symmetric positive-definite matrices, along with multiple projections (i.e., estimates within the family) of the sample covariance matrixC.b

Let y1, . . . , yq be independent and identically distributed observations from some

distribution with density function pY



· ; ϕA1→A2(t) 

(35)

S

+

(n)

A

1

A

2 d

C

t

ˆt

¯t

ˇt

Figure 3-1: Representation of the geodesic between anchors A1 and A2 and projection of the sample covariance matrixC as solution of Problemsb 3.3.1to3.3.4. We denote ˆt

as the value of t that maximizes the likelihood (reverse I-projection), ˇt the I-projection,

and t∗ the natural projection (cf. Lemma 3.4.2).

the maximum likelihood estimate of t is then

tML∈ argmax t∈(−∞,∞) q X i=1 log pY(yi; ϕA1→A2(t)).

For comparison with other techniques, we consider the specific case where pY is

multi-variate Gaussian and, without loss of generality, zero mean. In this case, the maximum likelihood estimate is as follows:

Problem 3.3.1 (Maximum likelihood with a Gaussian model).

yi iid ∼ N0, ϕA1→A2(t)  ˆ t ∈ argmax t∈(−∞,∞) q X i=1 log Nyi; 0, ϕA1→A2(t)  .

Now, suppose that the sample covariance matrix C of {yb i}q

i=1 is full-rank, which

is typically the case when q ≥ n. (Note that Problem 3.3.1 does not need this assumption.) In this case, the solution to Problem3.3.1is equivalent to that obtained by reverse I-projection, which consists of minimizing the KL divergence as follows. This equivalence is shown in Lemma A.0.4.

(36)

Problem 3.3.2 (Reverse I-projection). ˆ t ∈ arg min t∈(−∞,∞) DKL  N (0,C)b N  0, ϕA1→A2(t)  . (3.3.1)

Since Problems 3.3.1 and 3.3.2 are equivalent, we will only refer to Problem 3.3.2

going forward. We will also considerI-projection, which minimizes the KL divergence but in the opposite order:

Problem 3.3.3 (I-projection). ˇ t ∈ arg min t∈(−∞,∞) DKL  N0, ϕA1→A2(t)  N (0,C)b  .

Note that the KL divergence is proportional to Stein’s loss [86]. Therefore, Prob-lems 3.3.2 and 3.3.3 can also be cast as minimizing Stein’s loss with the appropriate order of the arguments.

Finally, we explore the possibility of covariance estimation by minimizing the geodesic distance d(·, ·) defined in Equation (2.1.4):

Problem 3.3.4 (Natural projection).

t∗ ∈ arg min t∈(−∞,∞) d  ϕA1→A2(t),Cb  .

3.4

Properties of the covariance estimation

prob-lem, one-parameter case

In this section, we characterize various properties of Problems 3.3.2 to 3.3.4. Our main results are the optimality conditions and their corresponding geometric inter-pretations. These results are presented in Propositions3.4.1to3.4.3, proofs of which are deferred to Appendix A. First, however, we establish uniqueness of the optima and idempotence of the associated projections. Supporting results (Lemmas A.0.3

(37)

Lemma 3.4.1 (Uniqueness of the solution). Each of Problems 3.3.2to 3.3.4has

a unique solution.

Proof. Since the optimization problems are unconstrained, uniqueness follows

imme-diately from the convexity of Problem 3.3.4 (shown in Lemma A.0.5) and of Prob-lems 3.3.2 and 3.3.3 (shown in Lemma A.0.6).

Since the solutions are unique, though in general distinct (see below), we will use ˆ

t to denote the result of reverse I-projection, ˇt that of I-projection, and t∗ that of natural projection. Uniqueness also allows defining the distance between the sample covariance matrix and the geodesic as dϕA1→A2(t),Cb



.

If the sample covariance matrix already belongs to the covariance family, the most representative member of the family ought to be the sample covariance matrix itself. This result holds true for all three problems.

Lemma 3.4.2 (Idempotence of projections). If C ∈ ϕb A

1→A2(t), then there exists

a unique ¯t such that (λ(A2,A1)

k ) ¯ t = λ(C,Ab 1) k , k = 1, . . . n, where λ (A2,A1) k and λ (C,Ab 1) k are the k-th eigenvalues of the pencil (A2, A1) and (C, Ab 1), respectively. Moreover, under

this condition, ¯t= t= ˆt = ˇt (cf. Figure 3-1), b

C = ϕA1→A2(¯t) with:

¯ t = Pn k=1log λCkb− Pn k=1log λ A1 k Pn k=1log λ A2 k − Pn k=1log λ A1 k , where λA1 k , λ A2

k , and λCkb, are the k-th eigenvalues of A1, A2, andCb, respectively. This

expression also holds when A1 = αA2, for any Cb and α > 0.

Proof. If C ∈ ϕb A

1→A2(t), then there exists a ¯t such that C = Ab 1 2 1U Λ ¯ tU>A12 1. Rear-ranging the terms, we obtain A

1 2 1 CAb −1 2 1 = U Λ ¯ tU>, which is an eigendecomposition of A− 1 2 1 CAb −1 2

1 . This is equivalent to say that Λ ¯

t contains the λ(C,Ab 1)

k . But also, from

our notation in Equation (2.2.1) we knew that Λ contains the λ(A2k ,A1).

Notice that ¯t satisfies the general form (cf. Equation (A.0.3)) of distance min-imization, and similarly for the reverse I-projection (cf. Equation (A.0.4)) and the I-projection (cf. Equation (A.0.5)). Using uniqueness in Lemma 3.4.1, we conclude that ¯t = t∗ = ˆt = ˇt. The last C = ϕAb 1→A2t) follows by the definition of ¯t. For the

(38)

closed form solution, set C = ϕb A 1→A2(t) = A12 1U ΛtU>A 1 2 1 = A 1 2 1(A −1 2 1 A2A −1 2 1 )tA12. Now, take the determinant of both sides and apply its properties to have det (C) =b

det (A1) det(A −1 2 1 A2A −1 2 1 )t

. Applying logarithms on both sides we obtain the de-sired result. Clearly, it solves Problem 3.3.4 since distance in this case is zero. Finally, if A1 = αA2, then Λ = αI and note that the general expression simpli-fies to trlogmtCb− 1 2A1Cb− 1 2) 

= 0. Using Jacobi’s formula, it simplifies further to det (ΛtCb−1

2A1Cb− 1

2) = 1 and we come back to the same above expression with

determinants.

The following propositions characterize the optimal solutions of Problems 3.3.2

to3.3.4.

Proposition 3.4.1 (Natural projection for covariance estimation). The

opti-mal parameter tsatisfies:

trlogm(ZΛ−t∗) logm(Λ)= 0, (3.4.1) where Z = U> A− 1 2 1 CAb −1 2

1 U. This optimality equation can also be rewritten as an

orthogonality condition on the tangent space: gRA1→A2(t)  A− 1 2 1 CAb −1 2 1 , RA1→A2(1 + t∗)  = 0, (3.4.2) where RA1→A2(t) = (A −1 2 1 A2A −1 2

1 )t is the whitened geodesic A −1 2 1 ϕA1→A2A −1 2 1 .

The natural projection consists in minimizing the natural distance to a certain matrix over a curve. This is similar to what one would do in an Euclidean space, but on a manifold. On the tangent space, this operation looks like finding a point at which the projected geodesic is orthogonal to the direction of the outside matrix (Equation (3.4.2)). Figure3-2 illustrates this relationship.

As mentioned in Section 3.2, the Fisher information metric between two normal distributions with known mean is proportional to the natural distance, so the former is also minimized when the latter is. Therefore, the aforementioned t∗ minimizes the Fisher information metric between N0, ϕA1→A2(t)



(39)

R(1 + t∗) logR(t∗)(R(1 + t∗)) R(t∗) S+(n) TR(t∗)S+(n) logR(t∗)(C)c R(t∗)

Figure 3-2: Illustration of the orthogonality condition in Proposition 3.4.1. Solving Problem 3.3.4 is equivalent to finding a t∗ such that C is orthogonal to the unitaryb

vector R(1 + t∗).

Proposition 3.4.2 (Reverse I-projection for covariance estimation). The

op-timal parameter ˆt satisfies:

tr(ZΛ−ˆt− Id) logm(Λ)= 0, (3.4.3)

where Z = U>A−12 1 CAb

−1 2

1 U. This optimality equation can also be rewritten as an

orthogonality condition: gR A1→A2t)  expR A1→A2t)  A− 1 2 1 CAb −1 2 1 − RA1→A2t)  , RA1→A2(1 + ˆt)  = 0. (3.4.4)

Notice that the matrix A− 1 2 1 CAb

−1 2

1 − RA1→A2(ˆt) lives in a Euclidean space and not necessarily in S+(n). Indeed, such a subtraction is the usual way of obtaining a vector between two points in a “flat” space, but it does not necessarily preserve positive-definiteness. Intuitively, the likelihood (which yields the first argument of Equation (3.4.4)) seems to be a flat notion, whereas the geodesic (yielding the second argument) is in the manifold; thus one could argue that reverse I-projection is actu-ally inconsistent. Instead, one should either maximize the likelihood over a family produced by a convex combination of anchors (i.e., both the family and the diver-gence we are minimizing are in a flat space) or minimize the natural distance over a proper geodesic (as in Problem3.3.4). Indeed, the orthogonality condition for natural

(40)

projection in Proposition 3.4.1 is far more direct. We can develop similar results for Problem 3.3.3:

Proposition 3.4.3 (I-projection for covariance estimation). The optimal

pa-rameter ˇt satisfies: tr(ΛˇtZ−1− Id) logm(Λ)= 0, (3.4.5) where Z = U>A−12 1 CAb −1 2

1 U. This optimality equation can also be rewritten as an

orthogonality condition: gR A1→A2(−ˇt)  expR A1→A2(−ˇt)  A 1 2 1Cb−1A 1 2 1 − RA1→A2(−ˇt)  , RA1→A2(1 − ˇt)  = 0. (3.4.6)

Notice that Equation (3.4.3) is the first order Taylor expansion of Equation (3.4.1) around the identity matrix if we use ZΛ−t as a variable. In this sense, maximizing the likelihood (reverse I-projection) corresponds to solving a linearized version of the natural distance minimization problem. The same can be observed with the I-projection if we Taylor expand in ΛtZ−1 (cf. Equations (3.4.1) and (3.4.5)). It suffices to adapt Equation (3.4.1) using logm(ZΛ−t) = − logmtZ−1).

3.5

Local analysis and comparison of the

projec-tions

We now compare the solutions of Problems 3.3.2to 3.3.4when C is very close to theb

geodesic (in terms of natural distance).

Lemma 3.5.1 (Equality of the limit). Let A1, A2, and C be matrices in S+(n).

Assume that d

ϕA1→A2(t), C 

= ,  > 0. In the limit of  → 0, Problems 3.3.2

to 3.3.4 have the same solution. Proof. Let tbe the minimizer of dϕA

1→A2(t), C 

and A= ϕ

A1→A2(t∗). Without loss of generality, defineC such that d(Ab ∗,C) = 1 and ϕb

A b

C() = C. By the properties of

the natural distance, we know that d(A, C) = . Define Z = U>A−12 1 ϕA b C()A −1 2 1 U

(41)

and notice that: Z = lim →0Z= lim→0U > A− 1 2 1 ϕA b C()A −12 1 U = U > A− 1 2 1 AA− 1 2 1 U. By definition, A= ϕA1→A2(t) = A 1 2 1U ΛtU>A 1 2 1, thus Z = Λt

. The proof is con-cluded after realizing that, with this value of Z, Equations (3.4.3) and (3.4.5) hold.

Together with idempotence of the projection (Lemma3.4.2), Lemma3.5.1implies continuity of ˆ∆t := ˆt − tat  = 0. Indeed, idempotence means pointwise equivalence at  = 0, and at the limit  → 0, we also see ˆ∆t = 0. Thus ˆ∆t is continuous at that point. The same is also true for ˇ∆t := ˇt − t∗.

Theorem 3.5.1 (Natural projection versus I-projection and reverse I-pro-jection). Let A1, A2, C ∈ S+(n), and without loss of generality suppose that A1 is

the matrix in ϕA1→A2 that minimizes the distance d 

ϕA1→A2(t), C 

. The difference be-tween the solutions of Problems 3.3.2and 3.3.4 as a function of  = d

ϕA1→A2(t), C 

is ˆ∆t(), defined implicitly as:

tr(U>V ΣV>U Λ∆t()ˆ − Id) logm(Λ)= 0, (3.5.1) where V ΣV> = A−12 1 CA −1 2 1 and UΛU> = A −1 2 1 A2A −1 2

1 are orthogonal

eigendecomposi-tions.

Similarly, the difference in the solutions of Problems 3.3.3and3.3.4as a function of  is ˇ∆t(), defined implicitly as:

tr(Λ− ˇ∆t()U>V Σ−V>U − Id) logm(Λ)= 0. (3.5.2)

Moreover, the functions ˆ∆t() and ˇ∆t() are continuous at  = 0, and ˆ

∆t0(0) = ˇ∆t0(0) = −gA1( C, A2)

d(A1, A2)

= 0. (3.5.3)

Proof. Refer to the construction of the proof in Lemma3.5.1and notice that A= A 1,

(42)

t= 0 for any , and, Z = U>A −12 1 ϕA b C()A −12 1 U = U > V ΣV>U.

Then Equation (3.5.1) follows immediately from Equation (3.4.3), and Equation (3.5.2) from Equation (3.4.5). Continuity at  = 0 follows from Lemmas3.4.2and3.5.1. Now, we can take derivatives of Equation (3.5.1) and obtain the following:

tr



U>V Σlogm(Σ)V>U Λ∆t()ˆ + U>V ΣV>U logm(Λ)Λ∆t()ˆ ∆tˆ 0()logm(Λ)



= 0, (3.5.4) which evaluated at  = 0 results in:

ˆ ∆t0(0) = − tr  U>V logm(Σ)V>Ulogm(Λ)  trlog2m(Λ) , (3.5.5)

which can be rewritten as:

ˆ ∆t0(0) = −tr  logm(A− 1 2 1 CA −1 2 1 ) logm(A −1 2 1 A2A −1 2 1 )  d(A1, A2) = 0.

Notice that ˆ∆t0(0) vanishes since the numerator is Equation (3.4.2). An analogous derivation for ˇ∆t0(0) provides the same result.

From Equation (3.5.3), note that ˆ∆t0(0) can be understood as the inner product of the tangent vectors at A1 pointing to C and A2, normalized by the distance from the reference point to the latter matrix. The expression is analogous to the classic form of the inner product as a product of modulus and angle. In our setting, the modulus is the d(A1, A2) and the angle is ˆ∆t0(0).

Since ˆ∆t(0) = 0 and ˆ∆t0(0) = 0, a second order Taylor series expansion around

 = 0 would be: ˆ ∆t() = ˆ ∆t00(0) 2  2 + O(3),

(43)

Finally, we compare I-projection and reverse I-projection, summarizing the results below:

Theorem 3.5.2 (I-projection versus reverse I-projection). Refer to the notation

in Theorem 3.5.1. The i-th derivatives of ˆ∆t and ˇ∆t satisfy the following: ˆ

∆t(i)(0) = ∆tˇ (i)(0), i = 1, 3, 5 . . . ˆ

∆t(i)(0) = − ˇ∆t(i)(0), i = 0, 2, 4, 6 . . . .

Thus, the Taylor expansion of the difference in the solutions of Problems3.3.2and3.3.3

as a function of  = d(ϕA1→A2(t), C), in a neighborhood of  = 0, always attains one

additional order of accuracy. In particular,

ˆ t() − ˇt() = ˆ∆t00(0)2+ O(4), where ˆ ∆t00(0) = tr  log2m(A− 1 2 1 CA −12 1 ) logm(A −12 1 A2A −12 1 )  d(A1, A2) .

Proof. Comparing the implicit derivative of ˇ∆t with Equation (3.5.4), we notice that the signs will alternate in each subsequent derivative. Theorem3.5.2is obtained after taking derivatives of Equation (3.5.4).

From Theorems 3.5.1 and 3.5.2, the difference between the solution of Prob-lem 3.3.2 or Problem3.3.3 and that of Problem3.3.4 is locally of order 2. Similarly, the difference between the solutions of Problem 3.3.2 and Problem 3.3.3 is also of order 2. Moreover, as shown in Section 3.5, given that the first derivative is zero and

ˆ

∆t00(0) = − ˇ∆t00(0) (Theorem 3.5.2), the natural projection will typically fall between the I-projection and the reverse I-projection.

Besides the fact that it is a “middle point” between projection and reverse I-projection, there are other reasons to prefer natural projection. First, as noted earlier, it does not require assigning a probability distribution to the data. Second, the nat-ural projection inherits the invariance properties of the natnat-ural distance, the most

(44)

 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Δ t ×10-3 -8 -6 -4 -2 0 2 4 6 8 ˆ

Δt() (Reverse I-projection vs natural projection) ˇ

Δt() (I-projection vs natural projection) Natural projection

ˆ

Δt() ≈ ˆΔt”(0)2/2

ˇ

Δt() ≈ ˇΔt”(0)2/2

Figure 3-3: We draw three matrix realizations from a Wishart distribution of size

n = 10 (A, A2, and C). A1 is then constructed as the minimizer of the natural distance from C to the geodesic ϕA→A2(t) (cf. construction used in Theorem 3.5.1). We show ˆ∆t and ˇ∆t as a function of the distance  from A1 to C. We control

 by defining C = ϕb A1→C 

1

d(A1,C) 

; thus d(A1,C) = 1 and  = d(Ab 1, ϕ

A1→Cb

()) (cf. Equation (2.1.5)). By construction, ∆t for the natural projection is always zero.

important of which are symmetry of the arguments and invariance to inversion; the latter property is particularly useful when working with precision matrices, e.g., to take advantage of sparsity. (Reverse) I-projection does not enjoy these properties. Third, as we showed in Section 3.4, the optimality conditions of the I-projections are first order Taylor expansions of the natural projection; therefore, by maximizing the likelihood we are only solving a “flat” version of the geodesic problem. Finally, the natural distance is equivalent (up to a constant) to the Fisher information met-ric/Rao distance between two normal distributions with common mean, and thus the natural projection also minimizes these loss functions. If one uses geodesics (lines that minimize the natural distance) to build a covariance family, it is consistent to use the natural projection to select the most representative member of the family. In Section 3.7, we will show that these advantages of natural projection translate into modeling benefits in practical applications, e.g., robustness to noise-corrupted data.

Figure

Figure 3-2: Illustration of the orthogonality condition in Proposition 3.4.1. Solving Problem 3.3.4 is equivalent to finding a t ∗ such that Cb is orthogonal to the unitary vector R(1 + t ∗ ).
Figure 3-3: We draw three matrix realizations from a Wishart distribution of size n = 10 (A, A 2 , and C)
Figure 3-4: Unbalanced (left) versus balanced (right) trees. Green nodes represents anchor matrices and white circles are combination of one-parameter covariance  fam-ilies.
Figure 3-5: Illustration of the considered aquifer.
+7

Références

Documents relatifs

In addition to per- formance considerations, ontology learning tools need to be fully integrated into the knowledge engineering life-cycle, working in the background and providing

Among these models, intestinal perfusion is the most common experiment used to study the in vivo drug permeability and intestinal metabolism in different regions of the

We prove that arbitrary actions of compact groups on finite dimensional C ∗ -algebras are equivariantly semiprojec- tive, that quasifree actions of compact groups on the Cuntz

We introduce a family of DG methods for (1.1)–(1.3) based on the coupling of a DG approximation to the Vlasov equation (transport equation) with several mixed finite element methods

First introduced by Faddeev and Kashaev [7, 9], the quantum dilogarithm G b (x) and its variants S b (x) and g b (x) play a crucial role in the study of positive representations

The second mentioned author would like to thank James Cogdell for helpful conversation on their previous results when he was visiting Ohio State University.. The authors also would

The schemes are con- structed by combing a discontinuous Galerkin approximation to the Vlasov equation together with a mixed finite element method for the Poisson problem.. We

An infinitesimal increase of the volume ∆v brings an increase in p int , so that the piston moves until another equilibrium point with a bigger volume. On the other hand, a