A combinatorial approach to goal-oriented optimal Bayesian experimental design

(1)

A Combinatorial Approach to Goal-Oriented

Optimal Bayesian Experimental Design

by

Fengyi Li

B.S., Texas A&M University (2017)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

© Massachusetts Institute of Technology 2019. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

May 28, 2019

Certified by . . . .

Youssef Marzouk

Associate Professor of Aeronautics and Astronautics

Thesis Supervisor

Accepted by . . . .

Sertac Karaman

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

(2)

(3)

A Combinatorial Approach to Goal-Oriented Optimal

Bayesian Experimental Design

by

Fengyi Li

Submitted to the Department of Aeronautics and Astronautics on May 28, 2019, in partial fulfillment of the

requirements for the degree of Master of Science

Abstract

Optimal experimental design plays an important role in science and engineering. In many situations, we have many observations but only few of them can be selected due to limited resources. We then need to decide which ones to select based on our goal.

In this thesis, we study the Bayesian linear Gaussian model with a large number of observations, and propose several algorithms for solving the combinatorial problem of observation selection/optimal experimental design in a goal-oriented setting. Here, the quantity of interest (QoI) is not the model parameters, but some (vector-valued) function of the parameters. We wish to select a subset of the candidate observations that is most informative for this QoI, in the sense of reducing its uncertainty. More precisely, we seek to maximize the mutual information between the selected obser-vations and the QoI. Finding the true optimum is NP-hard, and in this setting, the mutual information objective is in general not submodular. We thus introduce sev-eral algorithms that approximate the optimal solution, including a greedy approach, a minorize–maximize approach employing modular bounds, and certain score-based heuristics. We compare the computational cost these algorithms, and demonstrate their performance on a synthetic data set and a real data set from a climate model. Thesis Supervisor: Youssef Marzouk

(4)

(5)

Acknowledgments

First of all, I would like to thank my advisor, Professor Youssef Marzouk, for taking me as his graduate student two years ago. Although busy, he is always there lending a helping hand whenever needed. He teaches me how to become a researcher rather than just a problem solver. This thesis would not have been done without his help and guidance.

In addition, I would like to thank my lab mates, Alessio, Andrea, Ben, Chi, Daniele, Jayanth, Michael, Ricardo, and Zheng, for insightful discussions over research and interesting daily conversations. Specially, I thank Jayanth for guiding me in doing this project and sharing his code. I would also like to thank my friends outside the lab, Bai, Hanshen, Shaoxiong, Xiaoyue, Xinzhe, and Yilun, for having irregular hotpot gatherings. In particular, I thank Bai for taking so many classes with me and going through the frustrating moments before homework is due; and Hanshen for all those discussions over random stuff and for being a good game partner in playing “chiji”, for countless evenings when my research seems to be hopeless.

Most importantly, I would like to thank my parents for their love, encouragement and support throughout the years.

(6)

(7)

List of Figures

5-1 Spectrum of important matrices and pencils of a calibration example

with correlation length 0.05 . . . 48

5-2 Numerical results of a calibration example with correlation length 0.05 49 5-3 Spectrum of important matrices and pencils of a calibration example with correlation length 0.5 . . . 50

5-4 Numerical results of a calibration example with correlation length 0.5 51 5-5 Spectrum of importance matrices and pencils with correlated noise . . 52

5-6 Numerical results with correlated noise . . . 53

5-7 Spectrum of importance matrices and pencils with uncorrelated noise 54 5-8 Numerical results with uncorrelated noise . . . 55

5-9 Spectrum for linearized sELM model with H = ₄₇1 P47 i=1Xi . . . 59

5-10 Numerical results for linearized sELM model with H = ₄₇1 P47 i=1Xi . . 60

5-11 Selected locations shown on a map with H = ₄₇1 P 47 i=1Xi . . . 61

5-12 Spectrum for linearized sELM model with H = X1, . . . 62

5-13 Numerical results for linearized sELM model with H = X1 . . . 62

(10)

(11)

List of Tables

(12)

(13)

Chapter 1 Introduction

The optimal experimental design (OED) problem arises in numerous scenarios and has various applications in science and engineering, such as combustion kinetics [26], sensor placement for weather prediction [34], and containment source identification [5]. There are numerous variations of OED problem. However, they, in general, aim to solve following problem: how to select k out of n observations in order to “opti-mally” learn the underlying parameters. One particular variant is the goal-oriented OED problem. In this case, the quantity of interest (QoI) is some function of the underlying parameters rather than parameters themselves. The optimality criteria, usually labelled using alphabets, are functionals of eigenvalues of fisher information matrices of the model [15, 41, 18]. For example, if we would like to minimize the average variance of the QoI, then A-optimal criterion is adopted. Another commonly used criterion is to minimize the volume of the confidence ellipsoid, leading to D-optimal criterion. The details of different D-optimality criteria are presented in section 2.1.

One naive way to solve this problem is to exhaust all n_k possible selections and choose the best k-subset. This problem is NP-hard, and exhaustive enumeration immediately becomes intractable even for moderate dimensions. For example, one needs to iterate over 4.71 × 1013 choices to select 20 out of 50 observations using exhaustive search. Therefore, most the effort in this field goes into developing method for finding some approximate solutions. Some common methods that have been used

(14)

in this field are greedy algorithm, continuous relaxations, and exchange algorithms. The greedy algorithm for cardinality-constrained design optimizes the objective in each iteration, by selecting the one that gives the largest improvement, condition-ing on the chosen ones. If the objective function is non-decreascondition-ing (see Def. 1 in section 3) and submodular (Def. 2), the greedy algorithm yields a 1 − 1/e constant performance guarantee [37]. Furthermore, if the objective is not submodular but still non-decreasing, we can define a notion of curvature (Def. 4) and a submodularity ratio (Def. 3) [7] and the performance bound can be written in terms of the curvature and submodularity ratio. More details about maximizing submodular functions can be found in [33] and Chapter 2.

On the other hand, we can consider a continuous relaxation of the combinato-rial problem. A standard approach is to consider the continuous version of {0, 1} constraint. Hence, for each variable, we let it lie in [0, 1], and solve the continuous problem. In many settings, continuous relaxation results in convex problems [30, 3]. In other cases, `1 regularization term could be added to the optimization problem to

encourage the sparsity of the solution [5].

Fedorov’s exchange algorithm [20] starts with some random set S of cardinality k, and exchange one element i ∈ S with an element not in the set, j ∈ Sc, in each iteration so that the objective is improved after exchange. The algorithm terminates if the total number of iterations is reached or there is no exchange can be made to further improve the objective. Fedorov’s exchange algorithm is slow, and there have been several variants of this algorithm, as described in [38].

In the Bayesian framework, a prior distribution is put on the parameters to rep-resent our state of knowledge before seeing the data. Then, a posterior distribution of the parameters is obtained by conditioning on the observations. Hence, instead of outputting a point estimator, the Bayesian framework leads to a distribution, which makes it a powerful tool for uncertainty quantification. Some of the alphabetical optimality criteria used in the classical setting have their natural extensions to the Bayesian setting and admit statistical interpretations. We will describe them in sec-tion 2.2. Huan et al. [26] propose a framework for solving the Bayesian optimal

(15)

experimental design problem using simulations. For sequential experimental design under the Bayesian framework, Huan et al. [27] propose to use approximate dynamic programming (DP) to solve the sequential design problem. Unlike Greedy, DP con-siders both the future and the feedback. More details about the sequential Bayesian experimental design can be found in [43].

The design becomes nonlinear if we have a nonlinear forward model. Exact calcu-lation of the optimality design criteria in this case would often lead to a complicated integral and certain approximation has to be adopted. Some commonly used approx-imations are normal approximation to the posterior distribution or using empirical Fisher information matrix rather than the expected one. More detailed discussion about nonlinear design can be found in [11].

Thesis contribution: we proposed a one-shot method for selecting the observations for OED. Although the algorithm has some limitations, the computational cost is much lower comparing to some existing method. We demonstrate the performance on synthetic data set and a real data set.

(16)

(17)

Chapter 2 Background

2.1 Classical optimal experimental design

In the classical (frequentist) optimal experimental design setting, the efficiency of design is measured by functionals of the eigenvalues of certain matrices. Different choice of functionals leads to different optiamlity criteria, hence different designs. Let’s consider a linear model

Y = GX + (2.1)

where Y ∈ Rn is the observation, X ∈ Rm is the underlying parameter, and ∈ Rn is the additive noise, distributed as N (0, Γ). The alphabetical optimality criteria,

along with their Bayesian extension, are usually represented by alphabets. For the classical design, we assume the noise are independent, namely, Γ is diagonal with

each of its diagonal element being σi > 0. Before introducing the design criteria, we

first define the selection operator P>, where each of its row is a unit vector with all 0’s except one entry being 1. If we want to selection k out of n observations, then P> _{is in R}k×n_{. Using the selection operator, we now list some commonly used design}

(18)

A(verage)-Optimality minimizes the average variance of the estimator:

min

rank(P )≤ktr(G >

P P>Γ−1 P P>G)−1. (2.2) D(eterminant)-Optimality minimizes the confidence ellipsoid (mean radius) of the estimator [30]:

min

rank(P )≤kdet(G >

P P>Γ−1 P P>G)−1. (2.3) E(igen)-Optimality minimizes the maximum variance over all direction of the esti-mator:

min

rank(P )≤kλmax(G >

P P>Γ−1 P P>G)−1. (2.4) These criteria allow us to choose a particular design to fulfill different purposes subject to cardinality constraints. Some other optimality criteria can be found in [41].

2.2 The Bayesian framework

In the Bayesian setting, a prior distribution is firstly put on the QoI, representing our state of knowledge before observing the data. We start this by first writing out the posterior distribution. We assume X has multivariate normal prior distribution, and without loss of generlity, we assume the mean of prior is 0, X ∼ N (0, ΓX), and

the noise ∼ N (0, Γ). Here, we do not assume the noise are independent, i.e., Γ

need not to be diagonal. Using Bayesian rule, the posterior distribution of X|Y is also Gaussian, X|Y ∼ N (µpos(Y ), Γpos), with mean

(19)

and the covariance

Γpos = (Hl+ Γ−1X ) −1

(2.6)

where Hl = G>Γ−1 G is the Hessian of the negative log-likelihood. It is worth

men-tioning there is a variational characterization of the mean, µpos. Define

J (µ) := 1 2kGµ − Y k 2 Γ−1_noise+ 1 2kµk 2 Γ−1pr (2.7)

we note that µpos is the solution to the following optimization problem [5]:

µpos = arg min

µ J (µ) (2.8)

and the Hessian of the J (µ) is calculated to be

HJ = Hl+ Γ−1pr = Γ −1

pos. (2.9)

In the setting of the Bayesian experimental design, one seeks the observations that minimizes the posterior uncertainty of the underlying parameter X, subject to car-dinality constraints. In the next section, we see how the selection operator P> can help formulate this problem.

2.3 The selection operator

We have introduced the selection operator P> earlier. To restate it, each row of P> is a unit vector with all 0’s except one entry being 1. Now, given the same setting as in section 2.2, we compute the posterior distribution of X after observing P>Y . Denoting P>Y by YP, the posterior distribution is also multivariate normal with

mean

µX|YP = ΓX|YPG

>

(20)

and the covariance matrix ΓX|YP = (G > P (P>ΓP )−1P>G + Γ−1X ) −1 . (2.11)

From the formulation above, what the selection operator P does is essentially se-lecting the corresponding rows and columns (principal submatrix) of the observation covariance matrix Γ, calculating its inverse and enlarging it to the original size by

putting zeros to the unselected entries. Let A+ denote the pseudo-inverse of the matrix A. Then, we have

P (P>ΓP )−1P> = P P>(P P>ΓP P>)+P P>. (2.12)

Note P P> is a square matrix of size m × m, and the covariance matrix can be written equivalently as ΓX|YP = (GP P > (P P>ΓP P>)+P P>G + Γ−1X ) −1 . (2.13)

2.4 Bayesian optimality criterion

The A- and D-optimality criterion have their corresponding Bayesian extension and admit statistical interpretation. The Bayesian A-optimal design minimizes the trace of the posterior covariance matrix, i.e.

min

rank(P )≤ktr(ΓX|YP) =rank(P )≤kmin tr(G >

P (P>ΓP )−1P>G + Γ−1X ) −1

, (2.14)

and the Bayesian D-optimal design is to minimizes the log-determinant of the poste-rior covariance matrix:

min

rank(P )≤klog det(ΓX|YP) =rank(P )≤kmin log det(G >

P (P>ΓP )−1P>G + Γ−1_X )−1. (2.15)

Comparing (2.14), (2.15) with (2.2), (2.3), we see that the prior plays the role of a regularization term, in a sense that, it prevents the optimization problem from being

(21)

ill-conditioned. The A-optimal criteria can be interpreted minimizing the Bayes’s risk of the posterior mean:

EXEY |X[ X − µX|YP 2 ] = tr(ΓX|YP)

On the other hand, the D-optimal criteria is to minimize the log volume (mean ra-dius) of the credible ellipsoid of the posterior distribution [30]. In addition, the log-determinant of the posterior distribution can be connected to the expected infor-mation gain, characterized by the expected Kullback-Leibler (KL) divergence between the prior and the posterior, using the following indentity [5, 26]:

EY |X[DKL(N (µX|YP, ΓX|YP)||N (0, ΓX))] = −

1

2log det ΓX|YP +

1

2log det ΓX (2.16) This gives the informational theoretical motivation for the D-optiaml design. (2.16) is also equivalent to the mutual information between the the observations and the underlying parameters. I(YP; Z) = 1 2log det ΓX det ΓX|YP (2.17)

and we see that minimizing the log-determinant of the posterior distribution is equiv-alent to maximizing the expected information gain and the mutual information. In this thesis, we mainly consider the Bayesian D-optimal design and use the mutual information characterization.

2.5 The generalized eigenvalue problem

The symmetry of mutual information in its two arguments enables us to decompose it in the following two ways:

(22)

where h denotes the differential entropy. For a general d-dimensional multivariate normal random variable W ∼ N (µ, Γ), the differential entropy has a closed form:

H(W ) = 1

2log det(Γ) + d

2(1 + log(2π)). (2.19) Combining (2.18) and (2.19), we can formulate the D-optimality criterion in two ways:

I(YP; X) = H(YP) − H(YP|X) = −

1 2log det Γ+ 1 2log det ΓYP (2.20) = H(X) − H(X|YP) = − 1

2log det ΓX|YP +

1

2log det ΓX, (2.21) where ΓYP = P

>_Γ

YP = P>(GΓXG>+ Γ)P is the covariance matrix of the selected

observations. This symmetry also let us see the problem in two ways: (2.20) is the formulation in the data space while (2.21) is the formulation in the parameter space. Both of these can be viewed as a generalized eigenvalue problem (GEP).

For a matrix pencil (A, B), where A, B are both full rank square matrix and have the same dimension, the GEP solves the following equaiton:

Avi = λiBvi (2.22)

where λi is the eigenvalue and vi is the corresponding eigenvector. See more about

GEP in Appendix B.1. Then (2.20) can be written as 1

2log det(Σdata) (2.23)

where Σdata = diag(σdata,i), and σdata,i’s are the eigenvalues of the pencil (ΓYP, Γ).

Similarly, (2.21) can be written as 1

2log det(Σpara) (2.24)

and Σpara= diag(σpara,i), where σpara,i’s are the eigenvalues of the pencil (ΓX, ΓX|YP).

(23)

approximation as presented in [47]. We introduce this in the following section.

2.6 Connection to optimal low rank

approxima-tion

The optimal low rank approximation solves the following problem:

ˆ

Γ∗_pos = arg min

ˆ Γpos d2_R(Γpos, ˆΓpos) (2.25) such that ˆ Γpos ∈ Mr = {Γpr − K>K 0 : rank(K) ≤ r} (2.26)

where dR is the distance between two symmetric positive definite (SPD) matrices,

and is defined as dR(A, B) = q tr[log2(A−1/2_BA−1/2_{)] =} s X i log2(λi) (2.27)

where A, B are two SPD matrices of compatible size and λi’s are the generalized

eigen-values of the matrix pencil (A, B). The optimization problem (2.25) is solved using the generalized eigenvalues and eigenvectors of the matrix pencil (G>Γ−1 G, Γ−1_X ). We present the optimal low rank approximation theorem below (Theorem 2.3 in [47]).

Theorem 1 (Optimal posterior covariance approximation). Let (δ2

i, wi) be the

gen-eralized eigenvalue-eigenvector pairs of the matrix pencil (G>Γ−1 G, Γ−1_X ), with the ordering δi ≥ δi+1. Given Γpos, a minimizer, ˆΓ∗pos of d2R(Γpos, ˆΓpos), where ˆΓpos ∈ Mr

is given by

ˆ

(24)

where

KK>=X

i≤r

δi(1 + δi)−1wˆiwˆ>i (2.29)

The corresponding minimum loss is given by

d2_R(Γpos, ˆΓ∗pos) = X i>r log2( 1 1 + δi ) (2.30)

Using the identities of eigenpairs of Schur complement of covariance matrices as in C , Theorem 1 can be reformulated as

Theorem 2 (Optimal posterior covariance approximation, Schur complement ver-sion). Let (λi, qi) be the eigenpairs of (GΓXG>, ΓY) with the ordering λi > λi+1 and

normalization q_i>GΓXG>qi = 1, where ΓY = Γ + GΓXG> is the covariance matrix

of the marginal distribution of Y . Then, a minimizer Γ∗_pos of the Riemannian metric dR between Γpos and an element of Mr is given by:

Γ∗_pos= ΓX − KK> KK>= r X i=1 ˆ qiqˆi> ˆ qi = ΓXG>qi p λi

where the corresponding minimum distance is:

d2_R(Γpos, Γ∗pos) =

X

i>r

log2(1 − λi)

This theorem indicates that the low rank approximation using eigenpairs of (ΓY −

Γ, ΓY) gives the optimal approximation in Riemannian metric over all SPD matrices.

(25)

linear model:

V>Y = V>GX + V> (2.31) where V ∈ Rn×r_{, and we are interested in the following optimization problem:}

max

V ∈Rn×rI(V

>

Y, X) = 1

2log det maxV ∈Rn×rdet((V

>

ΓYV )(V>ΓV )−1). (2.32)

Intuitively, this finds the bases to project on so that the mutual information between the underlying parameters and the projected variables is maximized. Let (λ, vi)ri=1

be the top r eigenpairs of the pencil (ΓY, Γ). It is shown in [22] that the solution to

(2.32) is given by the matrix V with columns being the eigenvectors {vi}ri=1, and the

optimum is given by max V ∈Rn×rI(V > Y, X) = 1 2 r X i=1 log λi. (2.33)

2.7 Connection to canonical correlation analysis

(CCA)

CCA, first proposed by Hotelling in 1936 [25], finds the maximal correlation over all possible linear transformations of two multivariate random variables. It has been used for mainly two purposes: dimension reduction and data interpretation. The former lets us consider only some small number of linear combinations and random variables, whereas the latter allows us to find important features/directions for explaining the data [24]. For two multivariate random variables X and Y , with covariance

Σ =   ΣXX ΣXY ΣY X ΣY Y   (2.34)

(26)

consider the linear combinations X>wx ∈ R and Y>wy ∈ R. CCA solves the follow-ing: arg max wx,wy ρ = arg max wx,wy corr(X>wx, Y>wy) (2.35) = arg max wx,wy E[wx>XY >_w y] q E[wx>XX>wx]E[wy>Y Y>wy] = arg max wx,wy w>_xΣXYwy q w> xΣXXwxw>yΣY Ywy . (2.36)

The solution can be found by solving the following system of generalized eigenvalue equations:    ΣXYΣ−1Y YΣY Xwx = ρ2ΣXXwx ΣY XΣ−1XXΣXYwy = ρ2ΣY Ywy. (2.37) (2.38) That is (ρ2_{, w}

x) is the eigenpair of the pencil (ΣXYΣ−1Y YΣY X, ΣXX) and (ρ2, wy) is the

eigenpair of the pencil (ΣY XΣ−1XXΣXY, ΣY Y). Moreover, wx and wy are related by the

following equations:      ΣXYwy = ρλxΣXXwx ΣY Xwx= ρλyΣY Ywy (2.39) where λx = λ−1y = q w> yΣY Ywy w>

xΣXXwx. Therefore, in practice, we only need to solve one of

(2.37) and (2.38), and (2.39) give the other.

Return to our problem, the covariance between Y and X as in (2.1) is   ΓY GΓX ΓXG> ΓX   (2.40)

Let ˆwi be the eigenvector of (G>ΓXG, ΓY), analogous to (2.37), and ˆvibe the

eigenvec-tor of (GΓXΓ−1_Y ΓXG>, ΓX), analogous to (2.38),corresponding to the same eigenvalue

λi. Then ˆwi and ˆvi maximize corr(X>vi, Y>wi) and the optimal value of the

(27)

Another way to characterize the canonical correlation is as follows. (Wx, Wy) = arg min ( ˆWx, ˆWy)∈Sr ˆ W_x>X − ˆW_y>Y 2 (2.41) where Sr = {( ˆWx, ˆWy) : rank(Wx) = rank(Wy) = r, ˆWx>ΣXXWˆx = ˆWy>ΣY YWˆy = Ir}. (2.42)

Then (2.41) is minimized when Wx and Wy has top-r eigenvectors wx and wy (as

described in (2.37) and (2.38)) as their columns. Details about CCA are discussed in [23, 50, 9].

2.8 The goal operator and the transformed model

Now consider a goal-oriented operator. The QoI is a linear function of X, denoted by Z = HX, where H is what we called, the goal operator. In the goal-oriented design, we are interested in minimizing the uncertainty of Z. In many cases, the range of H is low dimensional, for instance, 1 or 2; hence, inferring the whole aspect of X would be a waste of computational resources. In order to do inference, we would like to compute the posterior distribution of Z, the probability distribution of Y |Z, etc.. One way is to note the prior distribution of Z is N (0, ΓZ) with ΓZ = HΓZH>. Hence,

the posterior distribution of Z is also multivariate normal with mean

µZpos(Y ) = Hµpos(Y ) (2.43)

and posterior

ΓZpos = HΓposH

>

(28)

Another way to see this goal-oriented problem is to introduced the transformed model [46]:

Y = G ˜HZ + δ (2.44)

where ˜H = ΓXH>Γ−1Z , Z ∼ N (0, ΓZ) and δ ∼ N (0, Γδ). Z and δ are independent,

and Γδ= Γ+ G(ΓX− ΓXH>Γ−1Z HΓX)G>. Using this transformation, the results for

the Bayesian experimental design can be applied to the goal-oriented case, and we obtain that

ΓZ|Y = (Γ−1Z + (G ˜H) >

Γ−1_δ GH)−1 (2.45)

µZ|Y = HµX|Y = ΓZ|Y(G ˜H)>Γ−1δ (2.46)

and

ΓZ|YP = (G ˜H)

>_{P (P}>_Γ

δP )−1P>G ˜H + Γ−1_Z (2.47)

µZ|YP = HµX|YP = ΓZ|YP(G ˜H)

>_{P (P}>_Γ

δP )−1P> (2.48)

However, it is worth mentioning that the forward operator of the transformed model is often rank deficient, in a sense that G ˜H has the same rank as H, and the rank of G ˜H is d, if d < m < n. Hence, although the algorithm in the classical setting can still be applied in the goal-oriented case, the deficiency of rank might make the theoretical analysis difficult.

The Bayesian D-optimal design problem in the goal-oriented setting can be for-mulated as

min

rank(P )≤klog det(ΓZ|YP) =rank(P )≤kmin HΓX|YPH >

(29)

or equivalently, using the transformed model:

min

rank(P )≤klog det(ΓZ|YP) (2.50)

= min

rank(P )≤klog det((G ˜H) >

P (P>ΓδP )−1P>G ˜H + Γ−1Z ) (2.51)

(30)

(31)

Chapter 3 Approximate submodular functions

Submodularity is a useful structure that has been widely used in combinatorial op-timization [6, 10, 8, 29] and machine learning [49, 31], in particular, design and analysis of approximate algorithms for NP-hard problems [37, 35, 21]. In this chap-ter, we present some classical results for (approximate) submodular functions, and how to use these for analyzing greedy algorithms. We will also show the challenge encountered when applying these methods to the goal-oriented setting.

We first introduce definitions of monotonicity and submodularity of a set function f , and then use these definitions to characterize the performance bound of the greedy algorithm when the objective enjoys some desired properties.

Definition 1 (Non-decreasing set functions). Let f be a set function on Ω, i.e., f : 2Ω _{→ R. Then f is non-decreasing if for all A, B satisfying A ⊆ B ⊆ Ω, we have}

f (A) ≤ f (B).

Definition 2 (Submodularity). Let f be a set function on Ω, i.e., f : 2Ω _{→ R. We}

say f is submodular if for every A, B ⊆ Ω, with A ⊆ B, and every e ∈ Ω \ B, we have f (A ∪ {e}) − f (A) ≥ f (B ∪ {e}) − f (B).

From the definition, we see that the concept of submodularity describes the ”di-minishing returns” property of the set functions: the gain by adding a new element is less for larger sets. Many functions in science and engineering exhibit this behavior — for example, weighted coverage function, entropy, mutual information of conditional

(32)

independent variables over disjoint sets, and many more [33]. When the objective function f is non-decreasing, it is natural to consider its maximum subject to the cardinality constraint that the size of the set is less than k, that is:

A∗ = arg max

S⊆Ω,|A|=k

f (A) (3.1)

A natural heuristic to consider in this setting is the greedy algorithm. In every step, the greedy algorithm successively maximizes the increment in f . Due to the simplicity and good performance of the algorithm, it has been applied to many problems, such as column selection [4] and max-cut of the graph [8]. Relating this to our problem, the selection criterion of the greedy algorithm is to choose the observation that maximizes the mutual information conditioned on the set of the chosen observations A:

max

v∈AcI(Z; YPv|YPA)

Denote YPA by YA, and using the chain rule of mutual information,

I(Z; Yv|YA) = I(Z; YA∪v) − I(Z; YA)

in each iteration, we solve v∗ = arg max_v∈AcI(Z; Y_A∪v)−I(Z; Y_A). In the next section,

(33)

3.1 The greedy algorithm

Algorithm 1 The Greedy Algorithm

1: input ground set Ω, set function f : 2Ω _{→ R, cardinality k}

2: A0 = ∅ 3: for i = 1 : k do 4: v∗ = argmax_v∈Acf (A_k−1∪ {v}) − f (A_k−1) 5: Ak= Ak−1∪ {v∗} 6: end for 7: return Ak

Nemhauser et al. [37] showed a classical result in 1978 for f satisfying Definition 1 and Definition 2. We restate the theorem here.

Theorem 3. Let f be a non-decreasing submodular set function, with f (∅) = 0. Then Algorithm 1 returns Ak that satisfies

f (A∗) ≥ f (AK) ≥ (1 −

1 e)f (A

∗

) (3.2)

where A∗ is the optimal solution to 3.1, subject to the cardinality constraints of size k.

Note (1 −1_e) ≈ 0.63. Hence Theorem 3 states that for f being non-decreasing and submodular, the greedy algorithm gives a result that has performance no worse than 63% of the optimal. Feige extended this result and further proved that no polynomial time algorithm can output the results that have a better performance than 1 − 1_e under the cardinality constraint, assuming N 6= NP [21]. For convenience, we denote YPA by YA, and we have the following lemma.

Lemma 1. Under the assumption that the H = I and the noise is uncorrelated, the mutual information I(X; YA) (see (2.19)) is monotone and submodular in A.

(34)

See [28] for a proof. This lemma implies that in the non-goal-oriented case, with uncorrelated noise, the mutual information between the underlying parameter X and the selected observations increases as more observations are selected, and also that the incremental gain diminishes as the number of observations being selected increases. This coincide with our intuition, since the more we observe the more we know about the underlying parameter X.

Then applying Theorem 3, we see that the greedy algorithm yields a performance bound of 1 −1_e under this assumption. However, if the noise is correlated or if we are dealing with a general goal operator other than identity, then the mutual information is no longer submodular. In light of this situation, we need some measure to quantify how close our objective function is to being submodular.

3.2 Submodularity ratio and generalized curvature

One way to capture the closeness of a non-submodular function to a submodular function is by introducing two paramenters: submodularity ratio and generalized curvature [14, 7].

Definition 3 (Submodularity ratio [14, 7]). The submodularity ratio of a non-negative set function f is the largest scalar γ s.t.

P

ω∈V \A(f (A ∪ ω) − f (A)) ≥ γ(f (A ∪ V ) − f (A))

for all V, A ⊆ S.

Definition 4 (Generalized curvature [14, 7]). The curvature of a non-negative func-tion f is the smallest scalar α s.t.

(f (A ∪ V ) − f (A\ω ∪ V )) ≥ (1 − α)(f (A) − f (A\ω)) for all V, A ∈ S, ω ∈ A\V .

We note that γ = 1 if and only if f is submodular, and α = 0 if and only if f is supermodular (−f is submodular). Hence, f is modular (both submodular and

(35)

supermodular) if and only if γ = 1 and α = 0. Then with two parameters γ and α, we are able to give a performance bound of the greedy algorithm for a general set function f , not neccessarily submoduar.

Theorem 4 (Approximation guarantee of greedy [7]). Let f be a negative non-decreasing set function with submodularity ratio γ ∈ [0, 1] and curvature α ∈ [0, 1]. The greedy algorithm enjoys the following approximation guarantee for solving the problem max A∈S,|A|≤kf (A) f (Ak) ≥ 1 α " 1 − k − αγ k k# f (A∗) ≥ 1 α(1 − e −αγ )f (A∗)

where A∗ is the solution to (3.1), and Ak is the output of the greedy algorithm. It seems that if we can obtain a lower bound for γ and an upper bound for 1 − α, then we can obtain the overall performance bound for the greedy algorithm even if f is non-submodular. However, in the goal-oriented OED setting, obtaining nontrivial bounds for γ and 1 − α is not so simple. We will explain this in detail in Section 3.4.

3.3 α-approximate submodularity and -approximate

submodularity

Another way to characterize how far a function deviates from being submodular is to define notions of α-approximate submodularity and -approximate submodularity as in [13] [12]. It is shown in [12] that D-optimal goal-oriented experimental design is submodular provided that d ≥ m (H is tall and skinny) and the noise is uncorrelated, which is not the case we are interested in. Performance bounds are given in [13] using α-approximate supermodularity and -approximate supermodularity. However, the performance bound in the paper is written in terms of some multiple of det(H>H) (equation (30) in [13]), which is 0 when d < m. The results, however, do not hold

(36)

when d < m using the same proof technique.

3.4 The challenge in goal-oriented case

Unfortunately, none of the methods introduced in previous section work in the case we are interested in, due to the fact that d m < n. This relative relationship between dimensions of H, X and Y causes many analyses to break and yields trivial lower bound. We present here an attempt to utilize the result from Section 3.2. Proofs of theorems and lemmas can be found in Appendix A.

The following lemma shows that mutual information is a non-decreasing set func-tion.

Lemma 2. I(YA; Z) is non-decreasing in A, i.e. I(YA∪v; Z) ≥ I(YA; Z).

Let f (·) = I(·; Z) = I(Z; ·). Then, to apply Theorem 4, we would like to get a lower bound for γ and an upper bound for α. We first examine a lower bound for γ, hence a lower bound for

P

ω∈V \AI(YA∪ω;Z)−I(YA;Z)

I(YA∪V;Z)−I(YA;Z) for all V, A ⊆ Ω. For the denominator,

we have the following result:

Theorem 5. I(YA∪V; Z)−I(YA; Z) ≤ 1₂ logQ |V \A|

i=1 λi(ΓY)−log

Qd

i=d−|V \A|+1λi(ΓY |Z),

where λi(Γ) is the ith largest eigenvalues of matrix Γ.

To get a lower bound for the numerator of

P

ω∈V \AI(YA∪ω;Z)−I(YA;Z)

I(YA∪V;Z)−I(YA;Z) , we need some

lemma to build up the machinery. Lemma 3. Let MA= ΓZΓ−1_Z|Y_A = I + Γ

1/2 Z (G ˜H) >_P A(PA>ΓδPA)−1PA>(G ˜H)Γ 1/2 Z and let

SA,ω = MA∪ω−MA. Then, 2I(YA∪V; Z)−2I(YA; Z) =

det(MA∪ω) det(MA) = 1+tr(M −1/2 A SA,ωM −1/2 A )

In the proof of this lemma, (A.13) shows that when we select a new observation in each iteration, we are essentially performing a rank 1 update to the argument fed into the objective.

(37)

Theorem 6. For n ≤ min(d, n),

X

ω∈V \A

I(P_A∪ω> Y ; Z) − I(P_A>Y ; Z) > |V \A|

2 log 1 + λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H) >₎ max(diag(Γδ)) ! . (3.3) Now, combine Theorem 6 and Theorem 5, we have

P

ω∈V \AI(YA∪ω; Z) − I(YA; Z)

I(YA∪V; Z) − I(YA; Z)

(3.4) ≥ |V \A| 2 log 1 + λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H) >₎ max(diag(Γδ)) 1 2 log Q|V \A| i=1 λi(ΓY) − log Qd i=d−|V \A|+1λi(ΓY |Z) (3.5) ≥ |V \A| 2 log 1 + λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H) >₎ max(diag(Γδ)) |V \A| 2 log λ1(ΓY) − log λn(ΓY |Z) (3.6) =        log 1+λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H)>) max(diag(Γδ)) !

log λ1(ΓY)−log λn(ΓY |Z) n ≤ min(d, m)

0 otherwise

(3.7)

We see that for the interesting case n > m > d, we obtain only the trivial lower bound for γ. The nontrivial bound works for the case where we have number of observations being smaller than both the dimension of the goal operator and the dimension of the underlying parameters.

An lower bound for 1 − α (or an upper bound for α) can be obtained in a similar way. Hence 1 − α and can also be lower bounded by (3.7) as shown in Appendix D.1 in [7]. Plugging γ ≥ 0 into Theorem 4, we see that we are obtaining a trivial bound.

(38)

(39)

Chapter 4 Algorithms

In this chapter, we compare three algorithms: greedy algorithm, which is introduced in previous chapter; Minorize-Maximize (MM) [28], which is based on iteratively maximizing the modular lower bound of the objective function; and the General-ized Leverage Score (GLS) algorithm, which is a one-shot method that ranks the observations based on `2 norm of the generalized eigenvectors of the matrix pencil.

4.1 The greedy algorithm

We present the greedy algorithm here again, with f in Algorithm 1 replaced with mutual information.

Algorithm 2 The Greedy Algorithm

2: A0 = ∅

3: for i = 1 : k do

4: v∗ = arg max_v∈AcI(Z; Y_A∪{v}) − I(Z; Y_A)

5: Ai = Ai−1∪ {v∗}

6: end for

7: return Ak

Recall that I(Z; YA∪{v}) = 1₂log

det Γ_YA∪{v}

(40)

cost of the algorithm comes from the computing the determinant. The overall cost for greedy is O(nk4_).

4.2 The MM algorithm

The MM algorithm in the continuous setting is referred to as ”Majorize-Minimization” (for finding the minimum) or “Minorize-Maximization” (for finding the maximum), and the idea is that, take majorize-minimization for example, for a non-convex func-tion, in each iterafunc-tion, we construct a convex surrogate function that is globally greater than the objective, and successively finding the minimum of the convex sur-rogate. The MM algorithm presented below is designed based on this idea: if we can find a modular lower bound of the objective, and in each iteration, we can just maximize the lower bound.

f (B) + ηX

e∈A

f (B ∪ e) − f (B)

| {z }

modular lower bound

≤ f (A ∪ B) ≤ f (B) + 1 γ X e∈A f (B ∪ e) − f (B) | {z }

modular upper bound

(4.1)

This leads to the following algorithm. Algorithm 3 The MM algorithm

2: A = ∅

3: for i = 1 : k do

4: F = ΓY(Ac, Ac) − ΓY(Ac, A)ΓY(A, A)−1ΓY(A, Ac)

5: S = Γδ(Ac, Ac) − Γδ(Ac, A)Γδ(A, A)−1ΓY(A, Ac)

6: rj = max(diag(log(F)) - diag(log(S)))

7: Ai = Ai−1∪ {j}

8: end for

9: return Ak

(41)

Hence, the main computational cost comes from computing the logarithm of matrices, which is equivalent to computing the eigenvalues. The overall computational cost for MM is O(n3_{k). The performance and analysis details of this algorithm can be found}

in [28].

4.3 The generalized leverage score algorithm

We first give the definition of the leverage score for a symmetric matrix A.

Definition 5 (Leverage score for symmetric matrix [40]). Let Vk ∈ Rn×k contain the

top k singular vectors of a n × n symmetric matrix A with rank(A) ≥ k. Then, the (rank-k) leverage score of the i-th column/row of A is defined as `(k)_i = k[Vk]i,:k

2 2 for

i = 1, · · · , n, where [Vk]i,: denotes the i-th row of Vk.

From the definition, we see that the rank-k leverage scores are essentially the square of the `2 norm of each row of the top k eigenvectors. Leverage scores have been

widely used in the field of randomized numerical linear algebra, for example, sample the rows of a matrix with the probability proportional to their leverage score leads to CU R decomposition [16] and thus fast computation linear regression [36, 2]; another line of research goes into approximating leverage score, without direct computing the singular/eigen vectors of matrices [17].

For a matrix pencil (A, B), analogous to Definition 5, the weighted generalized leverage score can be defined as follows:

Definition 6 (Weighted generalized leverage score). For A, B ∈ Rn×n_{, let V}

k ∈ Rn×k

contain the top k eigenvectors of the matrix pencil (A, B), with (rank(A), rank(B)) ≥ k. let Λk ∈ Rk×k be a diagonal matrix, where each of its diagonal entries being

the corresponding eigenvalue of the pencil. Then, the (rank-k) weighted generalized leverage score is defined as `(k)_i = k[VkΛk]i,:k2₂.

(42)

Algorithm 4 The generalized leverage score algorithm

1: input ΓY, Γδ;

2: A = ∅

3: (Λ, Q) = generalized eig(ΓY − Γδ, ΓY )

% compute the generalized eigenpairs of the matrix pencil (ΓY − Γδ, ΓY)

4: s = row-wise 2 norm of QΛ1/2

5: r = {si1, si2, · · · , sin}, where sij is the jth largest component in s

6: A = {i1, i2, · · · , in}

7: return A

We give some intuition about why this algorithm works. We consider the matrix pencil (GΓXH>Γ−1Z HΓXG>, ΓY) = (ΓY − ΓY |Z, ΓY) = (ΓY − Γδ, ΓY), and let (λi, qi)

be the eigenpairs of this pencil with normalization q_i>GΓXH>Γ−1Z HΓXG>qi = 1 for

i = 1, · · · , d. Then from Theorem 2.3 in [46], we have

ΓZ|Y = ΓZ− KK>

KK> = ˆQΛ ˆQ> ˆ

Q = HΓXG>Q

where Λ is a d × d diagonal matrix whose diagonal entries are the generalized eigen-values and Q is a n × d matrix whose column are the corresponding generalized eigenvectors. Then combine three equations above, we have

ΓZ|Y = ΓZ− ˆQΛ ˆQ> = ΓZ− HΓXG>QΛQ>GΓXH>

| {z }

KK>

.

On the other hand, from (2.45), using Woodbury’s identity, and expand all the terms, we have

ΓZ|Y = ΓZ− HΓXG>Γ−1Y GΓXH>.

(43)

generalized eigenvalues and eigenvectors. Then we have

HΓXG>(Γ−1Y − QΛQ >

)HΓXG>= 0 (4.2)

by construction. Also, from (2.47), we have

ΓZ|YP = ΓZ− HΓXG > P (P>ΓYP )−1P>GΓXH> (4.3) ≈ ΓZ− HΓXG>P P>Γ−1Y P P > GΓXH>. (4.4)

We consider using the following approximation

˜

ΓZ|YP = ΓZ− HΓXG

>_{P P}>_(QΛQ>_{)P P}>_GΓ XH>

and we would like ˜ΓZ|YP to be as close to ΓZ|Y as possible, which is equivalent to

making HΓXG>P P>QΛQ>P>P GΓXH> to be as close to HΓXG>QΛQ>GΓXH> as

possible. A natural heuristic is to eliminate the ”less important” rows of QΛ1/2 _first,

characterized by the row-wise two norm. This idea can be seen as follows.

QΛ1/2 =      | | | √ λ1q1 √ λ2q2 · · · √ λnqn | | |      =         — v1 — — v2 — .. . — vn —         P P>QΛ1/2 =         0 0 · · · 0 0 1 · · · 0 . .. 0 0 · · · 1                 — v1 — — v2 — .. . — vn —         =         — 0 — — v2 — .. . — vn —        

The action of P P> on the matrix QΛ1/2 _{is on rows. Hence, by selecting ”important”}

rows, or equivalently, eliminating ”less important” rows, we make P P>QΛQ>P>P close to QΛQ>.

(44)

Theorem 7 (Optimal approximation of the posterior covariance of the QoI). Let (λi, qi) be the eigenpairs of

(GΓXH>Γ−1Z HΓXG>, ΓY)

with the ordering λi > λi+1 > 0 and normalization qi>GΓXH>Γ−1Z HΓXG>qi = 1,

where ΓY := Γobs + GΓXG> is the covariance matrix i of the marginal distribution

of Y . Then, a minimizer ˜ΓZ|Y of the Riemannian metric dR between ΓZ|Y and an

element of MZ r is given by: ˆ ΓZ|Y = ΓZ− KK> KK>= r X i=1 ˆ qiqˆi> ˆ qi = HΓXG>qi p λi

where the corresponding minimum distance is:

d2_R(ΓZ|Y, ˜ΓZ|Y) = 1 2 X i>r ln2(1 − λi) and MZ r = {ΓZ− KK> 0 : rank(K) ≤ r}

Note that, the covariance between Y and Z is given by the following matrix

Σ =   ΓY GΓXH> HΓXG> ΓZ  ,

Also, the covariance between YP and Z, given by ΣP, is as follows.

ΣP =   P>ΓYP P>GΓXH> HΓXG>P ΓZ  ,

(45)

Hence, we consider the matrix pencil (P>GΓXH>Γ−1Z HΓXG>P, P>ΓYP ) and the low

rank decomposition qP

i = HΓXG>P P>qi.

Since we know that ˆΓZ|Y is the optimal approximation in Riemannian metric dR,

we would like qP

i to be as close to qi as possible. Then we examine the norm of the

rows of eigenvector matrix to determine after eliminating which row, we would have the least change in eigenvectors. This sense of closeness is captured using `2 norm.

Then, using this approximation, we have

˜

ΓZ|YP = ΓZ − HΓXG

>

P P>QΛQ>P P>GΓXH>.

4.4 Theoretical consideration

Although a rigorous analysis of the performance is lacking at this point, we present a lower bound for I(Z; YP) for |YP| = k. That is what would the worst performance

be if we randomly select k observations.

Theorem 8. For any k randomly selected observations,

I(Z; YP) ≥ 1 2(1 + λmin(Γ −1 δ ) n X i=n−k+1 diag_i((G ˜O)ΓZ(G ˜O)>))

where diag_i((G Õ)ΓZ(G Õ)>)) denotes the i-th largest diagonal element of (G Õ)ΓZ(G Õ)>.

(46)

(47)

Chapter 5 Numerical examples

In this section, we demonstrate the performance of three algorithms, greedy algo-rithm, majorize-minimization (MM) algoalgo-rithm, and generalize leverage score (GLS) algorithm on a synthetic data set and a real data set from a climate model.

5.1 Synthetic data set

5.1.1 Calibration

For the purpose of calibration, we first run a small-scale example where we can gen-erate the combinatorial optimal, so we can compare the performance of approximate algorithms to the optimal. Below is an example with H ∈ R2×10 being a matrix with two rows of randomly drawn unit vectors (drawn from I10), G ∈ R16×10 having

expo-nential decreasing spectrum. and ΓX and Γ are generated from exponential kernel

with correlation length 0.05. Note that the noise in this case is correlated. Figure 5-1 shows the spectrum of important matrices and pencils used in this example.

(48)

Figure 5-1: Spectrum of important matrices and pencils of a calibration example with correlation length 0.05

We see that the pencil (GΓXH>Γ−1_Z HΓXG, ΓY) has only two nontrivial

eigen-values, which aligns with the rank of H. The numerical results of the approximate algorithms and the optimal are shown in the plot below.

(49)

Figure 5-2: Numerical results of a calibration example with correlation length 0.05

The solid lines are the numerical results obtained from the goal-oriented setting. The dashed lines are obtained by selecting observations according to the non-goal-oriented criterion, and evaluate the performance in the goal-non-goal-oriented setting. That is, we choose the observations using the objective maxAI(X; YA), and evaluate the

performance of the set A using I(Z; YA). We also plot 50 realizations of random

selections, which are denoted by yellow crosses in the plot. We see from the plots above that although not guaranteed, the performance of goal-oriented design is better than that of selecting observations in the non-goal-oriented setting. The latter has similar performance to the random selections. However, the performance of GLS could deviates a large amount from the combinatorial optimal in some cases. In the following example, we have an instance where we the covariance of the noise is generated from exponential kernel with correlation length is 0.5.

(50)

Figure 5-3: Spectrum of important matrices and pencils of a calibration example with correlation length 0.5

(51)

Figure 5-4: Numerical results of a calibration example with correlation length 0.5

We see from the plot above that when the cardinality of the set is small, the performnace of GLS is similar to that of random selections. The greedy algorithm and MM still perform relatively good in this setting.

5.1.2 Correlated noise

We run an example on a synthetic data set with H ∈ R1×100 being a randomly drawn unit vector, G ∈ R200×100 _{having exponential decreasing spectrum, and Γ}

X and Γ

are generated from exponential kernel with correlation length 0.05. Note that in this case, the noise is correlated. The spectrum of important matrices are shown in Figure 5-5 below.

(52)

Figure 5-5: Spectrum of importance matrices and pencils with correlated noise

Since the rank of H is 1, the spectrums of some matrices and pencils are not able to see from the plot. Therefore, we list those here: λ(G ˜H) = 1.1131, λ(H) = 1, and the only generalized eigenvalue of the pencil (GΓXH>Γ−1Z HΓXG>, ΓY) is 0.7581. The

(53)

Figure 5-6: Numerical results with correlated noise

From this we see that, even though the noise is correlated, the performance of GLS is much better than that in the random case.

5.1.3 Uncorrelated noise

Below is an example with with H ∈ R1×100 _{being a unit vector, G ∈ R}200×100 _having

exponential decreasing spectrum. ΓX is generated from square exponential kernel

with correlation length 0.05. The noise is uncorrelated, and the diagonal entries of Γ are uniformly distributed between [0, 1].

(54)

Figure 5-7: Spectrum of importance matrices and pencils with uncorrelated noise

As in previous part, we list the spectrum of rank 1 matrices and pencils: λ(G ˜H) =

1.1907, λ(H) = 1, and the only generalized eigenvalue of the pencil (GΓXH>Γ−1Z HΓXG>, ΓY)

(55)

Figure 5-8: Numerical results with uncorrelated noise

Comparing 5-6 with 5-8, we see that the performance of selecting observations in the non-goal-oriented setting is better when the noise is uncorrelated. Three algo-rithms have similar performance when the noise is uncorrelated. However, when the noise is correlated, GLS usually performs worse than the other two.

5.2 Simple E3SM (Energy Exascale Earth System

Model) land model (sELM)

5.2.1 Model description

ELM is a DOE sponsored earth system model development and simulation project aiming to study energy-relevant science. It is based on Community Land Model (CLM) version 4.5 as described in [39]. The details about ELM can be found in [1]. sELM uses the CLM 4.5 schemes for allocation and respiration (chapter 13), seasonal

(56)

deciduous phenology (chapter 14), decomposition (chapter 15) and mortality (chapter 17). Other processes such as nutrient cycling, fire methane, crops, and land use change are not yet included. For photosynthesis, aggregated canopy model (ACM) is used, as described in several papers which use the data assimilation-linked ecosystem carbon (DALEC) model. For example, see [44]. The input parameters of the model are 47 biogeophysics and biogeochemical cycling related parameters, and the model outputs the gross primary production (GPP) at each of 1642 locations in the United States. 35 out of 47 input parameters with their detailed descriptions are listed in table below [42, 45].

Table 5.1: Input parameters details for sELM

Parameter Description Units Minimum Maximum

leafcn Leaf carbon/nitrogen (C:N)

ratio gC gN

−1 _12.5 ₇₀

frootcn Fine root C:N ratio gC gN−1 21 63

livewdcn Live wood C:N ratio gC gN−1 25 75 froot leaf

Fine root to leaf allocation

ratio None 0.3 2.5

flivewd Fraction of new wood that is live

None 0.06 0.28

lf flab Leaf litter labile fraction None 0.125 0.375

lf flig Leaf litter lignin fraction None Constrained Constrained fr flab Fine root labile fraction None 0.125 0.375

fr flig Fine root lignin fraction None Constrained Constrained

leaf long Leaf longevity Years 1 7

br mr Base rate for maintenance respiration (MR)

umol m−2 s−1

1.26E-06 3.75E-06

q10 mr Temperature sensitivity for

(57)

rf l1s1 Respiration fraction for lit-ter 1 →SOM1

None 0.2 0.58

rf l2s2 Respiration fraction for

lit-ter 2 →SOM2 None 0.275 0.82

rf l3s3

Respiration fraction for

lit-ter 3 →SOM3 None 0.15 0.43

rf s1s2 Respiration fraction for SOM1 →SOM2

None 0.14 0.42

rf s2s3 Respiration fraction for

SOM2 →SOM3 None 0.23 0.69

rf s3s4

Respiration fraction for

SOM3 →SOM4 None 0.28 0.83

k l1 Decay rate for litter pool 1 1 d−1 0.9 1.8 k l2 Decay rate for litter pool 2 1 d−1 0.036 0.112 k l3 Decay rate for litter pool 3 1 d−1 0.007 0.021

k s1 Decay rate for SOM1 1 d−1 0.036 0.112

k s2 Decay rate for SOM2 1 d−1 0.007 0.021

k s3 Decay rate for SOM3 1 d−1 0.0007 0.0021 k s4 Decay rate for SOM4 1 d−1 5.00E-05 1.50E-04 k frag Fragmentation rate for

coarse wood litter 1 d

−1 _5.00E-04 _1.50E-03

q10 hr

Q10 for heterotrophic

respi-ration None 1.3 3.3

r mort Mortality rate 1 yr−1 0.0025 0.05

crit dayl Critical day length for senescence

Seconds 35,000 45,000

ndays on Number of days for leaf on Days 15 45 ndays off Number of days for leaf off Days 7.5 22.5

(58)

fstor2tran Fraction of storage trans-ferred

None 0.25 0.75

lwtop ann live wood turnover

propor-tion yr

−1 _0.5 ₁

stem leaf new stem alloc C per leaf C gC gC−1 0.6 5.3 croot stem new croot alloc C per stem

C gC gC

−1 _0.1 _0.7

The sELM is nonlinear and we linearize it so that we can apply the method we developed. The details for linearization are introduced in the next section.

5.2.2 Linearized sELM

The method we used for linearizing the model is similar to that in[19]. We linearize the model using the following steps. We run sELM multiple times with different input parameters, from which we can gather a collection of input-output pairs at each location. We perform linear regression of the output at each grid point on the predictors comprised of the input parameters. We then determine the residuals of the regression problem (one for every grid point), and computer their covariance. This covariance operator accounts for the effect of forcing. The statistical model we use is yi = GX + i, where i = 1, . . . , # grid points. yi is the time average output Et[yi] at

each grid point, which we can determine. We have as many samples of Et[yi] as the

cardinality of the ensemble set. Furthermore we can asscociate a weight to each of the ensemble sample by assessing the variance in time, hence the connection to the spatial variability of the forcing. This will lead us to a weighted regression problem. We then determine the linearized forward operator G, and iis the direct contribution

of the forcing at each grid point to the output at that grid point. We determine i

as the residual of the regression problem, and then subsequently determine the joint covariance of all i, i = 1, . . . , # grid points.

(59)

5.2.3 Numerical results

For numerical computation, we normalize the model so the prior we used is N (0, I). The goal operator H is chosen to be the average of all 47 parameters, i.e., H =

1 47

P47

i=1Xi. Below is the spectrum of relevant matrices and pencils.

0 20 40 60 80 100

index of singular/eigen value 10-3 10-2 10-1 100 101 102 103 singular/eigen value

Figure 5-9: Spectrum for linearized sELM model with H = ₄₇1 P47

i=1Xi

We list the singular/eigen values of rank-1 matrices and pencils: λ(G ˜H) = 24.8848, λ(H) = 0.1459, and the only generalized eigenvalue of the pencil (GΓXH>Γ−1Z HΓXG>, ΓY) is

(60)

0 50 100 150 200 250 300 cardinality of the set

0 0.1 0.2 0.3 0.4 0.5 mutual information greedy

greedy non-goal setting MM

MM non-goal setting GLS

GLS non-goal setting random

Figure 5-10: Numerical results for linearized sELM model with H = ₄₇1 P47

i=1Xi

(61)

30°N 40°N 50°N 90°W 80°W 70°W greedy MM GLS

Figure 5-11: Selected locations shown on a map with H = ₄₇1 P

47 i=1Xi

We then run an example with H being X1 (br mr), which is base rate for

(62)

10 20 30 40 50 60 70 80 90 100 index of singular/eigen value

10-3 10-2 10-1 100 101 102 103 singular/eigen value

Figure 5-12: Spectrum for linearized sELM model with H = X1,

Note that λ(G ˆH) = 7.6295, λ(H) = 1, and (GΓXH>Γ−1Z HΓXG>, ΓY) is 0.9426.

0 50 100 150 200 250 300

cardinality of the set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 mutual information greedy

greedy non-goal setting MM

MM non-goal setting GLS

GLS non-goal setting random

(63)

30°N 40°N 50°N 90°W 80°W 70°W greedy MM GLS

Figure 5-14: Selected locations shown on a map with H = X1

Since the has fast decaying and highly correlated covariance, GLS is not per-forming good, similar to the performance of random selections. The other two, greedy algorithm and MM perform better. However, there is a noticeable performance gap between them in both cases.

(64)

(65)

Chapter 6 Conclusions and future work

In this thesis, we studied goal-oriented optimal Bayesian experimental design. We reviewed the background of the problem and introduced the information theoretical characterization of the design criterion. Using mutual information, the problem is then formulated as a combinatorial optimization problem. We then introduced three algorithms, Greedy, Minorize-Maximize, and Generalized Leverage Score for finding the approximate solutions. The classical analysis using (approximate) submodularity breaks in the goal-oriented setting due to the forward operator of the transformed problem being rank deficient under the assumption that n > m > d. We studied the computational cost and test the performance of these algorithms on both synthetic and real data sets. We concluded that although iterative algorithms have better performance, the computational cost is much larger than GLS. The performance of GLS is similar to that of Greedy and MM when the noise is uncorrelated or slightly correlated. However, in the case when the noise is highly correlated, the performance of GLS could be as poor as the random selections. Due to the difficulty in obtaining theoretical guarantees of the GLS algorithm, we obtain a lower bound for any k-set selections.

With that being said, we point out some potential improvement and several future research directions.

(66)

the theoretical guarantees for the goal-oriented problem, some other approaches for getting the bound might success.

• Nonlinear problem: Currently, we are only considering the linear forward operator. We may further consider the case where the forward operator being nonlinear or the goal operator being nonlinear.

• Non-Gaussian noise: We can also consider the case where the prior and the observations have non-Gaussian noise. Then, many formulas and equations used in this thesis might not be applicable. For example, the mutual informa-tion between two non-Gaussian variables does not always have a closed form expression.

• Optimal low rank approximation in Riemannian metric and canonical correlation analysis: Both problems solve the generalized eigenvalue problem with different matrix pencils, as introduced in Section 2.7, and both can be formulated as optimization problems. Hence it would be very interesting if some intrinsic connections can be drawn between those two problems.

(67)

Appendix A

Proofs

To prove Lemma 2, we first introduce the Loewner order over the class of symmetric matrices.

Definition 7 (Loewner order). Let A and B be two symmetric matrices. We say that A B if A − B is positive semi-definite. Similarly, we say that A B if A − B is positive definite.

Then we note that determinant preserves the order.

Lemma 4 (Determinant preserves the monotonicity [46]). If A B 0, then det(A) ≥ det(B).

A.1 Proof of Lemma 2

Proof.

I(YA, Yi; Z) − I(YA; Z) =

1 2log

det ΓZ

det ΓZ|YA∪v

− 1 2log det ΓZ det ΓZ|YA = 1 2log det Γ−1_Z|Y A∪v det Γ−1_Z|Y A

Now it is left to show that det Γ

−1 Z|YA∪v det Γ−1 Z|YA ≥ 1 or det Γ−1_Z|Y A ≥ det Γ −1

Z|YA∪v. If we can show

that Γ−1_Z|Y

A∪v Γ

−1

(68)

Γ−1_Z|Y A = Γ −1 Z + (G ˜H) >_P A(PA>ΓδPA)−1PA>(G ˜H). Therefore, Γ−1_Z|Y A∪v − Γ −1 Z|YA = (G ˜H) >

[PA∪v(PA∪v> ΓδPA∪v)−1PA∪v> − PA(PA>ΓδPA)−1PA>](G ˜H)

If A = ∅, then I(YA∪v; Z) = I(Yv; Z) ≥ I(0; Z) = I(YA; Z).

Now assume A 6= ∅. Without loss of generality, let

PA∪v(PA∪v> ΓδPA∪v)PA∪v> =

        A b 0 . . . 0 b> c 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0         (A.1) and let PA(PA>ΓδPA)PA> =         A 0 0 . . . 0 0 0 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0         (A.2)

Hence, forming the inverse using Schur’s complement,

PA∪v(PA∪v> ΓδPA∪v)PA∪v> =

        (A − bc−1b>)−1 −A−1_{b(c − b}>_A−1_b)−1 _{0 . . . 0} −(c − b>_A−1_b)−1_b>_A−1 _{(c − b}>_A−1_b)−1 _{0 . . . 0} .. . ... ... . .. ... 0 0 0 . . . 0         (A.3) and PA(PA>ΓδPA)−1PA> =         A−1 0 0 . . . 0 0 0 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0         (A.4)

(69)

we see that

PA∪v(PA∪v> ΓδPA∪v)PA∪v> − PA(PA>ΓδPA)−1PA> (A.5)

=         (A − bc−1b>)−1− A−1 _−A−1_{b(c − b}>_A−1_b)−1 _{0 . . . 0} −(c − b>_A−1_b)−1_b>_A−1 _{(c − b}>_A−1_b)−1 _{0 . . . 0} .. . ... ... . .. ... 0 0 0 . . . 0         (A.6) and (A − bc−1b>)−1− A−1 = A−1+ A−1b(c − b>A−1b)−1b>A−1− A−1 (A.7) = A−1b(c − b>A−1b)−1b>A−1 (A.8) Therefore,

PA∪v(PA∪v> ΓδPA∪v)PA∪v> − PA(PA>ΓδPA)−1PA>− PA(PA>ΓδPA)−1PA>

=         A−1b(c − b>A−1b)−1b>A−1 −A−1_{b(c − b}>_A−1_b)−1 _{0 . . . 0} −(c − b>_A−1_b)−1_b>_A−1 _{(c − b}>_A−1_b)−1 _{0 . . . 0} .. . ... ... . .. ... 0 0 0 . . . 0         (A.9) Since (c − b>A−1b)−1 0 and A−1b(c−b>A−1b)−1b>A−1−A−1b(c−b>A−1b)−1(c−b>A−1b)(c−b>A−1b)−1b>A−1 = 0 we have PA∪v(PA∪v> ΓδPA∪v)PA∪v> − PA(PA>ΓδPA)−1PA> 0, which indicates

(70)

hence Γ−1_Z|Y

A∪v Γ

−1

Z|YA 0, and the statement follows.

Remark. An alternative proof is by applying data processing inequality. Note that Z → YA∪v → YA is a markov chain. Therefore, we have I(YA∪v; Z) ≥ I(YA; Z) .

To prove Theorem 5, we state two lemmas.

Lemma 5 (Weyl). If A B 0, then λi(A) ≥ λi(B). Consequently, G>AG

G>BG 0, and λi(G>AG) ≥ λi(G>BG) for any real matrix G of compatible size.

Proof. Let C = A − B. then C 0. Using Weyl’s inequality, we see λn(C) ≤ λi(A) −

λi(B) ≤ λ1(C). Since λn(C) ≥ 0, λi(A) ≥ λi(B). Then note that G>(A − B)G 0,

and the result follows.

Lemma 6 (Cauchy interlacing). Suppose A ∈ Rn×nsymmetric. Let B ∈ mathbbRm×m with m ¡ n be a principal submatrix (obtained by deleting both i-th row and i-th col-umn for some values of i). Suppose A has eigenvalues α1 ≥ · · · ≥ αn and B has

eigenvalues β1 ≥ · · · ≥ βm. Then αk ≥ βk≥ αn−m+k. In particular, if m = n − 1, we

have αk≥ βk ≥ αk+1.

(71)

A.2 Proof of Theorem 5

Proof.

2I(YA∪V; Z) − 2I(YA; Z)

= logdet(P

>

A∪VΓYPA∪V) det(PA>ΓδPA)

det(P>

AΓYPA) det(PA∪V> ΓδPA∪V)

= log det(P_A∪V> ΓYPA∪V) − log det(PA>ΓYPA) − [log det(PA∪V> ΓδPA∪V) − log det(PA>ΓδPA)]

= log

|A∪V |

Y

i=1

λi(PA∪V> ΓYPA∪V) − log |A| Y i=1 λi(PA>ΓYPA) − [log |A∪V | Y i=1

λi(PA∪V> ΓδPA∪V) − log |A| Y i=1 λi(PA>ΓδPA)] = log |V \A| Y i=1 λi(PA∪V> ΓYPA∪V) |A∪V | Y i=|V \A|+1

λi(PA∪V> ΓYPA∪V) − log |A| Y i=1 λi(PA>ΓYPA) − [log |A| Y i=1 λi(PA∪V> ΓδPA∪V) |A∪V | Y i=|A|+1

λi(PA∪V> ΓδPA∪V) − log |A| Y i=1 λi(PA>ΓδPA)] ≤ log |V \A| Y i=1 λi(PA∪V> ΓYPA∪V) |A| Y i=1 λi(PA>ΓYPA) − log |A| Y i=1 λi(PA>ΓYPA) − [log |A| Y i=1 λi(PA>ΓδPA) |A∪V | Y i=|A|+1

λi(PA∪V> ΓδPA∪V) − log |A| Y i=1 λi(PA>ΓδPA)] (A.11) = log |V \A| Y i=1

λi(PA∪V> ΓYPA∪V) − log |A∪V | Y i=|A|+1 λi(PA∪V> ΓδPA∪V) ≤ log |V \A| Y i=1 λi(ΓY) − log d Y i=d−|V \A|+1 λi(ΓY |Z) (A.12)

(72)

A.3 Proof of Lemma 3

Proof. Recall from previous theorem 2, if we let

PA∪ω(PA∪ω> ΓδPA∪ω)PA∪ω> =

        A b 0 . . . 0 b> c 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0         and let PA(PA>ΓδPA)PA> =         A 0 0 . . . 0 0 0 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0         then, SA,ω = MA∪ω − MA

= Γ1/2_Z (G ˜H)>(PA∪ω(PA∪ω> ΓδPA∪ω)−1PA∪ω> − PA(PA>ΓδPA)−1PA>)(G ˜H)Γ 1/2 Z = Γ1/2_Z (G ˜H)>         A−1b(c − b>A−1b)−1b>A−1 −A−1_{b(c − b}>_A−1_b)−1 _{0 . . . 0} −(c − b>_A−1_b)−1_b>_A−1 _{(c − b}>_A−1_b)−1 _{0 . . . 0} .. . ... ... . .. ... 0 0 0 . . . 0         (G ˜H)Γ1/2_Z = (c − b>A−1b)−1Γ1/2_Z (G ˜H)>            −A−1_b 1 0 .. . 0            h −b>_A−1 _{1 0 · · · 0}i_{(G ˜}_H)Γ1/2 Z (A.13)

A combinatorial approach to goal-oriented optimal Bayesian experimental design

A Combinatorial Approach to Goal-Oriented

Optimal Bayesian Experimental Design

by

Fengyi Li

B.S., Texas A&M University (2017)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

© Massachusetts Institute of Technology 2019. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

May 28, 2019

Certified by . . . .

Youssef Marzouk

Associate Professor of Aeronautics and Astronautics

Thesis Supervisor

Accepted by . . . .

Sertac Karaman

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

A Combinatorial Approach to Goal-Oriented Optimal

Bayesian Experimental Design

by

Fengyi Li

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Background

2.1

Classical optimal experimental design

2.2

The Bayesian framework

2.3

The selection operator

2.4

Bayesian optimality criterion

2.5

The generalized eigenvalue problem

2.6

Connection to optimal low rank

approxima-tion

2.7

Connection to canonical correlation analysis

(CCA)

2.8

The goal operator and the transformed model

Chapter 3

Approximate submodular functions

3.1

The greedy algorithm

3.2

Submodularity ratio and generalized curvature

3.3

α-approximate submodularity and -approximate

submodularity

3.4

The challenge in goal-oriented case

Chapter 4

Algorithms

4.1

The greedy algorithm

4.2

The MM algorithm

4.3

The generalized leverage score algorithm

4.4

Theoretical consideration

Chapter 5

Numerical examples

5.1

Synthetic data set

α-approximate submodularity and -approximate