A Combinatorial Approach to Goal-Oriented
Optimal Bayesian Experimental Design
by
Fengyi Li
B.S., Texas A&M University (2017)
Submitted to the Department of Aeronautics and Astronautics
in partial fulfillment of the requirements for the degree of
Master of Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2019
© Massachusetts Institute of Technology 2019. All rights reserved.
Author . . . .
Department of Aeronautics and Astronautics
May 28, 2019
Certified by . . . .
Youssef Marzouk
Associate Professor of Aeronautics and Astronautics
Thesis Supervisor
Accepted by . . . .
Sertac Karaman
Associate Professor of Aeronautics and Astronautics
Chair, Graduate Program Committee
A Combinatorial Approach to Goal-Oriented Optimal
Bayesian Experimental Design
by
Fengyi Li
Submitted to the Department of Aeronautics and Astronautics on May 28, 2019, in partial fulfillment of the
requirements for the degree of Master of Science
Abstract
Optimal experimental design plays an important role in science and engineering. In many situations, we have many observations but only few of them can be selected due to limited resources. We then need to decide which ones to select based on our goal.
In this thesis, we study the Bayesian linear Gaussian model with a large number of observations, and propose several algorithms for solving the combinatorial problem of observation selection/optimal experimental design in a goal-oriented setting. Here, the quantity of interest (QoI) is not the model parameters, but some (vector-valued) function of the parameters. We wish to select a subset of the candidate observations that is most informative for this QoI, in the sense of reducing its uncertainty. More precisely, we seek to maximize the mutual information between the selected obser-vations and the QoI. Finding the true optimum is NP-hard, and in this setting, the mutual information objective is in general not submodular. We thus introduce sev-eral algorithms that approximate the optimal solution, including a greedy approach, a minorize–maximize approach employing modular bounds, and certain score-based heuristics. We compare the computational cost these algorithms, and demonstrate their performance on a synthetic data set and a real data set from a climate model. Thesis Supervisor: Youssef Marzouk
Acknowledgments
First of all, I would like to thank my advisor, Professor Youssef Marzouk, for taking me as his graduate student two years ago. Although busy, he is always there lending a helping hand whenever needed. He teaches me how to become a researcher rather than just a problem solver. This thesis would not have been done without his help and guidance.
In addition, I would like to thank my lab mates, Alessio, Andrea, Ben, Chi, Daniele, Jayanth, Michael, Ricardo, and Zheng, for insightful discussions over research and interesting daily conversations. Specially, I thank Jayanth for guiding me in doing this project and sharing his code. I would also like to thank my friends outside the lab, Bai, Hanshen, Shaoxiong, Xiaoyue, Xinzhe, and Yilun, for having irregular hotpot gatherings. In particular, I thank Bai for taking so many classes with me and going through the frustrating moments before homework is due; and Hanshen for all those discussions over random stuff and for being a good game partner in playing “chiji”, for countless evenings when my research seems to be hopeless.
Most importantly, I would like to thank my parents for their love, encouragement and support throughout the years.
Contents
1 Introduction 13
2 Background 17
2.1 Classical optimal experimental design . . . 17
2.2 The Bayesian framework . . . 18
2.3 The selection operator . . . 19
2.4 Bayesian optimality criterion . . . 20
2.5 The generalized eigenvalue problem . . . 21
2.6 Connection to optimal low rank approximation . . . 23
2.7 Connection to canonical correlation analysis (CCA) . . . 25
2.8 The goal operator and the transformed model . . . 27
3 Approximate submodular functions 31 3.1 The greedy algorithm . . . 33
3.2 Submodularity ratio and generalized curvature . . . 34
3.3 α-approximate submodularity and -approximate submodularity . . . 35
3.4 The challenge in goal-oriented case . . . 36
4 Algorithms 39 4.1 The greedy algorithm . . . 39
4.2 The MM algorithm . . . 40
4.3 The generalized leverage score algorithm . . . 41
5 Numerical examples 47
5.1 Synthetic data set . . . 47
5.1.1 Calibration . . . 47
5.1.2 Correlated noise . . . 51
5.1.3 Uncorrelated noise . . . 53
5.2 Simple E3SM (Energy Exascale Earth System Model) land model (sELM) 55 5.2.1 Model description . . . 55
5.2.2 Linearized sELM . . . 58
5.2.3 Numerical results . . . 59
6 Conclusions and future work 65 A Proofs 67 A.1 Proof of Lemma 2 . . . 67
A.2 Proof of Theorem 5 . . . 71
A.3 Proof of Lemma 3 . . . 72
A.4 Proof of Theorem 6 . . . 74
A.5 Proof of Theorem 8 . . . 76
B The generalized eigenvalue problem 79
List of Figures
5-1 Spectrum of important matrices and pencils of a calibration example
with correlation length 0.05 . . . 48
5-2 Numerical results of a calibration example with correlation length 0.05 49 5-3 Spectrum of important matrices and pencils of a calibration example with correlation length 0.5 . . . 50
5-4 Numerical results of a calibration example with correlation length 0.5 51 5-5 Spectrum of importance matrices and pencils with correlated noise . . 52
5-6 Numerical results with correlated noise . . . 53
5-7 Spectrum of importance matrices and pencils with uncorrelated noise 54 5-8 Numerical results with uncorrelated noise . . . 55
5-9 Spectrum for linearized sELM model with H = 471 P47 i=1Xi . . . 59
5-10 Numerical results for linearized sELM model with H = 471 P47 i=1Xi . . 60
5-11 Selected locations shown on a map with H = 471 P 47 i=1Xi . . . 61
5-12 Spectrum for linearized sELM model with H = X1, . . . 62
5-13 Numerical results for linearized sELM model with H = X1 . . . 62
List of Tables
Chapter 1
Introduction
The optimal experimental design (OED) problem arises in numerous scenarios and has various applications in science and engineering, such as combustion kinetics [26], sensor placement for weather prediction [34], and containment source identification [5]. There are numerous variations of OED problem. However, they, in general, aim to solve following problem: how to select k out of n observations in order to “opti-mally” learn the underlying parameters. One particular variant is the goal-oriented OED problem. In this case, the quantity of interest (QoI) is some function of the underlying parameters rather than parameters themselves. The optimality criteria, usually labelled using alphabets, are functionals of eigenvalues of fisher information matrices of the model [15, 41, 18]. For example, if we would like to minimize the average variance of the QoI, then A-optimal criterion is adopted. Another commonly used criterion is to minimize the volume of the confidence ellipsoid, leading to D-optimal criterion. The details of different D-optimality criteria are presented in section 2.1.
One naive way to solve this problem is to exhaust all nk possible selections and choose the best k-subset. This problem is NP-hard, and exhaustive enumeration immediately becomes intractable even for moderate dimensions. For example, one needs to iterate over 4.71 × 1013 choices to select 20 out of 50 observations using exhaustive search. Therefore, most the effort in this field goes into developing method for finding some approximate solutions. Some common methods that have been used
in this field are greedy algorithm, continuous relaxations, and exchange algorithms. The greedy algorithm for cardinality-constrained design optimizes the objective in each iteration, by selecting the one that gives the largest improvement, condition-ing on the chosen ones. If the objective function is non-decreascondition-ing (see Def. 1 in section 3) and submodular (Def. 2), the greedy algorithm yields a 1 − 1/e constant performance guarantee [37]. Furthermore, if the objective is not submodular but still non-decreasing, we can define a notion of curvature (Def. 4) and a submodularity ratio (Def. 3) [7] and the performance bound can be written in terms of the curvature and submodularity ratio. More details about maximizing submodular functions can be found in [33] and Chapter 2.
On the other hand, we can consider a continuous relaxation of the combinato-rial problem. A standard approach is to consider the continuous version of {0, 1} constraint. Hence, for each variable, we let it lie in [0, 1], and solve the continuous problem. In many settings, continuous relaxation results in convex problems [30, 3]. In other cases, `1 regularization term could be added to the optimization problem to
encourage the sparsity of the solution [5].
Fedorov’s exchange algorithm [20] starts with some random set S of cardinality k, and exchange one element i ∈ S with an element not in the set, j ∈ Sc, in each iteration so that the objective is improved after exchange. The algorithm terminates if the total number of iterations is reached or there is no exchange can be made to further improve the objective. Fedorov’s exchange algorithm is slow, and there have been several variants of this algorithm, as described in [38].
In the Bayesian framework, a prior distribution is put on the parameters to rep-resent our state of knowledge before seeing the data. Then, a posterior distribution of the parameters is obtained by conditioning on the observations. Hence, instead of outputting a point estimator, the Bayesian framework leads to a distribution, which makes it a powerful tool for uncertainty quantification. Some of the alphabetical optimality criteria used in the classical setting have their natural extensions to the Bayesian setting and admit statistical interpretations. We will describe them in sec-tion 2.2. Huan et al. [26] propose a framework for solving the Bayesian optimal
experimental design problem using simulations. For sequential experimental design under the Bayesian framework, Huan et al. [27] propose to use approximate dynamic programming (DP) to solve the sequential design problem. Unlike Greedy, DP con-siders both the future and the feedback. More details about the sequential Bayesian experimental design can be found in [43].
The design becomes nonlinear if we have a nonlinear forward model. Exact calcu-lation of the optimality design criteria in this case would often lead to a complicated integral and certain approximation has to be adopted. Some commonly used approx-imations are normal approximation to the posterior distribution or using empirical Fisher information matrix rather than the expected one. More detailed discussion about nonlinear design can be found in [11].
Thesis contribution: we proposed a one-shot method for selecting the observations for OED. Although the algorithm has some limitations, the computational cost is much lower comparing to some existing method. We demonstrate the performance on synthetic data set and a real data set.
Chapter 2
Background
2.1
Classical optimal experimental design
In the classical (frequentist) optimal experimental design setting, the efficiency of design is measured by functionals of the eigenvalues of certain matrices. Different choice of functionals leads to different optiamlity criteria, hence different designs. Let’s consider a linear model
Y = GX + (2.1)
where Y ∈ Rn is the observation, X ∈ Rm is the underlying parameter, and ∈ Rn is the additive noise, distributed as N (0, Γ). The alphabetical optimality criteria,
along with their Bayesian extension, are usually represented by alphabets. For the classical design, we assume the noise are independent, namely, Γ is diagonal with
each of its diagonal element being σi > 0. Before introducing the design criteria, we
first define the selection operator P>, where each of its row is a unit vector with all 0’s except one entry being 1. If we want to selection k out of n observations, then P> is in Rk×n. Using the selection operator, we now list some commonly used design
A(verage)-Optimality minimizes the average variance of the estimator:
min
rank(P )≤ktr(G >
P P>Γ−1 P P>G)−1. (2.2) D(eterminant)-Optimality minimizes the confidence ellipsoid (mean radius) of the estimator [30]:
min
rank(P )≤kdet(G >
P P>Γ−1 P P>G)−1. (2.3) E(igen)-Optimality minimizes the maximum variance over all direction of the esti-mator:
min
rank(P )≤kλmax(G >
P P>Γ−1 P P>G)−1. (2.4) These criteria allow us to choose a particular design to fulfill different purposes subject to cardinality constraints. Some other optimality criteria can be found in [41].
2.2
The Bayesian framework
In the Bayesian setting, a prior distribution is firstly put on the QoI, representing our state of knowledge before observing the data. We start this by first writing out the posterior distribution. We assume X has multivariate normal prior distribution, and without loss of generlity, we assume the mean of prior is 0, X ∼ N (0, ΓX), and
the noise ∼ N (0, Γ). Here, we do not assume the noise are independent, i.e., Γ
need not to be diagonal. Using Bayesian rule, the posterior distribution of X|Y is also Gaussian, X|Y ∼ N (µpos(Y ), Γpos), with mean
and the covariance
Γpos = (Hl+ Γ−1X ) −1
(2.6)
where Hl = G>Γ−1 G is the Hessian of the negative log-likelihood. It is worth
men-tioning there is a variational characterization of the mean, µpos. Define
J (µ) := 1 2kGµ − Y k 2 Γ−1noise+ 1 2kµk 2 Γ−1pr (2.7)
we note that µpos is the solution to the following optimization problem [5]:
µpos = arg min
µ J (µ) (2.8)
and the Hessian of the J (µ) is calculated to be
HJ = Hl+ Γ−1pr = Γ −1
pos. (2.9)
In the setting of the Bayesian experimental design, one seeks the observations that minimizes the posterior uncertainty of the underlying parameter X, subject to car-dinality constraints. In the next section, we see how the selection operator P> can help formulate this problem.
2.3
The selection operator
We have introduced the selection operator P> earlier. To restate it, each row of P> is a unit vector with all 0’s except one entry being 1. Now, given the same setting as in section 2.2, we compute the posterior distribution of X after observing P>Y . Denoting P>Y by YP, the posterior distribution is also multivariate normal with
mean
µX|YP = ΓX|YPG
>
and the covariance matrix ΓX|YP = (G > P (P>ΓP )−1P>G + Γ−1X ) −1 . (2.11)
From the formulation above, what the selection operator P does is essentially se-lecting the corresponding rows and columns (principal submatrix) of the observation covariance matrix Γ, calculating its inverse and enlarging it to the original size by
putting zeros to the unselected entries. Let A+ denote the pseudo-inverse of the matrix A. Then, we have
P (P>ΓP )−1P> = P P>(P P>ΓP P>)+P P>. (2.12)
Note P P> is a square matrix of size m × m, and the covariance matrix can be written equivalently as ΓX|YP = (GP P > (P P>ΓP P>)+P P>G + Γ−1X ) −1 . (2.13)
2.4
Bayesian optimality criterion
The A- and D-optimality criterion have their corresponding Bayesian extension and admit statistical interpretation. The Bayesian A-optimal design minimizes the trace of the posterior covariance matrix, i.e.
min
rank(P )≤ktr(ΓX|YP) =rank(P )≤kmin tr(G >
P (P>ΓP )−1P>G + Γ−1X ) −1
, (2.14)
and the Bayesian D-optimal design is to minimizes the log-determinant of the poste-rior covariance matrix:
min
rank(P )≤klog det(ΓX|YP) =rank(P )≤kmin log det(G >
P (P>ΓP )−1P>G + Γ−1X )−1. (2.15)
Comparing (2.14), (2.15) with (2.2), (2.3), we see that the prior plays the role of a regularization term, in a sense that, it prevents the optimization problem from being
ill-conditioned. The A-optimal criteria can be interpreted minimizing the Bayes’s risk of the posterior mean:
EXEY |X[ X − µX|YP 2 ] = tr(ΓX|YP)
On the other hand, the D-optimal criteria is to minimize the log volume (mean ra-dius) of the credible ellipsoid of the posterior distribution [30]. In addition, the log-determinant of the posterior distribution can be connected to the expected infor-mation gain, characterized by the expected Kullback-Leibler (KL) divergence between the prior and the posterior, using the following indentity [5, 26]:
EY |X[DKL(N (µX|YP, ΓX|YP)||N (0, ΓX))] = −
1
2log det ΓX|YP +
1
2log det ΓX (2.16) This gives the informational theoretical motivation for the D-optiaml design. (2.16) is also equivalent to the mutual information between the the observations and the underlying parameters. I(YP; Z) = 1 2log det ΓX det ΓX|YP (2.17)
and we see that minimizing the log-determinant of the posterior distribution is equiv-alent to maximizing the expected information gain and the mutual information. In this thesis, we mainly consider the Bayesian D-optimal design and use the mutual information characterization.
2.5
The generalized eigenvalue problem
The symmetry of mutual information in its two arguments enables us to decompose it in the following two ways:
where h denotes the differential entropy. For a general d-dimensional multivariate normal random variable W ∼ N (µ, Γ), the differential entropy has a closed form:
H(W ) = 1
2log det(Γ) + d
2(1 + log(2π)). (2.19) Combining (2.18) and (2.19), we can formulate the D-optimality criterion in two ways:
I(YP; X) = H(YP) − H(YP|X) = −
1 2log det Γ+ 1 2log det ΓYP (2.20) = H(X) − H(X|YP) = − 1
2log det ΓX|YP +
1
2log det ΓX, (2.21) where ΓYP = P
>Γ
YP = P>(GΓXG>+ Γ)P is the covariance matrix of the selected
observations. This symmetry also let us see the problem in two ways: (2.20) is the formulation in the data space while (2.21) is the formulation in the parameter space. Both of these can be viewed as a generalized eigenvalue problem (GEP).
For a matrix pencil (A, B), where A, B are both full rank square matrix and have the same dimension, the GEP solves the following equaiton:
Avi = λiBvi (2.22)
where λi is the eigenvalue and vi is the corresponding eigenvector. See more about
GEP in Appendix B.1. Then (2.20) can be written as 1
2log det(Σdata) (2.23)
where Σdata = diag(σdata,i), and σdata,i’s are the eigenvalues of the pencil (ΓYP, Γ).
Similarly, (2.21) can be written as 1
2log det(Σpara) (2.24)
and Σpara= diag(σpara,i), where σpara,i’s are the eigenvalues of the pencil (ΓX, ΓX|YP).
approximation as presented in [47]. We introduce this in the following section.
2.6
Connection to optimal low rank
approxima-tion
The optimal low rank approximation solves the following problem:
ˆ
Γ∗pos = arg min
ˆ Γpos d2R(Γpos, ˆΓpos) (2.25) such that ˆ Γpos ∈ Mr = {Γpr − K>K 0 : rank(K) ≤ r} (2.26)
where dR is the distance between two symmetric positive definite (SPD) matrices,
and is defined as dR(A, B) = q tr[log2(A−1/2BA−1/2)] = s X i log2(λi) (2.27)
where A, B are two SPD matrices of compatible size and λi’s are the generalized
eigen-values of the matrix pencil (A, B). The optimization problem (2.25) is solved using the generalized eigenvalues and eigenvectors of the matrix pencil (G>Γ−1 G, Γ−1X ). We present the optimal low rank approximation theorem below (Theorem 2.3 in [47]).
Theorem 1 (Optimal posterior covariance approximation). Let (δ2
i, wi) be the
gen-eralized eigenvalue-eigenvector pairs of the matrix pencil (G>Γ−1 G, Γ−1X ), with the ordering δi ≥ δi+1. Given Γpos, a minimizer, ˆΓ∗pos of d2R(Γpos, ˆΓpos), where ˆΓpos ∈ Mr
is given by
ˆ
where
KK>=X
i≤r
δi(1 + δi)−1wˆiwˆ>i (2.29)
The corresponding minimum loss is given by
d2R(Γpos, ˆΓ∗pos) = X i>r log2( 1 1 + δi ) (2.30)
Using the identities of eigenpairs of Schur complement of covariance matrices as in C , Theorem 1 can be reformulated as
Theorem 2 (Optimal posterior covariance approximation, Schur complement ver-sion). Let (λi, qi) be the eigenpairs of (GΓXG>, ΓY) with the ordering λi > λi+1 and
normalization qi>GΓXG>qi = 1, where ΓY = Γ + GΓXG> is the covariance matrix
of the marginal distribution of Y . Then, a minimizer Γ∗pos of the Riemannian metric dR between Γpos and an element of Mr is given by:
Γ∗pos= ΓX − KK> KK>= r X i=1 ˆ qiqˆi> ˆ qi = ΓXG>qi p λi
where the corresponding minimum distance is:
d2R(Γpos, Γ∗pos) =
X
i>r
log2(1 − λi)
This theorem indicates that the low rank approximation using eigenpairs of (ΓY −
Γ, ΓY) gives the optimal approximation in Riemannian metric over all SPD matrices.
linear model:
V>Y = V>GX + V> (2.31) where V ∈ Rn×r, and we are interested in the following optimization problem:
max
V ∈Rn×rI(V
>
Y, X) = 1
2log det maxV ∈Rn×rdet((V
>
ΓYV )(V>ΓV )−1). (2.32)
Intuitively, this finds the bases to project on so that the mutual information between the underlying parameters and the projected variables is maximized. Let (λ, vi)ri=1
be the top r eigenpairs of the pencil (ΓY, Γ). It is shown in [22] that the solution to
(2.32) is given by the matrix V with columns being the eigenvectors {vi}ri=1, and the
optimum is given by max V ∈Rn×rI(V > Y, X) = 1 2 r X i=1 log λi. (2.33)
2.7
Connection to canonical correlation analysis
(CCA)
CCA, first proposed by Hotelling in 1936 [25], finds the maximal correlation over all possible linear transformations of two multivariate random variables. It has been used for mainly two purposes: dimension reduction and data interpretation. The former lets us consider only some small number of linear combinations and random variables, whereas the latter allows us to find important features/directions for explaining the data [24]. For two multivariate random variables X and Y , with covariance
Σ = ΣXX ΣXY ΣY X ΣY Y (2.34)
consider the linear combinations X>wx ∈ R and Y>wy ∈ R. CCA solves the follow-ing: arg max wx,wy ρ = arg max wx,wy corr(X>wx, Y>wy) (2.35) = arg max wx,wy E[wx>XY >w y] q E[wx>XX>wx]E[wy>Y Y>wy] = arg max wx,wy w>xΣXYwy q w> xΣXXwxw>yΣY Ywy . (2.36)
The solution can be found by solving the following system of generalized eigenvalue equations: ΣXYΣ−1Y YΣY Xwx = ρ2ΣXXwx ΣY XΣ−1XXΣXYwy = ρ2ΣY Ywy. (2.37) (2.38) That is (ρ2, w
x) is the eigenpair of the pencil (ΣXYΣ−1Y YΣY X, ΣXX) and (ρ2, wy) is the
eigenpair of the pencil (ΣY XΣ−1XXΣXY, ΣY Y). Moreover, wx and wy are related by the
following equations: ΣXYwy = ρλxΣXXwx ΣY Xwx= ρλyΣY Ywy (2.39) where λx = λ−1y = q w> yΣY Ywy w>
xΣXXwx. Therefore, in practice, we only need to solve one of
(2.37) and (2.38), and (2.39) give the other.
Return to our problem, the covariance between Y and X as in (2.1) is ΓY GΓX ΓXG> ΓX (2.40)
Let ˆwi be the eigenvector of (G>ΓXG, ΓY), analogous to (2.37), and ˆvibe the
eigenvec-tor of (GΓXΓ−1Y ΓXG>, ΓX), analogous to (2.38),corresponding to the same eigenvalue
λi. Then ˆwi and ˆvi maximize corr(X>vi, Y>wi) and the optimal value of the
Another way to characterize the canonical correlation is as follows. (Wx, Wy) = arg min ( ˆWx, ˆWy)∈Sr ˆ Wx>X − ˆWy>Y 2 (2.41) where Sr = {( ˆWx, ˆWy) : rank(Wx) = rank(Wy) = r, ˆWx>ΣXXWˆx = ˆWy>ΣY YWˆy = Ir}. (2.42)
Then (2.41) is minimized when Wx and Wy has top-r eigenvectors wx and wy (as
described in (2.37) and (2.38)) as their columns. Details about CCA are discussed in [23, 50, 9].
2.8
The goal operator and the transformed model
Now consider a goal-oriented operator. The QoI is a linear function of X, denoted by Z = HX, where H is what we called, the goal operator. In the goal-oriented design, we are interested in minimizing the uncertainty of Z. In many cases, the range of H is low dimensional, for instance, 1 or 2; hence, inferring the whole aspect of X would be a waste of computational resources. In order to do inference, we would like to compute the posterior distribution of Z, the probability distribution of Y |Z, etc.. One way is to note the prior distribution of Z is N (0, ΓZ) with ΓZ = HΓZH>. Hence,
the posterior distribution of Z is also multivariate normal with mean
µZpos(Y ) = Hµpos(Y ) (2.43)
and posterior
ΓZpos = HΓposH
>
Another way to see this goal-oriented problem is to introduced the transformed model [46]:
Y = G ˜HZ + δ (2.44)
where ˜H = ΓXH>Γ−1Z , Z ∼ N (0, ΓZ) and δ ∼ N (0, Γδ). Z and δ are independent,
and Γδ= Γ+ G(ΓX− ΓXH>Γ−1Z HΓX)G>. Using this transformation, the results for
the Bayesian experimental design can be applied to the goal-oriented case, and we obtain that
ΓZ|Y = (Γ−1Z + (G ˜H) >
Γ−1δ GH)−1 (2.45)
µZ|Y = HµX|Y = ΓZ|Y(G ˜H)>Γ−1δ (2.46)
and
ΓZ|YP = (G ˜H)
>P (P>Γ
δP )−1P>G ˜H + Γ−1Z (2.47)
µZ|YP = HµX|YP = ΓZ|YP(G ˜H)
>P (P>Γ
δP )−1P> (2.48)
However, it is worth mentioning that the forward operator of the transformed model is often rank deficient, in a sense that G ˜H has the same rank as H, and the rank of G ˜H is d, if d < m < n. Hence, although the algorithm in the classical setting can still be applied in the goal-oriented case, the deficiency of rank might make the theoretical analysis difficult.
The Bayesian D-optimal design problem in the goal-oriented setting can be for-mulated as
min
rank(P )≤klog det(ΓZ|YP) =rank(P )≤kmin HΓX|YPH >
or equivalently, using the transformed model:
min
rank(P )≤klog det(ΓZ|YP) (2.50)
= min
rank(P )≤klog det((G ˜H) >
P (P>ΓδP )−1P>G ˜H + Γ−1Z ) (2.51)
Chapter 3
Approximate submodular functions
Submodularity is a useful structure that has been widely used in combinatorial op-timization [6, 10, 8, 29] and machine learning [49, 31], in particular, design and analysis of approximate algorithms for NP-hard problems [37, 35, 21]. In this chap-ter, we present some classical results for (approximate) submodular functions, and how to use these for analyzing greedy algorithms. We will also show the challenge encountered when applying these methods to the goal-oriented setting.
We first introduce definitions of monotonicity and submodularity of a set function f , and then use these definitions to characterize the performance bound of the greedy algorithm when the objective enjoys some desired properties.
Definition 1 (Non-decreasing set functions). Let f be a set function on Ω, i.e., f : 2Ω → R. Then f is non-decreasing if for all A, B satisfying A ⊆ B ⊆ Ω, we have
f (A) ≤ f (B).
Definition 2 (Submodularity). Let f be a set function on Ω, i.e., f : 2Ω → R. We
say f is submodular if for every A, B ⊆ Ω, with A ⊆ B, and every e ∈ Ω \ B, we have f (A ∪ {e}) − f (A) ≥ f (B ∪ {e}) − f (B).
From the definition, we see that the concept of submodularity describes the ”di-minishing returns” property of the set functions: the gain by adding a new element is less for larger sets. Many functions in science and engineering exhibit this behavior — for example, weighted coverage function, entropy, mutual information of conditional
independent variables over disjoint sets, and many more [33]. When the objective function f is non-decreasing, it is natural to consider its maximum subject to the cardinality constraint that the size of the set is less than k, that is:
A∗ = arg max
S⊆Ω,|A|=k
f (A) (3.1)
A natural heuristic to consider in this setting is the greedy algorithm. In every step, the greedy algorithm successively maximizes the increment in f . Due to the simplicity and good performance of the algorithm, it has been applied to many problems, such as column selection [4] and max-cut of the graph [8]. Relating this to our problem, the selection criterion of the greedy algorithm is to choose the observation that maximizes the mutual information conditioned on the set of the chosen observations A:
max
v∈AcI(Z; YPv|YPA)
Denote YPA by YA, and using the chain rule of mutual information,
I(Z; Yv|YA) = I(Z; YA∪v) − I(Z; YA)
in each iteration, we solve v∗ = arg maxv∈AcI(Z; YA∪v)−I(Z; YA). In the next section,
3.1
The greedy algorithm
Algorithm 1 The Greedy Algorithm
1: input ground set Ω, set function f : 2Ω → R, cardinality k
2: A0 = ∅ 3: for i = 1 : k do 4: v∗ = argmaxv∈Acf (Ak−1∪ {v}) − f (Ak−1) 5: Ak= Ak−1∪ {v∗} 6: end for 7: return Ak
Nemhauser et al. [37] showed a classical result in 1978 for f satisfying Definition 1 and Definition 2. We restate the theorem here.
Theorem 3. Let f be a non-decreasing submodular set function, with f (∅) = 0. Then Algorithm 1 returns Ak that satisfies
f (A∗) ≥ f (AK) ≥ (1 −
1 e)f (A
∗
) (3.2)
where A∗ is the optimal solution to 3.1, subject to the cardinality constraints of size k.
Note (1 −1e) ≈ 0.63. Hence Theorem 3 states that for f being non-decreasing and submodular, the greedy algorithm gives a result that has performance no worse than 63% of the optimal. Feige extended this result and further proved that no polynomial time algorithm can output the results that have a better performance than 1 − 1e under the cardinality constraint, assuming N 6= NP [21]. For convenience, we denote YPA by YA, and we have the following lemma.
Lemma 1. Under the assumption that the H = I and the noise is uncorrelated, the mutual information I(X; YA) (see (2.19)) is monotone and submodular in A.
See [28] for a proof. This lemma implies that in the non-goal-oriented case, with uncorrelated noise, the mutual information between the underlying parameter X and the selected observations increases as more observations are selected, and also that the incremental gain diminishes as the number of observations being selected increases. This coincide with our intuition, since the more we observe the more we know about the underlying parameter X.
Then applying Theorem 3, we see that the greedy algorithm yields a performance bound of 1 −1e under this assumption. However, if the noise is correlated or if we are dealing with a general goal operator other than identity, then the mutual information is no longer submodular. In light of this situation, we need some measure to quantify how close our objective function is to being submodular.
3.2
Submodularity ratio and generalized curvature
One way to capture the closeness of a non-submodular function to a submodular function is by introducing two paramenters: submodularity ratio and generalized curvature [14, 7].
Definition 3 (Submodularity ratio [14, 7]). The submodularity ratio of a non-negative set function f is the largest scalar γ s.t.
P
ω∈V \A(f (A ∪ ω) − f (A)) ≥ γ(f (A ∪ V ) − f (A))
for all V, A ⊆ S.
Definition 4 (Generalized curvature [14, 7]). The curvature of a non-negative func-tion f is the smallest scalar α s.t.
(f (A ∪ V ) − f (A\ω ∪ V )) ≥ (1 − α)(f (A) − f (A\ω)) for all V, A ∈ S, ω ∈ A\V .
We note that γ = 1 if and only if f is submodular, and α = 0 if and only if f is supermodular (−f is submodular). Hence, f is modular (both submodular and
supermodular) if and only if γ = 1 and α = 0. Then with two parameters γ and α, we are able to give a performance bound of the greedy algorithm for a general set function f , not neccessarily submoduar.
Theorem 4 (Approximation guarantee of greedy [7]). Let f be a negative non-decreasing set function with submodularity ratio γ ∈ [0, 1] and curvature α ∈ [0, 1]. The greedy algorithm enjoys the following approximation guarantee for solving the problem max A∈S,|A|≤kf (A) f (Ak) ≥ 1 α " 1 − k − αγ k k# f (A∗) ≥ 1 α(1 − e −αγ )f (A∗)
where A∗ is the solution to (3.1), and Ak is the output of the greedy algorithm. It seems that if we can obtain a lower bound for γ and an upper bound for 1 − α, then we can obtain the overall performance bound for the greedy algorithm even if f is non-submodular. However, in the goal-oriented OED setting, obtaining nontrivial bounds for γ and 1 − α is not so simple. We will explain this in detail in Section 3.4.
3.3
α-approximate submodularity and -approximate
submodularity
Another way to characterize how far a function deviates from being submodular is to define notions of α-approximate submodularity and -approximate submodularity as in [13] [12]. It is shown in [12] that D-optimal goal-oriented experimental design is submodular provided that d ≥ m (H is tall and skinny) and the noise is uncorrelated, which is not the case we are interested in. Performance bounds are given in [13] using α-approximate supermodularity and -approximate supermodularity. However, the performance bound in the paper is written in terms of some multiple of det(H>H) (equation (30) in [13]), which is 0 when d < m. The results, however, do not hold
when d < m using the same proof technique.
3.4
The challenge in goal-oriented case
Unfortunately, none of the methods introduced in previous section work in the case we are interested in, due to the fact that d m < n. This relative relationship between dimensions of H, X and Y causes many analyses to break and yields trivial lower bound. We present here an attempt to utilize the result from Section 3.2. Proofs of theorems and lemmas can be found in Appendix A.
The following lemma shows that mutual information is a non-decreasing set func-tion.
Lemma 2. I(YA; Z) is non-decreasing in A, i.e. I(YA∪v; Z) ≥ I(YA; Z).
Let f (·) = I(·; Z) = I(Z; ·). Then, to apply Theorem 4, we would like to get a lower bound for γ and an upper bound for α. We first examine a lower bound for γ, hence a lower bound for
P
ω∈V \AI(YA∪ω;Z)−I(YA;Z)
I(YA∪V;Z)−I(YA;Z) for all V, A ⊆ Ω. For the denominator,
we have the following result:
Theorem 5. I(YA∪V; Z)−I(YA; Z) ≤ 12 logQ |V \A|
i=1 λi(ΓY)−log
Qd
i=d−|V \A|+1λi(ΓY |Z),
where λi(Γ) is the ith largest eigenvalues of matrix Γ.
To get a lower bound for the numerator of
P
ω∈V \AI(YA∪ω;Z)−I(YA;Z)
I(YA∪V;Z)−I(YA;Z) , we need some
lemma to build up the machinery. Lemma 3. Let MA= ΓZΓ−1Z|YA = I + Γ
1/2 Z (G ˜H) >P A(PA>ΓδPA)−1PA>(G ˜H)Γ 1/2 Z and let
SA,ω = MA∪ω−MA. Then, 2I(YA∪V; Z)−2I(YA; Z) =
det(MA∪ω) det(MA) = 1+tr(M −1/2 A SA,ωM −1/2 A )
In the proof of this lemma, (A.13) shows that when we select a new observation in each iteration, we are essentially performing a rank 1 update to the argument fed into the objective.
Theorem 6. For n ≤ min(d, n),
X
ω∈V \A
I(PA∪ω> Y ; Z) − I(PA>Y ; Z) > |V \A|
2 log 1 + λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H) >) max(diag(Γδ)) ! . (3.3) Now, combine Theorem 6 and Theorem 5, we have
P
ω∈V \AI(YA∪ω; Z) − I(YA; Z)
I(YA∪V; Z) − I(YA; Z)
(3.4) ≥ |V \A| 2 log 1 + λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H) >) max(diag(Γδ)) 1 2 log Q|V \A| i=1 λi(ΓY) − log Qd i=d−|V \A|+1λi(ΓY |Z) (3.5) ≥ |V \A| 2 log 1 + λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H) >) max(diag(Γδ)) |V \A| 2 log λ1(ΓY) − log λn(ΓY |Z) (3.6) = log 1+λmin((G ˜H)Γ 1/2 Z M −1 ∅ Γ 1/2 Z (G ˜H)>) max(diag(Γδ)) !
log λ1(ΓY)−log λn(ΓY |Z) n ≤ min(d, m)
0 otherwise
(3.7)
We see that for the interesting case n > m > d, we obtain only the trivial lower bound for γ. The nontrivial bound works for the case where we have number of observations being smaller than both the dimension of the goal operator and the dimension of the underlying parameters.
An lower bound for 1 − α (or an upper bound for α) can be obtained in a similar way. Hence 1 − α and can also be lower bounded by (3.7) as shown in Appendix D.1 in [7]. Plugging γ ≥ 0 into Theorem 4, we see that we are obtaining a trivial bound.
Chapter 4
Algorithms
In this chapter, we compare three algorithms: greedy algorithm, which is introduced in previous chapter; Minorize-Maximize (MM) [28], which is based on iteratively maximizing the modular lower bound of the objective function; and the General-ized Leverage Score (GLS) algorithm, which is a one-shot method that ranks the observations based on `2 norm of the generalized eigenvectors of the matrix pencil.
4.1
The greedy algorithm
We present the greedy algorithm here again, with f in Algorithm 1 replaced with mutual information.
Algorithm 2 The Greedy Algorithm
1: input ground set Ω, set function f : 2Ω → R, cardinality k
2: A0 = ∅
3: for i = 1 : k do
4: v∗ = arg maxv∈AcI(Z; YA∪{v}) − I(Z; YA)
5: Ai = Ai−1∪ {v∗}
6: end for
7: return Ak
Recall that I(Z; YA∪{v}) = 12log
det ΓYA∪{v}
cost of the algorithm comes from the computing the determinant. The overall cost for greedy is O(nk4).
4.2
The MM algorithm
The MM algorithm in the continuous setting is referred to as ”Majorize-Minimization” (for finding the minimum) or “Minorize-Maximization” (for finding the maximum), and the idea is that, take majorize-minimization for example, for a non-convex func-tion, in each iterafunc-tion, we construct a convex surrogate function that is globally greater than the objective, and successively finding the minimum of the convex sur-rogate. The MM algorithm presented below is designed based on this idea: if we can find a modular lower bound of the objective, and in each iteration, we can just maximize the lower bound.
f (B) + ηX
e∈A
f (B ∪ e) − f (B)
| {z }
modular lower bound
≤ f (A ∪ B) ≤ f (B) + 1 γ X e∈A f (B ∪ e) − f (B) | {z }
modular upper bound
(4.1)
This leads to the following algorithm. Algorithm 3 The MM algorithm
1: input ground set Ω, set function f : 2Ω → R, cardinality k
2: A = ∅
3: for i = 1 : k do
4: F = ΓY(Ac, Ac) − ΓY(Ac, A)ΓY(A, A)−1ΓY(A, Ac)
5: S = Γδ(Ac, Ac) − Γδ(Ac, A)Γδ(A, A)−1ΓY(A, Ac)
6: rj = max(diag(log(F)) - diag(log(S)))
7: Ai = Ai−1∪ {j}
8: end for
9: return Ak
Hence, the main computational cost comes from computing the logarithm of matrices, which is equivalent to computing the eigenvalues. The overall computational cost for MM is O(n3k). The performance and analysis details of this algorithm can be found
in [28].
4.3
The generalized leverage score algorithm
We first give the definition of the leverage score for a symmetric matrix A.
Definition 5 (Leverage score for symmetric matrix [40]). Let Vk ∈ Rn×k contain the
top k singular vectors of a n × n symmetric matrix A with rank(A) ≥ k. Then, the (rank-k) leverage score of the i-th column/row of A is defined as `(k)i = k[Vk]i,:k
2 2 for
i = 1, · · · , n, where [Vk]i,: denotes the i-th row of Vk.
From the definition, we see that the rank-k leverage scores are essentially the square of the `2 norm of each row of the top k eigenvectors. Leverage scores have been
widely used in the field of randomized numerical linear algebra, for example, sample the rows of a matrix with the probability proportional to their leverage score leads to CU R decomposition [16] and thus fast computation linear regression [36, 2]; another line of research goes into approximating leverage score, without direct computing the singular/eigen vectors of matrices [17].
For a matrix pencil (A, B), analogous to Definition 5, the weighted generalized leverage score can be defined as follows:
Definition 6 (Weighted generalized leverage score). For A, B ∈ Rn×n, let V
k ∈ Rn×k
contain the top k eigenvectors of the matrix pencil (A, B), with (rank(A), rank(B)) ≥ k. let Λk ∈ Rk×k be a diagonal matrix, where each of its diagonal entries being
the corresponding eigenvalue of the pencil. Then, the (rank-k) weighted generalized leverage score is defined as `(k)i = k[VkΛk]i,:k22.
Algorithm 4 The generalized leverage score algorithm
1: input ΓY, Γδ;
2: A = ∅
3: (Λ, Q) = generalized eig(ΓY − Γδ, ΓY )
% compute the generalized eigenpairs of the matrix pencil (ΓY − Γδ, ΓY)
4: s = row-wise 2 norm of QΛ1/2
5: r = {si1, si2, · · · , sin}, where sij is the jth largest component in s
6: A = {i1, i2, · · · , in}
7: return A
We give some intuition about why this algorithm works. We consider the matrix pencil (GΓXH>Γ−1Z HΓXG>, ΓY) = (ΓY − ΓY |Z, ΓY) = (ΓY − Γδ, ΓY), and let (λi, qi)
be the eigenpairs of this pencil with normalization qi>GΓXH>Γ−1Z HΓXG>qi = 1 for
i = 1, · · · , d. Then from Theorem 2.3 in [46], we have
ΓZ|Y = ΓZ− KK>
KK> = ˆQΛ ˆQ> ˆ
Q = HΓXG>Q
where Λ is a d × d diagonal matrix whose diagonal entries are the generalized eigen-values and Q is a n × d matrix whose column are the corresponding generalized eigenvectors. Then combine three equations above, we have
ΓZ|Y = ΓZ− ˆQΛ ˆQ> = ΓZ− HΓXG>QΛQ>GΓXH>
| {z }
KK>
.
On the other hand, from (2.45), using Woodbury’s identity, and expand all the terms, we have
ΓZ|Y = ΓZ− HΓXG>Γ−1Y GΓXH>.
generalized eigenvalues and eigenvectors. Then we have
HΓXG>(Γ−1Y − QΛQ >
)HΓXG>= 0 (4.2)
by construction. Also, from (2.47), we have
ΓZ|YP = ΓZ− HΓXG > P (P>ΓYP )−1P>GΓXH> (4.3) ≈ ΓZ− HΓXG>P P>Γ−1Y P P > GΓXH>. (4.4)
We consider using the following approximation
˜
ΓZ|YP = ΓZ− HΓXG
>P P>(QΛQ>)P P>GΓ XH>
and we would like ˜ΓZ|YP to be as close to ΓZ|Y as possible, which is equivalent to
making HΓXG>P P>QΛQ>P>P GΓXH> to be as close to HΓXG>QΛQ>GΓXH> as
possible. A natural heuristic is to eliminate the ”less important” rows of QΛ1/2 first,
characterized by the row-wise two norm. This idea can be seen as follows.
QΛ1/2 = | | | √ λ1q1 √ λ2q2 · · · √ λnqn | | | = — v1 — — v2 — .. . — vn — P P>QΛ1/2 = 0 0 · · · 0 0 1 · · · 0 . .. 0 0 · · · 1 — v1 — — v2 — .. . — vn — = — 0 — — v2 — .. . — vn —
The action of P P> on the matrix QΛ1/2 is on rows. Hence, by selecting ”important”
rows, or equivalently, eliminating ”less important” rows, we make P P>QΛQ>P>P close to QΛQ>.
Theorem 7 (Optimal approximation of the posterior covariance of the QoI). Let (λi, qi) be the eigenpairs of
(GΓXH>Γ−1Z HΓXG>, ΓY)
with the ordering λi > λi+1 > 0 and normalization qi>GΓXH>Γ−1Z HΓXG>qi = 1,
where ΓY := Γobs + GΓXG> is the covariance matrix i of the marginal distribution
of Y . Then, a minimizer ˜ΓZ|Y of the Riemannian metric dR between ΓZ|Y and an
element of MZ r is given by: ˆ ΓZ|Y = ΓZ− KK> KK>= r X i=1 ˆ qiqˆi> ˆ qi = HΓXG>qi p λi
where the corresponding minimum distance is:
d2R(ΓZ|Y, ˜ΓZ|Y) = 1 2 X i>r ln2(1 − λi) and MZ r = {ΓZ− KK> 0 : rank(K) ≤ r}
Note that, the covariance between Y and Z is given by the following matrix
Σ = ΓY GΓXH> HΓXG> ΓZ ,
Also, the covariance between YP and Z, given by ΣP, is as follows.
ΣP = P>ΓYP P>GΓXH> HΓXG>P ΓZ ,
Hence, we consider the matrix pencil (P>GΓXH>Γ−1Z HΓXG>P, P>ΓYP ) and the low
rank decomposition qP
i = HΓXG>P P>qi.
Since we know that ˆΓZ|Y is the optimal approximation in Riemannian metric dR,
we would like qP
i to be as close to qi as possible. Then we examine the norm of the
rows of eigenvector matrix to determine after eliminating which row, we would have the least change in eigenvectors. This sense of closeness is captured using `2 norm.
Then, using this approximation, we have
˜
ΓZ|YP = ΓZ − HΓXG
>
P P>QΛQ>P P>GΓXH>.
4.4
Theoretical consideration
Although a rigorous analysis of the performance is lacking at this point, we present a lower bound for I(Z; YP) for |YP| = k. That is what would the worst performance
be if we randomly select k observations.
Theorem 8. For any k randomly selected observations,
I(Z; YP) ≥ 1 2(1 + λmin(Γ −1 δ ) n X i=n−k+1 diagi((G ˜O)ΓZ(G ˜O)>))
where diagi((G ˜O)ΓZ(G ˜O)>)) denotes the i-th largest diagonal element of (G ˜O)ΓZ(G ˜O)>.
Chapter 5
Numerical examples
In this section, we demonstrate the performance of three algorithms, greedy algo-rithm, majorize-minimization (MM) algoalgo-rithm, and generalize leverage score (GLS) algorithm on a synthetic data set and a real data set from a climate model.
5.1
Synthetic data set
5.1.1
Calibration
For the purpose of calibration, we first run a small-scale example where we can gen-erate the combinatorial optimal, so we can compare the performance of approximate algorithms to the optimal. Below is an example with H ∈ R2×10 being a matrix with two rows of randomly drawn unit vectors (drawn from I10), G ∈ R16×10 having
expo-nential decreasing spectrum. and ΓX and Γ are generated from exponential kernel
with correlation length 0.05. Note that the noise in this case is correlated. Figure 5-1 shows the spectrum of important matrices and pencils used in this example.
Figure 5-1: Spectrum of important matrices and pencils of a calibration example with correlation length 0.05
We see that the pencil (GΓXH>Γ−1Z HΓXG, ΓY) has only two nontrivial
eigen-values, which aligns with the rank of H. The numerical results of the approximate algorithms and the optimal are shown in the plot below.
Figure 5-2: Numerical results of a calibration example with correlation length 0.05
The solid lines are the numerical results obtained from the goal-oriented setting. The dashed lines are obtained by selecting observations according to the non-goal-oriented criterion, and evaluate the performance in the goal-non-goal-oriented setting. That is, we choose the observations using the objective maxAI(X; YA), and evaluate the
performance of the set A using I(Z; YA). We also plot 50 realizations of random
selections, which are denoted by yellow crosses in the plot. We see from the plots above that although not guaranteed, the performance of goal-oriented design is better than that of selecting observations in the non-goal-oriented setting. The latter has similar performance to the random selections. However, the performance of GLS could deviates a large amount from the combinatorial optimal in some cases. In the following example, we have an instance where we the covariance of the noise is generated from exponential kernel with correlation length is 0.5.
Figure 5-3: Spectrum of important matrices and pencils of a calibration example with correlation length 0.5
Figure 5-4: Numerical results of a calibration example with correlation length 0.5
We see from the plot above that when the cardinality of the set is small, the performnace of GLS is similar to that of random selections. The greedy algorithm and MM still perform relatively good in this setting.
5.1.2
Correlated noise
We run an example on a synthetic data set with H ∈ R1×100 being a randomly drawn unit vector, G ∈ R200×100 having exponential decreasing spectrum, and Γ
X and Γ
are generated from exponential kernel with correlation length 0.05. Note that in this case, the noise is correlated. The spectrum of important matrices are shown in Figure 5-5 below.
Figure 5-5: Spectrum of importance matrices and pencils with correlated noise
Since the rank of H is 1, the spectrums of some matrices and pencils are not able to see from the plot. Therefore, we list those here: λ(G ˜H) = 1.1131, λ(H) = 1, and the only generalized eigenvalue of the pencil (GΓXH>Γ−1Z HΓXG>, ΓY) is 0.7581. The
Figure 5-6: Numerical results with correlated noise
From this we see that, even though the noise is correlated, the performance of GLS is much better than that in the random case.
5.1.3
Uncorrelated noise
Below is an example with with H ∈ R1×100 being a unit vector, G ∈ R200×100 having
exponential decreasing spectrum. ΓX is generated from square exponential kernel
with correlation length 0.05. The noise is uncorrelated, and the diagonal entries of Γ are uniformly distributed between [0, 1].
Figure 5-7: Spectrum of importance matrices and pencils with uncorrelated noise
As in previous part, we list the spectrum of rank 1 matrices and pencils: λ(G ˜H) =
1.1907, λ(H) = 1, and the only generalized eigenvalue of the pencil (GΓXH>Γ−1Z HΓXG>, ΓY)
Figure 5-8: Numerical results with uncorrelated noise
Comparing 5-6 with 5-8, we see that the performance of selecting observations in the non-goal-oriented setting is better when the noise is uncorrelated. Three algo-rithms have similar performance when the noise is uncorrelated. However, when the noise is correlated, GLS usually performs worse than the other two.
5.2
Simple E3SM (Energy Exascale Earth System
Model) land model (sELM)
5.2.1
Model description
ELM is a DOE sponsored earth system model development and simulation project aiming to study energy-relevant science. It is based on Community Land Model (CLM) version 4.5 as described in [39]. The details about ELM can be found in [1]. sELM uses the CLM 4.5 schemes for allocation and respiration (chapter 13), seasonal
deciduous phenology (chapter 14), decomposition (chapter 15) and mortality (chapter 17). Other processes such as nutrient cycling, fire methane, crops, and land use change are not yet included. For photosynthesis, aggregated canopy model (ACM) is used, as described in several papers which use the data assimilation-linked ecosystem carbon (DALEC) model. For example, see [44]. The input parameters of the model are 47 biogeophysics and biogeochemical cycling related parameters, and the model outputs the gross primary production (GPP) at each of 1642 locations in the United States. 35 out of 47 input parameters with their detailed descriptions are listed in table below [42, 45].
Table 5.1: Input parameters details for sELM
Parameter Description Units Minimum Maximum
leafcn Leaf carbon/nitrogen (C:N)
ratio gC gN
−1 12.5 70
frootcn Fine root C:N ratio gC gN−1 21 63
livewdcn Live wood C:N ratio gC gN−1 25 75 froot leaf
Fine root to leaf allocation
ratio None 0.3 2.5
flivewd Fraction of new wood that is live
None 0.06 0.28
lf flab Leaf litter labile fraction None 0.125 0.375
lf flig Leaf litter lignin fraction None Constrained Constrained fr flab Fine root labile fraction None 0.125 0.375
fr flig Fine root lignin fraction None Constrained Constrained
leaf long Leaf longevity Years 1 7
br mr Base rate for maintenance respiration (MR)
umol m−2 s−1
1.26E-06 3.75E-06
q10 mr Temperature sensitivity for
rf l1s1 Respiration fraction for lit-ter 1 →SOM1
None 0.2 0.58
rf l2s2 Respiration fraction for
lit-ter 2 →SOM2 None 0.275 0.82
rf l3s3
Respiration fraction for
lit-ter 3 →SOM3 None 0.15 0.43
rf s1s2 Respiration fraction for SOM1 →SOM2
None 0.14 0.42
rf s2s3 Respiration fraction for
SOM2 →SOM3 None 0.23 0.69
rf s3s4
Respiration fraction for
SOM3 →SOM4 None 0.28 0.83
k l1 Decay rate for litter pool 1 1 d−1 0.9 1.8 k l2 Decay rate for litter pool 2 1 d−1 0.036 0.112 k l3 Decay rate for litter pool 3 1 d−1 0.007 0.021
k s1 Decay rate for SOM1 1 d−1 0.036 0.112
k s2 Decay rate for SOM2 1 d−1 0.007 0.021
k s3 Decay rate for SOM3 1 d−1 0.0007 0.0021 k s4 Decay rate for SOM4 1 d−1 5.00E-05 1.50E-04 k frag Fragmentation rate for
coarse wood litter 1 d
−1 5.00E-04 1.50E-03
q10 hr
Q10 for heterotrophic
respi-ration None 1.3 3.3
r mort Mortality rate 1 yr−1 0.0025 0.05
crit dayl Critical day length for senescence
Seconds 35,000 45,000
ndays on Number of days for leaf on Days 15 45 ndays off Number of days for leaf off Days 7.5 22.5
fstor2tran Fraction of storage trans-ferred
None 0.25 0.75
lwtop ann live wood turnover
propor-tion yr
−1 0.5 1
stem leaf new stem alloc C per leaf C gC gC−1 0.6 5.3 croot stem new croot alloc C per stem
C gC gC
−1 0.1 0.7
The sELM is nonlinear and we linearize it so that we can apply the method we developed. The details for linearization are introduced in the next section.
5.2.2
Linearized sELM
The method we used for linearizing the model is similar to that in[19]. We linearize the model using the following steps. We run sELM multiple times with different input parameters, from which we can gather a collection of input-output pairs at each location. We perform linear regression of the output at each grid point on the predictors comprised of the input parameters. We then determine the residuals of the regression problem (one for every grid point), and computer their covariance. This covariance operator accounts for the effect of forcing. The statistical model we use is yi = GX + i, where i = 1, . . . , # grid points. yi is the time average output Et[yi] at
each grid point, which we can determine. We have as many samples of Et[yi] as the
cardinality of the ensemble set. Furthermore we can asscociate a weight to each of the ensemble sample by assessing the variance in time, hence the connection to the spatial variability of the forcing. This will lead us to a weighted regression problem. We then determine the linearized forward operator G, and iis the direct contribution
of the forcing at each grid point to the output at that grid point. We determine i
as the residual of the regression problem, and then subsequently determine the joint covariance of all i, i = 1, . . . , # grid points.
5.2.3
Numerical results
For numerical computation, we normalize the model so the prior we used is N (0, I). The goal operator H is chosen to be the average of all 47 parameters, i.e., H =
1 47
P47
i=1Xi. Below is the spectrum of relevant matrices and pencils.
0 20 40 60 80 100
index of singular/eigen value 10-3 10-2 10-1 100 101 102 103 singular/eigen value
Figure 5-9: Spectrum for linearized sELM model with H = 471 P47
i=1Xi
We list the singular/eigen values of rank-1 matrices and pencils: λ(G ˜H) = 24.8848, λ(H) = 0.1459, and the only generalized eigenvalue of the pencil (GΓXH>Γ−1Z HΓXG>, ΓY) is
0 50 100 150 200 250 300 cardinality of the set
0 0.1 0.2 0.3 0.4 0.5 mutual information greedy
greedy non-goal setting MM
MM non-goal setting GLS
GLS non-goal setting random
Figure 5-10: Numerical results for linearized sELM model with H = 471 P47
i=1Xi
30°N 40°N 50°N 90°W 80°W 70°W greedy MM GLS
Figure 5-11: Selected locations shown on a map with H = 471 P
47 i=1Xi
We then run an example with H being X1 (br mr), which is base rate for
10 20 30 40 50 60 70 80 90 100 index of singular/eigen value
10-3 10-2 10-1 100 101 102 103 singular/eigen value
Figure 5-12: Spectrum for linearized sELM model with H = X1,
Note that λ(G ˆH) = 7.6295, λ(H) = 1, and (GΓXH>Γ−1Z HΓXG>, ΓY) is 0.9426.
0 50 100 150 200 250 300
cardinality of the set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 mutual information greedy
greedy non-goal setting MM
MM non-goal setting GLS
GLS non-goal setting random
30°N 40°N 50°N 90°W 80°W 70°W greedy MM GLS
Figure 5-14: Selected locations shown on a map with H = X1
Since the has fast decaying and highly correlated covariance, GLS is not per-forming good, similar to the performance of random selections. The other two, greedy algorithm and MM perform better. However, there is a noticeable performance gap between them in both cases.
Chapter 6
Conclusions and future work
In this thesis, we studied goal-oriented optimal Bayesian experimental design. We reviewed the background of the problem and introduced the information theoretical characterization of the design criterion. Using mutual information, the problem is then formulated as a combinatorial optimization problem. We then introduced three algorithms, Greedy, Minorize-Maximize, and Generalized Leverage Score for finding the approximate solutions. The classical analysis using (approximate) submodularity breaks in the goal-oriented setting due to the forward operator of the transformed problem being rank deficient under the assumption that n > m > d. We studied the computational cost and test the performance of these algorithms on both synthetic and real data sets. We concluded that although iterative algorithms have better performance, the computational cost is much larger than GLS. The performance of GLS is similar to that of Greedy and MM when the noise is uncorrelated or slightly correlated. However, in the case when the noise is highly correlated, the performance of GLS could be as poor as the random selections. Due to the difficulty in obtaining theoretical guarantees of the GLS algorithm, we obtain a lower bound for any k-set selections.
With that being said, we point out some potential improvement and several future research directions.
the theoretical guarantees for the goal-oriented problem, some other approaches for getting the bound might success.
• Nonlinear problem: Currently, we are only considering the linear forward operator. We may further consider the case where the forward operator being nonlinear or the goal operator being nonlinear.
• Non-Gaussian noise: We can also consider the case where the prior and the observations have non-Gaussian noise. Then, many formulas and equations used in this thesis might not be applicable. For example, the mutual informa-tion between two non-Gaussian variables does not always have a closed form expression.
• Optimal low rank approximation in Riemannian metric and canonical correlation analysis: Both problems solve the generalized eigenvalue problem with different matrix pencils, as introduced in Section 2.7, and both can be formulated as optimization problems. Hence it would be very interesting if some intrinsic connections can be drawn between those two problems.
Appendix A
Proofs
To prove Lemma 2, we first introduce the Loewner order over the class of symmetric matrices.
Definition 7 (Loewner order). Let A and B be two symmetric matrices. We say that A B if A − B is positive semi-definite. Similarly, we say that A B if A − B is positive definite.
Then we note that determinant preserves the order.
Lemma 4 (Determinant preserves the monotonicity [46]). If A B 0, then det(A) ≥ det(B).
A.1
Proof of Lemma 2
Proof.
I(YA, Yi; Z) − I(YA; Z) =
1 2log
det ΓZ
det ΓZ|YA∪v
− 1 2log det ΓZ det ΓZ|YA = 1 2log det Γ−1Z|Y A∪v det Γ−1Z|Y A
Now it is left to show that det Γ
−1 Z|YA∪v det Γ−1 Z|YA ≥ 1 or det Γ−1Z|Y A ≥ det Γ −1
Z|YA∪v. If we can show
that Γ−1Z|Y
A∪v Γ
−1
Γ−1Z|Y A = Γ −1 Z + (G ˜H) >P A(PA>ΓδPA)−1PA>(G ˜H). Therefore, Γ−1Z|Y A∪v − Γ −1 Z|YA = (G ˜H) >
[PA∪v(PA∪v> ΓδPA∪v)−1PA∪v> − PA(PA>ΓδPA)−1PA>](G ˜H)
If A = ∅, then I(YA∪v; Z) = I(Yv; Z) ≥ I(0; Z) = I(YA; Z).
Now assume A 6= ∅. Without loss of generality, let
PA∪v(PA∪v> ΓδPA∪v)PA∪v> =
A b 0 . . . 0 b> c 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (A.1) and let PA(PA>ΓδPA)PA> = A 0 0 . . . 0 0 0 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (A.2)
Hence, forming the inverse using Schur’s complement,
PA∪v(PA∪v> ΓδPA∪v)PA∪v> =
(A − bc−1b>)−1 −A−1b(c − b>A−1b)−1 0 . . . 0 −(c − b>A−1b)−1b>A−1 (c − b>A−1b)−1 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (A.3) and PA(PA>ΓδPA)−1PA> = A−1 0 0 . . . 0 0 0 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (A.4)
we see that
PA∪v(PA∪v> ΓδPA∪v)PA∪v> − PA(PA>ΓδPA)−1PA> (A.5)
= (A − bc−1b>)−1− A−1 −A−1b(c − b>A−1b)−1 0 . . . 0 −(c − b>A−1b)−1b>A−1 (c − b>A−1b)−1 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (A.6) and (A − bc−1b>)−1− A−1 = A−1+ A−1b(c − b>A−1b)−1b>A−1− A−1 (A.7) = A−1b(c − b>A−1b)−1b>A−1 (A.8) Therefore,
PA∪v(PA∪v> ΓδPA∪v)PA∪v> − PA(PA>ΓδPA)−1PA>− PA(PA>ΓδPA)−1PA>
= A−1b(c − b>A−1b)−1b>A−1 −A−1b(c − b>A−1b)−1 0 . . . 0 −(c − b>A−1b)−1b>A−1 (c − b>A−1b)−1 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (A.9) Since (c − b>A−1b)−1 0 and A−1b(c−b>A−1b)−1b>A−1−A−1b(c−b>A−1b)−1(c−b>A−1b)(c−b>A−1b)−1b>A−1 = 0 we have PA∪v(PA∪v> ΓδPA∪v)PA∪v> − PA(PA>ΓδPA)−1PA> 0, which indicates
hence Γ−1Z|Y
A∪v Γ
−1
Z|YA 0, and the statement follows.
Remark. An alternative proof is by applying data processing inequality. Note that Z → YA∪v → YA is a markov chain. Therefore, we have I(YA∪v; Z) ≥ I(YA; Z) .
To prove Theorem 5, we state two lemmas.
Lemma 5 (Weyl). If A B 0, then λi(A) ≥ λi(B). Consequently, G>AG
G>BG 0, and λi(G>AG) ≥ λi(G>BG) for any real matrix G of compatible size.
Proof. Let C = A − B. then C 0. Using Weyl’s inequality, we see λn(C) ≤ λi(A) −
λi(B) ≤ λ1(C). Since λn(C) ≥ 0, λi(A) ≥ λi(B). Then note that G>(A − B)G 0,
and the result follows.
Lemma 6 (Cauchy interlacing). Suppose A ∈ Rn×nsymmetric. Let B ∈ mathbbRm×m with m ¡ n be a principal submatrix (obtained by deleting both i-th row and i-th col-umn for some values of i). Suppose A has eigenvalues α1 ≥ · · · ≥ αn and B has
eigenvalues β1 ≥ · · · ≥ βm. Then αk ≥ βk≥ αn−m+k. In particular, if m = n − 1, we
have αk≥ βk ≥ αk+1.
A.2
Proof of Theorem 5
Proof.
2I(YA∪V; Z) − 2I(YA; Z)
= logdet(P
>
A∪VΓYPA∪V) det(PA>ΓδPA)
det(P>
AΓYPA) det(PA∪V> ΓδPA∪V)
= log det(PA∪V> ΓYPA∪V) − log det(PA>ΓYPA) − [log det(PA∪V> ΓδPA∪V) − log det(PA>ΓδPA)]
= log
|A∪V |
Y
i=1
λi(PA∪V> ΓYPA∪V) − log |A| Y i=1 λi(PA>ΓYPA) − [log |A∪V | Y i=1
λi(PA∪V> ΓδPA∪V) − log |A| Y i=1 λi(PA>ΓδPA)] = log |V \A| Y i=1 λi(PA∪V> ΓYPA∪V) |A∪V | Y i=|V \A|+1
λi(PA∪V> ΓYPA∪V) − log |A| Y i=1 λi(PA>ΓYPA) − [log |A| Y i=1 λi(PA∪V> ΓδPA∪V) |A∪V | Y i=|A|+1
λi(PA∪V> ΓδPA∪V) − log |A| Y i=1 λi(PA>ΓδPA)] ≤ log |V \A| Y i=1 λi(PA∪V> ΓYPA∪V) |A| Y i=1 λi(PA>ΓYPA) − log |A| Y i=1 λi(PA>ΓYPA) − [log |A| Y i=1 λi(PA>ΓδPA) |A∪V | Y i=|A|+1
λi(PA∪V> ΓδPA∪V) − log |A| Y i=1 λi(PA>ΓδPA)] (A.11) = log |V \A| Y i=1
λi(PA∪V> ΓYPA∪V) − log |A∪V | Y i=|A|+1 λi(PA∪V> ΓδPA∪V) ≤ log |V \A| Y i=1 λi(ΓY) − log d Y i=d−|V \A|+1 λi(ΓY |Z) (A.12)
A.3
Proof of Lemma 3
Proof. Recall from previous theorem 2, if we let
PA∪ω(PA∪ω> ΓδPA∪ω)PA∪ω> =
A b 0 . . . 0 b> c 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 and let PA(PA>ΓδPA)PA> = A 0 0 . . . 0 0 0 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 then, SA,ω = MA∪ω − MA
= Γ1/2Z (G ˜H)>(PA∪ω(PA∪ω> ΓδPA∪ω)−1PA∪ω> − PA(PA>ΓδPA)−1PA>)(G ˜H)Γ 1/2 Z = Γ1/2Z (G ˜H)> A−1b(c − b>A−1b)−1b>A−1 −A−1b(c − b>A−1b)−1 0 . . . 0 −(c − b>A−1b)−1b>A−1 (c − b>A−1b)−1 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . 0 (G ˜H)Γ1/2Z = (c − b>A−1b)−1Γ1/2Z (G ˜H)> −A−1b 1 0 .. . 0 h −b>A−1 1 0 · · · 0i(G ˜H)Γ1/2 Z (A.13)