Reduced Rank Gaussian Processes for Optimisation

Basic GP regression scales asO(n³)due to the cost of inverting the covariance matrix. The inverse,C⁻¹, is required both for training (in calculating the log-likelihood), and for making predictions (mean and variance). Calculating the inverse exactly is acceptable for problems with relatively small n, but as n grows large (e.g. n > 10000) exact inverses become far too expensive. In this section we consider speeding up basic GPO by using an approximation that reduce theO(n³)cost.

Gaussian processes optimisation with expected improvement requires a model that provides a predictive mean and variance at any input point. Therefore, any approximate model we use to speed up optimisation needs to provide a predictive mean and variance. One such model is a reduced rank Gaussian process (RRGP) model [55], as discussed in chapter 5.

To reiterate, an RRGP model of n data points X_n is formed by considering a set of m ≤ n support points X_m. In this thesis, the support points form a subset of X_n although this is not necessary in general. To simplify, let the support points be the first m points in X_n. That is, X_n = [Xm|x_m+1. . .x_n].

Given a training input setX_n, and covariance functionk(xi,x_j), we form the matrices K_nm and K_mm where K_ab is an a×b matrix, with the (i, j)^th entry defined byk(xi,x_j). At any test pointx_∗ we find the predictive mean m(x_∗) and variancev(x_∗)as follows:

m(x_∗) =q(x_∗)^Tµ_∗ (10.6)

v(x_∗) =σ² +q(x_∗)^TΣ_∗q(x_∗) (10.7) whereq(x_∗)^T =

k(x_∗)^T |k_∗∗

The scalar k_∗∗ = k(x_∗,x_∗), and k(x_∗) = [k(x_∗,x₁). . . k(x_∗,x_m)] is the m ×1 vector of covariances betweenx_∗and themsupport points.

The (m+ 1)×1 vector µ_∗ is the mean of the posterior distribution over the augmented weights, and is equal to

µ_∗ =σ⁻²Σ_∗[Knm |k_∗]^Ty (10.8) where k_∗ = [k(x_∗,x₁). . . k(x_∗,x_n)]is then×1vector of covariances between x_∗ and thentraining inputs.

The (m + 1) ×(m + 1) matrix Σ_∗ is the covariance matrix of the posterior distribution over the augmented weights, and is equal to

Σ_∗ =

The computational cost of prediction is an initialO(nm²)to computeΣ. Then, each test point has a cost of O(nm), due to the most expensive computation K^T_nmk_∗. OnceΣhas been calculated, it is best to calculate Σ_∗ by inversion by partitioning. Firstly, let r=k(x_∗) +σ⁻²K^T_nmk_∗.

Then Σ^∗₁₁ = Σ+Σrr^TΣ/ρ, Σ^∗₁₂ = −Σr/ρ = Σ^∗₂₁^T, andΣ^∗₂₂ = 1/ρ where ρ=k_∗∗+σ⁻²k^T_∗k_∗−r^TΣr.

10.4.1 Reduced Rank GP Training

To construct a reduced rank GP model, we require a covariance function, usu-ally parameterised with a set of hyperparameters,θ andσ², and a set of sup-port inputs. For training, we wish to use Bayes’ theorem to infer the most likely or most probable hyperparameters and support subset.

Recall that the support subset is equal to the first m training inputs. This means a different support set simply corresponds to reordering of the train-ing inputs. One such reordertrain-ing is described as follows. Let z be a vec-tor of indices that reorders the original data X and y to form the training

dataX_n and y_n. That is, given z = [z1. . . zn]^T, we reorder the original data X= [x1. . .x_n],y= [y1. . . yn]^Tto form the training dataX_n= [xz1. . .x_z_n],y_n = [yz1. . . yzn]^T. Note that only the firstmelements ofzare relevant, and their or-dering is arbitrary – these are the elements that define the support subset. The remainder of the training set can be ordered arbitrarily.

The marginal likelihood of the hyperparameters given a support set is p(y_n|X_n,z,θ, σ²) = N(0,Q) where Q = σ²I_n +K_nmK⁻_mm¹ K^T_nm. The log of this is [55]:

L=−ⁿ₂ log(2π)−¹₂log|Q| − ¹₂y^T_nQ⁻¹y_n (10.10) where we can use the matrix inversion lemma to efficiently calculate

Q⁻¹ =σ⁻²I_n−σ⁻⁴K_nm(K_mm+σ⁻²K^T_nmK_nm)⁻¹K^T_nm (10.11) which requires inversion of anm×mmatrix, rather than ann×nmatrix.

Given a support set, we can maximiseLto find the maximum likelihood hy-perparameters. If we wish, we incorporate prior beliefs about the hyperpa-rameters and a log prior term to equation (10.10). We can then find the maxi-muma posteriori(MAP) hyperparameters.

To learn the support set, ideally one would like to find the support set that maximisesL. Unfortunately, the number of possible support sets is combina-torial,ⁿCm[55], which is obviously prohibitive for largen(e.g. ³²C5 = 201376).

Instead, we examine just a fixed number of the possible subsets, and use the best for RRGP predictions.

10.4.2 Reduced Rank GP Optimisation

Given a support set and hyperparameters, the RRGP model can be queried for a predictive mean and variance at any input point. This prediction can then be used to find the expected improvementEIrrgp(x), and its maximum, resulting in reduced rank GP optimisation (RRGPO).

Note that RRGPO is the same as basic GPO, except that we use a reduced rank approximation to the full GP. Furthermore, the derivatives of 10.6 and 10.7 can be found, enabling the evaluation of ∂EIrrgp(x)

∂xi

. However, this derivative is

complicated, and is not evaluated in the example to follow. Instead,EIrrgp(x) is maximised by random search.

In this example, RRGPO is used to maximise a noiseless 36D hyperellipti-cal Gaussian f(x) = exp(−¹₂x^TΣ x), whereΣ = diag(_λ¹2

1 . . ._λ¹2

36), and λ1. . . λ36

are uniformly spaced from 0.3 to 1.0. The optimsation starts from x₀ =

√log 400

6 [λ1. . . λ36]^T, (such thatf(x0) = 0.05), and proceeds untilf(x_∗)≥0.975.

At each iteration of the optimisation, a RRGP is built by finding hyperparam-eters and a support set that maximises the joint posterior densityp(θ, σ²,z|D).

At iterationn, ifn ≤36, then a full GP model is used (m =n), and the support set is equivalent to the training inputs. In this case, we simply maximise the posterior density over hyperparameters. For n > m, we use a reduced rank ofm = 36. In this case, the support set is not equivalent to the training input set. We examined a random selection of10support subsets, and the support set that resulted in the maximum posterior probability (after optimising the hyperparameters) was chosen as the support set for predictions. This is sub-optimal, but as mentioned above, it is too expensive to examine all possible support sets.

Results are shown in figure 10.5, along with an 18D example withm = 36.

The 18Dproblem was solved in 69iterations, and the36Dproblem took 147 iterations. Note that this performance was attained even though a maximum of36support points were included in the support set for training and predic-tion, showing that GPO using a reduced rank approximation is feasible, and useful at least for this noiseless toy example.

Dans le document Gaussian Processes for Regression and Optimisation (Page 160-163)