• Aucun résultat trouvé

Reduced Rank Gaussian Processes for Optimisation

Basic GP regression scales asO(n3)due to the cost of inverting the covariance matrix. The inverse,C1, is required both for training (in calculating the log-likelihood), and for making predictions (mean and variance). Calculating the inverse exactly is acceptable for problems with relatively small n, but as n grows large (e.g. n > 10000) exact inverses become far too expensive. In this section we consider speeding up basic GPO by using an approximation that reduce theO(n3)cost.

Gaussian processes optimisation with expected improvement requires a model that provides a predictive mean and variance at any input point. Therefore, any approximate model we use to speed up optimisation needs to provide a predictive mean and variance. One such model is a reduced rank Gaussian process (RRGP) model [55], as discussed in chapter 5.

To reiterate, an RRGP model of n data points Xn is formed by considering a set of m ≤ n support points Xm. In this thesis, the support points form a subset of Xn although this is not necessary in general. To simplify, let the support points be the first m points in Xn. That is, Xn = [Xm|xm+1. . .xn].

Given a training input setXn, and covariance functionk(xi,xj), we form the matrices Knm and Kmm where Kab is an a×b matrix, with the (i, j)th entry defined byk(xi,xj). At any test pointx we find the predictive mean m(x) and variancev(x)as follows:

m(x) =q(x)Tµ (10.6)

v(x) =σ2 +q(x)TΣq(x) (10.7) whereq(x)T =

k(x)T |k∗∗

.

The scalar k∗∗ = k(x,x), and k(x) = [k(x,x1). . . k(x,xm)] is the m ×1 vector of covariances betweenxand themsupport points.

The (m+ 1)×1 vector µ is the mean of the posterior distribution over the augmented weights, and is equal to

µ2Σ[Knm |k]Ty (10.8) where k = [k(x,x1). . . k(x,xn)]is then×1vector of covariances between x and thentraining inputs.

The (m + 1) ×(m + 1) matrix Σ is the covariance matrix of the posterior distribution over the augmented weights, and is equal to

Σ =

The computational cost of prediction is an initialO(nm2)to computeΣ. Then, each test point has a cost of O(nm), due to the most expensive computation KTnmk. OnceΣhas been calculated, it is best to calculate Σ by inversion by partitioning. Firstly, let r=k(x) +σ2KTnmk.

Then Σ11 = Σ+ΣrrTΣ/ρ, Σ12 = −Σr/ρ = Σ21T, andΣ22 = 1/ρ where ρ=k∗∗2kTk−rTΣr.

10.4.1 Reduced Rank GP Training

To construct a reduced rank GP model, we require a covariance function, usu-ally parameterised with a set of hyperparameters,θ andσ2, and a set of sup-port inputs. For training, we wish to use Bayes’ theorem to infer the most likely or most probable hyperparameters and support subset.

Recall that the support subset is equal to the first m training inputs. This means a different support set simply corresponds to reordering of the train-ing inputs. One such reordertrain-ing is described as follows. Let z be a vec-tor of indices that reorders the original data X and y to form the training

dataXn and yn. That is, given z = [z1. . . zn]T, we reorder the original data X= [x1. . .xn],y= [y1. . . yn]Tto form the training dataXn= [xz1. . .xzn],yn = [yz1. . . yzn]T. Note that only the firstmelements ofzare relevant, and their or-dering is arbitrary – these are the elements that define the support subset. The remainder of the training set can be ordered arbitrarily.

The marginal likelihood of the hyperparameters given a support set is p(yn|Xn,z,θ, σ2) = N(0,Q) where Q = σ2In +KnmKmm1 KTnm. The log of this is [55]:

L=−n2 log(2π)−12log|Q| − 12yTnQ1yn (10.10) where we can use the matrix inversion lemma to efficiently calculate

Q12In−σ4Knm(Kmm2KTnmKnm)1KTnm (10.11) which requires inversion of anm×mmatrix, rather than ann×nmatrix.

Given a support set, we can maximiseLto find the maximum likelihood hy-perparameters. If we wish, we incorporate prior beliefs about the hyperpa-rameters and a log prior term to equation (10.10). We can then find the maxi-muma posteriori(MAP) hyperparameters.

To learn the support set, ideally one would like to find the support set that maximisesL. Unfortunately, the number of possible support sets is combina-torial,nCm[55], which is obviously prohibitive for largen(e.g. 32C5 = 201376).

Instead, we examine just a fixed number of the possible subsets, and use the best for RRGP predictions.

10.4.2 Reduced Rank GP Optimisation

Given a support set and hyperparameters, the RRGP model can be queried for a predictive mean and variance at any input point. This prediction can then be used to find the expected improvementEIrrgp(x), and its maximum, resulting in reduced rank GP optimisation (RRGPO).

Note that RRGPO is the same as basic GPO, except that we use a reduced rank approximation to the full GP. Furthermore, the derivatives of 10.6 and 10.7 can be found, enabling the evaluation of ∂EIrrgp(x)

∂xi

. However, this derivative is

complicated, and is not evaluated in the example to follow. Instead,EIrrgp(x) is maximised by random search.

In this example, RRGPO is used to maximise a noiseless 36D hyperellipti-cal Gaussian f(x) = exp(−12xTΣ x), whereΣ = diag(λ12

1 . . .λ12

36), and λ1. . . λ36

are uniformly spaced from 0.3 to 1.0. The optimsation starts from x0 =

log 400

61. . . λ36]T, (such thatf(x0) = 0.05), and proceeds untilf(x)≥0.975.

At each iteration of the optimisation, a RRGP is built by finding hyperparam-eters and a support set that maximises the joint posterior densityp(θ, σ2,z|D).

At iterationn, ifn ≤36, then a full GP model is used (m =n), and the support set is equivalent to the training inputs. In this case, we simply maximise the posterior density over hyperparameters. For n > m, we use a reduced rank ofm = 36. In this case, the support set is not equivalent to the training input set. We examined a random selection of10support subsets, and the support set that resulted in the maximum posterior probability (after optimising the hyperparameters) was chosen as the support set for predictions. This is sub-optimal, but as mentioned above, it is too expensive to examine all possible support sets.

Results are shown in figure 10.5, along with an 18D example withm = 36.

The 18Dproblem was solved in 69iterations, and the36Dproblem took 147 iterations. Note that this performance was attained even though a maximum of36support points were included in the support set for training and predic-tion, showing that GPO using a reduced rank approximation is feasible, and useful at least for this noiseless toy example.