Gradient Estimation - Gaussian Processes for Regression and Optimisation

Consider the situation where we can query a function f^∗(x) with any in-put x_i ∈ R^D, and receive back a noisy evaluation yi = f^∗(xi) +ǫi where ǫi

is a zero mean Gaussian random variable with a variance σ². By taking a set of samples, D = {(x1, y1), . . . ,(xn, yn)} we wish to estimate the gradient g^∗(x0) = ∂f^∗(x)

∂x x=x0

at a pointx₀. If sampling from f^∗(x)incurs some cost, then we might wish to minimise the number of samples required to get an acceptable gradient estimate.

159

11.2.1 Derivative Processes

Consider a Gaussian process overR^D, constructed as the output of a linear filter excited by Gaussian white noise (see section 2.1 page 17). The filter is defined by an impulse responseh(x),x∈R^D, which is convolved with a noise processu(x)to produce the output Gaussian processy(x) =u(x)∗h(x). Now definez_i(x)as the partial derivative ofy(x)with respect tox_i. Differentiation is a linear operation, so z_i(x) is also a Gaussian process, which is clear from the following:

is the impulse response of the filter that produces zi(x) when excited by the noise process u(x). In other words, we generate the partial derivative process zi(x)by exciting a linear filtergi(x) with Gaussian white noiseu(x).

By extension, we can generate higher-order and mixed partial-derivatives of the process y(x) by convolving the same input noise sequence with partial derivatives of the impulse response

∂^n+my(x)

∂xⁿ_i∂x^m_j =u(x)∗∂^n+mh(x)

∂xⁿ_i∂x^m_j (11.6)

assuming that all the partial derivatives exist, which is so if the impulse re-sponseh(x)is Gaussian.

The derivative processes,z1(x). . . zD(x), are derived from the same noise source as the original processy(x). Therefore, the derivative processes and the orig-inal processes form a set of dependent Gaussian processes and we can use the results of chapter 3 for their analysis. In particular, we can find the co-variance function between the process and first-derivative processes. To do so, define the covariance functioncovij(a,b)as the covariance between zi(a)

and zj(b), with i, j > 0, and a,b ∈ X (the input space). Furthermore, let the auto-covariance function for the processy(x)bek(a,b). Therefore,

cov_ij(a,b) = In a similar fashion, we can show that the covariance function betweeny(a) andz_j(b)is

cj(a,b) = ∂k(a,b)

∂bj

(11.11) Equations (11.10) and (11.11) are the same as those presented by Solak et al.

[73], albeit with different notations.

11.2.2 Gaussian Process Gradient Estimation

For input vectorsx_iandx_j, consider the squared-exponential covariance func-tion as a funcfunc-tion of input separafunc-tions=x_i −x_j

k(s) = exp(v) exp −¹₂s^TAs

+δijexp(2β) (11.12) whereAis aD×Dsymmetric, positive definite matrix.

Given data D, and hyperparameters θ = {A, v, β} define f(x) as the mean prediction of a Gaussian process model

f(x) = k^T(x)C⁻¹y (11.13) where k^T(x) = [k(x−x₁). . . k(x−x_n)]andy= [y₁. . . y_n]^T. Cis the covari-ance matrix built fromDandk(s).

f(x)can be differentiated to find the model gradient atx₀ g(x0) = ∂f(x)

whereJ^Tis theD×ntranspose of the Jacobian which has columns defined by differentiating equation (11.12)

∂k(x0−x_i)

∂x0

=−exp(v) exp −¹₂(x0−x_i)^TA(x0−x_i)

A(x0−x_i) (11.17) Differentiation is a linear operation, sog(x0)is a Gaussian process . In partic-ular,g =g(x0)is a vector with componentsg1. . . gD, with Gaussian statistics defined by aD×Dcovariance matrixB, where

TheD×nmatrix of the covariance between ournobservations andDgradient components is

The Gaussian process prior distribution overyandgis



Using standard formulae [61], we condition onDto find the distribution g

D ∼ N

J^TC⁻¹y, B−J^TC⁻¹J

(11.24) where the mean is equal to that in equation (11.15).

In summary, given dataD, we can estimate the gradientg atx₀ and find the covariance matrix given byB−J^TC⁻¹J.

11.2.3 Sample Minimisation

When we estimate g using a Gaussian process model, equation (11.24) tells us that there is uncertainty in this estimate. Intuitively, we desire a gradient estimate with low uncertainty. This section describes a method to sequentially select sample points so as to systematically reduce the uncertainty ing. One way to reduce the uncertainty in the gradient estimate is to take a new sample x_∗ placed so as to minimise the entropy of p(g|D,x_∗). We call this entropy S_∗ and derive it as follows, noting that the entropy of a multivariate Gaussian probability density with covariance matrixΨis [67]:

S = ^D₂ log(2πe) + ¹₂log(|Ψ|) (11.25)

The gradientg, conditional on sampling atx_∗ is g

D,x_∗ ∼ N

J^TC⁻¹y, F

(11.26) whereFis derived as follows.

Firstly, define Σas the covariance matrix for the multivariate Gaussian joint probability density function p(y₁, . . . , y_n, y_∗, g₁, . . . , g_D)

which is the same as the covariance matrix in equation (11.23) augmented by covariance entries for the test point x_∗, namely,k = k(x_∗), κ = cov(y_∗, y_∗) =

Define a block partitioning ofΣas covari-ance matrixFto fit into equation (11.26) is

F=B−h

The entropy givenx_∗ is therefore

S_∗ = ^D₂ log(2πe) + ¹₂log(|F|) (11.27) We can now attempt to find ax_∗that minimisesS_∗. Sampling at such a point is expected to maximally decrease the entropy of the gradient distribution.

Hence we will maximally increase our “certainty” about the model gradient at x₀. If the model is accurate, then we expect the gradient g to be a good approximation of the problem gradientg^∗(x0).

A simple illustrative example is shown in figure 11.1. The top panel shows a Gaussian process model of4samples, taken from a function with additive Gaussian noise. The bottom panel shows the conditional entropy of the esti-mate of the gradient atx = 0, as a function of the next sample pointx_∗. The entropy has two major local minima, either side ofx_∗ = 0, indicating the best places to take the next sample. These minima are neither too close nor too far from the point at which the gradient is to be estimated. This makes sense, as a sample that is too close cannot offer much information about the gradient be-cause it is easily swamped by noise. A point that is too far away ’loses sight’

of the local trend, and cannot contribute either. The most valuable point for contributing to the gradient estimate lies somewhere in between the two ex-tremes, and ultimately depends on the model prior distributions and training data.

input spacex

next sample point x_∗

entropy

Figure 11.1: (Top) Gaussian process model (solid line) of4samples (dots) from a noisy function. The95%confidence interval is shaded. (Bottom) Conditional entropy of the model’s estimate of the gradient atx= 0, as a function of the next sample point.

11.2.4 Gradient of Entropy

Note that we can differentiate (11.27) with respect to thej^th component ofx_∗ to aid in the minimisation.

∂S_∗ To further reduce this, observe that

∂Σ21

where ∂b

We can use algorithm 11.1 to estimate the gradient g^∗(x₀), using n samples fromf^∗(x)by modellingf^∗(x)with a Gaussian process model parameterised byθ. We specify a prior densityp(θ)which represents our prior beliefs about θ. For example, the covariance function in equation (11.12) has hyperparame-tersθ ={v, β, α1, . . . αD}withA = diag(α1. . . αD). In this case we might spec-ify a Gaussian prior distribution forθ.

Figure 11.2 shows the results of estimating the gradient at a point for a 6 di-mensional hyperelliptical Gaussian surface, using the gradient estimation al-gorithm (Alal-gorithm 11.1). Samples of the surface were corrupted by Gaussian noise withσ = 0.02, and the surface range was bounded to[0,1].

Dans le document Gaussian Processes for Regression and Optimisation (Page 175-182)