• Aucun résultat trouvé

Discrete Process Convolution Models

In the previous sections, in an effort to reduce the computational cost of GP re-gression, we examined approximations to the full GP predictive distribution.

In this section, we consider yet another method of approximation, known as the discrete process convolution model.

A continuous Gaussian process can be constructed by convolving a kernel h(t), with a noise processw(t)that is non-zero at a number of discrete points [24, 10, 29]. The result is a discrete process convolution (DPC) model. For example, if we use the noise processw(t) =Pm

j=1wjδ(t−tj), the output is

y(t) = w(t)∗h(t) (5.28)

= Xm

j=1

wjh(t−tj) (5.29)

wheret1. . . tm are support points and w1. . . wm are the Gaussian distributed weights of the latent noise process. The motivation for building such models is that the computational cost of making a prediction isO(m2n).

Upon inspection, we see that equation (5.29) is a generalised linear model with h(t −t1). . . h(t −tm) as its basis functions. We have a Gaussian prior distribution over the weightsw∼ N(0,I), so from the discussion in the pre-vious sections, y(t)is a Gaussian process. If we set the support points equal tomof the training inputs then the DPC model is equivalent to the SR model in section 5.2. Like the SR method, the DPC Gaussian process is a degener-ate Gaussian process. To make it non-degenerdegener-ate, we add an extra weightw when making predictions at a test pointt. We redefine the noise process as w(t) = wδ(t−t) +Pm

j=1wjδ(t−tj). The resulting model is then equivalent to a reduced rank Gaussian process model as defined above.

5.7 Summary

A reduced rank Gaussian process can be constructed by augmenting a gen-eralised linear model at test time. That is, before querying a RRGP model with a test input, we add an extra weight and basis function to the model.

Using this augmentation results in a reduced rank Gaussian process that is

non-degenerate. Furthermore, we can construct non-stationary Gaussian pro-cesses using augmented generalised linear models with basis functions that vary over space.

Reduced rank Gaussian processes exist as a method to reduce the computa-tional cost of Gaussian process regression. There are of course many other methods for doing just this, as outlined in section 1.4 on page 10. However, in this chapter the focus has been on the reduced rank method mainly because of the result in section 5.4, where it was shown that the full Gaussian process predictive distribution is recovered from a reduced rank model by setting the support set equal to the training input set. Reduced rank models are them-selves obtained by augmenting generalised linear models with extra weights at test time. Overall, there is a natural link between generalised linear models, reduced rank Gaussian processes, and full rank Gaussian processes. In fact, one might like to think of a Gaussian process as an augmented generalised linear model with positive definite kernels centred on the training inputs as basis functions.

Chapter 6

Reduced Rank Dependent Gaussian Processes

Chapter 3 showed how to construct sets of dependent Gaussian processes and use them to perform multivariate regression. To do so required the inversion of n× n matrices, where n was the total number of observations across all outputs. Exact inversion of these matrices has complexityO(n3), and becomes prohibitive for largen.

The previous chapter showed how to reduce the computational cost of sian process regression by using approximations known as reduced rank Gaus-sian processes. This chapter extends this approximation to the multiple out-put case by defining reduced rank dependent Gaussian processes.

6.1 Multiple Output Linear Models

Consider a generalised linear model over two outputs. We have nx obser-vations of the first output Dx = {(x1, y1x). . .(xnx, ynxx)}, and nz observations of the second Dz = {(z1, y1z). . .(znz, yznz)}. For notational convenience, we combine the input vectors into matrices; X = [x1. . .xnx],Z = [z1. . .znz], and combine the targets into vectorsyx = [y1x. . . ynxx]T,yz = [y1z. . . ynzz]T.

The model consists of a set of2(mx+mz)basis functions andmx+mzweights, where mx ≤ nx and mz ≤ nz. The basis functions are divided into two sets:

83

2mx of these are centred on the firstmx training inputs inX, and the remain-ing2mz basis functions are centred on the firstmz training inputs inZ. Over-all, there are mx +mz support points and each is associated with two basis functions - one for each output.

The weights are also divided into two sets: mx weights are collected into the vector v and control the contribution of themx support points from X, and the remaining mz weights in the vector w control the contribution from the mz support points inZ.

Consider two test points, x and z (one for each output), and add an extra basis function and weight for each (the extra weights are v and w respec-tively). The augmented model evaluated over the training inputs produces fx = [fx(x1). . . fx(xnx)]Tandfz = [fz(z1). . . fz(znz)]Tas follows:

where the augmented weights vectoru =

vT wT v wT

, and the remain-ing components are defined in what follows.

The design matrixΦis independent of the test points and is block partitioned as:

where the partitions are built by evaluating the set of2(mx +mz)basis func-tions over the training inputs, as in equafunc-tions (6.4) to (6.7) below. The set of basis functions is: {kxx(x,x1). . .kxx(x,xmx),kxz(x,z1). . .kxz(x,zmz),kzx(z,x1)

Φxz =

In general, the design matrix does not have to be built from positive definite functions so the basis functions can be arbitrary. Nevertheless, in this section we use valid kernel kxx(·,·), kzz(·,·)and cross-kernelfunctionskxz(·,·), kzx(·,·) as basis functions. For example, the kernel functions could be set equal to the covariance and cross-covariance functions derived from Gaussian impulse responses as in appendix A.2.

The (nx +nz)×2matrixA is the contribution to the model of the two extra

We define a Gaussian prior distribution over the augmented weights vector:

u

where Ωis formed by evaluating the basis functions at the support points, rather than all the training inputs and is block partitioned as:

Ω=

The2×2matrixΛhas the following elements:

Λ=

and is independent of the training data.

The model assumes Gaussian noise on the outputs, where each output has a separate noise variance, σx2 and σz2. This assumption allows us to find the likelihood function

y

u,xz,X,Z ∼ N(Fu, Ψ) (6.12) whereyT = [yTx yTz], andF=

Φ|A

. The covariance matrix for this Gaus-sian likelihood function is

From the likelihood function of the augmented weights and the prior distri-bution over augmented weights, we can find the posterior distridistri-bution over the augmented weights:

To simplify, define the matrixQas Q=h

BT Λ i

(6.17) Then, the predictive distribution at test points x and z is Gaussian with predictive mean

An illustration of a reduced-rank dependent Gaussian process is shown in figure 6.1. This example has two outputs, with n1 = n2 = 10 and m1 = m2 = 8, and noise variance σ12 = σ22 = 0.0252. The model was constructed using Gaussian basis functions equivalent to the auto and cross-covariance functions derived in appendix A.2. The training data for output 1 is restricted tox >0, but output 2 has training inputs spread across the whole input space.

Note that the model of output 1 gives predictions with low uncertainty at certain points, even though there is no output 1 training data associated with those points. Predictions at these points have low uncertainty because of a dependency on the training data from output 2.