Scalable stochastic gradient-based inference for gaussian processes

(1)

Scalable Stochastic Gradient-based Inference for Gaussian Processes

Maurizio Filippone

EURECOM, Sophia Antipolis, France email: [email protected]

Gaussian Process (GP) Regression - Illustration

GP prior

−4 −2 0 2 4

−3

−2

−1 0 1 2 3

input

label

I Grey - Samples from the GP prior

I Green - Radial Basis Function covariance

I Full and “infinite” covariance matrix K ∈ M_∞,∞

K =

GP regression example

−4 −2 0 2 4

−3

−2

−1 0 1 2 3

input

label

●

● ●

●

I Data generated from the model

I Marginal distributions of multivari- ate Gaussian are Gaussian

I It is sufficient to look at the values of the covariance at the input locations

K =

Inference result

−4 −2 0 2 4

−3

−2

−1 0 1 2 3

input

label

●

● ●

●

● ●

●

I Blue - Mean prediction ± 2 Std devs

I Conditional distributions of multi- variate Gaussian are Gaussian

I Predictions can be made by calculat- ing conditional distributions

K =

n

Bayesian Inference for GPs

p(par|data) = p(data|par)p(par) Z

p(data|par)p(par)dpar

p(par) p(par|data)

Marginal likelihood

I Marginal likelihood

p(data|par) = Z

p(data|latent)p(latent|par)dlatent

can only be computed if p(data|latent) is Gaussian

I ... even then

log[p(data|par)] = −1

2 log |K| − 1

2y^TK⁻¹y + const.

where K = K(par) is generally an n × n dense matrix!

Stochastic Gradient Langevin Dynamics (SGLD) algorithm

I Stochastic gradient ascent optimization with injected noise η_t par⁰ = par + α_t

2 ∇]^par log[p(data|par)p(par)] + η_t η_t ∼ N(0, α_t) α_t → 0

I First phase – α_t large – Optimization phase

– Injected noise η_t is smaller than the gradient-based update – Behavior similar to stochastic gradient ascent

I Second phase – α_t small – Langevin dynamics phase

– Injected noise η_t dominates gradient-based update

3 Acceptance rate reaches one so no need to accept/reject 3 No need to evaluate p(data|par)

3 We only need stochastic gradients to obtain samples from p(par|data)

Stochastic gradients for GPs

I Marginal likelihood

log[p(data|par)] = −1

2 log |K| − 1

2y^TK⁻¹y + const.

I Derivatives wrt par

∂ log[p(data|par)]

∂par_i = −1 2Tr

K⁻¹ ∂K

∂par_i

+ 1

2y^TK⁻¹ ∂K

∂par_i K⁻¹y

I Stochastic estimate of the trace Tr

K⁻¹ ∂K

∂par_i

= Tr

K⁻¹ ∂K

∂par_iE[rr^T]

= E

r^TK⁻¹ ∂K

∂par_i r

with E[rr^T] = I - e.g., r_j drawn from {−1, 1} with p = 1/2 I Stochastic gradient

− 1 2N_r

Nr

X

i=1

r^(i)TK⁻¹ ∂K

∂par_i r⁽ⁱ⁾ + 1

2y^TK⁻¹ ∂K

∂par_i K⁻¹y

I Linear systems only!

Solving linear systems

I Linear systems:

Ks = b

I Can be solved using the Conjugate Gradient algorithm:

s = arg min

x

1

2x^TKx − x^Tb

I Iterative update s = s₀ + δ₁ + . . . + δ_T

I Requires only Covariance Matrix Vector Products (CMVPs)! O(n²) time I No need to store K! O(n) space

ULISSE - the Unbiased LInear System SolvEr

I Accelerate the solution of dense linear systems I . . . returning an unbiased estimate of the solution I Full CG solution:

s = s₀ + δ₁ + . . . + δ_l + δ_l+1 . . . + δ_T I ULISSE:

p₁

1 p₁

1 p₂ p₂

+

_p^l+1

1

+

_p^l+2

1p₂

s₀ + ₁ + . . . + _l

In this work:

p_i = exp(−βi)

I Final solution is an unbiased estimate of s!

0 1 2 3 4 5 6

1.02.03.0

log₁₀(κ)

log10(number iterations) ^CG

ULISSE q=0.1 ULISSE q=1.0

0 1 2 3 4 5 6

−8−6−4−20

log₁₀(κ)

log10(avg rel square norm)

3 Fast computation of stochastic gradients 3 Small relative error wrt exact gradients

rel square norm = kg(θ) − g(θ)˜ k² kg(θ)k²

Traditional solvers vs ULISSE

Fast CMVPs

0 1 2 3 4 5 6

12345

log₁₀(κ)

log10(number iterations) double

float anchors kdtree

0 1 2 3 4 5 6

−15−10−50

log₁₀(κ)

log10(norm residual)

Preconditioning

0 1 2 3 4 5 6

1234

log₁₀(κ)

log10(number iterations) CG

PCG double PCG float

0 1 2 3 4 5 6

−14−10−6

log₁₀(κ)

log10(norm residual)

CG vs ULISSE with β = 1

0 1 2 3 4 5 6

0.51.52.53.5

log₁₀(κ)

0 1 2 3 4 5 6

−8−6−4−2024

log₁₀(κ)

log10(avg std dev)

CG vs ULISSE with β = 100

0 1 2 3 4 5 6

0.51.52.53.5

log₁₀(κ)

0 1 2 3 4 5 6

−15−10−50

log₁₀(κ)

log10(avg std dev)

Inference Results

Concrete data n ≈ 1K

0 10000 20000 30000

1.02.54.0

Iteration number

log(σ)

0e+00 1e+04 2e+04 3e+04

1.02.54.0

Iteration number

log(σ) PSRF median

PSRF 97.5%

0 10000 20000 30000

−2.9−2.6

Iteration number

log(λ)

0e+00 1e+04 2e+04 3e+04

1.02.54.0

Iteration number

log(λ) PSRF median

PSRF 97.5%

0 10000 20000 30000

−3.5−2.5

Iteration number

log(τ)

0e+00 1e+04 2e+04 3e+04

1.02.54.0

Iteration number

log(τ) PSRF median

PSRF 97.5%

Census data n ≈ 23K

0 5000 10000 15000

0.851.05

Iteration number

log(σ)

0 5000 10000 15000

1.02.54.0

Iteration number

log(σ) PSRF median

PSRF 97.5%

0 5000 10000 15000

−1.80−1.70

Iteration number

log(λ)

0 5000 10000 15000

1.02.54.0

Iteration number

log(λ) PSRF median

PSRF 97.5%

0 5000 10000 15000

−0.95−0.75

Iteration number

log(τ)

0 5000 10000 15000

1.02.54.0

Iteration number

log(τ) PSRF median

PSRF 97.5%

Conclusions

I Novel adaptation of SGLD to infer covariance parameters in Gaussian processes

3 Accurate in characterizing the posterior distribution over covariance parameters 3 Scales with O(n) in space and with O(n²) in time

3 Massively parallelizable

3 Without assuming factorization of the likelihood (mini-batches) 3 Without considering subsets of the data or inducing points

3 Without considering subsets of the spectrum of the covariance 3 Without imposing sparsity on the covariance or its inverse

I Novel linear solver - ULISSE

3 Early stop of iterative linear solver that yields an unbiased solution 3 Can be adopted to accelerate any iterative solver

I Ongoing work

– Extension to Gaussian Markov Random Fields and other likelihoods – Tuning of a preconditioner in SGLD

– Mixed precision calculations within the Conjugate Gradient algorithm

References

[1] M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE). ICML 2015, to appear.

[2] M. Filippone and M. Girolami, Pseudo-marginal Bayesian inference for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2214–2226, 2014.

[3] M. Filippone et al. Probabilistic Prediction of Neurological Disorders with a Statistical Assessment of Neuroimaging Data Modalities. Annals of Applied Statistics, 6(4):1883–1905, 2012.

[4] M. N. Gibbs, Bayesian Gaussian processes for regression and classification. PhD thesis, University of Cambridge, 1997.

[5] M. Welling and Y. W. Teh, Bayesian Learning via Stochastic Gradient Langevin Dynamics. ICML 2011, pp. 681–688. Omnipress, 2011.

1