Scalable Stochastic Gradient-based Inference for Gaussian Processes
Maurizio Filippone
EURECOM, Sophia Antipolis, France email: [email protected]
Gaussian Process (GP) Regression - Illustration
GP prior
−4 −2 0 2 4
−3
−2
−1 0 1 2 3
input
label
I Grey - Samples from the GP prior
I Green - Radial Basis Function covariance
I Full and “infinite” covariance matrix K ∈ M∞,∞
K =
GP regression example
−4 −2 0 2 4
−3
−2
−1 0 1 2 3
input
label
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
I Data generated from the model
I Marginal distributions of multivari- ate Gaussian are Gaussian
I It is sufficient to look at the values of the covariance at the input locations
K =
Inference result
−4 −2 0 2 4
−3
−2
−1 0 1 2 3
input
label
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
I Blue - Mean prediction ± 2 Std devs
I Conditional distributions of multi- variate Gaussian are Gaussian
I Predictions can be made by calculat- ing conditional distributions
K =
n
n
Bayesian Inference for GPs
p(par|data) = p(data|par)p(par) Z
p(data|par)p(par)dpar
p(par) p(par|data)
Marginal likelihood
I Marginal likelihood
p(data|par) = Z
p(data|latent)p(latent|par)dlatent
can only be computed if p(data|latent) is Gaussian
I ... even then
log[p(data|par)] = −1
2 log |K| − 1
2yTK−1y + const.
where K = K(par) is generally an n × n dense matrix!
Stochastic Gradient Langevin Dynamics (SGLD) algorithm
I Stochastic gradient ascent optimization with injected noise ηt par0 = par + αt
2 ∇]par log[p(data|par)p(par)] + ηt ηt ∼ N(0, αt) αt → 0
I First phase – αt large – Optimization phase
– Injected noise ηt is smaller than the gradient-based update – Behavior similar to stochastic gradient ascent
I Second phase – αt small – Langevin dynamics phase
– Injected noise ηt dominates gradient-based update
3 Acceptance rate reaches one so no need to accept/reject 3 No need to evaluate p(data|par)
3 We only need stochastic gradients to obtain samples from p(par|data)
Stochastic gradients for GPs
I Marginal likelihood
log[p(data|par)] = −1
2 log |K| − 1
2yTK−1y + const.
I Derivatives wrt par
∂ log[p(data|par)]
∂pari = −1 2Tr
K−1 ∂K
∂pari
+ 1
2yTK−1 ∂K
∂pari K−1y
I Stochastic estimate of the trace Tr
K−1 ∂K
∂pari
= Tr
K−1 ∂K
∂pariE[rrT]
= E
rTK−1 ∂K
∂pari r
with E[rrT] = I - e.g., rj drawn from {−1, 1} with p = 1/2 I Stochastic gradient
− 1 2Nr
Nr
X
i=1
r(i)TK−1 ∂K
∂pari r(i) + 1
2yTK−1 ∂K
∂pari K−1y
I Linear systems only!
Solving linear systems
I Linear systems:
Ks = b
I Can be solved using the Conjugate Gradient algorithm:
s = arg min
x
1
2xTKx − xTb
I Iterative update s = s0 + δ1 + . . . + δT
I Requires only Covariance Matrix Vector Products (CMVPs)! O(n2) time I No need to store K! O(n) space
ULISSE - the Unbiased LInear System SolvEr
I Accelerate the solution of dense linear systems I . . . returning an unbiased estimate of the solution I Full CG solution:
s = s0 + δ1 + . . . + δl + δl+1 . . . + δT I ULISSE:
p1
1 p1
1 p2 p2
+
pl+11
+
pl+21p2
s0 + 1 + . . . + l
In this work:
pi = exp(−βi)
I Final solution is an unbiased estimate of s!
0 1 2 3 4 5 6
1.02.03.0
log10(κ)
log10(number iterations) CG
ULISSE q=0.1 ULISSE q=1.0
0 1 2 3 4 5 6
−8−6−4−20
log10(κ)
log10(avg rel square norm)
3 Fast computation of stochastic gradients 3 Small relative error wrt exact gradients
rel square norm = kg(θ) − g(θ)˜ k2 kg(θ)k2
Traditional solvers vs ULISSE
Fast CMVPs
0 1 2 3 4 5 6
12345
log10(κ)
log10(number iterations) double
float anchors kdtree
0 1 2 3 4 5 6
−15−10−50
log10(κ)
log10(norm residual)
Preconditioning
0 1 2 3 4 5 6
1234
log10(κ)
log10(number iterations) CG
PCG double PCG float
0 1 2 3 4 5 6
−14−10−6
log10(κ)
log10(norm residual)
CG vs ULISSE with β = 1
0 1 2 3 4 5 6
0.51.52.53.5
log10(κ)
log10(number iterations) CG
ULISSE q=0.1 ULISSE q=1.0
0 1 2 3 4 5 6
−8−6−4−2024
log10(κ)
log10(avg std dev)
CG vs ULISSE with β = 100
0 1 2 3 4 5 6
0.51.52.53.5
log10(κ)
log10(number iterations) CG
ULISSE q=0.1 ULISSE q=1.0
0 1 2 3 4 5 6
−15−10−50
log10(κ)
log10(avg std dev)
Inference Results
Concrete data n ≈ 1K
0 10000 20000 30000
1.02.54.0
Iteration number
log(σ)
0e+00 1e+04 2e+04 3e+04
1.02.54.0
Iteration number
log(σ) PSRF median
PSRF 97.5%
0 10000 20000 30000
−2.9−2.6
Iteration number
log(λ)
0e+00 1e+04 2e+04 3e+04
1.02.54.0
Iteration number
log(λ) PSRF median
PSRF 97.5%
0 10000 20000 30000
−3.5−2.5
Iteration number
log(τ)
0e+00 1e+04 2e+04 3e+04
1.02.54.0
Iteration number
log(τ) PSRF median
PSRF 97.5%
Census data n ≈ 23K
0 5000 10000 15000
0.851.05
Iteration number
log(σ)
0 5000 10000 15000
1.02.54.0
Iteration number
log(σ) PSRF median
PSRF 97.5%
0 5000 10000 15000
−1.80−1.70
Iteration number
log(λ)
0 5000 10000 15000
1.02.54.0
Iteration number
log(λ) PSRF median
PSRF 97.5%
0 5000 10000 15000
−0.95−0.75
Iteration number
log(τ)
0 5000 10000 15000
1.02.54.0
Iteration number
log(τ) PSRF median
PSRF 97.5%
Conclusions
I Novel adaptation of SGLD to infer covariance parameters in Gaussian processes
3 Accurate in characterizing the posterior distribution over covariance parameters 3 Scales with O(n) in space and with O(n2) in time
3 Massively parallelizable
3 Without assuming factorization of the likelihood (mini-batches) 3 Without considering subsets of the data or inducing points
3 Without considering subsets of the spectrum of the covariance 3 Without imposing sparsity on the covariance or its inverse
I Novel linear solver - ULISSE
3 Early stop of iterative linear solver that yields an unbiased solution 3 Can be adopted to accelerate any iterative solver
I Ongoing work
– Extension to Gaussian Markov Random Fields and other likelihoods – Tuning of a preconditioner in SGLD
– Mixed precision calculations within the Conjugate Gradient algorithm
References
[1] M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE). ICML 2015, to appear.
[2] M. Filippone and M. Girolami, Pseudo-marginal Bayesian inference for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2214–2226, 2014.
[3] M. Filippone et al. Probabilistic Prediction of Neurological Disorders with a Statistical Assessment of Neuroimaging Data Modalities. Annals of Applied Statistics, 6(4):1883–1905, 2012.
[4] M. N. Gibbs, Bayesian Gaussian processes for regression and classification. PhD thesis, University of Cambridge, 1997.
[5] M. Welling and Y. W. Teh, Bayesian Learning via Stochastic Gradient Langevin Dynamics. ICML 2011, pp. 681–688. Omnipress, 2011.
1