• Aucun résultat trouvé

Scalable stochastic gradient-based inference for gaussian processes

N/A
N/A
Protected

Academic year: 2022

Partager "Scalable stochastic gradient-based inference for gaussian processes"

Copied!
1
0
0

Texte intégral

(1)

Scalable Stochastic Gradient-based Inference for Gaussian Processes

Maurizio Filippone

EURECOM, Sophia Antipolis, France email: [email protected]

Gaussian Process (GP) Regression - Illustration

GP prior

−4 −2 0 2 4

−3

−2

−1 0 1 2 3

input

label

I Grey - Samples from the GP prior

I Green - Radial Basis Function covariance

I Full and “infinite” covariance matrix K M∞,∞

K =

GP regression example

−4 −2 0 2 4

−3

−2

−1 0 1 2 3

input

label

I Data generated from the model

I Marginal distributions of multivari- ate Gaussian are Gaussian

I It is sufficient to look at the values of the covariance at the input locations

K =

Inference result

−4 −2 0 2 4

−3

−2

−1 0 1 2 3

input

label

I Blue - Mean prediction ± 2 Std devs

I Conditional distributions of multi- variate Gaussian are Gaussian

I Predictions can be made by calculat- ing conditional distributions

K =

n

n

Bayesian Inference for GPs

p(par|data) = p(data|par)p(par) Z

p(data|par)p(par)dpar

p(par) p(par|data)

Marginal likelihood

I Marginal likelihood

p(data|par) = Z

p(data|latent)p(latent|par)dlatent

can only be computed if p(data|latent) is Gaussian

I ... even then

log[p(data|par)] = −1

2 log |K| − 1

2yTK−1y + const.

where K = K(par) is generally an n × n dense matrix!

Stochastic Gradient Langevin Dynamics (SGLD) algorithm

I Stochastic gradient ascent optimization with injected noise ηt par0 = par + αt

2 ∇]par log[p(data|par)p(par)] + ηt ηt ∼ N(0, αt) αt → 0

I First phase – αt large – Optimization phase

Injected noise ηt is smaller than the gradient-based update Behavior similar to stochastic gradient ascent

I Second phase – αt small – Langevin dynamics phase

Injected noise ηt dominates gradient-based update

3 Acceptance rate reaches one so no need to accept/reject 3 No need to evaluate p(data|par)

3 We only need stochastic gradients to obtain samples from p(par|data)

Stochastic gradients for GPs

I Marginal likelihood

log[p(data|par)] = −1

2 log |K| − 1

2yTK−1y + const.

I Derivatives wrt par

∂ log[p(data|par)]

∂pari = −1 2Tr

K−1 ∂K

∂pari

+ 1

2yTK−1 ∂K

∂pari K−1y

I Stochastic estimate of the trace Tr

K−1 ∂K

∂pari

= Tr

K−1 ∂K

∂pariE[rrT]

= E

rTK−1 ∂K

∂pari r

with E[rrT] = I - e.g., rj drawn from {−1, 1} with p = 1/2 I Stochastic gradient

− 1 2Nr

Nr

X

i=1

r(i)TK−1 ∂K

∂pari r(i) + 1

2yTK−1 ∂K

∂pari K−1y

I Linear systems only!

Solving linear systems

I Linear systems:

Ks = b

I Can be solved using the Conjugate Gradient algorithm:

s = arg min

x

1

2xTKx − xTb

I Iterative update s = s0 + δ1 + . . . + δT

I Requires only Covariance Matrix Vector Products (CMVPs)! O(n2) time I No need to store K! O(n) space

ULISSE - the Unbiased LInear System SolvEr

I Accelerate the solution of dense linear systems I . . . returning an unbiased estimate of the solution I Full CG solution:

s = s0 + δ1 + . . . + δl + δl+1 . . . + δT I ULISSE:

p1

1 p1

1 p2 p2

+

pl+1

1

+

pl+2

1p2

s0 + 1 + . . . + l

In this work:

pi = exp(−βi)

I Final solution is an unbiased estimate of s!

0 1 2 3 4 5 6

1.02.03.0

log10(κ)

log10(number iterations) CG

ULISSE q=0.1 ULISSE q=1.0

0 1 2 3 4 5 6

−8−6−4−20

log10(κ)

log10(avg rel square norm)

3 Fast computation of stochastic gradients 3 Small relative error wrt exact gradients

rel square norm = kg(θ) − g(θ)˜ k2 kg(θ)k2

Traditional solvers vs ULISSE

Fast CMVPs

0 1 2 3 4 5 6

12345

log10(κ)

log10(number iterations) double

float anchors kdtree

0 1 2 3 4 5 6

−15−10−50

log10(κ)

log10(norm residual)

Preconditioning

0 1 2 3 4 5 6

1234

log10(κ)

log10(number iterations) CG

PCG double PCG float

0 1 2 3 4 5 6

−14−10−6

log10(κ)

log10(norm residual)

CG vs ULISSE with β = 1

0 1 2 3 4 5 6

0.51.52.53.5

log10(κ)

log10(number iterations) CG

ULISSE q=0.1 ULISSE q=1.0

0 1 2 3 4 5 6

−8−6−4−2024

log10(κ)

log10(avg std dev)

CG vs ULISSE with β = 100

0 1 2 3 4 5 6

0.51.52.53.5

log10(κ)

log10(number iterations) CG

ULISSE q=0.1 ULISSE q=1.0

0 1 2 3 4 5 6

−15−10−50

log10(κ)

log10(avg std dev)

Inference Results

Concrete data n ≈ 1K

0 10000 20000 30000

1.02.54.0

Iteration number

log(σ)

0e+00 1e+04 2e+04 3e+04

1.02.54.0

Iteration number

log(σ) PSRF median

PSRF 97.5%

0 10000 20000 30000

−2.9−2.6

Iteration number

log(λ)

0e+00 1e+04 2e+04 3e+04

1.02.54.0

Iteration number

log(λ) PSRF median

PSRF 97.5%

0 10000 20000 30000

−3.5−2.5

Iteration number

log(τ)

0e+00 1e+04 2e+04 3e+04

1.02.54.0

Iteration number

log(τ) PSRF median

PSRF 97.5%

Census data n ≈ 23K

0 5000 10000 15000

0.851.05

Iteration number

log(σ)

0 5000 10000 15000

1.02.54.0

Iteration number

log(σ) PSRF median

PSRF 97.5%

0 5000 10000 15000

−1.80−1.70

Iteration number

log(λ)

0 5000 10000 15000

1.02.54.0

Iteration number

log(λ) PSRF median

PSRF 97.5%

0 5000 10000 15000

−0.95−0.75

Iteration number

log(τ)

0 5000 10000 15000

1.02.54.0

Iteration number

log(τ) PSRF median

PSRF 97.5%

Conclusions

I Novel adaptation of SGLD to infer covariance parameters in Gaussian processes

3 Accurate in characterizing the posterior distribution over covariance parameters 3 Scales with O(n) in space and with O(n2) in time

3 Massively parallelizable

3 Without assuming factorization of the likelihood (mini-batches) 3 Without considering subsets of the data or inducing points

3 Without considering subsets of the spectrum of the covariance 3 Without imposing sparsity on the covariance or its inverse

I Novel linear solver - ULISSE

3 Early stop of iterative linear solver that yields an unbiased solution 3 Can be adopted to accelerate any iterative solver

I Ongoing work

Extension to Gaussian Markov Random Fields and other likelihoods Tuning of a preconditioner in SGLD

Mixed precision calculations within the Conjugate Gradient algorithm

References

[1] M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE). ICML 2015, to appear.

[2] M. Filippone and M. Girolami, Pseudo-marginal Bayesian inference for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2214–2226, 2014.

[3] M. Filippone et al. Probabilistic Prediction of Neurological Disorders with a Statistical Assessment of Neuroimaging Data Modalities. Annals of Applied Statistics, 6(4):1883–1905, 2012.

[4] M. N. Gibbs, Bayesian Gaussian processes for regression and classification. PhD thesis, University of Cambridge, 1997.

[5] M. Welling and Y. W. Teh, Bayesian Learning via Stochastic Gradient Langevin Dynamics. ICML 2011, pp. 681–688. Omnipress, 2011.

1

Références

Documents relatifs

These data indicate that there is at least an additional path- way of embryogenesis gene expression affected in gia3 mutants in response to ABA during seed germination..

For the purposes of determining how the properties of satellite galaxies depend on the host properties, we produce a comparative sample of smaller neighbouring galaxies with

However, the main idea of this paper is that (1) is also the characteristic function of a particular case of Gaussian compound stochastic processes (GCP), and therefore it

The issue is that variational methods minimize the KL divergence between the sparse pos- terior approximation and the exact posterior (Titsias, 2009; Bauer et al., 2016; Hensman et

Meanwhile, workloads acquired at science-gateway level contain detailed information about users, pilot jobs, task sub-steps, bag of tasks and workflow executions.. In this work,

Relating to first international movers, we observed that coher- ence between a favorable pre-internationalization perfor- mance, measured from the aspiration-level performance

Les hypothèses de recherches sont : 1 la formation offerte en vue de l’implantation d’une LDPE en évaluation de la douleur, améliorera les connaissances du personnel soignant sur

Two 2 main and parallel ways have been developed over the years to build a stochastic calculus with respect to Gaussian processes; the Itô integral wrt Brownian motion being at