Deep Gaussian processes

(1)

Deep Gaussian Processes

Maurizio Filippone

EURECOM, Sophia Antipolis, France August 31

^st

, 2018

1

(2)

1 Introduction

2 Inference for Deep Gaussian Processes

3 Convolutional Deep Gaussian Processes

4 Conclusions

2

(3)

Introduction

(4)

Gaussian Processes - Priors over Functions

• Infinite Gaussian random variables with parametric and input-dependent covariance

Rasmussen and Williams, 2006

3

(5)

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W ⁽ⁱ⁾ ∼ N (0, α _i I )

• Central Limit Theorem implies that F is Gaussian

.. .

Φ

X F

W⁽⁰⁾ W⁽¹⁾

• F has zero-mean

• cov(F ) = E _p(W

(0)

,W

⁽¹⁾

) [Φ(X W ⁽⁰⁾ )W ⁽¹⁾ W ⁽¹⁾ ^> Φ(X W ⁽⁰⁾ ) ^> ]

Neal,LNS, 1996

4

(6)

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W ⁽ⁱ⁾ ∼ N (0, α _i I )

• Central Limit Theorem implies that F is Gaussian

.. .

Φ

X F

W⁽⁰⁾ W⁽¹⁾

• F has zero-mean

• cov(F ) = α ₁ E _p(W

(0)

) [Φ(X W ⁽⁰⁾ )Φ(X W ⁽⁰⁾ ) ^> ]

• Some choices of Φ lead to analytic expression of known kernels (RBF, Mat´ ern, arc-cosine, Brownian motion, . . .)

Neal,LNS, 1996

5

(7)

Gaussian Processes - Priors over Functions

●

●●

●

● ●●

●

K =

6

(8)

Gaussian Processes - Priors over Functions

●

●●

●

● ●●

●

K =

7

(9)

Gaussian Processes - Priors over Functions

●

●●

●

● ●●

●

K = n

n

8

(10)

Gaussian Processes - Regression example

• Inputs = X Labels = Y

• Introduce latent variables F with covariance K = K (X , θ)

• Introduce Gaussian likelihood p(Y | F )

−3−2−10123

●

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Rasmussen and Williams, 2006 9

(11)

Gaussian Processes - Regression example

• Predictive distribution p(F _∗ | Y , X , θ) =

Z

p(F _∗ | F , θ)p (F | Y , X , θ)d F

−3 −2 −1 0 1 2 3

−3−2−10123

x

y

●

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

(12)

Gaussian Processes - Classification example

• Inputs = X Labels = Y

• Introduce latent variables F with covariance K = K (X , θ)

• Introduce Bernoulli likelihood p (Y | F )

0.00.20.40.60.81.0

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

(13)

Gaussian Processes - Classification example

• Predictive distribution - needs approximation to p(F | Y , X , θ)!

p(F _∗ | Y , X , θ) = Z

p(F _∗ | F , θ)p (F | Y , X , θ)d F

0.00.20.40.60.81.0

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

12

(14)

Challenges and Limitations

• Kernel design

• p(Y | X , θ) might be expensive to compute (factorize K )

• p(Y | X , θ) might not even be computable!

F

Y θ X

• Marginal likelihood p(Y | X , θ) =

Z

p (Y | F )p(F | X , θ)d F

13

(15)

Deep Gaussian Processes for Large Representational Power

• Bypassing kernel design through composition of processes

(f ◦ g)(x)??

Neal,LNS, 1996 – Damianou and Lawrence,AISTATS, 2013

14

(16)

Deep Gaussian Processes for Large Representational Power

• Composition of stationary processes yields something very complex

F⁽¹⁾

Y θ⁽¹⁾ X

F⁽²⁾ θ⁽²⁾

Neal,LNS, 1996 – Damianou and Lawrence,AISTATS, 2013 – Duvenaud et al.,AISTATS, 2014 15

(17)

Pathologies of Deep Gaussian Processes

• Deep is not necessarily good!

• Example

1 Layer 2 Layers

5 Layers 10 Layers

F⁽¹⁾

Y θ⁽¹⁾ X

F⁽²⁾ θ⁽²⁾

Neal,LNS, 1996 – Duvenaud et al.,AISTATS, 2014 – Matthews et al.,arXiv, 2018 16

(18)

Pathologies of Deep Gaussian Processes

• Deep is not necessarily good!

• Feeding input to each layer helps...

1 Layer 2 Layers

5 Layers 10 Layers

F⁽¹⁾

Y θ⁽¹⁾ X

F⁽²⁾ θ⁽²⁾

Neal,LNS, 1996 – Duvenaud et al.,AISTATS, 2014 – Matthews et al.,arXiv, 2018 17

(19)

Learning Deep Gaussian Processes

• Inference requires calculating integrals of this kind:

p(Y | X , θ) = Z

p

Y | F ^(N

^h

⁾ , θ ^(N

^h

⁾ × p

F ^(N

^h

⁾ | F ^(N

^h

⁻ ¹⁾ , θ ^(N

^h

⁻ ¹⁾

× . . . × p

F ⁽¹⁾ | X , θ ⁽⁰⁾

d F ^(N

^h

⁾ . . . d F ⁽¹⁾

• Extremely challenging!

18

(20)

The Deep Learning Revolution

• Large representational power

• Mini-batch-based learning

• Exploit GPU and distributed computing

• Automatic differentiation

• Mature development of regularization (e.g., dropout)

• Application-specific representations (e.g., convolutional)

19

(21)

Stochastic Gradient Optimization

E n

∇ ] par LowerBound o

= ∇ par LowerBound

−4 −2 0 2 4

−4−2024

●

−4 −2 0 2 4

−4−2024

●

● ●

● ●●

●

●●

●

●●●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

Robbins and Monro,AoMS, 1951

20

(22)

Stochastic Variational Inference - Illustration

vpar ⁰ = vpar + α t

2 ∇ ^ vpar (LowerBound) α t → 0

21

−4 −2 0 2 4

−4−2024

●

0 20 40 60 80 100

0.00.20.40.60.81.0

Iteration

p(par | data)

5 0 5

20 15 10 5 0 5 10 15 20

4 2 0 2 4

16

14

12

10

8

(23)

Is There Any Hope for GPs and DGPs?

• Mini-batch training is straightforward when objective factorizes over training points

objective = X

i

f (y _i , x _i , par)

• In GPs latent variables are fully correlated p(F | X , θ) = N (F | 0, K (X , θ)) ∝ exp

− 1

2 F ^> K ⁻ ¹ F

• Na¨ıve mini-batch approaches would totally break this!

Can we exploit what made Deep Learning successful for practical and scalable learning of (Deep) Gaussian processes?

22

(24)

Is There Any Hope for GPs and DGPs?

• Mini-batch training is straightforward when objective factorizes over training points

objective = X

i

f (y _i , x _i , par)

• In GPs latent variables are fully correlated p(F | X , θ) = N (F | 0, K (X , θ)) ∝ exp

− 1

2 F ^> K ⁻ ¹ F

• Na¨ıve mini-batch approaches would totally break this!

Can we exploit what made Deep Learning successful for

practical and scalable learning of (Deep) Gaussian processes?

₂₂

(25)

Inference for Deep Gaussian

Processes

(26)

Inference for DGPs

• Inducing points-based approximations

• VI+Titsias AISTATS 2009 Sparse GP

• Damianou and Lawrence, AISTATS, 2013

• Hensman and Lawrence, arXiv, 2014

• Salimbeni and Deisenroth, NIPS, 2017

• EP+FITC - Bui et al. ICML, 2016

• MCMC+Titsias AISTATS 2009 Sparse GP

• Havasi et al., arXiv, 2018

• Random feature-based approximations

• Gal and Ghahramani, ICML 2016

• Cutajar et al., ICML 2017

23

(27)

Inference for DGPs

• Low-Rank Approximation options - O (nm ² )

• Call P as a low rank approximation to K y

• Woodbury identity exploits low rank structure of P

Preconditioning Kernel Matrices

K. Cutajar ¹ , M. A. Osborne ² , J. P. Cunningham ³ , M. Filippone ¹

1 - EURECOM, Sophia Antipolis, France

2 - University of Oxford, Oxford, UK 3 - Columbia University, New York City, USA

Kernel Machines and Solving Linear Systems

! Operate in a high-dimensional, implicit feature space;

! Rely on the construction of an n × n Gram matrix K;

! Popular kernels:

– RBF : k (x

i

, x

j

) = σ

²

exp !

−

¹₂

d

²

"

; – Matérn : k

_v=³

2

(x

i

,x

j

) = σ

²

! 1 + √

3d "

exp !

− √ 3d "

; where d

²

= (x

i

− x

j

)

^⊤

Λ (x

i

− x

j

).

! Involve the solution of linear systems Kz = v;

! Cholesky Decomposition:

– O (n

²

) space and O (n

³

) time - unfeasible for large n.

Fig. 1:

Kernel machines enable non-linear separation of data.

! Conjugate Gradient (CG):

– Numerical solution of linear systems constructed using matrix-vector multiplications;

– O (tn

²

) for t CG iterations - in theory t = n (possibly worse).

z

z0

Fig. 2:

CG

! Preconditioned Conjugate Gradient (PCG):

– Transforms linear system to be better conditioned, improving convergence;

– Yields a new linear system of the form P

⁻¹

Kz = P

⁻¹

v;

– O(tn

²

) for t PCG iterations - in practice t ≪ n.

z

z0

Fig. 3:

Preconditioned CG

Preconditioning Approaches

! Suppose we want to precondition K

y

= K + λI;

! Our choice of preconditioner should:

– Approximate K

y

as closely as possible;

– Be easy to invert.

! For low-rank preconditioners we employ the Woodbury inversion lemma:

K

y

= P =

P

⁻¹

=

! For other preconditioners we solve inner linear systems once again using CG!

Formulation Strategy

Nyström P = K

_XU

K

_{U U}⁻¹

K

_{U X}

+ λI where U ⊂ X Woodbury FITC P = K

_XU

K

_{U U}⁻¹

K

_{U X}

+ diag !

K − K

_XU

K

_{U U}⁻¹

K

_{U X}

"

+ λI Woodbury

PITC P = K

_XU

K

_{U U}⁻¹

K

_{U X}

+ bldiag !

K − K

_XU

K

_{U U}⁻¹

K

_{U X}

"

+ λI Woodbury

Spectral P

ij

=

^σ_m²

#

m r=1

cos $

2πs

^⊤_r

(x

i

− x

j

) %

+ λI

ij

Woodbury

Partial SVD K = AΛA

^⊤

⇒ P = A

_[·,1:m]

Λ

_[1:m,1:m]

A

^⊤_[1:m,·]

+ λI Woodbury

Block Jacobi P = bldiag (K) + λI Block Inverse

SKI P = W K

U U

W

^⊤

+ λI where K

U U

is Kronecker Inner CG

Regularization P = K + δI + λI Inner CG

Preconditioning Kernel Matrices

Concrete Dataset Power plant Dataset Protein Dataset

(n

= 1030, d= 8)

(n

= 9568, d= 4)

(n

= 45730, d= 9)

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

log

₁₀

(n

it

)

0 5

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4

-6 +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+

+ +

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + + PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + +

+ + +

Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6 + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

Regularized

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

SKI

Loss

− 2

−1 0 1 2

Gain

Figure 1. Comparison of preconditioners for different settings of kernel parameters. The lengthscalel and the noise variance

λ

are shown on the x and y axes respectively. The top figure indicates the number of iterations required to solve the corresponding linear system using CG, whilst the bottom part of the figure shows the rate of improvement (negative - blue) or degradation (positive - red) achieved by using PCG to solve the same linear system. Parameters and results are reported inlog

10

. Symbols added to facilitate reading in B/W print.

tioner requires one matrix-vector product, and we add this to the overall count of such computations. For this precon- ditioner, we add a diagonal offset δ to the original matrix, equivalent to two orders of magnitude greater than the noise of the process. In general, although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large.

We focus on an isotropic RBF variant of the kernel in eq.1, fixing the marginal variance σ

²

to one. We vary the length- scale parameter l and the noise variance λ in log

10

scale.

The top part of fig. 1 shows the number of iterations that the standard CG algorithm takes, where we have capped the number of iterations to 100,000.

The bottom part of the figure reports the improvement of- fered by various preconditioners measured as

log

10

! # PCG iterations

# CG iterations

"

.

It is worth noting that when both CG and PCG fail to con- verge within the upper bound, the improvement will be marked as 0, i.e. neither a gain or a loss within the given bound. The results plotted in fig. 1 indicate that the low- rank preconditioners (PITC, FITC and Nystr¨om) achieve

significant reductions in the number of iterations for each dataset, and all approaches work best when the lengthscale is longer, characterising smoother processes. In contrast, preconditioning seems to be less effective when the length- scale is shorter, corresponding to a kernel matrix that is more sparse. However, for cases yielding positive results, the improvement is often in the range of an order of mag- nitude, which can be substantial when a large number of iterations is required by the CG algorithm.

The results also confirm that, as alluded to in the previous section, Block Jacobi preconditioning is generally a poor preconditioner, particularly when the corresponding kernel matrix is dense. The only minor improvements were ob- served when CG itself converges quickly, in which case preconditioning serves very little purpose either way.

The regularization approach with flexible conjugate gradi- ent does not appear to be effective in any case either, partic- ularly due to the substantial amount of iterations required for solving an inner system at every iteration of the PCG algorithm. This implies that introducing additional small jitter to the diagonal does not necessarily make the sys- tem much easier to solve, whilst adding an overly large offset would negatively impact convergence of the outer al- gorithm. One could assume that tuning the value of this parameter could result in slightly better results; however, preliminary experiments in this regard yielded only minor

Fig. 4:

Comparison of preconditioners for different settings of kernel parameters across multiple datasets. Top: Number of iterations required to solve the corresponding linear system using CG. Bottom: Rate of improvement (blue) or degradation (red) achieved by using PCG to solve the same linear system.

Motivating Example - Gaussian Processes

! Marginal likelihood:

log[p(y|par)] = − 1

2 log |K

y

| − 1

2 y

^⊤

K

⁻_y¹

y + const.

! Derivatives wrt par:

∂ log[p(y | par)]

∂par

_i

= − 1 2 Tr

&

K

_y⁻¹

∂K

y

∂par

_i

'

+ 1

2 y

^⊤

K

_y⁻¹

∂K

y

∂par

_i

K

_y⁻¹

y

! Stochastic estimate of the trace - assuming E[rr

^⊤

] = I, then:

Tr

&

K

_y⁻¹

∂K

y

∂par

_i

'

= Tr

&

K

_y⁻¹

∂K

y

∂par

_i

E[rr

^⊤

] '

= E (

r

^⊤

K

_y⁻¹

∂K

y

∂par

_i

r )

≈ 1 N

r

Nr

*

i=1

r

⁽ⁱ⁾^⊤

K

_y⁻¹

∂K

y

∂par

i

r

⁽ⁱ⁾

! Linear systems only!

! Laplace approximation for non-Gaussian likelihoods may be formulated in a similar way!

Experimental Setup and Results

! Exact gradient-based optimization using Cholesky decomposition (CHOL);

! Stochastic gradient-based optimization using ADAGRAD - using CG and PCG;

! GP Approximations:

– Variational learning of inducing variables (VAR);

– Fully Independent Training Conditional (FITC);

– Partially Independent Training Conditional (PITC).

Classification

Spam Dataset(n= 4061,d=57)

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.040.080.12

log10

(

seconds

)

Error Rate

−1 0 1 2 3

152535

log10

(

seconds

)

Negative Test Log−Lik

EEG Dataset(n= 14979,d=14)

1.0 1.5 2.0 2.5 3.0 3.5

0.050.150.25

log10

(

seconds

)

Error Rate

0 1 2 3 4

2030405060

log10

(

seconds

)

Regression

Power plant Dataset(n= 9568,d=4)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.180.200.22

log10

(

seconds

)

RMSE

0 1 2 3

−40−200

log10

(

seconds

)

Protein Dataset(n= 45730,d=9)

1.5 2.0 2.5 3.0 3.5 4.0

0.600.640.680.72

log10

(

seconds

)

RMSE

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

200300400

log10

(

seconds

)

PCG CG CHOL FITC PITC VAR

Fig. 5:

Error and negative log likelihood on

√

n

held out test data over time. Curves are averaged over multiple repetitions.

Conclusions

! Our solution:

✓ Exact in the limit of iterations;

✓ Straightforward to construct and easy to tune;

✓ Scalable to large datasets - no need to store K;

✓ Competitive with exact Cholesky decomposition;

✓ Superior to approximate methods.

! Ongoing work:

– Extending this work to other kernel functions and models;

– Implementation on a distributed framework;

– Exploiting PCG in the solution of f (K) z = v.

1 DGPs: Low-rank approximation of covariance at each layer

24

(28)

Inference for DGPs

• Low-Rank Approximation options - O (nm ² )

• Call P as a low rank approximation to K y

• Woodbury identity exploits low rank structure of P

Preconditioning Kernel Matrices

K. Cutajar ¹ , M. A. Osborne ² , J. P. Cunningham ³ , M. Filippone ¹

1 - EURECOM, Sophia Antipolis, France

2 - University of Oxford, Oxford, UK 3 - Columbia University, New York City, USA

Kernel Machines and Solving Linear Systems

! Operate in a high-dimensional, implicit feature space;

! Rely on the construction of an n × n Gram matrix K;

! Popular kernels:

– RBF : k (x

i

, x

j

) = σ

²

exp !

−

¹₂

d

²

"

; – Matérn : k

_v=³

2

(x

i

,x

j

) = σ

²

! 1 + √

3d "

exp !

− √ 3d "

; where d

²

= (x

i

− x

j

)

^⊤

Λ (x

i

− x

j

).

! Involve the solution of linear systems Kz = v;

! Cholesky Decomposition:

– O (n

²

) space and O (n

³

) time - unfeasible for large n.

Fig. 1:

Kernel machines enable non-linear separation of data.

! Conjugate Gradient (CG):

– Numerical solution of linear systems constructed using matrix-vector multiplications;

– O (tn

²

) for t CG iterations - in theory t = n (possibly worse).

z

z0

Fig. 2:

CG

! Preconditioned Conjugate Gradient (PCG):

– Transforms linear system to be better conditioned, improving convergence;

– Yields a new linear system of the form P

⁻¹

Kz = P

⁻¹

v;

– O(tn

²

) for t PCG iterations - in practice t ≪ n.

z

z0

Fig. 3:

Preconditioned CG

Preconditioning Approaches

! Suppose we want to precondition K

y

= K + λI;

! Our choice of preconditioner should:

– Approximate K

y

as closely as possible;

– Be easy to invert.

! For low-rank preconditioners we employ the Woodbury inversion lemma:

K

y

= P =

P

⁻¹

=

! For other preconditioners we solve inner linear systems once again using CG!

Formulation Strategy

Nyström P = K

_XU

K

_{U U}⁻¹

K

_{U X}

+ λI where U ⊂ X Woodbury FITC P = K

_XU

K

_{U U}⁻¹

K

_{U X}

+ diag !

K − K

_XU

K

_{U U}⁻¹

K

_{U X}

"

+ λI Woodbury

PITC P = K

_XU

K

_{U U}⁻¹

K

_{U X}

+ bldiag !

K − K

_XU

K

_{U U}⁻¹

K

_{U X}

"

+ λI Woodbury

Spectral P

ij

=

^σ_m²

#

m r=1

cos $

2πs

^⊤_r

(x

i

− x

j

) %

+ λI

ij

Woodbury

Partial SVD K = AΛA

^⊤

⇒ P = A

_[·,1:m]

Λ

_[1:m,1:m]

A

^⊤_[1:m,·]

+ λI Woodbury

Block Jacobi P = bldiag (K) + λI Block Inverse

SKI P = W K

U U

W

^⊤

+ λI where K

U U

is Kronecker Inner CG

Regularization P = K + δI + λI Inner CG

Preconditioning Kernel Matrices

Concrete Dataset Power plant Dataset Protein Dataset

(n

= 1030, d= 8)

(n

= 9568, d= 4)

(n

= 45730, d= 9)

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

log

₁₀

(n

it

)

0 5

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4

-6 +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+

+ +

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + + PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + +

+ + +

Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6 + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

Regularized

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

SKI

Loss

− 2

−1 0 1 2

Gain

Figure 1. Comparison of preconditioners for different settings of kernel parameters. The lengthscalel and the noise variance

λ

are shown on the x and y axes respectively. The top figure indicates the number of iterations required to solve the corresponding linear system using CG, whilst the bottom part of the figure shows the rate of improvement (negative - blue) or degradation (positive - red) achieved by using PCG to solve the same linear system. Parameters and results are reported inlog

10

. Symbols added to facilitate reading in B/W print.

tioner requires one matrix-vector product, and we add this to the overall count of such computations. For this precon- ditioner, we add a diagonal offset δ to the original matrix, equivalent to two orders of magnitude greater than the noise of the process. In general, although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large.

We focus on an isotropic RBF variant of the kernel in eq.1, fixing the marginal variance σ

²

to one. We vary the length- scale parameter l and the noise variance λ in log

10

scale.

The top part of fig. 1 shows the number of iterations that the standard CG algorithm takes, where we have capped the number of iterations to 100,000.

The bottom part of the figure reports the improvement of- fered by various preconditioners measured as

log

10

! # PCG iterations

# CG iterations

"

.

It is worth noting that when both CG and PCG fail to con- verge within the upper bound, the improvement will be marked as 0, i.e. neither a gain or a loss within the given bound. The results plotted in fig. 1 indicate that the low- rank preconditioners (PITC, FITC and Nystr¨om) achieve

significant reductions in the number of iterations for each dataset, and all approaches work best when the lengthscale is longer, characterising smoother processes. In contrast, preconditioning seems to be less effective when the length- scale is shorter, corresponding to a kernel matrix that is more sparse. However, for cases yielding positive results, the improvement is often in the range of an order of mag- nitude, which can be substantial when a large number of iterations is required by the CG algorithm.

The results also confirm that, as alluded to in the previous section, Block Jacobi preconditioning is generally a poor preconditioner, particularly when the corresponding kernel matrix is dense. The only minor improvements were ob- served when CG itself converges quickly, in which case preconditioning serves very little purpose either way.

The regularization approach with flexible conjugate gradi- ent does not appear to be effective in any case either, partic- ularly due to the substantial amount of iterations required for solving an inner system at every iteration of the PCG algorithm. This implies that introducing additional small jitter to the diagonal does not necessarily make the sys- tem much easier to solve, whilst adding an overly large offset would negatively impact convergence of the outer al- gorithm. One could assume that tuning the value of this parameter could result in slightly better results; however, preliminary experiments in this regard yielded only minor

Fig. 4:

Comparison of preconditioners for different settings of kernel parameters across multiple datasets. Top: Number of iterations required to solve the corresponding linear system using CG. Bottom: Rate of improvement (blue) or degradation (red) achieved by using PCG to solve the same linear system.

Motivating Example - Gaussian Processes

! Marginal likelihood:

log[p(y|par)] = − 1

2 log |K

y

| − 1

2 y

^⊤

K

⁻_y¹

y + const.

! Derivatives wrt par:

∂ log[p(y | par)]

∂par

_i

= − 1 2 Tr

&

K

_y⁻¹

∂K

y

∂par

_i

'

+ 1

2 y

^⊤

K

_y⁻¹

∂K

y

∂par

_i

K

_y⁻¹

y

! Stochastic estimate of the trace - assuming E[rr

^⊤

] = I, then:

Tr

&

K

_y⁻¹

∂K

y

∂par

_i

'

= Tr

&

K

_y⁻¹

∂K

y

∂par

_i

E[rr

^⊤

] '

= E (

r

^⊤

K

_y⁻¹

∂K

y

∂par

_i

r )

≈ 1 N

r

Nr

*

i=1

r

⁽ⁱ⁾^⊤

K

_y⁻¹

∂K

y

∂par

i

r

⁽ⁱ⁾

! Linear systems only!

! Laplace approximation for non-Gaussian likelihoods may be formulated in a similar way!

Experimental Setup and Results

! Exact gradient-based optimization using Cholesky decomposition (CHOL);

! Stochastic gradient-based optimization using ADAGRAD - using CG and PCG;

! GP Approximations:

– Variational learning of inducing variables (VAR);

– Fully Independent Training Conditional (FITC);

– Partially Independent Training Conditional (PITC).

Classification

Spam Dataset(n= 4061,d=57)

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.040.080.12

log10

(

seconds

)

Error Rate

−1 0 1 2 3

152535

log10

(

seconds

)

EEG Dataset(n= 14979,d=14)

1.0 1.5 2.0 2.5 3.0 3.5

0.050.150.25

log10

(

seconds

)

Error Rate

0 1 2 3 4

2030405060

log10

(

seconds

)

Regression

Power plant Dataset(n= 9568,d=4)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.180.200.22

log10

(

seconds

)

RMSE

0 1 2 3

−40−200

log10

(

seconds

)

Protein Dataset(n= 45730,d=9)

1.5 2.0 2.5 3.0 3.5 4.0

0.600.640.680.72

log10

(

seconds

)

RMSE

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

200300400

log10

(

seconds

)

PCG CG CHOL FITC PITC VAR

Fig. 5:

Error and negative log likelihood on

√

n

held out test data over time. Curves are averaged over multiple repetitions.

Conclusions

! Our solution:

✓ Exact in the limit of iterations;

✓ Straightforward to construct and easy to tune;

✓ Scalable to large datasets - no need to store K;

✓ Competitive with exact Cholesky decomposition;

✓ Superior to approximate methods.

! Ongoing work:

– Extending this work to other kernel functions and models;

– Implementation on a distributed framework;

– Exploiting PCG in the solution of f (K) z = v.

1 DGPs: Low-rank approximation of covariance at each layer

24

(29)

Scalable Expectation Propagation for DGPs

• Pseudo-inputs Z ⁽ⁱ⁾

• Inducing variables U ⁽ⁱ⁾

• VI targets q

U ⁽ⁱ⁾

• Assuming q U ⁽ⁱ⁾

∝ p U ⁽ⁱ⁾

g U ⁽ⁱ⁾ N

learn g as an average data factor

• Reduces memory and allows for factorization of the objective (output of each layer made Gaussian)

F

⁽¹⁾

Y

θ

⁽¹⁾

X

F

⁽²⁾

θ

⁽²⁾

U

⁽¹⁾

Z

⁽¹⁾

U

⁽¹⁾

Z

⁽²⁾

U

⁽¹⁾

Z

⁽³⁾

θ

⁽³⁾

Bui et al.,ICML, 2016

25

(30)

Scalable Expectation Propagation for DGPs

• Pseudo-inputs Z ⁽ⁱ⁾

• Inducing variables U ⁽ⁱ⁾

• VI targets q

U ⁽ⁱ⁾

• Assuming q U ⁽ⁱ⁾

∝ p U ⁽ⁱ⁾

g U ⁽ⁱ⁾ N

learn g as an average data factor

• Reduces memory and allows for factorization of the objective (output of each layer made Gaussian)

F

⁽¹⁾

Y

θ

⁽¹⁾

X

F

⁽²⁾

θ

⁽²⁾

U

⁽¹⁾

Z

⁽¹⁾

U

⁽¹⁾

Z

⁽²⁾

U

⁽¹⁾

Z

⁽³⁾

θ

⁽³⁾

Bui et al.,ICML, 2016

25

(31)

Inducing Points for DGPs extending Titsias, AISTATS, 2009

• Pseudo-inputs Z ⁽ⁱ⁾

• Inducing variables U ⁽ⁱ⁾

• VI targets q F ⁽ⁱ⁾ , U ⁽ⁱ⁾ | F ⁽ⁱ ⁻ ¹⁾ p

F ⁽ⁱ⁾ | U ⁽ⁱ⁾ , F ⁽ⁱ ⁻ ¹⁾

q

U ⁽ⁱ⁾

• Lower bound factorizes across training points. . .

• . . . and the i th marginal of the final layer depends only on the i th marginals of all layers

F

⁽¹⁾

Y

θ

⁽¹⁾

X

F

⁽²⁾

θ

⁽²⁾

U

⁽¹⁾

Z

⁽¹⁾

U

⁽¹⁾

Z

⁽²⁾

U

⁽¹⁾

Z

⁽³⁾

θ

⁽³⁾

Hensman and Lawrence,arXiv, 2014 – Salimbeni and Deisenroth,NIPS, 2017

26

(32)

Inducing Points for DGPs extending Titsias, AISTATS, 2009

• Pseudo-inputs Z ⁽ⁱ⁾

• Inducing variables U ⁽ⁱ⁾

• VI targets q F ⁽ⁱ⁾ , U ⁽ⁱ⁾ | F ⁽ⁱ ⁻ ¹⁾ p

F ⁽ⁱ⁾ | U ⁽ⁱ⁾ , F ⁽ⁱ ⁻ ¹⁾

q

U ⁽ⁱ⁾

• Lower bound factorizes across training points. . .

• . . . and the i th marginal of the final layer depends only on the i th marginals of all layers

F

⁽¹⁾

Y

θ

⁽¹⁾

X

F

⁽²⁾

θ

⁽²⁾

U

⁽¹⁾

Z

⁽¹⁾

U

⁽¹⁾

Z

⁽²⁾

U

⁽¹⁾

Z

⁽³⁾

θ

⁽³⁾

Hensman and Lawrence,arXiv, 2014 – Salimbeni and Deisenroth,NIPS, 2017

26

(33)

Random Feature Expansions for DGPs - Bochner’s theorem

• Continuous shift-invariant covariance function k(x i − x j | θ) = σ ²

Z

p(ω | θ) exp

ι(x i − x j ) ^> ω

d ω

• Monte Carlo estimate

k(x i − x j | θ) ≈ σ ² N _RF

N

RF

X

r =1

z(x i | ω ˜ r ) ^> z(x j | ω ˜ r )

with

˜

ω r ∼ p(ω | θ)

z(x | ω) = [cos(x ^> ω), sin(x ^> ω)] ^>

Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010

27

(34)

Random Feature Expansions for DGPs - Bochner’s theorem

• Continuous shift-invariant covariance function k(x i − x j | θ) = σ ²

Z

p(ω | θ) exp

ι(x i − x j ) ^> ω

d ω

• Monte Carlo estimate

k(x i − x j | θ) ≈ σ ² N _RF

N

RF

X

r =1

z(x i | ω ˜ r ) ^> z(x j | ω ˜ r )

with

˜

ω r ∼ p(ω | θ)

z(x | ω) = [cos(x ^> ω), sin(x ^> ω)] ^>

Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010

27

(35)

Random Feature Expansions for DGPs

• Define

Φ ^(l) = s σ ²

N _RF ^(l) h

cos

F ^(l ⁾ Ω ^(l)

, sin

F ^(l) Ω ^(l) i

and

F ^(l+1) = Φ ^(l) W ^(l)

• We are stacking Bayesian linear models with p

W ^(l) _· _i

= N (0, I)

• Expansion of arc-cosine kernel yields ReLU activations!

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

28

(36)

Random Feature Expansions for DGPs

• Define

Φ ^(l) = s σ ²

N _RF ^(l) h

cos

F ^(l ⁾ Ω ^(l)

, sin

F ^(l) Ω ^(l) i

and

F ^(l+1) = Φ ^(l) W ^(l)

• We are stacking Bayesian linear models with p

W ^(l) _· _i

= N (0, I)

• Expansion of arc-cosine kernel yields ReLU activations!

28

(37)

DGPs with random features become DNNs

θ

⁽⁰⁾

θ

⁽¹⁾

Φ

⁽⁰⁾

X F

⁽¹⁾

Φ

⁽¹⁾

F

⁽²⁾

Y

Ω

⁽⁰⁾

W

⁽⁰⁾

Ω

⁽¹⁾

W

⁽¹⁾

We can learn the model using Stochastic Variational Inference for Bayesian DNNs!

29

(38)

Results - Classification

EEG dataset (n = 14979, d = 14)

2 2.5 3 3.5

0 0.1 0.2

log

10

(sec) Error rate

2 2.5 3 3.5

0.2 0.4

log

10

(sec) MNLL

dgp-rbf dgp-arc dgp-ep dnn var-gp

30

(39)

Results - Multiclass Classification

MNIST dataset (n = 60000, d = 784)

3 3.5 4 4.5

0 0.05 0.1 0.15 0.2

log

10

(sec) Error rate

3 3.5 4 4.5

0 1 2 3

log

10

(sec) MNLL

dgp-rbf dgp-arc dgp-ep dnn var-gp

31

(40)

Results - MNIST-8M

• Variant of MNIST with 8.1M images

• 99+% accuracy!

• Also, check out Krauth et al., UAI 2017

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017 – Krauth, Cutajar, Bonilla, Filippone,UAI, 2017

32

(41)

Results - Model (Depth) Selection

Airline dataset (n = 5 M +, d = 8)

2 3 4 5

0.2 0.3 0.4 0.5

log10(sec) Error rate

2 3 4 5

0.45 0.5 0.55 0.6

log10(sec) MNLL

2 10 20 30 2.6

2.7

·10⁶

Layers Neg. Lower Bound

2 layers 10 layers 20 layers 30 layers SV-DKL

33

(42)

Convolutional Deep Gaussian

Processes

(43)

Convolutional Nets

• Convolutional nets are widely used. . .

• . . .but they are known to be overconfident!

Convolution + ReLU

Pooling Pooling Fully connected layers Output

Guo et al.,ICML, 2017

34

(44)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

35

(45)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

36

(46)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams - Under-confident predictions

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

• We can extract the Expected Calibration Error (ece) score

• The brier score is another measure of calibration

37

(47)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams - Overconfident predictions

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reliability diagrams of modern Deep CNNs look like this! Post-calibration fixes it

38

(48)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams - Overconfident predictions

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reliability diagrams of modern Deep CNNs look like this!

Post-calibration fixes it

₃₈

(49)

Combining Convolutional Nets with GPs

• There have been attempts to combine CNNs with GPs

• Most popular ones replace fully connected layers with GPs

Convolution + ReLU

Pooling Pooling Gaussian Process Output

Wilson et al.,NIPS, 2016 – Bradshaw et al.,arXiv, 2017 – Tran et al.,arXiv, 2018

39

(50)

Combining Convolutional Nets with GPs

• There have been attempts to combine CNNs with GPs

• Most popular ones replace fully connected layers with GPs

Convolution + ReLU

Pooling Pooling Deep Gaussian Process Output

• Better quantification of uncertainty??

NO!

Tran et al.,arXiv, 2018

40

(51)

Combining Convolutional Nets with GPs

• There have been attempts to combine CNNs with GPs

• Most popular ones replace fully connected layers with GPs

Convolution + ReLU

Pooling Pooling Deep Gaussian Process Output

• Better quantification of uncertainty?? NO!

40

(52)

Existing Combinations of CNNs and GPs

• Convolutional Neural Nets - CNN

• Hybrid GPs and DNNs - gpdnn

• Stochastic Variational Deep Kernel Learning - svdkl

• Convolutional GP - cgp

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

accuracy

CNN

0.0 0.5 1.0

predictive output 0.0

0.5 1.0 GPDNN

0.0 0.5 1.0

predictive output 0.0

0.5 1.0 CGP

0.0 0.5 1.0

predictive output 0.0

0.5 1.0 SVDKL

cifar10-lenet cifar10-resnet cifar100-resnet cifar10 cifar100

Bradshaw et al.,arXiv, 2017 – Wilson et al.,NIPS, 2016 – van der Wilk et al.,NIPS, 2017

41

(53)

Bayesian CNNs are calibrated

• Inferring parameters of convolutional filter recovers calibration

• Example with Monte Carlo Dropout

0.0 0.5 1.0

cifar10-lenet

ECE: 0.0254 BRIER: 0.3951

1/8

0.0 0.5 1.0

ECE: 0.0145

BRIER: 0.3300

1/4

0.0 0.5 1.0

ECE: 0.0270

BRIER: 0.2827

1/2

0.0 0.5 1.0

ECE: 0.0495

BRIER: 0.2404

1 0.0 0.5 1.0

0.0 0.5 1.0

cifar10-resnet

ECE: 0.0485 BRIER: 0.3518

0.0 0.5 1.0

ECE: 0.0140

BRIER: 0.2765

0.0 0.5 1.0

ECE: 0.0295

BRIER: 0.2126

0.0 0.5 1.0

ECE: 0.0456

BRIER: 0.1799

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

cifar100-resnet

ECE: 0.2201 BRIER: 0.8334

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.1357

BRIER: 0.6886

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0492

BRIER: 0.5463

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0270

BRIER: 0.4615

RELIABILITY DIAGRAM FOR MCD

Tran et al.,arXiv, 2018 42

(54)

Bayesian CNNs with DGPs with Random Features

• We extended our work on Random Feature Expansions for DGPs to replace fully connected layers

0.0 0.5 1.0

cifar10-lenet

ECE: 0.1034 BRIER: 0.5104

1/8

0.0 0.5 1.0

ECE: 0.0679

BRIER: 0.4117

1/4

0.0 0.5 1.0

ECE: 0.0410

BRIER: 0.3293

1/2

0.0 0.5 1.0

ECE: 0.0205

BRIER: 0.2701

1 0.0 0.5 1.0

0.0 0.5 1.0

cifar10-resnet

ECE: 0.0203 BRIER: 0.3239

0.0 0.5 1.0

ECE: 0.0242

BRIER: 0.2491

0.0 0.5 1.0

ECE: 0.0562

BRIER: 0.1997

0.0 0.5 1.0

ECE: 0.0725

BRIER: 0.1766

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

cifar100-resnet

ECE: 0.0979 BRIER: 0.7468

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0194

BRIER: 0.6137

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0639

BRIER: 0.5092

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.1156

BRIER: 0.4523

RELIABILITY DIAGRAM FOR RF

Tran et al.,arXiv, 2018 43

(55)

Comparison with competitors

SHALLOW CONVOLUTIONAL STRUCTURE

err mnll ece brier

mnist cif ar10

3.8 4.0 4.2 4.4 4.6 4.8 0.00

0.01 0.02 0.03 0.04

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.1 0.2 0.3

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.1 0.2 0.3 0.4

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6 0.8

3.8 4.0 4.2 4.4 4.6 4.8 0

1 2 3 4

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.1 0.2 0.3 0.4

3.8 4.0 4.2 4.4 4.6 4.8 0.2

0.4 0.6 0.8 1.0

cnn+gp(rf)

cnn+gp(sorf) H cnn+mcd

N cnn+cal × gpdnn

? cgp

+ svdkl

44

(56)

Comparison with competitors

DEEP CONVOLUTIONAL STRUCTURE

err mnll ece brier

cif ar10 cif ar100

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6 0.8

3.8 4.0 4.2 4.4 4.6 4.8 0

2 4 6 8

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.5 1.0 1.5

3.8 4.0 4.2 4.4 4.6 4.8 0.2

0.4 0.6 0.8 1.0

3.8 4.0 4.2 4.4 4.6 4.8 0

5 10 15

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.5 1.0 1.5

log

₁₀

(train size) log

₁₀

(train size) log

₁₀

(train size) log

₁₀

(train size) cnn+gp(rf)

cnn+gp(sorf) H cnn+mcd

N cnn+cal × gpdnn

? cgp

+ svdkl

45

(57)

Analysis of Depth of DGP

• Increasing depth of DGP slightly improves error rate. . .

• . . . and slightly worsen calibration

err mnll ece brier

2 4 6 8 10

0.14 0.15 0.16 0.17

2 4 6 8 10

0.44 0.46 0.48 0.50

2 4 6 8 10

0.00 0.01 0.02 0.03 0.04

2 4 6 8 10

0.21 0.22 0.23 0.24

depth depth depth depth

46

(58)

Other Interesting DGP-based models

• Autoencoders

^{Dai et al.}ICLR, 2015 – Domingues et al.,Mach. Learn., 2018

• DGPs with constrained dynamics

Lorenzi and Filippone,ICML, 2018

47

(59)

Conclusions

(60)

Conclusions

• DGPs offer probabilistic deep learning with sensible priors

• Inference for DGPs is hard

• Model approximations

• Approximate inference

• Difficult to assess the impact of these approximations

48

(61)

Conclusions

• DGPs offer probabilistic deep learning with sensible priors

• Inference for DGPs is hard

• Model approximations

• Approximate inference

• Difficult to assess the impact of these approximations

48

(62)

Conclusions

• We are borrowing ideas from GPs and deep learning

• Stochastic-based approximate inference

• Low-rank process decompositions

• Algebraic/computational tricks

49

Deep Gaussian processes

Deep Gaussian Processes

Maurizio Filippone

EURECOM, Sophia Antipolis, France August 31

, 2018

1 Introduction

2 Inference for Deep Gaussian Processes

3 Convolutional Deep Gaussian Processes

4 Conclusions

Introduction

Gaussian Processes - Priors over Functions

• Infinite Gaussian random variables with parametric and input-dependent covariance

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W (i) ∼ N (0, α i I )

• Central Limit Theorem implies that F is Gaussian

.. .

.. .

• F has zero-mean

• cov(F ) = E p(W

,W

) [Φ(X W (0) )W (1) W (1) > Φ(X W (0) ) > ]

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W (i) ∼ N (0, α i I )

• Central Limit Theorem implies that F is Gaussian

.. .

.. .

• F has zero-mean

• cov(F ) = α 1 E p(W

) [Φ(X W (0) )Φ(X W (0) ) > ]

• Some choices of Φ lead to analytic expression of known kernels (RBF, Mat´ ern, arc-cosine, Brownian motion, . . .)

Gaussian Processes - Priors over Functions

Gaussian Processes - Priors over Functions

Gaussian Processes - Priors over Functions

Gaussian Processes - Regression example

• Inputs = X Labels = Y

• Introduce latent variables F with covariance K = K (X , θ)

• Introduce Gaussian likelihood p(Y | F )

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Gaussian Processes - Regression example

• Predictive distribution p(F ∗ | Y , X , θ) =

Z

p(F ∗ | F , θ)p (F | Y , X , θ)d F

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Gaussian Processes - Classification example

• Inputs = X Labels = Y

• Introduce latent variables F with covariance K = K (X , θ)

• Introduce Bernoulli likelihood p (Y | F )

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Gaussian Processes - Classification example

• Predictive distribution - needs approximation to p(F | Y , X , θ)!

p(F ∗ | Y , X , θ) = Z

p(F ∗ | F , θ)p (F | Y , X , θ)d F

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Challenges and Limitations

• Kernel design

• p(Y | X , θ) might be expensive to compute (factorize K )

• p(Y | X , θ) might not even be computable!

• Marginal likelihood p(Y | X , θ) =

Z

p (Y | F )p(F | X , θ)d F

Deep Gaussian Processes for Large Representational Power

• Bypassing kernel design through composition of processes

(f ◦ g)(x)??

Deep Gaussian Processes for Large Representational Power

• Composition of stationary processes yields something very complex

Pathologies of Deep Gaussian Processes

• Deep is not necessarily good!

• Example

Pathologies of Deep Gaussian Processes

• Deep is not necessarily good!

• Feeding input to each layer helps...

Learning Deep Gaussian Processes

• Inference requires calculating integrals of this kind:

p(Y | X , θ) = Z

p

Y | F (N

) , θ (N

) × p

F (N

) | F (N

• Take W ⁽ⁱ⁾ ∼ N (0, α _i I )

• cov(F ) = E _p(W

) [Φ(X W ⁽⁰⁾ )W ⁽¹⁾ W ⁽¹⁾ ^> Φ(X W ⁽⁰⁾ ) ^> ]

• Take W ⁽ⁱ⁾ ∼ N (0, α _i I )

• cov(F ) = α ₁ E _p(W

) [Φ(X W ⁽⁰⁾ )Φ(X W ⁽⁰⁾ ) ^> ]

• Predictive distribution p(F _∗ | Y , X , θ) =

p(F _∗ | F , θ)p (F | Y , X , θ)d F

p(F _∗ | Y , X , θ) = Z

p(F _∗ | F , θ)p (F | Y , X , θ)d F

Y | F ^(N

⁾ , θ ^(N

⁾ × p

F ^(N

⁾ | F ^(N

⁻ ¹⁾ , θ ^(N

⁻ ¹⁾

F ⁽¹⁾ | X , θ ⁽⁰⁾

d F ^(N

⁾ . . . d F ⁽¹⁾

vpar ⁰ = vpar + α t

f (y _i , x _i , par)

2 F ^> K ⁻ ¹ F

f (y _i , x _i , par)

2 F ^> K ⁻ ¹ F

• Low-Rank Approximation options - O (nm ² )