Deep Gaussian Processes
Maurizio Filippone
EURECOM, Sophia Antipolis, France August 31
st, 2018
1
1 Introduction
2 Inference for Deep Gaussian Processes
3 Convolutional Deep Gaussian Processes
4 Conclusions
2
Introduction
Gaussian Processes - Priors over Functions
• Infinite Gaussian random variables with parametric and input-dependent covariance
Rasmussen and Williams, 2006
3
Gaussian Processes as Infinitely-Wide Shallow Neural Nets
• Take W (i) ∼ N (0, α i I )
• Central Limit Theorem implies that F is Gaussian
.. .
.. .
Φ
X F
W(0) W(1)
• F has zero-mean
• cov(F ) = E p(W
(0),W
(1)) [Φ(X W (0) )W (1) W (1) > Φ(X W (0) ) > ]
Neal,LNS, 1996
4
Gaussian Processes as Infinitely-Wide Shallow Neural Nets
• Take W (i) ∼ N (0, α i I )
• Central Limit Theorem implies that F is Gaussian
.. .
.. .
Φ
X F
W(0) W(1)
• F has zero-mean
• cov(F ) = α 1 E p(W
(0)) [Φ(X W (0) )Φ(X W (0) ) > ]
• Some choices of Φ lead to analytic expression of known kernels (RBF, Mat´ ern, arc-cosine, Brownian motion, . . .)
Neal,LNS, 1996
5
Gaussian Processes - Priors over Functions
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
K =
Rasmussen and Williams, 2006
6
Gaussian Processes - Priors over Functions
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
K =
Rasmussen and Williams, 2006
7
Gaussian Processes - Priors over Functions
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
K = n
n
Rasmussen and Williams, 2006
8
Gaussian Processes - Regression example
• Inputs = X Labels = Y
• Introduce latent variables F with covariance K = K (X , θ)
• Introduce Gaussian likelihood p(Y | F )
−3−2−10123
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F
Rasmussen and Williams, 2006 9
Gaussian Processes - Regression example
• Predictive distribution p(F ∗ | Y , X , θ) =
Z
p(F ∗ | F , θ)p (F | Y , X , θ)d F
−3 −2 −1 0 1 2 3
−3−2−10123
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F
Rasmussen and Williams, 2006 10
Gaussian Processes - Classification example
• Inputs = X Labels = Y
• Introduce latent variables F with covariance K = K (X , θ)
• Introduce Bernoulli likelihood p (Y | F )
0.00.20.40.60.81.0
• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F
Rasmussen and Williams, 2006 11
Gaussian Processes - Classification example
• Predictive distribution - needs approximation to p(F | Y , X , θ)!
p(F ∗ | Y , X , θ) = Z
p(F ∗ | F , θ)p (F | Y , X , θ)d F
0.00.20.40.60.81.0
• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F
Rasmussen and Williams, 2006
12
Challenges and Limitations
• Kernel design
• p(Y | X , θ) might be expensive to compute (factorize K )
• p(Y | X , θ) might not even be computable!
F
Y θ X
• Marginal likelihood p(Y | X , θ) =
Z
p (Y | F )p(F | X , θ)d F
13
Deep Gaussian Processes for Large Representational Power
• Bypassing kernel design through composition of processes
(f ◦ g)(x)??
Neal,LNS, 1996 – Damianou and Lawrence,AISTATS, 2013
14
Deep Gaussian Processes for Large Representational Power
• Composition of stationary processes yields something very complex
F(1)
Y θ(1) X
F(2) θ(2)
Neal,LNS, 1996 – Damianou and Lawrence,AISTATS, 2013 – Duvenaud et al.,AISTATS, 2014 15
Pathologies of Deep Gaussian Processes
• Deep is not necessarily good!
• Example
1 Layer 2 Layers
5 Layers 10 Layers
F(1)
Y θ(1) X
F(2) θ(2)
Neal,LNS, 1996 – Duvenaud et al.,AISTATS, 2014 – Matthews et al.,arXiv, 2018 16
Pathologies of Deep Gaussian Processes
• Deep is not necessarily good!
• Feeding input to each layer helps...
1 Layer 2 Layers
5 Layers 10 Layers
F(1)
Y θ(1) X
F(2) θ(2)
Neal,LNS, 1996 – Duvenaud et al.,AISTATS, 2014 – Matthews et al.,arXiv, 2018 17
Learning Deep Gaussian Processes
• Inference requires calculating integrals of this kind:
p(Y | X , θ) = Z
p
Y | F (N
h) , θ (N
h) × p
F (N
h) | F (N
h− 1) , θ (N
h− 1)
× . . . × p
F (1) | X , θ (0)
d F (N
h) . . . d F (1)
• Extremely challenging!
18
The Deep Learning Revolution
• Large representational power
• Mini-batch-based learning
• Exploit GPU and distributed computing
• Automatic differentiation
• Mature development of regularization (e.g., dropout)
• Application-specific representations (e.g., convolutional)
19
Stochastic Gradient Optimization
E n
∇ ] par LowerBound o
= ∇ par LowerBound
−4 −2 0 2 4
−4−2024
●
−4 −2 0 2 4
−4−2024
●
●
●
● ●
● ●●
●
●
●●
●
●●●
●
●
● ●
●
● ●
●
●
●
●●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
● ●
●
●
●
●
●●
●
●
Robbins and Monro,AoMS, 1951
20
Stochastic Variational Inference - Illustration
vpar 0 = vpar + α t
2 ∇ ^ vpar (LowerBound) α t → 0
21
−4 −2 0 2 4
−4−2024
●
●
●
0 20 40 60 80 100
0.00.20.40.60.81.0
Iteration
p(par | data)
5 0 5
20 15 10 5 0 5 10 15 20
4 2 0 2 4
16
14
12
10
8
Is There Any Hope for GPs and DGPs?
• Mini-batch training is straightforward when objective factorizes over training points
objective = X
i
f (y i , x i , par)
• In GPs latent variables are fully correlated p(F | X , θ) = N (F | 0, K (X , θ)) ∝ exp
− 1
2 F > K − 1 F
• Na¨ıve mini-batch approaches would totally break this!
Can we exploit what made Deep Learning successful for practical and scalable learning of (Deep) Gaussian processes?
22
Is There Any Hope for GPs and DGPs?
• Mini-batch training is straightforward when objective factorizes over training points
objective = X
i
f (y i , x i , par)
• In GPs latent variables are fully correlated p(F | X , θ) = N (F | 0, K (X , θ)) ∝ exp
− 1
2 F > K − 1 F
• Na¨ıve mini-batch approaches would totally break this!
Can we exploit what made Deep Learning successful for
practical and scalable learning of (Deep) Gaussian processes?
22Inference for Deep Gaussian
Processes
Inference for DGPs
• Inducing points-based approximations
• VI+Titsias AISTATS 2009 Sparse GP
• Damianou and Lawrence, AISTATS, 2013
• Hensman and Lawrence, arXiv, 2014
• Salimbeni and Deisenroth, NIPS, 2017
• EP+FITC - Bui et al. ICML, 2016
• MCMC+Titsias AISTATS 2009 Sparse GP
• Havasi et al., arXiv, 2018
• Random feature-based approximations
• Gal and Ghahramani, ICML 2016
• Cutajar et al., ICML 2017
23
Inference for DGPs
• Low-Rank Approximation options - O (nm 2 )
• Call P as a low rank approximation to K y
• Woodbury identity exploits low rank structure of P
Preconditioning Kernel Matrices
K. Cutajar 1 , M. A. Osborne 2 , J. P. Cunningham 3 , M. Filippone 1
1 - EURECOM, Sophia Antipolis, France
2 - University of Oxford, Oxford, UK 3 - Columbia University, New York City, USA
Kernel Machines and Solving Linear Systems
! Operate in a high-dimensional, implicit feature space;
! Rely on the construction of an n × n Gram matrix K;
! Popular kernels:
– RBF : k (x
i, x
j) = σ
2exp !
−
12d
2"
; – Matérn : k
v=32
(x
i,x
j) = σ
2! 1 + √
3d "
exp !
− √ 3d "
; where d
2= (x
i− x
j)
⊤Λ (x
i− x
j).
! Involve the solution of linear systems Kz = v;
! Cholesky Decomposition:
– O (n
2) space and O (n
3) time - unfeasible for large n.
Fig. 1:
Kernel machines enable non-linear separation of data.
! Conjugate Gradient (CG):
– Numerical solution of linear systems constructed using matrix-vector multiplications;
– O (tn
2) for t CG iterations - in theory t = n (possibly worse).
z
z0
Fig. 2:
CG
! Preconditioned Conjugate Gradient (PCG):
– Transforms linear system to be better conditioned, improving convergence;
– Yields a new linear system of the form P
−1Kz = P
−1v;
– O(tn
2) for t PCG iterations - in practice t ≪ n.
z
z0
Fig. 3:
Preconditioned CG
Preconditioning Approaches
! Suppose we want to precondition K
y= K + λI;
! Our choice of preconditioner should:
– Approximate K
yas closely as possible;
– Be easy to invert.
! For low-rank preconditioners we employ the Woodbury inversion lemma:
K
y= P =
P
−1=
! For other preconditioners we solve inner linear systems once again using CG!
Formulation Strategy
Nyström P = K
XUK
U U−1K
U X+ λI where U ⊂ X Woodbury FITC P = K
XUK
U U−1K
U X+ diag !
K − K
XUK
U U−1K
U X"
+ λI Woodbury
PITC P = K
XUK
U U−1K
U X+ bldiag !
K − K
XUK
U U−1K
U X"
+ λI Woodbury
Spectral P
ij=
σm2#
m r=1cos $
2πs
⊤r(x
i− x
j) %
+ λI
ijWoodbury
Partial SVD K = AΛA
⊤⇒ P = A
[·,1:m]Λ
[1:m,1:m]A
⊤[1:m,·]+ λI Woodbury
Block Jacobi P = bldiag (K) + λI Block Inverse
SKI P = W K
U UW
⊤+ λI where K
U Uis Kronecker Inner CG
Regularization P = K + δI + λI Inner CG
Preconditioning Kernel Matrices
Concrete Dataset Power plant Dataset Protein Dataset
(n
= 1030, d= 8)(n
= 9568, d= 4)(n
= 45730, d= 9)-3 -2 -1 0 1 2 log10(l) -2 -4 -6
log10()
-3 -2 -1 0 1 2 log10(l) -2 -4 -6
log10()
-3 -2 -1 0 1 2 log10(l) -2 -4 -6
log10()
log
10(n
it)
0 5
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + Block Jacobi
-3 -2 -1 0 1 2 -2
-4
-6 +
PITC
-3 -2 -1 0 1 2 -2
-4 -6
+
+ +
+ + + FITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + +
Nystrom
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + Spectral
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + Randomized SVD
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + + + + + Regularized
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + + + + +
SKI
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + Block Jacobi
-3 -2 -1 0 1 2 -2
-4 -6
+ + + PITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + FITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + +
Nystrom
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + +
Spectral
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + Randomized SVD
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + + + + Regularized
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + +
SKI
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + +
+ + +
Block Jacobi
-3 -2 -1 0 1 2 -2
-4 -6
+ + +
+ + +
PITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + FITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + +
Nystrom
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + +
Spectral
-3 -2 -1 0 1 2 -2
-4 -6 + + + + + Randomized SVD
-3 -2 -1 0 1 2 -2
-4 -6
Too Expensive!
Regularized
-3 -2 -1 0 1 2 -2
-4 -6
Too Expensive!
SKI
Loss
− 2
−1 0 1 2
Gain
Figure 1. Comparison of preconditioners for different settings of kernel parameters. The lengthscalel and the noise variance
λare shown on the x and y axes respectively. The top figure indicates the number of iterations required to solve the corresponding linear system using CG, whilst the bottom part of the figure shows the rate of improvement (negative - blue) or degradation (positive - red) achieved by using PCG to solve the same linear system. Parameters and results are reported inlog
10. Symbols added to facilitate reading in B/W print.
tioner requires one matrix-vector product, and we add this to the overall count of such computations. For this precon- ditioner, we add a diagonal offset δ to the original matrix, equivalent to two orders of magnitude greater than the noise of the process. In general, although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large.
We focus on an isotropic RBF variant of the kernel in eq.1, fixing the marginal variance σ
2to one. We vary the length- scale parameter l and the noise variance λ in log
10scale.
The top part of fig. 1 shows the number of iterations that the standard CG algorithm takes, where we have capped the number of iterations to 100,000.
The bottom part of the figure reports the improvement of- fered by various preconditioners measured as
log
10! # PCG iterations
# CG iterations
"
.
It is worth noting that when both CG and PCG fail to con- verge within the upper bound, the improvement will be marked as 0, i.e. neither a gain or a loss within the given bound. The results plotted in fig. 1 indicate that the low- rank preconditioners (PITC, FITC and Nystr¨om) achieve
significant reductions in the number of iterations for each dataset, and all approaches work best when the lengthscale is longer, characterising smoother processes. In contrast, preconditioning seems to be less effective when the length- scale is shorter, corresponding to a kernel matrix that is more sparse. However, for cases yielding positive results, the improvement is often in the range of an order of mag- nitude, which can be substantial when a large number of iterations is required by the CG algorithm.
The results also confirm that, as alluded to in the previous section, Block Jacobi preconditioning is generally a poor preconditioner, particularly when the corresponding kernel matrix is dense. The only minor improvements were ob- served when CG itself converges quickly, in which case preconditioning serves very little purpose either way.
The regularization approach with flexible conjugate gradi- ent does not appear to be effective in any case either, partic- ularly due to the substantial amount of iterations required for solving an inner system at every iteration of the PCG algorithm. This implies that introducing additional small jitter to the diagonal does not necessarily make the sys- tem much easier to solve, whilst adding an overly large offset would negatively impact convergence of the outer al- gorithm. One could assume that tuning the value of this parameter could result in slightly better results; however, preliminary experiments in this regard yielded only minor
Fig. 4:Comparison of preconditioners for different settings of kernel parameters across multiple datasets. Top: Number of iterations required to solve the corresponding linear system using CG. Bottom: Rate of improvement (blue) or degradation (red) achieved by using PCG to solve the same linear system.
Motivating Example - Gaussian Processes
! Marginal likelihood:
log[p(y|par)] = − 1
2 log |K
y| − 1
2 y
⊤K
−y1y + const.
! Derivatives wrt par:
∂ log[p(y | par)]
∂par
i= − 1 2 Tr
&
K
y−1∂K
y∂par
i'
+ 1
2 y
⊤K
y−1∂K
y∂par
iK
y−1y
! Stochastic estimate of the trace - assuming E[rr
⊤] = I, then:
Tr
&
K
y−1∂K
y∂par
i'
= Tr
&
K
y−1∂K
y∂par
iE[rr
⊤] '
= E (
r
⊤K
y−1∂K
y∂par
ir )
≈ 1 N
rNr
*
i=1
r
(i)⊤K
y−1∂K
y∂par
ir
(i)! Linear systems only!
! Laplace approximation for non-Gaussian likelihoods may be formulated in a similar way!
Experimental Setup and Results
! Exact gradient-based optimization using Cholesky decomposition (CHOL);
! Stochastic gradient-based optimization using ADAGRAD - using CG and PCG;
! GP Approximations:
– Variational learning of inducing variables (VAR);
– Fully Independent Training Conditional (FITC);
– Partially Independent Training Conditional (PITC).
Classification
Spam Dataset(n= 4061,d=57)0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.040.080.12
log10
(
seconds)
Error Rate
−1 0 1 2 3
152535
log10
(
seconds)
Negative Test Log−Lik
EEG Dataset(n= 14979,d=14)
1.0 1.5 2.0 2.5 3.0 3.5
0.050.150.25
log10
(
seconds)
Error Rate
0 1 2 3 4
2030405060
log10
(
seconds)
Negative Test Log−Lik
Regression
Power plant Dataset(n= 9568,d=4)0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.180.200.22
log10
(
seconds)
RMSE
0 1 2 3
−40−200
log10
(
seconds)
Negative Test Log−Lik
Protein Dataset(n= 45730,d=9)
1.5 2.0 2.5 3.0 3.5 4.0
0.600.640.680.72
log10
(
seconds)
RMSE
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
200300400
log10
(
seconds)
Negative Test Log−Lik
PCG CG CHOL FITC PITC VAR
Fig. 5:
Error and negative log likelihood on
√n
held out test data over time. Curves are averaged over multiple repetitions.
Conclusions
! Our solution:
✓ Exact in the limit of iterations;
✓ Straightforward to construct and easy to tune;
✓ Scalable to large datasets - no need to store K;
✓ Competitive with exact Cholesky decomposition;
✓ Superior to approximate methods.
! Ongoing work:
– Extending this work to other kernel functions and models;
– Implementation on a distributed framework;
– Exploiting PCG in the solution of f (K) z = v.
1
DGPs: Low-rank approximation of covariance at each layer
24
Inference for DGPs
• Low-Rank Approximation options - O (nm 2 )
• Call P as a low rank approximation to K y
• Woodbury identity exploits low rank structure of P
Preconditioning Kernel Matrices
K. Cutajar 1 , M. A. Osborne 2 , J. P. Cunningham 3 , M. Filippone 1
1 - EURECOM, Sophia Antipolis, France
2 - University of Oxford, Oxford, UK 3 - Columbia University, New York City, USA
Kernel Machines and Solving Linear Systems
! Operate in a high-dimensional, implicit feature space;
! Rely on the construction of an n × n Gram matrix K;
! Popular kernels:
– RBF : k (x
i, x
j) = σ
2exp !
−
12d
2"
; – Matérn : k
v=32
(x
i,x
j) = σ
2! 1 + √
3d "
exp !
− √ 3d "
; where d
2= (x
i− x
j)
⊤Λ (x
i− x
j).
! Involve the solution of linear systems Kz = v;
! Cholesky Decomposition:
– O (n
2) space and O (n
3) time - unfeasible for large n.
Fig. 1:
Kernel machines enable non-linear separation of data.
! Conjugate Gradient (CG):
– Numerical solution of linear systems constructed using matrix-vector multiplications;
– O (tn
2) for t CG iterations - in theory t = n (possibly worse).
z
z0
Fig. 2:
CG
! Preconditioned Conjugate Gradient (PCG):
– Transforms linear system to be better conditioned, improving convergence;
– Yields a new linear system of the form P
−1Kz = P
−1v;
– O(tn
2) for t PCG iterations - in practice t ≪ n.
z
z0
Fig. 3:
Preconditioned CG
Preconditioning Approaches
! Suppose we want to precondition K
y= K + λI;
! Our choice of preconditioner should:
– Approximate K
yas closely as possible;
– Be easy to invert.
! For low-rank preconditioners we employ the Woodbury inversion lemma:
K
y= P =
P
−1=
! For other preconditioners we solve inner linear systems once again using CG!
Formulation Strategy
Nyström P = K
XUK
U U−1K
U X+ λI where U ⊂ X Woodbury FITC P = K
XUK
U U−1K
U X+ diag !
K − K
XUK
U U−1K
U X"
+ λI Woodbury
PITC P = K
XUK
U U−1K
U X+ bldiag !
K − K
XUK
U U−1K
U X"
+ λI Woodbury
Spectral P
ij=
σm2#
m r=1cos $
2πs
⊤r(x
i− x
j) %
+ λI
ijWoodbury
Partial SVD K = AΛA
⊤⇒ P = A
[·,1:m]Λ
[1:m,1:m]A
⊤[1:m,·]+ λI Woodbury
Block Jacobi P = bldiag (K) + λI Block Inverse
SKI P = W K
U UW
⊤+ λI where K
U Uis Kronecker Inner CG
Regularization P = K + δI + λI Inner CG
Preconditioning Kernel Matrices
Concrete Dataset Power plant Dataset Protein Dataset
(n
= 1030, d= 8)(n
= 9568, d= 4)(n
= 45730, d= 9)-3 -2 -1 0 1 2 log10(l) -2 -4 -6
log10()
-3 -2 -1 0 1 2 log10(l) -2 -4 -6
log10()
-3 -2 -1 0 1 2 log10(l) -2 -4 -6
log10()
log
10(n
it)
0 5
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + Block Jacobi
-3 -2 -1 0 1 2 -2
-4
-6 +
PITC
-3 -2 -1 0 1 2 -2
-4 -6
+
+ +
+ + + FITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + +
Nystrom
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + Spectral
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + Randomized SVD
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + + + + + Regularized
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + + + + +
SKI
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + Block Jacobi
-3 -2 -1 0 1 2 -2
-4 -6
+ + + PITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + FITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + +
Nystrom
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + +
Spectral
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + Randomized SVD
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + + + + Regularized
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + + + + +
SKI
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + +
+ + +
Block Jacobi
-3 -2 -1 0 1 2 -2
-4 -6
+ + +
+ + +
PITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + FITC
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + +
Nystrom
-3 -2 -1 0 1 2 -2
-4 -6
+ + + + + + + + + + + +
Spectral
-3 -2 -1 0 1 2 -2
-4 -6 + + + + + Randomized SVD
-3 -2 -1 0 1 2 -2
-4 -6
Too Expensive!
Regularized
-3 -2 -1 0 1 2 -2
-4 -6
Too Expensive!
SKI
Loss
− 2
−1 0 1 2
Gain
Figure 1. Comparison of preconditioners for different settings of kernel parameters. The lengthscalel and the noise variance
λare shown on the x and y axes respectively. The top figure indicates the number of iterations required to solve the corresponding linear system using CG, whilst the bottom part of the figure shows the rate of improvement (negative - blue) or degradation (positive - red) achieved by using PCG to solve the same linear system. Parameters and results are reported inlog
10. Symbols added to facilitate reading in B/W print.
tioner requires one matrix-vector product, and we add this to the overall count of such computations. For this precon- ditioner, we add a diagonal offset δ to the original matrix, equivalent to two orders of magnitude greater than the noise of the process. In general, although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large.
We focus on an isotropic RBF variant of the kernel in eq.1, fixing the marginal variance σ
2to one. We vary the length- scale parameter l and the noise variance λ in log
10scale.
The top part of fig. 1 shows the number of iterations that the standard CG algorithm takes, where we have capped the number of iterations to 100,000.
The bottom part of the figure reports the improvement of- fered by various preconditioners measured as
log
10! # PCG iterations
# CG iterations
"
.
It is worth noting that when both CG and PCG fail to con- verge within the upper bound, the improvement will be marked as 0, i.e. neither a gain or a loss within the given bound. The results plotted in fig. 1 indicate that the low- rank preconditioners (PITC, FITC and Nystr¨om) achieve
significant reductions in the number of iterations for each dataset, and all approaches work best when the lengthscale is longer, characterising smoother processes. In contrast, preconditioning seems to be less effective when the length- scale is shorter, corresponding to a kernel matrix that is more sparse. However, for cases yielding positive results, the improvement is often in the range of an order of mag- nitude, which can be substantial when a large number of iterations is required by the CG algorithm.
The results also confirm that, as alluded to in the previous section, Block Jacobi preconditioning is generally a poor preconditioner, particularly when the corresponding kernel matrix is dense. The only minor improvements were ob- served when CG itself converges quickly, in which case preconditioning serves very little purpose either way.
The regularization approach with flexible conjugate gradi- ent does not appear to be effective in any case either, partic- ularly due to the substantial amount of iterations required for solving an inner system at every iteration of the PCG algorithm. This implies that introducing additional small jitter to the diagonal does not necessarily make the sys- tem much easier to solve, whilst adding an overly large offset would negatively impact convergence of the outer al- gorithm. One could assume that tuning the value of this parameter could result in slightly better results; however, preliminary experiments in this regard yielded only minor
Fig. 4:Comparison of preconditioners for different settings of kernel parameters across multiple datasets. Top: Number of iterations required to solve the corresponding linear system using CG. Bottom: Rate of improvement (blue) or degradation (red) achieved by using PCG to solve the same linear system.
Motivating Example - Gaussian Processes
! Marginal likelihood:
log[p(y|par)] = − 1
2 log |K
y| − 1
2 y
⊤K
−y1y + const.
! Derivatives wrt par:
∂ log[p(y | par)]
∂par
i= − 1 2 Tr
&
K
y−1∂K
y∂par
i'
+ 1
2 y
⊤K
y−1∂K
y∂par
iK
y−1y
! Stochastic estimate of the trace - assuming E[rr
⊤] = I, then:
Tr
&
K
y−1∂K
y∂par
i'
= Tr
&
K
y−1∂K
y∂par
iE[rr
⊤] '
= E (
r
⊤K
y−1∂K
y∂par
ir )
≈ 1 N
rNr
*
i=1
r
(i)⊤K
y−1∂K
y∂par
ir
(i)! Linear systems only!
! Laplace approximation for non-Gaussian likelihoods may be formulated in a similar way!
Experimental Setup and Results
! Exact gradient-based optimization using Cholesky decomposition (CHOL);
! Stochastic gradient-based optimization using ADAGRAD - using CG and PCG;
! GP Approximations:
– Variational learning of inducing variables (VAR);
– Fully Independent Training Conditional (FITC);
– Partially Independent Training Conditional (PITC).
Classification
Spam Dataset(n= 4061,d=57)0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.040.080.12
log10
(
seconds)
Error Rate
−1 0 1 2 3
152535
log10
(
seconds)
Negative Test Log−Lik
EEG Dataset(n= 14979,d=14)
1.0 1.5 2.0 2.5 3.0 3.5
0.050.150.25
log10
(
seconds)
Error Rate
0 1 2 3 4
2030405060
log10
(
seconds)
Negative Test Log−Lik
Regression
Power plant Dataset(n= 9568,d=4)0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.180.200.22
log10
(
seconds)
RMSE
0 1 2 3
−40−200
log10
(
seconds)
Negative Test Log−Lik
Protein Dataset(n= 45730,d=9)
1.5 2.0 2.5 3.0 3.5 4.0
0.600.640.680.72
log10
(
seconds)
RMSE
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
200300400
log10
(
seconds)
Negative Test Log−Lik
PCG CG CHOL FITC PITC VAR
Fig. 5:
Error and negative log likelihood on
√n
held out test data over time. Curves are averaged over multiple repetitions.
Conclusions
! Our solution:
✓ Exact in the limit of iterations;
✓ Straightforward to construct and easy to tune;
✓ Scalable to large datasets - no need to store K;
✓ Competitive with exact Cholesky decomposition;
✓ Superior to approximate methods.
! Ongoing work:
– Extending this work to other kernel functions and models;
– Implementation on a distributed framework;
– Exploiting PCG in the solution of f (K) z = v.
1
DGPs: Low-rank approximation of covariance at each layer
24
Scalable Expectation Propagation for DGPs
• Pseudo-inputs Z (i)
• Inducing variables U (i)
• VI targets q
U (i)
• Assuming q U (i)
∝ p U (i)
g U (i) N
learn g as an average data factor
• Reduces memory and allows for factorization of the objective (output of each layer made Gaussian)
F
(1)Y
θ
(1)X
F
(2)θ
(2)U
(1)Z
(1)U
(1)Z
(2)U
(1)Z
(3)θ
(3)Bui et al.,ICML, 2016
25
Scalable Expectation Propagation for DGPs
• Pseudo-inputs Z (i)
• Inducing variables U (i)
• VI targets q
U (i)
• Assuming q U (i)
∝ p U (i)
g U (i) N
learn g as an average data factor
• Reduces memory and allows for factorization of the objective (output of each layer made Gaussian)
F
(1)Y
θ
(1)X
F
(2)θ
(2)U
(1)Z
(1)U
(1)Z
(2)U
(1)Z
(3)θ
(3)Bui et al.,ICML, 2016
25
Inducing Points for DGPs extending Titsias, AISTATS, 2009
• Pseudo-inputs Z (i)
• Inducing variables U (i)
• VI targets q F (i) , U (i) | F (i − 1) p
F (i) | U (i) , F (i − 1)
q
U (i)
• Lower bound factorizes across training points. . .
• . . . and the i th marginal of the final layer depends only on the i th marginals of all layers
F
(1)Y
θ
(1)X
F
(2)θ
(2)U
(1)Z
(1)U
(1)Z
(2)U
(1)Z
(3)θ
(3)Hensman and Lawrence,arXiv, 2014 – Salimbeni and Deisenroth,NIPS, 2017
26
Inducing Points for DGPs extending Titsias, AISTATS, 2009
• Pseudo-inputs Z (i)
• Inducing variables U (i)
• VI targets q F (i) , U (i) | F (i − 1) p
F (i) | U (i) , F (i − 1)
q
U (i)
• Lower bound factorizes across training points. . .
• . . . and the i th marginal of the final layer depends only on the i th marginals of all layers
F
(1)Y
θ
(1)X
F
(2)θ
(2)U
(1)Z
(1)U
(1)Z
(2)U
(1)Z
(3)θ
(3)Hensman and Lawrence,arXiv, 2014 – Salimbeni and Deisenroth,NIPS, 2017
26
Random Feature Expansions for DGPs - Bochner’s theorem
• Continuous shift-invariant covariance function k(x i − x j | θ) = σ 2
Z
p(ω | θ) exp
ι(x i − x j ) > ω
d ω
• Monte Carlo estimate
k(x i − x j | θ) ≈ σ 2 N RF
N
RFX
r =1
z(x i | ω ˜ r ) > z(x j | ω ˜ r )
with
˜
ω r ∼ p(ω | θ)
z(x | ω) = [cos(x > ω), sin(x > ω)] >
Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010
27
Random Feature Expansions for DGPs - Bochner’s theorem
• Continuous shift-invariant covariance function k(x i − x j | θ) = σ 2
Z
p(ω | θ) exp
ι(x i − x j ) > ω
d ω
• Monte Carlo estimate
k(x i − x j | θ) ≈ σ 2 N RF
N
RFX
r =1
z(x i | ω ˜ r ) > z(x j | ω ˜ r )
with
˜
ω r ∼ p(ω | θ)
z(x | ω) = [cos(x > ω), sin(x > ω)] >
Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010
27
Random Feature Expansions for DGPs
• Define
Φ (l) = s σ 2
N RF (l) h
cos
F (l ) Ω (l)
, sin
F (l) Ω (l) i
and
F (l+1) = Φ (l) W (l)
• We are stacking Bayesian linear models with p
W (l) · i
= N (0, I)
• Expansion of arc-cosine kernel yields ReLU activations!
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017
28
Random Feature Expansions for DGPs
• Define
Φ (l) = s σ 2
N RF (l) h
cos
F (l ) Ω (l)
, sin
F (l) Ω (l) i
and
F (l+1) = Φ (l) W (l)
• We are stacking Bayesian linear models with p
W (l) · i
= N (0, I)
• Expansion of arc-cosine kernel yields ReLU activations!
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017
28
DGPs with random features become DNNs
θ
(0)θ
(1)Φ
(0)X F
(1)Φ
(1)F
(2)Y
Ω
(0)W
(0)Ω
(1)W
(1)We can learn the model using Stochastic Variational Inference for Bayesian DNNs!
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017
29
Results - Classification
EEG dataset (n = 14979, d = 14)
2 2.5 3 3.5
0 0.1 0.2
log
10(sec) Error rate
2 2.5 3 3.5
0.2 0.4
log
10(sec) MNLL
dgp-rbf dgp-arc dgp-ep dnn var-gp
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017
30
Results - Multiclass Classification
MNIST dataset (n = 60000, d = 784)
3 3.5 4 4.5
0 0.05 0.1 0.15 0.2
log
10(sec) Error rate
3 3.5 4 4.5
0 1 2 3
log
10(sec) MNLL
dgp-rbf dgp-arc dgp-ep dnn var-gp
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017
31
Results - MNIST-8M
• Variant of MNIST with 8.1M images
• 99+% accuracy!
• Also, check out Krauth et al., UAI 2017
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017 – Krauth, Cutajar, Bonilla, Filippone,UAI, 2017
32
Results - Model (Depth) Selection
Airline dataset (n = 5 M +, d = 8)
2 3 4 5
0.2 0.3 0.4 0.5
log10(sec) Error rate
2 3 4 5
0.45 0.5 0.55 0.6
log10(sec) MNLL
2 10 20 30 2.6
2.7
·106
Layers Neg. Lower Bound
2 layers 10 layers 20 layers 30 layers SV-DKL
Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017
33
Convolutional Deep Gaussian
Processes
Convolutional Nets
• Convolutional nets are widely used. . .
• . . .but they are known to be overconfident!
Convolution + ReLU
Convolution + ReLU
Pooling Pooling Fully connected layers Output
Guo et al.,ICML, 2017
34
Calibration as a Measure of Quantification of Uncertainty
• Reliability diagrams
Predicted value
F ra ct io n po si tive s
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
35
Calibration as a Measure of Quantification of Uncertainty
• Reliability diagrams
Predicted value
F ra ct io n po si tive s
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
36
Calibration as a Measure of Quantification of Uncertainty
• Reliability diagrams - Under-confident predictions
Predicted value
F ra ct io n po si tive s
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
• We can extract the Expected Calibration Error (ece) score
• The brier score is another measure of calibration
37
Calibration as a Measure of Quantification of Uncertainty
• Reliability diagrams - Overconfident predictions
Predicted value
F ra ct io n po si tive s
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Reliability diagrams of modern Deep CNNs look like this! Post-calibration fixes it
38
Calibration as a Measure of Quantification of Uncertainty
• Reliability diagrams - Overconfident predictions
Predicted value
F ra ct io n po si tive s
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Reliability diagrams of modern Deep CNNs look like this!
Post-calibration fixes it
38Combining Convolutional Nets with GPs
• There have been attempts to combine CNNs with GPs
• Most popular ones replace fully connected layers with GPs
Convolution + ReLU
Convolution + ReLU
Pooling Pooling Gaussian Process Output
Wilson et al.,NIPS, 2016 – Bradshaw et al.,arXiv, 2017 – Tran et al.,arXiv, 2018
39
Combining Convolutional Nets with GPs
• There have been attempts to combine CNNs with GPs
• Most popular ones replace fully connected layers with GPs
Convolution + ReLU
Convolution + ReLU
Pooling Pooling Deep Gaussian Process Output
• Better quantification of uncertainty??
NO!
Tran et al.,arXiv, 2018
40
Combining Convolutional Nets with GPs
• There have been attempts to combine CNNs with GPs
• Most popular ones replace fully connected layers with GPs
Convolution + ReLU
Convolution + ReLU
Pooling Pooling Deep Gaussian Process Output
• Better quantification of uncertainty?? NO!
Tran et al.,arXiv, 2018
40
Existing Combinations of CNNs and GPs
• Convolutional Neural Nets - CNN
• Hybrid GPs and DNNs - gpdnn
• Stochastic Variational Deep Kernel Learning - svdkl
• Convolutional GP - cgp
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
accuracy
CNN
0.0 0.5 1.0
predictive output 0.0
0.5
1.0 GPDNN
0.0 0.5 1.0
predictive output 0.0
0.5
1.0 CGP
0.0 0.5 1.0
predictive output 0.0
0.5
1.0 SVDKL
cifar10-lenet cifar10-resnet cifar100-resnet cifar10 cifar100
Bradshaw et al.,arXiv, 2017 – Wilson et al.,NIPS, 2016 – van der Wilk et al.,NIPS, 2017
41
Bayesian CNNs are calibrated
• Inferring parameters of convolutional filter recovers calibration
• Example with Monte Carlo Dropout
0.0 0.5 1.0
0.0 0.5 1.0
cifar10-lenet
ECE: 0.0254 BRIER: 0.3951
1/8
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0145BRIER: 0.3300
1/4
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0270BRIER: 0.2827
1/2
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0495BRIER: 0.2404
1
0.0 0.5 1.0
0.0 0.5 1.0
cifar10-resnet
ECE: 0.0485 BRIER: 0.3518
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0140BRIER: 0.2765
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0295BRIER: 0.2126
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0456BRIER: 0.1799
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
cifar100-resnet
ECE: 0.2201 BRIER: 0.8334
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
ECE: 0.1357BRIER: 0.6886
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
ECE: 0.0492BRIER: 0.5463
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
ECE: 0.0270BRIER: 0.4615
RELIABILITY DIAGRAM FOR MCD
Tran et al.,arXiv, 2018 42
Bayesian CNNs with DGPs with Random Features
• We extended our work on Random Feature Expansions for DGPs to replace fully connected layers
0.0 0.5 1.0
0.0 0.5 1.0
cifar10-lenet
ECE: 0.1034 BRIER: 0.5104
1/8
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0679BRIER: 0.4117
1/4
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0410BRIER: 0.3293
1/2
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0205BRIER: 0.2701
1
0.0 0.5 1.0
0.0 0.5 1.0
cifar10-resnet
ECE: 0.0203 BRIER: 0.3239
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0242BRIER: 0.2491
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0562BRIER: 0.1997
0.0 0.5 1.0
0.0 0.5 1.0
ECE: 0.0725BRIER: 0.1766
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
cifar100-resnet
ECE: 0.0979 BRIER: 0.7468
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
ECE: 0.0194BRIER: 0.6137
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
ECE: 0.0639BRIER: 0.5092
0.0 0.5 1.0
predictive output 0.0
0.5 1.0
ECE: 0.1156BRIER: 0.4523
RELIABILITY DIAGRAM FOR RF
Tran et al.,arXiv, 2018 43
Comparison with competitors
SHALLOW CONVOLUTIONAL STRUCTURE
err mnll ece brier
mnist cif ar10
3.8 4.0 4.2 4.4 4.6 4.8 0.00
0.01 0.02 0.03 0.04
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.1 0.2 0.3
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.2 0.4 0.6
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.1 0.2 0.3 0.4
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.2 0.4 0.6 0.8
3.8 4.0 4.2 4.4 4.6 4.8 0
1 2 3 4
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.1 0.2 0.3 0.4
3.8 4.0 4.2 4.4 4.6 4.8 0.2
0.4 0.6 0.8 1.0
cnn+gp(rf)
cnn+gp(sorf) H cnn+mcd
N cnn+cal × gpdnn
? cgp
+ svdkl
Tran et al.,arXiv, 2018
44
Comparison with competitors
DEEP CONVOLUTIONAL STRUCTURE
err mnll ece brier
cif ar10 cif ar100
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.2 0.4 0.6 0.8
3.8 4.0 4.2 4.4 4.6 4.8 0
2 4 6 8
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.2 0.4 0.6
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.5 1.0 1.5
3.8 4.0 4.2 4.4 4.6 4.8 0.2
0.4 0.6 0.8 1.0
3.8 4.0 4.2 4.4 4.6 4.8 0
5 10 15
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.2 0.4 0.6
3.8 4.0 4.2 4.4 4.6 4.8 0.0
0.5 1.0 1.5
log
10(train size) log
10(train size) log
10(train size) log
10(train size) cnn+gp(rf)
cnn+gp(sorf) H cnn+mcd
N cnn+cal × gpdnn
? cgp
+ svdkl
Tran et al.,arXiv, 2018
45
Analysis of Depth of DGP
• Increasing depth of DGP slightly improves error rate. . .
• . . . and slightly worsen calibration
err mnll ece brier
2 4 6 8 10
0.14 0.15 0.16 0.17
2 4 6 8 10
0.44 0.46 0.48 0.50
2 4 6 8 10
0.00 0.01 0.02 0.03 0.04
2 4 6 8 10
0.21 0.22 0.23 0.24
depth depth depth depth
Tran et al.,arXiv, 2018
46
Other Interesting DGP-based models
• Autoencoders
Dai et al.ICLR, 2015 – Domingues et al.,Mach. Learn., 2018• DGPs with constrained dynamics
Lorenzi and Filippone,ICML, 201847
Conclusions
Conclusions
• DGPs offer probabilistic deep learning with sensible priors
• Inference for DGPs is hard
• Model approximations
• Approximate inference
• Difficult to assess the impact of these approximations
48
Conclusions
• DGPs offer probabilistic deep learning with sensible priors
• Inference for DGPs is hard
• Model approximations
• Approximate inference
• Difficult to assess the impact of these approximations
48
Conclusions
• We are borrowing ideas from GPs and deep learning
• Stochastic-based approximate inference
• Low-rank process decompositions
• Algebraic/computational tricks
49