Bayesian deep learning

(1)

Bayesian Deep Learning

Maurizio Filippone

EURECOM, Sophia Antipolis, France January 10^th, 2019

(2)

Outline

1 Motivation

2 Probabilistic Deep Nets Scalable Inference

Connections with (Deep) Gaussian Processes

3 Some Results 4 Conclusions

2

(3)

Motivation

(4)

Quantification of Uncertainty with Expensive Models

• Climate modeling

Kennedy and O’Hagan,J-RSS-B, 2001

3

(5)

Quantification of Uncertainty with No Models

• Classification and progression of neurodegenerative diseases

(6)

A Unified Framework

A model might be expensive to simulate/inaccurate

• Emulate model/discrepancy using a surrogate

A model might not even be available

• Replace it with a flexible statistical model

Probabilistic Deep Models for Accurate Modeling and Quantification of Uncertainty

5

(7)

A Unified Framework

(8)

A Unified Framework

5

(9)

Probabilistic Deep Nets

(10)

Learning from Data – Function Estimation

• Take these two examples

−3 −2 −1 0 1 2 3

−3−2−10123

x

y

●

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

x

y

• We are interested in estimating a function f(x) from data

• Most problems in Machine Learning can be cast this way!

6

(11)

Deep Neural Networks

• Implement a composition of parametric functions f(x) =f^(L)

f^(L−1)

· · ·f⁽¹⁾(x)· · · with

f^(l⁾(h) =g

h^>W^(l)

f⁽¹⁾

x f⁽²⁾ f⁽³⁾ f⁽⁴⁾ y

W⁽¹⁾ W⁽²⁾ W⁽³⁾ W⁽⁴⁾

(12)

Back-propagation – Probabilistic Interpretation Loss

• Inputs:X ={x₁, . . . ,x_N}

• Labels:Y ={y₁, . . . ,y_N}

• Weights:W ={W⁽¹⁾, . . . ,W^(L)}

Quadratic Loss p(Y|X,W)∝exp(−Loss)

051015 0.00.40.8

• Back-propagation minimizes a loss function

• . . . equivalent as optimizing likelihoodp(Y|X,W)

8

(13)

Bayesian Inference

• Inputs:X ={x₁, . . . ,x_N}

• Labels:Y ={y₁, . . . ,yN}

• Weights:W ={W⁽¹⁾, . . . ,W^(L)}

p(W) p(W|Y,X)

p(W|Y,X) = p(Y|X,W)p(W) Z

p(Y|X,W)p(W)dW

(14)

Bayesian Deep Neural Networks

• Regression example

−3 −2 −1 0 1 2 3

−3−2−10123

x

y

●

−3 −2 −1 0 1 2 3

−3−2−10123

x

y

●

10

(15)

Bayesian Deep Neural Networks

• Classification example

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

x

y

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

x

y

(16)

Stochastic Variational Inference

• Bayesian inference is intractable due to this integral log [p(Y|X)] = log

Z

p(Y|X,W)p(W)dW

• Lower bound for log [p(Y|X)]

E_q(W₎(log [p(Y|X,W)])−KL[q(W)kp(W)] , where q(W) approximates p(W|Y,X).

• Kullback-Leibler divergenceKL – “distance” betweenq andp Optimize the lower bound wrt the parameters of q(W)

Graves,NIPS, 2011

12

(17)

Z

p(Y|X,W)p(W)dW

E_q(W₎(log [p(Y|X,W)])−KL[q(W)kp(W)] , whereq(W) approximates p(W|Y,X).

• Kullback-Leibler divergenceKL – “distance” betweenq andp

Optimize the lower bound wrt the parameters of q(W)

Graves,NIPS, 2011

(18)

Z

p(Y|X,W)p(W)dW

E_q(W₎(log [p(Y|X,W)])−KL[q(W)kp(W)] , whereq(W) approximates p(W|Y,X).

• Kullback-Leibler divergenceKL – “distance” betweenq andp Optimize the lower bound wrt the parameters of q(W)

Graves,NIPS, 2011 12

(19)

• Assume that the likelihood factorizes p(Y|X,W) =Y

k

p(yk|x_k,W)

• Doubly stochasticunbiasedestimate of the expectation term

• Mini-batch

Eq(W)(log [p(Y|X,W)])≈ n m

X

k∈Im

Eq(W)(log [p(y_k|xk,W)])

• Monte Carlo

Eq(W)(log [p(yk|xk,W)])≈ 1 NMC

N_MC

X

r=1

log[p(yk|xk,W˜r)]

with ˜W_r ∼q(W).

(20)

• Assume that the likelihood factorizes p(Y|X,W) =Y

k

p(yk|x_k,W)

• Doubly stochasticunbiasedestimate of the expectation term

• Mini-batch

Eq(W)(log [p(Y|X,W)])≈ n m

X

k∈Im

Eq(W)(log [p(y_k|xk,W)])

• Monte Carlo

Eq(W)(log [p(yk|xk,W)])≈ 1 NMC

N_MC

X

r=1

log[p(yk|xk,W˜r)]

with ˜W_r ∼q(W).

13

(21)

• Assume a factorized Gaussian approximate posterior:

q(W) =Y

ijl

q

W^(l)_ij

=Y

ijl

N

µ^(l)_ij ,(σ²)^(l)_ij

(1)

• Reparameterization trick

( ˜W^(l)_r )ij =σ^(l)_ij ε^(l)_rij +µ^(l)_ij , with ε^(l)_rij ∼ N(0,1)

• Optimization wrt µ^(l)_ij ,(σ²)^(l)_ij with automatic differentiation

Kingma and Welling,ICLR, 2014

(22)

• Assume a factorized Gaussian approximate posterior:

q(W) =Y

ijl

q

W^(l)_ij

=Y

ijl

N

µ^(l)_ij ,(σ²)^(l)_ij

(1)

• Reparameterization trick

( ˜W^(l)_r )ij =σ^(l)_ij ε^(l_rij⁾+µ^(l)_ij , with ε^(l)_rij ∼ N(0,1)

• Optimization wrtµ^(l)_ij ,(σ²)^(l)_ij with automatic differentiation

Kingma and Welling,ICLR, 2014

14

(23)

Stochastic Gradient Optimization

En

^∇_par_qLowerBoundo

=∇_par_qLowerBound

−4 −2 0 2 4

−4−2024

●

−4 −2 0 2 4

−4−2024

●

● ●

● ●●

●

●●

●

●●●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

(24)

Stochastic Variational Inference - Simple Illustration parq0

=parq+αt

2 ^∇_par_q(LowerBound) αt →0

16

−4 −2 0 2 4

−4−2024

●

0 20 40 60 80 100

0.00.20.40.60.81.0

Iteration

p(par | data)

5 0 5

4 2 0 2 4

2 1 0

2.0 1.5 1.0 0.5 0.0 0.5

(25)

Form of Approximating Distribution

Approximating distributionq(W) can have the following forms:

• Fully factorized

−505

• Full covariance Σ =LL^>

−505

• Normalizing Flows, Real NVPs, Stein VI - change of measure determined by det(Jacobian)

−505 −505

Rezende et al.,ICML, 2015 – Dinh et al.,ICLR, 2017 – Liu and Wang,NIPS, 2016

(26)

−505

−505 −505

Rezende et al.,ICML, 2015 – Dinh et al.,ICLR, 2017 – Liu and Wang,NIPS, 2016

17

(27)

−505

−505 −505

(28)

Initialization of SVI matters

• Initialization can be an issue

• We proposed a novel way to initialize SVI well

−10 −5 0 5 10

−2

−1 0 1 2

AFTER POOR INITIALIZATION

−10 −5 0 5 10

−2

−1 0 1 2

AFTER OUR INITIALIZATION

Rossi, Michiardi and Filippone,arXiv, 2018

18

(29)

Is There an Easier Way?

• Dropout is Variational Inference with Bernoulli-like q(W)

• At training time, apply dropout

Iteration 1 Iteration 2 Iteration 3 . . . . . .

• At test time, “sample” networks with different dropout masks

−3 −2 −1 0 1 2 3

−3−2−10123

x

y

●

(30)

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W⁽ⁱ⁾ ∼ N(0, α_iI)

• Central Limit Theorem implies that F is Gaussian

...

Φ

X F

W⁽⁰⁾ W⁽¹⁾

• F has zero-mean

• cov(F) =E_p(W(0),W⁽¹⁾)[Φ(XW⁽⁰⁾)W⁽¹⁾W^(1)>Φ(XW⁽⁰⁾)^>]

Neal,LNS, 1996 – Rasmussen and Williams, 2006

20

(31)

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W⁽ⁱ⁾ ∼ N(0, α_iI)

• Central Limit Theorem implies that F is Gaussian

...

Φ

X F

W⁽⁰⁾ W⁽¹⁾

• F has zero-mean

• cov(F) =α₁E_p(W(0))[Φ(XW⁽⁰⁾)Φ(XW⁽⁰⁾)^>]

• Some choices of Φ lead to analytic expression of known kernels (RBF, Mat´ern, arc-cosine, Brownian motion, . . .)

(32)

Random Feature Expansions for DGPs - Bochner’s theorem

• Continuous shift-invariant covariance function k(xi−xj|θ) =σ²

Z

p(ω|θ) exp

ι(xi−xj)^>ω

dω

• Monte Carlo estimate k(xi−xj|θ)≈ σ²

NRF NRF

X

r=1

z(xi|˜ωr)^>z(xj|ω˜r) with

˜

ωr ∼p(ω|θ)

z(x|ω) = [cos(x^>ω),sin(x^>ω)]^>

Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010

22

(33)

Random Feature Expansions for DGPs - Bochner’s theorem

• Continuous shift-invariant covariance function k(xi−xj|θ) =σ²

Z

p(ω|θ) exp

ι(xi−xj)^>ω

dω

• Monte Carlo estimate k(xi−xj|θ)≈ σ²

NRF NRF

X

r=1

z(xi|˜ωr)^>z(xj|ω˜r) with

˜

ωr ∼p(ω|θ)

z(x|ω) = [cos(x^>ω),sin(x^>ω)]^>

(34)

Random Feature Expansions for DGPs

• Define

Φ^(l)= s σ²

N_RF^(l) h

cos

F^(l⁾Ω^(l)

,sin

F^(l)Ω^(l) i

and

F^(l+1) = Φ^(l)W^(l)

• We are stacking Bayesian linear models with p

W^(l)_·i

=N (0,I)

• Expansion of arc-cosine kernel yields ReLU activations!

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

23

(35)

Random Feature Expansions for DGPs

• Define

Φ^(l)= s σ²

N_RF^(l) h

cos

F^(l⁾Ω^(l)

,sin

F^(l)Ω^(l) i

and

F^(l+1) = Φ^(l)W^(l)

• We are stacking Bayesian linear models with p

W^(l)_·i

=N (0,I)

• Expansion of arc-cosine kernel yields ReLU activations!

(36)

Random Feature Expansions make Deep GPs become DNNs

θ⁽⁰⁾ θ⁽¹⁾

Φ⁽⁰⁾

X F⁽¹⁾ Φ⁽¹⁾ F⁽²⁾ Y

Ω⁽⁰⁾ W⁽⁰⁾ Ω⁽¹⁾ W⁽¹⁾

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

24

(37)

Some Results

(38)

Results - Model (Depth) Selection

Airline dataset (n = 5M+, d = 8)

2 3 4 5

0.2 0.3 0.4 0.5

log10(sec) Error rate

2 3 4 5

0.45 0.5 0.55 0.6

log10(sec) MNLL

2 10 20 30 2.6

2.7

·10⁶

Layers Neg. Lower Bound

2 layers 10 layers 20 layers 30 layers SV-DKL

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017 25

(39)

Convolutional Nets

• Convolutional nets are widely used. . .

• . . .but they are known to be overconfident!

Convolution + ReLU

Pooling Pooling Fully connected layers Output

(40)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams

Predicted value

Fraction positives

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

27

(41)

• Reliability diagrams

Predicted value

Fraction positives

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(42)

• Reliability diagrams - Under-confident predictions

Predicted value

Fraction positives

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

• We can extract the Expected Calibration Error (ece) score

• The brierscore is another measure of calibration

29

(43)

• Reliability diagrams - Overconfident predictions

Predicted value

Fraction positives

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reliability diagrams of modern Deep CNNs look like this! Bayesian treatment of filters fixes it!

(44)

• Reliability diagrams - Overconfident predictions

Predicted value

Fraction positives

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reliability diagrams of modern Deep CNNs look like this!

Bayesian treatment of filters fixes it! ₃₀

(45)

Bayesian CNNs are calibrated

• Inferring parameters of convolutional filter recovers calibration

• Example with Monte Carlo Dropout

0.0 0.5 1.0

cifar10-lenet

ECE: 0.0254 BRIER: 0.3951

1/8

0.0 0.5 1.0

0.0 0.5 1.0 ECE: 0.0145

BRIER: 0.3300

1/4

0.0 0.5 1.0

0.0 0.5 1.0 ECE: 0.0270

BRIER: 0.2827

1/2

0.0 0.5 1.0

0.0 0.5 1.0 ECE: 0.0495

BRIER: 0.2404

1

0.0 0.5 1.0

cifar10-resnet

ECE: 0.0485 BRIER: 0.3518

0.0 0.5 1.0

0.0 0.5 1.0 ECE: 0.0140

BRIER: 0.2765

0.0 0.5 1.0

0.0 0.5 1.0 ECE: 0.0295

BRIER: 0.2126

0.0 0.5 1.0

0.0 0.5 1.0 ECE: 0.0456

BRIER: 0.1799

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

cifar100-resnet

ECE: 0.2201 BRIER: 0.8334

0.0 0.5 1.0

0.5 1.0 ECE: 0.1357

BRIER: 0.6886

0.0 0.5 1.0

0.5 1.0 ECE: 0.0492

BRIER: 0.5463

0.0 0.5 1.0

0.5 1.0 ECE: 0.0270

BRIER: 0.4615

RELIABILITY DIAGRAM FOR MCD

(46)

Performance Evaluation of Bayesian CNNs

• Bayesian CNNs are calibrated and achieve better performance than post calibrated CNNs

2.0 1.5

MNIST-LENET

log10(ERR) 1.5 1.0

0.5 log10(MNLL)

3 2 1

log10(ECE)

2 1

log10(BRIER)

0.8 0.6 0.4

CIFAR10-LENET 0.0

0.5

1.5 1.0 0.5

0.6 0.4 0.2

0.75 0.50

CIFAR10-RESNET

0.0 0.5

2 1

0.75 0.50 0.25

1/8 1/4 1/2 1 proportion training data 0.4

0.2

CIFAR100-RESNET

1.0

1.0 0.5

0.0

GPDNN CGP SVDKL CAL MCD CNN+GP(RF) CNN+GP(SORF)

Tran et al.,AISTATS, 2019 32

(47)

Knowing When the Model Doesn’t Know

• Training on MNIST and test on not-MNIST

0.0 0.2 0.4 0.6 0.8 1.0 0

2000 4000 6000 8000

MCD

Average= 0.026 Average= 0.383

0.0 0.2 0.4 0.6 0.8 1.0 0

2000 4000 6000 8000

I-BLM INITIALIZATION

Average= 0.035 Average= 0.526

Test onMNIST Test onNOT-MNIST

(48)

Conclusions

(49)

Conclusions

• Inference for Deep Nets is hard

• Scalable stochastic-based approximate inference but...

• ... it is difficult to assess the impact approximations on quantification of uncertainty

• The connection between Deep Nets and Deep Gaussian processes can have implications on

• Understanding Deep Learning

• Deriving sensible priors for Deep Learning

• Improving inference borrowing algebraic/computational tricks from kernel literature

• Cool stuff

• New hardware

• Bayesian compression

(50)

Conclusions

• Cool stuff

• New hardware

34

(51)

Conclusions

• Cool stuff

• New hardware

(52)

We are hiring!

We are hiring PhDs, Post-docs and Assistant Professors

35

(53)

Acknowledgments

Bayesian deep learning