Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantiﬁcation

(1)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification

Gianni FRANCHI

with Andrei Bursuc, Emanuel Aldea, Severine Dubuisson, Isabelle Bloch U2IS, ENSTA Paris

Presentation GDR ISIS 28/07/2021

1 / 39

(2)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification

Plan

1 Context

2 Uncertainty and Deep learning

3 Bayesian Deep Neural Network

4 MC dropout

5 Ensembling approaches

6 LP-BNN approache

7 Conclusion

8 Bibliography

2 / 39

(3)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification Context

What is uncertainty in machine/deep learning

We make observations using the sensors in the world (e.g. camera) Based on the observations, we intend to learn a model that makes decisions

Given thesameobservations, the decision should be thesame However,

The worldchanges, observationschange, our sensorschange, the output should not change!

We would like to know how confident we can be about the decisions

3 / 39

(4)

Why Uncertainty is important?

Figure:Confidence histograms (top) and reliability diagrams(bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right)on CIFAR-100. [1]

4 / 39

(5)

Why Uncertainty is important?

Imagine an autonomous car with a perception system based on Deep learning without Uncertainty:

5 / 39

(6)

Why Uncertainty is important?

Imagine a medical diagnostics based on Deep learning without Uncertainty:

6 / 39

(7)

Why Uncertainty is important?

We build models for predictions, can we trust them? Are they certain?

7 / 39

(8)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification Uncertainty and Deep learning

Types of Uncertainty

Aleatoric: Uncertainty inherent in the observation noise (problems caused by sensor quality, natural randomness, that cannot be explained by our data).

Epistemic: Our ignorance about the correct model that generated the data (lack of knowledge about the process that generated the data).

8 / 39

(9)

Aleatoric uncertainty

Aleatoric uncertainty captures noise inherent in the observations:

For example, sensor noise or motion noise result in uncertainty.

This uncertainty cannot be reduced with more data.

However, aleatoric uncertainty could be reduced with better measurements.

9 / 39

(10)

Aleatoric uncertainty

Aleatoric uncertainty can further be categorized into homoscedastic and heteroscedastic uncertainties:

Homoscedastic uncertainty relates to the uncertainty that a particular task might cause. It stays constant for different inputs.

Heteroscedastic uncertainty depends on the inputs to the model, with some inputs potentially having more noisy outputs than others.

10 / 39

(11)

Types of Uncertainty: Case 1

¹

Let us consider a neural network model trained with several pictures of dogs. We ask the model to decide on a dog using a photo of a cat. What would you want the model to do?

1Credits: Gille Louppe

11 / 39

(12)

Types of Uncertainty: Case 2

²

We have three different types of images to classify, cat, dog, and cow, some of which may be noisy due to the limitations of the acquisition instrument.

2Credits: Gille Louppe

12 / 39

(13)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification Bayesian Deep Neural Network

Bayesian approach and DNN

The Goal of DNN is to findP(Y|X,ω), most of the classical approach findωthat maximize the likelihood.

ω= arg max

ω

logP(Dl|ω)

ω= arg max

ω nl

X

i=1

logP(Y_i|Xi,ω)

ω= arg max

ω

1/nl n_l

X

i=1

logP(Yi|Xi,ω) ω= arg max

ω E(X,Y)∼P(Dl)logP(Y|X,ω) ω= arg min

ω

H[P(Dl),P(Y|X,ω)]

WithH the cross entropy.

13 / 39

(14)

Bayesian approach and DNN

The Goal of DNN is to findP(Y|X,ω). In the classical bayesian

approach we findωsuch that we have the maximum a posteriori (MAP).

ω= arg max

ω

logP(ω|Dl) ω= arg max

ω

logP(Dl|ω) + logP(ω)

This leads to l2 regularization.

14 / 39

(15)

Bayesian DNN[4]

Bayesian DNN is based on marginalization instead of MAP optimization.

P(Y|X) =Eω∼P(ω|Dl)(P(Y|X,ω)) P(Y|X) =

Z

P(Y|X,ω)P(ω|D_l)dω

In practice:

P(Y|X)'X

i

(P(Y|X,ωi)) withωi∼ P(ω|Dl)

Different techniques to estimateP(ω|Dl).

15 / 39

(16)

Variational inference[4]

Variational inference approximates the posteriorP(ω|Dl)with a family of distributionsqλ(ω/Dl)The variational parameterλindexes the family of distributions. For example, ifqwere Gaussian, it would be the mean and variance of the latent variables for each datapointλ_x_i = (µ_x_i, σ²_x

i)).

Question : How can we know how well our variational posterior q_λ(w/D_l)approximates the true posteriorP(ω|D_l)?

16 / 39

(17)

Variational inference[4]

Question : How can we know how well our variational posterior q_λ(ω/D_l)approximates the true posterior P(ω|D_l)?

We can use the Kullback-Leibler divergence, which measures the information lost when usingq to approximateP :

KL(qλ(ω/Dl)|| P(ω|Dl)) = Z

ω

qλ(ω/Dl) log(qλ(ω/Dl) P(ω|Dl))

dω

= Z

ω

qλ(ω/Dl) log( qλ(ω/Dl) P(D_l)P(D_l,ω))

dω

=Eq[logq_λ(ω/D_l)]−Eq[logP(ω,D_l)] + logP(D_l) Our goal is to find the variational parametersλthat minimize this divergence. The optimal approximate posterior is thus

17 / 39

(18)

Variational inference[4]

The optimal approximate posterior is thus:

q_λ^∗(ω/D_l) = arg min_λKL(q_λ(ω/D_l)|| P(ω|D_l)).

This is impossible to compute directly due toP(Dl)that appears in the divergence. So, we consider the following function:

ELBO(λ) =Eq[logP(ω,Dl)]−Eq[logqλ(ω/Dl)]

=− Z

ω

qλ(ω/Dl) log( qλ(ω/Dl) P(ω)P(Dl|ω))

dω

=Eq[logP(D_l|ω)]−KL(q_λ(ω/D_l)|| P(ω)) Note thatKL(qλ(ω/Dl)|| P(ω|Dl)) = logP(Dl)−ELBO(λ).

18 / 39

(19)

Variational inference: Reparametrization trick[4]

Theorem: Letbe a random variable having a probability density given byq()and letω=t(λ, ). Suppose thatq_λ(ω/Dl), is such that q()d=q_λ(ω/Dl)dω. Then for a functionf differentiable with respect toω:

∂

∂λEq_λ(ω/Dl)f(ω, λ) =Eq()

∂f(ω, λ)

∂ω

∂λ +∂f(ω, λ)

∂λ

.

19 / 39

(20)

Variational inference [4]

20 / 39

(21)

Weight Uncertainty in Neural Networks [4]

³

3Image credit: Eric Ma

21 / 39

(22)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification MC dropout

Dropout

⁴

Dropout is an empirical technique that was proposed to avoid overfitting in CNN.

At each training step (i.e., for each sample within a mini-batch) Remove each node in the network with a probabilityp

Update the weights of the remaining nodes with backpropagation.

4Image credit: G. Louppe

22 / 39

(23)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification MC dropout

MC dropout [2]

They [2] propose to average the predictions of several DNN where they apply the dropout:

P(y^∗|x^∗) = 1 Nmodel

N_model

X

j=1

P(y^∗|ω(t^∗)b^j,x^∗) (1)

withb^j a vector of the same size of ω(t^∗)which is a realization of a binomial distribution.

23 / 39

(24)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification Ensembling approaches

Deep Ensembles[3]

They [3] propose to average the predictions of several DNN with different initial seeds:

P(y^∗|x^∗) = 1 N_model

N_model

X

j=1

P(y^∗|ω^j(t^∗),x^∗) (2)

24 / 39

(25)

Deep Ensembles[5]

Figure:t-SNE plot of predictions from checkpoints corresponding to 3 different randomly initialized trajectorie

25 / 39

(26)

Deep Ensembles[5]

Figure:Diversity versus accuracy plots for 3 models trained on CIFAR-10

26 / 39

(27)

BatchEnsemble[6]

They [6] propose to approximate the average of the predictions of several DNN with different initial seeds by using a DNN with two kinds of weights. More specifically, let the weight setωbe composed of two subsetsωslow , ωfast.

For simplicity let us consider a DNN with just one fully connected layer and let us writeω={ωj}^N_j=1^model ={Wj}^N_j=1^model andωslow=W and ωfast={Fj}^N_j=1^model. We haveWj =W ·F =W ·(rjs_j^t)

Figure:An illustration on how to generate the ensemble weights for two

ensemble members ^{27 / 39}

(28)

BatchEnsemble[6]

We have a set of weightW_j =W ·F=W ·(r_js_j^t)withW that sees all images and(r_js_j^t)that does not see all the same images. If we denoteφ an activation function then when we apply the BatchEnsemble on an image we perform:

y=φ W_j^tx

=φ (W^t·(rjs_j^t))^tx

=φ (W^t(x·rj)·sj) Similarly to Deep Ensembles, to perform inference we just perform ensembling :

P(y^∗|x^∗) = 1 Nmodel

N_model

X

j=1

P(y^∗|ω^j(t^∗),x^∗) (3)

Figure:An illustration on how to generate the ensemble weights for two ensemble members

28 / 39

(29)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification LP-BNN approache

From DNN to LP-BNN [7]

In classical DNNs, all weights are just one realization. In classical BNNs, all weights are one random variable, yet all the weights are independent.

In LP-BNNs, all weights of the same layer are correlated.

29 / 39

(30)

LP-BNN [7]

Similarly to Batchensemble we consider two kinds of weights:

h=a (W_share^> (xsj))ˆrj

, (4) Yet we useˆr and not r

30 / 39

(31)

LP-BNN [7]

Similarly to Batchensemble we consider two kinds of weights:

we introduce a VAE composed of a one layer encoderg_φ^enc(·)with variational parametersφand a one layer decoderg_ψ^dec(·)with parameters ψ.

the latent space of the VAE is denoted z∼ N(µ,σ²I)

We sample a latent variable zj ∼ N(µ_j,σ²_jI)and feed it to the decoder to haveˆrj =g_ψ^dec(zj).

31 / 39

(32)

LP-BNN [7]

Interestingly mixing the LPBNN is more stable than classical BNN. It brings more diversity and allows a better quantification of uncertainty.

32 / 39

(33)

Results on CIFAR

CIFAR-10 CIFAR-100

Method Acc↑ AUC↑ AUPR↑ FPR-95-TPR↓ ECE↓ cA↑ cE↓ Acc↑ ECE↓ cA↑ cE↓

MCP + cutout 96.33 0.9600 0.9767 0.115 0.0207 32.98 0.6167 80.19 0.1228 19.33 0.7844 MC dropout 95.95 0.9126 0.9511 0.282 0.0172 32.32 0.6673 75.40 0.0694 19.33 0.5830 MC dropout +cutout 96.50 0.9273 0.9603 0.242 0.0117 32.35 0.6403 77.92 0.0672 27.66 0.5909

DUQ 87.48 0.7083 0.8114 0.698 0.3983 64.89 0.2542 - - - -

DUQ Resnet18 93.36 0.8994 0.9213 0.1964 0.0131 69.01 0.5059 - - - -

EDL 85.73 0.9002 0.9198 0.247 0.0904 59.54 0.3412 - - - -

MIMO 94.96 0.9387 0.9648 0.175 0.0300 69.99 0.1846 0.7869 0.1018 0.4735 0.2832

Deep Ensembles + cutout 96.74 0.9803 0.9896 0.071 0.0093 68.75 0.1414 83.01 0.0673 47.35 0.2023 BatchEnsembles + cutout 96.48 0.9540 0.9731 0.132 0.0167 71.67 0.1928 81.27 0.0912 47.44 0.2909 LP-BNN (ours) + cutout 95.02 0.9691 0.9836 0.103 0.0094 69.51 0.1197 79.3 0.0702 48.40 0.2224

33 / 39

(34)

Results on CIFAR

0 1 2 3 4

skew intensity 0.05

0.10 0.15 0.20 0.25 0.30

ECE

Method BatchEnsemble LP-BNN Deep Ensembles

Figure:Calibration at different levels of corruption. We report ECE scores for LP-BNN, BatchEnsemble and Deep Ensembles on CIFAR-10-C.

34 / 39

(35)

LP-BNN [7]

Input image MCP BE LP-BNN

Figure:Visual assessment on two BDD-Anomaly test images containing a motorcycle (OOD class). For each image: on the first row, input image and confidence maps from MCP, BE, and LP-BNN;on the second row,

ground-truth segmentation and segmentation maps from MCP, BE, and LP-BNN. LP-BNN is less confident on the OOD objects.

35 / 39

(36)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification Conclusion

Conclusion

We propose a new BNN framework able to quantify uncertainty in the context of deep learning.

LP-BNNs are stable, efficient, and therefore easy to train compared to existing BNN models.

The extensive empirical comparisons on multiple tasks show that LP-BNNs reach state-of-the-art levels with substantially lower computational cost.

The code can be download in

https://github.com/giannifranchi/LP_BNN Thank you for your attention.

36 / 39

(37)

Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification Bibliography

Bibliography:

1 Guo, Chuan, et al. "On calibration of modern neural networks."

Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

2 Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning."

international conference on machine learning. 2016.

3 Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell.

"Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems.

2017.

37 / 39

(38)

Bibliography:

4 Blundell, Charles, et al. "Weight uncertainty in neural networks."

arXiv preprint arXiv:1505.05424 (2015).

5 Fort, Stanislav, Huiyi Hu, and Balaji Lakshminarayanan. "Deep Ensembles: A Loss Landscape Perspective." arXiv preprint arXiv:1912.02757 (2019).

38 / 39

(39)

Bibliography:

6 Wen, Yeming, Dustin Tran, and Jimmy Ba. "Batchensemble: an alternative approach to efficient ensemble and lifelong learning."

arXiv preprint arXiv:2002.06715 (2020).

7 Franchi, G., Bursuc, A., Aldea, E., Dubuisson, S., & Bloch, I.

(2020). Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification. arXiv preprint arXiv:2012.02818.

39 / 39