Bayesian neural network priors at the level of units

(1)

HAL Id: hal-01950660

https://hal.archives-ouvertes.fr/hal-01950660

Submitted on 11 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bayesian neural network priors at the level of units

Mariia Vladimirova, Julyan Arbel, Pablo Mesejo

To cite this version:

Mariia Vladimirova, Julyan Arbel, Pablo Mesejo. Bayesian neural network priors at the level of units.

Bayesian Statistics in the Big Data Era, Nov 2018, Marseille, France. pp.1. �hal-01950660�

(2)

Bayesian neural network priors at the level of units

Mariia Vladimirova ¹ , Julyan Arbel ¹ and Pablo Mesejo ²

1 Inria Grenoble Rhône-Alpes, France

2 University of Granada, Spain

Introduction

We investigate deep Bayesian neural networks with Gaussian priors on the weights and ReLU-like nonlinearities.

See Vladimirova et al. (2018).

... ... ...

...

x h

⁽¹⁾

h

⁽²⁾

h

^(`)

input

layer

1

^st

hid.

layer

2

^nd

hid.

layer

`

^th

hid.

layer

subW (

¹₂

) subW (1) subW (

₂^`

)

Notations

Given an input x ∈ R

^N

, the `-th hid- den layer unit activations are defined as

g

^(`)

(x) = W

^(`)

h

^(`−1)

(x), h

^(`)

(x) = φ(g

^(`)

(x)).

Assumptions

• Gaussian prior on weights : W

_i,j^(`)

∼ N (0, σ

_w²

),

• A nonlinearity φ : R → R is said to obey the extended envelope property if there exist

c

₁

, c

₂

, d

₂

≥ 0, d

₁

> 0 such that

|φ(u)| ≥ c

₁

+ d

₁

|u| for u ∈ R

⁺

or u ∈ R

⁻

,

|φ(u)| ≤ c

₂

+ d

₂

|u| for u ∈ R .

Sub-Weibull

A random variable X , such that P (|X | ≥ x) ≤ exp

−x

¹^/^θ

/K

for all x ≥ 0 and for some K > 0, is called a sub-Weibull random vari- able with tail parameter θ > 0:

X ∼ subW (θ ).

Moment property:

X ∼ subW (θ ) implies

kX k

_k

= E |X |

^k

¹^/^k

k

^θ

,

Meaning for all k ∈ N and for some constants d, D > 0,

d < kX k

_k

/k

^θ

< D.

Covariance theorem

The covariance between hidden units of the same layer is non- negative. Moreover, for any `-th hidden layer units h

^(`)

and ˜ h

^(`)

, for s, t ∈ N it holds

Cov

h

^(`)

^s

,

h ˜

^(`)

^t

≥ 0.

Penalized estimation

Regularized problem:

min

W

R (W ) + λ L (W ), (1) where R (W ) is a loss function, λ L (W ) is a penalty, λ > 0.

For Bayesian models with prior dis- tribution π (W ), the maximum a posteriori (MAP) solves (1) with:

L (W ) ∝ − log π (W )

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U₁

U 2

Layer 1

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U₁

U2

Layer 2

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U₁

U2

Layer 3

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U₁

U 2

Layer 10

Theorem (Vladimirova et al., 2018)

The `-th hidden layer units U

^(`)

(pre-activation g

^(`)

or post-activation h

^(`)

) of a feed-forward Bayesian neural network with:

• Gaussian priors on weights and

• extended envelope condition activation function φ

have sub-Weibull marginal prior distribution with optimal tail pa- rameter θ = `/2, conditional on the input x:

U

^(`)

∼ subW (`/2),

Prior distributions of layers ` = 1, 2, 3

Illustration of units marginal prior distributions from the first three hidden layers. Neural network parameters: (N, H

₁

, H

₂

, H

₃

) = (50, 25, 24, 4).

0 2 4 6 8 10

Value

0.00 0.05 0.10 0.15

Density

sub-W(1/2) sub-W(1) sub-W(3/2)

0 2 4 6 8 10

Value

0.0 0.1 0.2 0.3 0.4 0.5

P(X≥x)

Proof sketch

Induction w.r.t. layer depth `:

kh

^(`)

k

_k

k

^`^/²

,

which is the moment characterization of sub-Weibull variable.

• Extended envelope property implies kh

^(`)

k

_k

kg

^(`)

k

_k

• Base step: g ∼ N (0, σ

²

), kg k

_k

√

k. Thus,

khk

_k

= kφ(g )k

_k

kg k

_k

√ k.

• Inductive step: suppose kh

^(`−1)

k

_k

k

^(` ⁻ ¹⁾^/²

.

• Lower bound: non-negative covariance theorem:

Cov

h

^(`−1)

^s

,

h ˜

^(`−1)

^t

≥ 0.

• Upper bound: Holder’s inequality

• g

^(`)

= P

H

j=1

W

_i,j^(`−1)

h

^(`−1)_j

implies kh

^(`)

k

_k

kg

^(`)

k

_k

k

^`^/²

.

Sparsity interpretation

MAP on weights is L2-reg.

Independent Gaussian prior

π (W ) ∝

L

Y

`=1

Y

i,j

e

⁻¹²^(W^i,j^(`)⁾²

,

is equivalent to the weight decay penalty with negative log-prior :

L (W ) ∝

L

X

`=1

X

i,j

(W

_i,j^(`)

)

²

= kW k

²₂

,

MAP on units induces sparsity The joint prior distribution for all the units can be expressed by Sklar’s rep- resentation theorem as

π (U ) =

L

Y

`=1

H_`

Y

m=1

π

_m^(`)

(U

_m^(`)

) C (F (U )),

where C is the copula of U (charac- terizes all the dependence between the units), F is its cumulative distribution function. The penalty is the negative log-prior :

L (U ) ≈ kU

⁽¹⁾

k

²₂

+ · · · + kU

^(L)

k

²₂^/_/^L

L

− log C (F (U )).

Layer W -penalty U -penalty

1 kW

⁽¹⁾

k

²₂

, L

²

kU

⁽¹⁾

k

²₂

, L

²

2 kW

⁽²⁾

k

²₂

, L

²

kU

⁽²⁾

k, L

¹

` kW

^(`)

k

²₂

, L

²

kU

^(`)

k

^2/`_2/`

, L

^2/`

Conclusion

We prove that the marginal prior unit distributions are heavier-tailed as depth increases. We further interpret this finding, showing that the units tend to be more sparsely represented as layers become deeper. This result provides new theoretical insight on deep Bayesian neural networks, under- pinning their natural shrinkage prop- erties and practical potential.

Bayesian neural network priors at the level of units

Bayesian neural network priors at the level of units

Mariia Vladimirova 1 , Julyan Arbel 1 and Pablo Mesejo 2

1 Inria Grenoble Rhône-Alpes, France

2 University of Granada, Spain

Introduction

We investigate deep Bayesian neural networks with Gaussian priors on the weights and ReLU-like nonlinearities.

See Vladimirova et al. (2018).

... ... ...

...

...

...

...

x h

h

h

input

layer

1

hid.

layer

2

hid.

layer

`

hid.

layer

subW (

) subW (1) subW (

)

Notations

Given an input x ∈ R

, the `-th hid- den layer unit activations are defined as

g

(x) = W

h

(x), h

(x) = φ(g

(x)).

Assumptions

• Gaussian prior on weights : W

∼ N (0, σ

),

• A nonlinearity φ : R → R is said to obey the extended envelope property if there exist

c

, c

, d

≥ 0, d

> 0 such that

|φ(u)| ≥ c

+ d

|u| for u ∈ R

or u ∈ R

,

|φ(u)| ≤ c

+ d

|u| for u ∈ R .

Sub-Weibull

A random variable X , such that P (|X | ≥ x) ≤ exp

−x

/K

for all x ≥ 0 and for some K > 0, is called a sub-Weibull random vari- able with tail parameter θ > 0:

X ∼ subW (θ ).

Moment property:

X ∼ subW (θ ) implies

kX k

= E |X |

k

,

Meaning for all k ∈ N and for some constants d, D > 0,

d < kX k

/k

< D.

Covariance theorem

The covariance between hidden units of the same layer is non- negative. Moreover, for any `-th hidden layer units h

and ˜ h

, for s, t ∈ N it holds

Cov

h

,

Mariia Vladimirova ¹ , Julyan Arbel ¹ and Pablo Mesejo ²