• Aucun résultat trouvé

Bayesian neural network priors at the level of units

N/A
N/A
Protected

Academic year: 2021

Partager "Bayesian neural network priors at the level of units"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01950660

https://hal.archives-ouvertes.fr/hal-01950660

Submitted on 11 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bayesian neural network priors at the level of units

Mariia Vladimirova, Julyan Arbel, Pablo Mesejo

To cite this version:

Mariia Vladimirova, Julyan Arbel, Pablo Mesejo. Bayesian neural network priors at the level of units.

Bayesian Statistics in the Big Data Era, Nov 2018, Marseille, France. pp.1. �hal-01950660�

(2)

Bayesian neural network priors at the level of units

Mariia Vladimirova 1 , Julyan Arbel 1 and Pablo Mesejo 2

1 Inria Grenoble Rhône-Alpes, France

2 University of Granada, Spain

Introduction

We investigate deep Bayesian neural networks with Gaussian priors on the weights and ReLU-like nonlinearities.

See Vladimirova et al. (2018).

... ... ...

...

...

...

...

x h

(1)

h

(2)

h

(`)

input

layer

1

st

hid.

layer

2

nd

hid.

layer

`

th

hid.

layer

subW (

12

) subW (1) subW (

2`

)

Notations

Given an input x ∈ R

N

, the `-th hid- den layer unit activations are defined as

g

(`)

(x) = W

(`)

h

(`−1)

(x), h

(`)

(x) = φ(g

(`)

(x)).

Assumptions

Gaussian prior on weights : W

i,j(`)

∼ N (0, σ

w2

),

• A nonlinearity φ : R → R is said to obey the extended envelope property if there exist

c

1

, c

2

, d

2

≥ 0, d

1

> 0 such that

|φ(u)| ≥ c

1

+ d

1

|u| for u ∈ R

+

or u ∈ R

,

|φ(u)| ≤ c

2

+ d

2

|u| for u ∈ R .

Sub-Weibull

A random variable X , such that P (|X | ≥ x) ≤ exp

−x

1/θ

/K

for all x ≥ 0 and for some K > 0, is called a sub-Weibull random vari- able with tail parameter θ > 0:

X ∼ subW (θ ).

Moment property:

X ∼ subW (θ ) implies

kX k

k

= E |X |

k

1/k

k

θ

,

Meaning for all k ∈ N and for some constants d, D > 0,

d < kX k

k

/k

θ

< D.

Covariance theorem

The covariance between hidden units of the same layer is non- negative. Moreover, for any `-th hidden layer units h

(`)

and ˜ h

(`)

, for s, t ∈ N it holds

Cov

h

(`)

s

,

h ˜

(`)

t

≥ 0.

Penalized estimation

Regularized problem:

min

W

R (W ) + λ L (W ), (1) where R (W ) is a loss function, λ L (W ) is a penalty, λ > 0.

For Bayesian models with prior dis- tribution π (W ), the maximum a posteriori (MAP) solves (1) with:

L (W ) ∝ − log π (W )

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U 2

Layer 1

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U2

Layer 2

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U2

Layer 3

−1.0

−0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U 2

Layer 10

Theorem (Vladimirova et al., 2018)

The `-th hidden layer units U

(`)

(pre-activation g

(`)

or post-activation h

(`)

) of a feed-forward Bayesian neural network with:

Gaussian priors on weights and

extended envelope condition activation function φ

have sub-Weibull marginal prior distribution with optimal tail pa- rameter θ = `/2, conditional on the input x:

U

(`)

∼ subW (`/2),

Prior distributions of layers ` = 1, 2, 3

Illustration of units marginal prior distributions from the first three hidden layers. Neural network parameters: (N, H

1

, H

2

, H

3

) = (50, 25, 24, 4).

0 2 4 6 8 10

Value

0.00 0.05 0.10 0.15

Density

sub-W(1/2) sub-W(1) sub-W(3/2)

0 2 4 6 8 10

Value

0.0 0.1 0.2 0.3 0.4 0.5

P(Xx)

Proof sketch

Induction w.r.t. layer depth `:

kh

(`)

k

k

k

`/2

,

which is the moment characterization of sub-Weibull variable.

• Extended envelope property implies kh

(`)

k

k

kg

(`)

k

k

Base step: gN (0, σ

2

), kg k

k

k. Thus,

khk

k

= kφ(g )k

k

kg k

k

k.

Inductive step: suppose kh

(`−1)

k

k

k

(` 1)/2

.

• Lower bound: non-negative covariance theorem:

Cov

h

(`−1)

s

,

h ˜

(`−1)

t

≥ 0.

• Upper bound: Holder’s inequality

g

(`)

= P

H

j=1

W

i,j(`−1)

h

(`−1)j

implies kh

(`)

k

k

kg

(`)

k

k

k

`/2

.

Sparsity interpretation

MAP on weights is L2-reg.

Independent Gaussian prior

π (W ) ∝

L

Y

`=1

Y

i,j

e

12(Wi,j(`))2

,

is equivalent to the weight decay penalty with negative log-prior :

L (W ) ∝

L

X

`=1

X

i,j

(W

i,j(`)

)

2

= kW k

22

,

MAP on units induces sparsity The joint prior distribution for all the units can be expressed by Sklar’s rep- resentation theorem as

π (U ) =

L

Y

`=1

H`

Y

m=1

π

m(`)

(U

m(`)

) C (F (U )),

where C is the copula of U (charac- terizes all the dependence between the units), F is its cumulative distribution function. The penalty is the negative log-prior :

L (U ) ≈ kU

(1)

k

22

+ · · · + kU

(L)

k

22//L

L

− log C (F (U )).

Layer W -penalty U -penalty

1 kW

(1)

k

22

, L

2

kU

(1)

k

22

, L

2

2 kW

(2)

k

22

, L

2

kU

(2)

k, L

1

` kW

(`)

k

22

, L

2

kU

(`)

k

2/`2/`

, L

2/`

Conclusion

We prove that the marginal prior unit distributions are heavier-tailed as depth increases. We further interpret this finding, showing that the units tend to be more sparsely represented as layers become deeper. This result provides new theoretical insight on deep Bayesian neural networks, under- pinning their natural shrinkage prop- erties and practical potential.

References

Vladimirova, M., Arbel, J., and Mesejo, P.

(2018). Bayesian neural networks increas-

ingly sparsify their units with depth. arXiv

preprint arXiv:1810.05193.

Références

Documents relatifs

Using the Fo¨rster formulation of FRET and combining the PM3 calculations of the dipole moments of the aromatic portions of the chromophores, docking process via BiGGER software,

Si certains travaux ont abordé le sujet des dermatoses à l’officine sur un plan théorique, aucun n’a concerné, à notre connaissance, les demandes d’avis

Later in the clinical course, measurement of newly established indicators of erythropoiesis disclosed in 2 patients with persistent anemia inappropriately low

Households’ livelihood and adaptive capacity in peri- urban interfaces : A case study on the city of Khon Kaen, Thailand.. Architecture,

3 Assez logiquement, cette double caractéristique se retrouve également chez la plupart des hommes peuplant la maison d’arrêt étudiée. 111-113) qu’est la surreprésentation

Et si, d’un côté, nous avons chez Carrier des contes plus construits, plus « littéraires », tendant vers la nouvelle (le titre le dit : jolis deuils. L’auteur fait d’une

la RCP n’est pas forcément utilisée comme elle devrait l’être, c'est-à-dire un lieu de coordination et de concertation mais elle peut être utilisée par certains comme un lieu

The change of sound attenuation at the separation line between the two zones, to- gether with the sound attenuation slopes, are equally well predicted by the room-acoustic diffusion