• Aucun résultat trouvé

Insights into distributed variational inference for Bayesian deep learning

N/A
N/A
Protected

Academic year: 2022

Partager "Insights into distributed variational inference for Bayesian deep learning"

Copied!
1
0
0

Texte intégral

(1)

Insights into Distributed Variational Inference for Bayesian Deep Learning

Rosa Candela

1

, Pietro Michiardi

1

, Maurizio Filippone

1

1 - EURECOM, Sophia Antipolis, France

Distributed Inference

I Motivations:

Training of complex models and large datasets needs a distributed implementation Current distributed architectures are not efficient and do not scale well

Previous experimental studies only consider simple models I Parameter-Server architecture with data-parallelism approach:

Data and workloads are distributed over worker nodes;

Parameter Server maintains globally shared parameters;

Fig. 1: Parameter-Server architecture

I Synchronization schemes in TensorFlow implementation:

Fig. 2: Different synchronization techniques

Stochastic Variational Inference for DGPs

I Stochastic Variational Inference

Intractable posterior over model parameters:

p(θ|Y , X) = p(Y |X, θ)p(θ) R p(Y |X, θ)p(θ)dθ;

Lower bound on marginal likelihood with mini-batch Stochastic Gradient optimization:

log[p(Y |X, θ)] ≥ n m

X

k∈Im

Eq(θ)(log[p(yk|θ)]) − DKL[q(θ)kp(θ|X)], where q(θ) approximates p(θ|X);

Estimate the expectation using Monte Carlo:

Eq(θ)(log[p(yk|θ)]) ≈ 1 NMC

NMC

X

r=1

log[p(yk|θ˜r)] with θ˜r ∼ q(θ);

I Deep Gaussian Processes (DGPs) Deep probabilistic models;

Composition of functions:

f(x) =

h(Nh−1)

θ(Nh−1)

◦ . . . ◦ h(0)

θ(0)

(x);

h(0)(x) h(1)(x) h(1) h(0) (x)

Fig. 3: Illustration of how stochastic processes may be composed.

I DGPs with random feature expansions

Example of RBF kernel approximated with trigonometric functions:

Φrbf = s

σ2

NRF [cos (FΩ) , sin (F Ω)] , with

F = ΦW, p (Ω·j|θ) = N 0, Λ−1

, Λ = diag(l21, . . . , l2d);

DGPs become equivalent to Deep Neural Networks with low-rank weight matrices.

Fig. 4: Diagram of the proposed DGP model with random features.

Experimental methodology

I Use mini-batch Stochastic gradient descent (SGD) algorithm I Study the impact of design parameters:

Batch size

Learning rate

Number of Workers

Number of Parameter Servers

I Evaluate the performance through standard metrics:

Training time Error Rate

Experimental Setup and Results

I Classification task;

I MNIST dataset with 60000 samples;

I DGP model configuration: 2 hidden layers, 500 random features, 50 GPs in the hidden layers.

Batch size

Learning rate

Number of workers

Number of Parameter Servers

Fig. 5: Training time and error rate as function of different parameters.

Conclusions

Results of the experimental study:

I Large batch size and high learning rate:

3 improve Synchronous SGD convergence speed;

7 increase model quality deterioration with Asynchronous SGD;

I Scaling up the number of workers:

Asynchronous SGD reduces the training time, at expense of high error rate;

Synchronous SGD shows a smaller speed-up, but maintains low error rate;

I Scaling up the number of Parameter Servers:

can help to avoid bottlenecks at network level;

produces negligible effects if the computation work prevails on the communication task;

Future works:

I Repeat the experiments with a different distributed setting;

I Improve SGD distributed implementation, modifying the updates collection phase.

References

[1] K. Cutajar, E. Bonilla, P. Michiardi, M. Filippone. Random Feature Expansions for Deep Gaussian Processes, ICML 2017.

[2] M. Abadi, P. Barham, J. Chen et al. TensorFlow: A system for large-scale machine learning, OSDI 2016.

1

Références

Documents relatifs

After the completion of trials on Test Areas near Churchill Camp, a number of cross-country exercises were undertaken to explore the country and determine if any new terrain

Dealing with data centers, computing Grids, Clouds and dedicated network resources, this framework, called ERIDIS, enables energy-efficient resource allocation via reservation

Our evaluation strategy is, at first, analyzing the perfor- mance of the greedy heuristics together, i.e., comparing them in group to other algorithms and then, in a second

La concession est en effet d´efinie dans la litt´erature sp´ecialis´ee comme l’expression d’une cause contrari´ee, inefficace et inop´erante, comme une relation

This paper has introduced 3 software tools designed to solve PDE using cellular automata ( Escapade ), to run these fine grained computations on coarse grained architectures ( parXXL

L’infiltration de LT CD8 + avant le traitement, qui défi- nit la notion de tumeur chaude vs tumeur froide, est un autre biomarqueur important dont l’impact a été montré dans

Les dubia circa sacramenta nous aident à prendre en compte l’ouverture des horizons géographiques dans l’histoire religieuse de l’occident chrétien non pas comme un chapitre en

• We present and demonstrate the effectiveness of a Multi-Resource Energy Efficient Framework (MREEF), an implementation of our methodology for reducing the energy consumption of