• Aucun résultat trouvé

A zealous parallel gradient descent algorithm

N/A
N/A
Protected

Academic year: 2021

Partager "A zealous parallel gradient descent algorithm"

Copied!
1
0
0

Texte intégral

(1)

A zealous parallel gradient descent algorithm

Gilles Louppe and Pierre Geurts

Department of EE and CS, University of Liège, Belgium

Zealous parallel gradient descent algorithm

Procedure followed by each individual thread Global state

Local state

Critical section

Policy functions

Abstract

Experimental results

References and acknowledgements

Asynchronous mini-batch gradient descent

Mini-batch gradient descent

Conclusions and future work

- θ: vector of parameters of the model;

- next: pid of the next thread

allowed to update θ;

- counter: array of integers,

such that counter[i]

corresponds to the number of pending updates of thread i. - Δθ: pending updates of θ; - b: current mini-batch; - pid: unique identifier. Stop? End trylock(pid)

Get next mini-batch b

Get next mini-batch b

( , ) z b Cz          

0;    ; 0;           True False Access granted Access refused function trylock(pid) counter[pid]++;

return next == pid;

end

function next(pid)

counter[pid] ← 0;

next ← arg max(counter)

end

next(pid);

next(pid);

Parallel and distributed algorithms have become a necessity in modern machine learning tasks. In this work, we focus on parallel

asynchronous gradient descent and propose a zealous variant that minimizes the idle time of processors to achieve a substantial speedup. We then experimentally study this algorithm in the context of

training a restricted Boltzmann machine on a large collaborative filtering task.

Parallel mini-batch gradient descent with shared memory [1, 2, 3]:

- Store θ in shared memory.

- Have multiple processors process asynchronously and independently multiple mini-batches.

- Update θ in mutual exclusion using a synchronization lock.

Drawbacks:

[ ( , )]

z

C

z

1

(

,

)

k k s b k t k k t s

C

z

  

Minimize where C is some (typically convex) cost function and the expectation is computed over training points z. In mini-batch gradient descent, this is achieved using the update rule

where α is some learning rate and b is the number of training points in a mini-batch.

Some delay might occur between the time gradient components are computed and the time they are eventually used to update

θ. Hence, processors might use stale parameters that do not

take into account the very last updates. Yet, [1, 4] showed that convergence is still guaranteed under some conditions.

Contention might appear on the synchronization lock, hence causing the processors to queue and idle. This is likely to

happen when updating θ takes a non-negligeable amount of time or as the number of processors increases.

Setting

- Train a restricted Boltzmann

machine on a large collaborative

filtering task [3, 5].

- θ counts 10M+ of values, hence executing the critical section

takes a fair amount of time.

- Experiments carried out on a dedicated 24-core machine.

[1] A. Nedic, D.P. Bertsekas, and V.S. Borkar. Distributed asynchronous incremental subgradient

methods. Studies in Computational Mathematics, 8:381–407, 2001.

[2] K. Gimpel, D. Das, and N.A. Smith. Distributed asynchronous online learning for natural language

processing. In Proceedings of the Conference on Computational Natural Language Learning, 2010.

[3] G. Louppe. Collaborative filtering: Scalable approaches using restricted Boltzmann machines. Master’s thesis, University of Liège, 2010.

[4] M. Zinkevich, A. Smola, and J. Langford. Slow learners are fast. In Advances in Neural Information Processing Systems 22, pages 2331–2339. 2009.

[5] R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted Boltzmann machines for collaborative

filtering. In Proceedings of the 24th international conference on Machine learning, page 798. ACM, 2007.

Gilles Louppe and Pierre Geurts are respectively research fellow and research associate of the FNRS Belgium. This paper presents research results of the Belgian Network BIOMAGNET (Bioinformatics and Modeling: from Genomes to Networks), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific responsibility rests with its authors.

Significant speedup over the asynchronous parallel gradient

descent algorithm.

Future work: corroborate the results obtained in this work with more thorough experiments.

Updates of θ may become too much delayed if the number of cores becomes too large, which can impair convergence.

Future work: Explore strategies to counter the effects of delay. Derive theoretical guarantees on the convergence of the algorithm.

Keep going instead of blocking on the

synchronization lock, hence solving

the idling problem!

This is what we address in this work.

(1) ( ) efficiency T

nT n

Zealous with 4 cores is nearly as fast as

asynchronous with 8 cores! Oscillations

Références

Documents relatifs

These three algorithms form a new intermediate family with the interest- ing property of making much better use of the samples than TD while keeping a gradient descent scheme, which

But strong convexity is a more powerful property than convexity : If we know the value and gradient at a point x of a strongly convex function, we know a quadratic lower bound for

We propose a molecular classification dividing gastric cancer into four subtypes: tumours positive for Epstein–Barr virus, which display recur- rent PIK3CA mutations, extreme

As a novelty, the sign gradient descent algorithm allows to converge in practice towards other minima than the closest minimum of the initial condition making these algorithms

Normally, the deterministic incremental gradient method requires a decreasing sequence of step sizes to achieve convergence, but Solodov shows that under condition (5) the

Three-dimensional plot of the term structure data: the sample consists of 420 monthly US zero-coupon bond yields downloaded from the CRSP for the time period between January 1965

Ray trac- ing predicted the observed fade rates and differential mode delays using electron densities derived from measured foF2 values and IRI profiles. The simulated and observed

In the quadratic case the solution should solve the system Ax = b, which you can compute using numpy.linalg.solve, in order to test the efficiency of your algorithm1. Test