• Aucun résultat trouvé

A Common GPU n-Dimensional Array for Python and C

N/A
N/A
Protected

Academic year: 2022

Partager "A Common GPU n-Dimensional Array for Python and C"

Copied!
1
0
0

Texte intégral

(1)

'

&

$

%

References

Gpundarray: A common n-dimensional array on the gpu. https://github.

com/inducer/compyte/wiki .

Travis E. Oliphant. Python for scientific computing. In Computing in Science and Engineering , 9:10-20, 2007.

James Bergstra, Olivier Breleux, Fr´ed´eric Bastien, Pascal Lamblin, Rasvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley and Yoshua Bengio.

Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy) , 2010. Oral Presentation.

More references in the paper.

'

&

$

%

Future Plans

• Use in Theano/PyOpenCL/PyCUDA

• Design and implement a good C/C++ interface

• Find ways to lower the overhead

• Use the implicit looping provided by CUDA and OpenCL

• World domination!

'

&

$

%

Benchmarks

10

4

10

5

10

6

10

7

number of elements 0

500 1000 1500 2000 2500 3000

time (us)

a+1 pycuda

GPU ndarray 1d contiguous GPU ndarray 4d contiguous

GPU ndarray 4d contiguous not collapsed

GPU ndarray 4d strided outer(2d after collapse)

10

4

10

5

10

6

10

7

number of elements 0

500 1000 1500 2000 2500 3000 3500

time (us)

a**2 + b**2 + 2*a*b pycuda

GPU ndarray 1d contiguous GPU ndarray 4d contiguous

GPU ndarray 4d contiguous not collapsed

GPU ndarray 4d strided outer(2d after collapse)

Speed for contiguous cases is similar to other implementations.

'

&

$

%

Element-wise dimension collapsing

• Indexing computations are expensive

• The cost is paid per dimension (irrespective of their size)

• Suppose we have some elementwise work to do on a 3d tensor B that is a view of A, but strided in the innnermost dimension.

– We can merge the two outer dimensions to obtain an equivalent array that accesses the same memory but with easier indexing.

'

&

$

%

Functionality

What we have

• data types

• dimensions

• strides, views

• broadcasting

• elementwise kernels

• partial reductions

• support for CUDA and OpenCL

Interfaces

• Python

• C++ interface similar to Numpy C-API (depends on python)

Missing

• assignation

• reshaping

• a clean C interface

'

&

$

%

Why has this not been done before?

• Hard and time consuming to get right and efficient

• Certain algorithms cannot work on a general memory layout

• Indexing computations take up a significant portion of time on the GPU

'

&

$

%

Comparison of existing implementations

Package strides broadcast dimensions types backends

Theano yes a yes any float32 CUDA

PyCUDA no no any all CUDA

PyOpenCL no no any all OpenCL

CUDAMat no yes b 2 float32 CUDA

Gnumpy no yes any float32 c CUDA

Thrust no no 1 all CUDA

Desired yes yes any all both

a

as number of elements

b

via a function

c

and a hackish form of boolean

'

&

$

%

Broadcasting

• We have matrix A (size [8,8]) and we want to add a bias vector b (size [8,1]) to it.

– This doesn’t fit the rules for elementwise operations since both objects do not have the same number of elements.

• So we make virtual copies of b along the last dimension until it has the same size as A.

– Then we can proceed as usual for elementwise.

'

&

$

%

Strides

• Strides is a way to specify how much memory to skip between each element of a dimension.

– This corresponds to the size of one element times the number of elements

• We can use strides to take submatrix B from A without copying any memory.

– The strides stay the same but the number of elements in each dimension is reduced

'

&

$

%

Features desired

• Support for varying datatypes

• Support for an arbitrary number of dimensions

• Support for strides

• Support for broadcasting

• Compatibility with CUDA and OpenCL

Easy to develop

• Not always a good idea to make a gpu code work for all memory layout.

– Harder to code

– Harder to get efficient

• Just call as {contiguous,fortran} memory() on inputs!

'

&

$

%

Why do we need this?

• Efficient linear algebra is at the core of many scientific applications

• On the CPU, numpy ndarray provides a standard object (for python at least)

Why a new implementation?

There are already a number of existing GPU computing codebases:

Theano, PyCUDA/PyOpenCL, CUDAmat, Gnumpy, Thrust, ...

But:

1. All are incompatible

2. They do not support the full range of numpy ndarray features 3. None support both CUDA and OpenCL

Fr´ed´eric Bastien Arnaud Bergeron Andreas Kl¨ockner Pascal Vincent

Yoshua Bengio

A Common GPU n-Dimensional

Array for Python and C

Références

Documents relatifs

In the mode of calculating fatigue crack growth, the Paris law is used to simulate the process, which connects the rate of crack propagation under a cyclic load with the parameters

A  Monte  Carlo  study  was  performed  to  evaluate  possible  biases  and  estimate  the  standard  error  associated  with  the  parameter  estimates  when 

The Python library Numpy [6], the de facto standard for efficient numerical computa- tions with CPUs in Python, provides a convenient arbitrary-dimensional array (i.e., tensor)

In a recent work by Li an Babuska [12], an h-version high order finite element method is considered for a large class of two dimensional models, but only weak convergence of

It is expected that the result is not true with = 0 as soon as the degree of α is ≥ 3, which means that it is expected no real algebraic number of degree at least 3 is

As a first step in this bold goal, this current paper aims to benchmark three CFD codes from various origins and distinct numerical schemes: Concha [1], elsA [2], ANSYS-FLUENT

Reaffirms that the communities, groups and, where appropriate, individuals concerned are essential participants at all stages of the identification and inventorying of

Concerning criterion R.2, the Subsidiary Body found that submitting States still fail to understand that this criterion is meant to explain how