Par4All: From Convex Array Regions to Heterogeneous Computing

(1)

HAL Id: hal-00744733

https://hal-mines-paristech.archives-ouvertes.fr/hal-00744733

Submitted on 2 Nov 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Par4All: From Convex Array Regions to Heterogeneous

Computing

Mehdi Amini, Béatrice Creusillet, Stéphanie Even, Ronan Keryell, Onig

Goubier, Serge Guelton, Janice Onanian Mcmahon, François-Xavier Pasquier,

Grégoire Péan, Pierre Villalon

To cite this version:

Mehdi Amini, Béatrice Creusillet, Stéphanie Even, Ronan Keryell, Onig Goubier, et al.. Par4All: From

Convex Array Regions to Heterogeneous Computing. IMPACT 2012 : Second International Workshop

on Polyhedral Compilation Techniques HiPEAC 2012, Jan 2012, Paris, France. �hal-00744733�

(2)

Par4All: From Convex Array Regions to Heterogeneous

Computing

Mehdi Amini

1,2

_{Béatrice Creusillet}

2

_{Stéphanie Even}

2

_{Ronan Keryell}

2

Onig Goubier

2

_{Serge Guelton}

2

_{Janice Onanian McMahon}

2

François-Xavier Pasquier

2

_{Grégoire Péan}

2

Pierre Villalon

2

1

_{MINES ParisTech}

_firstname

_.

_lastname

_{@mines-paristech.fr}

2

_{HPC Project}

_firstname

_.

_lastname

_{@hpc-project.com}

Keywords

Heterogeneous computing, convex array regions, source-to-source compilation, polyhedral model, gpu, cuda, OpenCL.

ABSTRACT

Recent compilers comprise an incremental way for convert-ing software toward accelerators. For instance, the pgi Ac-celerator [14] or hmpp [3] require the use of directives. The programmer must select the pieces of source that are to be executed on the accelerator, providing optional directives that act as hints for data allocations and transfers. The compiler generates all code automatically.

Jcuda [15] offers a simpler interface to target cuda from Java_{. Data transfers are automatically generated for each} call. Arguments can be declared as IN, OUT, or INOUT to avoid useless transfers, but no piece of data can be kept in the gpu memory between two kernel launches. There have also been several initiatives to automate transforma-tions for OpenMP annotated source code to cuda [10, 11]. The gpu programming model and the host accelera-tor paradigm greatly restrict the potential of this approach, since OpenMP is designed for shared memory computer. Recent work [6, 9] adds extensions to OpenMP that account for cuda specificity. These make programs easier to write, but the developer is still responsible for designing and writ-ing communications code, and usually the programmer have to specialize his source code for a particular architecture. Unlike these approaches, Par4All [13] is an automatic par-allelizing and optimizing compiler for C and Fortran sequen-tial programs funded by the hpc Project startup. The pur-pose of this source-to-source compiler is to integrate several compilation tools into an easy-to-use yet powerful compiler that automatically transforms existing programs to target various hardware platforms. Heterogeneity is everywhere

IMPACT 2012

Second International Workshop on Polyhedral Compilation Techniques Jan 23, 2012, Paris, France

In conjunction with HiPEAC 2012. http://impact.gforge.inria.fr/impact2012

nowadays, from the supercomputers to the mobile world, and the future seems to be promised to more and more heterogeneity. Thus adapting automatically programs on targets such as multicore systems, embedded systems, high performance computers and gpus is a critical challenge. Par4All_{is mainly based on the pips [7, 1] source-to-source} compiler infrastructure and benefits from its interprocedural capabilities like memory effects, reduction detection, paral-lelism detection, but also polyhedral-based analyses such as convex array regions [4] and preconditions.

The source-to-source nature of Par4All makes it easy to integrate third-party tools into the compilation flow. For in-stance, we are using pips to identify parts that are of interest in a whole program, and we rely on the pocc [12] polyhedral loop optimizer to perform memory accesses optimizations on these parts, in order to exhibit locality for instance. The combination of pips’ analyses together and the inser-tion of other optimiser in the middle of the compilainser-tion flow is automated by Par4All using a programmable pass manager [5] to perform whole program analysis, spot paral-lel loops and generate mostly OpenMP, cuda or OpenCL code.

To that end, we mainly face two challenges: parallelism de-tection and data transfer generation. The OpenMP direc-tives generation relies on coarse grain parallelization and semantic-based reduction detection [8]. The cuda and OpenCL targets add the difficulty of data transfer man-agement. We tackle it using convex array regions that are translated into optimized, interprocedural data transfers be-tween host and accelerator as described in [2].

The demonstration will provide the assistance with a global understanding of Par4All internals compilation flow, go-ing through the interprocedural results of pips analyses, parallelism detection, data transfer generation and result-ing code execution. Several benchmark examples and some real-world scientific applications will be used as a showcase.

1. REFERENCES

[1] Mehdi Amini, Corinne Ancourt, Fabien Coelho, B´eatrice Creusillet, Serge Guelton, Fran¸cois Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon.

(3)

0.25x 0.5x 1x 2x 4x 8x 16x 32x 64x 128x 256x

2mm 3mm adi bicg correlationcovariance doitgen fdtd-2d gauss-filter gemm gemver gesummv

Polybench-2.0 OpenMP 4.1 4.0 2.9 1.2 6.1 6.1 6.1 1.6 2.6 4.5 6.6 2.6 HMPP-2.5.1 127 131 .3 13.5 13.5 1.1 9.8 .9 115 2.6 PGI-11.8 188 196 2.4 13.9 14.1 36.5 6.3 156 2.7 par4all-naive 150 150 186 185 3.0 .5 144 1.9 .4 par4all-opt 214 216 2.4 .5 310 314 48.2 9.5 4.7 211 6.5 .7 0.25x 0.5x 1x 2x 4x 8x 16x 32x 64x 128x

gramschmidtjacobi-1d jacobi-2d lu mvt symm-exp syrk syr2k hotspot99 lud99 srad99 Stars-PM Geo.Mean

Polybench-2.0 Rodinia 4.2 5.9 1.4 2.1 8.9 5.5 1.8 3.6 8.4 9.6 8.5 2.4 3.85 21.1 1.0 10.0 11.3 3.5 26.7 5.1 31.6 3.00 4.0 6.7 3.7 32.9 .3 1.0 2.15 .6 .3 6.4 .7 6.2 4.9 1.5 .5 1.4 3.6 2.22 2.7 4.9 10.7 12.5 6.6 51.3 6.6 5.2 30.5 7.3 19.5 52.0 14.43

Figure 1: Speedup relative to naive sequential version for an OpenMP version, a version with basic pgi and hmpp directives, a naive cuda version, and an optimized cuda version, all automatically generated from the naive sequential code.

PIPS Is not (only) Polyhedral Software. In First

International Workshop on Polyhedral Compilation Techniques, IMPACT, April 2011.

[2] Mehdi Amini, Fabien Coelho, Fran¸cois Irigoin, and Ronan Keryell. Static compilation analysis for host-accelerator communication optimization. In

Workshops on Languages and Compilers for Parallel Computing, LCPC, 2010.

[3] Francois Bodin and Stephane Bihan. Heterogeneous multicore parallel programming for graphics processing units. Sci. Program., 17:325–336, December 2009. [4] B´eatrice Creusillet and Francois Irigoin.

Interprocedural array region analyses. International

Journal of Parallel Programming, 24(6):513–546, 1996. [5] Serge Guelton. Building Source-to-Source compilers

for Heterogenous targets. PhD thesis, T´el´ecom Bretagne, 2011.

[6] Tianyi David Han and Tarek S. Abdelrahman. hiCUDA: a high-level directive-based language for GPU programming. In Proceedings of 2nd Workshop

on General Purpose Processing on Graphics Processing Units, pages 52–61, New York, NY, USA, 2009. ACM. [7] Fran¸cois Irigoin, Pierre Jouvelot, and R´emi Triolet.

Semantical interprocedural parallelization: an overview of the PIPS project. In International

Conference on Supercomputing, ICS, pages 244–251, 1991.

[8] Pierre Jouvelot and Babak Dehbonei. A unified semantic approach for the vectorization and parallelization of generalized reductions. In

International Conference on Supercomputing, ICS,

pages 186–194, 1989.

[9] Seyong Lee and Rudolf Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC ’10, pages 1–11, 2010.

[10] Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP, 2009.

[11] Satoshi Ohshima, Shoichi Hirasawa, and Hiroki Honda. OMPCUDA : OpenMP execution framework for CUDA based on omni OpenMP compiler. In

Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, volume 6132 of

Lecture Notes in Computer Science, pages 161–173. Springer Verlag, 2010.

[12] Louis-Noel Pouchet, C´edric Bastoul, and Uday Bondhugula. PoCC: the Polyhedral Compiler Collection, 2010. http://pocc.sf.net. [13] HPC Project. Par4All initiative for automatic

parallelization. http://www.par4all.org, 2010. [14] Michael Wolfe. Implementing the PGI accelerator

model. In Proceedings of the 3rd Workshop on

General-Purpose Computation on Graphics Processing Units, GPGPU, pages 43–50, New York, NY, USA, 2010. ACM.

[15] Yonghong Yan, Max Grossman, and Vivek Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In

Proceedings of the 15th International Euro-Par Conference on Parallel Processing, 2009.