Compiling Image Processing Applications for Many-Core Accelerators

(1)

HAL Id: hal-01254412

https://hal-mines-paristech.archives-ouvertes.fr/hal-01254412

Submitted on 12 Jan 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Compiling Image Processing Applications for Many-Core Accelerators

Pierre Guillou

To cite this version:

Pierre Guillou. Compiling Image Processing Applications for Many-Core Accelerators . ACACES Summer School : Eleventh International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Jul 2015, Fiuggi, Italy. �hal-01254412�

(2)

Compiling Image Processing Applications

for Many-Core Accelerators

Pierre Guillou

– CRI MINES ParisTech, PSL Research University

Image Processing

image analysis:

detect geometrical structures in an image

mathematical morphology:

image analysis theory and technique based

on lattices theory

Mathematical Morphology Base Operators

arithmetic operators

unary (pixel ⊗ parameter, 1 input image) binary (pixel ⊗ pixel, 2 input images) + − × ÷ min max = & | ∼

morphological operators

stencils

neighbor selection + min/max/avg

reduction operators

global max/min/sum

other operators

threshold, mask, log2, . . .

=⇒ Sigma-C agent library

The MPPA-256 Chip

Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster

I/O cluster

I/O

clu

ster

I/O

clu

ster

I/O cluster

Shared

memory

(2 MB)

System core Compute core Compute core Compute core Compute core Compute core Compute core Compute core Compute core Debug support unit Compute core Compute core Compute core Compute core Compute core Compute core Compute core Compute core

NoC Interface

Host

machine

Local

memory

(3 GB)

PCI-E

DDR

Example: Licence Plate Extraction

Input Output

Sigma-C, a Dataflow Programming Language

Agent foo

=⇒

input 0 input 1 output a g e n t foo () { // d e s c r i b e a g e n t i n t e r f a c e i n t e r f a c e { in<int> input0 , i n p u t 1 ; out<int> o u t p u t ; // d e c l a r e the s t a t e m a c h i n e spec{ i n p u t 0 [2] , input1 , o u t p u t [ 3 ] } ; }

// loop over the s t a t e

void s t a r t() e x c h a n g e ( i n p u t 0 inp0 [2] , i n p u t 1 inp1 , o u t p u t outp [3]) { outp [0] = inp0 [0]; outp [1] = inp1 ; outp [2] = inp0 [1]; } } s u b g r a p h bar () { // d e s c r i b e s u b g r a p h i n t e r f a c e i n t e r f a c e { /* ... */ } map { // i n s t a n c i a t e a g e n t s a g e n t a1 = new A g e n t 1 (); a g e n t a3 = new S u b g r a p h 3 (); // ... // c o n n e c t a g e n t s to s u b g r a p h i n t e r f a c e s c o n n e c t ( input0 , a1 . i n p u t 0 ); c o n n e c t ( a5 . output , o u t p u t 1 ); // ... // c o n n e c t a g e n t s c o n n e c t ( a1 . output0 , a2 . i n p u t ); c o n n e c t ( a3 . output , a5 . i n p u t 1 ); // ... } } Agent 1 Agent 2 Subgraph 3 Agent 4 Agent 5

Subgraph bar

Compilation Chain

original application source-to-source

compiler call graph optimisations control code streaming kernels operator library

target-specific compiler GCC compute binaries host binary

Runtime Environment

Accelerator runtime

on I/O clusters

Host runtime

Control code

Compute clusters

Unix

pip

es

PCIe

NoC

launch computation send load from HD receive write on HD transfer transfer transfer store stream images read store result computation n computation i computation 1

Optimisations

unrolling of converging loops

arithmetic operators aggregation

generation of kernel-specific convolutions

data parallelization for compute-intensive operators

Results: Execution Times and Energy Consumption

(MPPA-256 = 1, lower is better)

anr999 antibio burner deblocking lp oop retina toggle GMEAN 10−2

10−1 100 101

Image processing applications

Relative

execution

time

MPPA-256 (Sigma-C/10 W) SPoC (FPGA/25 W)

AMD 4-core (OpenCL/60 W) Quadro 600 (OpenCL/40 W) Tesla C 2050 (OpenCL/240 W)

anr999 antibio burner deblocking lp oop retina toggle GMEAN 100

101 102

Image processing applications

Relative

energy

consumption

Future Work

Other programming models:

Pthreads/OpenMP on compute clusters, communication library between clusters OpenCL via local memory pagination

Compiling Image Processing Applications for Many-Core Accelerators

Compiling Image Processing Applications

for Many-Core Accelerators

Pierre Guillou

– CRI MINES ParisTech, PSL Research University

Image Processing

image analysis:

detect geometrical structures in an image

mathematical morphology:

image analysis theory and technique based

on lattices theory

Mathematical Morphology Base Operators

arithmetic operators

morphological operators

reduction operators

other operators

=⇒ Sigma-C agent library

The MPPA-256 Chip

I/O cluster

I/O

clu

ster

I/O

clu

ster

I/O cluster

Shared

memory

(2 MB)

NoC Interface

NoC Interface

Host

machine

Local

memory

(3 GB)

PCI-E

DDR

Example: Licence Plate Extraction

Sigma-C, a Dataflow Programming Language

Agent foo

=⇒

Subgraph bar

Compilation Chain

Runtime Environment

Accelerator runtime

on I/O clusters

Host runtime

Control code

Compute clusters

Unix

pip

es

PCIe

NoC

Optimisations

unrolling of converging loops

arithmetic operators aggregation

generation of kernel-specific convolutions

data parallelization for compute-intensive operators

Results: Execution Times and Energy Consumption

Image processing applications

Relative

execution

time

Image processing applications

Relative

energy

consumption

Future Work

Other programming models:

Improve data-parallelism to take better advantage of the curent architecture

Implement more complex algorithms: watershed, arrow, labelling, minima, . . .

References

Pierre Guillou, Fabien Coelho, and François Irigoin.

Automatic Streamization of Image Processing Applications.

The 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2014.