HAL Id: hal-01254412
https://hal-mines-paristech.archives-ouvertes.fr/hal-01254412
Submitted on 12 Jan 2016
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Compiling Image Processing Applications for Many-Core Accelerators
Pierre Guillou
To cite this version:
Pierre Guillou. Compiling Image Processing Applications for Many-Core Accelerators . ACACES Summer School : Eleventh International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Jul 2015, Fiuggi, Italy. �hal-01254412�
Compiling Image Processing Applications
for Many-Core Accelerators
Pierre Guillou
– CRI MINES ParisTech, PSL Research University
Image Processing
image analysis:
detect geometrical structures in an image
mathematical morphology:
image analysis theory and technique based
on lattices theory
Mathematical Morphology Base Operators
arithmetic operators
unary (pixel ⊗ parameter, 1 input image) binary (pixel ⊗ pixel, 2 input images) + − × ÷ min max = & | ∼
morphological operators
stencils
neighbor selection + min/max/avg
reduction operators
global max/min/sum
other operators
threshold, mask, log2, . . .
=⇒ Sigma-C agent library
The MPPA-256 Chip
Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster
I/O cluster
I/O
clu
ster
I/O
clu
ster
I/O cluster
Shared
memory
(2 MB)
System core Compute core Compute core Compute core Compute core Compute core Compute core Compute core Compute core Debug support unit Compute core Compute core Compute core Compute core Compute core Compute core Compute core Compute coreNoC Interface
NoC Interface
Host
machine
Local
memory
(3 GB)
PCI-E
DDR
Example: Licence Plate Extraction
Input Output
Sigma-C, a Dataflow Programming Language
Agent foo
=⇒
input 0 input 1 output a g e n t foo () { // d e s c r i b e a g e n t i n t e r f a c e i n t e r f a c e { in<int> input0 , i n p u t 1 ; out<int> o u t p u t ; // d e c l a r e the s t a t e m a c h i n e spec{ i n p u t 0 [2] , input1 , o u t p u t [ 3 ] } ; }// loop over the s t a t e
void s t a r t() e x c h a n g e ( i n p u t 0 inp0 [2] , i n p u t 1 inp1 , o u t p u t outp [3]) { outp [0] = inp0 [0]; outp [1] = inp1 ; outp [2] = inp0 [1]; } } s u b g r a p h bar () { // d e s c r i b e s u b g r a p h i n t e r f a c e i n t e r f a c e { /* ... */ } map { // i n s t a n c i a t e a g e n t s a g e n t a1 = new A g e n t 1 (); a g e n t a3 = new S u b g r a p h 3 (); // ... // c o n n e c t a g e n t s to s u b g r a p h i n t e r f a c e s c o n n e c t ( input0 , a1 . i n p u t 0 ); c o n n e c t ( a5 . output , o u t p u t 1 ); // ... // c o n n e c t a g e n t s c o n n e c t ( a1 . output0 , a2 . i n p u t ); c o n n e c t ( a3 . output , a5 . i n p u t 1 ); // ... } } Agent 1 Agent 2 Subgraph 3 Agent 4 Agent 5
Subgraph bar
Compilation Chain
original application source-to-sourcecompiler call graph optimisations control code streaming kernels operator library
target-specific compiler GCC compute binaries host binary
Runtime Environment
Accelerator runtime
on I/O clusters
Host runtime
Control code
Compute clusters
Unix
pip
es
PCIe
NoC
launch computation send load from HD receive write on HD transfer transfer transfer store stream images read store result computation n computation i computation 1Optimisations
unrolling of converging loops
arithmetic operators aggregation
generation of kernel-specific convolutions
data parallelization for compute-intensive operators
Results: Execution Times and Energy Consumption
(MPPA-256 = 1, lower is better)anr999 antibio burner deblocking lp oop retina toggle GMEAN 10−2
10−1 100 101
Image processing applications
Relative
execution
time
MPPA-256 (Sigma-C/10 W) SPoC (FPGA/25 W)
AMD 4-core (OpenCL/60 W) Quadro 600 (OpenCL/40 W) Tesla C 2050 (OpenCL/240 W)
anr999 antibio burner deblocking lp oop retina toggle GMEAN 100
101 102
Image processing applications
Relative
energy
consumption
Future Work
Other programming models:
Pthreads/OpenMP on compute clusters, communication library between clusters OpenCL via local memory pagination