• Aucun résultat trouvé

Compiling Image Processing Applications for Many-Core Accelerators

N/A
N/A
Protected

Academic year: 2021

Partager "Compiling Image Processing Applications for Many-Core Accelerators"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01178938

https://hal-mines-paristech.archives-ouvertes.fr/hal-01178938

Submitted on 21 Jul 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Compiling Image Processing Applications for Many-Core Accelerators

Pierre Guillou

To cite this version:

Pierre Guillou. Compiling Image Processing Applications for Many-Core Accelerators. Journées de seconde année de l’Ecole Doctorale, Jun 2015, Paris, France. �hal-01178938�

(2)

Compiling Image Processing Applications

for Many-Core Accelerators

Pierre Guillou

– CRI MINES ParisTech, PSL Research University

Image Processing

image analysis:

detect geometrical structures in an image

mathematical morphology:

image analysis theory and technique based

on lattices theory

Mathematical Morphology Base Operators

arithmetic operators

unary (pixel ⊗ parameter, 1 input image) binary (pixel ⊗ pixel, 2 input images) + − × ÷ min max = & | ∼

morphological operators

stencils

neighbor selection + min/max/avg

reduction operators

global max/min/sum

other operators

threshold, mask, log2, . . .

=⇒ Sigma-C agent library

The MPPA-256 Chip

Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster Compute cluster

I/O cluster

I/O

clu

ster

I/O

clu

ster

I/O cluster

Shared

memory

(2 MB)

System core Compute core Compute core Compute core Compute core Compute core Compute core Compute core Compute core Debug support unit Compute core Compute core Compute core Compute core Compute core Compute core Compute core Compute core

NoC Interface

NoC Interface

Host

machine

Local

memory

(3 GB)

PCI-E

DDR

Example: Licence Plate Extraction

Input Output

Sigma-C, a Dataflow Programming Language

Agent foo

=⇒

input 0 input 1 output a g e n t foo () { // d e s c r i b e a g e n t i n t e r f a c e i n t e r f a c e { in<int> input0 , i n p u t 1 ; out<int> o u t p u t ; // d e c l a r e the s t a t e m a c h i n e spec{ i n p u t 0 [2] , input1 , o u t p u t [ 3 ] } ; }

// loop over the s t a t e

void s t a r t() e x c h a n g e ( i n p u t 0 inp0 [2] , i n p u t 1 inp1 , o u t p u t outp [3]) { outp [0] = inp0 [0]; outp [1] = inp1 ; outp [2] = inp0 [1]; } } s u b g r a p h bar () { // d e s c r i b e s u b g r a p h i n t e r f a c e i n t e r f a c e { /* ... */ } map { // i n s t a n c i a t e a g e n t s a g e n t a1 = new A g e n t 1 (); a g e n t a3 = new S u b g r a p h 3 (); // ... // c o n n e c t a g e n t s to s u b g r a p h i n t e r f a c e s c o n n e c t ( input0 , a1 . i n p u t 0 ); c o n n e c t ( a5 . output , o u t p u t 1 ); // ... // c o n n e c t a g e n t s c o n n e c t ( a1 . output0 , a2 . i n p u t ); c o n n e c t ( a3 . output , a5 . i n p u t 1 ); // ... } } Agent 1 Agent 2 Subgraph 3 Agent 4 Agent 5

Subgraph bar

Compilation Chain

original application source-to-source

compiler call graph optimisations control code streaming kernels operator library

target-specific compiler GCC compute binaries host binary

Runtime Environment

Accelerator runtime

on I/O clusters

Host runtime

Control code

Compute clusters

Unix

pip

es

PCIe

NoC

launch computation send load from HD receive write on HD transfer transfer transfer store stream images read store result computation n computation i computation 1

Optimisations

unrolling of converging loops

arithmetic operators aggregation

generation of kernel-specific convolutions

data parallelization for compute-intensive operators

Results: Execution Times and Energy Consumption

(MPPA-256 = 1, lower is better)

anr999 antibio burner deblocking lp oop retina toggle GMEAN 10−2

10−1 100 101

Image processing applications

Relative

execution

time

MPPA-256 (Sigma-C/10 W) SPoC (FPGA/25 W)

AMD 4-core (OpenCL/60 W) Quadro 600 (OpenCL/40 W) Tesla C 2050 (OpenCL/240 W)

anr999 antibio burner deblocking lp oop retina toggle GMEAN 100

101 102

Image processing applications

Relative

energy

consumption

Future Work

Other programming models:

Pthreads/OpenMP on compute clusters, communication library between clusters OpenCL via local memory pagination

Improve data-parallelism to take better advantage of the curent architecture

Implement more complex algorithms: watershed, arrow, labelling, minima, . . .

References

Pierre Guillou, Fabien Coelho, and François Irigoin.

Automatic Streamization of Image Processing Applications.

The 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2014.

Références

Documents relatifs

We provide, in particular, a comparison of different more fine-grained performance measures (Section 2), a survey of the existing results (Section 4), an analysis how the

The IR of the closure compiler actually represents an excellent alternative for working at the bytecode level: it makes it possible to express code that is not directly obtainable

Dynamic binary optimization aims at applying optimization transformations with no access to source code or any form of intermediate representations during program’s load or

Le cas d'une barrière de sécurité routière est spécifié dans l'étude : une barrière a été testée expérimentalement, le programme Ls-Dyna est utilisé pour la

O objetivo proposto neste trabalho foi calcular o impacto ambiental da produção de suínos, recebendo dietas com diferentes níveis de proteína bruta (PB), por

The analyzer will essentially mark a parallelizable (dependence-free) execution superpath (which is the union of a set of execution paths) in the for- loop, along with exit points

commandes de pharmacie. Ceci représente bien sûr un soulagement pour les IDE qui jugent bien souvent cette activité fastidieuse à réaliser. Malheureusement, le gain de temps lié

This paper focuses on the platform independent subsystem that realises deployment and redeployment of J2EE modules based on the new J2EE Deployment API as a part of the