Improving Communication Patterns in Polyhedral Process Networks

(1)

HAL Id: hal-01725143

https://hal.inria.fr/hal-01725143

Submitted on 8 Mar 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Improving Communication Patterns in Polyhedral

Process Networks

Christophe Alias

To cite this version:

Christophe Alias. Improving Communication Patterns in Polyhedral Process Networks. HIP3ES 2018

- Sixth International Workshop on High Performance Energy Eﬀicient Embedded Systems, Jan 2018,

Manchester, United Kingdom. pp.1-6. �hal-01725143�

(2)

Improving Communication Patterns in Polyhedral Process

Networks

Christophe Alias

CNRS, ENS de Lyon, Inria, UCBL, Université de Lyon christophe.alias[at]ens-lyon.fr

ABSTRACT

Embedded system performances are bounded by power con-sumption. The trend is to offload greedy computations on hardware accelerators as GPU, Xeon Phi or FPGA. FPGA chips combine both flexibility of programmable chips and energy-efficiency of specialized hardware and appear as a natural solution. Hardware compilers from high-level lan-guages (High-level synthesis, HLS) are required to exploit all the capabilities of FPGA while satisfying tight time-to-market constraints. Compiler optimizations for parallelism and data locality restructure deeply the execution order of the processes, hence the read/write patterns in communi-cation channels. This breaks most FIFO channels, which have to be implemented with addressable buffers. Expen-sive hardware is required to enforce synchronizations, which often results in dramatic performance loss. In this paper, we present an algorithm to partition the communications so that most FIFO channels can be recovered after a loop tiling, a key optimization for parallelism and data locality. Experimental results show a drastic improvement of FIFO detection for regular kernels at the cost of a few additional storage. As a bonus, the storage can even be reduced in some cases.

1. INTRODUCTION

Since the end of Dennard scaling, the performance of

em-bedded systems is bounded by power consumption. The

trend is to trade genericity (processors) for energy efficiency (hardware accelerators) by offloading critical tasks to

spe-cialized hardware. FPGA chips combine both flexibility

of programmable chips and energy-efficiency of specialized hardware and appear as a natural solution. High-level syn-thesis (HLS) techniques are required to exploit all the capa-bilities of FPGA, while satisfying tight time-to-market con-straints. Parallelization techniques from high-performance compilers are progressively migrating to HLS, particularly the models and algorithms from the polyhedral model [7], a powerful framework to design compiler optimizations.

Ad-HIP3ES 2018

Sixth International Workshop on High Performance Energy Efficient Embedded Systems

Jan 22th, 2018, Manchester, UK In conjunction with HiPEAC 2018.

https://www.hipeac.net/events/activities/7528/ hip3es/

ditional constraints must be fulfilled before plugging a com-piler optimization into an HLS tool. Unlike software, the hardware size is bounded by the available silicium surface. The bigger is a parallel unit, the less it can be duplicated, thereby limiting the overall performance. Particularly, tricky program optimizations are likely to spoil the performances if the circuit is not post-optimized carefully [5]. An impor-tant consequence is that the the roofline model is not longer valid in HLS [8]. Indeed, peak performance is no longer a constant: it decreases with the operational intensity. The bigger is the operational intensity, the bigger is the buffer size and the less is the space remaining for the computation itself. Consequently, it is important to produce at source-level a precise model of the circuit which allows to predict ac-curately the resource consumption. Process networks are a natural and convenient intermediate representation for HLS [4, 13, 14, 19]. A sequential program is translated to a pro-cess network by partitioning computations into propro-cesses and flow dependences into channels. Then, the processes and buffers are factorized and mapped to hardware.

In this paper, we focus on the translation of buffers to hardware. We propose an algorithm to restructure the buffers so they can be mapped to inexpensive FIFOs. Most often, a direct translation of a regular kernel – without optimization – produces to a process network with FIFO buffers [16]. Unfortunately, data transfers optimization [3] and gener-ally loop tiling reorganizes deeply the computations, hence the read/write order in channels (communication patterns). Consequently, most channels may no longer be implemented by a FIFO. Additional circuitry is required to enforce syn-chronizations [4, 20, 15, 17] which result in larger circuits and causes performance penalties. In this paper, we make the following contributions:

• We propose an algorithm to reorganize the communi-cations between processes so that more channels can be implemented as FIFO after a loop tiling. As far as we know, this is the first algorithm to recover FIFO communication patterns after a compiler optimization. • Experimental results show that we can recover most of the FIFO disabled by communication optimization, and more generally any loop tiling, at almost no extra storage cost.

The remainder of this paper is structured as follows. Sec-tion 2 introduces polyhedral process network and discusses how communication patterns are impacted by loop tiling, Section 3 describes our algorithm to reorganize channels,

(3)

Section 4 presents experimental results, Finally, Section 5 concludes this paper and draws future research directions.

2. PRELIMINARIES

This section defines the notions used in the remainder of this paper. Section 2.1 and 2.2 introduces the basics of compiler optimization in the polyhedral model and defines loop tiling. Section 2.3 defines polyhedral process networks (PPN), shows how loop tiling disables FIFO communication patterns and outlines a solution.

2.1 Polyhedral Model at a Glance

Translating a program to a process network requires to split the computation into processes and flow dependences into channels. The polyhedral model focuses on kernels whose computation and flow dependences can be predicted, repre-sented and explored at compile-time. The control must be predictable: for loops, if with conditions on loop counters. Data structures are bounded to arrays, pointers are not

al-lowed. Also, loop bounds, conditions and array accesses

must be affine functions of surrounding loop counters and structure parameters (typically the array size). This way, the computation may be represented with Presburger sets (typically approximated with convex polyhedra, hence the name). This makes possible to reason geometrically about the computation and to produce precise compiler analy-sis thanks to integer linear programming: flow dependence analysis [9], scheduling [7] or code generation [6, 12] to quote a few. Most compute-intensive kernels from linear algebra and image processing fit in this category. In some cases, kernels with dynamic control can even fit in the polyhe-dral model after a proper abstraction [2]. Figure 1.(a) de-picts a polyhedral kernel and (b) dede-picts the geometric rep-resentation of the computation for each assignment (• for

assignment load, • for assignment compute and ◦ for

as-signment store). The vector ~i = (i1, . . . , in) of loop

coun-ters surrounding an assignment S is called an iteration of

S. The execution of S at iteration ~i is denoted by hS,~ii.

The set DS of iterations of S is called iteration domain of

S. The original execution of the iterations of S follows the

lexicographic order over DS. For instance, on the

state-ment C: (t, i) (t0, i0) iff t < t0 or (t = t0 and i < i0).

The lexicographic order over Zdis naturally partitioned by

depth: =1_{] . . . ]}d _{where (u} 1. . . ud) k(v1, . . . , vd) iff ∧k−1 i=1ui= vi ∧ uk< vk.

Dataflow Analysis.

On Figure 1.(b), red arrows depict several flow depen-dences (read after write) between executions instances. We are interested in flow dependences relating the production of a value to its consumption, not only a write followed by a read to the same location. These flow dependences are called

direct dependences. Direct dependences represent the

com-munication of values between two computations and drive communications and synchronizations in the final process network. They are crucial to build the process network. Di-rect dependences can be computed exactly in the polyhedral model [9]. The result is a relation → relating each producer hP,~ii to one or more consumers hC,~ji. Technically, → is a

Presburger relation between vectors (P,~i) and vectors (C,~j)

where assignments P and C are encoded as integers. For example, dependence 5 is summed up with the Presburger

relation: {(•, t − 1, i) → (•, t, i), 0 < t ≤ T ∧ 0 ≤ i ≤ N }.

Presburger relations are computable and efficient libraries allow to manipulate them [18, 10]. In the remainder, direct dependence will be referred as flow dependence or depen-dence to simplify the presentation.

2.2 Scheduling and Loop Tiling

Compiler optimizations change the execution order to ful-fill multiple goals such as increasing the parallelism degree or minimizing the communications. The new execution order

is specified by a schedule. A schedule θS maps each

exe-cution hS,~ii to a timestamp θS(~i) = (t1, . . . , td) ∈ Zd, the

timestamps being ordered by the lexicographic order . In a way, a schedule dispatches each execution instance hS,~ii

into a new loop nest, θS(~i) = (t1, . . . , td) being the new

iteration vector of hS,~ii. A schedule θ induces a new

exe-cution order ≺θ such that hS,~ii ≺θhT,~ji iff θS(~i) θT(~j).

Also, hS,~ii θ hT,~ji means that either hS,~ii ≺θ hT,~ji or

θS(~i) = θT(~j). When a schedule is injective, it is said to

be sequential: no execution is scheduled at the same time. Hence everything is executed in sequence. In the polyhe-dral model, schedules are affine functions. They can be de-rived automatically from flow dependences [7]. On Figure 1, the original execution order is specified by the schedule

θload(i) = (0, i), θcompute(t, i) = (1, t, i) and θstore(i) = (2, i).

The lexicographic order ensures the execution of all the load instances (0), then all the compute instances (1) and finally all the store instances (2). Then, for each statement, the loops are executed in the specified order.

Loop tiling is a transformation which partitions the

com-putation in tiles, each tile being executed atomically. Com-munication minimization [3] typically relies on loop tiling to tune the ratio computation/communication of the program beyond the ratio peak performance/communication band-width of the target architecture. Figure 3.(a) depicts the iteration domain of compute and the new execution order after tiling loops t and i. For presentation reasons, we de-pict a domain bigger than in Figure 1.(b) (with bigger N and M ) and we depict only a part of the domain. In the polyhedral model, a loop tiling is specified by hyperplanes

with linearly independent normal vectors ~τ1, . . . ,~~τdwhere d

is the number of nested loops (here ~τ1= (0, 1) for the vertical

hyperplanes and ~τ2 = (1, 1) for the diagonal hyperplanes).

Roughly, hyperplans along each normal vector ~τiare placed

at regular intervals bi(here b1= b2= 2) to cut the iteration

domain in tiles. Then, each tile is identified by an iteration

vector (φ1, . . . , φd), φkbeing the slice number of an iteration

~i along normal vector ~τk: φk = ~τk· ~i ÷ bk. The result is a

Presburger iteration domain, here ˆD = {(φ1, φ2, t, i), 2φ1≤

t < 2(φ1+1)∧2φ2≤ t+i < 2(φ2+1)}: the polyhedral model is closed under loop tiling. In particular, the tiled domain can be scheduled. For instance, ˆθS(φ1, φ2, t, i) = (φ1, φ2, t, i) specifies the execution order depicted in Figure 3.(a)): tile with point (4,4) is executed, then tile with point (4,8), then tile with point (4,12), and so on. For each tile, the iterations are executed for each t, then for each i.

2.3 Polyhedral Process Networks

Given the iteration domains and the flow dependence rela-tion, →, we derive a polyhedral process network by partition-ing iterations domains into processes and flow dependence into channels. More formally, a polyhedral process network

(4)

for i := 0 to N + 1

• load(a[0, i]);

for t := 1 to T for i := 1 to N

• a[t, i] := a[t − 1, i − 1] + a[t − 1, i] + a[t − 1, i + 1];

for i := 1 to N ◦ store(a[T, i]); i i t i 0 1 2 3 4 5 6 N = 7 1 2 3 4 5 = T 1 1 3 2 1 3 3 6 5 4 7 7 7 7 7 7 Load 2 1 3 C 4 5 6 7 Store

(a) Jacobi 1D kernel (b) Flow dependences (c) Polyhedral process network

Figure 1: Motivating example: Jacobi-1D kernel

is a couple (P, C) such that:

• Each process P ∈ P is specified by an iteration

do-main DP and a sequential schedule θP inducing an

execution order ≺P over DP. Each iteration ~i ∈ DP

realizes the execution instance µP(~i) in the program.

The processes partition the execution instances in the

program: {µP(DP) for each process P } is a partition

of the program computation.

• Each channel c ∈ C is specified by a producer process

Pc ∈ P, a consumer process Cc ∈ P and a dataflow

relation →c relating each production of a value by Pc

to its consumption by Cc: if ~i →c~j, then execution ~i

of Pcproduces a value read by execution ~j of Cc. →c

is a subset of the flow dependences from Pcto Ccand

the collection of →c for each channel c between two

given processes P and C, {→c, (P, C) = (Pc, Cc)}, is

a partition of flow dependences from P to C.

The goal of this paper is to find out a partition of flow de-pendences for each producer/consumer couple (P, C), such that most channels from P to C can be realized by a FIFO. Figure 1.(c) depicts the PPN obtained with the canonical partition of computation: each execution hS,~ii is mapped

to process PS and executed at process iteration ~i: µPS(~i) =

hS,~ii. For presentation reason the compute process is

de-picted as C. Dependences depicted as k on the

depen-dence graph in (b) are solved by channel k. To read the input values in parallel, we use a different channel per

cou-ple producer/read reference, hence this partitioning. We

assume that, locally, each process executes instructions in

the same order than in the original program: θload(i) = i,

θcompute(t, i) = (t, i) and θstore(i) = i. Remark that the

lead-ing constant (0 for load, 1 for compute, 2 for store) has dis-appeared: the timestamps only define an order local to their

process: ≺load, ≺compute and ≺store. The global execution

order is driven by the dataflow semantics: the next process operation is executed as soon as its operands are available. The next step is to detect communication patterns to figure out how to implement channels.

Communication Patterns.

A channel c ∈ C might be implemented by a FIFO iff the

consumer Ccread the values from c in the same order than

the producer Pcwrite them to c (in-order) and each value is

read exactly once (unicity) [14, 16]. The in-order constraint can be written:

in-order(→c, ≺P, ≺C) :=

∀x →cx0, ∀y →cy0: x0≺Cy0⇒ x P y

The unicity constraints can be written:

unicity(→c) :=

∀x →cx0, ∀y →cy0: x06= y0⇒ x 6= y

Notice that unicity depends only on the dataflow relation

→c, it is independent from the execution order of the

pro-ducer process ≺P and the consumer process ≺C.

Further-more, ¬in-order(→c, ≺P, ≺C) and ¬unicity(→c) amount to

check the emptiness of a convex polyhedron, which can be done by most LP solvers.

Finally, a channel may be implemented by a FIFO iff it verifies both in-order and unicity constraints:

fifo(→c, ≺P, ≺C) :=

in-order(→c, ≺P, ≺C) ∧ unicity(→c)

When the consumer reads the data in the same order than they are produced but a datum may be read several times:

in-order(→c, ≺P, ≺C) ∧ ¬unicity(→c), the communication

pattern is said to be in-order with multiplicity: the channel may be implemented with a FIFO and a register keeping the last read value for multiple reads. However, additional cir-cuitry is required to trigger the write of a new datum in the register [14]: this implementation is more expensive than a single FIFO. Finally, when we have neither in-order nor

unicity: ¬in-order(→c, ≺P, ≺C)∧¬unicity(→c), the

commu-nication pattern is said to be out-of-order without

multiplic-ity: significant hardware resources are required to enforce

flow- and anti- dependences between producer and consumer and additional latencies may limit the overall throughput of the circuit [4, 20, 15, 17].

Consider Figure 1.(c), channel 5, implementing

depen-dence 5 (depicted on (b)) from h•, t − 1, ii (write a[t, i]) to

h•, t, ii (read a[t − 1, i]). With the schedule defined above,

the data are produced (h•, t − 1, ii) and read (h•, t − 1, ii)

in the same order, and only once: the channel may be im-plemented as a FIFO. Now, assume that process compute follows the tiled execution order depicted in Figure 3.(a). The execution order now executes tile with point (4,4), then tile with point (4,8), then tile with point (4,12), and so on. In each tile, the iterations are executed for each t, then for

(5)

1 split(→c,θP,θC)

2 for k := 1 to n

3 add(→c∩{(x, y), θP(x) kθC(y)});

4 add(→c∩{(x, y), θP(x) ≈nθC(y)});

5 fifoize((P, C))

6 for each channel c

7 {→1_c, . . . , →n+1c } := split(→c,θPc,θCc);

8 if fifo(→k_c, ≺θ_Pc, ≺θ_Cc) ∀k

9 remove(→c);

10 insert(→kc) ∀k;

Figure 2: Our algorithm for partitioning channels

each i. Consider iterations depicted in red as 1, 2, 3, 4 in

Figure 3.(b). With the new execution order, we execute

successively 1,2,4,3, whereas an in-order pattern would have

required 1,2,3,4. Consequently, channel 5 is no longer a

FIFO. The same hold for channel 4 and 6. Now, the point is to partition dependence 5 and others so FIFO communi-cation pattern hold.

Consider Figure 3.(c). Dependence 5 is partitioned in 3

parts: red dependences crossing tiling hyperplane φ1

(di-rection t), blue dependences crossing tiling hyperplane t + i (direction t + i) and green dependences inside a tile. Since the execution order in a tile is the same than the original execution order (actually a subset of the original execution order), green dependences will clearly verify the FIFO com-munication pattern. As concerns blue and red dependences, source and target are executed in the same order because the execution order is the same for each tile and dependence 5 happens to be short enough. In practice, this partitioning is effective to reveal FIFO channels. In the next section, we propose an algorithm to find such a partitioning.

3. OUR ALGORITHM

Figure 2 depicts our algorithm for partitioning channels

given a polyhedral process network (P, C) (line 5). For

each channel c from a producer P = Pc to a consumer

C = Cc, the channel is partitioned by depth along the lines

described in the previous section (line 7). DP and DC are

assumed to be tiled with the same number of hyperplanes.

P and C are assumed to share a schedule with the shape: θ(φ1, . . . , φn,~i) = (φ1, . . . , φn,~i). This case arise frequently

with tiling schemes for I/O optimization [4]. If not, the

next channel →cis considered (line 6). The split is realized

by procedure split (lines 1–4). A new partition is build starting from the empty set. For each depth (hyperplane) of the tiling, the dependences crossing that hyperplane are filtered and added to the partition (line 3): this gives

de-pendences →1

c, . . . , →nc. Finally, dependences lying in a tile

(source and target in the same tile) are added to the

par-tition (line 4): this gives →n+1c . θP(x) ≈n θC(y) means

that the n first dimensions of θP(x) and θC(y) (tiling

co-ordinates (φ1, . . . , φn)) are the same: x and y belong to

the same tile. Consider the PPN depicted in Figure 1.(c) with the tiling and schedule discussed above: process

com-pute is tiled as depicted in Figure 3.(c) with the schedule θcompute(φ1, φ2, t, i) = (φ1, φ2, t, i). Since processes load and

store are not tiled, the only channels processed by our

al-gorithm are 4,5 and 6. split is applied on the associated

dataflow relations →4, →5 and →6. Each dataflow relation

is split in three parts as depicted in Figure 3.(c). For →5:

→1

5 crosses hyperplane t (red), →25 crosses hyperplane t + i

(blue) and →35stays in a tile (green).

This algorithm works pretty well for short uniform

de-pendences →c: if fifo(c) before tiling, then, after tiling, the

algorithm can split c in such a way that we get FIFOs. How-ever, when dependences are longer, e.g. (t, i) → (t, i + 2), the target operations (t, i + 2) reproduce the tile execution pattern, which prevents to find a FIFO. The same happens

when the tile hyperplanes are “too skewed”, e.g. τ1= (1, 1),

τ2= (2, 1), dependence (t−1, i−1) → (t, i). Figure 3.(d)

de-picts the volume of data to be stored on the FIFO produced for each depth. In particular, dotted line with k indicates iterations producing data to be kept in the FIFO at depth

k. FIFO at depth 1 (dotted line with 1) must store N data

at the same time. Similarly, FIFO at depth 2 stores at most

b1data and FIFO at depth 3 stores at most b2data. Hence,

on this example, each transformed channel requires b1+ b2

additional storage. In general the additional storage require-ments are one order of magnitude smaller than the original FIFO size and stays reasonable in practice, as shown in the next section.

4. EXPERIMENTAL EVALUATION

This section presents the experimental results obtained on the benchmarks of the polyhedral community. We demon-strate the capabilities of our algorithm at recovering FIFO communication patterns after loop tiling and we show how much additional storage is required.

Experimental Setup.

We have run our algorithm on the kernels of PolyBench/C v3.2 [11]. Tables 2 and 1 depicts the results obtained for each kernel. Each kernel is tiled to reduce I/O while exposing parallelism [4] and translated to a PPN using our research compiler, Dcc (DPN C Compiler). Dcc actually produces a DPN (Data-aware Process Network), a PPN optimized for a specific tiled pattern. DPN features additional control processes and synchronization for I/O and parallelism which have nothing with our optimization. So, we actually only consider the PPN part of our DPN. We have applied our algorithm to each channel to expose FIFO patterns. For each kernel, we compare the PPN obtained after tiling to the PPN processed by our algorithm.

Results.

Table 2 depicts the capabilities of our algorithm to find out FIFO patterns. For each kernel, we provide the channels characteristics on the original tiled PPN (Before Partition-ing) and after applying our algorithm (After PartitionPartition-ing). We give the total number of channels (#channel), the FIFO found among these channels (#fifo), the number of channels which were successfully turned to FIFO thanks to our algo-rithm (#fifo-split), the ratios #fifo/#channel (%fifo) and #fifo-split/#channel (%fifo-split), the cumulated size of the FIFO found (fifo-size) and the cumulated size of the chan-nels found, including FIFO (total-size). On every kernel, our algorithm succeeds to expose more FIFO patterns (%fifo vs %fifo-split). On a significant number of kernels (11 among 15), we even succeed to turn all the compute channels to FIFO. On the remaining kernels, we succeed to recover all the FIFO communication patterns disabled by the tiling. Even though our method is not complete, as discussed in

(6)

i t 3 4 5 6 7 8 4 5 6 7 8 9 10 11 12 i t 3 4 5 6 7 8 4 5 6 7 8 9 10 11 12 1 2 3 4 i t 3 4 5 6 7 8 4 5 6 7 8 9 10 11 12 i t 1 2 2 3

(a) Loop tiling (b) Communication pattern (c) Our solution (d) Storage requirements

Figure 3: Impact of loop tiling on communication pattern

section 3, it happens that all the kernels fulfill the condi-tions expected by our algorithm (short dependence, tiling hyperplanes not too skewed).

Table 1 depicts the additional storage required after split-ting channels. For each kernel, we compare the cumulative size of channels split and successfully turn to a FIFO (size-fifo-fail) to the cumulative size of the FIFOs generated by the splitting (size-fifo-split). The size unit is a datum e.g. 4 bytes if a datum is a 32 bits float. We also quantify the ad-ditional storage required by split channels compared to the original channel (∆ := [fifo-split - fifo-fail] / size-fifo-fail). It turns out that the FIFO generated by splitting use mostly the same data volume than the original channels. Additional storage resources are due to our sizing heuristic [1], which rounds channel size to a power of 2. Surprisingly, splitting can sometimes help the sizing heuristic to find out a smaller size (kernel gemm), and then reducing the storage requirements. Indeed, splitting decomposes a channel into channels of a smaller dimension, for which our sizing heuris-tic is more precise. In a way, our algorithm allows to find out a nice piecewise allocation function whose footprint is smaller than a single piece allocation. We plan to exploit this nice side effect in the future.

kernel size-fifo-fail size-fifo-split ∆

trmm 256 257 0% gemm 512 288 -44% syrk 8192 8193 0% symm 800 801 0% gemver 32 33 3% gesummv 0 0 syr2k 8192 8193 0% lu 528 531 1% cholesky 273 275 1% atax 1 1 0% doitgen 4096 4097 0% jacobi-2d 8320 8832 6% seidel-2d 49952 52065 4% jacobi-1d 1152 1174 2% heat-3d 148608 158992 7%

Table 1: Impact on storage requirements

5. CONCLUSION

In this paper, we have proposed an algorithm to reorganize the channels of a polyhedral process network to reveal more FIFO communication patterns. Specifically, our algorithm operates producer/consumer processes whose iteration do-main has been partitioned by a loop tiling. Experimental results shows that our algorithm allows to recover the FIFO disabled by loop tiling with almost the same storage require-ment. Our algorithm is sensible to the dependence size and the chosen loop tiling. In the future, we plan to design a reorganization algorithm provably complete, in the meaning that a FIFO channel will be recovered whatever the depen-dence size and the tiling used. We also observe that splitting channels can reduce the storage requirements in some cases. We plan to investigate how such cases can be revealed au-tomatically.

6. REFERENCES

[1] C. Alias, F. Baray, and A. Darte. Bee+Cl@k: An implementation of lattice-based array contraction in the source-to-source translator Rose. In ACM Conf.

on Languages, Compilers, and Tools for Embedded Systems (LCTES’07), 2007.

[2] C. Alias, A. Darte, P. Feautrier, and L. Gonnord. Multi-dimensional rankings, program termination, and complexity bounds of flowchart programs. In

International Static Analysis Symposium (SAS’10),

2010.

[3] C. Alias, A. Darte, and A. Plesco. Optimizing remote accesses for offloaded kernels: Application to

high-level synthesis for FPGA. In ACM SIGDA Intl.

Conference on Design, Automation and Test in Europe (DATE’13), Grenoble, France, 2013.

[4] C. Alias and A. Plesco. Data-aware Process Networks. Research Report RR-8735, Inria - Research Centre Grenoble – Rhône-Alpes, June 2015.

[5] C. Alias and A. Plesco. Optimizing Affine Control with Semantic Factorizations. ACM Transactions on

Architecture and Code Optimization (TACO) ,

14(4):27, Dec. 2017.

[6] C. Bastoul. Efficient code generation for automatic parallelization and optimization. In 2nd International

(7)

Kernel Before Partitioning After Partitioning

#channel #fifo #fifo-split %fifo %fifo-split fifo-size total-size #channel #fifo fifo-size total-size

trmm 2 1 2 50% 100% 256 512 3 3 513 513 gemm 2 1 2 50% 100% 16 528 3 3 304 304 syrk 2 1 2 50% 100% 1 8193 3 3 8194 8194 symm 6 3 6 50% 100% 18 818 7 7 819 819 gemver 6 3 5 50% 83% 4113 4161 7 6 4146 4162 gesummv 6 6 6 100% 100% 96 96 6 6 96 96 syr2k 2 1 2 50% 100% 1 8193 3 3 8194 8194 lu 8 0 3 0% 37% 0 1088 11 6 531 1091 cholesky 9 3 6 33% 66% 513 1074 11 8 788 1076 atax 5 3 4 60% 80% 48 65 5 4 49 65 doitgen 3 2 3 66% 100% 8192 12288 4 4 12289 12289 jacobi-2d 10 0 10 0% 100% 0 8320 18 18 8832 8832 seidel-2d 9 0 9 0% 100% 0 49952 16 16 52065 52065 jacobi-1d 6 1 6 16% 100% 1 1153 10 10 1175 1175 heat-3d 20 0 20 0% 100% 0 148608 38 38 158992 158992

Table 2: Detailed results

(ISPDC 2003), 13-14 October 2003, Ljubljana, Slovenia, pages 23–30, 2003.

[7] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of

the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, pages 101–113, 2008.

[8] B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi. Performance modeling for FPGAs: extending the roofline model with high-level synthesis tools. International Journal of Reconfigurable

Computing, 2013:7, 2013.

[9] P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel

Programming, 20(1):23–53, 1991.

[10] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The omega calculator and library, version 1.1. 0. College Park,

MD, 20742:18, 1996.

[11] L.-N. Pouchet. Polybench: The polyhedral benchmark suite. URL: http://www. cs. ucla. edu/˜

pouchet/software/polybench/[cited July,], 2012.

[12] F. Quilleré, S. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. International

journal of parallel programming, 28(5):469–498, 2000.

[13] E. Rijpkema, E. F. Deprettere, and B. Kienhuis. Deriving process networks from nested loop algorithms. Parallel Processing Letters, 10(02n03):165–176, 2000.

[14] A. Turjan. Compiling nested loop programs to process

networks. PhD thesis, Leiden Institute of Advanced

Computer Science (LIACS), and Leiden Embedded Research Center, Faculty of Science, Leiden University, 2007.

[15] A. Turjan, B. Kienhuis, and E. Deprettere. Realizations of the extended linearization model.

Domain-specific processors: systems, architectures, modeling, and simulation, pages 171–191, 2002.

[16] A. Turjan, B. Kienhuis, and E. Deprettere. Classifying interprocess communication in process network representation of nested-loop programs. ACM

Transactions on Embedded Computing Systems (TECS), 6(2):13, 2007.

[17] S. van Haastregt and B. Kienhuis. Enabling automatic pipeline utilization improvement in polyhedral process network implementations. In Application-Specific

Systems, Architectures and Processors (ASAP), 2012 IEEE 23rd International Conference on, pages

173–176. IEEE, 2012.

[18] S. Verdoolaege. ISL: An integer set library for the polyhedral model. In ICMS, volume 6327, pages 299–302. Springer, 2010.

[19] S. Verdoolaege. Polyhedral Process Networks, pages 931–965. Handbook of Signal Processing Systems. 2010.

[20] C. Zissulescu, A. Turjan, B. Kienhuis, and

E. Deprettere. Solving out of order communication using CAM memory: an implementation. In 13th

Annual Workshop on Circuits, Systems and Signal Processing (ProRISC 2002), 2002.