High-Level Synthesis for the Design of FPGA-based Signal Processing Systems

(1)

High-Level Synthesis for the Design of FPGA-based Signal Processing Systems

Emmanuel Casseau INRIA/IRISA

ENSSAT Université de Rennes1 Lannion, France [email protected]

Bertrand Le Gal IMS Laboratory CNRS UMR 5218

Abstract—High-level synthesis (HLS) currently seems to be an interesting process to reduce the design time substantially. HLS tools actually map algorithms to architectures. While such tools were developed targeting ASIC technologies, HLS currently draws wide interest for FPGA designers. However with most of HLS techniques, traditional resource sharing models are very inaccurate for FPGAs: for example, multiplexers can be very expensive with such technologies. Resource usage optimizations and dedicated resource binding have to be applied. In this paper a HLS process which takes care of data-width and combines scheduling and binding to carefully take into account interconnect cost is presented. Experimental results show that our approach achieves significant reduction for area (34%) and dynamic power (28%) compared to a traditional synthesis.

Keywords-component; CAD, FPGA, high-level synthesis, VLSI design

I. INTRODUCTION

Digital signal processing (DSP) is often characterized by a large number of computations. In such applications, intensive use is made of data transfers, storage and computation, and the data-width are all different. Hence the minimum number of bits required to represent the data varies considerably. 16, 32 or 64 bits are usually used by digital signal processors to handle integer or real numbers using fixed-point representation. Stephenson [1] has shown that there is, on average, 40% more bits available than necessary for operations and 20% more for variable storage in a set of benchmark programs written in C.

In the case of a hardware implementation, correctly-sized hardware resources are of major interest. For example, table 1 shows usual DSP resource costs: area, delay and dynamic power consumption, for an Altera Cyclone-III FPGA device.

Undersized resource based architectures may provide overflow errors whereas oversized resource based architectures increase area, delay and power consumption. A word-length analysis makes it possible to reduce the resource costs identifying redundant bits.

In other respects, the increasingly demanding requirements for digital signal processing applications lead to the implementation of more and more complex algorithms and systems. To handle this increase in complexity and the "time to

market" pressure, design methodologies based on High-Level Synthesis (HLS) can be used. High-level synthesis [2, 3] is analogous to software compilation transposed to the hardware domain. From an algorithmic behavior of the specification, HLS tools automate the design process and generate an RTL architecture taking into account user-specified constraints. A high-level synthesis based design flow thus allows the design time to be reduced compared to a traditional design methodology based on handcraft RTL architectures. However, as mentioned above, data and operation word-length of DSP applications can considerably vary throughout the processing.

Most HLS methodologies generate datapaths with a uniform (constant) word-length. Resource word-length is then chosen considering the worst case, i.e the largest data-width.

Oversized architectures are thus generated. Moreover most high-level synthesis based design flows have been developed targeting ASIC technologies and can not be applied to FPGA platforms without resource wastage. For example, area of interconnection resources (multiplexers, …) is similar to simple but commonly used DSP arithmetic operators like adders or subtractors (table 1). Resource sharing overhead can not be ignored.

Word-length

Resource 4 8 12 16 20 24 28 32 48 64

ADD 4 8 12 16 20 24 28 32 48 64

Area

(logical MULT 15 50 100 163 253 379 525 691 1536 2728

(logical

elements)) REG 4 8 12 16 20 24 28 32 48 64

MUX 2to1 4 8 12 16 20 24 28 32 48 64

C iti l ADD 1,56 2,17 2,23 2,73 2,70 2,89 3,24 3,52 4,51 5,45 Critical

path (ns) MULT 2,77 3,16 3,45 3,45 5,50 6,25 6,65 7,05 8,04 9,17 path (ns)

MUX 2to1 1,05 1,30 1,32 1,51 1,45 1,67 1,89 1,97 2,35 2,35

Power ADD 1,35 1,67 2,66 3,35 3,45 5,85 5,79 6,81 13 14,38

Power (mw)

40MH MULT 1,43 2,29 4,28 5,45 9,24 12,07 16,62 20,79 43,31 74,61 40MHz MUX 2to1 1,36 2,17 2,78 3,74 3,94 4,51 5,06 5,25 8,6 11,08

Table 1. Resource costs (Altera Cyclone III)

In this paper we propose a HLS-based design methodology targeting FPGA platforms. From a Matlab behavioral description which specifies the behavior of the application to implement, an RTL architecture is automatically generated.

The behavioral description is first analyzed to determine data and operation width. A concurrent ad-hoc scheduling and binding algorithm is used to optimize resource usage and reduce interconnect costs. Word-length information is taken into account during this step. Multiplexer cost models dedicated to FPGAs are used. The architecture is generated

(2)

based on component libraries characterized for FPGA technologies (Xilinx and Altera devices).

The paper is organized as follow. Section II presents related work around word-length aware and FPGA high-level synthesis. Section III presents our design flow. Word-length analysis is presented in section IV. Section V detailed our high-level synthesis process. Experimental results are reported in section VI.

II. RELATED WORK AND CONTRIBUTION

A. Data-width aware high-level synthesis

Several high-level synthesis techniques have been suggested during the last two decades. However traditional techniques usually address uniform-width resources (same word-length for every resource (arithmetic, storage, interconnection) [4], [5]. Data-witdh aware design flows, including word-length analysis, scheduling and binding, are proposed in [6]–[8]. In [6], data range propagation introduced in [1] is used to determine the minimum word-length required for each operation and memorization. Scheduling, binding and placement are performed without firstly considering word- length information. Then word-length aware operation re- scheduling and re-binding are performed to minimize area cost of functional units. Once done, the data-width aware register allocation and binding tasks are performed. A traditional 3- step design flow is proposed in [7]. Data range is propagated through a data-flow graph to optimize the word-length of both operations and variables. Operations are scheduled based on a list-scheduling algorithm. Then binding step aims at optimizing resource sharing taking word-length into account.

In [8], the combined scheduling, binding, and word-length selection problem is formulated as an ILP problem. To reduce runtime a heuristic solution is proposed in [9]. Two kinds of graphs are used: a sequencing graph which represents the data- dependencies between the operations and the data, and a word- length compatibility graph which represents the compatibility between operations and sized operators that can implement the operations with regard to their word-length.

B. FPGA resource reduction and high-level synthesis Interconnection resources are very expensive in FPGAs.

HLS processes focusing on FPGA platforms take particular care on interconnection cost during the binding process. In [10], a bipartite weighted matching approach is used to minimize the number of multiplexers in a datapath allocation.

This approach has been enhanced for FPGA targets in [11] by the co-family based approach. Actually, in ASIC technologies, multiplexer area is linear to the port-width and to the number of input ports (figure 1a). However in FPGAs, multiplexer area is linear to the port-width, but not linear to the number of input ports (figure 1b). This is caused by the specific regular architectures of each FPGA target based on look up tables and internal multiplexers. Dedicated cost models are required.

Figure 1a. Asic multiplexer area cost

Figure 1b. FPGA multiplexer area cost

Figure 1. Relationship between the number of input ports and area cost for 16-bit multiplexers

In [12] a compatibility graph based binding approach is proposed to minimize the multiplexer inputs. In [13], an iterative binding algorithm is used to reduce the interconnect cost using embedded memory block as register file on FPGAs.

A simulated annealing based approach is proposed in [14] to optimize the assignment of variables onto the storage elements taking into account multiplexer cost.

Pattern-based high-level synthesis techniques can also be applied to reduce interconnect cost. In [15] the proposed approach minimizes the resource cost through pattern selection and pattern-adaptative scheduling. Best patterns are chosen based on some metrics which take into account multiplexer area. It is assumed multiplexer area is linear to the number of input ports.

C. Contribution

Most word-length aware high-level synthesis approaches handle latency constrained design flows, trying to reduce the global hardware cost (design area). Best approaches are based on iterative or complex algorithms and produce efficient architectures. Unfortunately most of them focus on ASIC technologies or go on hardware assumptions which are wrong for FPGA platforms. For example, the interconnection cost is not of minor importance with FPGAs. Except the approach proposed in [16] which is dedicated to pipeline scheduling and the approach proposed in [15] where unfortunately the multiplexer area cost model assumes area is linear to the number of input ports, HLS approaches targeting FPGAs really take into account interconnection resource costs during

(3)

the binding process only, i.e. during the last synthesis step. We believe this is not a good choice: scheduling decisions would not have been made if actual interconnection paths have been known. Furthermore the nonlinear relationship between the number of input ports and area cost (figure 1b) in FPGAs requires dedicated multiplexer cost models be used.

In this paper, we intend to take into account as much as possible actual interconnect cost during the complete synthesis process. To take into account FPGA features and particularly interconnection resource cost, our word-length aware HLS design flow combines scheduling and binding, rather than processing these two synthesis steps alone. Targeted applications are digital signal processing applications.

Synthesis can be constrained by throughput or by the number of arithmetic resources.

III. DESIGN FLOW OVERVIEW

Our design flow is presented in figure 2. It is based on two steps :

1. The first step performs a word-length analysis of the application.

2. The second step carries out the synthesis process taking into account FPGA platform features.

Internal model

Hardware library Compilation

VHDL Generator Behavioral description

RTL architecture description

Selection

Combined scheduling and

binding Allocation Synthesis Contraints Input range

Upper and lower bound

analysis Data-width computation

Figure 2. Analysis and synthesis design flow

Inputs are a Matlab behavioral description which specify the behavior of the application to implement, the synthesis constraint, and input ranges (lower and upper values, or data word-length). First step aims at computing the number of bits required to represent the data. Fixed-point representation is considered. Data is thus made up of an integer part plus a fractional part. Integer part bit requirements depend on the lower/upper bound values. Lower-bound and upper-bound values for each computation and each memorization are thus analyzed. Once all the bound computations are performed, the minimal number of bits required to implement the integer part

of every data and operations is evaluated. This value depends on the data, thus the integer part word-length is non-uniform.

Fractional part bit requirements depend on the value of the computation accuracy the user wants to achieve for his application. It is assumed that this accuracy has been previously determined using [17] for example. The minimum number of bits required for the fractional part can thus be computed. To avoid binary point alignment, uniform fractional part word-length is used.

The second step of the methodology relates to the high- level synthesis process. High-level synthesis is used to formally transform the behavioral description of an application into a hardware architecture according to a set of constraints.

The synthesis process is performed taking into account word- length information and FPGA platform features according to hardware libraries characterized for FPGA technologies (Xilinx and Altera devices). The architecture is generated in VHDL-RTL.

IV. WORD-LENGTH ANALYSIS

To reduce resource wastage, word-length analysis is performed before synthesis. This step aims to compute the minimum number of bits required to represent and implement each variable and operation used in the application, using a formal graph model. Usually DSP applications are regular and predictable. The modeling of such applications is generally performed using Data Flow Graph (DFG) or Signal Flow Graph (SFG) models [2], [3] which clearly exhibit the data dependencies. In the proposed design flow, the behavioral description which specifies the behavior of the application to implement is compiled to a SFG as internal representation.

Loops are unrolled and conditional structures are flattened.

The intrinsic parallelism between operations can thus be easily exploited so that real time constraints (throughput) can be satisfied. Each node (data and operations) of the model is annotated with its word-length, i.e. the minimum number of bits required to store or compute each data. Range analysis is processed based on a static method: from user-defined input data range, maximum values of the data are estimated by considering the propagation of data ranges through the DFG [1], [7]. The integer part word-length of each node is then determined from the lower-bound and the upper-bound values of the data. Finally, word-length of a particular node is made up of the integer part word-length (non-uniform) plus the fractional part word-length (uniform). Two's complement coding is used when negative values occur.

It should be noted that this approach leads to pessimistic results (it is a worst case analysis) [18]. However with the design framework we use the designer can manually restrict word-length for operation or variable nodes using his own application knowledge: nodes of the graph can be specifically annotated. This step is performed before the data-width propagation through the representation model.

V. HIGH-LEVEL SYNTHESIS

A. High-level synthesis process overview

The synthesis process can be constrained by the designer using synthesis constraints like the circuit throughput, the I/O

(4)

chronology, number of arithmetic resources. The synthesis starts with the selection of the hardware operators using a characterized library dedicated to the FPGA the designer targets. This step allows the association of delays to each operation of the internal representation. Mobility of operations is then computed using ASAP and ALAP data. According to the average number of operations per cycle, the allocation step defines the minimum number of arithmetic operators required for each kind of operation in order to satisfy the designer timing constraints. The operation scheduling is then performed. In our approach, the binding step is performed concurrently and associates to each scheduled operation an arithmetic operator, registers and interconnection resources.

Register sharing is performed on the fly but to save runtime and to reduce the interconnection cost, it is applied locally.

Actually, the architecture is composed of clusters. Clusters are composed of one arithmetic operator, registers that are connected to inputs of this operator and associated interconnection resources. Register’s input is linked to one arithmetic operator only. Register sharing is applied only inside clusters rather than on the overall architecture.

B. HLS for FPGA resource reduction

High-level synthesis aims at optimizing resource sharing.

Resources include arithmetic resources and registers. Because multiplexer cost can not be ignored, interconnection resources are to be considered too. A resource can be shared if its uses do not take place simultaneously. Most high-level synthesis tools generate hardware architectures with uniform-width datapath corresponding to the highest data word-length among the computations to be performed. Oversized architectures are thus generated. To reduce area and decrease power consumption, non-uniform width datapath can be synthesized based on word-length analysis. However resource sharing is more complicated in this case. A resource can not process or store data if its width does not match data-width. It means the width of a particular resource has to be greater than or equal to the width of the data it handles. Architectures generated by a word-length aware high-level synthesis flow is thus made of resources which may process non-uniform data. At the hardware design level, when the width of a data is smaller than the resource’s one, a data expansion is required. The VHDL

“resize” standard function (IEEE.NUMERIC_STD library) is used in practice to automate this data expansion during the hardware generation step of the design flow (figure 2).

Most of high-level synthesis design flows have been developed targeting ASIC platforms. Traditionally the scheduling step makes the assumption that arithmetic operators are (very) more expensive than registers and interconnection resources. In FPGA devices, it is right for example for complex operators like multipliers but not for basic but so common operators like adders or subtractors (see Table 1).

Such operators have the same area cost as multiplexers or registers. With FPGA platforms, resource sharing overhead has to be accurately taken into account. For example, as seen in section II.B, FPGA multiplexer area is linear to the port- width, but not linear to the number of input ports. A dedicated multiplexer cost model has to be taken. This model depends on the FPGA the designer targets. Usually operation scheduling and arithmetic and storage resource binding are based on estimated interconnect costs. Unfortunately, actual interconnect cost results from interconnection binding, i.e.

traditionally last HLS step. Moreover high-level synthesis methodologies which take into account data word-length usually operates on the scheduling step focusing on sharing arithmetic resources with similar word-length operands and then try to optimize the binding step. Our proposed approach aims at limiting the extra cost due to the interconnections and the controller. An accurate resource sharing method is used.

We developed a join ad-hoc scheduling and binding algorithm based on word-length information and similarities between datapaths. The aim is not to focus on the minimization of the arithmetic resources but the overall architecture, including registers and interconnection resources. For example, scheduling operations which may share the same well-sized adder is not necessarily efficient if it requires two extra multiplexers and a register because new paths are created at last during the binding step.

Controller cost is also to be considered. In order to share resources, a controller (FSM) is required to drive the multiplexers and the registers. The number of command signals roughly increases with the number of multiplexers (and their number of input ports) and registers. It thus increases the controller complexity (its decoding part) i.e. its area cost. It may involve a delay penalty when the critical path is linked with the controller too. Interconnection resource reduction thus makes it possible to also reduce controller costs.

C. Combined scheduling and binding

Figure 3 shows an overview of the scheduling and binding algorithm for applications under timing constraint. This algorithm can easily be derived to a resource-constrained version. A concurrent arithmetic operation scheduling and global binding algorithm (operators, registers and interconnection resources) is performed. Compare to a traditional HLS process, the main difference is the binding of an operation as soon as this operation has been selected for scheduling and its binding cost (including extra interconnections and registers if required) is reduced compare to other ready to scheduled operations.

From the annotated signal flow graph, the list of operation nodes to be scheduled is first extracted and nodes are sorted using priority function (equation 1) (line 1 (L1)).

Each clock cycle, when nodes can be scheduled (L3), the process is the following. We assume N is the number of already allocated arithmetic resources. A list of N nodes is selected for scheduling and binding (L4, L8). These N nodes are extracted from the list of ready nodes. They include N₀ nodes that can not be delayed (zero mobility nodes) and N-N0

nodes with highest priority. Because minimum number of arithmetic resources required for each kind of operation is computed according to the average number of operations per cycle, N₀ may be greater than N. In this case, the number of allocated arithmetic resources is dynamically increased in order to meet the timing constraint¹ (L5-L7). The binding cost of the selected nodes is computed over the available resources.

Minimum binding cost is used to select the particular operator which will implement each node of the node list (L9). Section V.E details the binding cost model and how minimum binding

1 In that case, no re-scheduling is realized in order to obtain a solution in a short run-time. The set of resources is increased and the scheduling of the next operations benefits of this update.

(5)

cost is obtained. Binding is performed and associated resources are updated if required: word-length increase, add one more input to multiplexers, multiplexer/register allocation (L10).

When every zero-mobility node has been scheduled and no more arithmetic resource is available, next clock cycle is processed. The combined scheduling and binding process ends when all operation nodes have been processed.

Figure 3. Combined scheduling and binding algorithm

D. Scheduling priority function

The scheduling approach we use is based on the list- scheduling algorithm [2]. A list-based scheduling algorithm maintains a priority list of ready nodes. A ready node represents an operation which can be scheduled, i.e. whose predecessors have already been scheduled. The priority function is used to sort the ready operation nodes: nodes with higher priority are scheduled first. Thus the priority function resolves the resource contention among operations.

Our goal is to consider in the same time an efficient use of the parallelism of the application, word-length information and datapath cost. The scheduling priority function is thus based on the following metrics:

the operation mobility, such as in a traditional list- scheduling,

the operation word-length, to favor first the scheduling of operations associated with costly datapath,

the number of operations which can be fired (immediate successors waiting for the result of the current operation).

Scheduling nodes with higher number of successors reduces variable lifetime [19], i.e. reduces data storage cost.

Priority= (1)

₁ and ₂ are parameters which users can adjust based on their requirements to trade off between resource reduction and critical path overhead.

E. Binding cost

During the combined scheduling and binding step, weighted bipartite graphs [20] are used to select minimum binding cost. The binding cost of a particular node over a particular arithmetic resource is thus required. Binding an operation to an operator involves the operator itself as well as the resources required to steer the input data to this operator.

Binding cost thus includes the operator cost and the path cost Path cost is made of interconnection and register resource cost as well as controller cost. Because scheduling and binding are performed concurrently, these costs can be accurately computed from previously scheduled nodes and previously bound resources. Of course binding cost is assessed only when the operator can implement the current arithmetic operation.

The cost of binding operation node nj to an available operator i is given by equation (2). operator is the operator area overcost and path(i, nj) is the path overcost.

(2)

(i, nj) represents an accurate measure of the operation binding cost. It takes into account the overall design overcost (operator, interconnection resources, registers, controller) required to bind an operation to a particular resource and not only the arithmetic operator overcost.

Resource overcost - For each hardware function , we defined a function (i, nj) {} which computes the area overcost issued from the transformation of the particular hardware resource i to compute/store/transmit the operation/variable node n_j to be bound.

() models the area cost of hardware function depending on word-length requirements . Because area is not always linear to the word-length (a multiplier for example, see table 1), we use polynomial models to compute resource area.

These models come from our characterizations of the FPGA platforms we target. When (i, n_j) is negative, i.e. resource i

is oversized according to word-length requirements, (i, nj) is set to zero because the width of the resource does not need to be changed. (0, nj) is a particular case where no resource will be shared, that is to say a new resource will be used.

Resource overcost in this case is the complete area of the resource with word-length

Interconnection resources need particular care because area depends on the input port number and the port width. As said before, in FPGAs, multiplexer area is linear to the port width, but not linear to the number of input ports (figure 1b). The relationship between the number of input ports and the multiplexer area cost is hard to be represented by a simple mathematical model and is dedicated to a particular FPGA target. Interconnection resource cost mux(i, n_j) is thus characterized by a linear model when word-length only is changed, and by a technology specific table otherwise.

(6)

mux(, ) models the interconnection resource area cost according to word-length requirements and input port number .

As said previously, path overcost path(i, nj) includes interconnection and storage resources as well as controller overcosts (mux, reg and ctrl respectively in equation (5)):

Controller cost is very complex to estimate. Furthermore its complexity will increase depending on future scheduling and binding decisions. We thus decided not to take controller cost into account, that is to say to set ctrl equals to 0.

To compute path overcost, three cases have to be assessed:

a register of the cluster is available and there is already a path between this register and the currently investigated input of the operator,

a register of the cluster is available but there is no path between this register and the currently investigated input of the operator,

an extra register is allocated.

1) If there is already a path between a free register and the input of the operator, overcost comes from the word-length increase (if required). Path overcost due to an existing path is defined by equation (5) for data coming from the output of the register to input l of the operator.

2) If there is no path between a free register and the input of the operator, to steer the data to the operator, a new interconnection resource has to be allocated (i.e. a two input multiplexer) or it is necessary to use an already allocated interconnection resource but with one more input port. In this latter case, interconnection resource word-length increase may be processing too. The register word-length increase has also to be computed if required. Path overcost due to the creation of a new path is given by equation (5). When a two-input multiplexer is allocated, mux(mux, n_j) is replaced by mux(0, n_j).

3) When the allocation of an extra register is investigated, to steer the data to the arithmetic resource, three cases may occur:

a) operator i was never used before, so no sharing resource is required, b) a new interconnection resource has to be allocated or c) it is necessary to use an already allocated interconnection resource but with one more input port. Equation (5) is used to compute path overcost with mux(mux, n_j) replaced by 0 and mux(0, nj) for cases a) and b) respectively. reg(reg, nj) is replaced by reg(0, nj).

Path overcost path(i, nj) required to bind operation nj to operator i takes into account the path of each input of the operation. Commutativity is investigated when it can be implemented. To process minimum path overcost, a weighted bipartite graph B = (S T, E) is built. Each vertex s_l S represents an input l of operator i and each vertex tm T represents an available register regm of the cluster Ci or an extra register. E is the set of weighted edges e_{l_m} between s_l and tm. The weight wl_m associated with edge el_m is given by path(l, n_j). Couples (input, register) providing the minimum path cost are found according to minimal weighted matching

for B [21]. Path overcost path(i, n_j) for operator i is the sum of the selected path of every operator input.

The cost of binding every operation node of the list of selected to be scheduled nodes over every available operator is thus computed using equation (2). Finally minimum binding cost is used to select the operator which will implement particular operation node n_j at clock cycle k. Again a weighted bipartite graph B’ = (S’ T’, E’) is built. Each vertex s’_j S’

represents operation node n_j and each vertex t’_m T’

represents an available operator m or an extra operator. E’ is the set of weighted edges e’_{j_m} between s’_j and t’_m. The weight w’j_m associated with edge e’j_m is given by (m, nj).

Operations are bound to operators according to minimal weighted matching for B’.

VI. EXPERIMENTAL RESULTS

To validate our proposed methodology, we synthesized several well-known DSP applications: FIR, FFT, DCT, IDCT, 2d DCT, SSD and SAD (sum of square/absolute differences).

We set three experimental synthesis flows. The first one corresponds to a traditional approach using a word-length unaware high-level synthesis flow. It produces architectures with a uniform-width datapath, i.e. according to the largest data-width requirements. The second one is based on the approach proposed in [6]. We call it 3-step approach.

Scheduling and binding are performed without firstly considering data word-length information. Then operation word-length aware re-scheduling and re-binding step is performed to minimize arithmetic unit area. Word-length aware register allocation and binding is completed to minimize area cost of registers. The third approach corresponds to our proposed approach. For a fair comparison, same number of allocated arithmetic resources is chosen. Results are obtained using first GraphLab² tool : from a Matlab behavioral description, data range analysis and high-level synthesis process are performed. Then a logical synthesis is performed using Quartus II V8.1 tool. The target technology is Altera Cyclone III FPGA devices. For comprehensive comparison of the resource usage, we enforce multipliers to be implemented by LUT, not hard blocks.

Area and dynamic power consumption results are presented in table 2 for the complete architecture (data-path, memory and controller). Number of allocated arithmetic resources are shown as well as the input data-width constraint and the maximum datapath word-length required to produce errorless results for worst-case executions. The maximum throughput of the architecture which provides the lowest performance is also shown to give an idea of the performance³.

Compared to the uniform-width approach, area decreases from 7% up to 58% (average saving = 34%) and dynamic power consumption is reduced by about 29%. Very good results are obtained for both area and power using our approach for kinds of applications like FIR filtering or SSD.

2 http://www.enseirb.fr/~legal/wp_graphlab

3 Anyway it should be noticed that performances of the architectures were similar.

(7)

Table 2. Synthesis results (Altera Cyclone III)

This is due to application features. For example, using our approach, area and power saving are respectively 52% and 31% on average compared to the uniform-width approach.

Actually the size of the multiplications is bounded only by the operand word-length whereas because the accumulation implements an addition tree, addition word-length also depends on the number of FIR’s taps. Since multiplier’s area is much more important than adder’s one (see table 1), allocating oversized multipliers (and associated registers and interconnection resources) with the uniform-width approach is deeply costly. The greater the number of FIR’s taps the greater the difference between the width of the resources with a word- length aware synthesis. When for each arithmetic resource data-width is spread on a quite large range space up to the largest word-length, which is the case for FFT and 2D DCT for example, area saving is less. Since arithmetic resources are shared, large data-width resources are allocated. In this case, saving mainly comes from a better resource sharing process including interconnect cost. Syntheses have been performed to compare the results to the ones obtained without interconnect costs included. The proposed approach was still used but, because interconnect cost is not to take into account, multiplexer cost was removed from path overcost computation (equation 5). Thus scheduling and binding choices may change. In this case, area increases by about 6%.

Compared to the 3-step approach, area saving reaches 11%

on average and dynamic power consumption is reduced by about 9%. Actually, with the 3-step approach, multiplexer binding is performed after the word-length aware register allocation and binding step. Our proposed approach shows that sharing resource cost has to be taken into account as soon as possible in the synthesis process. Although cumulative register area is very similar, our combined scheduling and binding algorithm favors datapath similarities leading to smaller number of multiplexers with less input ports. This interconnect cost decrease also reduces controller complexity.

VII. CONCLUSION

In this paper, we have presented a synthesis design flow for FPGA platforms. This design flow is based on two major steps. First, a word-length analysis of the application is performed. Then high-level synthesis is completed using algorithms designed for FPGA devices. To limit the extra cost due to interconnection and steering logic, a combined ad-hoc scheduling and binding algorithm based on word-length information and similarities between datapaths is used. Area decrease and power consumption reduction show the effectiveness of our proposed approach. Area saving is about 34% and dynamic power consumption is reduced by about 29% in comparison to a traditional high-level synthesis flow.

Future work will focus on the word-length analysis step of the methodology. Rather than processing range analysis by considering data range propagation through the graph which leads to pessimistic results (worst case analysis), range analysis will be implemented based on analytical estimation.

More accurate data scaling can be obtained avoiding representation of data that are impossible to ever occur in practice.

REFERENCES

[1] M. Stephenson, J. Babb, S. Amarasinghe, “Bitwidth analysis with application to silicon compilation”, Proc. of ACM SIGPLAN Conf. on Programming Language Design and Implementation, pp. 108–120, 2000.

[2] S. Gupta, R. Gupta, N. Dutt, A. Nicaulo, SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits, Springer, 2004.

[3] High-Level Synthesis: From Algorithm to Digital Circuit, P. Coussy, A.

Morawiec, (Eds), Springer, 2008.

[4] D. Gajski, N. Dutt, A. C.-H. Wu, and S. Y.-L. Lin, High-Level Synthesis: Introduction to Chip and System Design. Boston, MA:

Kluwer Academic Publishers, 1992.

[5] E. Casseau, B. Le Gal, P. Bomel, C. Jego, S. Huet, E. Martin, "C-based rapid prototyping for digital signal processing", European Signal Processing Conference, EUSIPCO, 2005

Throughput Input Maximum Area (logic elements) Area ssaving Dyn. poweer consumpption (mw) Power saving Throughput

(MS/s) Resources p data- width

Maximum data-width Uniform

approach 3-step

approach Proposed approach Versus

uniform Versus 3-step Uniform

approach 3-step

approach Proposed approach Versus

uniform Versus 3-step

64 taps FIR 4 4(x) 4(+) 8 bits 25 4404 2753 2066 53,1 % 25,0 % 161 146 135 16,1 % 7,5 %

64 taps FIR 4 4(x) 4(+)

16 bits 33 6430 4093 3053 52,5 % 25,4 % 181 121 96 47,0 % 20,7 %

64 taps FFT 160 24(+) 26(*) 8 bits 35 73532 63690 58308 20,7 % 8,5 % 5760 4562 3967 31,1 % 13,0 %

64 taps FFT 160 ( ) ( )

25 (-) 16 bits 43 91000 81025 73205 19,6 % 9,7 % 8238 6027 5331 35,3 % 11,5 %

DCT 8 107 8(+) 8(*) 8( ) 8 bits 22 3082 2167 1820 40,9 % 16,0 % 150 131 121 19,1 % 7,3 %

DCT 8 107 8(+) 8( ) 8(-)

16 bits 30 5112 3431 3318 35,1 % 3,3 % 317 225 217 31,5 % 3,6 %

iDCT 8 91 8(+) 8(*) 8( ) 8 bits 22 2764 1892 1813 34,4 % 4,2 % 141 121 118 16,3 % 2,5 %

iDCT 8 91 8(+) 8( ) 8(-)

16 bits 30 4807 3179 2974 38,1 % 6,4 % 323 211 199 38,4 % 5,7 %

2D DCT 8x8 125 16(+) 32(*) 8 bits 35 37051 32978 32373 12,6 % 1,8 % 3291 2570 2203 33,1 % 14,3 %

2D DCT 8x8 125 ( ) ( )

16(-) 16 bits 43 43717 42863 40397 7,6 % 5,8 % 4667 3241 2942 37,0 % 9,2 %

SSD 8x8 4 4(x) 4(+) 8 bits 22 1700 771 705 58,5 % 8,6 % 46 39 36 21,7 % 6,7 %

SSD 8x8 4 ( ) ( )

4(-) 16 bits 38 4476 2180 1954 56,3 % 10,4 % 166 115 106 36,1 % 7,8 %

SAD 8x8 4 4(-) 4(+) 8 bits 14 367 296 244 33,5 % 17,6 % 26 23 21 19,9 % 7,6 %

SAD 8x8 4 ( ) ( )

4(abs) 16 bits 22 548 506 426 22,3 % 15,8 % 48 43 37 22,0 % 13,0 %

Average 34,7 % 11,3 % Average 28,9 % 9,3 %

(8)

[6] J. Cong, , Y. Fan, G. Han, Y. Lin, J. Xu, Z. Zhang, X. Cheng,

“Bitwidth-aware scheduling and binding in high-level synthesis,” in the Proceedings of the ASP-DAC, Asia and South Pacific Design Automation Conference, pp. 856–861, 2005.

[7] P. Coussy, C. Chavet, P. Bomel, D. Heller; E. Senn, E. Martin, “GAUT:

a high-level synthesis tool for DSP applications”, in High-Level Synthesis: From Algorithm to Digital Circuit, Springer, 2008

[8] G. A. Constantinides, P. Y. K. Cheung, W. Luk, “Optimal datapath allocation for multiple-wordlength systems,” Electronics Letters, vol.

Issue 17, pp. 1508–1509, 2000.

[9] G. Constantinides., P. Cheung, W. Luk, “Heuristic datapath allocation for multiple wordlength systems,” Proc. of the Design, Automation and Test in Europe (DATE) Conf., pp. 791–796, 2001.

[10] C-Y. Huang, Y-S. Chen, Y-L. Lin, Y-C. Hsu, "Data path allocation based on bipartite weighted matching", Proceedings of the 27th ACM/IEEE conference on Design automation, pp. 499 - 504, 1990 [11] D. Chen, J. Cong, Y. Fan, “Low-power high-level synthesis for FPGA

architectures”, in Proc. of the International Symposium on Low Power Electronics and Design, ISLPED '03, pp.134-139, 2003.

[12] T. Kim, X. Liu, "Compatibility path based binding algorithm for interconnect reduction in high level synthesis", International Conference on Computer Aided Design, pp.435-441, 2007.

[13] J. Cong, Y. Fan, W. Jiang, "Platform-Based Resource Binding Using a Distributed Register-File Microarchitecture", International Conference on Computer Aided Design, pp.709-715, 2006

[14] A. Avakian, I. Ouaiss, “Optimizing Register Binding in FPGAs Using Simulated Annealing”, In Proceedings of the 2005 international Conference on Reconfigurable Computing and Fpgas (Reconfig'05) , pp435-441, 2005.

[15] J. Cong, W. Jiang, "Pattern-Based Behavior Synthesis for FPGA Resource Reduction", in Proc. of the International Symposium on Field- Programmable Gate Arrays FPGA'08, pp107-116, 2008.

[16] S. Sun, W. Wirthlin, M. J. Neuendorffer,"FPGA Pipeline Synthesis Design Exploration Using Module Selection and Resource Sharing", IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, No.2, pp. 254-265, Feb. 2007

[17] D. Menard, R. Serizel, R. Rocher, O. Sentieys, “Noise model for accuracy constraint determination in fixed-point systems,” in the Proceedings of EURASIP Conference on Design and Architectures for Signal and Image Processing, pp. 17–25, 2007.

[18] C. Carreras, J. A. López, and O. Nieto-Taladriz, “Bit-Width Selection for Data-Path Implementations”, in Proc. of the 12th international Symposium on System Synthesis, page 114-119, 1999.

[19] A.M. Sllame, V. Drabek, “An efficient list-based scheduling algorithm for high-level synthesis”, Proc. Euromicro Symposium on Digital System Design, pp. 316-323, 2002.

[20] C-Y. Huang, Y. Chen, Y. Lin, Y. Hsu, "Data path allocation based on bipartite weighted matching", Proc. ACM/IEEE Design Automation Conf. (DAC), pp. 499-504, 1990.