Bit-Width Optimizations for High-Level Synthesis

(1)

ABSTRACT

In this paper we propose a methodology that takes into account bit-width to optimize area and power consumption of hardware architectures provided by high level synthesis tools.

The methodology is based on a bit-width analysis using information that comes from the designer. This bit-width information is propagated through a graph which models the application. The resulting annotated graph enables datapath structure optimizations for high level synthesis without increasing dramatically its processing time (complexity: O(n)). The methodology was applied to several signal and image processing applications. Area decrease and power consumption reduction demonstrate the effectiveness of the approach. It can be also applied in a more general design context for sizing the data of an application knowing the input data formats and their potential correlation.

General Terms

— Data sizing, hardware design, high level synthesis, optimization

1. INTRODUCTION

MULTIMEDIA applications such as video and image processing are often characterized by a large number of computations. In these data transfer, storage and computation intensive applications, the data and the computation bit-width is rarely constant all over the application. This bit-width evolution requires from the circuit designer a deep knowledge of the application in order to optimize its design using correctly sized hardware resource. This analyze is not trivial for the designer. The architectural functional errorless will depend of the architecture bit-width analysis for the usage profile considered. In the same time, using correctly sized hardware resources enables area reduction and power consumption decrease, that is to say important features for embedded system design.

The increasingly demanding requirements of digital signal processing applications like multimedia, new generations of wireless systems, etc. led to the definition of more and more complex algorithms and systems that are to be efficiently implemented with the time to market constraint. Today, the electronic system design community is mainly concerned with defining efficient System-on-a-Chip (SoC) design methodologies in order to benefit from the high integration capabilities of current ASIC and FPGA technologies on the one hand, and manages the

increasing algorithmic complexity of applications on the other hand.

To handle this complexity increase, methodologies [1] benefit from the emerging High-Level Synthesis (HLS) tools. High-level synthesis [2], [3] is analogous to software compilation transposed to the hardware domain. HLS tools automate the process that generates RTL architecture from an algorithmic behavior of the source specification. The provided architecture respects the designer and the system constraints and is reliable (error less) compared to a hand coded design. However usual high-level synthesis tools provide fixed bit-width datapath, i.e. over-sized architectures.

In this paper we propose a methodology that takes into account bit-width to optimize area and power consumption of hardware architectures provided by high level synthesis tools. The paper is organizes as follow. Section II presents related work in that topic. Section III presents our design flow. The models and the techniques used to optimize the architecture are detailed in section IV. The results presented in section V show the effectiveness of the proposed methodology.

2. RELATED WORK

Several high-level synthesis techniques have been proposed for two decades. However conventional techniques usually center on uniform-width resources. Recently, a large part of work has been investigated for bit-width optimization in architecture but a little part in high-level synthesis.

The value range propagation through data-flow graphs in [4] is used to determine the minimum number of bits required for the integral part of floating point representation and for integers. A range analysis is also performed in [5] and [6] to optimize the bit- width for data and operations. The Bitwise Project [6] proposes a compiler that minimizes the bit-width. Bitwise propagates integer variable ranges backwards and forwards through data-flow graphs. This method aims to remove the unwanted most- significant bits (MSBs).

Jason Cong and al. [7] developed a bit-aware analysis, including bit-width analysis, scheduling and binding. The used flow is composed of fours steps. First, a behavioral description in C is transformed into the Machine-SUIF intermediate representation. After compilation optimization, the bit-width analysis is performed as a stand-alone Machine-SUIF pass. The bit-width analysis introduced in [6] is used to decide the minimum bit-width. In the second step, MCAS architectural synthesis system [8] performs scheduling, binding and placement.

Bit-Width Optimizations for High-Level Synthesis

(2)

In the third step, bit-aware re-scheduling and re-binding is performed to minimize area cost of FUs; and the bit-aware register allocation and binding task is performed to minimize the area cost of registers. In the last step, the corresponding datapath and controller are generated.

Constantinides and al. [9] formulated the combined scheduling, binding, and wordlength selection problem as an ILP, and proposed a heuristic solution in [10]. Two kinds of graphs are used: a sequencing graph which represents the data-dependencies and a word-length compatibility graph which represents the compatibility between operation and sized operators that can implement the operation regarding the word-length. A word- length optimization for high level synthesis of digital signal processing systems has been developed in [11] using a word- length optimization software which considers the hardware sharing to reduce the hardware cost and minimize the optimization time. This software inserts quantizers to a data flow graph representation, partitions the resultant graph, determines the minimum required word-length for each partitioned signal, conducts scheduling and binding using the minimum word-length information, and finally optimizes the word-lengths of functional units. In [12], the potential of precision sensitive approach for the high-level synthesis of multi-precision DFGs has been explored.

Register allocation, functional unit binding and scheduling algorithms to exploit the multi-precision nature of DFG for area efficient implementation and an integrated methodology to exploit the interdependence of scheduling, allocation and binding have been proposed. By example, an add-shift based hardware implementation has been preferred over a multiplier based realization.

One approach based only on hardware allocation has been developed in [13]. An allocation algorithm has been proposed to minimize the hardware waste by fragmenting operations into its common operative kernel, which then may be executed over the same functional units.

In our approach, we use an annotated formal model with bit- width information and dynamic range values in order to extract bit-wise information to optimize the high-level synthesis process without increasing dramatically its processing time. The optimization step is done after the scheduling and binding tasks to optimally resize the operators and registers.

3. DESIGN FLOW OVERVIEW

Our methodology is based on two steps as presented in Figure 1. First, a bit-width analysis of the application according to input information provided by the designer is performed. The goal of this first part is to compute lower- bound and upper-bound values of each computation and memorization which are implemented on hardware resources after synthesis. The methodology is based on a formal model representing the application and which can handle bounded data information. Once all bound computation performed, the necessary bit-width to model data and to implement the operations are evaluated. This information is then used during the high-level synthesis process.

Figure 1. Analysis and synthesis flow

The second part of the methodology relates to the high-level synthesis process. High-level synthesis is used to formally transform the application into an architecture observing a set of constraints (latency, area, power, etc.). In our approach, an architecture optimization stage is complete in order to adapt both possible operator and register bit-width. Because of its low complexity O(n), the methodology can be applied to current high- complexity DSP applications .

In this paper, the high level synthesis tool we used is GAUT¹ which allows to synthesize applications under a real time constraint.

4. SYNTHESIS STEPS

4.1 Bit-width model and propagation

Our methodology particularly focuses on signal and image processing applications which are generally regular and predictable. The modeling of such applications is generally performed using data (or signal) flow graph models. It is worth noticing that our methodology could take into account conditional branches and hierarchy even if these features are not highlighted in this paper. This is done using a CSFG (Control and Structure Flow Graph) model defined in [14]. An example of graph representation is illustrated in Figure 2.

Each node (data and operations) of the model can be annotated with bit-width and bounds information. We define, for each graph node, attributes representing information which is necessary for the nodes bit-width computation, propagation and use.

Definition 1 (Lower-bound and Upper-bound)

For each node n belonging to graph G, there exists a lower- bound and an upper-bound couple {β_min(n), β_max(n)}→{ℜ_,ℜ_} such as β_min/max(n) is the minimal/maximal value that the data can be worth if n ∈ Variable(G), or if n ∈ Operation(G), then, βmin/max(n) models minimal/maximal value which can be produced by operation modeled by node n.

1 GAUT tool is downloadable after a free registration on LESTER web site http://web.univ-ubs.fr/gaut/

(3)

These values allow to determine the static range of stored or produced data for the set of graph nodes. With this information, it is possible to realize an automate bit-width analysis of the application.

According to the values of βmin and βmax bounds of graph nodes, it is also possible to determine inputs and outputs sign bit requirement. It is important to use a sign bit annotation, because of two’s complement coding of negative values used in hardware architecture. According to the information about the sign the architectural implementation of operators and registers is different.

Definition 2 (Sign extension)

Each graph node n belonging to graph G has an attribute noted ψ(n)→[0, 1] such as ψ(n) notices if node n needs or not information modeling the sign of the handled data. This attribute ψ(n) has zero value if data range notices that the data is always positive, and one value in opposite case.

The set of defined information makes it possible to determine the optimal bit-width associated with all graph nodes, that is to say the minimum number of bits that are necessary to store or compute data.

Figure 2. Formal representation model example

Definition 3 (Bit-width implementation)

For each graph node n, there exists an optimal bit-width implementation noted Ω(n)→ℜ such as Ω(n) corresponds to the minimum number of bit that are necessary for register or operator hardware implementation. The bit-width implementation Ω(n) takes into account data coding as well as sign extension if necessary. Computing functions from (β_min, β_max, ψ) are given in (1) for positive numbers and in (2) for negative numbers.

{Ω, ψ} = {log2(βmax), 0} (1) {Ω, ψ}= {log2(2×max(abs(βmin, abs(βmax)) + 1, 1} (2)

This optimal bit-width corresponds to the data sizing constraint of hardware implementation from which hardware resources are physically able to implement the set of computation existing in the application.

Once these values are available, it is possible to extract information that will be useful for the optimization of the high- level synthesis process (Figure 3).

We now detail the flow that allows the propagation of the inputs ranges through the graph. First, the representation model is annotated according to the inputs information provided by the

designer. The inputs information can be defined in two different ways:

- Input bit-width: in this case, the designer specifies, for each input, the number of bits required to code them and if a bit for sign extension is necessary. These information are presented as a couple-shaped [Ω, ψ].

- Inputs range: in this case, the designer specifies for each input, minimal and maximal values in the worst case. This information is presented couple-shaped [βmin, βmax].

Figure 3. Analysis flow

Our methodology allows to merge these two ways of bit-width expressions for the same application. The attribute standardisation stage is then applied to model all the information in a same way as [β_min, β_max].

The inputs bit-width information provided by the designer are then propagated through graph nodes.

Definition 4 (Bound propagation function)

For each graph node n belonging to graph G, there exists a couple of function F={σ_min, σ_max} which allows, for each node type, calculating its lower-bound and upper-bound functions of available input values. The function σmin(e1, … en) : ℜⁿ_→ℜ allows calculating the lower-bound of node n and the function σ_max(e₁, … e_n) : ℜⁿ_→ℜ allow calculating its upper- bound.

In order to be able to propagate lower-bound and upper-bound through the set of graph nodes, it is necessary to have, for each node type, a suitable couple of functions F={σ_min, σ_max}. The set of these functions is composed of the composition of arithmetic and logic operations representing nodes functioning before and after hardware implementation.

Once the range interval of each graph node is computed, it is possible to compute the associated bit-widths and sign bit if necessary. Equations (3) (unsigned data) and (4) (signed data) allow to compute the couple (bit-width, sign) from the couple (lower-bound, upper-bound).

Ω(n)= log2(βmax(n) (3) Ω(n)= log2(2×max(abs(βmin(n), abs(βmax(n))) + ψ(n) (4)

(4)

The representation model is then annotated with this data in order to be considered during the synthesis process.

4.2 High-Level Synthesis Process

In this paper, the high-level synthesis tool we use is GAUT (Figure 4). The behavioural description, specifying the behaviour of the application to implement, is described in high-level language (C or behavioural VHDL).

The synthesis can be constrained by the designer with target technology, throughput, E/S chronology, etc. A compilation stage performs syntactic analysis, semantic analysis and code parallelizing. The compilation provides an internal representation of the algorithm using a signal flow graph model (SFG).

Figure 4. Usual high-level synthesis flow

The datapath unit synthesis starts with the selection of the operators. Then, the allocation step defines the number of each operator. The operations scheduling is then performed. Binding stage consists in affecting each scheduled operation to an available operator at the considered time. After the scheduling/binding stage, hardware optimization techniques can be completed to optimize architecture in terms of registers sharing and bus usages.

Like GAUT, most of high level synthesis tools generate hardware architectures with fixed datapath bit-width corresponding to the highest data bit-width among the computation to be performed. Oversized architectures are thus generated. From the bit-width analysis presented previously, it is possible to size correctly the hardware operators and registers of the generated architecture, that is to say reduce the area and decrease power consumption.

Previous high-level synthesis methodologies that take into account data bit-width operate during the selection and allocation steps (see section II). These approaches are NP-complete problems. In order to reduce the processing time, our approach consists in optimizing the generated architecture after the binding step using bit-width information coming from the annotated representation model. The algorithm complexity is then O(n). The corresponding high-level synthesis flow is presented in Figure 5.

Figure 5. High-level synthesis flow including bitwise consideration

After the scheduling/binding step, operations are bound to hardware resources (operators and registers). For each allocated shared resource, the proposed approach consists in determining its optimal size.

Definition 5 (Sign extension requirement)

Each operator/register composing the generated hardware architecture has a list of uses. Each use corresponds to a node in the formal representation model. Thanks to the sign information of the nodes, the need of signed hardware resource can be evaluated. If one use of a particular resource is signed then the resource has to be signed.

Definition 6 (Bit-width requirement)

Each operator/register composing the generated hardware architecture has a list of uses. The minimal hardware requirement for a particular resource is the maximal bit-width requirement found in its list of uses.

The above definitions allow respectively determining the sign bit and the minimal bit-width requirements for each allocated component of the architecture.

4.3 Register optimizations

Registers merging algorithms are used in order to increase their temporal sharing. The register optimization algorithm implemented in the GAUT tool takes into account data-path interconnection [15]. The features of a typical datapath generated by GAUT are presented in Figure 6. It is based on elementary computation cells also called « clusters ». These clusters are composed of one operator and its associated registers which are directly interconnected to the operator. The register optimization algorithm is based on sharing registers inside the same cluster only in order to reduce interconnections costs which can become critical during the logic synthesis step if a register is placed far away from its connected operators.

In our approach, the cost function determines when a register share is interesting and includes a bit-width difference metric.

(5)

Figure 6. Typical data-path

5. EXPERIMENTS

In order to evaluate our methodology, the synthesis approach was applied to two widely used signal and image processing functions: a Sum of Absolute Differences (SAD) computation and a Finite Impulse Response (FIR) filter.

5.1 Sum of Absolute Difference

The SAD computation is the basic operation used in block matching algorithms like the Full Search or Three Step Search algorithm. The formula used to compute the SAD is given in (6).

The macrobloc size considered (N) is 16×16 using four levels for the transparency.

) , ( )

, ( ) , (

0 0

2

1

y x I y x Alpha y x I

N

y N

x

×

−

=

∆ ∑∑

= =

(6) where I1(y,x) and I2(y,x) are the pixels in (y,x) and Alpha(y,x) is the transparency.

5.2 FIR Filter

We have used a transposed structure of FIR filter. This structure consists of multiplications and additions. The following equation describes the N-taps FIR computation:

) ( ) ( )

(

1

0

i n H n x n

y

N

i

−

×

= ∑

−

=

(7)

We have experimented 512-taps FIR.

5.3 Results

Two syntheses were made. The first one use a usual high level synthesis process, i.e. with a fixed bit-width datapath, and the second one use the proposed approach based on a variable bit- width datapath. For each synthesis, we represent both the datapath unit area and the datapath with its controller area (complete architecture). The target technology was an FPGA and the FPGA device was a Virtex-II Pro XC2VP100. Results were obtained using a complete design flow, i.e. high level synthesis and then synthesis of the RTL architecture with Xilinx ISE 7.1i synthesis and mapping tools (Xilinx, Inc.). The dynamic power consumption of the complete architecture was obtained from XPower 7.1i of Xilinx Inc. which is specially designed for evaluation power consumption of FPGA devices. Syntheses and power estimates were made for various throughput constraints.

For each case study inputs were signed and coded with 8 bits that is a usual bit-width of analog-to-digital converter output where data come from. H(n-i) were also signed and coded with 8 bits. Alpha(y,x) were positive and integer values varying from 0 to 3. Unsigned 2-bit coding was used. Results are shown in figures 7 and 9 for area and 8 and 10 for power estimates respectively.

Using our approach, the area of the complete architecture decreases from 17% up to 43% for the SAD computation and from 30% up to 40% for the 512-taps FIR. The dynamic power consumption is also reduced from 28% up to 34% for the SAD computation and from 10% to 22% for the 512-taps FIR.

In fact, the highest the throughput, the highest the gain is.

Actually for low throughputs there are less operators and registers to compute the whole application, i.e. they are more shared. They thus have to handle data which bit-width ranges from the smallest to the highest ones, that is to say they have to implement the worst case (highest bit-width). In theses cases, the area decrease and power reduction are smaller.

SAD Computation with 4 levels for the transparency

306

613

1312

1656 1732

898

1032

1767

2240 2255

128 248

557

708 752

1277 1049

1287

713 643

0 500 1000 1500 2000 2500

91 183 320 427 640

Throughput (Mpixels/s)

Area (LUTs)

Datapath Area Modified Datapath Area Datapath+Controller Area Modified Datapath+Controller Area

Figure 7. Synthesis results for the SAD Computation

SAD Computation with 4 levels for the transparency

100

144

252

359 349

72

92

178

240 232

0 50 100 150 200 250 300 350 400

91 183 320 427 640

Throughput (Mpixels/s)

Dynamic power consumption (mW)

Datapath+Controller power consumption Modified Datapath+Controller power consum ption

Figure 8. Dynamic power consumption for the SAD Computation

(6)

512-taps FIR

2936

4711

9184

3521

7236 5505

3089 4097 2608

1928 1554 4517

6648

7362

5309

5780 4174 4672

3132

3598

0 2000 4000 6000 8000 10000 12000

213 284 366 427 512

Throughput (Msamples/s)

Area (LUTs)

Datapath Area Modified Datapath Area Datapath+Controller Area Modified Datapath+Controller Area

Figure 9. Synthesis results for the 512-taps FIR.

512-taps FIR

787 853

1079

1228

1350

1092 952 866

679 780

0 200 400 600 800 1000 1200 1400 1600

213 284 366 427 512

Throughput (Msamples/s)

Dynamic power consumption (mW)

Datapath+Controller power consumption Modified Datapath+Controller power consumption

Figure 10. Dynamic power consumption for the 512-taps FIR

6. CONCLUSION

In this paper, we have presented a bit-width aware synthesis design flow based on two steps. First, a bit-width analysis of the application according to input information provided by the designer is performed. Then a high-level synthesis is completed.

Using results of the first step, an architecture optimization is performed in order to adapt both possible operator and register bit-widths. Area decrease and power reduction show the effectiveness of the approach. Furthermore thanks to its low complexity O(n) the proposed methodology can be applied to current high-complex DSP applications.

The data and operation bit-width analysis (first step of the methodology) we propose is not specifically dedicated to high- level synthesis. It can be used in a more general context for sizing the data of any DSP application knowing the input data formats and their potential correlation.

In the same way, the hardware resource optimization we propose can be easily integrated into other existing high-level

synthesis tools since it is completed after the main steps of the synthesis process.

7. REFERENCES

[1] E. Casseau, B. Le Gal, P. Bomel, C. Jégo, S. Huet, and E.

Martin. C-based rapid prototyping for digital signal processing. In the Proc. of the EUSIPCO, 2005.

[2] D. D. Gajski, N. D. Dutt, Allen C-H. Wu, Steve Y-L. Lin, High-Level Synthesis: Introduction to Chip and System Design, Kluwer Academic Publishers, Boston, MA, 1992.

[3] J. P.Elliott, Understanding Behavioral Synthesis. A Practical Guide to High-Level Design, Kluwer Academic Publishers, 2000.

[4] Nayak A., Haldar M., Choudhary A., and Banerjee P.

“Precision and error analysis of MATLAB applications during automated hardware synthesis for FPGAs”, Proceedings of DATE, 2001, pp. 722-728.

[5] Dong-U Lee, Altaf Abdul Gaffar, Ray C.C. Cheung, Oskar Mencer, Wayne Luk, George A. Constantinides “Accuracy Guaranteed Bit-Width Optimization”, IEEE Transactions on CAD, 2006.

[6] Stephenson M., J. Babb and Amarasinghe S. “Bitwidth analysis with application to silicon compilation”, Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 2000, pp. 108-120.

[7] J. Cong, Y. Fan, G. Han, Y. Lin, J. Xu, Z. Zhang and X.

Cheng “Bitwidth-Aware Scheduling and Binding in High- Level Synthesis”, Proceedings of the ASP-DAC, Asia and South Pacific, 2005, pp. 856-861.

[8] J. Cong, Y. Fan, G. Han, X. Yang and Z. Zhang

“Architecture and Synthesis for On-chip Multicycle Communication”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2004.

[9] Constantinides G.A., Cheung P.Y.K. and Luk W. “Optimal datapath allocation for multiple-wordlength systems”, Electronics Letters, 2000, Issue 17, pp. 1508-1509.

[10] Constantinides G.A., Cheung P.Y.K. and Luk W. “Heuristic datapath allocation for multiple wordlength systems”, Proceedings of Date, 2001, pp. 791-796.

[11] Kum Ki-Il and Sung W. “Word-length optimization for high- level synthesis of digital signal processing systems”, IEEE workshop on Signal Processing Systems, 1998, pp. 569-578.

[12] V. Agrawal, A. Pande, and M. Mehendale “High Level Synthesis of Multi-Precision Data Flow Graphs”, Proceedings of the 14th International Conference on VLSI Design, 2001, pp. 411-416.

[13] Molina M.C., Mendias J.M., Hermida R. “High-level allocation to minimize internal hardware wastage”, Proceedings of DATE, 2003, pp. 264-269.

[14] B. Le Gal, E. Casseau, S. Huet and E. Martin. “Pipelined Memory Controllers for DSP Applications Handling Unpredictable Data Accesses”, In the Proc. of VLSI, 2005, pp. 268-269.

[15] C. Jego, E. Casseau, E. Martin “Real time application architectural synthesis dedicated to sub-micron technologies”, IEEE International ASIC/SOC Conference, 2000, pp. 397-401.