IP Generation Targeting Multiple Bit-Width Standards

(1)

IP Generation Targeting Multiple Bit-Width Standards

Bertrand Le Gal

IXL Laboratory, CNRS UMR 5818 ENSEIRB, Bordeaux 1 University, France

Emmanuel Casseau R2D2 laboratory, IRISA UMR 6074 ENSSAT, University of Rennes 1, France

Abstract— Multimedia applications such as video and image processing are computation intensive applications. For these applications the bit-width of data and operations is different all over the application. Generating optimized architectures is not an obvious task since it requires a deep bit-width analysis in order to properly size hardware resources. Furthermore implementing several application profiles onto the same chip makes it possible to avoid over-sized architectures or chip reconfiguration. In this paper we propose a design methodology based on high-level synthesis which takes into account multiple bit-width standards in order to generate area and power optimized architectures for embedded devices. First results demonstrate the interest of the approach.

I. INTRODUCTION

Multimedia applications such as video and image processing are often characterized by a large number of computations. In these computation intensive applications, the data and the computation bit-width is rarely constant all over the application.

This bit-width evolution requires from the circuit designer a deep knowledge of the application in order to optimize its design using correctly sized hardware resource. However this analysis is not trivial. Undersized resource-based architectures may provide wrong results depending on the values of the input data whereas oversized resource-based ones increase area and power consumption, that is to say important features for embedded systems.

The increasingly demanding requirements for digital signal processing applications (like multimedia, new generations of wireless systems, etc.) led to more and more complex algorithms and systems. In the same time, these applications have to be efficiently implemented under the so-called ”time to market” constraint. Today, the electronic system design com- munity is mainly concerned with defining efficientSystem-on- a-Chip (SoC)design methodologies in order to take advantages of the high integration capabilities of currentASICandFPGA technologies on the one hand, and to manage the increasing algorithmic complexity of applications on the other hand. To handle this complexity increase, methodologies [1] benefit from the emerging High-Level Synthesis (HLS) tools. High- level synthesis [2] [3] is analogous to software compilation transposed to the hardware domain. HLS tools automate the design process that generates aRTLarchitecture from an algorithmic behavior of the specification. Few HLS methodologies generate datapath with variable bit-width. However they only

target one application profile. Implementing several profiles onto the same chip is a challenging aspect systems have to cope with to match multiple standards and area and power cost requirements.

In this paper we propose a design methodology which takes into account the resource bit-widths required for different standards during the application datapath implementation.

The paper is organized as follow. Section II presents related work around the application bitwise analyze techniques and their use in HLS design flows. Section III presents our design flow, the models and the techniques used to analyze the application bit-width. First results show that it is possible to exploit specific multi-standard application knowledge on input dynamic (bit-width / range value) and integrate that knowledge in a high-level synthesis design flow to optimize the generated multi-purpose architectures.

II. RELATED WORK

Several high-level synthesis techniques have been proposed last two decades. However conventional techniques usually stick on uniform-width resources.

Recently, many work has been investigated for bit-width analysis and optimization [4] [5] [6] [7] in system design but a a few ones targeted high-level synthesis. Some of these works use emerging system-level languages such asSystemC and SpecC which permit to specify variables and operations with arbitrary bit-widths in the behavioral input description.

A bit-width optimization flow for high-level synthesis of digital signal processing systems has been developed in [8]

using a bit-width commercial optimization software which considers the hardware sharing to reduce the hardware cost and minimize the optimization time. This software inserts quantizers to a data flow graph representation which are used to balance the processing according to the data bit-width.

Constantinides and al. [9] have formulated the combined scheduling, binding, and word length selection problem as an ILP problem, and also proposed a heuristic solution in [10].

In [11], the ”potential of precision sensitive approach for the high-level synthesis of multi-precision DF G” has been explored. Register allocation, functional unit binding and scheduling algorithms have been proposed to exploit the multi- precision nature ofDF G for area efficient implementation.

(2)

Fig. 1. Analysis and synthesis design flow

Another approach based only on hardware allocation has been developed in [12]. An allocation algorithm has been proposed to minimize the hardware waste by fragmenting operations into its common operative kernel, which then may be executed over the same functional units.

Conclusion

The previously presented approaches use scheduling and binding techniques which consider the data and operation bit- width in mono-use case approaches only (the design flow can only handle one bit-width profile). The optimizing complexity is generally in O(n²) or greater, hence prominent synthesis time. Furthermore, except [12], these approaches restrict hardware resource reuse implying that final results may be sometimes area inefficient.

In the next sections we present a novel technique based on multiple use case analysis to provide the designer an efficient way to generate flexible and errorless architectures.

Our approach is based on a range value annotated graph model to extract bit-width application information and we use a sub- optimal design flow to limit as much as possible the HLS processing time increase.

III. CIRCUIT DESIGN FLOW

In this paper, we focus on designing multiple standard architectures based on a single behavioral description of the application. We particularly target signal and image processing applications using integer and/or fixed-point data type. Our design flow presented in figure 1 is based on two major steps.

The goal of the first part (the bit-width analysis), is to compute lower-bound and upper-bound values for each computation and each memorization. Once all the bound computations are performed, the minimal (optimal) necessary bit-width to implement data, operations and memorizations are evaluated.

The second part of the methodology (the optimized architecture generation), relates to the high-level synthesis process. High-level synthesis is used to formally transform an application into a hardware architecture observing a set of constraints (latency, area, power, etc.). During the synthesis process, an optimization step is performed in order to closely size operators and registers bit-width.

Fig. 2. Detailed bit-width analysis flow

For our experiments, the GAUT¹ high-level synthesis tool is used to prove our concept, however the presented design methodology is not dedicated to this tool.

A. Bit-width analysis

The bit-width analysis flow is presented in figure 2. This analysis aims to compute the number of bits necessary to represent and implement each variable and operation used in the application, using a formal graph model.

Signal and image processing applications are generally regular and predictable. The modeling of such applications is generally performed usingSignal Flow Graph based models.

Such a formal representation model is used in our design flow.

Each node (data and operations) of the model can be annotated with bit-width and bound information. We define, for each graph node, attributes representing necessary information for the bit-width computation and propagation.

Definition 1(Lower-bound and Upper-bound). For each node n belonging to graph G, there exists a lower-bound and an upper-bound couple{βmin(n), βmax(n)} → {<,<} such as β_min/max(n)is the minimal/maximal value that the data can be worth ifn∈V ariable(G), or ifn∈Operation(G), then, βmin/max(n) models minimal/maximal value which can be produced by the operation modeled by noden.

For a noden the couple {βmin(n), β_max(n)}specifies the dynamic range value of the produced or the memorized data.

Using this couple, it is possible to determine for each graph node the optimal hardware bit-width implementation required.

It corresponds to the minimal number of bits required to store or compute each data that is to say avoiding area waste in hardware resources and interconnections.

This information formally protects from the functional disability of the generated architecture as long as the input data belong to the dynamic range specified by the designer.

The implementation bit-widthΩ(n)considers the data coding, including sign extension namedψ(n)(if required) [13].

1GAUT tool is able to download after a free registration on the LESTER’s web sitehttp://web.univ-ubs.fr/lester/www-gaut/

785

(3)

Fig. 3. Multiple use cases (a) input correlation (b) different execution profiles

Bit-width propagation through the representation model In order to optimally compute the size of each data manipu- lated by the application, the previously presented formal model is used to spread the dynamic range of the inputs to all the nodes of the graph. This step is automatically implemented with GraphLab ² tool. The input range information can be specified by the designer using two different ways:

• Dynamic range values: the designer specifies, for each input of its application, the couple of minimum and maximum values. This information is presented as the couple-shaped [β_min, β_max].

• Input bit-width: the designer specifies, for each input of the application, the minimum number of bits required to encode the data and their type (signed/unsigned). These information is presented as the couple-shaped [Ω, ψ].

A heterogeneous description combining these two dynamic range ways is possible. The inputs bit-width information provided by the designer are then spread through graph nodes using the propagation functions.

Definition 2 (Bound propagation function). For each graph node n belonging to graphG, there exists a couple of function F ={σmin, σ_max} which allows, for each node type, calculating its lower-bound and its upper-bound from its available input values e_i. The function σ_min(e₁, . . . e_n) : <ⁿ → <

allows calculating the lower-bound of nodenand the function σ_max allows calculating its upper- bound.

The data range propagation, from input nodes to output nodes, is performed by a low-complexity recursive algorithm using a propagation function library. According to the node type, the propagation function F(n) corresponds to a simple affectation or to an arithmetic or logic operations.

Once the range interval values of each node have been computed, the associated bit-width implementation requirement for each node is then computed for the synthesis step.

B. Multiple standard management

The above modeling allows considering in a simple manner several use cases (different standards) of the bit-width for the same application. A component generated using a HLS tool can thus support multiple standards without overflow for all the use cases specified by the designer. This method also permits for an application to reduce the negative effects of

2The (GraphLab) tool used in these experiments can be downloaded from http://bertrand.legal.free.fr/graphlab/

Fig. 4. Multiple execution profiles analysis flow

the arithmetic interval technique, used in our design flow, using the designer knowledge to refine the automatic bit- width analysis specifying correlated input values. This point is interesting in current design where the design cost has to be reduced by design reuse. Two examples of use cases are illustrated in figure (3a) presenting correlated inputs and in figure (3b) presenting different input configurations.

This permits in the case of multi-profile applications, to have a single hardware solution rather than an architecture which has to be reconfigured according to the considered execution profile. For that, we need to correctly size the set of hardware resources to produce an architecture able to execute correctly all the use cases specified before synthesis.

The multiple profile management consists in merging different mono-profile analysis. Each input data characterization is then analyzed separately as presented in figure 4. After the formal propagation of the lower and upper bounds, the bit- width annotations [Ω(n), ψ(n)] of each node are merged to obtain its minimal bit-width (equation 1) considering the set of use cases.

Ω(n) =M axk(Ω(nk)) (1) withn_k the node linked to the use-casek

Compared to a solution which consists in merging the input profiles (dynamic ranges) before performing a global analysis of the application, this solution allows a finer analysis of the bit-width requirements. It thus allows having an optimal size of the required hardware resources.

Example 3a describes a case where inputs are correlated:

depending of the Avalue,B upper and lower bounds can be refined. Without the use case merging method, the designer would have made a typical description of this use case with B∈[0,10], which is not optimal.

Example 3b describes a case where the designer wants to implement two different standards for a same behavioral description. The first standard named ”modulator 1” is characterized by some dynamic ranges for its inputs (Amplicoded with 7 bits and X with 8 bits), and the second standard 786

(4)

”modulator 2” has different dynamic ranges for its inputs (Amplicoded with1bit andX with10bits). The traditional description of these two use cases using a typical approach would have been : Ampli∈[0,128]and thatX ∈[0,2047].

C. High-level synthesis

The modified high-level synthesis tool we used isGAUT. Its starting point is a behavioral description which specifies the behavior of the application to implement. High-level languages such as SystemC or behavioral VHDL can be used for the description. The synthesis process can be constrained by the designer using different parameters: the target technology, the circuit throughput, etc. The design flow implemented in this tool and the low complexity algorithms used to generate variable bit-width architecture are described in [13].

IV. EXPERIMENTS

To evaluate the effectiveness of the proposed methodology, we have generated a multi-standard architecture for Block Matching applications. This experiment aims to compare our approach with a conventional approach using a HLS flow which produces architectures which do not implement multi-standard components. We used a Xilinx Virtex-II Pro XC2VP100 FPGA as a target technology. Results were ob- tained with a complete design flow including the bit-width analysis using the GRAPHLAB tool, the high-level synthesis process with GAUT tool and the logic synthesis with Xilinx ISE 7.1i tool fromXinlinx. The dynamic power consumption has been estimated withXPower 7.1from Xinlinx.

TheSADcomputation is the basic operation used in block matching algorithms like theThree Step Search algorithm[14], etc. It is used in MPEG-xandh26xencoding systems.

For practical purposes, video compression applications using motion estimation techniques are regularly implemented on different systems (computers, mobile phones, P DA, etc.) and use different execution profiles according to the targeted standards, the application requirements and system capabilities. These profiles differ, for example, by the macro-bloc size, the number of level of transparency and the data precision. To underline the relevance of the multi-profile analysis concerning the processing bit-width, a system which manages three standards was considered (N=16):

1) The first use profile (profile 1) considers pixels coded with8bits without level of transparency.

2) The second use profile considers pixels coded with 12 bits without level of transparency.

3) The third use profile considers pixels coded with 8bits with4levels of transparency (2 bits).

In order to verify the multi-profile analysis interest, we have check the difference between this approach and a mono-profile said one. The mono-profile experiment was based on the union on the dynamic input ranges of the three proposed standards that is to say pixels were coded with 12 bits and 4 level of transparency (2 bits) were considered.

Then, for both mono-profile and multi-profile experiments, the proposed design flow was applied, i.e. the bit-width

analysis followed by the high-level synthesis process with non- uniform bit-width components was completed. After logical synthesis, the area of the multi-profile architecture was reduced from 5% up to 11% compare to the mono-profile one, and the power consumption was decreased from 2% up to 12%

depending of the cadency contraints given to the HLS flow.

This result shows the interest of the multi-profile analysis to optimally handle multiple use cases (standards, correlated inputs, etc.) using a single component, scaling to the finer the required bit-width during the analysis and then the hardware resource sizing during the synthesis.

V. CONCLUSION

In this paper, we have presented a bit-width aware synthesis design flow which permit to consider different standards for the same application. For each profile, data and computation sizing is analized. Bit-width constraints are then merged to conduct the synthesis. A unique architecture is thus generated to integrate the set of different standards avoiding over-sized architectures or chip reconfiguration.

Area decrease and power reduction show the interest of the approach in comparison to a conventional synthesis. Fur- thermore, thanks to its low computation complexity O(n) the proposed methodology can be applied to current high- complexityDSP applications.

REFERENCES

[1] E. Casseau, B. Le Gal, P. Bomel, C. J´ego, S. Huet, and E. Martin, “C- based rapid prototyping for digital signal processing,” inProceedings of EUSIPCO, Antalya, Turquie, 4-8 September 2005.

[2] D. Gajski, N. Dutt, and al.,High-Level Synthesis: Introduction to Chip and System Design. Kluwer Academic Publishers, 1992.

[3] J. P. Elliott,Understanding Behavioral Synthesis. A Practical Guide to High-Level Design. Kluwer Academic Publishers, 2000.

[4] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee, “Precision and error analysis of matlab applications during automated hardware synthesis for fpgas,” inProceedings of DATE, 2001, pp. 722–728.

[5] D.-U. Lee, A. A. Gaffar, R. C. Cheung, O. Mencer, W. Luk, and G. A. Constantinides, “Accuracy guaranteed bit-width optimization,”

IEEE Transactions on Computer Aided Design, 2006.

[6] M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth analysis with application to silicon compilation,” inProc. of PLDI, 2000, pp. 108–120.

[7] S. Kim and W. Sung, “Fixed-point error analysis and word length optimization of 88 idct architectures.”Trans. on Circuits and Systems for Video Technology, vol. 8, no. 8, pp. 935–940, December 1998.

[8] K.-I. Kum and W. Sung, “Word-length optimization for high-level synthesis of digital signal processing systems,” in Proceedings of the IEEE Workshop on Signal Processing Systems, 1998, pp. 569–578.

[9] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Optimal datapath allocation for multiple-wordlength systems,” Electronics Letters, vol.

Issue 17, pp. 1508–1509, 2000.

[10] G. Constantinides., P. Cheung, and W. Luk, “Heuristic datapath allocation for multiple wordlength systems,” inProceedings of the Design, Automation and Test in Europe (DATE) Conference, 2001, pp. 791–796.

[11] V. Agrawal, A. Pande, and M. Mehendale, “High level synthesis of multi-precision data flow graphs,” inProceedings of the 14th Interna- tional Conference on VLSI Design, 2001, pp. 411–416.

[12] M. Molina, J. Mendias, and R. Hermida, “High-level allocation to minimize internal hardware wastage,” in Proceedings of the Design, Automation and Test in Europe (DATE) Conference, 2003, pp. 264–269.

[13] B. Le Gal, C. Andriamisaina, and E. Casseau, “Bit-width aware high- level synthesis for digital signal processing systems,” inProceedings of SOCC’06, Austin, Texas, 24-27 September 2006.

[14] R. Li, B. Zeng, and M. Liou, “A new three-step search algorithm for block motion estimation,”IEEE Trans. on Circuits and Systems for Video Technology, vol. 4, no. 4, pp. 438–442, August 1994.

787