Using Digital Signal Processor Singularities to Minimize ROM Based Controller Area
Bertrand LE GAL, Lilian BOSSUET and Dominique DALLET
IMS Laboratory - UMR CNRS 5218
University of Bordeaux - Talence, FRANCE
Abstract—The Interest in synthesis of custom Digital Signal Processors (DSP) using automated design flow like High Level Synthesis greatly increased in the last years. This phenomenon is due to the growing processing complexity and the time to market constraint. Dedicated processor component design is a complex process, for which tools must optimize the datapath and its controller. In this paper, we propose a controller design flow based on mapping Finite-State Machines into Memory Blocks in order to limit the critical path delay in the controller. Our design flow approach takes into account DSP circuit singularities providing efficient area saving compared to other approaches (more than 18% up to 42% on real applications).
I. INTRODUCTION
Custom circuits generated using High-Level Synthesis design flows are based on generic architectures. These custom circuits, usually dedicated to Digital Signal Processing applications, are composed by two main parts: a datapath to perform the computations and a controller to command the hardware resources. Complexity of generated circuits increases with application functionalities and performance constraints. Design complexity heavily affects the controller part of the circuit. Therefore, the controller part limits the circuit maximum frequency characteristic.
Many controllers related issues have been addressed in control intensive researches. It has been demonstrated that implementing a controller using a ROM based design provides interesting characteristics [1, 2]. However, existing techniques were developed for control-intensive applications and must be adapted to efficiently manage computation intensive circuit specificities.
In this paper, we present a new design flow for ROM based controllers dedicated to custom digital signal processors to save memory area. The problem formulation and its resolution are different from literature approaches. Indeed, we do not consider that the next state computation part of the controller is the most complex one. In the case of DSP circuits, the controller complexity is located in the output decoder part of the design. The main issue is to factorize efficiently the output command signals to save ROM area.
Article is organized as follows. Section 2 presents the literature approaches optimizing ROM based implementations of controllers and explains the motivation behind studying this kind of solution. In Section 3, we detail the area optimization algorithm used and extend literature approaches to handle DSP singularities. Experimental results which validate our strategy to reduce the controller ROM area are reported in Section 4. Finally, Section 5 concludes this paper.
II. RELATED WORKS
The FSM can be described by a 5-tuple (X, Y, S, f, g) where X, Y and S are finite collections of the input, output and state variables; The f function defined as f: X!S " S is the transition function which provides the next state sj for the given inputs and the current state si. The function named g defined as g: X!S " Y is the output function which computes the machine output for the given inputs and current state.
DSP controller optimization techniques have been developed considering logic based controller implementation (Figure 1a). Logic based controller design has been proved as inefficient for controllers with large number of states [3]. One
Figure 1. Various FSM controller implementations (a) logic based design
(b) ROM based design (address and output) (c) ROM based design using an address counter (states).
978-1-4244-6805-8/10/$26.00 ©2010 IEEE
29
way to cope the relation between the critical path and the number of FSM states is to implement the design controller in a ROM based design (Figure 1b). Using such controller architecture, the output values and the transition conditions are pre-computed (before logical synthesis and stored in ROM. In these controller architectures, the critical path is constant whatever the number of states and the number of resources (depending only on ROM characteristics).
General methods for ROM based controller synthesis targeting implementation of sequential circuits using embedded memory blocks have been proposed in [4, 5]. These methods dedicated to control intensive applications, save ROM area by decomposing the memory block (corresponding to the controller) into two blocks: a semi-combinational address modifier and a smaller memory block to store the output values. An appropriately chosen decomposition strategy may reduce the required memory size at the cost of additional logic cells. These optimizations focus only the next address computation part of the controller implementation. A similar approach was proposed in [5] which considers the controller power consumption problem. Finally, in [6], the author uses don't care value to simplify state transition equations. This simplification reduces the memory size as well as multiplexer complexity of the address modifier part only of the controller.
Proposed approach
Literature approaches focus on general FSM models with uncorrelated output signals. They consider that next state computation part of the controller is more complex than the output decoder one. In dedicated DSP circuits, controllers do not have such characteristics: (1) the FSM models are linear (the next state decoding equations and conditions are simples) (2) the output signal set is complex (huge numbers of states and output signals).
In custom DSP circuits, the controller output signals are used to control the datapath resources like shown in Figure 2.
In the same time, the next state computation part of controller is simpler than general FSM models because their execution path is linear. This singularity is due to digital signal application properties which are computation intensive applications. Controller architecture presented in Figure 1c is efficient for such kind of circuits. In this design, the next state computation function f is implemented using an adder, a register and a multiplexer resource. The output decoding function g is implemented using a ROM memory. Each word of the ROM store the output commands associated to state S.
In this paper, we propose a cluster-based methodology that reduces efficiently ROM area associated to output signal generation.
III. AREA SAVING TECHNIQUES
ROM area increases with the number of resources and the number of execution states. Depending on the circuit complexity, these requirements can become huge. Two techniques have been proposed, removing spatial and temporal redundancy. These techniques use the fact that command signals (controller outputs) are not required for each hardware resources at each clock cycle. Undefined command values named don't care values are represented using X in example the truth table (Figure 3). Don't care values help the compaction process in reducing the ROM area as they can be modified without design functionality impact.
Spatial redundancy - The first approach to save ROM area is to realize column compaction [4]. This step aims at removing output signals (columns) which are logically equivalent, or can be made equivalent through assignment of don't cares.
Given a set of output columns, the problem of finding the smallest column set to drive the overall datapath resources can be obtained by compacting the given set. This problem is related to the maximum clique-partitioning problem, which is NP-complete.
Removing temporal redundancy – The second approach is used to remove the inter-instruction redundancy (reducing the ROM height). Removing the temporal redundancies modifies output computation function g:X!S"Y updating it by g’:T(X!S)"Y where T:(X!S)"I is the function which associate an instruction for each controller state. This indexed relation between state and the output signals required an architectural modification: a new ROM memory is added to the controller design to implement the T relation. Modified controller architecture is presented in Figure 4. However this appraoch may leads to ROM size increasing in some circumstancies i.e. when their exist a low instruction redundancy in the controller (indexing ROM may be more
Figure 4. Architecture for instruction compacted controllers Figure 2. Circuit composed of a datapath and its controller.
Figure 3. Command signal table for ASIP circuit (signals control datapath resources)
30
expensive that memory saved compacting instructions)..
A. Proposed Cluster-Based approach
Using a single controller approach to control the overall datapath (Figure 5a) is a bottleneck during the optimization process, i.e. merging rows or columns can be forbidden by one bit value over hundreds. To solve this optimization issue, the dedicated processor circuit can be divided into independent synchronous clusters. Clusters are atomic elements composed of: a computation resource, its associated storage elements and required steering logic resources. Each cluster has its own characteristics (i.e. computation starting and ending states depending on the resource usages). Using such circuit decomposition (Figure 5b), each cluster controller can be optimized without considering others. This approach is efficient for instruction compaction.
Unfortunately, the drawback of duplicating controllers using an island-styled approach (Figure 5b) is design area increase: it reduces the column compaction opportunities and it requires indexing ROMs in each cluster. Duplicated indexing ROMs and state counters which can be partially or fully redundant with others increases circuit area. Efficient datapath controller design is located between the clustered approach and single ROM one (Figure 5c). The cluster- merging problem is an optimization problem where the objective function can be described as follow:
!
min : Area(ci)
i=1 N
"
with N the number of controllers; Area(ci) the controller memory size of the ith controller.
To find an efficient controller solution, a weighted graph B=(C, E) is built. Each vertex cl # C represents a cluster controller. Nodes ci are weighted with vi which represent the minimum memory cost of the ith controller. E # (C ! C) is the set of weighted edges el,m between cl and cm. Edges represent
the merging possibility between the linked clusters. Weight wl,m associated with edge el,m corresponds to the area saving (or lost) obtained while merging controllers c1 with c2.
1) First Step: Creating the weighted graph
For each cluster ci with i # [1, N] in the architecture we create a node ci in B. For each node ci we compute the associated controller minimum memory cost vi. The optimal vi weight is obtained after applying the overall optimization schemes. There exist three distinct possible ways to obtain the best ROM design: (1) using column compaction only (2) using column and instruction packing (3) the same optimizations as 2 but in the opposite order. The minimum cost value is selected as node weight (vi).
Once all nodes is created, we can create edges. Each graph node is linked to all others. Each edge ei,j models the area saving obtained while merging controller ci with cj. For each node couple (ci, cj) we create three distinct edges, weighted using the merged controller cost obtained using the three optimization processes (similar as nodes weight). An example of such graph is presented in Figure 6a.
2) Second Step: Removing inefficient opportunities Once the overall, weighted graph B has been constructed, we first eliminate the redundant edges linking node couples:
for each couple (ci, cj) we only conserve one edge ei,j corresponding to the better area reduction. In case of area saving equivalence, column compacted only solution is preferred to other ones for critical path reason. This step result is presented in Figure 6b. Finally, the inefficient merging possibilities are removed (merging opportunities which increase the design area). Each edge is evaluated and ones with wi,j < 0 are removed. This model transformation is illustrated in Figure 6c.
3) Third Step: Incremental graph compaction
Graph compaction problem is solved using greedy approach to limit the algorithmic complexity. The B graph is analyzed to find the maximum weight wi,j value. This weight Figure 5. Controller approaches in DSP circuit (a) single controller (b) clustered controllers (c) mixed approach
(a) Initial graph model (b) Redundant edge removing (c) Inefficient edge removing (d) ROM factorization result Figure 6. Bi-partite graph models obtained during the proposed optimization process.
31
corresponds to the better controller merging opportunity (area saving). We merge the two controllers associated to ei,j, removing nodes ci and cj from B. A new node ck is inserted in B. Edges linking ck to cm with cm # B/{ck} are created and weighted as described in algorithm Step 1. Newly created edges are optimized and the procedure performed again until there is no more possible merging available (Figure 6d).
Saving results may be improved considering smaller clusters at the optimization start i.e. one controller for each register and multiplexer element). Unfortunately, such approach would increase drastically the optimization runtime.
IV. EXPERIMENTAL RESULTS
In this section, we present experimental results. The ROM based optimization techniques have been integrated in the VHDL backend of the GraphLab HLS tool [8]. Experiments are based on commonly used digital signal processing applications. The optimization process results have been obtained for different synthesis constraints applied to the applications producing different controllers architectures (different number of resources to control and different number of states). This procedure helps us to provide a fairly evaluation of the saving obtained. Five methodologies have been compared: (1) ROM optimized using the column compaction technique only (2) optimized using column compaction technique followed by instruction compaction (3) same as 2 but in the opposite order (4) full clustered approach, with for each controller column and instruction compaction (5) proposed cluster based approach.
Results presented in Table 1 present ROM area saving obtained for benchmark applications depending on the controller number of states (and the number of resources).
Results highlight that the proposed technique always provide better area saving (more than 17% saving) compared to other methodologies. However, memory saving obtained depends on the FSM characteristics. However, the saving depends on the number of resources, controller states, and command signal values.
V. CONCLUSION
In this paper, we have presented a new ROM area saving methodology whose objective is to use the digital signal processor circuit singularities to extend literature approaches.
The controller architecture is generated using is an efficient trade-off between a single controller and a full clustered approach. Proposed methodology, efficient for complex circuit, can be integrated to high-level synthesis in CAD tools or used independently on hand made circuits. As the experimental results show, the controllers generated using our design flow have significantly less area: 22% up to 42%
compared to a single based controller design and 17% up to 36% compared to a full clustered approach.
VI. REFERENCES
[1] W. Shiue, “Power/area/delay aware fsm synthesis and optimization,”
Microelectronics Journal, vol. 36, no. 2, pp. 147–162, February 2005.
[2] K. Kuusilinna, V. Lahtinen, T. Hamalainen, and J. Saarinen, “Finite state machine encoding for vhdl synthesis,” in Computers and Digital Techniques, IEEE Proceedings -, vol. 148, no. 1, pp. 23–30, 2001.
[3] P. Bomel, E. Martin, and E. Boutillon, “Synchronization processor synthesis for latency insensitive systems,” in Proceedings of the DATE conference, pp. 896–897, 2005.
[4] S. Mitra, L. Avra, and E. Mc Cluskey, “An output encoding problem and a solution technique,” in Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design, pp. 304–307, 1997.
[5] H. Selvaraj et al., “Fsm implementation in embedded memory blocks of programmable logic devices using functional decomposition,” in Proceedings of the ITCC’02 Conference, page 355, 2002.
[6] M. Rawski, H. Selvaraj, and T. Luba, “An application of functional decomposition in ROM-based FSM implementation in fpga devices,”
Journal of Systems Architecture, vol. 51, no. 6-7, pp. 424–434, 2005.
[7] I. Garcia-Varga at al, “ROM-based finite state machine implementation in low cost fpgas,” in IEEE International Symposium on Industrial Electronics (ISIE), pp. 2342–2347, 2007.
[8] E. Casseau and B. Le Gal, “ High-Level Synthesis for the Design of FPGA-based Signal Processing Systems”, In the Proceedings of SAMOS’09, Samos, Greece, July 20-23, 2009
Application Technique # of
states # of
resources # of
clusters Instruction
width # of ROM
bits ROM size (kByte) # of
instructions Saving vs proposed
64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT
inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG
Without optimisation
91 1247
1 1247 114724 14,0 91 67,3!%
Column compaction
91 1247
1 692 63664 7,8 91 41,1!%
Column and Instruction compaction 91 1247 1 692 48392 5,9 69 22,5!%
Fully clustered approach
91 1247
61 1043 45057 5,5 [6, 35] 16,8!%
Proposed approach
91 1247
13 852 37509 4,6 [29, 44] ---
Without optimisation
274 1069
1 1069 293975 35,9 274 80,6!%
Column compaction
274 1069
1 551 151525 18,5 274 62,3!%
Column and Instruction compaction 274 1069 1 551 88707 10,8 157 35,6!%
Fully clustered approach
274 1069
54 873 74305 9,1 [3, 65] 23,2!%
Proposed approach
274 1069
10 677 57095 7,0 [53, 89] ---
0,0 Without optimisation
80 2805
1 2805 227205 27,7 80 77,7!%
Column compaction
80 2805
1 1221 98901 12,1 80 48,7!%
Column and Instruction compaction 80 2805 1 1221 88479 10,8 72 42,6!%
Fully clustered approach
80 2805
256 2295 78307 9,6 [4, 30] 35,2!%
Proposed approach
80 2805
37 1674 50749 6,2 [13, 33] ---
Without optimisation
140 2472
1 2472 348552 42,5 140 69,4!%
Column compaction
140 2472
1 1110 156510 19,1 140 31,9!%
Column and Instruction 140 2472 1 1110 139737 17,1 125 23,7!%
Fully clustered approach
140 2472
119 1963 148909 18,2 [2, 76] 28,4!%
Proposed approach
140 2472
4 1136 106635 13,0 [91, 94] ---
Table 1. ROM area saving for two applications synthesized under different timing (latency) constraints
32