Using Digital Signal Processor Singularities to Minimize ROM Based Controller Area

(1)

Using Digital Signal Processor Singularities to Minimize ROM Based Controller Area

Bertrand LE GAL, Lilian BOSSUET and Dominique DALLET

IMS Laboratory - UMR CNRS 5218

University of Bordeaux - Talence, FRANCE

Abstract—The Interest in synthesis of custom Digital Signal Processors (DSP) using automated design flow like High Level Synthesis greatly increased in the last years. This phenomenon is due to the growing processing complexity and the time to market constraint. Dedicated processor component design is a complex process, for which tools must optimize the datapath and its controller. In this paper, we propose a controller design flow based on mapping Finite-State Machines into Memory Blocks in order to limit the critical path delay in the controller. Our design flow approach takes into account DSP circuit singularities providing efficient area saving compared to other approaches (more than 18% up to 42% on real applications).

I. INTRODUCTION

Custom circuits generated using High-Level Synthesis design flows are based on generic architectures. These custom circuits, usually dedicated to Digital Signal Processing applications, are composed by two main parts: a datapath to perform the computations and a controller to command the hardware resources. Complexity of generated circuits increases with application functionalities and performance constraints. Design complexity heavily affects the controller part of the circuit. Therefore, the controller part limits the circuit maximum frequency characteristic.

Many controllers related issues have been addressed in control intensive researches. It has been demonstrated that implementing a controller using a ROM based design provides interesting characteristics [1, 2]. However, existing techniques were developed for control-intensive applications and must be adapted to efficiently manage computation intensive circuit specificities.

In this paper, we present a new design flow for ROM based controllers dedicated to custom digital signal processors to save memory area. The problem formulation and its resolution are different from literature approaches. Indeed, we do not consider that the next state computation part of the controller is the most complex one. In the case of DSP circuits, the controller complexity is located in the output decoder part of the design. The main issue is to factorize efficiently the output command signals to save ROM area.

Article is organized as follows. Section 2 presents the literature approaches optimizing ROM based implementations of controllers and explains the motivation behind studying this kind of solution. In Section 3, we detail the area optimization algorithm used and extend literature approaches to handle DSP singularities. Experimental results which validate our strategy to reduce the controller ROM area are reported in Section 4. Finally, Section 5 concludes this paper.

II. RELATED WORKS

The FSM can be described by a 5-tuple (X, Y, S, f, g) where X, Y and S are finite collections of the input, output and state variables; The f function defined as f: X!^S" S is the transition function which provides the next state s_j for the given inputs and the current state si. The function named g defined as g: X!S " Y is the output function which computes the machine output for the given inputs and current state.

DSP controller optimization techniques have been developed considering logic based controller implementation (Figure 1a). Logic based controller design has been proved as inefficient for controllers with large number of states [3]. One

Figure 1. Various FSM controller implementations (a) logic based design

(b) ROM based design (address and output) (c) ROM based design using an address counter (states).

29

(2)

way to cope the relation between the critical path and the number of FSM states is to implement the design controller in a ROM based design (Figure 1b). Using such controller architecture, the output values and the transition conditions are pre-computed (before logical synthesis and stored in ROM. In these controller architectures, the critical path is constant whatever the number of states and the number of resources (depending only on ROM characteristics).

General methods for ROM based controller synthesis targeting implementation of sequential circuits using embedded memory blocks have been proposed in [4, 5]. These methods dedicated to control intensive applications, save ROM area by decomposing the memory block (corresponding to the controller) into two blocks: a semi-combinational address modifier and a smaller memory block to store the output values. An appropriately chosen decomposition strategy may reduce the required memory size at the cost of additional logic cells. These optimizations focus only the next address computation part of the controller implementation. A similar approach was proposed in [5] which considers the controller power consumption problem. Finally, in [6], the author uses don't care value to simplify state transition equations. This simplification reduces the memory size as well as multiplexer complexity of the address modifier part only of the controller.

Proposed approach

Literature approaches focus on general FSM models with uncorrelated output signals. They consider that next state computation part of the controller is more complex than the output decoder one. In dedicated DSP circuits, controllers do not have such characteristics: (1) the FSM models are linear (the next state decoding equations and conditions are simples) (2) the output signal set is complex (huge numbers of states and output signals).

In custom DSP circuits, the controller output signals are used to control the datapath resources like shown in Figure 2.

In the same time, the next state computation part of controller is simpler than general FSM models because their execution path is linear. This singularity is due to digital signal application properties which are computation intensive applications. Controller architecture presented in Figure 1c is efficient for such kind of circuits. In this design, the next state computation function f is implemented using an adder, a register and a multiplexer resource. The output decoding function g is implemented using a ROM memory. Each word of the ROM store the output commands associated to state S.

In this paper, we propose a cluster-based methodology that reduces efficiently ROM area associated to output signal generation.

III. AREA SAVING TECHNIQUES

ROM area increases with the number of resources and the number of execution states. Depending on the circuit complexity, these requirements can become huge. Two techniques have been proposed, removing spatial and temporal redundancy. These techniques use the fact that command signals (controller outputs) are not required for each hardware resources at each clock cycle. Undefined command values named don't care values are represented using X in example the truth table (Figure 3). Don't care values help the compaction process in reducing the ROM area as they can be modified without design functionality impact.

Spatial redundancy - The first approach to save ROM area is to realize column compaction [4]. This step aims at removing output signals (columns) which are logically equivalent, or can be made equivalent through assignment of don't cares.

Given a set of output columns, the problem of finding the smallest column set to drive the overall datapath resources can be obtained by compacting the given set. This problem is related to the maximum clique-partitioning problem, which is NP-complete.

Removing temporal redundancy – The second approach is used to remove the inter-instruction redundancy (reducing the ROM height). Removing the temporal redundancies modifies output computation function g:X!^S"Y updating it by g’:T(X!S)"Y where T:(X!S)"I is the function which associate an instruction for each controller state. This indexed relation between state and the output signals required an architectural modification: a new ROM memory is added to the controller design to implement the T relation. Modified controller architecture is presented in Figure 4. However this appraoch may leads to ROM size increasing in some circumstancies i.e. when their exist a low instruction redundancy in the controller (indexing ROM may be more

Figure 4. Architecture for instruction compacted controllers Figure 2. Circuit composed of a datapath and its controller.

Figure 3. Command signal table for ASIP circuit (signals control datapath resources)

30

(3)

expensive that memory saved compacting instructions)..

A. Proposed Cluster-Based approach

Using a single controller approach to control the overall datapath (Figure 5a) is a bottleneck during the optimization process, i.e. merging rows or columns can be forbidden by one bit value over hundreds. To solve this optimization issue, the dedicated processor circuit can be divided into independent synchronous clusters. Clusters are atomic elements composed of: a computation resource, its associated storage elements and required steering logic resources. Each cluster has its own characteristics (i.e. computation starting and ending states depending on the resource usages). Using such circuit decomposition (Figure 5b), each cluster controller can be optimized without considering others. This approach is efficient for instruction compaction.

Unfortunately, the drawback of duplicating controllers using an island-styled approach (Figure 5b) is design area increase: it reduces the column compaction opportunities and it requires indexing ROMs in each cluster. Duplicated indexing ROMs and state counters which can be partially or fully redundant with others increases circuit area. Efficient datapath controller design is located between the clustered approach and single ROM one (Figure 5c). The cluster- merging problem is an optimization problem where the objective function can be described as follow:

!

min : Area(c_i)

i=1 N

"

with N the number of controllers; Area(ci) the controller memory size of the i^th controller.

To find an efficient controller solution, a weighted graph B=(C, E) is built. Each vertex c_l # C represents a cluster controller. Nodes ci are weighted with vi which represent the minimum memory cost of the i^th controller. E #^(C! C) is the set of weighted edges e_l,m between c_l and c_m. Edges represent

the merging possibility between the linked clusters. Weight wl,m associated with edge el,m corresponds to the area saving (or lost) obtained while merging controllers c1 with c2.

1) First Step: Creating the weighted graph

For each cluster ci with i # [1, N] in the architecture we create a node c_i in B. For each node c_i we compute the associated controller minimum memory cost v_i. The optimal v_i weight is obtained after applying the overall optimization schemes. There exist three distinct possible ways to obtain the best ROM design: (1) using column compaction only (2) using column and instruction packing (3) the same optimizations as 2 but in the opposite order. The minimum cost value is selected as node weight (v_i).

Once all nodes is created, we can create edges. Each graph node is linked to all others. Each edge e_i,j models the area saving obtained while merging controller ci with cj. For each node couple (c_i, c_j) we create three distinct edges, weighted using the merged controller cost obtained using the three optimization processes (similar as nodes weight). An example of such graph is presented in Figure 6a.

2) Second Step: Removing inefficient opportunities Once the overall, weighted graph B has been constructed, we first eliminate the redundant edges linking node couples:

for each couple (c_i, c_j) we only conserve one edge e_i,j corresponding to the better area reduction. In case of area saving equivalence, column compacted only solution is preferred to other ones for critical path reason. This step result is presented in Figure 6b. Finally, the inefficient merging possibilities are removed (merging opportunities which increase the design area). Each edge is evaluated and ones with w_i,j < 0 are removed. This model transformation is illustrated in Figure 6c.

3) Third Step: Incremental graph compaction

Graph compaction problem is solved using greedy approach to limit the algorithmic complexity. The B graph is analyzed to find the maximum weight wi,j value. This weight Figure 5. Controller approaches in DSP circuit (a) single controller (b) clustered controllers (c) mixed approach

(a) Initial graph model (b) Redundant edge removing (c) Inefficient edge removing (d) ROM factorization result Figure 6. Bi-partite graph models obtained during the proposed optimization process.

31

(4)

corresponds to the better controller merging opportunity (area saving). We merge the two controllers associated to e_i,j, removing nodes c_i and c_j from B. A new node c_k is inserted in B. Edges linking ck to cm with cm # B/{ck} are created and weighted as described in algorithm Step 1. Newly created edges are optimized and the procedure performed again until there is no more possible merging available (Figure 6d).

Saving results may be improved considering smaller clusters at the optimization start i.e. one controller for each register and multiplexer element). Unfortunately, such approach would increase drastically the optimization runtime.

IV. EXPERIMENTAL RESULTS

In this section, we present experimental results. The ROM based optimization techniques have been integrated in the VHDL backend of the GraphLab HLS tool [8]. Experiments are based on commonly used digital signal processing applications. The optimization process results have been obtained for different synthesis constraints applied to the applications producing different controllers architectures (different number of resources to control and different number of states). This procedure helps us to provide a fairly evaluation of the saving obtained. Five methodologies have been compared: (1) ROM optimized using the column compaction technique only (2) optimized using column compaction technique followed by instruction compaction (3) same as 2 but in the opposite order (4) full clustered approach, with for each controller column and instruction compaction (5) proposed cluster based approach.

Results presented in Table 1 present ROM area saving obtained for benchmark applications depending on the controller number of states (and the number of resources).

Results highlight that the proposed technique always provide better area saving (more than 17% saving) compared to other methodologies. However, memory saving obtained depends on the FSM characteristics. However, the saving depends on the number of resources, controller states, and command signal values.

V. CONCLUSION

In this paper, we have presented a new ROM area saving methodology whose objective is to use the digital signal processor circuit singularities to extend literature approaches.

The controller architecture is generated using is an efficient trade-off between a single controller and a full clustered approach. Proposed methodology, efficient for complex circuit, can be integrated to high-level synthesis in CAD tools or used independently on hand made circuits. As the experimental results show, the controllers generated using our design flow have significantly less area: 22% up to 42%

compared to a single based controller design and 17% up to 36% compared to a full clustered approach.

VI. REFERENCES

[1] W. Shiue, “Power/area/delay aware fsm synthesis and optimization,”

Microelectronics Journal, vol. 36, no. 2, pp. 147–162, February 2005.

[2] K. Kuusilinna, V. Lahtinen, T. Hamalainen, and J. Saarinen, “Finite state machine encoding for vhdl synthesis,” in Computers and Digital Techniques, IEEE Proceedings -, vol. 148, no. 1, pp. 23–30, 2001.

[3] P. Bomel, E. Martin, and E. Boutillon, “Synchronization processor synthesis for latency insensitive systems,” in Proceedings of the DATE conference, pp. 896–897, 2005.

[4] S. Mitra, L. Avra, and E. Mc Cluskey, “An output encoding problem and a solution technique,” in Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design, pp. 304–307, 1997.

[5] H. Selvaraj et al., “Fsm implementation in embedded memory blocks of programmable logic devices using functional decomposition,” in Proceedings of the ITCC’02 Conference, page 355, 2002.

[6] M. Rawski, H. Selvaraj, and T. Luba, “An application of functional decomposition in ROM-based FSM implementation in fpga devices,”

Journal of Systems Architecture, vol. 51, no. 6-7, pp. 424–434, 2005.

[7] I. Garcia-Varga at al, “ROM-based finite state machine implementation in low cost fpgas,” in IEEE International Symposium on Industrial Electronics (ISIE), pp. 2342–2347, 2007.

[8] E. Casseau and B. Le Gal, “ High-Level Synthesis for the Design of FPGA-based Signal Processing Systems”, In the Proceedings of SAMOS’09, Samos, Greece, July 20-23, 2009

Application Technique # of

states # of

resources # of

clusters Instruction

width # of ROM

bits ROM size (kByte) # of

instructions Saving vs proposed

64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT 64 taps FFT

inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG inverse JPEG

Without optimisation

91 1247

1 1247 114724 14,0 91 67,3!%

Column compaction

91 1247

1 692 63664 7,8 91 41,1!%

Column and Instruction compaction 91 1247 1 692 48392 5,9 69 22,5!%

Fully clustered approach

91 1247

61 1043 45057 5,5 [6, 35] 16,8!%

Proposed approach

91 1247

13 852 37509 4,6 [29, 44] ---

274 1069

1 1069 293975 35,9 274 80,6!%

Column compaction

274 1069

1 551 151525 18,5 274 62,3!%

274 1069

54 873 74305 9,1 [3, 65] 23,2!%

Proposed approach

274 1069

10 677 57095 7,0 [53, 89] ---

0,0 Without optimisation

80 2805

1 2805 227205 27,7 80 77,7!%

Column compaction

80 2805

1 1221 98901 12,1 80 48,7!%

80 2805

256 2295 78307 9,6 [4, 30] 35,2!%

Proposed approach

80 2805

37 1674 50749 6,2 [13, 33] ---

140 2472

1 2472 348552 42,5 140 69,4!%

Column compaction

140 2472

1 1110 156510 19,1 140 31,9!%

Column and Instruction 140 2472 1 1110 139737 17,1 125 23,7!%

140 2472

119 1963 148909 18,2 [2, 76] 28,4!%

Proposed approach

140 2472

4 1136 106635 13,0 [91, 94] ---

Table 1. ROM area saving for two applications synthesized under different timing (latency) constraints

32