Dynamic Memory Access Management and Address Computation Balancing for Dataflow Applications
Bertrand Le Gal, Emmanuel Casseau, Caaliph Andriamisaina, Eric Martin LESTER Laboratory - University of South Brittany
Lorient - FRANCE [email protected]
Abstract
Multimedia applications are characterized by a large number of data accesses (i.e. RAM accesses). In many digital signal-processing applications, the array access pat- terns are regular and periodic. Optimized Pipelined Mem- ory Access Controllers can be generated. The computa- tional complexity reduction of algorithms, for instance in multidimensional signal processing applications, can be achieved by using execution hazards (conditional state- ments). In this paper we propose a behavioral synthesis design flow based for data-dominated systems under timing constraints handling both predictable and unpredictable memory access patterns in a memory sequencer. We also analyze the benefits of balancing dynamic address computa- tions from datapath to specialized computation units placed in the memory sequencer thus optimizing area, latency, data locality i.e. reducing the power consumption.
1 Introduction
Multimedia applications such as video and image processing are often characterized by a large number of data accesses. Performances highly depend on the memory ar- chitecture (hierarchy, number of banks), together with the way data are placed and transferred. The design of the memory has also a very great impact on the power con- sumption, which is a so critical feature in embedded ap- plications.
On one hand, actual researches in Multimedia applica- tions try to reduce the computational complexity of algo- rithms using ad-hoc solution composed of conditional com- putations. On the other hand other researches based on ar- chitectural implementations of these algorithms under real- time constraints try to exploit operation parallelism, budget- ing computations to respect the cadence/latency constraints.
However optimized architectures are usually obtained for
regular algorithms without execution hazard. This is true also for the memory architecture which is more advan- tageous for computation intensive applications where the memory access sequences are predictable.
For most of multimedia applications the entire memory access sequence is not knowna priori. This means that all accesses to the memory are not statically known during the synthesis of the application. This prevents in these cases the designers and/or High-Level Synthesis (HSL) tools to handle efficiently the application repetitive sequences. So depending of the application class, the design flow and the HLS tool to be used for efficient area and power consump- tion design is different.
In this paper, we make the following contributions to- wards taking into account of memory accesses in high-level synthesis; we first present the design flow based on HLS.
Then we extend with the dynamic address computations handling method, balancing dynamic address computations between the datapath and the address sequencer for latency performance increase and data transfer reduction in a low- power perspective. We present a new address memory se- quencer architecture which can perform pipeline memory accesses for conditionned and dynamic access sequences.
The results presented in this paper support the claim that it is possible to exploit application specific information and in- tegrate that knowledge in a custom address generator mod- ule for reducing the overhead associated with memory haz- ardous accesses. This allows computation dominated ap- plications with no-fully predictable access sequences to be optimized by well known dataflow optimizations.
Definitions
Deterministic access sequences are access sequences where all the data (for read and/or write operations) are known a priori before/during the synthesis process. This kind of access sequence is also callstatic memory accesses in the literature: the memory accesses do not depend on the
execution context.
Indeterminate access sequences are access sequences where all the data are not known during/before the synthesis process. The memory access sequences are thus composed of static accesses and dynamic and conditioned accesses.
Dynamic and conditioned memory accesses are computed during the execution of the application.
2 Related Work
During the last decade, a lot of research has been done on datapath and memory architectural implementations, show- ing that memory architecture design and datapath designs are interdependent. However, each one produce constrains to the other, reducing the optimization range that can be ap- plied depending of the implementation order. Two synthesis flows are usually used: (1) The memory architecture is syn- thesized before the data path optimizing the memory cost (power consumption, area, etc.) reducing the computation parallelism during the datapath synthesis i.e. the applica- tion latency is increased. (2) The datapath architecture is synthesized before the memory optimizing application la- tency and computation parallelism exploitation. This limits the organization of the memory system and usually leads to power costly implementations.
2.1 Area and Low-Power Memory Optimization To tackle the complexity of low-cost (area, power) mem- ory design, researchers have work on different ways in the design flow (with or without High-Level Synthesis). The area and power consumption of the memories are min- imized before the datapath synthesis, and provides con- strains to the datapath synthesis. In the same way, Wuytack [16] presents a technique that allows memory bandwidth limitation between datapath and memory under timing con- straints. The memory architecture is defined before doing the detailed scheduling, so the selected architecture has to provide sufficient memory bandwidth (parallel ports) such that the application can be scheduled within the timing bud- get. The subsequent memory allocation/assignment tasks [1, 12] have to generate a memory architecture that satisfies several accesses in parallel without data access conflicts.
2.2 Memory and High-Level Synthesis Flow In the context of a High-Level Synthesis (HLS) assisted design flow, scheduling techniques, that include memory is- sues can be used. Among them, some of them really sched- ule the memory accesses [9, 10]. They include precise tem- poral models of the memory accesses. However they do not consider simultaneous access conflicts. Generally, memory vertices are scheduled as operative vertices by considering
Figure 1. Memory Sequencer Usage
conflicts among data accesses. This technique is used to handle off-chip memory access.
Seo [14] performed a first scheduling on a Data Flow Graph; the memory accesses are then rescheduled after the selection and memory allocation to reduce the overall mem- ory cost. This optimization is restricted by the datapath scheduling. Park [11] and Corre [4] minimize the num- ber of simultaneous memory accesses taking into account which data is being accessed simultaneously to optimize the memory access conflict graphs.
2.3 Data-Path and Memory Interfacing
In many digital signal-processing applications, the array access patterns are predictable, regular and periodic [5]. In these cases, the necessary address patterns can be efficiently generated either directly from a memory address sequencer as presented in figure 1. This sequencer can be implemented using as dedicated counters, shift registers, etc. depending on timing and area constraints [7].
The sequencer generation also allows the designer to decouple the concerns of memory interfacing and static scheduling of possible memory accesses for applications with streamed data [11]. This technique is used to improve the pipeline access mode to RAM by creating specialized hardware components for generating addresses and packing and unpacking data items.
As shown in figure 1, only data are transferred from one unit to the other, these data are produced or consumed by the datapath. This allows the reduction of the address trans- fers as their sequences are static a priori. Researches have demonstrated that custom address generator created from the access patterns can be optimized by algorithms to obtain optimal memory architectures. These generators can be op- timized by applying minimizing buses power consumptions techniques. There also exist various bus encoding schemes [15, 2, 8] which have been proposed to decrease the num- ber of transitions. Moreover, Chun [3] shows the interest of an address sequencer utilization for memory accesses for power efficient circuit generation.
2.4 Our Approach
Our approach provides a sequencer architecture handling hazardous memory accesses, i.e. dynamic address compu- tations. The methodology also provide a design method- ology in which the memory architecture is firstly optimized during the datapath scheduling process and in a second time during the access pattern sequencer generation is proposed.
3 Design Flow using High-Level Synthesis
3.1 The Design FlowAs a case study, we propose in this paper to apply a de- sign flow using a HLS tool to generate the datapath unit.
Our starting point is a high-level description of the appli- cation and a memory mapping specifying which data is in memory (figure 2). In order to map an application onto the previously defined memory sequencer architecture, we de- fine a graph model which takes into account the required timing constraints due to dynamic address accesses. There- fore, the input of our transformation method is an algorith- mic description using a data-flow graph that specifies the circuit functionality at the behavioural level disregarding any potential implementation solutions and transformation methods. The definition of the memory architecture (mem- ory mapping) can be performed in the first step of the over- all design flow. Otherwise the designer may decide to let the HLS tool freely organize a part (entirely) of the data in memory under performance constrains. Our methodology can independently handle the two approaches in a unique transformation flow.
3.1.1 Data transfer annotations
The first step of our design flow for dynamic address se- quencer generation is the data flow graph annotation: this annotation step aims to handle the timing requirements for data and address transfers from one unit to the other (Com- munication unit, Memory units and the Datapath unit) such as dynamic access time. This transformation can be guided using the memory mapping information given by the de- signer if the data placements have been done before the syn- thesis in the design flow. The location annotations bring lo- cality information on data memorization (memory, register in datapath) and operations. In this first approach, all the dynamic address computations are located in the datapath unit. The dynamic address read and write operations are constrained by precedence relations with the help of struc- ture nodes, which handle precedence relation between ac- cesses to a same data structure.
Figure 2. Design Flow.
3.1.2 Computation Balancing Metric
In a second time, the dynamic address computation balanc- ing algorithm is applied to the previously annotated graph in order to move some dynamic address computations from datapath to the memory sequencer unit. The decision met- ric used to select address computations which have to be balanced, takes into account different criteria:
1. The number of data-transfer needed to calculate the ad- dress in the sequencer versus the datapath unit.
2. The time increase/decrease of critical paths according higher rates to the design for speed optimizing, 3. The bitwise of the datapath operators and registers
compared to the address computation bitwise for area and switching optimizations,
4. The usage rate and the potential parallelism that can be exploited for computations and data-transfers.
The dynamic address computation balancing script is ap- plied in a static manner to the annotated graph. All the address computations are evaluated for the balancing de- cisions. If the balancing decision optimizes the system, then the graph is transformed: transferring nodes are added, others are deleted and the locality attributes for nodes are changed. Then the balancing evaluation is continued to the other nodes. An example of these transformations is shown in figure 3. This step generates an optimal graph with tim- ing constraint to model data and dynamic address transfers.
3.2 Representation Model
As we target data intensive applications, we extend a commonly use Data Flow Graph to handle the semantics of the application. This graph handles control semantics to model conditional data transfers and operations thus allowing a synthesis with mutually exclusion operations.
This graph is called Extended Data-Flow Graph (EDFG).
An extended data-flow graph (EDFG) is a finite, directed, weighted graphG= (V, E, t)where V is the vertex set of computation nodes;E ⊆ V xV is the edge set, represent- ing precedence relations among the nodes;t :V →Z is a function witht(v), the computation time of nodev. A path in G is a connected sequence of nodes and edges.
The presented transformations are applied on the Data- Flow graph modelling the application. We first transform the application behaviour in a data-flow graph as shown in figure 3a. We use @ddressing node to model dynamic ad- dress access (@r: dynamic read and@w: dynamic write).
We also define array node (X in the example). These nodes are necessary because we do not knowa prioriwhich data element in a vector will be read or write during dynamic access.
Data Transfer Annotations are applied on this graph.
This transformation leads to the transformed graph (figure 3b). The transfers between the datapath unit and the se- quencer are modelled using Transfer Nodes Tx,y (with X equals the source location and y equals the destination).
These transformations consist in adding transfer nodes to model transfer time between the datapath and the memory units. These transformations also affect variables memo- rization; these variable nodes will be split between the dif- ferent units
TheAddress Computation Balancingtransformation can be applied. In figure 3c, theiaddress computation has been balanced into the memory. Transfer nodes have been moved from ito the input (A, B) nodes transferring the addition into the sequencer datapath.
The designer can exploit this extended data-flow graph to generate his datapath architecture manually or using a high-level synthesis tool. Since dynamic address and trans- fer nodes model necessary cycle time for the sequencer, this allows the generation of a feasible sequencer after the data- path synthesis. Depending on the target technology, the transfer cycle count and the dynamic address access process may take more than one clock cycle, so these nodes can be associated with multi-cycle operations in the Extended Data Flow Graph.
3.2.1 Datapath and Memory Sequencer Synthesis The designer can then implement is datapath by hand cod- ing or using a high-level synthesis tool without regarding
Figure 3. Extended Data-Flow Graph.
the memory implementation constraints. In this work the High-Level Synthesis tool we have used is GAUT [6]. It is an HLS tool dedicated to Signal and Image processing ap- plications under real time execution constraints. The HLS tool takes into account the necessary time constraints for dy- namic data addressing and data transfers using the Extended Data-Flow Graph annotated model. During the synthesis process, it selects, allocates and schedules both datapath and sequencer operations in a same time without over con- straining one of the unit. During the scheduling and binding process, a second optimization step can be performed for unallocated data in memory (data without mapping a pri- ori). At the end of the process, the datapath architecture is generated. In order to generate the entire sequencer, other information like access patterns including dynamic address access time slots are generated.
3.2.2 Memory Sequencer Hardware Generation The entire sequencer can be generated using the previously generated information on the access patterns. With this in- formation the Memory Scheduler and the Memory Address Generator can be implemented using local optimizations.
The internal datapath dedicated for dynamic address com- putation is then inserted in the sequencer architecture.
3.3 Conclusion
A prototype tool for dynamic address computation bal- ancing has been implemented to demonstrate the feasibility of automating this important system design step in a HLS design flow.
Figure 4. Targeted Architecture
3.4 Targeted Architecture
3.4.1 Circuit Architecture
Our methodology targets Custom Hardware DSP dedicated to computation intensive application. The targeted architec- ture is generally composed in this case of three units:
1. The processing unit containing the datapath and a con- troller which performs the computations of the appli- cation,
2. The memory unit manages the pipeline accesses to memories, using if necessary preventive read access.
It is composed of a memory sequencer and the mem- ory bank (which can have different clocks and latency to reduce the power consumption).
3. The communication unit, which sends and receives data from/to the rest of the system.
3.4.2 The Memory Sequencer
The memory unit is composed of the memory banks and a memory access sequencer (figure 5). It is composed of 5 different units: a memory access scheduler, a dynamic ad- dress controller, an address generator, an address translation table (logical address to physical address) and an internal datapath for dynamic address computations.
Depending on the targeted memory bank and the dy- namic address access usage in the current cycle, the dynamic access controller routes the correct commands (read/write) and the physical address to the right memory bank. For conditional memory accesses, the read/write ac- cess decisions are taken in the scheduler using the condition result memorized in the state registers (this value is coming from the datapath using the data buses).
This sequencer architecture allows dynamic and condi- tional memory accesses to be realized using a sequencer ap- proach. This also allows the designer to bind freely the split data vectors of the application in different memory banks for memory access parallelism exploitation [6].
Figure 5. Memory Unit Architecture with the Dynamic Access Sequencer
The sequencer datapath is dedicated to dynamic address computations (i.e. logical and arithmetic operations, like in- crement, shifts, etc.). Datapath operators are optimized for address computation versus the datapath unit because ad- dress operator bitwise is usually smaller. The address com- putation unit is composed of operators and registers. This dedicated hardware is shared between address computations during the execution of the application.
Localizing address computations in the memory se- quencer provides important latency gain by avoiding ad- dress transfers between the units. The address traffic reduc- tion between the datapath and the sequencer also reduces the line switching, i.e. the power consumption. Further- more this approach leads to the decreasing of the bus re- quirement between the datapath and the sequencer.
4 Experiments
Current researches in multimedia applications try to re- duce the algorithm computation complexity using ad-hoc solution composed of conditional computations. These al- gorithms improvements may lead to execution hazards to appear in the memory access sequences. This kind of tech- nique is used in Block Matching Techniques where the com- putation complexity of the Full Search algorithm is opti- mized with a Three Step Search algorithm [13] conditional motion vector selections are then involved.
We have applied our sequencer design approach on the Three Step Search algorithm[13] which has an un- deterministic access sequence. We have developed 3 archi- tectures. The first one is based on a control like architecture without memory sequencer usage; we have considered the
Architecture MB size
Window size
Operations (DPath/Seq.)
Transfers
(data+adr.) Bus Memorizations
(DPath/Seq.) Registers
Control 8x8 24x24 8224 0 6349 75 13024 0 195
Sequencer 1 8x8 24x24 8224 0 3189 37 9995 3190 126
Sequencer 2 8x8 24x24 6640 1584 1725 22 8403 3190 115
Control 16x16 48x48 32344 0 25483 107 51497 0 284
Sequencer 1 16x16 48x48 32344 0 13767 51 39293 12768 176
Sequencer 2 16x16 48x48 25976 6368 6895 29 32909 12768 155
Table 1. Experimental results for a 3 Step Search Application
address transfers for the reference macro bloc plus the dy- namic bloc access. The second architecture is based on our first memory sequencer architecture. We have used the ap- plication knowledge for the 5 first bloc transfers which are known a priori as the reference bloc. The third architec- ture e have experimented is based on the second optimize sequencer architecture. In this case dynamic address com- putations are made in the sequencer datapath. Only the base addresses for the dynamic macro bloc are thus transferred.
Results are shown in the table 1. An approach using a sequencer for deterministic memory accesses may be used with data-dominated applications containing few control operations. The results show that the number of address transfers is reduced using data-computation units to com- pute the addresses in the sequencer compared to a normal approach. This is useful with pipeline architectures when address transfers take more that one clock cycle. Moreover the number of memorizations (and of the associated regis- ters) is also decreased.
5 Conclusion
In this paper we have presented a new design flow base on High-Level Synthesis and our new sequencer architec- ture. We show that the design flow allows the designer to freely optimize each unit of his design using a join memory sequencer and datapath synthesis. This is allowed by the graph model define in the paper which allow the expression of memory operations (transfer, dynamic addressing) in the same way as the conditionnal ones.
References
[1] F. Balasa, F. Catthoor, and H. DeMan. Dataflow-driven memory allocation for multi-dimensional signal processing systems. InIn the Proc. of ICCAD, pages 31–35, November 1994.
[2] W.-C. Cheng and M. Pedram. Power-optimal encoding for a dram address bus. IEEE Trans. Very Large Scale Integr.
Syst., 10(2):109–118, 2002.
[3] T. K. Chun-Gi Lyuh and K.-W. Kim. Coupling-aware high-level interconnect synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(1):157–164, January 2004.
[4] G. Corre, E. Senn, P. Bomel, N. Julien, and E. Martin. Mem- ory accesses management during high level synthesis. In Proc. of CODES+ISSS ’04, pages 42–47. ACM Press, 2004.
[5] P. D. D. Grant and I. Finlay. Synthesis of address generators.
InProceedings of ICCAD 89, pages 116–119, 1989.
[6] N. J. Gwenole Corre, Eric Senn and E. Martin. A memory aware behavioral synthesis tool for real-time vlsi circuits. In In Proceedings of GLSVLSI, pages 82–85, April 2004.
[7] S. Hettiaratchi, P. Cheung, and T. Clarke. Performance- area trade-off of address generators for address decoder- decoupled memory. InIn Proc. of DATE, page 902, 2002.
[8] T. Lv, J. Henkel, H. Lekatsas, and W. Wolf. A dictionary- based en/decoding scheme for low-power data buses. IEEE Trans. Very Large Scale Integr. Syst., 11(5):943–951, 2003.
[9] E. S. N. Passos and L.-F. Chao. Multi-dimensional interleav- ing for time-and-memory design optimization. InIn Proc.
of ICCD’95, pages 440–445, October 1995.
[10] A. Nicolau and S. Novack. Trailblazing a hierarchical ap- proach to percolation scheduling. In In Proc. ICPP’93, pages 120–124, 1993.
[11] J. Park and P. C. Diniz. Synthesis of pipelined memory ac- cess controllers for streamed data applications on fpga-based computing engines. InProc. of ISSS, pages 221–226, 2001.
[12] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj. A coding framework for low-power address and data busses. IEEE Trans. Very Large Scale Integr. Syst., 7(2):212–221, 1999.
[13] B. R.Li and M.L.Liou. A new three-step search algorithm for block motion estimation. IEEE Trans. on Circuits and systems for Video Technology, 4(4):438–442, August 1994.
[14] J. Seo, T. Kim, and P. R. Panda. Memory allocation and mapping in high-level synthesis: an integrated approach.
IEEE Trans. Very Large Scale Integr. Syst., 11(5):928–938, 2003.
[15] M. R. Stan and W. P. Burleson. Bus-invert coding for low-power i/o. IEEE Trans. Very Large Scale Integr. Syst., 3(1):49–58, 1995.
[16] S. Wuytack, F. Catthoor, G. de Jong, B. Lin, and H. de Man.
Flow graph balancing for minimizing the required memory bandwidth. InISSS ’96: Proc. of the International Sympo- sium on System Synthesis, page 127. IEEE Computer Soci- ety, 1996.