Dynamic Memory Access Management and Address Computation Balancing for Dataﬂow Applications

(1)

Dynamic Memory Access Management and Address Computation Balancing for Dataflow Applications

Bertrand Le Gal, Emmanuel Casseau, Caaliph Andriamisaina, Eric Martin LESTER Laboratory - University of South Brittany

Lorient - FRANCE [email protected]

Abstract

Multimedia applications are characterized by a large number of data accesses (i.e. RAM accesses). In many digital signal-processing applications, the array access patterns are regular and periodic. Optimized Pipelined Mem- ory Access Controllers can be generated. The computational complexity reduction of algorithms, for instance in multidimensional signal processing applications, can be achieved by using execution hazards (conditional state- ments). In this paper we propose a behavioral synthesis design flow based for data-dominated systems under timing constraints handling both predictable and unpredictable memory access patterns in a memory sequencer. We also analyze the benefits of balancing dynamic address computations from datapath to specialized computation units placed in the memory sequencer thus optimizing area, latency, data locality i.e. reducing the power consumption.

1 Introduction

Multimedia applications such as video and image processing are often characterized by a large number of data accesses. Performances highly depend on the memory architecture (hierarchy, number of banks), together with the way data are placed and transferred. The design of the memory has also a very great impact on the power consumption, which is a so critical feature in embedded applications.

On one hand, actual researches in Multimedia applications try to reduce the computational complexity of algorithms using ad-hoc solution composed of conditional computations. On the other hand other researches based on architectural implementations of these algorithms under real- time constraints try to exploit operation parallelism, budget- ing computations to respect the cadence/latency constraints.

However optimized architectures are usually obtained for

regular algorithms without execution hazard. This is true also for the memory architecture which is more advan- tageous for computation intensive applications where the memory access sequences are predictable.

For most of multimedia applications the entire memory access sequence is not knowna priori. This means that all accesses to the memory are not statically known during the synthesis of the application. This prevents in these cases the designers and/or High-Level Synthesis (HSL) tools to handle efficiently the application repetitive sequences. So depending of the application class, the design flow and the HLS tool to be used for efficient area and power consumption design is different.

In this paper, we make the following contributions to- wards taking into account of memory accesses in high-level synthesis; we first present the design flow based on HLS.

Then we extend with the dynamic address computations handling method, balancing dynamic address computations between the datapath and the address sequencer for latency performance increase and data transfer reduction in a low- power perspective. We present a new address memory sequencer architecture which can perform pipeline memory accesses for conditionned and dynamic access sequences.

The results presented in this paper support the claim that it is possible to exploit application specific information and in- tegrate that knowledge in a custom address generator mod- ule for reducing the overhead associated with memory hazardous accesses. This allows computation dominated applications with no-fully predictable access sequences to be optimized by well known dataflow optimizations.

Definitions

Deterministic access sequences are access sequences where all the data (for read and/or write operations) are known a priori before/during the synthesis process. This kind of access sequence is also callstatic memory accesses in the literature: the memory accesses do not depend on the

(2)

execution context.

Indeterminate access sequences are access sequences where all the data are not known during/before the synthesis process. The memory access sequences are thus composed of static accesses and dynamic and conditioned accesses.

Dynamic and conditioned memory accesses are computed during the execution of the application.

2 Related Work

During the last decade, a lot of research has been done on datapath and memory architectural implementations, show- ing that memory architecture design and datapath designs are interdependent. However, each one produce constrains to the other, reducing the optimization range that can be applied depending of the implementation order. Two synthesis flows are usually used: (1) The memory architecture is synthesized before the data path optimizing the memory cost (power consumption, area, etc.) reducing the computation parallelism during the datapath synthesis i.e. the application latency is increased. (2) The datapath architecture is synthesized before the memory optimizing application latency and computation parallelism exploitation. This limits the organization of the memory system and usually leads to power costly implementations.

2.1 Area and Low-Power Memory Optimization To tackle the complexity of low-cost (area, power) memory design, researchers have work on different ways in the design flow (with or without High-Level Synthesis). The area and power consumption of the memories are min- imized before the datapath synthesis, and provides constrains to the datapath synthesis. In the same way, Wuytack [16] presents a technique that allows memory bandwidth limitation between datapath and memory under timing constraints. The memory architecture is defined before doing the detailed scheduling, so the selected architecture has to provide sufficient memory bandwidth (parallel ports) such that the application can be scheduled within the timing budget. The subsequent memory allocation/assignment tasks [1, 12] have to generate a memory architecture that satisfies several accesses in parallel without data access conflicts.

2.2 Memory and High-Level Synthesis Flow In the context of a High-Level Synthesis (HLS) assisted design flow, scheduling techniques, that include memory is- sues can be used. Among them, some of them really sched- ule the memory accesses [9, 10]. They include precise tem- poral models of the memory accesses. However they do not consider simultaneous access conflicts. Generally, memory vertices are scheduled as operative vertices by considering

Figure 1. Memory Sequencer Usage

conflicts among data accesses. This technique is used to handle off-chip memory access.

Seo [14] performed a first scheduling on a Data Flow Graph; the memory accesses are then rescheduled after the selection and memory allocation to reduce the overall memory cost. This optimization is restricted by the datapath scheduling. Park [11] and Corre [4] minimize the number of simultaneous memory accesses taking into account which data is being accessed simultaneously to optimize the memory access conflict graphs.

2.3 Data-Path and Memory Interfacing

In many digital signal-processing applications, the array access patterns are predictable, regular and periodic [5]. In these cases, the necessary address patterns can be efficiently generated either directly from a memory address sequencer as presented in figure 1. This sequencer can be implemented using as dedicated counters, shift registers, etc. depending on timing and area constraints [7].

The sequencer generation also allows the designer to decouple the concerns of memory interfacing and static scheduling of possible memory accesses for applications with streamed data [11]. This technique is used to improve the pipeline access mode to RAM by creating specialized hardware components for generating addresses and packing and unpacking data items.

As shown in figure 1, only data are transferred from one unit to the other, these data are produced or consumed by the datapath. This allows the reduction of the address transfers as their sequences are static a priori. Researches have demonstrated that custom address generator created from the access patterns can be optimized by algorithms to obtain optimal memory architectures. These generators can be optimized by applying minimizing buses power consumptions techniques. There also exist various bus encoding schemes [15, 2, 8] which have been proposed to decrease the number of transitions. Moreover, Chun [3] shows the interest of an address sequencer utilization for memory accesses for power efficient circuit generation.

(3)

2.4 Our Approach

Our approach provides a sequencer architecture handling hazardous memory accesses, i.e. dynamic address computations. The methodology also provide a design methodology in which the memory architecture is firstly optimized during the datapath scheduling process and in a second time during the access pattern sequencer generation is proposed.

3 Design Flow using High-Level Synthesis

3.1 The Design Flow

As a case study, we propose in this paper to apply a design flow using a HLS tool to generate the datapath unit.

Our starting point is a high-level description of the application and a memory mapping specifying which data is in memory (figure 2). In order to map an application onto the previously defined memory sequencer architecture, we define a graph model which takes into account the required timing constraints due to dynamic address accesses. There- fore, the input of our transformation method is an algorith- mic description using a data-flow graph that specifies the circuit functionality at the behavioural level disregarding any potential implementation solutions and transformation methods. The definition of the memory architecture (memory mapping) can be performed in the first step of the overall design flow. Otherwise the designer may decide to let the HLS tool freely organize a part (entirely) of the data in memory under performance constrains. Our methodology can independently handle the two approaches in a unique transformation flow.

3.1.1 Data transfer annotations

The first step of our design flow for dynamic address sequencer generation is the data flow graph annotation: this annotation step aims to handle the timing requirements for data and address transfers from one unit to the other (Com- munication unit, Memory units and the Datapath unit) such as dynamic access time. This transformation can be guided using the memory mapping information given by the designer if the data placements have been done before the synthesis in the design flow. The location annotations bring locality information on data memorization (memory, register in datapath) and operations. In this first approach, all the dynamic address computations are located in the datapath unit. The dynamic address read and write operations are constrained by precedence relations with the help of structure nodes, which handle precedence relation between accesses to a same data structure.

Figure 2. Design Flow.

3.1.2 Computation Balancing Metric

In a second time, the dynamic address computation balancing algorithm is applied to the previously annotated graph in order to move some dynamic address computations from datapath to the memory sequencer unit. The decision metric used to select address computations which have to be balanced, takes into account different criteria:

1. The number of data-transfer needed to calculate the address in the sequencer versus the datapath unit.

2. The time increase/decrease of critical paths according higher rates to the design for speed optimizing, 3. The bitwise of the datapath operators and registers

compared to the address computation bitwise for area and switching optimizations,

4. The usage rate and the potential parallelism that can be exploited for computations and data-transfers.

The dynamic address computation balancing script is applied in a static manner to the annotated graph. All the address computations are evaluated for the balancing decisions. If the balancing decision optimizes the system, then the graph is transformed: transferring nodes are added, others are deleted and the locality attributes for nodes are changed. Then the balancing evaluation is continued to the other nodes. An example of these transformations is shown in figure 3. This step generates an optimal graph with timing constraint to model data and dynamic address transfers.

(4)

3.2 Representation Model

As we target data intensive applications, we extend a commonly use Data Flow Graph to handle the semantics of the application. This graph handles control semantics to model conditional data transfers and operations thus allowing a synthesis with mutually exclusion operations.

This graph is called Extended Data-Flow Graph (EDFG).

An extended data-flow graph (EDFG) is a finite, directed, weighted graphG= (V, E, t)where V is the vertex set of computation nodes;E ⊆ V xV is the edge set, represent- ing precedence relations among the nodes;t :V →Z is a function witht(v), the computation time of nodev. A path in G is a connected sequence of nodes and edges.

The presented transformations are applied on the Data- Flow graph modelling the application. We first transform the application behaviour in a data-flow graph as shown in figure 3a. We use @ddressing node to model dynamic address access (@r: dynamic read and@w: dynamic write).

We also define array node (X in the example). These nodes are necessary because we do not knowa prioriwhich data element in a vector will be read or write during dynamic access.

Data Transfer Annotations are applied on this graph.

This transformation leads to the transformed graph (figure 3b). The transfers between the datapath unit and the sequencer are modelled using Transfer Nodes Tx,y (with X equals the source location and y equals the destination).

These transformations consist in adding transfer nodes to model transfer time between the datapath and the memory units. These transformations also affect variables memorization; these variable nodes will be split between the different units

TheAddress Computation Balancingtransformation can be applied. In figure 3c, theiaddress computation has been balanced into the memory. Transfer nodes have been moved from ito the input (A, B) nodes transferring the addition into the sequencer datapath.

The designer can exploit this extended data-flow graph to generate his datapath architecture manually or using a high-level synthesis tool. Since dynamic address and transfer nodes model necessary cycle time for the sequencer, this allows the generation of a feasible sequencer after the datapath synthesis. Depending on the target technology, the transfer cycle count and the dynamic address access process may take more than one clock cycle, so these nodes can be associated with multi-cycle operations in the Extended Data Flow Graph.

3.2.1 Datapath and Memory Sequencer Synthesis The designer can then implement is datapath by hand coding or using a high-level synthesis tool without regarding

Figure 3. Extended Data-Flow Graph.

the memory implementation constraints. In this work the High-Level Synthesis tool we have used is GAUT [6]. It is an HLS tool dedicated to Signal and Image processing applications under real time execution constraints. The HLS tool takes into account the necessary time constraints for dynamic data addressing and data transfers using the Extended Data-Flow Graph annotated model. During the synthesis process, it selects, allocates and schedules both datapath and sequencer operations in a same time without over con- straining one of the unit. During the scheduling and binding process, a second optimization step can be performed for unallocated data in memory (data without mapping a priori). At the end of the process, the datapath architecture is generated. In order to generate the entire sequencer, other information like access patterns including dynamic address access time slots are generated.

3.2.2 Memory Sequencer Hardware Generation The entire sequencer can be generated using the previously generated information on the access patterns. With this information the Memory Scheduler and the Memory Address Generator can be implemented using local optimizations.

The internal datapath dedicated for dynamic address computation is then inserted in the sequencer architecture.

3.3 Conclusion

A prototype tool for dynamic address computation balancing has been implemented to demonstrate the feasibility of automating this important system design step in a HLS design flow.

(5)

Figure 4. Targeted Architecture

3.4 Targeted Architecture

3.4.1 Circuit Architecture

Our methodology targets Custom Hardware DSP dedicated to computation intensive application. The targeted architecture is generally composed in this case of three units:

1. The processing unit containing the datapath and a controller which performs the computations of the application,

2. The memory unit manages the pipeline accesses to memories, using if necessary preventive read access.

It is composed of a memory sequencer and the memory bank (which can have different clocks and latency to reduce the power consumption).

3. The communication unit, which sends and receives data from/to the rest of the system.

3.4.2 The Memory Sequencer

The memory unit is composed of the memory banks and a memory access sequencer (figure 5). It is composed of 5 different units: a memory access scheduler, a dynamic address controller, an address generator, an address translation table (logical address to physical address) and an internal datapath for dynamic address computations.

Depending on the targeted memory bank and the dynamic address access usage in the current cycle, the dynamic access controller routes the correct commands (read/write) and the physical address to the right memory bank. For conditional memory accesses, the read/write access decisions are taken in the scheduler using the condition result memorized in the state registers (this value is coming from the datapath using the data buses).

This sequencer architecture allows dynamic and conditional memory accesses to be realized using a sequencer approach. This also allows the designer to bind freely the split data vectors of the application in different memory banks for memory access parallelism exploitation [6].

Figure 5. Memory Unit Architecture with the Dynamic Access Sequencer

The sequencer datapath is dedicated to dynamic address computations (i.e. logical and arithmetic operations, like in- crement, shifts, etc.). Datapath operators are optimized for address computation versus the datapath unit because address operator bitwise is usually smaller. The address computation unit is composed of operators and registers. This dedicated hardware is shared between address computations during the execution of the application.

Localizing address computations in the memory sequencer provides important latency gain by avoiding address transfers between the units. The address traffic reduction between the datapath and the sequencer also reduces the line switching, i.e. the power consumption. Further- more this approach leads to the decreasing of the bus re- quirement between the datapath and the sequencer.

4 Experiments

Current researches in multimedia applications try to reduce the algorithm computation complexity using ad-hoc solution composed of conditional computations. These algorithms improvements may lead to execution hazards to appear in the memory access sequences. This kind of technique is used in Block Matching Techniques where the computation complexity of the Full Search algorithm is optimized with a Three Step Search algorithm [13] conditional motion vector selections are then involved.

We have applied our sequencer design approach on the Three Step Search algorithm[13] which has an un- deterministic access sequence. We have developed 3 architectures. The first one is based on a control like architecture without memory sequencer usage; we have considered the

(6)

Architecture MB size

Window size

Operations (DPath/Seq.)

Transfers

(data+adr.) Bus Memorizations

(DPath/Seq.) Registers

Control 8x8 24x24 8224 0 6349 75 13024 0 195

Sequencer 1 8x8 24x24 8224 0 3189 37 9995 3190 126

Sequencer 2 8x8 24x24 6640 1584 1725 22 8403 3190 115

Control 16x16 48x48 32344 0 25483 107 51497 0 284

Sequencer 1 16x16 48x48 32344 0 13767 51 39293 12768 176

Sequencer 2 16x16 48x48 25976 6368 6895 29 32909 12768 155

Table 1. Experimental results for a 3 Step Search Application

address transfers for the reference macro bloc plus the dynamic bloc access. The second architecture is based on our first memory sequencer architecture. We have used the application knowledge for the 5 first bloc transfers which are known a priori as the reference bloc. The third architecture e have experimented is based on the second optimize sequencer architecture. In this case dynamic address computations are made in the sequencer datapath. Only the base addresses for the dynamic macro bloc are thus transferred.

Results are shown in the table 1. An approach using a sequencer for deterministic memory accesses may be used with data-dominated applications containing few control operations. The results show that the number of address transfers is reduced using data-computation units to com- pute the addresses in the sequencer compared to a normal approach. This is useful with pipeline architectures when address transfers take more that one clock cycle. Moreover the number of memorizations (and of the associated registers) is also decreased.

5 Conclusion

In this paper we have presented a new design flow base on High-Level Synthesis and our new sequencer architecture. We show that the design flow allows the designer to freely optimize each unit of his design using a join memory sequencer and datapath synthesis. This is allowed by the graph model define in the paper which allow the expression of memory operations (transfer, dynamic addressing) in the same way as the conditionnal ones.

References

[1] F. Balasa, F. Catthoor, and H. DeMan. Dataflow-driven memory allocation for multi-dimensional signal processing systems. InIn the Proc. of ICCAD, pages 31–35, November 1994.

[2] W.-C. Cheng and M. Pedram. Power-optimal encoding for a dram address bus. IEEE Trans. Very Large Scale Integr.

Syst., 10(2):109–118, 2002.

[3] T. K. Chun-Gi Lyuh and K.-W. Kim. Coupling-aware high-level interconnect synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(1):157–164, January 2004.

[4] G. Corre, E. Senn, P. Bomel, N. Julien, and E. Martin. Mem- ory accesses management during high level synthesis. In Proc. of CODES+ISSS ’04, pages 42–47. ACM Press, 2004.

[5] P. D. D. Grant and I. Finlay. Synthesis of address generators.

InProceedings of ICCAD 89, pages 116–119, 1989.

[6] N. J. Gwenole Corre, Eric Senn and E. Martin. A memory aware behavioral synthesis tool for real-time vlsi circuits. In In Proceedings of GLSVLSI, pages 82–85, April 2004.

[7] S. Hettiaratchi, P. Cheung, and T. Clarke. Performance- area trade-off of address generators for address decoder- decoupled memory. InIn Proc. of DATE, page 902, 2002.

[8] T. Lv, J. Henkel, H. Lekatsas, and W. Wolf. A dictionary- based en/decoding scheme for low-power data buses. IEEE Trans. Very Large Scale Integr. Syst., 11(5):943–951, 2003.

[9] E. S. N. Passos and L.-F. Chao. Multi-dimensional interleav- ing for time-and-memory design optimization. InIn Proc.

of ICCD’95, pages 440–445, October 1995.

[10] A. Nicolau and S. Novack. Trailblazing a hierarchical approach to percolation scheduling. In In Proc. ICPP’93, pages 120–124, 1993.

[11] J. Park and P. C. Diniz. Synthesis of pipelined memory access controllers for streamed data applications on fpga-based computing engines. InProc. of ISSS, pages 221–226, 2001.

[12] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj. A coding framework for low-power address and data busses. IEEE Trans. Very Large Scale Integr. Syst., 7(2):212–221, 1999.

[13] B. R.Li and M.L.Liou. A new three-step search algorithm for block motion estimation. IEEE Trans. on Circuits and systems for Video Technology, 4(4):438–442, August 1994.

[14] J. Seo, T. Kim, and P. R. Panda. Memory allocation and mapping in high-level synthesis: an integrated approach.

IEEE Trans. Very Large Scale Integr. Syst., 11(5):928–938, 2003.

[15] M. R. Stan and W. P. Burleson. Bus-invert coding for low-power i/o. IEEE Trans. Very Large Scale Integr. Syst., 3(1):49–58, 1995.

[16] S. Wuytack, F. Catthoor, G. de Jong, B. Lin, and H. de Man.

Flow graph balancing for minimizing the required memory bandwidth. InISSS ’96: Proc. of the International Sympo- sium on System Synthesis, page 127. IEEE Computer Soci- ety, 1996.