A Generalized Two-Dimensional Architecture

5.3 VLSI ARCHITECTURES FOR LIFTING-BASED DWT

5.3.8 A Generalized Two-Dimensional Architecture

Generally, two-dimensional wavelet filters are separable functions. A straight- forward approach for two-dimensional implementation is to first apply the one-dimensional DWT row-wise (to produce L and H subbands) and then column-wise to produce four subbands LL, LH, HL, and HH in each level of decomposition as shown in Figure 4.3(a) in Chapter 4. Obviously, the processor utilization is a concern in direct implementation of this approach because it requires all the rows in the image be processed before the column-wise pro-

128 VLSl ARCHITECTURES FOR DISCRETE WAVELET TRANSFORMS

Cyck

-

1 2 3

4

5 6 7 8 9 10

Adder 1

ocesc lhifte

-

Fig. 5.15 Partial schedule for the ( 5 , 3) filter implementation

cessing can begin. As a result, it requires a size of memory buffer of the order of the image size and hence increase total computation delay. The alterna- tive approach to reduce these inefficiencies is to begin the column-processing as soon as sufficient number of rows have been filtered. The column-wise processing is now performed on these available lines to produce wavelet co- efficients row-wise. Similar approach can be adopted for implementation of two-dimensional lifting scheme as well.

The two-dimensional architecture proposed in [27] computes both the forward and inverse lifting-based DWT in the traditional row-column fashion.

However, the scheduling of data is done in such a fashion that column- processing can start as soon as enough data is available after row-wise processing as explained earlier in order to minimize the computation delay. As shown in Figure 5.16, the architecture consists of a ^TOWmodule, a column module, and two memory modules (MEM1, MEM2). The row module consists of two processors RP1 and RP2 along with a register file REG1. The register file REGl is used to store the intermediate data between two lifting steps computed by RP1 and RP2. Similarly, the column processor consists of two processors CP1 and CP2 along with a register file REG2. The register files REGl and REG2 were used in between the processors mainly to locally store the intermediate results from the lifting steps in order to avoid access of memory for these intermediate data to store and read again. The register file REG2 is used to store the intermediate data between two lifting steps

VLSl ARCHITECTURES FOR LIFTING-BASED DWT 129

External Memory

LL subband MEMl

Row Module

I

Fig. 5.16 Block diagram of the two-dimensional architecture.

computed by CP1 and CP2. Internal logic of all the four processors RP1, RP2, CP1, CP2 is the same as shown in Figure 5.14.

When the DWT requires two lifting steps (as in (5, 3) wavelet filters), processors RP1 and RP2 read the data from MEMl, perform the computation along the rows, and write the data into MEM2. We denote this mode of operation of the architecture as 2M architecture mode. Processor CP1 reads the data from MEM2, performs the column-wise DWT along alternate rows, and writes the HH and LH subbands into MEM2 and an external memory (Ext.MEM). Processor CP2 reads the data from MEM2 and performs the column-wise DWT along the rows that CP1 did not work on and writes LL subband to MEMl and HL subband to Ext.MEM. The data flow is shown in Figure 5.17(a).

Ext Memory

J

Ext Memory

MEM

J

Column Row

Module Module

t I ,

^MEM2

la) 2M filters (b) 4M filters

Fig. 5.17 ^Data^flow^for^{(a) 2M,}(b) 4M architectures.

130 VLSl ARCHITECTURES FOR DISCRETE WAVELET TRANSFORMS

Fig. 5.18 Two-dimensional data-access patterns for the row and column modules for the ( 5 , 3 ) filter with N=5 in [27].

When the DWT requires four lifting steps (as in (9, 7) wavelet filters), we say the architecture is in 4M architecture mode and it operates in two passes. In the first pass, the row-wise computation is performed. RP1 and RP2 read the data from MEMl, execute the first two lifting stages and write the result into MEM2. CP1 and CP2 execute the next two lifting stages, and write results to MEM2. In the second pass, the transform is computed along columns. At the end of the second pass, CP1 writes HH and LH subbands to Ext.MEM while CP2 writes LL subband to MEMl and HL subband to Ext.MEM. The data flow is shown in Figure 5.17(b).

In the 2M Architecture mode, the latency and memory requirements would be very large if the column transform is started after completion of the transformation of all the rows in the whole two-dimensional block. To over- come this, the column processors also need to compute row-wise. This is illustrated in Figure 5.18 for the (5, 3) filter with N = 5. The first pro-

VLSI ARCHITECTURES FOR LIFTING-BASED DWT 131

cessor RP1 computes the high-pass (odd) elements yo,^, yo,3, ... along the rows, while the second processor RP2 calculates the low-pass (even) elements yo,o, yo,^, yo,4,

...,

also along the rows. Here an element yi,j denotes an element in ith row and jth column of the two-dimensional block. The processor CP1 calculates the high-pass and low-pass elements z1,0, z1,1,

...,

z 3 , o , z3,1, ...

along the odd rows and CP2 calculates the high-pass and low-pass elements

Z O , ~ , z o , ~ ,

. .. ,

z z , ~ , Z Z , ~ ,

. ..,

z4,0, z4,1,

. ..

along the even rows as shown in Figure 5.18.

It should be noted that the processors CP1 and CP2 start their computations as soon as the required elements are generated by Rp1 and RP2. Essen- tially, the processor RP1 calculates the high-pass values and RP2 calculates the low-pass values, along all the rows, whereas CP1 and CP2 calculate both high-pass and low-pass values along the odd rows and even rows respectively.

In Table 5.5, we present a snapshot of the schedule of the data and their computation in the first 14 clock cycles for the RP1 and RP2 processors. Similarly, we present a part of the schedule of the data and their computation for the processors CP1 and CP2 in Tables 5.6 and 5.7 respectively.

In the 4M Architecture mode, all four processors perform either a row transform or a column transform at any given instant. Specifically, the processors RP1 and CP1 compute the high-pass values along the rows in the first pass and along the columns in the second pass, whereas processors RP2 and CP2 compute the low-pass values.

Table 5.5 Partial Schedule of Processors RP1 and RP2 for the (5,3) Filter.

The memory modules, MEMl and MEM2, are both dual port with one read and one write port, and support two simultaneous accesses per cycle.

132

Cycle Adderl

VLSI ARCHITECTURES FOR DlSCRETE WAVELET TRANSFORMS

Table 5.6 Partial Schedule of Processor CP1 for the ( 5 , 3) Filter.

Shift Adder2

Dans le document Standard for (Page 145-150)