Flipping Architecture - VLSI ARCHITECTURES FOR LIFTING-BASED DWT

5.3 VLSI ARCHITECTURES FOR LIFTING-BASED DWT

5.3.3 Flipping Architecture

While conventional lifting-based architectures require fewer arithmetic operations compared to the convolution-based approach for DWT, they sometimes have long critical paths. For instance, the critical path of the lifting-based architecture for the (9, 7) filter is 4T,

+

8T, while that of the convolution implementation is T,

+

2T,. One way of improving this is by pipelining, as has been demonstrated in [28, 29, 301. However, this results in the number of registers increasing significantly. For instance, to pipeline the lifting-based (9, 7) filter such that the critical path is t ,

+

^T,,six additional registers are required.

Recently Huang, Tseng, and Chen [36] proposed a very efficient way of solving the timing accumulation problem. The basic idea is to remove the multiplications along the critical path by scaling the remaining paths by the inverse of the multiplier coefficients. Figures 5.9(a)-(c) describes how scaling at each level can reduce the multiplications in the critical path. Figure 5.9(d) further splits the three input addition nodes into two 2-input adders. The critical path is now T,

+

5T,. Note that the flipping transformation changes the round-off noise considerably. Techniques to address precision and noise problems have also been addressed in [36].

5.3.4

Chang, Lee, Peng, and Lee [31] proposed a programmable architecture to map the data dependency diagram of lifting-based DWT using four 3-input MAC (multiply-adder calculator), nine registers, and a register allocation scheme.

The algorithm consists of two phases as shown in Figure 5.10. We explain A Register Allocation Scheme for Lifting

122 VLSl ARCHITECTURES FOR DISCRETE WAVELET TRANSFORMS

l / K I

high pass low ^pass high pass low pass

low pass high pass low uass high pass

Fig. 5.9 A flipping architecture proposed in 1361; (a) original architecture, (b)-(c) scaling the coefficients to reduce the number of multiplications, (d) splitting the three- input addition nodes to two-input adders.

VLSl ARCHITECTURES FOR LIFTING-BASED DWT 123 below the data-flow principle of the architecture in terms of the register allocation of the nodes in the data dependency diagram as proposed in [31].

Phase 1 Phase2 Phase 1 Phase 2 Input

First stage Second stage HP output

LP output

Fig. 5.10 Data-flow and register allocation of the data dependency diagram of lifting.

€+om the data-flow in Figure 5.10, it is obvious that the architecture has two phases (Phase 1 and Phase 2 ) . These two phases operate in alternate fashion. The sequential computation and register allocation in phase ¹of the data dependency diagram shown in Figure 5.10 are in the following order:

RO ^+-~ 2 i - 1 ; R2 ^+- ^{~ 2 i ;} R3 +- RO

+

a(R1+ R2);

R4 ^+- R1+ b(R5

+

^R3);

R8 ^+-R5

+

^c(^R6

+

^R4);

OUtpUtLp ^+-R6

+

^d(R7

+

^R8);^OUtpUtHp^+-- ^R8.

Similarly, the sequential computation and register allocation in phase 2 of the data dependency diagram of lifting are as follows:

RO ^+-X Z ~ + I ; R1 ^+-~ 2 i + 2 ;

R5 ^+-RO

+

^a(R2

+

^{R l ) ;}

R6 ^+-R2

+

^b(R3

+

^R5);

R7 ^+-R3

+

c(R4

+

^R6);

OutputLp + R4

+

d(R8

+

^R7);^Output,, ^+--^R7.

As explained above, two samples are input in each phase and two samples (LP and HP) are output at the end of every phase until the end of input data. The output samples are also stored into a temporary buffer for usage in the vertical filtering for the two-dimensional implementation of lifting-based discrete wavelet transform.

124 V l S l ARCHlTECTURES FOR DISCRETE WAVELET TRANSFORMS

5.3.5

According to the multiresolution decomposition principle of DWT, in every stage the low-pass subband is further decomposed recursively by applying the same analysis filters. The total number of the output samples to be processed for an L-level of DWT is

A Recursive Architecture for Lifting

< 2 N ,

N N

2 4

N + -

+

+...+ 2L-1

where N is the number of samples in the input signal.

Most of the traditional DWT architectures compute the second level of decomposition upon completion of the first level of decomposition and so on.

Hence the ith level of decomposition is performed after completion of the (i - l)t” level at stage i in recursion. However, the number of samples to be processed in each level is always half of the size in the previous level. As a result, it is possible to process multiple levels of decomposition simultaneously.

This is the basic principle of recursive architecture for DWT computation, which was first proposed for a convolution-based DWT in [18]. Later the same principle was applied to develop recursive architecture for lifting-based DWT by Liao, Cockburn, and Mandal [34, 351. Here computations in higher levels of decomposition are initiated as soon as enough intermediate data in low-frequency subband are available for computation. The architecture for a three-level of decomposition of an input signal using Daubaches-4 DWT proposed by Liao et al. is shown in Figure 5.11. However, the same principle can be extended to other wavelet filters as well.

Fig. 5.11 Recursive architecture for lifting.

The basic circuit elements used in this architecture for arithmetic computation are delay elements, multipliers and multiply-accumulators (MAC).

The MAC is designed using a multiplier, an adder, and two shifters. The multiplexers M1 and M 2 select the even and odd samples of the input data as needed by the lifting scheme. The S1, S2, and S3 are the control signals for data flow of the architecture. For the first level of computation the select signal (Sl) of each multiplexer is set to 0, and it is set to 1 during the second

VLSI ARCHITECTURES FOR LIFTING-BASED DWT 125 or third level of computation. The switches S2 and S3 select the input data for the second and third level of computation. The multiplexer M3 selects the delayed samples for each level of decomposition based on the clocked signals shown in Figure 5.11. The total time required by this recursive architecture to compute an L-level DWT is

T = N

+

²⁽¹f 2 -I- . . . _4-_2L-1)= N _4-Td

+

^2L^-^2,

where Td is the circuit delay from input to output.

5.3.6

A filter independent DSP-type parallel architecture has been proposed by Martina, Masera, Piccinini, and Zamboni in [37]. The architecture consists of Nt = maxt{kst, kt,} number of MAC (multiply-accumulate) units, where ksz and kt, are length of the primal and dual lifting filters si and ti respectively in step i of lifting factorization. The architecture is shown in Figure 5.12.

A DSP-Type Architecture for Lifting

Programmabl

j""'""l1

Fig. 5.12 Parallel MAC architecture for lifting.

The architecture essentially computes the following two streams in each lifting step.

a o u t [ j l = ain[jl -

1Ck

^{d i n b}^-^kl.Si[kl

⁺

^{4 ' 1}

dout [jl ⁼^din^{[ j ]}^-

iCk

^aout

li

^-kl.ti ^[kl

+

21 ⁷

126 VLSl ARCHlTECTURES FOR DlSCRETE WAVELET TRANSFORMS

where ai, and din are two input substreams formed by the even and odd samples of the original input signal stream z. It is obvious that streams azn and bi, are not processed together in this architecture; while one is processed the other has to be delayed enough to guarantee a consistent subtraction at the end of the lifting step. The above architecture is designed to compute nt si- multaneous partial convolution products selected by the multiplexer (MUX), where nt is the length of filter tap for the lifting step being currently executed in the architecture. After nt clock cycles, the first filtered sample is available for rounding operation at the output of the first MAC1 and subsequent samples are obtained in consecutive clock cycles from the subsequent MAC units (MAC2, . . .

,

MAC,,). The "programmable delay" is a buffer that guarantees the subtraction consistency to execute corresponding aout [ j ] and dollt[j] samples at the output. The ROUND unit in Figure 5.12 computes the floor function shown in the lifting equations and the SUB unit processes the corresponding subtraction operations. The input sample streams (a two- dimensional image) are stored into a RAM in four sub-sampled blocks in order to properly address the row-wise and column-wise processing of the image for 2-D lifting DWT implementation. A detailed memory addressing scheme and their access patterns have been discussed in great detail in [37].

5.3.7

The architecture proposed by Andra, Chakrabarti, and Acharya [25, 26, 271 is an example of a highly programmable architecture that can support a large set of wavelet filters. These include filters (5,3), (9,7), C(13,7), S(13,7), (2,6), (2,10), and (6,lO). In this architecture, each stage of the data dependency diagram in Figure 5.6 is assigned to a processor. For wavelet filters requiring only two lifting stages (as in the (5, 3) wavelet filter), this maps to a two processor architecture. For wavelet filters with four lifting stages (such as the (9, 7) wavelet filter), this maps to a four-processor architecture. Figure 5.13 describes the assignment of computation to processors P1 and P2 for the (5, 3) wavelet filter.

The processor architecture consists of adders, multipliers, and shifters that are interconnected in a manner that would support the computational struc- ture of the specific filter. Figure 5.14 describes the processor architectures for computation of the lifting steps. All the lifting steps for DWT and IDWT are essentially of the form yi =

+

^z,+~)

+

xi, where a is a constant multiplication factor. For the (5, 3) filter, the multiplication factors in both the lifting stages are multiplies of 2 and hence it can be executed by simple shift operations. As a result, the processor for computation of ( 5 , 3) filter consists of two adders and a shifter, whereas the processor for computation of (9, 7) filter consists of two adders and a multiplier.

Figure 5.15 describes part of the schedule for the (5, 3) wavelet filter to transform a row (or in one dimension). The schedules are generated by map- ping the dependency graph onto the resource-constrained architecture. It is A Generalized and Highly Programmable Architecture for Lifting

VLSl ARCHITECTURES FOR LIFTING-BASED DWT 127

P1 P2

-

I I L

-

YO Y2

Fig. 5.13 Processor assignment for the ^{( 5 , 3)}wavelet filter.

Fig. 5.14 Processor architecture for the (5, 3) and (9, 7) filters.

assumed that the delays of each adder, shifter, and the multiplier are 1, 1, and 4 time units respectively. For example, Adder1 of P1 adds the elements (xo, ^x2)in the second cycle and stores the sum in register RA1. The shifter reads this sum in the next cycle (third cycle), carries out the required number of shifts (one right shift as a = -0.5) and stores the data in register RS. The second adder (Adder2) reads the value in RS and subtracts the element ^x1 to generate y1 in the next cycle. To process N = 9 data, the P1 processor takes four cycles. Adder 1 in P2 processor starts computation in the sixth cycle. The gaps in the schedules for P1 and P2 are required to store the zeroth element of each row.

Dans le document Standard for (Page 139-145)