Design and Implementation of a Radix-4 Complex Division Unit with Prescaling

(1)

HAL Id: ensl-00379147

https://hal-ens-lyon.archives-ouvertes.fr/ensl-00379147v2

Submitted on 13 Jan 2010

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Design and Implementation of a Radix-4 Complex Division Unit with Prescaling

Pouya Dormiani, Milos Ercegovac, Jean-Michel Muller

To cite this version:

Pouya Dormiani, Milos Ercegovac, Jean-Michel Muller. Design and Implementation of a Radix-4

Complex Division Unit with Prescaling. 20th IEEE International Conference on Application-specific

Systems, Architectures and Processors (ASAP’09), Jul 2009, Boston, United States. �ensl-00379147v2�

(2)

Design and Implementation of a Radix-4 Complex Division Unit with Prescaling

Pouya Dormiani

Computer Science Department University of California at Los Angeles

Los Angeles, CA 90024, USA Email: [email protected]

Miloˇs D. Ercegovac

4731H Boelter Hall Computer Science Department University of California at Los Angeles

Los Angeles, CA 90024, USA Email: [email protected]

Jean-Michel Muller

CNRS-Laboratoire CNRS-ENSL-INRIA-UCBL LIP Ecole Normale Sup´erieure de Lyon

46 All´ee d’Italie 69364 Lyon Cedex 07, France Email: [email protected]

Abstract—We present a design and implementation of a radix-4 complex division unit with prescaling of the operands.

Specifically, we extend the treatment of the residual bound and errors due to the use of truncated redundant representation. The requirements for prescaling tables are simplified and a detailed specification of the table design is given. All principal components used in the design are described and the proposed optimizations are explained. The target platform for implementation was an Altera Stratix II FPGA [15] for which we report timing and area requirements. For a precision of 36 bits, the implementation uses 1185 ALUTs, achieving a latency of 157 ns. The maximum clock frequency is 173.49 MHz.

I. INTRODUCTION

Complex division is used in applications such as signal processing (e.g., the complex SVD), multiantenna systems (MIMO-type) [1], GPS [2], astronomy [3], and non-linear RF measurement [4]. Unlike for complex multipliers [10], [12], its implementation has been commonly provided in software.

To improve its performance, a hardware implementation is considered. With that objective, a hardware-oriented algorithm and the corresponding theory for general radix-r complex valued division based on a digit-recurrence algorithm has been introduced in [6]. A high-level design of a complex divider is discussed in [7] without implementation details. In this paper we focus on the design and implementation of a radix-4 complex-valued division unit with the quotient-digit set{−3, . . . ,3}. The operands and the result are in fractional fixed-point form. We also refine some of the derivation results from [6] to improve the implementation.

Specifically, with the dividend z = z_R+iz_I and divisor d = d_R +id_I, i = √

−1, the design discussed computes q=z/d. A high-level description of the algorithm is

Initialization: j= 0

w[0] =z (1)

Recurrence iterations: j= 1, . . . , n

qj+1=Sel(4w[j], y) (2) w[j+ 1] = 4w[j]−qj+1y (3) Result:

q= z

d = 0.q^R₁q^R₂q^R₃ . . . q_n^R+i0.q₁Îq₂Îq₃Î. . . qÎ_n (4)

The recurrence for complex division corresponds to the conventional real-valued division discussed in [5] and similar conditions such as the containment and continuity as well as bounded residuals apply. The complex residual is w[j] = w^R[j] + iwÎ[j]. The quotient digits are q_j+1 = q^R_j+1 + iqÎ_j+1, with the real and imaginary components q_j+1^R and qÎ_j+1 ∈ {−3, . . . ,3}. These signed-digits can be converted during the iterations using on-the-fly conversion [5] to obtain conventional representation of the result. The complex residual recurrence decomposes into two separate recurrences for the real and imaginary part which can be computed in parallel:

w^R[j+ 1] = 4w^R[j]−q^R_j+1d^R+q_j+1Î dÎ (5) wÎ[j+ 1] = 4wÎ[j]−q^R_j+1dÎ−qÎ_j+1d^R (6) where w^R[0] = z^R and wÎ[0] = zÎ. The quotient-digit selection in the complex domain is a two-dimensional problem because both q_j+1^R and q_j+1Î must be selected in such a way that the real and imaginary residuals (w^R[j], wÎ[j]) remain bounded. This is much more difficult than single-digit selection used in the real case. We solve this problem by scaling the operands by factor K such that Kz/Kd = x/y where y = Kd ≈ 1. Consequently, y^R ≈ 1 and yÎ ≈ 0, and the selection of q^R_j+1 and qÎ_j+1 can be performed on the real and the imaginary shifted residuals separately in a manner similar to real-valued division selection. To determine the prescaling factorK, we assume that

kKd−1k∞< s (7)

wherekαk∞= max(|α^R|,|α^I|).

After prescaling step the recurrences are

w^R[j+ 1] = 4w^R[j]−q_j+1^R y^R+q_j+1Î yÎ (8) wÎ[j+ 1] = 4wÎ[j]−q_j+1^R yÎ−qÎ_j+1y^R (9) where w^R[0] = x^R and wÎ[0] = xÎ. Because the scaling makesyÎ ≈0 andy^R≈1−_sthe selection of the real part of the quotient can be performed by rounding the shifted real residual and taking its integer part. Similarly for the selection of the imaginary part of the quotient digit. Moreover, we can use estimates with σ fractional positions of the shifted residuals 4w^R[j] and 4wÎ[j] in the selection. Consequently,

(3)

the residuals can be computed in redundant form to keep the cycle time short. The selection functions are

q^R_j+1=Sel(est(4w^R[j], σ)) (10)

=sign(4w^R[j])× b|est(4w^R[j], σ)|+1 2c q^I_j+1=Sel(est(4w^I[j], σ)) (11)

=sign(4w^I[j])× b|est(4w^I[j], σ)|+1 2c The selection functionSel satisfies

|Sel(est(x, σ))−x|<1

2 + 2^−σ (12) The est(x, σ)is xtruncated toσ fractional positions with an error bound

est_ERR(x, σ) =|x−est(x, σ)|<2^−σ

If x is in carry-save form x=x_C+x_S then truncating the carry and sum vector to σ+ 1 fractional bits results in the same maximum error committed, i.e. , est_ERR(x, σ)<2^−σ andest_ERR(x_C, σ+ 1) +est_ERR(x_S, σ+ 1)<2^−σ.

Using (10), (11) and (12), a bound on the residual is deduced which ensures that the digit (q^R_j+1,q_j+1^I ) selected by rounding is in the digit set{−3, . . . ,3}. Namely,

kw[j]k∞≤ 1 4

3 +1

2 + 2^−σ

(13) As shown in [6], assuming that the scaling error is _sand a= 3, the residual is bounded by

kw[j]k∞<2×3×_s+1

2+ 2^−σ (14) Consequently,

6_s+1

2+ 2^−σ≤ 1 4

3 +1

2 + 2^−σ

(15) Satisfying this condition guarantees convergence of the digit- recurrence algorithm and allows the choice of s and σ to optimize the implementation characteristics.

II. DESIGN

The design of the complex division unit consists of several components: the prescaling module, the recurrence modules for the real and imaginary parts, the on-the-fly converters to obtain conventional representations, and a simple controller.

A high level block diagram of the design is shown in Fig. 1 with the timing shown in Fig. 2. The prescaling module in Fig. 1 performs a ROM look-up using a short estimate of the value of the divisor d as an address, in which the ROM stores K = 1/d. It then computes the complex product Kz, which is used to initializew[0]in the recurrence modules. The prescaling module computesKdin parallel to the initialization of the recurrence modules, which is then used to perform the iterations of the recurrence. The initial delay of the module to perform prescaling can be amortized by overlapping the prescaling of the next operation with digit-recurrence iterations

z d

Prescale

Imag. Rec.

Real Rec.

OFC OFC

q^R q^I

q^Rj q^Ij

Fig. 1. High-level block diagram of the complex division unit.

Prescale

Real Rec.

Imag. Rec.

Lookup prescaling values Prescale z

Prescale d

w^R[0]

w^I[0]

t

w^R[1]

w^I[1]

w^R[n]

w^I[n]

...

idle

Fig. 2. Timing relationships between modules.

of the current operation–this however has not been performed in the current implementation. Detailed design of the prescaling is discussed in Section II-A.

The two recurrence modules (one for the real recurrence and one for the imaginary) perform nearly identical operations which can be mapped to the same hardware. Detailed design of the recurrence module is discussed in Section II-C.

A. Prescaling

Prescaling consists of several steps: obtaining the factorK from a table based on an short-precision estimate of d, and computingKz andKd.

We define a function rnd(a, b) which returns a rounded value ofatob fractional places, s.t. |a−rnd(a, b)| ≤ ¹₂2^−b. The factorK can be determined by using a short estimate of dto q fractional positions, i.e., rnd(d^R, q),rnd(d^I, q)as an address to a ROM which stores the corresponding values of K with precision oft fractional positions,

K^R=rnd(1/rnd(d^R, q), t) K^I =rnd(1/rnd(d^I, q), t)

Error analysis for the choices of parameters q and t is performed in [13]. These effectsused in (15) to guarantee convergence of the algorithm. Radix 4, with digit set{−3, . . . ,3}

offers the most favorable choice of parameters by minimizing the number of bits required for the ROM among radices 4, 8,

(4)

and 16, except radix 2 which has lowest memory requirements.

Over-redundant digit sets are another design choice but we decided to restrict our design to maximally redundant digit set which allows faithful rounding [6].

r a σ q t KBits (approx.)

2 1 4 5 5 7.5

4 2 5 7 7 146

4 3 4 6 6 33

Radix: 8, 16 see [13] ≥146 TABLE I

MEMORY(ROM)REQUIREMENTS OF DIFFERENT RADIX(r),DIGIT SET {−a, . . . , a},PRECISION OF RESIDUAL ESTIMATE FOR SELECTION(σ), PRECISION OFdUSED TO PERFORM TABLE LOOK-UP(q)AND PRECISION

OF TABLE ENTRIES(t).

The value of the divisor is in the usual range 1

2 ≤ kdk∞<1 (16) noting that larger values can be scaled to this range. Its estimaternd(d, q)can be represented as 2 two’s complement numbers for the real and imaginary parts

rnd(d, q) =rnd(d^R, q) +i rnd(dÎ, q) (17) rnd(d^R, q) =κ^R₀.κ^R₁κ^R₂κ^R₃ . . . κ^R_q−1κ^R_q (18) rnd(dÎ, q) =κÎ₀.κÎ₁κÎ₂κÎ₃. . . κÎ_q−1κÎ_q (19) An additional bit κ₋₁ is required (to represent +1) as krnd(d, q)k∞ ≤1, which will be handled as a special case.

To reduce the number of address bits, the table can store corresponding values for |rnd(d^R, q)|and|rnd(d^I, q)|,

|rnd(d^R, q)|= 0.α^R₁α^R₂α₃^R. . . α^R_q (20)

|rnd(dÎ, q)|= 0.αÎ₁αÎ₂αÎ₃. . . α_qÎ (21) which eliminates the need for bits κ^R₀ and κÎ₀ (the sign) to be used when forming an address. Likewise, sincekdk∞≥¹₂ we know that either α^R₁ = 1or αÎ₁= 1 [6] (or both). Had an address been formed using

α^R₁α^R₂α^R₃ . . . α^R_qαÎ₁αÎ₂αÎ₃. . . αÎ_q then the address would require 2q bits. Given that

γ(d^R, d^I) = 1

d^R+id^I = d^R−id^I

(d^R)²+ (dÎ)² (22) γ(dÎ, d^R) =−γ(d^R, dÎ) (23) we could check if α^R₁ = 1, if so then the address is formed via

α^R₂α^R₃ . . . α^R_qα₁ÎαÎ₂αÎ₃. . . αÎ_q

otherwise, it must be true thatα^I₁= 1so the address is formed as

αÎ₂αÎ₃. . . αÎ_qα^R₁α^R₂α^R₃ . . . α^R_q

and the results obtained from the table look-up are negated based on (23). This reduces the number of address bits to 2q−1 (halving the memory required) while introducing little additional overhead.

Extra care must be taken with the aforementioned approach; although it is true that dividend is assumed to be bounded by −1<kdk∞<1, it is certainly not true that

−1< rnd(d^R, q)<1, in fact −1≤rnd(d^R, q)≤1 (same holds for rnd(d^I, q)). The two’s complement representation of the rounded divisor shown in equations (18) and (19) has range [−1,1). Negating -1 in two’s complement with the given representation is a special case; recalling that +1 is also a special case, the input is divided into two cases:

krnd(d, q)k∞<1andkrnd(d, q)k∞=±1.

Another special case occurs when negating the results obtained from the table look-up due to the swapping discussed earlier. For positive values of d^R and dÎ the real part of 1/d is positive and the imaginary part negative. The real part of 1/d is positive for positive values of d^R and the imaginary part of 1/d is negative for positive values of dÎ. Since1/2≤ krnd(d, q)k∞≤1 and the table only stores values for positive d^R and dÎ values, then 0≤K^R≤2 and

−2≤K^I ≤0. Therefore the table should only contain the magnitude of the value, which can be represented in 2 +t bits–this will present no anomalies if 3 integer bits are used for the negated values, i.e., the ROM will store2 +t bits, but the negated value will be3 +tbits

Here we describe the operation of the table incorporating the special cases,

rnd(d^R, q) =κ^R₋₁κ^R₀.κ^R₁κ^R₂κ^R₃ . . . κ^R_q rnd(dÎ, q) =κÎ₋₁κÎ₀.κÎ₁κÎ₂κÎ₃. . . κÎ_q A^R=|rnd(d^R, q)|=α^R₀.α^R₁α^R₂α^R₃ . . . α^R_q AÎ =|rnd(dÎ, q)|=αÎ₀.αÎ₁αÎ₂dÎ₃. . . αÎ_q A=

(

α^R₂α^R₃...α^R_qαÎ₁αÎ₂αÎ₃...αÎ_q ifα^R₁=1, αÎ₂αÎ₃...αÎ_qα^R₁α^R₂α^R₃...α^R_q otherwise

As= (

αÎ₁αÎ₂αÎ₃...αÎ_q ifA^R=±1, α^R₁α^R₂α^R₃...α^R_q otherwise

(U^R, U^I) =











(1/2,−1/2) ifA^R=1,AÎ=1, (1/2,1/2) ifA^R=1,AÎ=−1, (−1/2,−1/2) ifA^R=−1,AÎ=1, (−1/2,1/2) ifA^R=−1,AÎ=−1, ROMs[As] ifA^R=±1andAÎ6=±1

orA^I=±1andA^R6=±1 ROM[A] otherwise

neg^R= (

1 if real and imaginary swapped, 0 otherwise

neg^I = (

1 if real and imaginarynotswapped, 0 otherwise

K^R= (−1)^neg^RU^R KÎ = (−1)^negÎUÎ

From Table I, we have q= 6andt= 6 for radixr= 4and a= 3.

(5)

d

^R

d

^I

rnd( . ,6) rnd( . ,6)

8 8

ABS ABS

6 6

α^R₁. . . α^R₆ α^I₁. . . α^I₆

5 5

α^R₁ 1 0

ROM

1 0

ROMs

11 11 6 6

6

16 16

8 8 8 8

1/2 -1/2 -1/21/2

1 0 NEG

K

^R

K

^I

U

^R

U

^I

neg^R neg^I

9 9

9 9 9 9

A^R=±1 κ^R−1. . . κ^Rq κ^I−1. . . κ^Iq

Fig. 3. Prescaling ROM. The ABS block computes the absolute value of a two’s complement number. Blocksrnd(.,6)round their argument to the sixth fractional position. NEG blocks negate their argument, a two’s complement number.

• ROM: This ROM has 11 address bits and is 16 bits wide, which can be mapped to 8 Altera Stratix II M4K RAM blocks, constituting less than one percent of the total block memory bits in an EP2S60F672C3 device.

• ROMS : This ROM has 6 address bits and is 16 bits wide, which was mapped to logic and registers.

A schematic corresponding to the described look-up scheme is shown in Fig. 3.

The other two parts of the prescaling step involve computing Kz andKdwhich will be used to initialize and carry out the digit recurrence algorithm. OnceKis determinedxandycan

be computed via,

x= (K^R+iK^I)(z^R+iz^I)

= (K^Rz^R−KÎzÎ) +i(KÎz^R+K^RzÎ) y= (K^R+iKÎ)(d^R+idÎ)

= (K^Rd^R−KÎdÎ) +i(KÎd^R+K^RdÎ)

Since multipliers are costly in hardware, the complex valued products will be computed one at a time. Coincidentally, y = Kd is not required until after the residuals have been initialized with x = Kz, which can be computed in the previous cycles. Figure 4 shows the block diagram for the scaling module. The module uses several signals to control the data path:eninputs,enpres,ensc, andselmul. Control signals enxare clock enable signals to registers to control when data is latched. Clock enables on registers are used to facilitate multi-cycle paths which are necessary due to the larger delay of the prescaling logic.

In Fig. 4en_inputs controls when the inputs to the complex division unit are latched such that the values can be retained throughout the course of the operation–this is not necessarily unique and depends on the how the module is interfaced to other logic. For example, if the external logic feeds the arguments to the complex division unit in two cycles: sending (z^R, zÎ) in the first and (d^R, dÎ)in the second, then only 2 register banks are required for the inputs as opposed to 4. The current design reflects the assumption that the module receives its arguments in the same cycle, i.e., as(z^R, zÎ, d^R, dÎ).

Signalen_prescontrols storing of the results of the prescaling ROM look-up, retained throughout the course of the operation.

Signals sel_mul and en_sc are used to share the multipliers so that prescaling of the dividend and the divisor occurs in separate prescaling cycles. Although the prescaled value x=Kz is also fed through the registers controlled byensc, its value is not retained but over-written in the next cycle by y=Kd. The same enable signal (ensc) is used once more to assure that the value ofy is retained in these registers which feed the recurrence modules discussed in Section II-C.

B. Bounds of Values

It is important to characterize the bounds of the inputs to the complex division module in addition to the bounds of the prescaled values which predetermine the width of inputs to the recurrence modules.

The input d is in the range 1/2 ≤ kdk∞ < 1, and through our convergence analysis further constrained kKd−1k∞< _s. This implies that the prescaled value y satisfies

max(|y^R−1|,|y^I|)< s

⇒ |y^R−1|< _s

|y^I|< _s

Since |y^R|< 1 +s, its representation in two’s complement would require 2 integer bits andnfractional bits.

(6)

KR

KI

AR AI

KI

KR

AR AI

D D

Prescaling ROM

D D D D

divisor dividend

AR A_I

z_R z_I dR dI

sel_mul

KI

KR

en_sc

1 0 1 0

en_pres en_inputs

D D

Q^R Q^I

Fig. 4. Prescaling module. The Prescaling ROM block above is the module shown in Fig. 3.

Likewise, the constraint (14) determines the maximum value that the residual could possibly take. For our design point σ= 4which means that the residual is bounded by,

kw[j]k∞≤1 4

3 + 1

2+ 2⁻⁴

= 57/64

⇒ |w^R[0]|=|x^R| ≤57/64

|w^I[0]|=|x^I| ≤57/64

Therefore, the prescaled value(x^R, x^I)requires only a single integer bit, and n fractional bits. We are interested in determining a bound on zwhich we can derive from the bound on w,

kw[0]k∞=kKzk∞≤2kKk∞kzk∞≤57/64 (24) sincekKk∞≤2then

kzk∞≤57/256 (25) requiring only n−1 fractional bits, with most significant bit having weight 2⁻².

C. Digit-Recurrence Iterations

The digit-recurrence iterations compute the residuals (5) (6) and perform quotient-digit selection based on a short non- redundant estimate of the residuals as shown in Eq. (10) and (11).

The recurrences in (5) and (6) are structurally the same.

Namely,

w[j+ 1] = 4w[j] +σ1y^R+σ2y^I (26) The residuals are computed in redundant form in order to reduce the cycle time by eliminating the need for long carry

Real OFC

a b c d

MG MG

c⁰_in c¹in

c²in

c³in

mm₀₁

mm₂₃

m₀ m1

m2

m3

[6:2] Adder

D D

w_s w_c

CPA Sel Q^R

enres

e f Q^I

×4 ×4

-q_R q_R

NEGATE

(To Imaginary Recurrence) q_R

qI (From Imaginary Recurrence)

10 10

Q^R 0

initres

init_res

Fig. 6. Real recurrence module. Blocks×4 shift their argument right by 2 binary places. Blocks MG computeσ times their argument using theσⁱk

decomposition discussed. The CPA module is a carry propagate adder which computes a short non-redundant estimate of the residual. The Sel module takes as argument this estimate and outputs the next quotient digit.

chains. In our implementation we used a carry-save form. The operation is expressed as

(w_C[j+ 1], w_S[j+ 1]) =

ADD_[6:2](4w_C[j],4w_S[j], σ¹₁y^R,2σ₁²y^R, σ₂¹y^I,2σ²₂y^I) (27) where ADD_[6:2](a, b, c, d, e, f) is a [6 : 2] carry-save adder taking 6 inputs and producing a carry vector and sum vector, shown in Fig. 5. The digits σ1 and σ2 are in the digit set {−3, . . . ,3} so we implement this digit multiplication by decomposing σk = 2σ²_k+σ¹_k where σⁱ_k∈ {−1,0,1}. Multi- plying by negative one is achieved by inverting the input and adding a carry-in to the reduction module. A block diagram of the structure used to compute the real recurrence is shown in Fig. 6.

Digit selection is performed by taking a short precision estimate of the residual and rounding it to the nearest integer via a small CPA and table. In the discussion that follows we generally say residual without referring specifically to the real or imaginary part–the analysis holds for both residuals w^R and w^I. In Section II-B we determined that the residual has a single integer bit and n fractional bits, i.e., it is of the form w = w0.w1w2. . . wn with

(7)

FA FA

FA

FA FA

FA

FA FA

FA

FA FA

FA

. . .

c⁰in

c¹in

c²in

c³in

a0 b0c0 d0e0 f0

a1 b1c1 d1e1 f1

am bmcm dmem fm am-1bm-1cm-1dm-1em-1fm-1

FA

S0

S1

Sm-1

Sm

Sm+1Cm+1 Cm Cm-1 C1 C0

Cm+2 Repeat Least Sig.

Most Sig.

Fig. 5. [6 : 2]Adder module. The adder consists of three different slices: the least significant slice, which sums 6 arguments and takes 4 carry-ins, the repeat slice which sums 6 arguments and takes 4 lateral carries and produces 4 lateral carries to the subsequent slice, and the most significant slice.

value Pn

i=0w_i2⁻ⁱ. In redundant form w = w_c+w_s where (w_c, w_s) = (C₀.C₁C₂C₃. . . C_n, S₀.S₁S₂S₃. . . S_n). Recall- ing that selection is performed via,

qj+1=Sel(est(4w[j], σ))

whereσ= 4 as determined in section II-A, we know that

est(4w[j],4) =w0w1w2.w3w4w5w6=

n

X

i=0

wi2⁻ⁱ⁺² estERR(4w[j],4)<2⁻⁴

now since wis in redundant form,

est(4wc[j],5) =C0C1C2.C3C4C5C6C7

est(4w_s[j],5) =S₀S₁S₂.S₃S₄S₅S₆S₇

g=est(4wc[j],5) +est(4ws[j],5) (28) estERR(4wc[j],5) +estERR(4ws[j],5)<2⁻⁴ which gives us the short precision estimate of the residual g.

It is important to realize that g 6= est(4w[j],4) in general but that they commit the same maximum error 2⁻⁴ in their approximation of w[j]. The addition in equation (28) requires the CPA that we have been referring to during this discussion.

g₋₂g₋₁g0.g1. . . g5=

CP A(C0C1C2.C3. . . C7, S0S1S2.S3. . . S7) (29)

To round g and take the integer part one can use a small table as in table II by introducing an additional variablegz= g2+g3+g4+g5 (i.e. the logical or of bits g2 through g5).

This table is a function of 5 bits and produces three bits of output (for the encoding of qj+1) and will efficiently map to LUTs.

g−2 g−1 g0 g1 gz qj+1

0 0 0 0 - 0

0 0 0 1 - 1

0 0 1 0 - 1

0 0 1 1 - 2

0 1 0 0 - 2

0 1 0 1 - 3

0 1 1 0 - 3

1 0 0 1 1 -3

1 0 1 0 - -3

1 0 1 1 0 -3

1 0 1 1 1 -2

1 1 0 0 - -2

1 1 0 1 1 -1

1 1 1 0 - -1

1 1 1 1 1 0

TABLE II

ROUNDING TO INTEGER PART.

D. Optimizing the Recurrence Implementation

A straightforward implementation of the recurrence is shown in Fig. 7. There are several opportunities for its optimization:

• Since the residual is in the range(−57/64, 57/64), there is only one integer bit required to store the value of the residual. Based on this observation there is no need to find the sum of bits with weight greater than2⁰= 1.

• The recurrence implementation can be optimized in the most significant bits by using the non-redundant value computed for selection in the adder instead of the redundant form stored in the registers. Although addition of a short CPA delay to the most significant bits seems counter-intuitive to optimization, it turns out that for all fitting attempts to the Stratix II architecture this path was not the critical path–paths with routing delays dominated the critical path (short carry chains don’t exhibit routing delays as there are dedicated carry paths in Adaptive Logic Modules [15]). Since using the non-redundant portion didn’t introduce a new critical path and reduced the input bits it served as a pragmatic optimization technique. The non-redundant approximationgcomputed for selection can be used in the addition as opposed to using the [6 : 2]adders. This simplifies the[6 : 2]adder

(8)

c-2 c-1 c0 c1 c2 c3 c4 c5 ...

d-2 d-1 d0 d1 d2 d3 d4 d5 ...

e-2 e-1 e0 e1 e2 e3 e4 e5 ...

f-2 f-1 f0 f1 f2 f3 f4 f5 ...

CPA

g-2 g-1 g0 g1 g2 g3 g4

s^j+10 s^j+11 s^j+12 s^j+13 s^j+14 s^j+15 ...

c^j+10 c^j+11 c^j+12 c^j+13 c^j+14 c^j+15 ...

[6:2] Adder

s^j+10 s^j+11 s^j+12 s^j+13 s^j+14 s^j+15

c^j+10 c^j+11 c^j+12 c^j+13 c^j+14 c^j+15

s^j+16

c^j+16

Selection

s^j+1-2 s^j+1-1 s^j+10 s^j+11 s^j+12 s^j+13 s^j+14 s^j+15 ...

c^j+1-2 c^j+1-1 c^j+10 c^j+11 c^j+12 c^j+13 c^j+14 c^j+15 ...

c^j+1-3

s^j+10 s^j+11 s^j+12 s^j+13 s^j+14 s^j+15 ...

c^j+10 c^j+11 c^j+12 c^j+13 c^j+14 c^j+15 ...

Registers

s^j0 s^j1 s^j2 s^j3 s^j4 s^j5

c^j0 c^j1 c^j2 c^j3 c^j4 c^j5

Carry-Save Residual

...

s^j+17

c^j+17

s^j+16

c^j+16

s^j+17

c^j+17

g5

Fig. 7. First implementation of recurrence reduction. Each rectangular box represents some functional block where the bits inside show the inputs to that block and the bits beneath show the corresponding outputs. There are 5 bits produced by the[6 : 2]adder in this figure which have a shaded square background to signify that these output bits don’t drive any logic and are left

“open”.

to a [5 : 2] adder requiring 4 lateral carries (as opposed to a conventional [5 : 2] adder which only requires 3 lateral carries) which we denoted as [5 : 2]⁴–the lateral carries come from the previous[6 : 2]adder. The interface between the[6 : 2]adder, the[5 : 2]⁴ adder and the XOR slice is shown in Fig. 9.

• The [5 : 2]⁴ adder produces both s^j+1₁ and c^j+1₀ , it is unnecessary to produces^j+1₀ with the same module since we will discardc^j+1₋₁ . Bits^j+1₀ is just the sum modulus 2 of all bits of weight 1 plus the lateral carries, which can be computed via exclusive-ors (XOR).

Applying all mentioned optimizations we get an improved design shown in Fig. 8.

III. DESIGNMETHODOLOGY ANDRESULTS

A. Methodology

The proposed designs were written at the RTL level using VHDL and simulated for functional correctness with Modelsim-Altera Edition 8.1. They were mapped to an Altera Stratix II architecture using Quartus II 8.1 flow tools. The Quartus Classic Timing Analyzer was used to determine the timing characteristics of the circuit in addition to placing constraints on ROM look-up and prescaling registers to inform the tool of multi-cycle paths.

The multiplies performed in the prescaling module map to the Altera DSP blocks for precisions up to 36 bits–these modules support up to 36×36 multiplication. It does not

c0 c1 c2 c3 c4 c5 ...

d0 d1 d2 d3 d4 d5 ...

e0 e1 e2 e3 e4 e5 ...

f0 f1 f2 f3 f4 f5 ...

...

[6:2] Adder

g4 g5

s^j+10 s^j+11 s^j+12 s^j+13 s^j+14 s^j+15 ...

c^j+10 c^j+11 c^j+12 c^j+13 c^j+14 c^j+15 ...

g0 g1 g2 g3

[5:2]⁴ Adder XOR

s^j+10 s^j+11 s^j+12 s^j+13 s^j+14 s^j+15 ...

c^j+10 c^j+11 c^j+12 c^j+13 c^j+14 c^j+15 ...

Registers

Carry-Save Residual

...

c6

d6

e6

f6

s^j+18

c^j+18

s^j+16

c^j+16

CPA

g-2 g-1 g0 g1 g2 g3 g4

Selection

g5

s^j6

c^j6

s^j7

c^j7

Fig. 8. Optimized implementation of recurrence reduction. The effective reduction is visualized–the shaded circles signify input bits that were removed as they were deemed unnecessary.

really make sense to go beyond this precision as the current design choice is targeted for an architecture which supports fast multipliers. For larger precisions it seems more sensible to design an efficient custom rectangular multiplier.

B. Implementation Area and Delay Characteristics

The results show the number of ALUTs (Adaptive LUTs [15]), of which there are two in every ALM (Adaptive Logic Module) : the basic building blocks for logic in Altera Stratix II devices. The DSP blocks on Stratix II architectures support either eight9×9multiplies, four18×18multiplies or one36×

36multiply. The proposed design is limited to the availability of multiplication units and therefore we have only reported results for two design points, one utilizing a single DSP block with four 18×18 multipliers and the other using four DSP blocks each performing36×36multiplies. The results include on-the-fly conversion costs.

The most common scenario we foresee a designer will face when determining the usefulness of a complex division unit is when comparing performance to a software based solution.

One such software solution presented in [14] is based on the following,

a+jb c+jd =

(_a+b(d/c)

c+d(d/c)+j^b−a(d/c)_c+d(d/c) if |c| ≥ |d|

b+a(c/d)

d+c(c/d)+j^a−b(c/d)_d+c(c/d) if |d| ≥ |c| (30) which requires significantly more arithmetic operations, 4 conventional divisions ad 3 multiplications. A complex divider has been described in [16] implementing Smith’s formula with a pipelined multiplier, divider, and adder for an 8-bit precision (+4 guard bits). The scheme uses small number

(9)

FA FA

FA

FA HA

FA

s^j+17 c^j+17 c5 d5e5 f5

g5 c4 d4 e4 f4

S5

S4C4 C5

FA HA

FA

FA g2 c1 d1 e1f1

S1C1

. . .

g2 c1 d1e1 f1

XOR XOR

XOR

XOR XOR XOR

XOR

C3

S0C0

XOR [5:2]⁴ [6:2]

. . .

Fig. 9. Interface of the[6 : 2]adder,[5 : 2]⁴ adder and the XOR slice.

Precision [bits] 16 36

ALUTs 566 1185

DSP Block (9-bit elements) 8 36

Registers (FFs) 318 598

M4K RAM blocks 8 8

Critical path (ns) 5.685 5.764 Max. frequency (MHz) 175.90 173.49 Prescaling look-up (Cycles) 3 3

Prescaling (Cycles) 2*4 2*4

Total prescaling (Cycles) 11 11

Iterations (cycles) 8 16

Total time (latency) (ns) 108 156 TABLE III

RESULTS FOR PRECISION16AND36COMPLEX DIVISION UNITS IMPLEMENTED ON ANALTERASTRATIXII FPGA.

of Xilinx Virtex-II slices and operates at 100 MHz. Another design for a complex divider is proposed in [17]. It uses an algorithm similar to the SRT division. It also has an efficient implementation and a latency for 15-bit precision of about 600ns, and a throughput of 1.6MHz. These two approaches are not comparable to our higher-radix approach in terms of speed. They have an advantage that there is no prescaling and no tables for prescaling factors. Radix-2 complex online arithmetic developed in [9] is not directly comparable to our implementation.

IV. CONCLUSIONS ANDFUTUREWORK

We presented the design and implementation of a radix- 4 complex division unit with a single prescaling table. The implementation on an Altera Stratix II FPGA device requires 1185 ALUTs, with a critical path of 5.764 ns, and a maximum frequency of 173.49 MHz. The prescaling table requires 2K words of 16 bits. To our knowledge no comparable implementation exists at the time and our results initiate a point of reference for other hardware based designs. In future work we plan on exploring the use of multipartite tables to reduce the table requirements in addition to developing specialized rectangular multipliers to enable higher radix designs.

Acknowledgments.We thank Altera Corporation for provid- ing the tools and FPGA devices used in this research.

REFERENCES

[1] A. F. Molisch. Wireless Communications. John Wiley and Dons Ltd., 2005.

[2] J. X., L. Guo, Y. Chen, and J. Zhang. Study of GPS Adaptive Antenna Technology Based on Complex Number AACA,IEEE International Con- ference on Wireless Communications, Networking and Mobile Computing, 2008, pp. 1-4.

[3] S.R. Dicker et al. Cbm observations with the Jodrell Bank - iac interferometer at 33 Ghz. Mon. Not. R. Astron. Soc., 2000, 00:1-12.

[4] G. Vandersteen et al. Comparison of arithmetic functions with respect to Boolean circuits. In 58th ARFTG Conference Digest RF Measurements for a Wireless World, 2001, pp. 466-470.

[5] M.D. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, San Francisco, 2004.

[6] M.D. Ercegovac and J.-M. Muller. Complex Division with Prescaling of Operands. IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 293-303, 2003.

[7] M.D. Ercegovac and J.-M. Muller, Design of a complex divider.Proc.

SPIE on Advanced Signal Processing Algorithms, Architectures, and Implementations XII, pp. 51-59, 2004.

[8] M.D. Ercegovac and J.-M. Muller. Complex Square Root with Operand Prescaling. IEEE International Conference on Application-Specific Sys- tems, Architectures and Processors, pp. 293-303, 2004.

[9] R.D. McIlhenny,Complex Number On-line Arithmetic for Reconfigurable Hardware: Algorithms, Implementations, and Applications, Ph.D. Disser- tation, Computer Science Department, University of California, 2002.

[10] V. Oklobdzija, D. Villeger and T. Soulas, An Integrated Multiplier for Complex Numbers. J. of VLSI Signal Processing, vol.7, no. 3, pp.213- 222, May 1994.

[11] A.F. Tenca, M.D. Ercegovac. Design of high-radix digit slices for online computations. In SPIE Conference on High-Speed Computing, Digital Signal Processing, and Filtering Using Reconfigurable Logic, Bellingham, 1996.

[12] B.W.Y. Wei, H. Du, and H. Chen, A Complex-Number Multiplier Using Radix-4 Digits. Proc. 12th IEEE Symposium on Computer Arithmetic, pp. 84-90, 1995

[13] P. Dormiani, M.D. Ercegovac, and J-M. Muller, On the Design and Implementation of Complex-valued Division Unit with Operands Prescaling. Computer Science Department, UCLA, Internal Report 2009.

[14] R.L. Smith. Algorithm 116: Complex division. Communications of the ACM, 5(8):435, 1962.

[15] http://www.altera.com/

[16] F. Edman and V. Oewall, Fixed-point Implementation of a Robust Complex Valued Divider Architecture,Proceedings of ECCTD05, Cork, Ireland, August 2005.

[17] J. Liu, B. Weaver and Y. Zakharov, ”FPGA Implementation of Multiplication-Free Complex Division”, Electronic Letters, 17th January 2008, Vol. 44, No. 2.