HAL Id: ensl-00379147
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00379147v2
Submitted on 13 Jan 2010
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Design and Implementation of a Radix-4 Complex Division Unit with Prescaling
Pouya Dormiani, Milos Ercegovac, Jean-Michel Muller
To cite this version:
Pouya Dormiani, Milos Ercegovac, Jean-Michel Muller. Design and Implementation of a Radix-4
Complex Division Unit with Prescaling. 20th IEEE International Conference on Application-specific
Systems, Architectures and Processors (ASAP’09), Jul 2009, Boston, United States. �ensl-00379147v2�
Design and Implementation of a Radix-4 Complex Division Unit with Prescaling
Pouya Dormiani
Computer Science Department University of California at Los Angeles
Los Angeles, CA 90024, USA Email: [email protected]
Miloˇs D. Ercegovac
4731H Boelter Hall Computer Science Department University of California at Los Angeles
Los Angeles, CA 90024, USA Email: [email protected]
Jean-Michel Muller
CNRS-Laboratoire CNRS-ENSL-INRIA-UCBL LIP Ecole Normale Sup´erieure de Lyon
46 All´ee d’Italie 69364 Lyon Cedex 07, France Email: [email protected]
Abstract—We present a design and implementation of a radix-4 complex division unit with prescaling of the operands.
Specifically, we extend the treatment of the residual bound and errors due to the use of truncated redundant representation. The requirements for prescaling tables are simplified and a detailed specification of the table design is given. All principal components used in the design are described and the proposed optimizations are explained. The target platform for implementation was an Altera Stratix II FPGA [15] for which we report timing and area requirements. For a precision of 36 bits, the implementation uses 1185 ALUTs, achieving a latency of 157 ns. The maximum clock frequency is 173.49 MHz.
I. INTRODUCTION
Complex division is used in applications such as signal processing (e.g., the complex SVD), multiantenna systems (MIMO-type) [1], GPS [2], astronomy [3], and non-linear RF measurement [4]. Unlike for complex multipliers [10], [12], its implementation has been commonly provided in software.
To improve its performance, a hardware implementation is considered. With that objective, a hardware-oriented algorithm and the corresponding theory for general radix-r complex valued division based on a digit-recurrence algorithm has been introduced in [6]. A high-level design of a complex divider is discussed in [7] without implementation details. In this paper we focus on the design and implementation of a radix-4 complex-valued division unit with the quotient-digit set{−3, . . . ,3}. The operands and the result are in fractional fixed-point form. We also refine some of the derivation results from [6] to improve the implementation.
Specifically, with the dividend z = zR+izI and divisor d = dR +idI, i = √
−1, the design discussed computes q=z/d. A high-level description of the algorithm is
Initialization: j= 0
w[0] =z (1)
Recurrence iterations: j= 1, . . . , n
qj+1=Sel(4w[j], y) (2) w[j+ 1] = 4w[j]−qj+1y (3) Result:
q= z
d = 0.qR1qR2qR3 . . . qnR+i0.q1Iq2Iq3I. . . qIn (4)
The recurrence for complex division corresponds to the con- ventional real-valued division discussed in [5] and similar conditions such as the containment and continuity as well as bounded residuals apply. The complex residual is w[j] = wR[j] + iwI[j]. The quotient digits are qj+1 = qRj+1 + iqIj+1, with the real and imaginary components qj+1R and qIj+1 ∈ {−3, . . . ,3}. These signed-digits can be converted during the iterations using on-the-fly conversion [5] to obtain conventional representation of the result. The complex residual recurrence decomposes into two separate recurrences for the real and imaginary part which can be computed in parallel:
wR[j+ 1] = 4wR[j]−qRj+1dR+qj+1I dI (5) wI[j+ 1] = 4wI[j]−qRj+1dI−qIj+1dR (6) where wR[0] = zR and wI[0] = zI. The quotient-digit selection in the complex domain is a two-dimensional problem because both qj+1R and qj+1I must be selected in such a way that the real and imaginary residuals (wR[j], wI[j]) remain bounded. This is much more difficult than single-digit selection used in the real case. We solve this problem by scaling the operands by factor K such that Kz/Kd = x/y where y = Kd ≈ 1. Consequently, yR ≈ 1 and yI ≈ 0, and the selection of qRj+1 and qIj+1 can be performed on the real and the imaginary shifted residuals separately in a manner similar to real-valued division selection. To determine the prescaling factorK, we assume that
kKd−1k∞< s (7)
wherekαk∞= max(|αR|,|αI|).
After prescaling step the recurrences are
wR[j+ 1] = 4wR[j]−qj+1R yR+qj+1I yI (8) wI[j+ 1] = 4wI[j]−qj+1R yI−qIj+1yR (9) where wR[0] = xR and wI[0] = xI. Because the scaling makesyI ≈0 andyR≈1−sthe selection of the real part of the quotient can be performed by rounding the shifted real residual and taking its integer part. Similarly for the selection of the imaginary part of the quotient digit. Moreover, we can use estimates with σ fractional positions of the shifted residuals 4wR[j] and 4wI[j] in the selection. Consequently,
the residuals can be computed in redundant form to keep the cycle time short. The selection functions are
qRj+1=Sel(est(4wR[j], σ)) (10)
=sign(4wR[j])× b|est(4wR[j], σ)|+1 2c qIj+1=Sel(est(4wI[j], σ)) (11)
=sign(4wI[j])× b|est(4wI[j], σ)|+1 2c The selection functionSel satisfies
|Sel(est(x, σ))−x|<1
2 + 2−σ (12) The est(x, σ)is xtruncated toσ fractional positions with an error bound
estERR(x, σ) =|x−est(x, σ)|<2−σ
If x is in carry-save form x=xC+xS then truncating the carry and sum vector to σ+ 1 fractional bits results in the same maximum error committed, i.e. , estERR(x, σ)<2−σ andestERR(xC, σ+ 1) +estERR(xS, σ+ 1)<2−σ.
Using (10), (11) and (12), a bound on the residual is deduced which ensures that the digit (qRj+1,qj+1I ) selected by rounding is in the digit set{−3, . . . ,3}. Namely,
kw[j]k∞≤ 1 4
3 +1
2 + 2−σ
(13) As shown in [6], assuming that the scaling error is sand a= 3, the residual is bounded by
kw[j]k∞<2×3×s+1
2+ 2−σ (14) Consequently,
6s+1
2+ 2−σ≤ 1 4
3 +1
2 + 2−σ
(15) Satisfying this condition guarantees convergence of the digit- recurrence algorithm and allows the choice of s and σ to optimize the implementation characteristics.
II. DESIGN
The design of the complex division unit consists of several components: the prescaling module, the recurrence modules for the real and imaginary parts, the on-the-fly converters to obtain conventional representations, and a simple controller.
A high level block diagram of the design is shown in Fig. 1 with the timing shown in Fig. 2. The prescaling module in Fig. 1 performs a ROM look-up using a short estimate of the value of the divisor d as an address, in which the ROM stores K = 1/d. It then computes the complex product Kz, which is used to initializew[0]in the recurrence modules. The prescaling module computesKdin parallel to the initialization of the recurrence modules, which is then used to perform the iterations of the recurrence. The initial delay of the module to perform prescaling can be amortized by overlapping the prescaling of the next operation with digit-recurrence iterations
z d
Prescale
Imag. Rec.
Real Rec.
OFC OFC
qR qI
qRj qIj
Fig. 1. High-level block diagram of the complex division unit.
Prescale
Real Rec.
Imag. Rec.
Lookup prescaling values Prescale z
Prescale d
wR[0]
wI[0]
t
t
t
wR[1]
wI[1]
wR[n]
wI[n]
...
...
idle
Fig. 2. Timing relationships between modules.
of the current operation–this however has not been performed in the current implementation. Detailed design of the prescal- ing is discussed in Section II-A.
The two recurrence modules (one for the real recurrence and one for the imaginary) perform nearly identical operations which can be mapped to the same hardware. Detailed design of the recurrence module is discussed in Section II-C.
A. Prescaling
Prescaling consists of several steps: obtaining the factorK from a table based on an short-precision estimate of d, and computingKz andKd.
We define a function rnd(a, b) which returns a rounded value ofatob fractional places, s.t. |a−rnd(a, b)| ≤ 122−b. The factorK can be determined by using a short estimate of dto q fractional positions, i.e., rnd(dR, q),rnd(dI, q)as an address to a ROM which stores the corresponding values of K with precision oft fractional positions,
KR=rnd(1/rnd(dR, q), t) KI =rnd(1/rnd(dI, q), t)
Error analysis for the choices of parameters q and t is per- formed in [13]. These effectsused in (15) to guarantee con- vergence of the algorithm. Radix 4, with digit set{−3, . . . ,3}
offers the most favorable choice of parameters by minimizing the number of bits required for the ROM among radices 4, 8,
and 16, except radix 2 which has lowest memory requirements.
Over-redundant digit sets are another design choice but we decided to restrict our design to maximally redundant digit set which allows faithful rounding [6].
r a σ q t KBits (approx.)
2 1 4 5 5 7.5
4 2 5 7 7 146
4 3 4 6 6 33
Radix: 8, 16 see [13] ≥146 TABLE I
MEMORY(ROM)REQUIREMENTS OF DIFFERENT RADIX(r),DIGIT SET {−a, . . . , a},PRECISION OF RESIDUAL ESTIMATE FOR SELECTION(σ), PRECISION OFdUSED TO PERFORM TABLE LOOK-UP(q)AND PRECISION
OF TABLE ENTRIES(t).
The value of the divisor is in the usual range 1
2 ≤ kdk∞<1 (16) noting that larger values can be scaled to this range. Its estimaternd(d, q)can be represented as 2 two’s complement numbers for the real and imaginary parts
rnd(d, q) =rnd(dR, q) +i rnd(dI, q) (17) rnd(dR, q) =κR0.κR1κR2κR3 . . . κRq−1κRq (18) rnd(dI, q) =κI0.κI1κI2κI3. . . κIq−1κIq (19) An additional bit κ−1 is required (to represent +1) as krnd(d, q)k∞ ≤1, which will be handled as a special case.
To reduce the number of address bits, the table can store corresponding values for |rnd(dR, q)|and|rnd(dI, q)|,
|rnd(dR, q)|= 0.αR1αR2α3R. . . αRq (20)
|rnd(dI, q)|= 0.αI1αI2αI3. . . αqI (21) which eliminates the need for bits κR0 and κI0 (the sign) to be used when forming an address. Likewise, sincekdk∞≥12 we know that either αR1 = 1or αI1= 1 [6] (or both). Had an address been formed using
αR1αR2αR3 . . . αRqαI1αI2αI3. . . αIq then the address would require 2q bits. Given that
γ(dR, dI) = 1
dR+idI = dR−idI
(dR)2+ (dI)2 (22) γ(dI, dR) =−γ(dR, dI) (23) we could check if αR1 = 1, if so then the address is formed via
αR2αR3 . . . αRqα1IαI2αI3. . . αIq
otherwise, it must be true thatαI1= 1so the address is formed as
αI2αI3. . . αIqαR1αR2αR3 . . . αRq
and the results obtained from the table look-up are negated based on (23). This reduces the number of address bits to 2q−1 (halving the memory required) while introducing little additional overhead.
Extra care must be taken with the aforementioned ap- proach; although it is true that dividend is assumed to be bounded by −1<kdk∞<1, it is certainly not true that
−1< rnd(dR, q)<1, in fact −1≤rnd(dR, q)≤1 (same holds for rnd(dI, q)). The two’s complement representation of the rounded divisor shown in equations (18) and (19) has range [−1,1). Negating -1 in two’s complement with the given representation is a special case; recalling that +1 is also a special case, the input is divided into two cases:
krnd(d, q)k∞<1andkrnd(d, q)k∞=±1.
Another special case occurs when negating the results obtained from the table look-up due to the swapping discussed earlier. For positive values of dR and dI the real part of 1/d is positive and the imaginary part negative. The real part of 1/d is positive for positive values of dR and the imaginary part of 1/d is negative for positive values of dI. Since1/2≤ krnd(d, q)k∞≤1 and the table only stores values for positive dR and dI values, then 0≤KR≤2 and
−2≤KI ≤0. Therefore the table should only contain the magnitude of the value, which can be represented in 2 +t bits–this will present no anomalies if 3 integer bits are used for the negated values, i.e., the ROM will store2 +t bits, but the negated value will be3 +tbits
Here we describe the operation of the table incorporating the special cases,
rnd(dR, q) =κR−1κR0.κR1κR2κR3 . . . κRq rnd(dI, q) =κI−1κI0.κI1κI2κI3. . . κIq AR=|rnd(dR, q)|=αR0.αR1αR2αR3 . . . αRq AI =|rnd(dI, q)|=αI0.αI1αI2dI3. . . αIq A=
(
αR2αR3...αRqαI1αI2αI3...αIq ifαR1=1, αI2αI3...αIqαR1αR2αR3...αRq otherwise
As= (
αI1αI2αI3...αIq ifAR=±1, αR1αR2αR3...αRq otherwise
(UR, UI) =
(1/2,−1/2) ifAR=1,AI=1, (1/2,1/2) ifAR=1,AI=−1, (−1/2,−1/2) ifAR=−1,AI=1, (−1/2,1/2) ifAR=−1,AI=−1, ROMs[As] ifAR=±1andAI6=±1
orAI=±1andAR6=±1 ROM[A] otherwise
negR= (
1 if real and imaginary swapped, 0 otherwise
negI = (
1 if real and imaginarynotswapped, 0 otherwise
KR= (−1)negRUR KI = (−1)negIUI
From Table I, we have q= 6andt= 6 for radixr= 4and a= 3.
d
Rd
Irnd( . ,6) rnd( . ,6)
8 8
ABS ABS
6 6
αR1. . . αR6 αI1. . . αI6
5 5
αR1 1 0
ROM
1 0
ROMs
11 11 6 6
6
16 16
8 8 8 8
1/2 -1/2 -1/21/2
1 0 NEG
1 0 NEG
K
RK
IU
RU
InegR negI
9 9
9 9 9 9
AR=±1 κR−1. . . κRq κI−1. . . κIq
Fig. 3. Prescaling ROM. The ABS block computes the absolute value of a two’s complement number. Blocksrnd(.,6)round their argument to the sixth fractional position. NEG blocks negate their argument, a two’s complement number.
• ROM: This ROM has 11 address bits and is 16 bits wide, which can be mapped to 8 Altera Stratix II M4K RAM blocks, constituting less than one percent of the total block memory bits in an EP2S60F672C3 device.
• ROMS : This ROM has 6 address bits and is 16 bits wide, which was mapped to logic and registers.
A schematic corresponding to the described look-up scheme is shown in Fig. 3.
The other two parts of the prescaling step involve computing Kz andKdwhich will be used to initialize and carry out the digit recurrence algorithm. OnceKis determinedxandycan
be computed via,
x= (KR+iKI)(zR+izI)
= (KRzR−KIzI) +i(KIzR+KRzI) y= (KR+iKI)(dR+idI)
= (KRdR−KIdI) +i(KIdR+KRdI)
Since multipliers are costly in hardware, the complex valued products will be computed one at a time. Coincidentally, y = Kd is not required until after the residuals have been initialized with x = Kz, which can be computed in the previous cycles. Figure 4 shows the block diagram for the scaling module. The module uses several signals to control the data path:eninputs,enpres,ensc, andselmul. Control signals enxare clock enable signals to registers to control when data is latched. Clock enables on registers are used to facilitate multi-cycle paths which are necessary due to the larger delay of the prescaling logic.
In Fig. 4eninputs controls when the inputs to the complex division unit are latched such that the values can be retained throughout the course of the operation–this is not necessarily unique and depends on the how the module is interfaced to other logic. For example, if the external logic feeds the arguments to the complex division unit in two cycles: sending (zR, zI) in the first and (dR, dI)in the second, then only 2 register banks are required for the inputs as opposed to 4. The current design reflects the assumption that the module receives its arguments in the same cycle, i.e., as(zR, zI, dR, dI).
Signalenprescontrols storing of the results of the prescaling ROM look-up, retained throughout the course of the operation.
Signals selmul and ensc are used to share the multipliers so that prescaling of the dividend and the divisor occurs in separate prescaling cycles. Although the prescaled value x=Kz is also fed through the registers controlled byensc, its value is not retained but over-written in the next cycle by y=Kd. The same enable signal (ensc) is used once more to assure that the value ofy is retained in these registers which feed the recurrence modules discussed in Section II-C.
B. Bounds of Values
It is important to characterize the bounds of the inputs to the complex division module in addition to the bounds of the prescaled values which predetermine the width of inputs to the recurrence modules.
The input d is in the range 1/2 ≤ kdk∞ < 1, and through our convergence analysis further constrained kKd−1k∞< s. This implies that the prescaled value y satisfies
max(|yR−1|,|yI|)< s
⇒ |yR−1|< s
|yI|< s
Since |yR|< 1 +s, its representation in two’s complement would require 2 integer bits andnfractional bits.
KR
KI
AR AI
KI
KR
AR AI
D D
Prescaling ROM
D D D D
divisor dividend
AR AI
zR zI dR dI
selmul
KI
KR
ensc
1 0 1 0
enpres eninputs
D D
QR QI
Fig. 4. Prescaling module. The Prescaling ROM block above is the module shown in Fig. 3.
Likewise, the constraint (14) determines the maximum value that the residual could possibly take. For our design point σ= 4which means that the residual is bounded by,
kw[j]k∞≤1 4
3 + 1
2+ 2−4
= 57/64
⇒ |wR[0]|=|xR| ≤57/64
|wI[0]|=|xI| ≤57/64
Therefore, the prescaled value(xR, xI)requires only a single integer bit, and n fractional bits. We are interested in deter- mining a bound on zwhich we can derive from the bound on w,
kw[0]k∞=kKzk∞≤2kKk∞kzk∞≤57/64 (24) sincekKk∞≤2then
kzk∞≤57/256 (25) requiring only n−1 fractional bits, with most significant bit having weight 2−2.
C. Digit-Recurrence Iterations
The digit-recurrence iterations compute the residuals (5) (6) and perform quotient-digit selection based on a short non- redundant estimate of the residuals as shown in Eq. (10) and (11).
The recurrences in (5) and (6) are structurally the same.
Namely,
w[j+ 1] = 4w[j] +σ1yR+σ2yI (26) The residuals are computed in redundant form in order to reduce the cycle time by eliminating the need for long carry
Real OFC
a b c d
MG MG
c0in c1in
c2in
c3in
mm01
mm23
m0 m1
m2
m3
[6:2] Adder
D D
ws wc
CPA Sel QR
enres
e f QI
×4 ×4
-qR qR
NEGATE
(To Imaginary Recurrence) qR
qI (From Imaginary Recurrence)
10 10
QR 0
initres
initres
Fig. 6. Real recurrence module. Blocks×4 shift their argument right by 2 binary places. Blocks MG computeσ times their argument using theσik
decomposition discussed. The CPA module is a carry propagate adder which computes a short non-redundant estimate of the residual. The Sel module takes as argument this estimate and outputs the next quotient digit.
chains. In our implementation we used a carry-save form. The operation is expressed as
(wC[j+ 1], wS[j+ 1]) =
ADD[6:2](4wC[j],4wS[j], σ11yR,2σ12yR, σ21yI,2σ22yI) (27) where ADD[6:2](a, b, c, d, e, f) is a [6 : 2] carry-save adder taking 6 inputs and producing a carry vector and sum vector, shown in Fig. 5. The digits σ1 and σ2 are in the digit set {−3, . . . ,3} so we implement this digit multiplication by decomposing σk = 2σ2k+σ1k where σik∈ {−1,0,1}. Multi- plying by negative one is achieved by inverting the input and adding a carry-in to the reduction module. A block diagram of the structure used to compute the real recurrence is shown in Fig. 6.
Digit selection is performed by taking a short precision estimate of the residual and rounding it to the nearest in- teger via a small CPA and table. In the discussion that follows we generally say residual without referring specif- ically to the real or imaginary part–the analysis holds for both residuals wR and wI. In Section II-B we determined that the residual has a single integer bit and n fractional bits, i.e., it is of the form w = w0.w1w2. . . wn with
FA FA
FA
FA
FA FA
FA
FA
FA FA
FA
FA
FA FA
FA
FA
. . .
c0in
c1in
c2in
c3in
a0 b0c0 d0e0 f0
a1 b1c1 d1e1 f1
am bmcm dmem fm am-1bm-1cm-1dm-1em-1fm-1
FA
S0
S1
Sm-1
Sm
Sm+1Cm+1 Cm Cm-1 C1 C0
Cm+2 Repeat Least Sig.
Most Sig.
Fig. 5. [6 : 2]Adder module. The adder consists of three different slices: the least significant slice, which sums 6 arguments and takes 4 carry-ins, the repeat slice which sums 6 arguments and takes 4 lateral carries and produces 4 lateral carries to the subsequent slice, and the most significant slice.
value Pn
i=0wi2−i. In redundant form w = wc+ws where (wc, ws) = (C0.C1C2C3. . . Cn, S0.S1S2S3. . . Sn). Recall- ing that selection is performed via,
qj+1=Sel(est(4w[j], σ))
whereσ= 4 as determined in section II-A, we know that
est(4w[j],4) =w0w1w2.w3w4w5w6=
n
X
i=0
wi2−i+2 estERR(4w[j],4)<2−4
now since wis in redundant form,
est(4wc[j],5) =C0C1C2.C3C4C5C6C7
est(4ws[j],5) =S0S1S2.S3S4S5S6S7
g=est(4wc[j],5) +est(4ws[j],5) (28) estERR(4wc[j],5) +estERR(4ws[j],5)<2−4 which gives us the short precision estimate of the residual g.
It is important to realize that g 6= est(4w[j],4) in general but that they commit the same maximum error 2−4 in their approximation of w[j]. The addition in equation (28) requires the CPA that we have been referring to during this discussion.
g−2g−1g0.g1. . . g5=
CP A(C0C1C2.C3. . . C7, S0S1S2.S3. . . S7) (29)
To round g and take the integer part one can use a small table as in table II by introducing an additional variablegz= g2+g3+g4+g5 (i.e. the logical or of bits g2 through g5).
This table is a function of 5 bits and produces three bits of output (for the encoding of qj+1) and will efficiently map to LUTs.
g−2 g−1 g0 g1 gz qj+1
0 0 0 0 - 0
0 0 0 1 - 1
0 0 1 0 - 1
0 0 1 1 - 2
0 1 0 0 - 2
0 1 0 1 - 3
0 1 1 0 - 3
1 0 0 1 1 -3
1 0 0 1 1 -3
1 0 1 0 - -3
1 0 1 1 0 -3
1 0 1 1 1 -2
1 1 0 0 - -2
1 1 0 1 1 -1
1 1 1 0 - -1
1 1 1 1 1 0
TABLE II
ROUNDING TO INTEGER PART.
D. Optimizing the Recurrence Implementation
A straightforward implementation of the recurrence is shown in Fig. 7. There are several opportunities for its optimization:
• Since the residual is in the range(−57/64, 57/64), there is only one integer bit required to store the value of the residual. Based on this observation there is no need to find the sum of bits with weight greater than20= 1.
• The recurrence implementation can be optimized in the most significant bits by using the non-redundant value computed for selection in the adder instead of the re- dundant form stored in the registers. Although addition of a short CPA delay to the most significant bits seems counter-intuitive to optimization, it turns out that for all fitting attempts to the Stratix II architecture this path was not the critical path–paths with routing delays dominated the critical path (short carry chains don’t exhibit routing delays as there are dedicated carry paths in Adaptive Logic Modules [15]). Since using the non-redundant portion didn’t introduce a new critical path and reduced the input bits it served as a pragmatic optimization technique. The non-redundant approximationgcomputed for selection can be used in the addition as opposed to using the [6 : 2]adders. This simplifies the[6 : 2]adder
c-2 c-1 c0 c1 c2 c3 c4 c5 ...
d-2 d-1 d0 d1 d2 d3 d4 d5 ...
e-2 e-1 e0 e1 e2 e3 e4 e5 ...
f-2 f-1 f0 f1 f2 f3 f4 f5 ...
CPA
g-2 g-1 g0 g1 g2 g3 g4
sj+10 sj+11 sj+12 sj+13 sj+14 sj+15 ...
cj+10 cj+11 cj+12 cj+13 cj+14 cj+15 ...
[6:2] Adder
sj+10 sj+11 sj+12 sj+13 sj+14 sj+15
cj+10 cj+11 cj+12 cj+13 cj+14 cj+15
sj+16
cj+16
Selection
sj+1-2 sj+1-1 sj+10 sj+11 sj+12 sj+13 sj+14 sj+15 ...
cj+1-2 cj+1-1 cj+10 cj+11 cj+12 cj+13 cj+14 cj+15 ...
cj+1-3
sj+10 sj+11 sj+12 sj+13 sj+14 sj+15 ...
cj+10 cj+11 cj+12 cj+13 cj+14 cj+15 ...
Registers
sj0 sj1 sj2 sj3 sj4 sj5
cj0 cj1 cj2 cj3 cj4 cj5
Carry-Save Residual
...
...
sj+17
cj+17
sj+16
cj+16
sj+17
cj+17
g5
Fig. 7. First implementation of recurrence reduction. Each rectangular box represents some functional block where the bits inside show the inputs to that block and the bits beneath show the corresponding outputs. There are 5 bits produced by the[6 : 2]adder in this figure which have a shaded square background to signify that these output bits don’t drive any logic and are left
“open”.
to a [5 : 2] adder requiring 4 lateral carries (as opposed to a conventional [5 : 2] adder which only requires 3 lateral carries) which we denoted as [5 : 2]4–the lateral carries come from the previous[6 : 2]adder. The interface between the[6 : 2]adder, the[5 : 2]4 adder and the XOR slice is shown in Fig. 9.
• The [5 : 2]4 adder produces both sj+11 and cj+10 , it is unnecessary to producesj+10 with the same module since we will discardcj+1−1 . Bitsj+10 is just the sum modulus 2 of all bits of weight 1 plus the lateral carries, which can be computed via exclusive-ors (XOR).
Applying all mentioned optimizations we get an improved design shown in Fig. 8.
III. DESIGNMETHODOLOGY ANDRESULTS
A. Methodology
The proposed designs were written at the RTL level us- ing VHDL and simulated for functional correctness with Modelsim-Altera Edition 8.1. They were mapped to an Altera Stratix II architecture using Quartus II 8.1 flow tools. The Quartus Classic Timing Analyzer was used to determine the timing characteristics of the circuit in addition to placing constraints on ROM look-up and prescaling registers to inform the tool of multi-cycle paths.
The multiplies performed in the prescaling module map to the Altera DSP blocks for precisions up to 36 bits–these modules support up to 36×36 multiplication. It does not
c0 c1 c2 c3 c4 c5 ...
d0 d1 d2 d3 d4 d5 ...
e0 e1 e2 e3 e4 e5 ...
f0 f1 f2 f3 f4 f5 ...
...
...
[6:2] Adder
g4 g5
sj+10 sj+11 sj+12 sj+13 sj+14 sj+15 ...
cj+10 cj+11 cj+12 cj+13 cj+14 cj+15 ...
g0 g1 g2 g3
[5:2]4 Adder XOR
sj+10 sj+11 sj+12 sj+13 sj+14 sj+15 ...
cj+10 cj+11 cj+12 cj+13 cj+14 cj+15 ...
Registers
sj0 sj1 sj2 sj3 sj4 sj5
cj0 cj1 cj2 cj3 cj4 cj5
Carry-Save Residual
...
...
c6
d6
e6
f6
sj+18
cj+18
sj+16
cj+16
CPA
g-2 g-1 g0 g1 g2 g3 g4
Selection
g5
sj0 sj1 sj2 sj3 sj4 sj5
cj0 cj1 cj2 cj3 cj4 cj5
sj6
cj6
sj7
cj7
Fig. 8. Optimized implementation of recurrence reduction. The effective reduction is visualized–the shaded circles signify input bits that were removed as they were deemed unnecessary.
really make sense to go beyond this precision as the current design choice is targeted for an architecture which supports fast multipliers. For larger precisions it seems more sensible to design an efficient custom rectangular multiplier.
B. Implementation Area and Delay Characteristics
The results show the number of ALUTs (Adaptive LUTs [15]), of which there are two in every ALM (Adaptive Logic Module) : the basic building blocks for logic in Altera Stratix II devices. The DSP blocks on Stratix II architectures support either eight9×9multiplies, four18×18multiplies or one36×
36multiply. The proposed design is limited to the availability of multiplication units and therefore we have only reported results for two design points, one utilizing a single DSP block with four 18×18 multipliers and the other using four DSP blocks each performing36×36multiplies. The results include on-the-fly conversion costs.
The most common scenario we foresee a designer will face when determining the usefulness of a complex division unit is when comparing performance to a software based solution.
One such software solution presented in [14] is based on the following,
a+jb c+jd =
(a+b(d/c)
c+d(d/c)+jb−a(d/c)c+d(d/c) if |c| ≥ |d|
b+a(c/d)
d+c(c/d)+ja−b(c/d)d+c(c/d) if |d| ≥ |c| (30) which requires significantly more arithmetic operations, 4 conventional divisions ad 3 multiplications. A complex divider has been described in [16] implementing Smith’s formula with a pipelined multiplier, divider, and adder for an 8-bit precision (+4 guard bits). The scheme uses small number
FA FA
FA
FA
FA HA
FA
FA
sj+17 cj+17 c5 d5e5 f5
g5 c4 d4 e4 f4
S5
S4C4 C5
FA HA
FA
FA g2 c1 d1 e1f1
S1C1
. . .
g2 c1 d1e1 f1
XOR XOR
XOR
XOR XOR XOR
XOR
C3
S0C0
XOR [5:2]4 [6:2]
. . .
Fig. 9. Interface of the[6 : 2]adder,[5 : 2]4 adder and the XOR slice.
Precision [bits] 16 36
ALUTs 566 1185
DSP Block (9-bit elements) 8 36
Registers (FFs) 318 598
M4K RAM blocks 8 8
Critical path (ns) 5.685 5.764 Max. frequency (MHz) 175.90 173.49 Prescaling look-up (Cycles) 3 3
Prescaling (Cycles) 2*4 2*4
Total prescaling (Cycles) 11 11
Iterations (cycles) 8 16
Total time (latency) (ns) 108 156 TABLE III
RESULTS FOR PRECISION16AND36COMPLEX DIVISION UNITS IMPLEMENTED ON ANALTERASTRATIXII FPGA.
of Xilinx Virtex-II slices and operates at 100 MHz. Another design for a complex divider is proposed in [17]. It uses an algorithm similar to the SRT division. It also has an efficient implementation and a latency for 15-bit precision of about 600ns, and a throughput of 1.6MHz. These two approaches are not comparable to our higher-radix approach in terms of speed. They have an advantage that there is no prescaling and no tables for prescaling factors. Radix-2 complex online arithmetic developed in [9] is not directly comparable to our implementation.
IV. CONCLUSIONS ANDFUTUREWORK
We presented the design and implementation of a radix- 4 complex division unit with a single prescaling table. The implementation on an Altera Stratix II FPGA device requires 1185 ALUTs, with a critical path of 5.764 ns, and a maximum frequency of 173.49 MHz. The prescaling table requires 2K words of 16 bits. To our knowledge no comparable imple- mentation exists at the time and our results initiate a point of reference for other hardware based designs. In future work we plan on exploring the use of multipartite tables to reduce the table requirements in addition to developing specialized rectangular multipliers to enable higher radix designs.
Acknowledgments.We thank Altera Corporation for provid- ing the tools and FPGA devices used in this research.
REFERENCES
[1] A. F. Molisch. Wireless Communications. John Wiley and Dons Ltd., 2005.
[2] J. X., L. Guo, Y. Chen, and J. Zhang. Study of GPS Adaptive Antenna Technology Based on Complex Number AACA,IEEE International Con- ference on Wireless Communications, Networking and Mobile Computing, 2008, pp. 1-4.
[3] S.R. Dicker et al. Cbm observations with the Jodrell Bank - iac interferometer at 33 Ghz. Mon. Not. R. Astron. Soc., 2000, 00:1-12.
[4] G. Vandersteen et al. Comparison of arithmetic functions with respect to Boolean circuits. In 58th ARFTG Conference Digest RF Measurements for a Wireless World, 2001, pp. 466-470.
[5] M.D. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, San Francisco, 2004.
[6] M.D. Ercegovac and J.-M. Muller. Complex Division with Prescaling of Operands. IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 293-303, 2003.
[7] M.D. Ercegovac and J.-M. Muller, Design of a complex divider.Proc.
SPIE on Advanced Signal Processing Algorithms, Architectures, and Implementations XII, pp. 51-59, 2004.
[8] M.D. Ercegovac and J.-M. Muller. Complex Square Root with Operand Prescaling. IEEE International Conference on Application-Specific Sys- tems, Architectures and Processors, pp. 293-303, 2004.
[9] R.D. McIlhenny,Complex Number On-line Arithmetic for Reconfigurable Hardware: Algorithms, Implementations, and Applications, Ph.D. Disser- tation, Computer Science Department, University of California, 2002.
[10] V. Oklobdzija, D. Villeger and T. Soulas, An Integrated Multiplier for Complex Numbers. J. of VLSI Signal Processing, vol.7, no. 3, pp.213- 222, May 1994.
[11] A.F. Tenca, M.D. Ercegovac. Design of high-radix digit slices for online computations. In SPIE Conference on High-Speed Computing, Digital Signal Processing, and Filtering Using Reconfigurable Logic, Bellingham, 1996.
[12] B.W.Y. Wei, H. Du, and H. Chen, A Complex-Number Multiplier Using Radix-4 Digits. Proc. 12th IEEE Symposium on Computer Arithmetic, pp. 84-90, 1995
[13] P. Dormiani, M.D. Ercegovac, and J-M. Muller, On the Design and Implementation of Complex-valued Division Unit with Operands Prescaling. Computer Science Department, UCLA, Internal Report 2009.
[14] R.L. Smith. Algorithm 116: Complex division. Communications of the ACM, 5(8):435, 1962.
[15] http://www.altera.com/
[16] F. Edman and V. Oewall, Fixed-point Implementation of a Robust Complex Valued Divider Architecture,Proceedings of ECCTD05, Cork, Ireland, August 2005.
[17] J. Liu, B. Weaver and Y. Zakharov, ”FPGA Implementation of Multiplication-Free Complex Division”, Electronic Letters, 17th January 2008, Vol. 44, No. 2.