FINAL PRODUCT 3�1
(TO DIVU)
VML_RESULT [63:0J TO VREG
Figure 9 Vector Multiply Unit
similar to the process used for double-precision multiplication. However, in single-precision multi
plication, only one multiplier chip is needed ro pro
duce the result and the pack chips do not need to sum the partial product. Integer multipli ca tion is slightly different from floating point multiplication because it does not need to be accumulated or rounded. Thus, the correct product is produced by one multiplier. The result bypasses the accumu
lation and rounding logic and proceeds directly into the packing logic to be sent to the vector regis
ter unjt.
The exponent handling for both multiplication and division is performed by the same logic on the packing chips. Depending on the instruction being executed, the exponent is either added (multipli
cation) or subtracted (division). The result of this operation is then piped to the next stage and the position of the h idden bit is determined. If the frac
tional portion of the data must be shifted to ensure the hidden bit is in the correct position, the expo
nent is then incremented or decremented
accord-76
ingly. The normalize count (i.e. , shift count) is used to select the correct final exponent. Overflow and underflow exception checking can only be detected and reported after the final exponent is selected. If an exception is detected, then a reserved operand is written to the appropriate vector register element.
The first stage of the exponent logic also checks for divide by zero and reserved operand exceptions.
Division Vector division is a variable-cycle func
tion. The number of cycles depends on the format of the operands. The custom divider is capable of producing six quotient bits per cycle. Therefore, F _floating point division is performed in 7 cycles, G_floating point in 1 2 cycles, and D_floating point in 13 cycles. Because of the variable number of cycles in a divide instruction, no other instruc
tion can execute in the V-box while a divide is in process. Also, because of the iterative nature of divi
sion (i.e. , one division must be completed before another can be started), the instruction cannot be pipelined.
Vol. 2 No. 4 Fa/1 /'J'J{) Digital Tecbnicaljounwl
As a vector div ide instruction executes, two 64-bit elements are received from the vector regis
ter unit each cycle and are latched i n the di vide unpack chip. The elements are unpacked, and the fractional portion of the elements is sent to the etJS
tom divider in 32-bit slices. The exponent portion is sent to the shared exponent logic on the packing chips, as described in the Multiplication sect ion.
During this cycle, time-critical values, such as com
plemented element values and first-cycle quotient bits, are calculated and forwarded to t he custom divider.
W hen t he divider receives the data, it uses a n iterative algorithm t o produce six quotient bits per cycle. The quotient bits produced are then sent to the packing chips, which may have to increment the quotient, depending on the value of subsequent quotient bits. The div ider instructs the quotient accumulation logic whether or not incrementing is necessary. The partial quotient, once decided, is held in a bank of l atches until a l l the quotient bits are received . When the entire quotient is available, the result is rounded, normal ized , and packed by using the same logic path as multiplication. A mul
tiplexer switches this packing logic between the multiplication and division logic.
Performance Characteristics
As of this writing, testing of the vccror performance of the VAX 9000 system has only just begun. How
ever, some preliminmy resu lts are p resented in Table 3. We expect that these results will improve as testing continues and more code i s optimized to take advantage of the chaining and overlapping provided by the V-box.
Chaining and Overlapping
Because of the design of the vector register unit, the V-box can concurrently execute a vector
add-Table 3 VAX 9000 Model 21 0 P rel imi nary
Vector Processing on the VAX 9000 System
class instruction , vector multiply instruction, and vector memory instruction. Unlike the VAX 6000 Model 400 system, vector register conflicts between these instructions have little effect on overlapping. ; With the VAX 9000 system, a conflict only delays t he execution of the subsequent vector instruction by one or two cycles at most.
However, the overlapping behavior of the V-box is sensitive to the issue order of vector instructions.
If two vector instructions executed by the same V-box unit are issued one after the other, the second instruction is delayed until the V-box unit has fin
ished executing the first. In addition, vector i nstruc
tions issued after a vector memory instruction or divide instruction, do not begin execution unti.l the previous instruction completes. A general ru le in scheduling code for the VAX 9000 V-box, is to gen
erate, whenever possible, instruction triples, where the first two instructions are a vector add-class and vector multiply instruction and the last instruction is a vectOr memory or vector divide instruction . Failing that, at least one vector add-class or vector multiply instruction should be issued before a vec
tor memory or vector divide instruction.
The following code examples demonstrate the usage of the VAX vector instruction set and the over
lapping behavior of the VA X 9000 V-box. (Note: It should be assumed in the examples that all arrays are 8-byte double precision .)
In the following DAXPY inner loop example, the first two VLDQ instructions do nor overlap. How
ever, the VSM ULD, VVA DDD , and VSTQ instructions
The first two V LDQ instructions do not overlap in the following MERGE example,
Do i = 1 , 64
vectorizes as:
However, the VVSUBD instruction does overlap with the VSTQ instruction. Both the VSLSSD (VSCMP) and VVMERGE instructions are executed by the vector add unit. Therefore, these two instruc
tions do not overlap. However, the VVMERGE instruction does overlap with the VSTQ instruction.
In an I F-THEN- ELSE example, such as the
Nothing overlaps the first V LDQ instruction, but the VSLSSD instruction does overlap the second VLDQ instruction. Nothing can overlap with the VVDIVD instruction. Thus, the VSTQ instructio n does not begin execution until the VVOIVD instruc
tion completes. The remaining VSTQ instruction waits for the first VSTQ instruction to complete.
In the following scatter-gather example, none of the instructions is overlapped. VSEQLD and the IOTA instructions do not overlap.
This lack of overlap occurs because the IOTA instruction is actually done with microcode on the E-box, and the IOTA instruction cannot begin exe
cution until the VSEQLD instruction has computed all the new vector mask register bits. The vector register access instructions (MFVCR and MTVLR) take only a few cycles and do not significantly affect the overlapping of other vector instructions.
Summary
By taking advantage of key features of the VAX vector architecture, such as instruction overlap
ping, imprecise exceptions, and asynchronous interaction with the scalar processor, the vector processor of the VAX 9000 system provides super
computing performance for computationally inten
sive applications. Through the use of barber poling, the vector processor can overlap two vector arith
metic instructions with one memory instruction to deliver a peak double-precision performance of 125 M F LOPS.
Acknowledgments
The authors wish to acknowledge the technical contributions of the following individuals to the VAX vector architecture and the VAX 9000 V-box
References
1 . Russell, "The CRAY - 1 Computer System ,"
ACM Proceedings, vol . 21, no. 1 (January 1978):
63-72.
2. VAX Vector Processing Handbook (Maynard : D igital Equipment Corporation, Order No.
EC-H04 19-46/89, 1989).
3. R. Brunner, VAX Architecture Reference Manual (Bedford: Digital Press, Order No. EY -F576 E- DP,
1990).
4 . D. Fenwick et a l . , "A VlSI Implementation of the VAX Vector Architecture," Proceedings of COMPCON '90 (IEEE, Spring 1990).
Digital Tecbntcaljournal Vol. 2 No. 4 Fall 1990
Vector Processing on the VAX 9000 System
5. CRAY-2 Compute-r System Functional Descrip
tion (Cray Research, Inc , 1985 ).
6. W. Buchholz, "The IBM System/370 Vector Archi
tecture, " IBM Syste-ms journal, vol. 25, no. 1
(1986): 51 -62 .
7. D. Marshall and ]. McElroy, " VAX 9000 Pack
aging- The Multichip Unit," Proceedings of COMPCON '90 (!E E E , Spring 1990).
8. M. Adiletta et al . , "Semiconductor Technology in a High-performance VAX System ," Digital Technical journal, vol . 2 , no. 4 (Fall 1990, this issue): 43-60.
79
james B. McElroy Frank]. Swiatowiec