Superscalar Processors Mark Smotherman

System Design 2

2.1 Superscalar Processors Mark Smotherman

Clemson University

Dezso¨ Sima

Budapest Polytechnic

Kevin Skadron David Tarjan

University of Virginia

Tzi-cker Chiueh

State University of New York at Stony Brook

Binu Mathew

Apple Inc.

Ali Ibrahim

Advanced Micro Devices

2.1 Superscalar Processors...2-1

Introduction ^. Instruction-Level Parallelism ^. Superscalar Terminology ^. Example Machines

2.2 Register Renaming Techniques ...2-10

Introduction ^. Overview of the Rename Process ^. Design Space of Register Renaming Techniques^. Layout of the Rename Buffers ^. Layout of the Register Mapping ^. Basic Alternatives and Possible Implementation Schemes of Register Renaming

2.3 Predicting Branches in Computer Programs ...2-38

What Is Branch Prediction and Why Is It Needed? ^. Software Techniques^. Hardware Techniques ^. Sources of

Mispredictions ^. Comparison of Hardware Prediction Strategies^. Summary

2.4 Network Processor Architecture...2-60

Introduction ^. Design Issues ^. Architectural Support for Network Packet Processing ^. Example Network Processors ^. Conclusion

2.5 Stream Processors and Their Applications

for the Wireless Domain...2-66

Introduction ^. Stream Virtual Machine ^. Time and Space Multiplexing ^. Stream Processor Implementations^. Stream Processing for Wireless Systems ^. WCDMA Physical Layer ^. 4G ^. Computational Complexity and Power Consumption ^. Current Solutions ^. Stream Processor Based Wireless SoCs^. Conclusions

2.1 Superscalar Processors Mark Smotherman

2.1.1 Introduction

A superscalar processor is designed to fetch, decode, execute, and complete multiple instructions each clock cycle from a single instruction stream. The term superscalar originated in the 1980s to distinguish it as a form of parallelism set apart from vector and array processors as well as an advance beyond pipelined scalar processors. The rationale for such a design can be illustrated by considering the basic computer performance equation, IC3CPI3CCT, that is, execution time is a function of the instruc-tion count (IC) multiplied by the cycles per instrucinstruc-tion (CPI) multiplied by the clock cycle time (CCT).

The goal of a pipelined scalar processor is to strive toward a minimum CPI of 1.0, while simultaneously

2-1

reducing, or at least limiting any expansion of, the instruction count and the clock cycle time. The result is reduced overall execution time. The goal of a superscalar design is to attain a fractional CPI, or, stated as the reciprocal, the goal is to attain a throughput measured in instructions per cycle (IPC) greater than 1.0. With similar reductions, or at least limits on the expansion, of the instruction count and the clock cycle time, the result is an even larger reduction in the execution time than for scalar pipelining alone.

A VLIW (very long instruction word) processor is also designed for fetching, decoding, executing, and completing multiple operations each clock cycle. The difference between a superscalar processor and a VLIW processor is one of implementation and architecture. Superscalar design can be applied as an implementation technique to an existing sequential instruction set, while VLIW design requires that the instruction set architecture, the implementation, and the compilers be specifically designed to support the packaging of multiple independent operations into long instruction words.

Proponents of the VLIW approach rightly contend that VLIW design reduces the control complexity within a processor; however, the corresponding drawbacks are a loss of wide-scale program portability at the binary level and a lack of flexibility. With regard to portability, the control logic added by a superscalar processor is used to dynamically determine opportunities for parallel execution within a conventional instruction stream. Thus, superscalar processors dynamically schedule parallel execution of the instructions of existing executable program files, whereas recompilation into a static representation of parallel execution is a requirement for programs to run on VLIW processors. With regard to flexibility, superscalar processors can easily respond to dynamic events, such as cache misses. Dynamic events present a difficulty for VLIW designs. For example, early VLIW designs avoided data caches so that memory access time would be a known quantity for use in compiler scheduling.

A hybrid approach, called EPIC (explicitly parallel instruction computing), is exemplified in the Itanium Processor Family (IPF, also known as the IA-64) and is an attempt to gain the best of both approaches.

Explicit dependence information is incorporated into the instruction formats to reduce the control logic complexity, and some scheduling of dynamic behavior is incorporated to provide flexibility.

2.1.2 Instruction-Level Parallelism

Superscalar processors attempt to identify and exploit parallelism in the instruction stream. That is, instructions that are independent should be executed in parallel. We briefly review the concept of dependencies. More details can be found in the Shen and Lipasti text on superscalar processor design [1].

2.1.2.1 Dependencies

Dependencies limit the parallelism between instructions because they must be enforced so that the results of program execution will be correct. Indeed, much of the control logic in a superscalar processor is devoted to identifying dependencies, so that execution will produce the same results as if the instruction stream was being executed on a purely sequential computer. Dependencies can be categorized in three ways.

2.1.2.2 Data Dependencies

Data dependencies exist between two instructions when the order between the two instructions must be maintained for execution to be correct. The most obvious data dependency is the true data dependency (or RAW: read-after-write dependency) in which the result of one instruction is used as an input operand for the second instruction. To preserve correctness, the first instruction must be executed before the second. The storage location in which the operand is passed can be either a memory location or a CPU register. A forwarding path can also be used to provide the operand to the second instruction without having to wait on the storage update.

Two other cases arise when the second instruction writes to the common storage location. An output dependency (or WAW: write-after-write dependency) occurs when both instructions write to the same storage. To preserve correctness, the result of the second instruction must be the final value of the storage. An anti-dependency (or WAR: write-after-read dependency) occurs when the first instruction reads an input operand from the storage location that will be written with the result of the second

instruction. To preserve correctness, the first instruction must obtain its input operand before that value is overwritten by a new value from the second instruction. Both these cases are called false data dependencies because they arise from the reuse of storage locations.

See Fig. 2.1 for examples of data dependencies based on register storage. Similar examples using load and store instructions could be given based on memory storage.

2.1.2.3 Control Dependencies

A control dependency occurs when an instruction depends on a conditional branch instruction. It is not known whether the instruction is to be executed or not until the branch is resolved. Thus, the branch must be executed (or predicted) before the instruction execution.

2.1.2.4 Structural Dependencies

A structural dependency occurs when two instructions need the same resource. If the resource is not duplicated, the instructions must execute sequentially, one after the other, rather than in parallel. The resource for which the instructions contend might be an adder, a bus, a register file port, or some other component.

2.1.2.5 Studies of Instruction-Level Parallelism

In the early 1970s, two studies on decoding and executing multiple instructions per cycle were published, one by Tjaden and Flynn on a design of a multiple-issue IBM 7094 [2] and the other by Riseman and Foster on the effect of branches in CDC 3600 programs [3]. The conclusion in both papers was that only a small amount of instruction-level parallelism existed in sequential programs—1.86 and 1.72 instruc-tions per cycle on average, as determined by the respective studies. Thus, these studies clearly demon-strated the limiting effect of data and control dependencies on instruction-level parallelism, and the result was to encourage researchers to look for parallelism in other arenas, such as vector processors and multiprocessors. However, the study by Riseman and Foster did examine the effect of relaxing the control dependencies and found increasing levels of parallelism, up to 51 instructions per cycle, as the number of branches were eliminated (albeit in an impractical way). Later studies, in which false data dependencies as well as control dependencies were eliminated, found much more available parallelism, with the highest published estimate being 90 instructions per cycle by Nicolau and Fisher as part of their VLIW research [4].

2.1.2.6 Techniques to Increase Instruction-Level Parallelism

Just as the limit studies indicated, performance can be increased if dependencies can be eliminated or reduced. Let us consider the dependencies in the reverse order from their enumeration above. First, many structural dependencies can be avoided by providing duplicate copies of necessary resources. Even scalar pipelines provide two paths for memory access (separate instruction and data caches) and multiple adders (branch target adder and main ALU). Superscalar processors have even more resource requirements, and it is not unusual to find duplicated function units and even multiple ports to the data cache (e.g., true multiporting, multiple banks, or accessing a single-ported cache multiple times per cycle).

True dependency: add r1, r2, r3 ; r3 ← r1 + r2 sub r3, r4, r5 ; r5 ← r3 – r4 Anti-dependency: add r1, r2, r3 ; r3 ← r1 + r2

sub r4, r5, r1 ; r1 ← r4 – r5 Output dependency: add r1, r2, r3 ; r3 ← r1 + r2

sub r4, r5, r3 ; r3 ← r4 – r5 FIGURE 2.1 Example data dependencies.

Control dependencies are eliminated by compiler techniques of unrolling loops and performing optimizations such as ‘‘if conversion’’ (using conditional or predicated execution of instructions so that a control-dependent instruction is transformed into a data-dependent instruction). However, the main approach to reducing the impact of control dependencies is the use of sophisticated branch prediction. For example, the Pentium 4 keeps the history of over 4000 branches [5]. Branch prediction techniques allow instructions from the predicted path to begin before the branch is resolved and execute in a speculative manner. Of course, if a prediction is incorrect, there must be a way to recover and restart execution along the correct path.

False data dependencies can be eliminated or reduced by better compiler techniques (e.g., register and memory allocation algorithms that avoid reuse) or by the use of register and memory renaming hardware on the processor. Register renaming can be accomplished in the hardware by incorporating a larger set of physical registers than are available in the instruction set architecture. Thus, as each instruction is decoded, that instruction’s architectural destination register is mapped to a new physical register, and future uses of that architectural register will be mapped to the assigned physical register.

Hardware renaming is especially important for older instruction sets that have few architectural registers and for legacy and shrink-wrapped programs that for one reason or another will not be recompiled.

True data dependencies have been viewed as the fundamental limit for program execution; however, value prediction has been proposed in the past few years as somewhat of an analog of branch prediction, in which paths within the instruction stream that depend on easily predicted source values or addresses can be started earlier. As with branch prediction, there must be a way to recover from mispredictions.

Another method to reduce the impact of true data dependencies is the use of some form of multi-threading in which instructions from multiple threads are interleaved on a single processor; of course, instructions from different threads are independent by definition.

2.1.3 Superscalar Terminology 2.1.3.1 Program Order

Figure 2.2 illustrates various terms used in describing superscalar processors. The adjectivesin-orderand out-of-orderrefer to the ordering of instructions as compared to the program. Figure 2.2a shows a simple scalar processor with four stages, and the actions of moving an instruction from one stage to the next are named fetch, issue, and complete, respectively. Instructions are decoded and issued in program order. A simple extension to the scalar pipeline is the use of multiple pipelines, as shown in Fig. 2.2b.

This can allow a scalar processor to specialize its execution pipelines (or function units; e.g., integer vs.

floating point), or, as of interest here, it can produce an in-order superscalar, in which multiple instructions are fetched, decoded, and issued in program order. The stage-to-stage terminology is the same as for the simple in-order scalar processor.

2.1.3.2 Instruction Completion and Precise Exceptions

All processors that attempt to execute instructions in parallel must deal with variations in instruction execution times. That is, some instructions, such as those involving simple integer arithmetic or logic operations, will need only one cycle for execution, while others, such as floating-point instructions, will need multiple cycles for execution. If these different instructions are started at the same time, as in a superscalar processor, or even in adjacent cycles, as in a scalar pipelined processor with separate function units or execution pipelines, a simple instruction can complete earlier than a longer running instruction that appears earlier in the instruction stream. This is called out-of-order completion. We can, of course, prevent out-of-order completion by techniques such as adding delay stages so that all execution paths have the same number of stages.

If we choose to allow a subsequent, simple instruction to write its result to storage before a longer running instruction completes, we may violate a data dependency. Dependency checking hardware can eliminate this problem while still allowing some out-of-order completions; however, dependency check-ing will not solve the problem of an inconsistent state of storage (registers or memory) if the longer

running instruction causes an exception. To handle this exception and to be able to resume the program, we must know the precise state of the storage, that is, we must know which instructions, before the one causing the exception, have not been completed and which instructions, after the one causing the exception, have been completed. To resume at a given point in program order, the processor must restore a consistent state with all previous instructions completed and no subsequent instructions completed. A standard technique to handle this is to provide a form of buffering for the results of instructions, usually called a reorder buffer. This method is depicted in Fig. 2.2c. Instructions completing out-of-order can place their results in preassigned entries in this buffer (assignments are made when instructions are in the decode stage); and, when available, the results areretiredout of this buffer in program order. (This action is alternatively called commit, completion, or graduation in some processors.) If an exception occurs, or for that matter, a branch or value misprediction occurs, instructions before the one causing the exception or misprediction are allowed to retire and then the contents of the reorder buffer beyond that instruction are flushed. Execution can then be resumed with a consistent state of storage.

Other techniques for dealing with exceptions include the use of a run slow mode bit to switch between in-order and out-of-order instruction completion (IBM RS=6000), the use of a history buffer (Motorola 88110), the use of exception barrier instructions (DEC Alpha), delaying instruction issue until excep-tions from previous instrucexcep-tions are guaranteed not to occur (calledsafe instruction recognitionin the Intel Pentium), and the use of a future file (UltraSPARC-III).

2.1.3.3 Instruction Issue

Up to this point, instructions have been described to issue, that is, start execution, in program order.

Consider the case of a long-running dependent instruction pair followed by independent instructions.

It would be advantageous if the compiler would statically schedule independent instructions between the two instructions of the dependent pair; however, not all programs will be so scheduled. An alternative is to provide dynamic instruction scheduling in the hardware, also known as out-of-order execution.

Icache Decode Execute Write back

This requires providing some form of buffering for instructions subsequent to decode, as depicted in Fig. 2.2d, and the necessary control logic to identify and issue instructions that are ready to execute.

The execution of ready instructions in an out-of-order processor is a form of data flow execution, and the dynamic scheduling of the portion of the instruction stream held in the instruction buffer has been calledrestricted data flow. The buffer for the instructions can take the form of a centralizedinstruction windowor a decentralized set ofreservation stations, in which a subset of the instruction buffers are located at each function unit. In Fig. 2.2d, the action of placing a decoded instruction into this buffer is calleddispatchand the term for the start of execution (i.e., choosing and routing instructions from the instruction buffer to execution units) remains issue. Unfortunately, some authors and processor manuals make the issue and dispatch terms synonymous and some even reverse the above meanings, so the reader is advised to always read the context carefully when encountering these two terms.

Figures 2.3 and 2.5 illustrate superscalar processing with in-order execution and out-of-order execu-tion, respectively, for two iterations of a short loop that adds a value to each element in an array of floating-point values. Throughput rates of two instructions per cycle are assumed for each stage-to-stage action, and branch prediction and operand forwarding are assumed. Loads and stores have two-cycle execution, floating-point add has four-cycle execution, and integer add and branch have single-cycle execution each. The stages for the in-order processor in Fig. 2.3 are FDEW: fetch, decode, execute, and write back. The stages for the out-of-order processor in Fig. 2.5 are FDIECR: fetch, decode, issue (or inst.

window), execute, complete (or reorder buffer), and retire (or write back). Renaming is assumed for the out-of-order processor.

Without the ability to dispatch dependent instructions into an instruction window or to reservation stations, the decoder in the in-order processor in Fig. 2.3 stalls from cycle 3 to cycle 5, at which point the floating-point add can be issued. Similar decoder stalls can be observed at cycles 6, 13, and 16. Because of the data dependencies within the loop body, the throughput is less than one instruction per cycle. The overall effect is that the two iterations cannot be finished within 20 cycles. To take better advantage of the in-order processor, a compiler or assembly language programmer would need to unroll the loop. For example, unrolling by a factor of two would lead to the execution diagram in Fig. 2.4. In this case, Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 load 0(r1), f0 F D E E W

addf f0, f2, f4 F D E E E E W

store f4, 0(r1) F D E E W

add r1, #4, r1 F D E W

bne r1, r2, loop F D E W

load 0(r1), f0 F D E E W

addf f0, f2, f4 F D E E E E W

store f4, 0(r1) F D E E

add r1, #4, r1 F D

bne r1, r2, loop F

FIGURE 2.3 Superscalar execution with in-order execution.

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 load 0(r1), f0 F D E E W

load 4(r1), f2 F D E E W

addf f0, f4, f6 F D E E E E W addf f2, f4, f8 F D E E E E W

store f6, 0(r1) F D E E W

store f8, 4(r1) F D E E W

add r1, #8, r1 F D E W

bne r1, r2, loop F D E W

FIGURE 2.4 Superscalar execution with in-order execution and compiler loop unrolling.

one iteration of the unrolled loop, with two element updates, finishes in cycle 12. There is still some minor stalling at the decoder, as seen in cycles 4 and 6, but the overall performance has improved greatly.

For the out-of-order superscalar processor illustrated in Fig. 2.5, each instruction has to traverse more stages, but dependent instructions can be buffered so that the decoder and execution units can bypass these instructions and uncover independent instructions that will be ready to execute. This can be observed in cycle 4, in which the integer add is issued before any of the three previous instructions complete. (The WAR dependency between the add and store instructions is handled by register renaming.) The integer add completes in cycle 6 but the result has to wait in the reorder buffer until it can retire in-order in cycle 15. Even without compiler unrolling, the two iterations finish by cycle 20.

In fact, the use of renaming and dynamic scheduling allows the processor to dohardware unrollingof the loop, as seen in cycle 7 in which the load instruction of the second iteration is issued before the first

Dans le document How to go to your page (Page 111-120)