SUPERPIPELINED AND SUPERSCALAR ARCHITECTURES

Advanced Microprocessor Concepts

7.5 SUPERPIPELINED AND SUPERSCALAR ARCHITECTURES

At any given time, semiconductor process technology presents an intrinsic limitation on how fast a logic gate can switch on and off and at what frequency a ﬂip-ﬂop can run. Other than relying on semiconductor process advances to improve microprocessor and system throughput, certain basic techniques have been devised to extract more processing power from silicon with limited switching

Address Physical

PID/VPN Page

MPU L2 Cache/

DRAM MMU

TLB TLB Miss

L1 Cache

Page Offset FIGURE 7.7 Location of TLB.

-Balch.book Page 161 Thursday, May 15, 2003 3:46 PM

162Advanced Digital Systems

delays. Throughput can be enhanced in a serial manner by trying to execute a desired function faster.

If each function is executed at a faster clock frequency, more functions can be executed in a given time period. An alternative parallel approach can be taken whereby multiple functions are executed simultaneously, thereby improving performance over time. These two approaches can be comple-mentary in practice. Different logic implementations make use of serial and parallel enhancement techniques in the proportions and manners that are best suited to the application at hand.

A logic function is represented by a set of Boolean equations that are then implemented as dis-crete gates. During one clock cycle, the inputs to the equations are presented to a collection of gates via a set of input ﬂops, and the results are clocked into output ﬂops on the next rising edge. The propagation delays of the gates and their interconnecting wires largely determine the shortest clock period at which the logic function can reliably operate.

Pipelining, called superpipelining when taken to an extreme, is a classic serial throughput en-hancement technique. Pipelining is the process of breaking a Boolean equation into several smaller equations and then calculating the partial results during sequential clock cycles. Smaller equations require fewer gates, which have a shorter total propagation delay relative to the complete equation.

The shorter propagation delay enables the logic to run faster. Instead of calculating the complete re-sult in a single 40 ns cycle, for example, the rere-sult may be calculated in four successive cycles of 10 ns each. At ﬁrst glance, it may not seem that anything has been gained, because the calculation still takes 40 ns to complete. The power of pipelining is that different stages in the pipeline are oper-ating on different calculations each cycle. Using an example of an adder that is pipelined across four cycles, partial sums are calculated at each stage and then passed to the next stage. Once a partial sum is passed to the next stage, the current stage is free to calculate the partial sum of a completely new addition operation. Therefore, a four-stage pipelined adder takes four cycles to produce a result, but it can work on four separate calculations simultaneously, yielding an average throughput of one cal-culation every cycle—a four-times throughput improvement.

Pipelining does not come for free, because additional logic must be created to handle the com-plexity of tracking partial results and merging them into successively more complete results. Pipelin-ing a 32-bit unsigned integer adder can be done as shown in Fig. 7.8 by addPipelin-ing eight bits at a time and then passing the eight-bit sum and carry bit up to the next stage. From a Boolean equation per-spective, each stage only incurs the complexity of an 8-bit adder instead of a 32-bit adder, enabling it to run faster. An array of pipeline registers is necessary to hold the partial sums that have been calcu-lated by previous stages and the as-yet-to-be-calcucalcu-lated portions of the operands. The addition re-sults ripple through the pipeline on each rising clock edge and are accumulated into a ﬁnal 32-bit result as operand bytes are consumed by the adders. There is no feedback in this pipelined adder, meaning that, once a set of operands passes through a stage, that stage no longer has any involve-ment in the operation and can be reused to begin or continue a new operation.

Pipelining increases the overall throughput of a logic block but does not usually decrease the cal-culation latency. High-performance microprocessors often take advantage of pipelining to varying degrees. Some microprocessors implement superpipelining whereby a simple RISC instruction may have a latency of a dozen or more clock cycles. This high degree of pipelining allows the micropro-cessor to execute an average of one instruction each clock cycle, which becomes very powerful at operating frequencies measured in hundreds of megahertz and beyond.

Superpipelining a microprocessor introduces complexities that arise from the interactions be-tween consecutive instructions. One instruction may contain an operand that is calculated by the pre-vious instruction. If not handled correctly, this common circumstance can result in the wrong value being used in a subsequent instruction or a loss of performance where the pipeline is frequently stalled to allow one instruction to complete before continuing with others. Branches can also cause havoc with a superpipelined architecture, because the decision to take a conditional branch may nul-lify the few instructions that have already been loaded into the pipeline and partially executed. De--Balch.book Page 162 Thursday, May 15, 2003 3:46 PM

Advanced Microprocessor Concepts 163

pending on how the microprocessor is designed, various state information that has already been modified by these partially executed instructions may have to be rolled back as if the instructions were never fetched. Branches can therefore cause the pipeline to be flushed, reducing the throughput of the microprocessor, because there will be a gap in time during which new instructions advance through the pipeline stages and finally emerge at the output.

Traditional microprocessor architecture specifies that instructions are executed serially in the or-der explicitly defined by the programmer. Microprocessor designers have long observed that, within a given sequence of instructions, small sets of instructions can be executed in parallel without chang-ing the result that would be obtained had they been executed in the traditional serial manner. Super-scalar microprocessor architecture has emerged as a means to execute multiple instructions simultaneously within a single microprocessor that is operating on a normal sequence of instruc-tions. A superscalar architecture contains multiple independent execution units, some of which may be identical, that are organized and replicated according to statistical studies of which instructions are executed more often and how easily they can be made parallel without excessive restrictions and dependencies. Arithmetic execution units are prime targets for replication, because calculations with floating-point numbers and large integers require substantial logic and time to fully complete. A su-perscalar microprocessor may contain two integer ALUs and separate FPUs for floating-point addi-tion and multiplicaaddi-tion operaaddi-tions. Floating-point operaaddi-tions are the most complex instrucaddi-tions that many microprocessors execute, and they tend to have long latencies. Most floating-point applica-tions contain a mix of addition and multiplication operaapplica-tions, making them well suited to an archi-tecture with individual FPUs that each specialize in one type of operation.

32-Bit

Stage 1 Stage 2 Stage 3 Stage 4

8-Bit

-Balch.book Page 163 Thursday, May 15, 2003 3:46 PM

164 Advanced Digital Systems

Managing parallel execution units in a superscalar microprocessor is a complex task, because the microprocessor wants to execute instructions as fast as they can be fetched—yet it must do so in a manner consistent with the instructions’ serial interdependencies. These dependencies can become more complicated to resolve when superscalar and superpipelining techniques are combined to cre-ate a microprocessor with multiple execution units, each of which is implemented with a deep pipe-line. In such chips, the instruction decode logic handles the complex task of examining the pipelines of the execution units to determine when the next instruction is free of dependencies, allowing it to begin execution.

Related to superpipelining and superscalar methods are the techniques of branch prediction, specu-lative execution, and instruction reordering. Deep pipelines are subject to performance-degrading flushes each time a branch instruction comes along. To reduce the frequency of pipeline flushes due to branch instructions, some microprocessors incorporate branch prediction logic that attempts to make a preliminary guess as to whether the branch will be taken. These guesses are made based on the history of previous branches. The exact algorithms that perform branch prediction vary by implementation and are not always disclosed by the manufacturer, to protect their trade secrets. When the branch pre-diction logic makes its guess, the instruction fetch and decode logic can speculatively execute the in-struction stream that corresponds to the predicted branch result. If the prediction logic is correct, a costly pipeline flush is avoided. If the prediction is wrong, performance will temporarily degrade until the pipeline can be restarted. Hopefully, a given branch prediction algorithm improves performance rather than degrading it by having a worse record than would exist with no prediction at all!

The problem with branch prediction is that it is sometimes wrong, and the microprocessor must back out of any state changes that have resulted from an incorrectly predicted branch. Speculative execution can be taken a step farther in an attempt to eliminate the penalty of a wrong branch predic-tion by executing both possible branch results. To do this, a superscalar architecture is needed that has enough execution units to speculatively execute extra instructions whose results may not be used. It is a foregone conclusion that one of the branch results will not be valid. There is substantial complexity involved in such an approach because of the duplicate hardware that must be managed and the need to rapidly swap to the correct instruction stream that is already in progress when the re-sult of a branch is ﬁnally known.

A superscalar microprocessor will not always be able to keep each of its execution units busy, be-cause of dependencies across sequential instructions. In such a case, the next instruction to be pushed into the execution pipeline must be held until an in-progress instruction completes. Instruc-tion reordering logic reduces the penalty of such instrucInstruc-tion stalls by attempting to execute instruc-tions outside the order in which they appear in the program. The microprocessor can prefetch a set of instructions ahead of those currently executing, enabling it to look ahead in the sequence and deter-mine whether a later instruction can be safely executed without changing the behavior of the instruc-tion stream. For such reordering to occur, an instrucinstruc-tion must not have any dependencies on those that are being temporarily skipped over. Such dependencies include not only operands but branch possibilities as well. Reordering can occur in a situation in which the ALUs are busy calculating re-sults that are to be used by the next instruction in the sequence, and their latencies are preventing the next instruction from being issued. A load operation that is immediately behind the stalled instruc-tion can be executed out of order if it does not operate on any registers that are being used by the in-structions ahead of it. Such reordering boosts throughput by taking advantage of otherwise idle execution cycles.

All of the aforementioned throughput improvement techniques come at a cost of increased design complexity and cost. However, it has been widely noted that the cost of a transistor on an IC is asymp-totically approaching zero as tens of millions of transistors are squeezed onto chips that cost only sev-eral hundred dollars. Once designed, the cost of implementing deep pipelines, multiple execution units, and the complex logic that coordinates the actions of both continues to decrease over time.

-Balch.book Page 164 Thursday, May 15, 2003 3:46 PM

Advanced Microprocessor Concepts 165

Dans le document COMPLETE DIGITAL DESIGN (Page 182-186)