Register Renaming Techniques * Dezso¨ Sima

System Design 2

**2.2 Register Renaming Techniques * Dezso¨ Sima**

2.2.1 Introduction

Register renaming (or renaming for short) is a widely used technique in advanced instruction level processors (ILP-processors) to remove false data dependencies between register operands of subsequent instructions in a straight line code sequence [1–3]^.False data dependencies are write-after-read (WAR) or write-after-write (WAW) dependencies (see Appendix A). After removing false data dependencies by register renaming, on average more instructions are available for parallel execution per cycle, this increases processor performance.

*Portions of this chapter reprinted with permission from Sima, D., The design space of register renaming techniques,IEEE Micro, 20, Sept.=Oct., 70, 2000, [35]ßIEEE.

The principle of register renaming is straightforward. The processor removes false data dependencies by writing the results of the instructions first into dynamically allocated buffers, called rename buffers, rather than into the specified destination registers and forwards these result into the originally specified architectural registers in a later stage of instruction execution. For instance, in the case of the following WAW dependency:

i1: add r1, r2, r3; [r1 (r2)þ(r3)]

i2: mul r1, r4, r5; [r1 (r4)3(r5)]

the processor renames the destination register of i2that is r1, say to r33. Then after renaming register r1, the instruction i2becomes

i⁰2: mul r33, r4, r5; [r33 (r4)3(r5)]

and the processor writes the result of i20 into r33 instead of r1. This resolves the previous WAW dependency between i1and i2. In subsequent instructions, however, references to the source register r2 must be redirected to the rename buffer r33 as long as the renaming remains valid. In the next section we give a detailed description of the whole rename process.

A precursor to register renaming was introduced in 1967 by Tomasulo in the IBM 360=91 [4], a scalar supercomputer of that time, which pioneered both pipelining and shelving. The 360=91 renamed floating-point registers in order to preserve the logical consistency of the program execution, rather than to increase processor performance by removing false data dependencies.

Tjaden and Flynn [5] were the first to suggest the use of register renaming for removing false data dependencies in 1970. They proposed to rename load type instructions, without using the term ‘‘register renaming.’’ This specific term was introduced a few years later, in 1975, by Keller [6] who extended renaming to cover all instructions including a destination register. He also described a possible hardware implementation of this technique. Because of the complexity of its implementation, however, about two decades passed until register renaming came into widespread use in superscalars in the middle of the 1990s.

Early superscalar models of significant processor lines, such as the PA 7100, SuperSPARC, Alpha 21064, R8000, and the Pentium, typically did not yet use renaming as indicated in Fig. 2.6.

Renaming appeared gradually, first in a restricted form, called partial renaming, in the beginning of the 1990s, in the IBM RS=6000 (POWER1), POWER2, PowerPC 601, and in NextGen’s Nx586 processors, as depicted in Fig. 2.6. Partial renaming restricts renaming to one or to a few data types, such as floating-point loads or floating-point instructions, as detailed in Section 2.2.3.2. Full renaming emerged later, beginning in 1992, first in the high-end models of the IBM mainframe family ES=9000, then in the PowerPC 603. Subsequently, renaming spread into virtually all superscalar processors with the notable exception of Sun’s UltraSPARC line. At present, register renaming is considered to be a standard feature of superscalar processors.

2.2.2 Overview of the Rename Process

The rename process itself is considerably complex. It consists of a number of rename specific tasks—

renaming the destination and the source registers, fetching renamed source operands, updating the rename buffers, releasing allocated rename buffers, recovery of the rename process from faultily executed speculative execution, etc. In addition, each of the rename-specific tasks may be implemented in a number of different ways. Furthermore, also specific features of the underlying microarchitecture affect the rename process. Therefore, each concrete description of the rename process is related to a particular renaming technique employed and the underlying microarchitecture. Thus, before describing the rename process we need to be specific about both possible renaming techniques and types of micro-architectures considered.

MC 88000 Gmicro M

SPARC

PowerPC

PA R Nx/K80x86POWER ES MC 68000

Motorola CYRIXSun/Hal

MIPS AMDIntel

IBMHP TRON

Compaq PowerPC Alliance

Alpha

RISC processors IBM Motorola

CISC processors The Nx586 has scalar issue for CISC instructions but a three-way superscalar core for converted RISC instructions. ∗∗—Partial renaming —Full renaming PPC designates PowerPC.∗ ∗∗∗The dispacth rate of the POWER2 and P2SC is 6 along the sequential path while only 4 immediately after a branch.

Gmicro/500 (2)

Alpha 21064 (2) Alpha 21164 (4) SuperSPARC (3)PA7100 (2) Pentium (2) MC 68060 (3)

R 8000 (4)

POWER1 (4)12 (RS/6000) ES/9000 (2)28

POWER2(6/4)***13 PentiumPro (3)24

Alpha 21264(4)7 PA8000 (4)9 PM1 (4) (SPARC64) K5 (4)32Nx586 (1/3)31∗∗ 1990199119921993199419951996199719981999

Pentium III (3)26

PA8200(4)10 UltraSPARC-2 (4) K6 (3)33

MII (2)30

POWER3 (4)20

PA 8500 (4)11 R 12000 (4)22 K7 (3)34

UltraSPARC-3 (4)

MC 88110 (2) UltraSPARC (4)

PPC 601 (3)15*PPC 604 (4)17∗ Pentium/MMX (2) Pentium II (3)25

PPC 620 (4)19∗ PPC 603 (3)

16* R 10000 (4)21PPC 602 (2)18∗

PA7200 (2)8 M1 (2)29

14 P2SC (6/4)∗∗∗ Pentium 4 (3)27 2000

23 FIGURE2.6Chronologyoftheintroductionofrenamingincommercialsuperscalarprocessors.Asdateofintroductionweindicatethefirstyearofvolumeproduction.Following themodeldesignationwealsoshowthedispatchrateoftheprocessors(inbrackets).ConcerningthedispatchrateofCISCprocessorswenotethatonex86instructioncanbe consideredtobeequivalentof1.3–1.9RISCinstructions.Inthisfigurewegivereferencestotheprocessorswhichmakeuseofrenaming.(FromGwennap,L.,MicroprocessorReport, 9,14,1,1995.)

Concerning renaming techniques, in a subsequent section, we show that there are nine basic alternatives available. In our description of the rename process, we need to presume one of them. Our choice is the one where (1) renaming is implemented by using rename register files (RRF) and (2) architectural registers are mapped to rename registers by means of mapping tables. Although both terms are explained later in the subsequent section, beforehand we note that RRFs, split to separate fixed-point (FX) and floating-point (FP) RRFs, store the instruction results produced by the execution units temporarily, while the FX- and FP-mapping tables hold the actual mappings of the FX- and FP-architectural registers to the associated rename registers, as indicated in the section on Layout of Register Mapping.

As far as the underlying microarchitecture is concerned, there are two design aspects that affect the implementation of the rename process: (1) whether or not the processor uses shelving (indirect issue, dynamic instruction scheduling, queued issue; see related box) and (2) assuming the use of shelving, what kind of operand fetch policy is employed (see related box). As recent superscalars predominantly make use of shelving, we take this design option for granted throughout this section. Regarding the operand fetch policy, which is one design aspect of shelving, we take into account both alternatives, since superscalar processors make use of both policies. Thus, while we describe the rename process in the subsequent two sections, we do it in two scenarios, first assuming the dispatch-bound fetch policy and then the issue-bound fetch policy. In both the scenarios mentioned, we describe the rename process by focusing only on a small part of the microarchitecture, which is just enough to highlight the imple-mentation of specific tasks of the rename process.

2.2.2.1 Process of Renaming, Assuming Dispatch-Bound Operand Fetching

The considered part of the microarchitecture executes FX-instructions and consists of an architectural register file (ARF) and an execution unit (EU), as shown in Fig. 2.7.

Our subsequent description of the rename process is embedded into the general framework of instruction processing. Here, we distinguish the following four processing phases: (1) decoded instructions are dispatched into the RSs, (2) executable instructions are issued from the RSs to the EUs, (3) the EUs perform the prescribed operations and generate the result of the instructions. At this time the instructions are said to be finished, and finally, (4) the processor completes (commits, retires) instructions in an in-order fashion, irreversibly updating the program state with the results of the instructions.

Assuming the processor core as shown in Fig. 2.7 and dispatch-bound operand fetching, the rename process is carried out as follows:

I. During instruction dispatch, three rename-related tasks must be performed:

a. Destination registers of dispatched instructions (Rd) need to be renamed.

b. Source registers (Rs1 and Rs2) should be renamed in order to redirect the source references to the associated rename registers.

c. Required source operands need to be fetched.

1. Renaming the destination registers of dispatched instructions: To rename the destination register of a dispatched instruction, first a free rename register needs to be allocated to the dispatched instruction. This task is accomplished by means of the mapping table. The mapping table keeps track of the actual mappings of the architectural registers to the rename registers. Renaming of the destination register results in writing the identifier of the allocated rename register (Rd⁰) into the corresponding mapping table entry, and forwarding this identifier also into the corresponding field of the RS. Typically, the processor uses the index of the allocated rename register as Rd⁰. 2. Renaming the source registers: Source registers, for which a valid renaming exists, also need to

be renamed. This is carried out by accessing the mapping table with the source register identifiers (Rs1, Rs2) as indices, and fetching the identifiers of the allocated rename registers (designated as Rs⁰1, Rs⁰2). If, for a particular source identifier there is no valid renaming, the required source operand will be accessed from the ARF by using the original source register identifier (Rs1 or Rs2).

3. Fetching the source operands: Finally, the referenced source operands need to be fetched. However, with renaming, requested source operands may be in one of two possible locations. If there is a

valid renaming, the requested operand needs to be fetched from the RRF, else from the ARF. To fetch a requested operand, usually the processor accesses both the RRF and the ARF simultane-ously to shorten the access time. If only the ARF hits, the referenced source register is actually not renamed and the accessed value is the required one. If, however, for a particular source register a valid renaming exists, both register files hit and the processor will give preference to the operand fetched from the RRF. In this case, the RRF may deliver either a valid operand value (Op1=Op2), if it has already been produced by a preceding instruction, or the index of that rename register, which will hold the requested value after its generation (Rs⁰1=Rs⁰2), if the required result has not yet been calculated. Thus, for each referenced source register either the requested operand value (Op1=Op2) or the appropriate rename register identifier (Rs⁰1=Rs⁰2) will be written into the RS.

The valid bits associated with the source operand fields (V1=V2) indicate whether the related operand field holds a valid source operand value (Op1=Op2) or a rename register identifier (Rs⁰1=Rs⁰2).

II. Issuing is not at all rename specific. Assuming in-order issuing, the processor inspects the valid bits of the source operands (V1 and V2) of the oldest instruction kept in the RS. If both valid bits of this instruction are set and the EU is also free, the instruction is forwarded to the EU for execution.

III. After the EU has finished the execution of an instruction, both the RS and the RRF need to be updated with the generated result. To update the RS, the generated results and their identifiers (Rd⁰) are

Mapping table

Architectural register file (ARF) Rs1⬘

Rs2⬘

Update arch. rf.

Op1 Op2 Rd⬘

Rd, Rs1, Rs2 Decoded instructions

Update RRF

Update RS

Result, Rd⬘

OC, Rd⬘, Op1, Op2 Rename register

file (RRF)

OC Rd⬘ Op1/Rs1⬘ V1Op2/Rs2⬘V2

Check valid bits Rs1, Rs2

Bypassing Op1/Rs1⬘

Op2/Rs2⬘ Dispatch

Issue Reservation station

(RS)

FIGURE 2.7 Processor core providing shelving with dispatch-bound operand fetching and renaming.

broadcasted to all the source register entries held in the RS. Through an associative search, all source register identifiers (Rs⁰1, Rs⁰2), which are waiting for the new result, are located. The processor substitutes matching identifiers with the result value and sets the associated valid bits (V1 or V2) to indicate availability. We note that this task is performed basically in the same way with and without renaming. There is, however, a slight difference with renaming, as in this case the search key is the renamed destination register identifier (Rd⁰) rather than the original destination register identifier (Rd) that is used without renaming. The second task is to update the rename register file. This is done simply by writing the new result into the RRF using the identifier accompanying the result produced (Rd⁰) and setting the associated valid bit (V) to signal availability.

IV. When an instruction completes, the processor permanently updates the ARF, and thus the program state, with the content of the associated rename register. This is done by writing the result of the completed instruction from the associated rename register to the addressed destination register. At this stage of the instruction execution, resources bound to the established renaming becomes free.

Therefore, the related entry in the mapping table needs to be deleted and the rename register involved can be reclaimed for further use. This is so since (i) after completion, the result of the instruction, that is, the content of the rename register, has already been written into the addressed destination register, and (ii) after finishing the instruction, the generated result has already been transferred to all instructions waiting for this operand in the RS.

During renaming, rename registers take on a sequence of states, as indicated in Fig. 2.8.

During initialization, the processor sets all rename registers into the ‘‘available’’ state. When the processor allocates a rename register to a dispatched instruction, the state of the allocated register will be changed to allocated, not valid and its valid bit will be reset. When this instruction becomes finished, the newly produced result is written into the associated rename register, and its state is set to allocated, valid.

Finally, while the instruction completes, the result held temporarily in the rename register is written into the specified architectural register. Thus, the allocated rename register can be reclaimed. Its state is then changed to available. Nevertheless, it can happen that an exception or faulty speculative execution gives rise to flush not yet completed instructions. In this case, a recovery procedure is needed, and the state of the concerned rename registers will be changed from the allocated, not valid or allocated, valid state to the available state and the corresponding mappings between architectural and rename registers will be deleted.

Allocated, valid

Available Allocated,

not valid Initialized

Reclaim, if instruction is

canceled Allocate, if instruction

is dispatched

Reclaim, if instruction is completed

Update, if instruction is finished

FIGURE 2.8 State transition diagram of the rename registers, assuming the use of a rename register file (RRF).

2.2.2.2 Process of Renaming, Assuming Issue-Bound Operand Fetching

Assuming basically the same processor core as before, but using the issue-bound operand fetching, the rename process is carried out as follows (see Fig. 2.9):

1. During instruction dispatch, both the destination register (Rd) and the source registers (Rs1 and Rs2) are renamed in the same way as described for dispatch-bound operand fetching. But now, beyond the operation code (OC) and the renamed destination register identifier (Rd⁰), the renamed source register identifiers (Rs⁰1 and Rs⁰2) are written into the RS rather than the operand values (Op1, Op2, if available) as with dispatch-bound operand fetching.

2. During issuing, two tasks need to be performed: (a) the instruction held in the last entry of the RS needs to be checked to see whether it is executable. If so and if the EU is also free, this instruction needs to be forwarded for execution to the EU. (b) During forwarding of the instruction, its operands need to be fetched either from the RRF or from the ARF in the same way as described in connection with the dispatch-bound operation.

3. When the EU finishes its operation, the generated result is used to update the RRF. Updating is performed by writing the result into the allocated rename register using the supplemented register identifier (Rd⁰) as an index into the RRF and setting the associated valid bit (V-bit).

Mapping table

Rename register file (RRF)

Architectural register file (ARF)

Result, Rd⬘ Update RRF

Rs1⬘, Rs2⬘

Checking for availability of (Rs1⬘), (Rs2⬘)

Op1 Op2 OC, Rd⬘

Decoded instructions OC Rd, Rs1, Rs2

OC Rd Rs1⬘Rs2⬘

V Rd⬘ Rs2⬘

Rs1⬘

Reservation station (RS)

Bypassing Dispatch

Issue

FIGURE 2.9 Processor core providing shelving with issue-bound operand fetching and renaming.

4. Finally, while the processor completes an instruction, the temporary result held in the associated rename register is written into the architectural register, which is specified in the destination field of the instruction. The only tasks remaining are to delete the corresponding entry in the mapping table and to reclaim the rename register associated with the completed instruction. Reclaiming of the rename register is, however, a far more complex task now than with dispatch-bound operand fetching. Notice that if operands are fetched dispatch bound (a) dispatched instructions imme-diately access their operands and (b) missing operands are, after their generation, immeimme-diately forwarded from the EU to the instructions waiting for these operands in the RS. In this case, after completing an instruction, the allocated rename register can immediately be reclaimed. However, if operands are fetched during issuing, the RS is not automatically updated with the produced results. As a consequence, after an instruction completes, the RS may still contain instructions, which will require the contents of the rename register, that are allocated to the just-completed instruction. Thus, while instructions complete, their allocated rename registers cannot be reclaimed immediately as in the case of the dispatch-bound operand fetching. To resolve this problem, one possible solution is to maintain a counter for each rename register, which keeps track of the number of references made to this register. The counter will be incremented each time if one of the source operands of a dispatched instruction addresses this particular rename register, and will be decremented during issuing of the instructions each time when a source operand is fetched from this register. After all outstanding fetch requests for a particular rename register are satisfied, as indicated by the counter score of zero, and the associated instruction has been completed, the related register becomes eligible for reclaiming. At the first sight, it may seem that this intricate reclaim process can be avoided if during completion the RS would have been searched for all renamed source operand identifiers (Rs⁰1, Rs⁰2), which refer to the rename buffer, allocated to the completing instruction (Rd⁰), and matching renamed source register identifiers would have been remapped to the associated architectural register (Rd). Unfortunately, this idea is not applicable since there is no guarantee that the addressed architectural register would not be rewritten until instructions needing its content are issued.

During the rename process, rename registers will take the same states and the same state transitions will also occur as described earlier in connection with Fig. 2.8. The only difference is that now rename registers are reclaimed according to modified conditions, as discussed previously.

We emphasize that other basic alternatives of register renaming differ mainly in two aspects: (1) the processor can hold renamed values in other structures than rename register files and (2) the processor can use a different scheme for mapping the architectural registers to rename registers as assumed above.

In addition, the processor should be able to rename not just one instruction per cycle but all dispatched instructions. Nevertheless, despite these differences, the previous descriptions in the two characteristic scenarios give a good background about how the rename process is carried out in any of the possible implementation schemes.

2.2.3 Design Space of Register Renaming Techniques 2.2.3.1 Overview

The design space of register renaming has four main dimensions: the scope of register renaming, the layout of the rename registers, the implementation technique of register mapping, and the rename rate, as indicated in Fig. 2.10. These aspects are discussed in the subsequent sections. For the presentation of the design space we make use of DS trees [3,36].

2.2.3.2 Scope of Register Renaming

The scope of register renaming indicates how extensively the processor makes use of renaming. In this respect, we distinguish between partial and full renaming. Partial renaming is restricted to one or to only

Dans le document How to go to your page (Page 120-148)