Instruction Scheduler (ISCHED) - RT620 hyperSPARC Central Processing Unit

RT620 hyperSPARC Central Processing Unit

3.4 Instruction Scheduler (ISCHED)

Rather than attempting to execute all instruction combinations simultaneously, the SPARe instruction set has been divided into two major groups to simplify the detection of conditions for executing simultaneous instructions. These two groups are instructions which must be executed sequentially, and instructions which are eligible for simultaneous execution.

T E C H N O L O G Y ,

Table 3-1. Instruction Grouping

Group-Id Group-Name Instructions

A Load/Store LDSB, LDSH, LD, LDUB, LDUH, LDD

ADDcc. ANDcc. ORcc. XORcc.

SUBcc. ANDNcc, ORNcc. XORNcc ADDXcc, SUBXcc

TADDcc, TSUBcc, MULScc

SLL, SRL. SRA, SETHI C FP instruction FADDs, FADDd, FSUBs, FSUBd

(FP add/multiply PTO (fp data conversions) except fcmp) FMULs, FSMULd, FMULd

FDIVs, FOlVd, FSQRTs, FSQRTd RDY, RDPSR, RDWIM, RDTBR WRY, WRPSR, WRWIM, WRTBR LDSBA, LDSHA, LDA, LDUBA, LDUHA, LDDA,LDFSR

STA, STBA, STHA, STDA, STDFQ, STFSR, SWAPA,

LDSTUBA

Ticc. FLUSH. CPOPS, CBccc, illegal, LDC. LDDC, LDCSR, STC, SIDC, STCSR, SIDCQ

FCMPs, FCMPd, FCMPEs, FCMPEd

In order to simplify this process of group classification, instruction decoding has been broken down into two levels. These two levels are global (scheduling) decoding and local (execution unit) decoding.

During the global level of Decode, each instruction in the Decode buffer is characterized as either belonging to a group which must always be launched one at a time (referred to as anfgroup) or belonging to a group which is eligible for simultaneous launch (sometimes referred to as a non-fgroup). The instruction groups eligible for simultaneous launch can be conceptualized as being mapped to one of four execution units.

During global Decode, resource conflicts, data dependencies and operand forwarding are detected and re-solved. After group characterization is performed by the instruction decoder, local decoding is performed.

If an instruction is recognized as a floating -point instruction, local decoding for that instruction is performed by the FPSCHED.

T E C H N O L O G Y ,

~~~~~~~~~~~~~~R~T~6~2~O~h~yp=e~r~SP~~~R~C~C~P~U

3.4.1 Single Instruction Launch Group

During global Decode, each instruction in the instruction packet is classified as either a single launch in-struction, or a multiple launch instruction. If either instruction belongs to the single step group, then it must be executed by itself. In other words, single step instructions cause the execution of the packet to be split (i.e., to be executed in sequence). The instructions in this group are:

• CALL, JMPL, RETT, SAVE, RESTORE, TADDccTV, TSUBccTV, UMUL, UMULcc, SMUL, SMULcc, UDIV, UDIVcc, SDIV, SDIVcc, Ticc, FLUSH, RDPSR, RDTBR, RDWIM, RDY, RDASR, WRPSR, WRTBR, WRWIM, WRY, WRASR, LDSBA, LDSHA, LDUBA, LDUHA, LDA, LDDA, LDSTUBA, SWAPA, LDFSR, STFSR, STDFQ, STBA, STHA, STA, STDA, FCMP, (CBccc, LDC, LDDC, LDCSR, STC, STDC, STCSR, STDCQ).

• all other coprocessor instructions (copland cop2).

t

• any other unimplemented instruction or reserved opcode detected at global Decode time (except for unimplemented fp instructions).

All other instructions are eligible to be executed simultaneously.

Special Case: There is a special case regarding single step instructions when a write special register or load floating-point status register instruction has been executed. The execution of write special register or load floating-point status register instructions cause packet splitting to be enforced for the next three instruc-tions regardless of the group to which the three succeeding instrucinstruc-tions belong.

3.4.2 Multiple Instruction Launch Group

If the instruction scheduler detects a packet which does not contain a single launch instruction, does not con-tain two instructions of the same group (for an exception to this, see Table 3-2) does not have an intra-packet data conflict, and does not have an inter-packet data conflict, then the instruction scheduler will launch both instructions in parallel.

If there is an intra-packet data conflict, the instruction scheduler will split the packet. If there is an inter-pack-et data conflict, the instruction scheduler will delay the slot-b instruction if the conflict exists only for the slot-b instruction. The scheduler will delay both instructions if the conflict exists for the slot-a instruction.

The concept of intra-packet and inter-packet dependencies is discussed in Section 3.4.3 under the notes for case vii.

As mentioned previously, the groupfinstructions force packet splitting. The following table lists combina-tions of instruccombina-tions in the Decode packet that are eligible for simultaneous execution.

t The RT620 does not support a coprocessor interface. The RT620 enters a coprocessor disabled trap upon encountering a coprocessor instruction.

T E e H N O ' 0 G Y.

,$

=============R=T=6=2=O=h:!:yp=e=r=SP=l\=R=C=C=P=U Table 3-2. Instruction Combinations Eligible for Simultaneous Execution

Slot-a Slot-b Slot-a Slot-b

group A group B group C group A

group A group C group C group B

group A group E group C group C [IJ

group B group A group C group E

group B [2J group B [2J group E group A

group B group C group E group B

group B group E group E group C

Notes: 1. Assuming the FPQ is not full, from the perspective of the instruction scheduler, the combination of two group C instructions in a packet behaves as though both instructions are launched simultaneously.

2. A special case exists called Fast Constant (refer to section 3.4.4.1) which permits two group B instructions in a packet to be executed at the same time. Otherwise, two group B instructions in an instruction packet must be executed singly.

3.4.3 Interlocks and Dependencies

There are situations where certain activities of the processor are temporarily suspended. These suspensions come in two forms:

1. A total freeze on activities in both the integer unit and the floating-point unit (an interlock caused by an external hold being applied by the external memory subsystem or IMBNA asserted and data ac-cess required.)

2. A delay in launching additional instructions (i.e., a delay caused by the lack of available computing resources or available data to continue launching instructions).

More specifically, instruction launch delays can be caused by:

i. A miss in the ICACHE.

ii. Contention for integer register file read ports.

Whenever a store instruction is executed, the pipeline maintains this information and the scheduler checks to see if a store instruction is in the Execute stage. If a fetched instruction requires the Load/

Store adder (e.g., LD, ST, LDSTUB or SWAP), that instruction is delayed until the store in execute has advanced to the Cache stage. The reason for this is that a store instruction actually has three source operands. The first two operands (rs 1 and rs2, or rs 1 and the signed immediate field) are used to calculate the effective address. The rd field contains the third source, that is the register containing the data to be written to memory. Since the integer register file has only two read ports which can be used by the LSU, the third operand access must be deferred to the Execute stage of the Store instruc-tion. When a Load or Store instruction is in Decode, it requires access to the register ports to obtain its effective address source operands. This access conflicts with that of the store instruction which is in the Execute stage. Therefore, the second Load-Store instruction will be delayed.

A JMPL, RETT or FLUSH that follows a store instruction must be delayed for one clock to avoid bus contention. A JMPL, RETT or FLUSH instruction that follows a LDSTUB or SWAP instruction will be delayed for two clocks to avoid bus contention.

T E e H N 0 L 0 • y.

,$ ============;;;;;R;;;;;;T;;;;;;6;;;;;;2;;;;;O;;;;;;h;:;;:yp~e;;;;;;r;;;;;;SP;;;;;i<\;;;;;R;C~C;;;;;;P;;;;;;;U

iii. Contention for execution units.

iv. Arbitrary internal delays.

There are cases involving Bicc, FBfcc, CALL, JMPL, and RETI that result in delays until the next instruction is available. For the cases of JMPL and RETI, these instructions are both delayed control transfer instructions. Because it is possible for these control transfer instructions to corne alone or in pairs and instructions are fetched two at a time, the location of a delayed control transfer instruction (slot-a vs. slot-b) and availability of delay slot instructions determine when a change in the flow of program control takes effect. When a Bicc, FBfcc, CALL, JMPL or RETI instruction is in Decode and the delay slot instruction is not available, the control transfer instruction will be delayed until the delay slot instruction is available.

A specified delay (called an "internal nop" or inop) is inserted after any JMPL or RETI instruction.

This provides sufficient time to generate the new target instruction address and only fetch the delay slot instruction packet. An internal nop (inop) has the same effect as a nop instruction; there is no operation and no program accessible machine state is altered.

In the case of FLUSH instructions, the processor must perform an ICACHE lookup and a Writeback (FLUSH). This requires that three inops be inserted after any FLUSH instruction before a new in-struction fetch can be performed. Also note that a FLUSH inin-struction is delayed until any inin-struction or data access that is in progress is completed.

v. The presence of an exception that has not yet trapped.

The scheduler also tracks the occurrence of exceptions that are detected at various pipeline stages. If an exception has been detected, the pipeline stages advance until the trap is taken; but additional in-structions are neither fetched nor launched until the trap is taken.

vi. A full fp pre-queue while additional fp instructions are "waiting" in the DBUF (Decode buffer).

The floating-point unit provides the scheduler with status information regarding free space in the fp pre-queue. When fp instructions exist in the integer unit decoder and there are no free entries in the pre-queue for off-loading fp instructions, additional instruction fetch and launch cannot continue.

Instruction fetch and launch resumes when instructions in the floating-point unit launch, thus mak-ing a free entry available in the fp pre-queue.

vii. Data dependencies (integer or fp).

Data dependencies need to be checked in two directions. The first involves checking for data depen-dencies between instructions in the same packet (called "intra-packet" dependepen-dencies). The second direction involves dependency checking between instructions which have not yet been launched and instructions currently advancing through the pipeline stages but have not yet written results back to the register ftle (called "inter-packet" dependencies). Dependency checking covers both the integer unit and floating-point unit and is limited to detecting register conflicts.

When intra-packet dependencies are detected the second instruction will be delayed at least one clock. For inter-packet comparisons, either one instruction will be delayed (if only the slot-b instruc-tion is involved) or both instrucinstruc-tions will be delayed (if the slot-a instrucinstruc-tion is involved).

viii. FP compare instructions in the pre-queue and an FBfcc instruction in the instruction (not fp) decoder.

The fp compare instructions and Idfsr are the only fp instructions that can affect the fp condition codes. The fp Branch is delayed until new fp condition codes are available (in the ex2 stage of an fp compare instruction and the Update stage of a LDFSR instruction).

T E C H N O L O G Y ,

ix. An LDFSR or a STFSR in the instruction decoder and a non-empty FPQ.

x. Integer multiply instructions cause further instruction launch to be held for 18 cycles.

The integer multiply instructions require 17 clock cycles to complete execution. Instruction launches can continue after cycle 17.

xi. Execution of an integer divide instruction causes further instruction launch to be held for 37 cycles.

The integer divide instructions require 37 cycles to complete execution. Instruction launches can continue after the execution is complete.

Instruction Execution Notes:

Integer Unit Forwarding: ALU results are computed in one clock. The results move through a series of buffers as the instruction moves through each pipeline stage. If an instruction in Decode stage requires the result from an ALU operation before the result is written back to the destination register, this result is for-warded to the instruction in Decode stage as an operand input. However, if the instruction in Decode stage requires the result of a load operation in the Execute stage or Cache stage, a delay will occur until the operand is available. In the floating-point unit, since more than one clock is required to execute an instruction, for-warding is not possible for intermediate pipeline stages, so delays are performed in much the same way as for load instructions.

Integer Unit Dependency Checking: Since dependency checking needs to be performed for both the inte-ger unit and floating-point unit, responsibility for the checks is divided between the two units. In the cases where inter-packet dependencies are resolved by the FPSCHED, dependencies detected by the FPSCHED are signaled to the integer unit scheduler. For more information on fp instruction scheduling, see Section 3.5.1.

There are two possible outcomes for inter-packet dependencies. When the slot-a instruction encounters a data dependency, both the slot-a and slot-b instructions are delayed. When the slot-a instruction does not encounter a data dependency and the slot-b instruction does encounter a data dependency, the slot-a instruc-tion is launched and only the slot-b instrucinstruc-tion is delayed. This preserves the execuinstruc-tion order of programs and minimizes the complexity of the exception handling logic.

3.4.4 Special Features

There are three special cases in which multiple instruction launch is allowed that otherwise would have to be executed sequentially. These three cases are called Fast Constant, Fast Index, and Fast Branch.

3.4.4.1 Fast Constant

When there is a need to construct 32-bit constants, SPARC typically accomplishes this by executing two ALU instructions. For example:

sethi %hi(const), %rx or %rx, %lo(const), %rx

In these instruction sequences, no ADD or logical OR need actually be performed. These operations can be accomplished through simple concatenation. Therefore, when the SETHI is in slot-a and the add/or is in slot-b, it is possible to construct the constant in one step rather than two sequential steps by simply concate-nating the high and low immediate fields of the instructions on the operand 2 input bus to the ALU. During a Fast Constant, register %gO (constant zero) is selected as the ALU input operand 1 and the concatenated 32-bit constant is selected as the input operand 2. The operation performed by the ALU will be an OR opera-tion. The result can then be passed through the ALU.

T ' C H NO LOG Y,

,$

=============R;;;;;;T;;;;;;6;;;;;;2;;;;;;O;;;;;;h;;;;;yp=e;;;;;;r;;;;;;SP;;;;;;1\;;;;;;R;;;;;;C=C;;;;;;P=U 3.4.4.2 Fast Index

There is a need to rapidly establish base addresses for array indexing, especially for programs which use complex data structures. This is typically accomplished by executing two instructions sequentially, for ex-ample:

sethi %hi(const), %rx ld [ %rx + %ry] , %rz

sethi %hi(const), %rx ld [%rx + imm], %rz

sethi %hi(const), %rx st %rz, [%rx + %ry]

sethi %hi(const), %rx st %rz, [%rx + imm]

In this instance, although the two instructions do not require the same execution unit, there appears to be a data dependency. An add is performed and the results of the SETHI need to be written to the register file.

However, when the SETHI is in slot-a and the load or store (including a fp load or store) is in slot-b, it is possible to provide the "shifted" constant to both the operand 2 input of the ALU and the operand 1 input of the LSU in one step rather than two sequential steps. The ALU processes the SETHI in the normal fashion.

The LSU calculates the effective address by adding the operand 1 input (generated by the instruction decod-er, not the fp decoder) with the second operand (specified by the instruction).

3.4.4.3 Fast Branch

In other words, instruction pairs of the type:

sethi %hi(const), %rx ldd [%rx + imm], %rz are executed as

sethi %hi(const), %rx

Idd [%hi(const) + imm]/ %rz

Ordinarily, fetching a condition-code-altering instruction and conditional Branch in the same packet would cause single instruction launch to occur. Consider the case where slot-a contains an ALU instruction which sets condition codes (e.g., ANDcc) and slot-b contains a Branch on Condition Code (e.g., BNE). Note that Branch Always (e.g., BA) and Branch Never (e.g., BN) do not depend on condition codes.

Ordinarily, the Branch instruction that uses the condition codes cannot be launched at the same time as the ALU instruction that sets the condition codes. This is because the time at which they are needed to evaluate whether the branch should be taken or not has already passed by the time the condition codes are set. Also, since the condition codes are not written back to the PSR immediately, a dependency issue exists.

Delays are avoided for most intra-packet and inter-packet branch condition code dependencies through the application of two techniques:

1. condition code forwarding

2. special handling of Branch under the Fast Branch situation

T E C H N O L O G Y ,

Condition code forwarding works similarly to operand forwarding. That is, if condition codes are required the clock after an ALUcc instruction has been launched, the newly generated condition codes are provided even though the PSR has not yet been updated.

In the special case where the ALUcc and Bicc instructions are contained in the same packet (i.e., there is an intra-packet condition code dependency and the delay slot instruction is available in the Fetch stage), the situation is referred to as a Fast Branch condition. An assumption is made that code generation is biased such that the Branch will be taken more often than not taken. Rather than split the packet and potentially waste a clock cycle, the new target address fetch is attempted. If the Branch is taken, the new target is available by the time the delay slot instruction has been executed. Otherwise, the delay slot instruction packet is executed and the original instruction stream is restored. Section 3.8.12.3 describes the pipeline operation for the Fast Branch case.

Fast Branch does not apply to floating-point Branch instructions. Fast Branch also does not apply when the slot-a instruction is a TADDccTV, TSUBccTV, UMULcc, SMULcc, UDIV cc, or SDIV cc instruction, be-cause these instructions force packet splitting.

Note that when a UMULcc, SMULcc, UDIV cc or SDIV cc instruction is being executed, the condition codes will NOT be available until after the instruction completes its Execute stages. The condition codes can be forwarded through the Cache and Writeback stages for these instructions. Sections 3.8.3.1 and 3.8.3.2 pro-vide a description of the pipeline stages for multiply and dipro-vide instructions.

Dans le document SPARC RISC USER'S GUIDE (Page 103-110)