Instruction Fetch Timing

3.8 Instruction Pipelines

3.8.1 Instruction Fetch Timing

Every instruction execution begins with an instruction fetch. The instruction can be fetched from one of three locations: the on-chip instruction cache, external cache (e-Cache), or main memory. Figure 3-31, Figure 3-32 and Figure 3-33 describe each of these cases when the ICACHE is enabled. Figure 3-34 and Figure 3-35 show timing of instruction fetch with e-Cache hit and e-Cache miss with the ICACHE disabled.

T , C H N O ' 0 G Y,

,$

=============R;;;;;T;;;;;6;;;;;2;;;;;O;;;;;h;;;;;yp:;;e;;;;;r;;;;;SP;;;;;l\;;;;;R;;;;;C=C;;;;;P=U

Figure 3-31. Timing for Instruction Fetch with ICACHE Hit

Address

Figure 3-32. Timing for Instruction Fetch with ICACHE Miss and e-Cache Hit

T E C H N O L O G Y ,

Figure 3-33. Timing for Instruction Fetch with ICACHE Miss and e-Cache Miss

Address

Figure 3-34. Timing for Instruction Fetch with ICACHE Disabled and e-Cache Hit

Address Generated IMA Address sent out on external bus IMD instruction available ICACHEhit detected

PHOLD (from external memory subsystem) IMDS

(from external memory subsystem)

Al e-Cache lookup

A2 e-Cache lookup

~~

^______

~

^___^LA^__2 ________

~~

______________

--J)~~O---o 0 0

__ -L ____

~~~-L---~---~~----~

--

---: ---:- ---

---S 5 --

---

---: ---

o 0 0

Sf

⁰ ⁰ ⁰

In In

--+----~----~ ~--7----~Sfr~----~----~

~---~Ss_J:

SD

Figure 3-35. Timing for Instruction Fetch with ICACHE Disabled and e-Cache Miss

F D E C w u

G

O : Fetch

o 0

clacki :clocki+l: clocki+2: clocki+3 :c1ocki+4 :clocki+S I

Figure 3-36. Typical Integer Instruction Pipeline 3.8.2 Integer Instruction Pipeline

The instructions in the Arithmetic Logic Unit (ALU), Load/Store Unit (LSU), and Program Counter Unit (PCU) groups follow a 6-stage pipeline denoted as: Fetch (F), Decode (D), Execute (E), Cache (C),

Write-T E e H N 0 L O G ' ,

,~ =============R;;:T;;:6=2=O;;:h==yp=e=r=SP;;:i\.=R;;:C~C;;:P:;;;;;;;U

back (W) and Update (U). All instructions share common instruction fetch activities and common global Decode activities. The local Decode and other Execute, Cache, and Write back stage activities vary from instruction to instruction. Figure 3-36 describes the typical integer instruction pipeline stages.

Note that there are Execute, Cache, and Writeback stage forwarding mechanisms for the ALU and there are Cache and Writeback stage forwarding mechanisms for the LSU. The Update stage is when the register file is actually written.

The following text and illustrations are intended to convey basic timing information associated with the pipeline stage activities for each of the major integer instructions. The functional activities are slightly dif-ferent for load, store, and Atomic Load-Store instructions, so each is treated as a separate case.

Table 3-5. Integer Unit Cycle Per Instruction (CPI)*

Instruction CPI

Single-cycle ALU iustructions: ADD, AND, OR, XOR, SUB, ANON, ORN, 1 XNOR, ADDX SUBX, ADDcc, ANDcc, ORcc, XORcc, SUBcc, ANDNcc,

ORNcc, XORNcc, ADDXcc, SUBXcc, TADDcc, TSUBcc, MULScc, SLL, SRL, SRA, SETHI

Integer multiply instructions: UMUL, UMULcc, SMUL, SMULcc 17 Integer divide instructions: UDIV, UDIVcc, SDIV, SDIVcc 37

Branch instructions: Bicc, FBfcc 1**

Control transfer instructions: CALL, JMPL, RETT, SAVE, RESTORE, Ticc 1 Control register access instructions: ROY, RDPSR, ROWIM, RDTBR, WRY, 1 WRPSR, WRWIM, WRTBR

Load instructions: LOSB, LOSH, LO, LOUB, LOUH, LDO, LDF, LDDF *** 1 Store instructions: ST, STB, STH, STD, STF, STDF *** 2

Atomic Load-Store instructions: SWAP, LDSTUB 3

Miscellaneous: UNIMP, FLUSH 1

* These CPI values assume the worst case situation of single instruction launch. The CPI decreases if dual instruction launch occurs. Also, pipeline enhancements such as Fast Branch or Fast Constant affect CPI.

** Assuming the delay slot is filled with a useful instruction and single instruction launch.

*** These instruction groups also include the alternate space version of the instruction (for example: LDA, STA).

3.8.3 ALU Instructions

Each ALU instruction proceeds through a series of processing activities. Figure 3-37 shows processing ac-tivities for typical ALU instructions. Note that integer multiply and divide instructions follow a slightly different pipeline. The first stage activity is the instruction fetch which involves looking up the address of the next ALU instruction in the ICACHE and making it available for Decode. During the second stage (De-code), the ALU instruction is globally decoded during the first half of the clock. By the end of the second half of the clock, local Decode is performed. During local Decode, appropriate control signals for the ALU execution unit are set up and operand access is performed. The decision to launch or delay the instruction is also made at this time. Operand access involves retrieving operand data either from the register file, ob-taining results currently in the pipeline (through forwarding), or extracting immediate data from the instruction itself.

During the Execute stage, the ALU operation result is generated. If the instruction causes updates to the con-dition codes, the new concon-dition codes are generated during the Execute stage. There are several exceptions that can occur when certain integer instructions are executed (e.g., TADDccTV). These exceptions are de-tected during the Decode or Execute stages (tag overflow, or illegal instruction).

T E C H N O L O G Y ,

F D

D E c w ^U

, ___ , r - - -r~:u~ts-~;e~ ~ - - -~

, available for operand forwarding

'GJ

^:^' ^gen^cc

*applies to TADDccTV and TSUBccTV only

Figure 3-37. Typical ALU Instruction Pipeline

<

^Shif)

r----

I add-shift

I I

^cycles ^I

I _ _ _ _ J

G

El7

G

<e~~l*)

c w

, ,

,r---,

'I I

: I result available I : I for forwarding I

'I I

: L _ _ _ - , _ _ _ J U

*applies to SMULcc and UMULcc only

Figure 3-38. Typical Integer Multiply Instruction Pipeline

T E e H N 0 LOG Y,

,$ =============R==T:;::6;;;::2==O:;::h=yp~e;;;::r;SP:;::i<\.:;::R:;::C;;;;;:;::;;C:;::P=U

3.8.3.1 Integer MUltiply Instructions

The Fetch and Decode stage activities for integer mUltiply are identical to those for other ALU instructions.

Integer multiply instructions require 17 Execute stages to generate results. Each stage of the 17 clock cycles involves addition of inputs 1 and 2, shifting the mUltiplier, updating the intermediate partial product and selecting the next input 1. If the instruction updates condition codes, the new condition codes are generated once the last addition has been performed. The results and condition codes are forwarded and additional instructions can be launched after Execute stage 17.

3.8.3.2 Integer Divide Instructions

The Fetch and Decode stage activities for integer divide are identical to those for other ALU instructions.

Integer divide instructions require more than one Execute stage to generate results. For a signed divide, the two's complement of the dividend input is generated during the next two cycles if it is negative. These cycles are idle cycles for other cases. In the next cycle, the dividend and divisor are compared to check for overflow.

If there is no overflow, the non-restoring algorithm is executed for 32 cycles. Each iteration consists of a shift followed by an add. Following the last add, for a signed divide, overflow is again checked for during the later half ofthe cycle and the two's complement of the quotient is generated during the next cycle if nec-essary. If the instruction updates condition codes, the new condition codes are generated once the last addition has been performed for an unsigned divide or following the two's complement (if necessary) for the signed divide. The results and condition codes are forwarded and additional instructions can be launched after Execute stage 37.

F D EI E2 E3 E4

@@8G'

" :88:

• for signed divide only

** for signed divide only if negative dividend .** for UDIVee and SDrVee only

E5 E36

• • • I

E37 c w

·_---.

" ,

result available for forwarding

I·----~----

... ...

-8 ^LI ^tz:;\'

OVf*'@C'

check eval

, ,

Figure 3-39. Typical Integer Divide Instruction Pipeline

F o E c

~ ~

w U

'memory alignment checks are NOT performed for LDSB and LDUB instructions

Figure 3-40. Typical Load Instruction Pipelines

o E c w

~ ~

Figure 3-41. Typical Store Instruction Pipeline

T E e H N 0 LOG Y,

,$

============:;;;:R:;;;:T:;;;:6:;;;:2:;;;:O:;;;:h;!:;y!:;pe:;;;:r:;;;:SP:;;;:i\:;;;:R:;;;:C=C:;;;:P=V 3.8.4 Load Instructions

Load instructions come in several varieties, including LD, LDF and LDFSR (word), LDD and LDDF (double), LDUB (unsigned byte), LDSB (signed byte), LDUH (unsigned halfword), and LDSH (signed half). Each integer unit Load also has a corresponding alternate address space form. Figure 3-40 illustrates the pipeline stages which Load instructions encounter.

Global Decode, local Decode and operand access are performed during the Decode stage. These activities are similar to those employed for ALU instructions. During Decode, it is possible to detect that a privilege violation or illegal instruction has been encountered for the alternate address space form of Load instruc-tions.

During the Execute stage, the effective address is calculated. After this, memory alignment is checked (e.g., if a word data fetch is required but the address calculated does not have "00" in its two least significant bits, a memory alignment exception must be generated). The address is placed on the 1MB address lines during the rising edge of the Cache stage (along with the appropriate IMSIZE, IMASI, and IMTYPE 0 signals).

If a memory alignment exception has occurred, the memory access is nullified by asserting PNULL on the cycle following the address. This provides the cache management unit with the data address to lookup in the external cache for this data access request.

Assuming there is a hit for the data in the external cache, the data is delivered prior to the Writeback stage.

Data alignment is performed during the Cache stage and the data is updated in the register file after the Write-back stage.

3.8.5 Store Instructions

Store instructions include all access size and alternate address space forms available for Load instructions.

As with load instructions, alternate address forms of the store instruction must be decoded so as to detect privilege violations and illegal instructions. Figure 3-41 shows timing for a typical Store instruction.

The effective address is calculated and memory alignment checks are performed during the Execute stage.

If a memory alignment exception is detected (i.e., access not word aligned), a memory alignment exception is generated. During the execute stage, the source register is accessed for the purpose of writing its contents to memory. During the Cache stage, the address is placed on the 1MB (along with the appropriate IMSIZE, IMASI, and IMTYPE < 0 > signals) and data alignment is performed on the source data. If a memory align-ment exception has occured, PNULL is asserted in the cycle following the address cycle in order to nullify the memory access.

In the case of a Load instruction, the cache management unit latches the address from the CPU and lookup does not require more than one clock (assuming an external cache hit occurs). In the case of store instruc-tions, the address must be held on the 1MB through the Cache and Writeback stages to guarantee Writeback to the data cache (one cycle is required to determine cache-hit and access and protection checks, and a second is required to perform the Writeback).

3.8.6 Atomic Load-Store Instructions

The Atomic Load-Store instructions are non-interruptible sequences of a load memory access followed by a store data access. Therefore, the Atomic Load-Store instruction timing follows that shown in Figure 3-42.

One difference (which is not obvious from Figure 3-42) between an ordinary Load and Store instruction pair and an Atomic Load-Store instruction is that the IMTYPE < 1 > (LOCK) signal is asserted by the IBIU while an Atomic Load-Store instruction is in progress.

F D E c ^W ^u

* memory alignment check applies only to SWAP and SWAPA instructions

Figure 3-42. Typical Atomic Load-Store Instruction Pipeline

F D E

8

, cc : eva!

c ^W u

This timing assumes that the delay slot instruction is not in the same instruction packet as the Branch instruction. Other-wise, the Fast Branch timing case applies (Refer to Section 3.8.123).

Figure 3-43. Typical Branch Instruction Pipeline

T E e H N O ' 0 G Y,

,~

=============R=T=6=2=O=h;;:;yp=e=r=SP=1\=R=C=C=P=U

F ^D E c w u

Figure 3-44. Typical CALL Instruction Pipeline 3.8.7 Branch Instructions

Branch instruction timing departs significantly from that applied to ALU, Load, and Store instructions.

Figure 3-43 illustrates Branch instruction timing for integer and fp branches (for the non-Fast Branch case).

The Fetch stage is performed in the usual fashion. Three activities occur during Decode. First, the condition codes are evaluated to determine whether the branch will be taken or not. Second the effective address of the target must be calculated using the current PC and the displacement extracted from the instruction by the instruction decoder. Finally, since SPARC permits nullification of delay slot instructions, this must be determined during Decode. The first two activities proceed in parallel. The third activity cannot be deter-mined until the condition codes have been evaluated except in the special case of a Fast Branch. Refer to Section 3,8.12.3 for pipeline descriptions for Fast Branch events.

If the branch is taken, the new target address is placed on the address lines (both for ICACHE and for the 1MB along with appropriate IMSIZE, IMASI, and IMTYPE < 0 > signals) and the new instruction stream Fetch begins. If the branch is not taken, instruction fetch continues along the previous instruction stream.

3.8.8 CALL Instructions

The CALL instruction pipeline activities are shown in Figure 3-44. The CALL instruction is very similar to branches except in three respects. First, CALL belongs to the single instruction launch group, so a "packet split" always occurs (the concepts of "instruction groups" and "packet splits" are discussed in Section 3.4).

Second, no condition code evaluation is performed because for CALL instructions, the "branch" is always taken. Finally, the CALL instruction requires that the PC of the CALL instruction be recorded in the register file as described in Chapter 12, SPARC Instruction Set.

3.8.9 JMPL/RETT/FLUSH Instructions

Although JMPL, RETT, and FLUSH perform significantly different functions, they are allfgroup instruc-tions and require generation of target addresses. They follow nearly identical timing. Figure 3-45, Figure 3-46 and Figure 3-47 illustrate the JMPLjRETT/FLUSH instruction timings.

T E C H N O L O G Y ,

F D E c w u

:r---' :

( insert ') '\ inop / '

1 \ . _ _ _ _ 1

Figure 3-45. Typical JMPL Instruction Pipeline

F D E c w ^u

• delay slot instruction

Figure 3-46. Typical RETT Instruction Pipeline

T E C H N O L O G Y ,

~==========================R=T=6=2=O=h~yp=e=r=SP=~=R=C==C=P==U

F o E c w u

Figure 3-47. Typical FLUSH Iustructiou Pipeline

The Decode, Execute, and Cache stages of a JMPLIRETTIFLUSH instruction are similar to those of the load instruction. An operand access is performed, an effective address is calculated from these operands, and memory alignment errors are checked. For all these instructions, the address is supplied for lookup in the instruction cache and external caches. The differences between these instructions are shown in the timing diagrams, and include:

• JMPL requires saving the instruction PC back to the register file.

• RETT performs window underflow checks and updates several fields in the PSR.

• FLUSH does not update the program counter with the target address but either takes an exception or flushes the corresponding packet entry in the cache line when an ICACHE hit is detected, depending on the state of the Instruction Cache Control Register (ICCR). When determining which packet of an ICACHE line is to be flushed, the last 3 bits of the address are ignored. Therefore, no misalignment ex-ception can occur for a FLUSH instruction.

Dans le document SPARC RISC USER'S GUIDE (Page 153-165)

3.8 Instruction Pipelines

3.8.1 Instruction Fetch Timing

,$

~~

~

~~

~~~-L---~---~~----~

--

---: ---:- ---

---S 5 --

---

---: ---

Sf

In In

~---~Ss_J:

SD

G

,~ =============R;;:T;;:6=2=O;;:h==yp=e=r=SP;;:i\.=R;;:C~C;;:P:;;;;;;;U

'GJ

<

r----

I I

G

G

<e~~l*)

,r---,

,$ =============R==T:;::6;;;::2==O:;::h=yp~e;;;::r;SP:;::i<\.:;::R:;::C;;;;;:;::;;C:;::P=U

@@8G'

" :88:

... ...

-8 LI tz:;\'

OVf*'@C'

~ ~

~ ~

,$

8

,~

:r---' :

~==========================R=T=6=2=O=h~yp=e=r=SP=~=R=C==C=P==U

-8 ^LI ^tz:;\'