Current Microprocessors
Pipeline
Efficient Utilization of Hardware Blocks
• Execution steps for an instruction:
1. Send instruction address (IA) 2. Instruction Fetch (IF) 3. Store instruction (SI) 4. Decode Instruction,
fetch operands (DI) 5. Address Computation
(AC)
6. Memory Access (ME) 7. Execution (EX) 8. Write Back (WB)
ADD R5, R4, #3
101 00011
0001 100 1 DR
Only one block used every cycle
Efficient Utilization of Hardware Blocks
for (i=0; i < 100; i++) { a[i] = a[i] + 5 }
...
05 LOOP LDR R1, R0, #3
06 ADD R1, R1, #5
07 STR R1, R0, #30
08 ADD R0, R0, #1
09 ADD R3, R0, R2
0A BRn LOOP
...
05/IA
05/DI 05/AC
05/EX 05/WB 06/IA 07/IA 08/IA
06/DI 06/AC
07/DI 09/IA 0A/IA
08/DI 07/AC
05/IA
09/DI 08/AC
06/EX
• All blocks used
• One instruction terminates each
05/SI 06/SI 07/SI 08/SI 09/SI 0A/SI
06/IA
Pipeline
• Program execution up to 8 times faster.
• Instruction execution time barely changed.
IA IF 1
EX MEM AC DI IF SI IA ADD R1,R1,#5
WB EX MEM AC DI SI IF IA LDR R1,R0,#30
WB EX MEM AC DI SI IF IA BRn LOOP
WB EX MEM AC DI SI IF IA ADD R3,R0,R2
IA IF SI DI 3 IA
0
IA IF SI 2
IF SI DI AC 4
SI DI AC MEM
5
DI AC MEM
EX 6
AC MEM
EX WB 7
MEM EX WB 8
EX WB 9
WB 10 11 12
ADD R0,R0,#1 STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30
Inst. 13
All blocks used
Pipeline Implementation
• All information (data and control) stored in pipeline registers
PC +
IF DI
IA AC MEM EX WB
Memory
Memory
Registers Address ALU Registers
computation Pipeline registers
IR
SI
Pipeline Implementation
PC +
DI
IA AC MEM EX WB
PC 16b
instruction ...
2x16b registers Opcode
...
16b address 2x16b registers
Opcode ...
16b data 2x16b registers
Opcode
... 16b result 16b data
...
IF
IR
SI
Trends
• The Pentium IV pipeline has 20 stages + 8 conversion stages x86 →
µInstructions.• The initial motivation for the pipeline was:
– Efficiently exploiting architecture blocks – Increase instruction execution rate
• Current motivation is increasing clock frequency:
– Split stages into sub-stages
– Maximum duration of a sub-stage reduced
⇒Enables clock frequency increase
• Difficult to reach sustainedperformance
Pipeline Hazards
• Not always possible to issue/commit one instruction per cycle
• When an instruction cannot proceed it is a pipeline hazard
• Three types of pipeline hazards:
– Resource hazard – Data hazard – Control hazard
• Hazard induces a pipeline stall
• The control circuit injects one or several bubbles
• Performance metric: IPC (Instructions Per Cycle)
< 1 (optimum) if hazards
Inst-2 Inst-1 Inst-1 BubbleInst-2 Inst-1
BubbleInst-2 Bubble BubbleInst-1Inst-2
Inst-1 Inst-1Inst-2 Inst-2 Inst-2 Inst-2
Pipeline Hazard
PC +
DI
IA AC MEM EX WB
Inst-1 Inst-1
9 10 WB
EX MEM AC DI SI IF IA Inst 1
WB EX MEM AC DI IF SI IA Inst 2
2
0 1 3 4 5 6 7 8
Inst.
8 cycles IF
IR
SI
Resource Hazards
• Access same block at same cycle
• Solutions:
– Accept pipeline stall (SPARC) – Increase
resources (memory ports)
05/MEM 09/IF
IF IA ADD R3,R0,R2
IA IF SI 2 IA
0
IA IF 1
IA IF SI DI 3
IF SI DI AC 4
SI DI AC MEM
5
ADD R0,R0,#1 STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30 Inst.
Resource conflict
Resource Hazards
DI SI IF IA ADD R1,R1,#5
AC DI SI IF
● IA LDR R1,R0,#30
MEM AC DI SI
● IF
● IA BRn LOOP
WB EX MEM AC DI SI
● IF IA ADD R3,R0,R2
IA IF SI 2 IA 0
IA IF 1
IA IF SI DI 3
IF SI DI AC 4
SI DI AC MEM
5
DI AC MEM
EX 6
AC MEM
EX WB 7
MEM EX WB 8
EX WB 9
WB 10 11
ADD R0,R0,#1 STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30
Inst. 12
No memory access⇒no resource hazard
Data Hazards
• Data dependence between two instructions
LDR R1 ← R0,#30 ADD R1 ← R1,#5
• An instruction can fetch operands only when available
IF SI 2 IA
0
IA IF 1
SI DI 3
●
AC 4
●
MEM 5
●
EX 6
●
WB 7
DI 8
AC 9
MEM 10
EX 11
ADD R1,R1,#5 LDR R1,R0,#30 Inst.
WB 12 R1contains value expected by LDR
Forwarding
• Data is often available in processor before it is written in register:
⇒Immediately pass data to expecting block
= Forwarding
• Pipeline stall only when data absolutely necessary
IF SI 2 IA 0
IA IF 1
SI DI 3
DI AC 4
AC MEM
5
MEM EX
6
EX WB 7
WB
8 9
ADD R1,R1,#5 LDR R1,R0,#30
Inst.
Data available; pipeline stall avoided
Implementing forwarding
• Forwardingrequires:
– Additional data paths
– Adding/Increasing size of muxes
– Modifying control circuit (detect/activate forwarding)
ADD ADD
MEM EX WB
Forwarding
•
Forwardingcannot avoid all pipeline stalls:
ADD R1 ← R1,#5 STR R1 → R0,#30
IA IF SI 2 IA
0
IA IF 1
IF SI DI 3
SI DI AC 4
DI AC MEM
5
AC MEM
EX 6
●
EX WB 7
MEM WB
8
EX 9
WB 10 11
STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30
Inst. 12
forwarding
Multi-Cycles Instructions
• Example: floating-point instructions – FPADD: 2 execution cycles – FPMUL: 5 execution cycles
EX EX WB
WB
●
EX
●
●
●
MEM AC DI SI IF IA FPADD F4←F2,F3
EX EX
EX EX
MEM AC DI SI IF IA FPADD F2←F1,F3
SI 2 IA 0
IF 1
DI 3
AC 4
MEM 5
EX 6
EX 7 FPMUL F2←F1,F0
Inst.
• New data dependences:
– Write registers out of order
• New resource conflicts (register banks ports)
Ecriture dans le désordre
WB EX
●
EX WB
●
●
EX
●
●
●
MEM AC DI SI IF IA FPADD F4←F2,F3
EX EX
EX EX
MEM AC DI IF IF IA FPADD F5←F1,F3
SI 2 IA
0 IF 1
DI 3
AC 4
MEM 5
EX 6
EX 7 FPMUL F2←F1,F0
Inst.
Resource hazard if FPADD and FPMUL
in same block
Pipeline and Exceptions
• Le pipeline makes exception management harder. Example:
– LDR has a page fault in MEM – ADD has a page fault in IF
• Precise exception on instruction I:
– All instructions before Ifinish normally – All instructions after Ican be interrupted, then
reexecuted from the start after exception handled
WB 8
EX MEM AC DI IF SI IA ADD
SI 2 IA 0
IF 1
DI 3
AC 4
MEM 5
EX 6
WB 7 LDR
Inst.
• Necessary to implement precise exceptions Exceptions
Pipeline and Exceptions
•
Exception vectorfor each pipeline register
• After exception, no more state modification
• In WB, exceptions dealt with in same order as instructions
PC +
DI
IA AC MEM EX WB
Exception vector IF
IR
SI
Multi-Cycle Instructions and Exceptions
• Example:
– FPADD does exception NaNin EX
• Processor state modified in
FPADDbefore exception detection
• Forbid
out of orderstate modification
Exception
WB EX EX EX
WB EX EX MEM AC DI SI IF IA FPADD F4←F1,F3
SI 2 IA 0
IF 1
DI 3
AC 4
MEM 5
EX 6
EX 7 FPMUL F2←F1,F0
Inst.
Control Hazards
• Branch: must know destination (and possibly condition value) before fetching next
instruction
LOOP LDR R1 ← R0,#30 ...
BRn LOOP
●
SI 2 IA
0
●
IF 1
●
DI 3
●
AC 4
IA MEM
5
IF EX 6
DI WB 7
AC 8
MEM 9
EX 10
WB 11
LDR R1,R0,#30 BRn LOOP
Inst. 12
Branch destination address available at end of this stage Condition
(bits n,p,z) known at the end of this stage
Current Microprocessors
Branch Prediction
Branch Prediction
• Usually, branch destination address is constant (except for RETand indirect branches)
• Predictdestination address:
– store destination addresses in a table for each branch execution
– table is indexed by branch instruction PC – when PC sent to memory, also sent to table
– table says if it’s a branch, and the destination address
Known branch instructions
Destination address
PC Predicted
address It is/is not a branch
Address Prediction
• If PC corresponds to branch, update PC:
PC = Destination address• Example: conditional branch
●
SI 2 IA 0
●
IF 1
●
DI 3
●
AC 4
IA MEM
5
IF EX 6
SI WB
7
DI 8
AC 9
MEM 10
EX 11
Inst 1 JMPR LABEL
Inst.
WB 12
Computed destination address Without address prediction
IF SI 2 IA 0
IA IF 1
SI DI 3
DI AC 4
AC MEM
5
MEM EX
6
EX WB 7
WB
8 9 10 11
Inst 1 JMPR LABEL
Inst. 12
Predicted destination address With address prediction
Error Prediction
• Detect error (destination address always computed)
• Squash speculatively fetched instructions
• Speculated instructions only modify machine state after check
• Branch squashing costs 1 cycle
WB EX MEM AC DI IF
● IA Inst 1 (OK)
IF IA Inst 3 (err)
SI IF IA Inst 2 (err)
IF SI 2 IA
0
IA IF 1
SI DI 3
DI AC 4
MEM 5
EX 6
WB
7 8 9 10 11
Inst 1 (err) JMPR
Inst. 12
Prediction error detected
Speculated instructions did not modify machine state
Recovery After Misprediction
PC +
DI
IA AC MEM EX WB
JMPR Label
JMPR Label
JMPR Label
JMPR Label
JMPR Label Inst 1
(err) Inst 1
(err) Inst 1
(err)
X
Inst 2 (err)
Inst 2 (err)
X
Inst 3
(err)
X
LabelJMPRInst 1 (OK)
IF
IR
SI
JMPR Label Inst 1
(err) Inst 3
(err)
X
Inst 1(OK)Inst 2 (err) Inst 4
(err)
X
Inst 2 (OK)
JMPR Label
Condition Prediction
• Conditional branches
• Must predict condition value
• Condition value can change often from one branch execution to another⇒prediction difficult
• Example: branch taken
● WB SI 2 IA 0
●
IF 1
●
DI 3
IA AC 4
IF MEM
5
DI EX 6
AC WB 7
MEM 8
EX 9
Inst 1 BR LABEL
Inst.
Condition computed No condition prediction, address prediction
IF SI 2 IA 0
IA IF 1
SI DI 3
DI AC 4
AC MEM
5
MEM EX
6
EX WB 7
WB
8 9
Inst 1 BR LABEL
Inst.
Condition predicted With condition and address prediction
Prediction Strategies
• Static
prediction:
– Always taken
• Works well with loops – Compile-Time analysis
• EPIC/IA-64; limitations of static analysis – Hit rate: ≈from 70% to 90%
...
05 LOOP LDR R1, R0, #3
06 ADD R1, R1, #5
07 STR R1, R0, #30
08 ADD R0, R0, #1
09 ADD R3, R0, R2
Prediction Strategies
• Dynamique prediction:
– Commonplace in processors
– Recent mechanisms:
hit rate up to 99% for certain applications – Principle: learn
individual branch behavior
• A first mechanism:
local history – one 4-state
automaton per branch
10 n bits
PC
2nentries
Prediction
Taken
11
Weakly Taken
10
Weakly Not taken
01
Not Taken
00 Not taken (0)
Taken (1) Update table
with condition
...
taken ...
i=99 (non taken)
Example
FNP
P FP NP
PC Prediction
100 iterations:
• iteration i=0: BRn taken
• iteration i=1: BRn taken
• ...
• iteration i=99: BRn not taken
Initial state
not taken i=0 (taken)
P P P
NP NP
NP
...
05 LOOP LDR R1, R0, #3
06 ADD R1, R1, #5
07 STR R1, R0, #30
08 ADD R0, R0, #1
09 ADD R3, R0, R2
0A BRn LOOP
...
taken i=1 (taken)
taken i=2 (taken)
taken i=3 (taken)
Improving Dynamic Prediction
• A small hit rate increase can have a significant impact on overall processor performance
• To improve prediction accuracy, use behavior of preceding branches: global history
if (a == 1) a = 0; /* Branchement B1 */
...
if (b == 0) b = 1; /* Branchement B2 */
...
if (a == b) ...; /* Branchement B3 */
p = 2 B1
B2 B2
B3 B3 B3 B3
1
1
1 0
0 PC B3 xx
00
00 PC B3 01
00 PC B3 10
00 PC B3 11
0
not taken→0 in history taken→1 in history p bits
10 n bits
PC
2n+pentries
Prediction
Update table
History of last pbranches
Index
p+n bits
Impact of Branch Prediction on Processor Performance
• On average, 1 instruction out of 5 is a branch
• In current pipelines, misprediction cost is 5 to 10 cycles
• 8 cycles between fetch and branch resolution
• 4 cycles to restart the pipeline
⇒12 cycles penalty
• 1 instruction / cycle (1000 instructions) :
• 50% wrong predictions:
1000 * (0,8*1 + 0,2*(0,5*1 + 0,5*12)) = 2100 cycles
• 20% wrong predictions:
1000 * (0,8*1 + 0,2*(0,8*1 + 0,2*12)) = 1440 cycles
• 5% wrong predictions:
1000 * (0,8*1 + 0,2*(0,95*1 + 0,05*12)) = 1110 cycles
• 5 instructions / cycle (1000 instructions) :
• 50% wrong predictions:
200 * (0,5*1 + 0,5*12) = 1300 cycles
• 20% wrong predictions:
200 * (0,8*1 + 0,2*12) = 640 cycles
• 5% wrong predictions:
200 * (0,95*1 + 0,05*12) = 310 cycles
Current Microprocessors
Caches
Memory Access Latency
• Processor cycle time << memory access time
µProc 60%/an (2X/1.5yr)
DRAM 9%/an (2X/10yr) 1
10 100 1000
1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
DRAM CPU
1982
DRAM/Processor gap: (50% / year)
Performance Moore’s law
Cache Memory
• Fast (SRAM) but small (cost) memory:
≈1 to 3 cycles (pipelined)
• Processor sends memory requests to cache
– Data in cache: hit – Data not in cache:
miss
• Performance:
– Hit rate – Average memory
access time
Processor
Memory Cache
1-bit SRAM Cell
• SRAM=Static Random Access Memory.
• Writing:
– bit=value; bit’=value’
– selection=1
• Reading:
– selection=0 – bit=VDD, bit’=VDD – selection=1 – value:
• 1/VDD→V decreases on bit’
• 0/VSS→V decreases on bit
Selection
bit Amplifier Cell
1-bit
selection bit
W
Stored
value bit’
Caches and Locality Properties
• Most programs have strong locality properties
• Temporallocality:
address Areferenced at time thas strong probability to be
referenced again within a short time interval
• Spatiallocality: address Areferenced at time t, strong probability to reference a neighbor address within a short time interval
for (i=0; i<N; i++) { for (j=0; j<N; j++) {
y[i] = y[i] + a[i][j] * x[j]
} }
• y[i]: spatial and temporal locality
• a[i][j]: spatial locality
• x[j]: spatial and temporal locality
Data and Instructions Locality
• Locality properties for instructions as well
• Temporal locality: just keep address in cache
• Sptial locality: load addresses by blocks
...
05 LOOP LDR R1, R0, #3
06 ADD R1, R1, #5
07 STR R1, R0, #30
08 ADD R0, R0, #1
09 ADD R3, R0, R2
0A BRn LOOP
...
Loop
⇒reuse instructions
⇒temporal locality Consecutive instructions
⇒spatial locality
No
Cache Architecture
Data Data addresses in
Memory (tags)
Cache block (or line) Processor request
(memory address)
=
Latch
Latch
Processor
Memory
Yes
Address Mapping
• Hardware mapping
⇒Programmer does not manage it
⇒Transparent for programmer
• Mapping is a simple function of the address
– # line in cache – # byte in line
• CSbytes cache
• LSbytes cache
• # byte: log2(LS) least significant bits of address
• # line: log2(CS/LS) least significant bits of address
# byte in cache line
# line in cache
0 1 2 3 4 5 6 7
0 12 3 4 5 6 7
0 1 2 3 4 5 6 7 8 Bits
# line in cache
# byte in cache Tag
Cache Line
• Fetch data by blocks
• Address = byte address
• Block:
– Most significant bits of bytes addresses identical
– Only least significant bits vary
16-bit address, 8-byte block
0 1 2 3 4 5 6 7 8 Bits
Block address Byte address
Example:
0...010100010 0...010100111 0...010101111
Same cache line Distinct lines Consecutive addresses
Reading Data
• Example:
– CS = 32 bytes – LS= 8 bytes
• Requested address (16 bits):
– 0110010001010100 – # line: 10
– # byte in line: 100
• A request can have a variable size:
– byte, half-word, word...
– request = address + nb of bytes
– address = addresse of first byte
– example: 2 bytes (16-bit word)
0 12 3
0 1 2 3 4 5 6 7
2 bytes sent to processor
Associativity
• Physical memory size >> cache size
• Mapping function can breed data conflicts
• Reduce conflicts by increasing cache
associativityx y for (i=0; i<N; i++) {
for (j=0; j<2; j++) { x[j] = y[j]
} }
@x = 0110010001001000
@y = 0000100000010000
Cache (horizontal view)
Lines: 0 1 2 3
i=0, j=0 i=0, j=1
Cache conflict
4
Associative Cache Structure
01100100010
Processor request (memory address)
0000100000010000
00001000000
MUX
=
Latch
=
Latch Processor
Memory
Banks
Associative Cache Operations
• Degree of associativityn.
• A data can be stored within ndifferent entries
• Upon a cache miss, choose block/bank
• Set of possible blocks = set:
– LRU: Least Recently Used – Random
– Pseudo-LRU: most recently used line not replaced, random among others
– FIFO: First In First Out
Set
Writing a Data (Write-Through)
01100100010
Processor request (memory request)
00001000000
= =
Memory
Data to write Latch
Writing a Data (Write-Back)
01100100010
Processor request (memory address)
00001000000
Memory
Data to write Latch
= =
Latch
Cache miss
Virtual Memory/Physical Memory:
TLB
• Processor uses virtual addresses
• Data have an address in physicalmemory
⇒Virtual/Physical address translation
⇒TLB(Translation Lookaside Buffer)
• Cache of address translations
• 1 TLB entry = 1 page
• TLB often fully associative (n= number of lines).
Virtual address
Physical address Processor request
(virtual address)
=
=
=
=
TLB
Summary
• Simultaneous:
– Virtual/physical address translation – Cache access using virtual address
TLB Cache
Processor
IF DI
IA SI AC MEM EX WB
Cache & TLB Instructions
Cache & TLB Data
Several Cache Levels
Processor
Instruction Cache 8 Ko, 1-way
Data Cache 8 Ko, 1-way Shared Cache 96 Ko, 3-way
Memory Example: Alpha 21164
OffchipShared Cache
≈1-8Mo, 1-way
Impact of Cache Misses on Processor Performance
• On average, 1/3 instructions are load/store
• Instruction cache misses
• Cache hierarchies reduce average memory latency
•1GHz processor, 100ns for memory access
• 1000 instructions :
• 50% cache misses:
1000 * (0,67*1 + 0,33*(0,5*1 + 0,5*100)) = 17335 cycles
• 5% cache misses:
1000 * (0,67*1 + 0,33*(0,95*1 + 0,05*100)) = 2633 cycles
• 0% cache misses:
1000 cycles
Current Microprocessors
Superscalar Execution
Superscalar Processor
• Pipeline: at most one instruction per cycle
• Superscalar degree of
n: up to ninstructions complete per cycle (in practice, n
≈4)
• Requirements for a superscalar processor:
– An uninterrupted flow of instructions
– Determine which instructions can execute in parallel – Propagate data among instructions (result of
instruction iis operand of instruction j) – Several functional units
• Constraint: precise interruptions
• Superscalar implemenations share a lot of features
Superscalar Processor Architecture
Source: Proceedingsof the IEEE
Pipeline Architecture
Instruction-Level Parallelism (ILP)
(fine-grain parallelism)
Instruction Fetch
• Instruction flow disruptions:
– branches
– instruction cache misses
• Avoid disruptions:
– Number fetched instructions ≥ 4: possibly several cache lines (multi-port cache)
– Buffer to store pre-fetched instructions
0 1 2 3 Cache
lines
8 9 10 11 4 5 6 7
2 3 4 10 11
4 5 6 7
XXX
8 9 10 11
XX
Inst.
buffer
Branch predicted taken
Dependences
• Find instruction dependences
• Avoid “false”
dependences due to register aliasing
• RAW (Read After Write): true dependence
• WAW (Write After Write): out-of-order write (false
dependence)
• WAR (Write After
Register Aliasing: Renaming
• Binary compatibility limits number of registers
• Technology allows more registers
• Physical registers + ReOrder Buffer (ROB)
• Each instruction mapped to a ROB entry
• A table maps logical registers to: either a physical register or a ROB entry
⇒Eliminates register aliasing
⇒Finds true dependences
L2:
move r3, r7
lw r8, (r3)
add r3, r3, 4
8 6
r8 ...
r3 Logical register
...
Physical storage
ROB
move
lw
add
7 r8
...
r3 Physical register
# ...
# Value r8
...
r3 Logical register
...
ROB6 Physical storage
r8 ...
r3 Logical register
ROB7 ...
ROB6 Physical storage
r8 ...
r3 Logical register
ROB7 ...
ROB8 Physical storage
r8
...
r3 Physical register
# ...
produced by move Value
r8 ...
r3 Physical register
produced by lw ...
produced by move Value
r8
...
r3
Physical register
produced by lw ...
produced by add Value
Dispatch
• After dependences, dispatch
• Reservation stations:
buffer for each function unit
• Instruction executed when:
– all operands available – functional unit available
• Tomasulo algorithm
add r3, r3, 4
Reservation stations for ALU
ALUs
r7 ROB6 Source
1
...
Data 1
1 0 Valid 1
- imm Source
2
- 0 Data 2
1 1 Valid 2
ROB6 move
lw/+
Ope- ration
ROB7 Result
r7 ROB6 ROB6 Source
1
...
Data 1
1 0 0 Valid 1
- imm imm Source
2
- 0 4 Data 2
1 1 1 Valid 2
ROB6 move
lw/+
add/+
Ope- ration
ROB7 ROB8 Result
ROB6 ROB6 Source
1 Data 1
0 0 Valid 1
imm imm Source
2
0 4 Data 2
1 1 Valid 2
lw/+
add/+
Ope- ration
ROB7 ROB8 Result
ROB6 ROB6 Source
1 Data 1
1 1 Valid 1
imm imm Source
2
0 4 Data 2
1 1 Valid 2
lw/+
add/+
Ope- ration
ROB7 ROB8 Result Source
1 Data1 Valid 1Source 2 Data 2 Valid 2 Ope-
ration Result
Issue
• Ready instructions sent to FU
• Result propagated through buses:
– to physical storage (ROB entries)
– to reservation stations
• Pending instructions immediately issued
⇒
Internal model closer to dataflow than
von NeumannROB
...
...
Common Data Bus Stations
UF UF UF UF
Commit
• An instruction can only commit when reaching ROB end
⇒commit order = program order
• Logical architecture state only modified at commit
⇒precise interruptions possible in OoO processors
• Logical state: registers and memory
• When an instruction leaves the ROB:
– result written into register – OR data sent to memory
• Superscalar processor of degree n: ninstructions can commit simultaneously
• If instruction at top of ROB has not completed, processor is stall
Pentium IV
Source: Tom’s Hardware
Trace Cache
A B D
Fetch
A Predictor
Address
A B
D
Fetch
A Predictor
Address Instruction cache Trace Cache
B D
n predictions C
A B P
P NP
A B C
Pentium IV
Source: Tom’s Hardware
Ideally...
54,8 62,6
17,9 75,2
118,7 150,1
0 20 40 60 80 100 120 140 160
gcc espresso li fpppp doducd tomcatv
Programs
Instruction Issues per cycle
Factoring in main constraints…
0 5 1 0 1 5 2 0 2 5
gc c ex pres s o li fpppp d oduc d tom c atv
P ro g ra m
6 4 3 2 1 6 8 4
Nombre d’instructions/cycle
In reality…
• 4 instructions/cycle
• Max 2,6 instructions per cycle; 2 on average