• Aucun résultat trouvé

1 Efficient UtilizationofHardware Blocks Efficient UtilizationofHardware Blocks CurrentMicroprocessors

N/A
N/A
Protected

Academic year: 2022

Partager "1 Efficient UtilizationofHardware Blocks Efficient UtilizationofHardware Blocks CurrentMicroprocessors"

Copied!
23
0
0

Texte intégral

(1)

Current Microprocessors

Pipeline

Efficient Utilization of Hardware Blocks

• Execution steps for an instruction:

1. Send instruction address (IA) 2. Instruction Fetch (IF) 3. Store instruction (SI) 4. Decode Instruction,

fetch operands (DI) 5. Address Computation

(AC)

6. Memory Access (ME) 7. Execution (EX) 8. Write Back (WB)

ADD R5, R4, #3

101 00011

0001 100 1 DR

Only one block used every cycle

Efficient Utilization of Hardware Blocks

for (i=0; i < 100; i++) { a[i] = a[i] + 5 }

...

05 LOOP LDR R1, R0, #3

06 ADD R1, R1, #5

07 STR R1, R0, #30

08 ADD R0, R0, #1

09 ADD R3, R0, R2

0A BRn LOOP

...

05/IA

05/DI 05/AC

05/EX 05/WB 06/IA 07/IA 08/IA

06/DI 06/AC

07/DI 09/IA 0A/IA

08/DI 07/AC

05/IA

09/DI 08/AC

06/EX

• All blocks used

• One instruction terminates each

05/SI 06/SI 07/SI 08/SI 09/SI 0A/SI

06/IA

(2)

Pipeline

• Program execution up to 8 times faster.

• Instruction execution time barely changed.

IA IF 1

EX MEM AC DI IF SI IA ADD R1,R1,#5

WB EX MEM AC DI SI IF IA LDR R1,R0,#30

WB EX MEM AC DI SI IF IA BRn LOOP

WB EX MEM AC DI SI IF IA ADD R3,R0,R2

IA IF SI DI 3 IA

0

IA IF SI 2

IF SI DI AC 4

SI DI AC MEM

5

DI AC MEM

EX 6

AC MEM

EX WB 7

MEM EX WB 8

EX WB 9

WB 10 11 12

ADD R0,R0,#1 STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30

Inst. 13

All blocks used

Pipeline Implementation

• All information (data and control) stored in pipeline registers

PC +

IF DI

IA AC MEM EX WB

Memory

Memory

Registers Address ALU Registers

computation Pipeline registers

IR

SI

Pipeline Implementation

PC +

DI

IA AC MEM EX WB

PC 16b

instruction ...

2x16b registers Opcode

...

16b address 2x16b registers

Opcode ...

16b data 2x16b registers

Opcode

... 16b result 16b data

...

IF

IR

SI

(3)

Trends

• The Pentium IV pipeline has 20 stages + 8 conversion stages x86 →

µInstructions.

• The initial motivation for the pipeline was:

– Efficiently exploiting architecture blocks – Increase instruction execution rate

• Current motivation is increasing clock frequency:

– Split stages into sub-stages

– Maximum duration of a sub-stage reduced

Enables clock frequency increase

• Difficult to reach sustainedperformance

Pipeline Hazards

• Not always possible to issue/commit one instruction per cycle

• When an instruction cannot proceed it is a pipeline hazard

• Three types of pipeline hazards:

– Resource hazard – Data hazard – Control hazard

• Hazard induces a pipeline stall

• The control circuit injects one or several bubbles

• Performance metric: IPC (Instructions Per Cycle)

< 1 (optimum) if hazards

Inst-2 Inst-1 Inst-1 BubbleInst-2 Inst-1

BubbleInst-2 Bubble BubbleInst-1Inst-2

Inst-1 Inst-1Inst-2 Inst-2 Inst-2 Inst-2

Pipeline Hazard

PC +

DI

IA AC MEM EX WB

Inst-1 Inst-1

9 10 WB

EX MEM AC DI SI IF IA Inst 1

WB EX MEM AC DI IF SI IA Inst 2

2

0 1 3 4 5 6 7 8

Inst.

8 cycles IF

IR

SI

(4)

Resource Hazards

• Access same block at same cycle

• Solutions:

– Accept pipeline stall (SPARC) – Increase

resources (memory ports)

05/MEM 09/IF

IF IA ADD R3,R0,R2

IA IF SI 2 IA

0

IA IF 1

IA IF SI DI 3

IF SI DI AC 4

SI DI AC MEM

5

ADD R0,R0,#1 STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30 Inst.

Resource conflict

Resource Hazards

DI SI IF IA ADD R1,R1,#5

AC DI SI IF

IA LDR R1,R0,#30

MEM AC DI SI

IF

IA BRn LOOP

WB EX MEM AC DI SI

IF IA ADD R3,R0,R2

IA IF SI 2 IA 0

IA IF 1

IA IF SI DI 3

IF SI DI AC 4

SI DI AC MEM

5

DI AC MEM

EX 6

AC MEM

EX WB 7

MEM EX WB 8

EX WB 9

WB 10 11

ADD R0,R0,#1 STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30

Inst. 12

No memory accessno resource hazard

Data Hazards

• Data dependence between two instructions

LDR R1 ← R0,#30 ADD R1 ← R1,#5

• An instruction can fetch operands only when available

IF SI 2 IA

0

IA IF 1

SI DI 3

AC 4

MEM 5

EX 6

WB 7

DI 8

AC 9

MEM 10

EX 11

ADD R1,R1,#5 LDR R1,R0,#30 Inst.

WB 12 R1contains value expected by LDR

(5)

Forwarding

• Data is often available in processor before it is written in register:

Immediately pass data to expecting block

= Forwarding

• Pipeline stall only when data absolutely necessary

IF SI 2 IA 0

IA IF 1

SI DI 3

DI AC 4

AC MEM

5

MEM EX

6

EX WB 7

WB

8 9

ADD R1,R1,#5 LDR R1,R0,#30

Inst.

Data available; pipeline stall avoided

Implementing forwarding

• Forwardingrequires:

– Additional data paths

– Adding/Increasing size of muxes

– Modifying control circuit (detect/activate forwarding)

ADD ADD

MEM EX WB

Forwarding

Forwarding

cannot avoid all pipeline stalls:

ADD R1 ← R1,#5 STR R1 → R0,#30

IA IF SI 2 IA

0

IA IF 1

IF SI DI 3

SI DI AC 4

DI AC MEM

5

AC MEM

EX 6

EX WB 7

MEM WB

8

EX 9

WB 10 11

STR R1,R0,#30 ADD R1,R1,#5 LDR R1,R0,#30

Inst. 12

forwarding

(6)

Multi-Cycles Instructions

• Example: floating-point instructions – FPADD: 2 execution cycles – FPMUL: 5 execution cycles

EX EX WB

WB

EX

MEM AC DI SI IF IA FPADD F4←F2,F3

EX EX

EX EX

MEM AC DI SI IF IA FPADD F2←F1,F3

SI 2 IA 0

IF 1

DI 3

AC 4

MEM 5

EX 6

EX 7 FPMUL F2←F1,F0

Inst.

• New data dependences:

– Write registers out of order

• New resource conflicts (register banks ports)

Ecriture dans le désordre

WB EX

EX WB

EX

MEM AC DI SI IF IA FPADD F4←F2,F3

EX EX

EX EX

MEM AC DI IF IF IA FPADD F5←F1,F3

SI 2 IA

0 IF 1

DI 3

AC 4

MEM 5

EX 6

EX 7 FPMUL F2←F1,F0

Inst.

Resource hazard if FPADD and FPMUL

in same block

Pipeline and Exceptions

• Le pipeline makes exception management harder. Example:

– LDR has a page fault in MEM – ADD has a page fault in IF

• Precise exception on instruction I:

– All instructions before Ifinish normally – All instructions after Ican be interrupted, then

reexecuted from the start after exception handled

WB 8

EX MEM AC DI IF SI IA ADD

SI 2 IA 0

IF 1

DI 3

AC 4

MEM 5

EX 6

WB 7 LDR

Inst.

• Necessary to implement precise exceptions Exceptions

Pipeline and Exceptions

Exception vector

for each pipeline register

• After exception, no more state modification

• In WB, exceptions dealt with in same order as instructions

PC +

DI

IA AC MEM EX WB

Exception vector IF

IR

SI

(7)

Multi-Cycle Instructions and Exceptions

• Example:

– FPADD does exception NaNin EX

• Processor state modified in

FPADD

before exception detection

• Forbid

out of order

state modification

Exception

WB EX EX EX

WB EX EX MEM AC DI SI IF IA FPADD F4←F1,F3

SI 2 IA 0

IF 1

DI 3

AC 4

MEM 5

EX 6

EX 7 FPMUL F2←F1,F0

Inst.

Control Hazards

• Branch: must know destination (and possibly condition value) before fetching next

instruction

LOOP LDR R1 ← R0,#30 ...

BRn LOOP

SI 2 IA

0

IF 1

DI 3

AC 4

IA MEM

5

IF EX 6

DI WB 7

AC 8

MEM 9

EX 10

WB 11

LDR R1,R0,#30 BRn LOOP

Inst. 12

Branch destination address available at end of this stage Condition

(bits n,p,z) known at the end of this stage

Current Microprocessors

Branch Prediction

(8)

Branch Prediction

• Usually, branch destination address is constant (except for RETand indirect branches)

• Predictdestination address:

– store destination addresses in a table for each branch execution

– table is indexed by branch instruction PC – when PC sent to memory, also sent to table

– table says if it’s a branch, and the destination address

Known branch instructions

Destination address

PC Predicted

address It is/is not a branch

Address Prediction

• If PC corresponds to branch, update PC:

PC = Destination address

• Example: conditional branch

SI 2 IA 0

IF 1

DI 3

AC 4

IA MEM

5

IF EX 6

SI WB

7

DI 8

AC 9

MEM 10

EX 11

Inst 1 JMPR LABEL

Inst.

WB 12

Computed destination address Without address prediction

IF SI 2 IA 0

IA IF 1

SI DI 3

DI AC 4

AC MEM

5

MEM EX

6

EX WB 7

WB

8 9 10 11

Inst 1 JMPR LABEL

Inst. 12

Predicted destination address With address prediction

Error Prediction

• Detect error (destination address always computed)

• Squash speculatively fetched instructions

• Speculated instructions only modify machine state after check

• Branch squashing costs 1 cycle

WB EX MEM AC DI IF

IA Inst 1 (OK)

IF IA Inst 3 (err)

SI IF IA Inst 2 (err)

IF SI 2 IA

0

IA IF 1

SI DI 3

DI AC 4

MEM 5

EX 6

WB

7 8 9 10 11

Inst 1 (err) JMPR

Inst. 12

Prediction error detected

Speculated instructions did not modify machine state

(9)

Recovery After Misprediction

PC +

DI

IA AC MEM EX WB

JMPR Label

JMPR Label

JMPR Label

JMPR Label

JMPR Label Inst 1

(err) Inst 1

(err) Inst 1

(err)

X

Inst 2 (err)

Inst 2 (err)

X

Inst 3

(err)

X

LabelJMPR

Inst 1 (OK)

IF

IR

SI

JMPR Label Inst 1

(err) Inst 3

(err)

X

Inst 1(OK)

Inst 2 (err) Inst 4

(err)

X

Inst 2 (OK)

JMPR Label

Condition Prediction

• Conditional branches

• Must predict condition value

• Condition value can change often from one branch execution to anotherprediction difficult

• Example: branch taken

WB SI 2 IA 0

IF 1

DI 3

IA AC 4

IF MEM

5

DI EX 6

AC WB 7

MEM 8

EX 9

Inst 1 BR LABEL

Inst.

Condition computed No condition prediction, address prediction

IF SI 2 IA 0

IA IF 1

SI DI 3

DI AC 4

AC MEM

5

MEM EX

6

EX WB 7

WB

8 9

Inst 1 BR LABEL

Inst.

Condition predicted With condition and address prediction

Prediction Strategies

• Static

prediction:

– Always taken

• Works well with loops – Compile-Time analysis

• EPIC/IA-64; limitations of static analysis – Hit rate: ≈from 70% to 90%

...

05 LOOP LDR R1, R0, #3

06 ADD R1, R1, #5

07 STR R1, R0, #30

08 ADD R0, R0, #1

09 ADD R3, R0, R2

(10)

Prediction Strategies

• Dynamique prediction:

– Commonplace in processors

– Recent mechanisms:

hit rate up to 99% for certain applications – Principle: learn

individual branch behavior

• A first mechanism:

local history – one 4-state

automaton per branch

10 n bits

PC

2nentries

Prediction

Taken

11

Weakly Taken

10

Weakly Not taken

01

Not Taken

00 Not taken (0)

Taken (1) Update table

with condition

...

taken ...

i=99 (non taken)

Example

FNP

P FP NP

PC Prediction

100 iterations:

• iteration i=0: BRn taken

• iteration i=1: BRn taken

• ...

• iteration i=99: BRn not taken

Initial state

not taken i=0 (taken)

P P P

NP NP

NP

...

05 LOOP LDR R1, R0, #3

06 ADD R1, R1, #5

07 STR R1, R0, #30

08 ADD R0, R0, #1

09 ADD R3, R0, R2

0A BRn LOOP

...

taken i=1 (taken)

taken i=2 (taken)

taken i=3 (taken)

Improving Dynamic Prediction

• A small hit rate increase can have a significant impact on overall processor performance

• To improve prediction accuracy, use behavior of preceding branches: global history

if (a == 1) a = 0; /* Branchement B1 */

...

if (b == 0) b = 1; /* Branchement B2 */

...

if (a == b) ...; /* Branchement B3 */

(11)

p = 2 B1

B2 B2

B3 B3 B3 B3

1

1

1 0

0 PC B3 xx

00

00 PC B3 01

00 PC B3 10

00 PC B3 11

0

not taken0 in history taken1 in history p bits

10 n bits

PC

2n+pentries

Prediction

Update table

History of last pbranches

Index

p+n bits

Impact of Branch Prediction on Processor Performance

• On average, 1 instruction out of 5 is a branch

• In current pipelines, misprediction cost is 5 to 10 cycles

• 8 cycles between fetch and branch resolution

• 4 cycles to restart the pipeline

12 cycles penalty

• 1 instruction / cycle (1000 instructions) :

• 50% wrong predictions:

1000 * (0,8*1 + 0,2*(0,5*1 + 0,5*12)) = 2100 cycles

• 20% wrong predictions:

1000 * (0,8*1 + 0,2*(0,8*1 + 0,2*12)) = 1440 cycles

• 5% wrong predictions:

1000 * (0,8*1 + 0,2*(0,95*1 + 0,05*12)) = 1110 cycles

• 5 instructions / cycle (1000 instructions) :

• 50% wrong predictions:

200 * (0,5*1 + 0,5*12) = 1300 cycles

• 20% wrong predictions:

200 * (0,8*1 + 0,2*12) = 640 cycles

• 5% wrong predictions:

200 * (0,95*1 + 0,05*12) = 310 cycles

Current Microprocessors

Caches

(12)

Memory Access Latency

• Processor cycle time << memory access time

µProc 60%/an (2X/1.5yr)

DRAM 9%/an (2X/10yr) 1

10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

DRAM/Processor gap: (50% / year)

Performance Moore’s law

Cache Memory

• Fast (SRAM) but small (cost) memory:

1 to 3 cycles (pipelined)

• Processor sends memory requests to cache

– Data in cache: hit – Data not in cache:

miss

• Performance:

– Hit rate – Average memory

access time

Processor

Memory Cache

1-bit SRAM Cell

• SRAM=Static Random Access Memory.

• Writing:

– bit=value; bit’=value’

– selection=1

• Reading:

– selection=0 – bit=VDD, bit’=VDD – selection=1 – value:

• 1/VDDV decreases on bit’

• 0/VSSV decreases on bit

Selection

bit Amplifier Cell

1-bit

selection bit

W

Stored

value bit’

(13)

Caches and Locality Properties

• Most programs have strong locality properties

• Temporallocality:

address Areferenced at time thas strong probability to be

referenced again within a short time interval

• Spatiallocality: address Areferenced at time t, strong probability to reference a neighbor address within a short time interval

for (i=0; i<N; i++) { for (j=0; j<N; j++) {

y[i] = y[i] + a[i][j] * x[j]

} }

y[i]: spatial and temporal locality

• a[i][j]: spatial locality

• x[j]: spatial and temporal locality

Data and Instructions Locality

• Locality properties for instructions as well

• Temporal locality: just keep address in cache

• Sptial locality: load addresses by blocks

...

05 LOOP LDR R1, R0, #3

06 ADD R1, R1, #5

07 STR R1, R0, #30

08 ADD R0, R0, #1

09 ADD R3, R0, R2

0A BRn LOOP

...

Loop

reuse instructions

temporal locality Consecutive instructions

spatial locality

No

Cache Architecture

Data Data addresses in

Memory (tags)

Cache block (or line) Processor request

(memory address)

=

Latch

Latch

Processor

Memory

Yes

(14)

Address Mapping

• Hardware mapping

Programmer does not manage it

Transparent for programmer

• Mapping is a simple function of the address

– # line in cache – # byte in line

• CSbytes cache

• LSbytes cache

• # byte: log2(LS) least significant bits of address

• # line: log2(CS/LS) least significant bits of address

# byte in cache line

# line in cache

0 1 2 3 4 5 6 7

0 12 3 4 5 6 7

0 1 2 3 4 5 6 7 8 Bits

# line in cache

# byte in cache Tag

Cache Line

• Fetch data by blocks

• Address = byte address

• Block:

– Most significant bits of bytes addresses identical

– Only least significant bits vary

16-bit address, 8-byte block

0 1 2 3 4 5 6 7 8 Bits

Block address Byte address

Example:

0...010100010 0...010100111 0...010101111

Same cache line Distinct lines Consecutive addresses

Reading Data

• Example:

– CS = 32 bytes – LS= 8 bytes

• Requested address (16 bits):

– 0110010001010100 – # line: 10

– # byte in line: 100

• A request can have a variable size:

– byte, half-word, word...

– request = address + nb of bytes

– address = addresse of first byte

– example: 2 bytes (16-bit word)

0 12 3

0 1 2 3 4 5 6 7

2 bytes sent to processor

(15)

Associativity

• Physical memory size >> cache size

• Mapping function can breed data conflicts

• Reduce conflicts by increasing cache

associativity

x y for (i=0; i<N; i++) {

for (j=0; j<2; j++) { x[j] = y[j]

} }

@x = 0110010001001000

@y = 0000100000010000

Cache (horizontal view)

Lines: 0 1 2 3

i=0, j=0 i=0, j=1

Cache conflict

4

Associative Cache Structure

01100100010

Processor request (memory address)

0000100000010000

00001000000

MUX

=

Latch

=

Latch Processor

Memory

Banks

Associative Cache Operations

• Degree of associativityn.

• A data can be stored within ndifferent entries

• Upon a cache miss, choose block/bank

• Set of possible blocks = set:

– LRU: Least Recently Used – Random

– Pseudo-LRU: most recently used line not replaced, random among others

– FIFO: First In First Out

Set

(16)

Writing a Data (Write-Through)

01100100010

Processor request (memory request)

00001000000

= =

Memory

Data to write Latch

Writing a Data (Write-Back)

01100100010

Processor request (memory address)

00001000000

Memory

Data to write Latch

= =

Latch

Cache miss

Virtual Memory/Physical Memory:

TLB

• Processor uses virtual addresses

• Data have an address in physicalmemory

Virtual/Physical address translation

TLB(Translation Lookaside Buffer)

• Cache of address translations

• 1 TLB entry = 1 page

• TLB often fully associative (n= number of lines).

Virtual address

Physical address Processor request

(virtual address)

=

=

=

=

TLB

(17)

Summary

• Simultaneous:

– Virtual/physical address translation – Cache access using virtual address

TLB Cache

Processor

IF DI

IA SI AC MEM EX WB

Cache & TLB Instructions

Cache & TLB Data

Several Cache Levels

Processor

Instruction Cache 8 Ko, 1-way

Data Cache 8 Ko, 1-way Shared Cache 96 Ko, 3-way

Memory Example: Alpha 21164

OffchipShared Cache

≈1-8Mo, 1-way

Impact of Cache Misses on Processor Performance

• On average, 1/3 instructions are load/store

• Instruction cache misses

• Cache hierarchies reduce average memory latency

•1GHz processor, 100ns for memory access

• 1000 instructions :

• 50% cache misses:

1000 * (0,67*1 + 0,33*(0,5*1 + 0,5*100)) = 17335 cycles

• 5% cache misses:

1000 * (0,67*1 + 0,33*(0,95*1 + 0,05*100)) = 2633 cycles

• 0% cache misses:

1000 cycles

(18)

Current Microprocessors

Superscalar Execution

Superscalar Processor

• Pipeline: at most one instruction per cycle

• Superscalar degree of

n: up to n

instructions complete per cycle (in practice, n

4)

• Requirements for a superscalar processor:

– An uninterrupted flow of instructions

– Determine which instructions can execute in parallel – Propagate data among instructions (result of

instruction iis operand of instruction j) – Several functional units

• Constraint: precise interruptions

• Superscalar implemenations share a lot of features

Superscalar Processor Architecture

Source: Proceedingsof the IEEE

Pipeline Architecture

(19)

Instruction-Level Parallelism (ILP)

(fine-grain parallelism)

Instruction Fetch

• Instruction flow disruptions:

branches

instruction cache misses

• Avoid disruptions:

– Number fetched instructions ≥ 4: possibly several cache lines (multi-port cache)

Buffer to store pre-fetched instructions

0 1 2 3 Cache

lines

8 9 10 11 4 5 6 7

2 3 4 10 11

4 5 6 7

XXX

8 9 10 11

XX

Inst.

buffer

Branch predicted taken

Dependences

• Find instruction dependences

• Avoid “false”

dependences due to register aliasing

• RAW (Read After Write): true dependence

• WAW (Write After Write): out-of-order write (false

dependence)

• WAR (Write After

(20)

Register Aliasing: Renaming

• Binary compatibility limits number of registers

• Technology allows more registers

• Physical registers + ReOrder Buffer (ROB)

• Each instruction mapped to a ROB entry

• A table maps logical registers to: either a physical register or a ROB entry

⇒Eliminates register aliasing

⇒Finds true dependences

L2:

move r3, r7

lw r8, (r3)

add r3, r3, 4

8 6

r8 ...

r3 Logical register

...

Physical storage

ROB

move

lw

add

7 r8

...

r3 Physical register

# ...

# Value r8

...

r3 Logical register

...

ROB6 Physical storage

r8 ...

r3 Logical register

ROB7 ...

ROB6 Physical storage

r8 ...

r3 Logical register

ROB7 ...

ROB8 Physical storage

r8

...

r3 Physical register

# ...

produced by move Value

r8 ...

r3 Physical register

produced by lw ...

produced by move Value

r8

...

r3

Physical register

produced by lw ...

produced by add Value

Dispatch

• After dependences, dispatch

• Reservation stations:

buffer for each function unit

• Instruction executed when:

– all operands available – functional unit available

• Tomasulo algorithm

add r3, r3, 4

Reservation stations for ALU

ALUs

r7 ROB6 Source

1

...

Data 1

1 0 Valid 1

- imm Source

2

- 0 Data 2

1 1 Valid 2

ROB6 move

lw/+

Ope- ration

ROB7 Result

r7 ROB6 ROB6 Source

1

...

Data 1

1 0 0 Valid 1

- imm imm Source

2

- 0 4 Data 2

1 1 1 Valid 2

ROB6 move

lw/+

add/+

Ope- ration

ROB7 ROB8 Result

ROB6 ROB6 Source

1 Data 1

0 0 Valid 1

imm imm Source

2

0 4 Data 2

1 1 Valid 2

lw/+

add/+

Ope- ration

ROB7 ROB8 Result

ROB6 ROB6 Source

1 Data 1

1 1 Valid 1

imm imm Source

2

0 4 Data 2

1 1 Valid 2

lw/+

add/+

Ope- ration

ROB7 ROB8 Result Source

1 Data1 Valid 1Source 2 Data 2 Valid 2 Ope-

ration Result

Issue

• Ready instructions sent to FU

• Result propagated through buses:

– to physical storage (ROB entries)

– to reservation stations

• Pending instructions immediately issued

Internal model closer to dataflow than

von Neumann

ROB

...

...

Common Data Bus Stations

UF UF UF UF

(21)

Commit

• An instruction can only commit when reaching ROB end

commit order = program order

• Logical architecture state only modified at commit

precise interruptions possible in OoO processors

• Logical state: registers and memory

• When an instruction leaves the ROB:

– result written into register – OR data sent to memory

• Superscalar processor of degree n: ninstructions can commit simultaneously

• If instruction at top of ROB has not completed, processor is stall

Pentium IV

Source: Tom’s Hardware

Trace Cache

A B D

Fetch

A Predictor

Address

A B

D

Fetch

A Predictor

Address Instruction cache Trace Cache

B D

n predictions C

A B P

P NP

A B C

(22)

Pentium IV

Source: Tom’s Hardware

Ideally...

54,8 62,6

17,9 75,2

118,7 150,1

0 20 40 60 80 100 120 140 160

gcc espresso li fpppp doducd tomcatv

Programs

Instruction Issues per cycle

Factoring in main constraints…

0 5 1 0 1 5 2 0 2 5

gc c ex pres s o li fpppp d oduc d tom c atv

P ro g ra m

6 4 3 2 1 6 8 4

Nombre d’instructions/cycle

(23)

In reality…

• 4 instructions/cycle

• Max 2,6 instructions per cycle; 2 on average

Références

Documents relatifs

[r]

The above claim is a modern interpretation of the Dulac’s classification [Dul08] of such Morse centers in a complex domain, see Lins Neto [LN14, Theorem 1.1]. Indeed, it is easier

Réalisé en béton prêt à l’emploi coulé en place, le dallage d’un bâtiment industriel ou commercial est l’élément de structure de l’ouvrage sans doute le plus

In addition, completely new applications will arise, for instance, the combination of RFID technology with other technologies (see also section 4.6.2).  On the usage level of

Résoudre lgs inéquations suivântes.. Résoudre les inéquations

off nous supposerons que los coefficients h et k soient des quantit6s ra- tionnelles.. Ddmons~ration du thdor6me fondamcntal de Oalois.. La d&amp;nonstration de la

D´ evelopper cette fonction en s´ erie de

[r]