Execution Core Detail - Intel ArchitectureOptimization

To successfully implement parallelism, information on execution units’

latency is required. Also important is the information on the execution units layout in the pipelines and on the µops that execute in pipelines. This section details on the execution core operation including the discussion on instruction latency and throughput, execution units and ports, caches, and store buffers.

Instruction Latency and Throughput

The core’s ability to exploit parallelism can be enhanced by ordering instructions so that their operands are ready and their corresponding execution units are free when they reach the reservation stations. Knowing instructions’ latencies helps in scheduling instructions appropriately. Some execution units are not pipelined, such that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle. Table 1-1 lists Pentium II and Pentium III processors execution units, their latency, and their issue throughput.

1

Intel Architecture Optimization Reference Manual

Table 1-1 Pentium II and Pentium III Processors Execution Units Port Execution Units Latency/Throughput Floating Point Unit: Multiply, Divide, Square Root, Move instructions

Latency 1, Throughput 1/cycle Latency 1, Throughput 1/cycle Latency 1, Throughput 1/cycle Latency 4, Throughput 1/cycle

Latency 3, Throughput 1/cycle (horizontal align with FADD)

Latency 5, Throughput 1/2cycle¹ (align with FMULL) Latency: single-precision 18 cycles, double-precision 32 cycles, extended-precision 38 cycles. Throughput non-pipelined (align with FDIV)

Latency 1, Throughput 1/cycle Latency 3, Throughput 1/cycle

See Appendix D, “Streaming SIMD Extensions Throughput and Latency”

1 Integer ALU Unit

MMX technology ALU Unit MMX technology Shift Unit Streaming SIMD Extensions:

Adder, Reciprocal and

See Appendix D, “Streaming SIMD Extensions Throughput and Latency”

continued

Processor Architecture Overview

1

1. The FMUL unit cannot accept a second FMUL in the cycle after it has accepted the first. This is NOT the same as only being able to do FMULs on even clock cycles.

FMUL is pipelined once every two clock cycles.

2. A load that gets its data from a store to the same address can dispatch in the same cycle as the store, so in that sense the latency of the store is 0. The store itself takes three cycles to complete, but that latency affects only how soon a store buffer entry is freed for use by another µop.

Execution Units and Ports

Each cycle, the core may dispatch zero or one µop on a port to any of the five pipelines (shown in Figure 1-2) for a maximum issue bandwidth of five µops per cycle. Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to its type of operation. For example, an integer arithmetic logic unit (ALU) and the floating-point execution units (adder, multiplier, and divider) share a pipeline. Knowledge of which µops are executed in the same pipeline can be useful in ordering instructions to avoid resource conflicts.

Port Execution Units Latency/Throughput

2 Load Unit

Streaming SIMD Extensions Load instructions

Latency 3 on a cache hit, Throughput 1/cycle See Appendix D, “Streaming SIMD Extensions Throughput and Latency”

3 Store Address Unit

Streaming SIMD Extensions Store instruction

Latency 0 or 3 (not on critical path), Throughput 1/cycle²

See Appendix D, “Streaming SIMD Extensions Throughput and Latency”

4 Store Data Unit

Streaming SIMD Extensions Store instruction

Latency 1, Throughput 1/cycle

See Appendix D, “Streaming SIMD Extensions Throughput and Latency”

Table 1-1 Pentium II and Pentium III Processors Execution Units (continued)

1

Intel Architecture Optimization Reference Manual

Caches of the Pentium II and Pentium III Processors The on-chip cache subsystem of Pentium II and Pentium III processors consists of two 16-Kbyte four-way set associative caches with a cache line length of 32 bytes. The caches employ a write-back mechanism and a pseudo-LRU (least recently used) replacement algorithm. The data cache consists of eight banks interleaved on four-byte boundaries.

Figure 1-2 Execution Units and Ports in the Out-Of-Order Core

Port 1

Processor Architecture Overview

1

Level two (L2) caches have been off chip but in the same package. They are 128K or more in size. L2 latencies are in the range of 4 to 10 cycles. An L2 miss initiates a transaction across the bus to memory chips. Such an access requires on the order of at least 11 additional bus cycles, assuming a DRAM page hit. A DRAM page miss incurs another three bus cycles. Each bus cycle equals several processor cycles, for example, one bus cycle for a 100 MHz bus is equal to four processor cycles on a 400 MHz processor. The speed of the bus and sizes of L2 caches are implementation dependent, however. Check the specifications of a given system to understand the precise characteristics of the L2 cache.

Store Buffers

Pentium II and Pentium III processors have twelve store buffers. These processors temporarily store each write (store) to memory in a store buffer.

The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles.

Writes stored in the store buffer are always written to memory in program order. Pentium II and Pentium III processors use processor ordering to maintain consistency in the order in which data is read (loaded) and written (stored) in a program and the order in which the processor actually carries out the reads and writes. With this type of ordering, reads can be carried out speculatively; and in any order, reads can pass buffered writes, while writes to memory are always carried out in program order.

Write hits cannot pass write misses, so performance of critical loops can be improved by scheduling the writes to memory. When you expect to see write misses, schedule the write instructions in groups no larger than twelve, and schedule other instructions before scheduling further write instructions.

1

Intel Architecture Optimization Reference Manual

Streaming SIMD Extensions of the Pentium III

Dans le document Intel ArchitectureOptimization (Page 31-36)