PERFORMANCE METRICS - Advanced Microprocessor Concepts

Advanced Microprocessor Concepts

7.8 PERFORMANCE METRICS

Evaluating the throughput potential of a microprocessor or a complete computer is not as simple as ﬁnding out how fast the microprocessor’s clock runs. System performance varies widely according to the applications being used and the computing hardware on which they run. Applications vary widely in their memory and I/O usage, both being properties whose performance is directly tied to the hardware architecture. We can consider three general sections of a computer and how each inﬂu-ences the speed at which an application is executed: the microprocessor, the memory system, and the I/O resources.

The usable address space of a microprocessor is an important consideration, because applications vary in their memory needs. Some embedded applications can ﬁt into several kilobytes of memory, making an 8-bit computer with 64 kB or less of address space quite adequate. More complex embed-ded applications start to look like applications that run on desktop computers. If large data arrays are called for, or if a multitasking system is envisioned whereby multiple tasks each contain megabytes of program memory and high-level data structures, a 32-bit microprocessor with hundreds of mega-bytes of usable address space may be necessary. At the very high end, microprocessors have transi-tioned to 64-bit architectures with gigabytes of directly addressable memory. A microprocessor’s address space can always be expanded externally by banking methods, but banking comes at a pen-alty of increased time to switch banks and the complexity of making an application aware of the banking scheme.

Any basic type of application can run on almost any microprocessor. The question is how fast and efficiently a particular microprocessor is able to handle the application. The instruction set is an im-portant attribute that should be considered when designing a computer system. If a floating-point in-tensive application is envisioned, it should probably be run on a microprocessor that contains an FPU, and the number of floating-point execution units and their execution latencies is an important attribute to investigate. An integer-only microprocessor could most likely run the floating-point ap-plication by performing software emulation of floating-point operations, but its performance would probably be rather dismal. For smaller-scale computers and applications, these types of questions are still valid. If an application needs to perform frequent bit manipulations for testing and setting vari-ous flags, a microprocessor that directly supports bit manipulation may be better suited than a ge-neric architecture with only logical AND/OR type instructions.

Once a suitable instruction set has been identiﬁed, a microprocessor’s ability to actually fetch and execute the instructions can become an important part of system performance. On smaller systems, there are few variables in instruction fetch and execution: each instruction is fetched and executed sequentially. Superscalar microprocessors, however, must include effective instruction analysis logic to properly utilize all the extra logic that has been put onto the chip and that you are paying for. If the multiple execution units cannot be kept busy enough of the time, your application will not enjoy the benchmark performance claims of the manufacturer. Vendors of high-performance microprocessors devote much time to instruction proﬁling and analysis of instruction sequences. Their results im-prove performance on most applications, but there are always a few niche applications that have un-common properties that can cause certain microprocessors to fall off in performance. It pays to keep in mind that common industry benchmarks of microprocessor performance do not always tell the whole story. These tests have been around for a long time, and microprocessor designers have -Balch.book Page 169 Thursday, May 15, 2003 3:46 PM

170 Advanced Digital Systems

learned how to optimize their hardware to perform well on the standard benchmarks. An application that behaves substantially differently from a benchmark test may not show the same level of perfor-mance as advertised by the manufacturer.

The microprocessor’s memory interface is a critical contributor to its performance. Whether a small 8-bit microprocessor or a 64-bit behemoth, the speed with which instructions can be fetched and data can be loaded and stored affects the execution time of an application. The necessary band-width of a memory interface is relative and is proportional to the sum of the instruction and data bandwidths of an application. From the instruction perspective, it is clear that the microprocessor needs to keep itself busy with a steady stream of instructions. Data bandwidth, however, is very much a function of the application. Some applications may perform frequent load/store operations, whereas others may operate more on data retained within the microprocessor’s register set. To the extent that load/store operations detract from the microprocessor’s ability to fetch and execute new instructions, they will reduce overall throughput.

Clock frequency becomes a deﬁning attribute of a microprocessor once its instruction set, exe-cution capabilities, and memory interface are understood from a performance perspective. Without these supporting attributes, clock speed alone does not deﬁne the capabilities of a microprocessor.

A 500-MHz single-issue, or nonsuperscalar, microprocessor could be easily outperformed by a 200-MHz four-issue superscalar design. Additionally, there may be multiple relevant clocks to con-sider in a complex microprocessor. Microprocessors whose internal processing cores are decoupled from the external memory bus by an integrated cache are often speciﬁed with at least two clocks:

the core clock and the bus interface clock. It is necessary to understand the effect of both clocks on the processing core’s throughput. A fast core can be potentially starved for instructions and data by a slow interface. Once a microprocessor’s resources have been quantified, clock frequency be-comes a multiplier to determine how many useful operations per second can be expected. Metrics such as instructions per second (IPS) or floating-point operations per second (FLOPS) are specified by multiplying the average number of instructions executed per cycle by how many cycles occur each second. Whereas high-end microprocessors were once measured in MIPS and MFLOPS, GIPS and GFLOPS performance levels are now attainable.

As already mentioned, memory bandwidth and, consequently, memory architecture hold key roles in determining overall system performance. Memory system architecture encompasses all memory external to the microprocessor’s core, including any integrated caches that it may contain.

When dealing with an older-style microprocessor with a memory interface that does not stress cur-rent memory technologies, memory architecture may not be subject to much variability and may not be a bottleneck at all. It is not hard to find flash, EPROM, and SRAM devices today with access times of 50 ns and under. A moderately sized memory array constructed from these components could provide an embedded microprocessor with full-speed random access as long as the memory transaction rate is 20 MHz or less. Many 8-, 16-, and even some 32-bit embedded microprocessors can fit comfortably within this performance window. As such, computers based on these devices can have simple memory architectures without suffering performance degradation.

Memory architecture starts to get more complicated when higher levels of performance are de-sired. Once the microprocessor’s program and data fetch latency becomes faster than main mem-ory’s random access latency, caching and bandwidth improvement techniques become critical to sustaining system throughput. Random access latency is the main concern. A large memory array can be made to deliver adequate bandwidth given a sufﬁcient width. As a result of the limited operat-ing frequency of SDRAM devices, high-end workstation computers have been known to connect multiple memory chips in parallel to create 256-bit and even 512-bit wide interfaces. Using 512 Mb DDR SDRAMs, each organized as 32M × 16 and running at 167 MHz, 16 devices in parallel would yield a 1-GB memory array with a burst bandwidth of 167 MHz × 2 words/hertz × 256 bits/word = 85.5 Gbps! This is a lot of bandwidth, but relative to a microprocessor core that operates at 1 GHz or -Balch.book Page 170 Thursday, May 15, 2003 3:46 PM

Advanced Microprocessor Concepts 171

more with a 32- or 64-bit data path, such a seemingly powerful memory array may just barely be able to keep up.

While bandwidth can be increased by widening the interface, random access latency does not go away. Therefore, there is more to a memory array than its raw size. The bandwidth of the array, which is the product of its interface frequency and width, and its latency are important metrics in un-derstanding the impact of cache misses, especially when dealing with applications that exhibit poor locality.

Caching reduces the negative effect of high random access latencies on a microprocessor’s throughput. However, caches and wide arrays cannot completely balance the inequality between the bandwidth and latency that the microprocessor demands and that which is provided by SDRAM tech-nology. Cache size, type, and latency and main memory bandwidth are therefore important metrics that contribute to overall system performance. An application’s memory characteristics determine how costly a memory architecture is necessary to maintain adequate performance. Applications that operate on smaller sets of data with higher degrees of locality will be less reliant on a large cache and fast memory array, because they will have fewer cache misses. Those applications with opposite memory characteristics will increase the memory architecture’s effect on the computer’s overall per-formance. In fact, by the nature of the application being run, caching effects can become more signif-icant than the microprocessor’s core clock frequency. In some situations, a 500-MHz microprocessor with a 2-MB cache can outperform a 1-GHz microprocessor with a 256-kB cache. It is important to understand these considerations because money may be better spent on either a faster microprocessor or a larger cache according to the needs of the intended applications.

I/O performance affects system throughput in two ways: the latency of executing transactions and the degree to which such execution blocks the microprocessor from performing other work. In a computer in which the microprocessor operates with a substantially higher bandwidth than individ-ual I/O interfaces, it is desirable to decouple the microprocessor from the slower interface as much as possible. Most I/O controllers provide a natural degree of decoupling. A typical UART, for exam-ple, absorbs one or more bytes in rapid succession from a microprocessor and then transmits them at a slower serial rate. Likewise, the UART assembles one or more whole incoming bytes that the mi-croprocessor can read at an instantaneous bandwidth much higher than the serial rate. Network and disk adapters often contain buffers of several kilobytes that can be rapidly ﬁlled or drained by the microprocessor. The microprocessor can then continue with program execution while the adapter logic handles the data at whatever lower bandwidth is inherent to the physical interface.

Inherent decoupling provided by an I/O controller is sufﬁcient for many applications. When deal-ing with very I/O-intensive applications, such as a large server, multiple I/O controllers may interact with each other and memory simultaneously in a multimaster bus conﬁguration. In such a context, the microprocessor sets up block data transfers by programming multiple I/O and DMA controllers and then resumes work processing other tasks. Each I/O and DMA controller is a potential bus mas-ter that can arbitrate for access to the memory system and the I/O bus (if there is a separate I/O bus).

As the number of simultaneous bus masters increases, contention can develop, which may cause per-formance degradation resulting from excessive waiting time by each potential bus master. This con-tention can be reduced by modifying the I/O bus architecture. A ﬁrst step is to decouple the I/O bus from the memory bus into one or more segments, enabling data transfers within a given I/O segment to proceed without conﬂicting with a memory transfer or one contained within other I/O segments.

PCI is an example of such a solution. At a more advanced level, the I/O system can be turned into a switched network in which individual I/O controllers or small segments of I/O controllers are con-nected to a dedicated port on an I/O switch that enables each port to communicate with any other port simultaneously insofar as multiple ports do not conﬂict for access to the same port. This is a fairly expensive solution that is implemented in high-end servers for which I/O performance is a key contributor to overall system throughput.

-Balch.book Page 171 Thursday, May 15, 2003 3:46 PM

172Advanced Digital Systems

The question of how fast a computer performs does not depend solely on how many megahertz the microprocessor runs at or how much RAM it has. Performance is highly application speciﬁc and is dominated by how many cycles per second the microprocessor is kept busy with useful instruc-tions and data.

-Balch.book Page 172 Thursday, May 15, 2003 3:46 PM

173

CHAPTER 8 High-Performance Memory

Dans le document COMPLETE DIGITAL DESIGN (Page 190-194)