A Review of Key Ideas in Power-Aware Microarchitectures

An Architect’s View

3.3 A Review of Key Ideas in Power-Aware Microarchitectures

In our review of power-efficient design concepts at the microarchitecture level, our primary attention will be on dynamic (also known as active or switching) power governed by the formula,CV²af. Recall thatCrefers to the switching capacitance,Vis the supply voltage,ais the activity factor (0<a<1), and fis the operating clock frequency. Power reduction ideas must therefore focus on one or more of these basic parameters. Reducing active power generally results in reduction of on-chip temperatures, and this indirectly causes leakage power to go down as well. Similarly, any increase in efficiency directed at lowering the latch count (e.g., by reducing the basic pipeline depth, or by reducing the number of back-end execution pipes within a given functional unit) also results in area and leakage reduction as a side benefit. However, later in this section, we also deal with the problem of mitigating leakage power directly, by providing microarchitectural support to what are primarily circuit-level mechanisms.

3.3.1 Power Efficiency at the Processor Core Level

In this section, we examine the key ideas that have been proposed in terms of microarchitectural support for power efficiency, at the level of a single processor core.

The effective (average) value ofCcan be reduced by using (1) area-efficient designs for various macros;

(2) adaptive structures, which change in effective size, latency, or communication bandwidth depending on the needs of the input workload, (3) selectively ‘‘gating off ’’ the clock for unused or idle units; (4) reducing or eliminating ‘‘speculative waste’’ resulting from executing instructions in mis-speculated branch paths or prefetching useless instructions and data into caches, based on wrong guesses.

The average value ofVcan be reduced via dynamic voltage scaling, i.e., by reducing the voltage as and when required or possible. Microarchitectural support, in this case, is not required, unless the mech-anisms to detect ‘‘idle’’ periods or temperature overruns are detected using counter-based ‘‘proxies,’’

specially architected for this purpose. Hence, in this chapter, we do not dwell on dynamic voltage scaling methods. (Note again, however, that since reducing Valso requires (or results in) reduction of the operating frequency, f, net power reduction has a cubic effect; thus, dynamic voltage and frequency scaling (DVFS), though arguably not a microarchitectural technique per se, is the most effective way of power reduction). Deciding when and how to apply DVFS, as a function of the input workload characteristics and overall operating environment, on the other hand, is very much a microarchitectural issue. It is a problem that is increasingly relevant in the era of variability-tolerant, power-efficient multi-core chip design, described briefly in Section 3.4.

The average value of the activity factor,a, can be reduced by (1) the use of clock-gating, where the normally free-running, synchronous clock is disabled in selected units or subunits within the system based on information or predictions about current or future activity in those regions; (2) the use of data representations and instruction schedules that result in reduced switching. Microarchitectural support is provided in the form of added mechanisms to (1) detect, predict, and control the generation of the applied gating signals or (2) aid in power-efficient data and instruction encodings. Compiler support for generating power-efficient instruction scheduling and data partitioning or special instructions for

‘‘nap=doze=sleep’’ control, if applicable, must also be considered under this category.

While clock-gating helps eliminate (or drastically reduce) active or switching power when a given macro, subunit, or unit is idle, power-gating can be used to also eliminate the residual leakage power of that idle entity. In this case, as described in detail later on, the power supply voltageVis itself gated off from the target circuit block, with the help of a header or footer transistor. Here, the need for microarchitectural support in the form of predictive control of the gating signal is even stronger because of the relatively large performance overheads that would be incurred without such support. There are other techniques like adaptive body biasing that are also targeted at leakage power control, and these too require some degree of microarchitectural support. However, these techniques are most relevant to bulk-CMOS designs (as opposed to SOI–bulk-CMOS technology), and are predominantly device- and circuit-level methods. As such, we do not dwell on them in this chapter.

Lastly, the average value of the design frequency,f, can be controlled or reduced by using (1) variable, multiple, or locally asynchronous (self-timed) clocks, e.g., in GALS [12] designs; (2) clock throttling, where the frequency is reduced dynamically in response to power or temperature overrun indicators; or (3) reduced pipeline depths in the baseline microarchitecture definition.

We consider power-aware microarchitectural constructs that useC,a, orfas the primary lever for reducing active power; and those that use the supply voltageVas the lever for reducing leakage power. In any such proposed processor architecture, the efficacy of the particular power reduction method that is used must be assessed by understanding the net performance impact. Here, depending on the applica-tion domain (or market), a PDP, EDP, or ED²P metric for evaluating and comparing power–perform-ance efficiencies must be used. (See earlier discussion in Section 3.2).

3.3.1.1 Optimal Pipeline Depth

A fundamental question that is asked has to do with pipeline depth. Is a deeply pipelined, high frequency (speed demon) design better than an IPC-centric low frequency (braniac) design? In the context of the topic of this chapter, ‘‘better’’ must be judged in terms of power–performance efficiency.

Let us consider, first, a simple, hazard-free, linear pipeline flow process, withkstages. Let the time for the total logic (without latches) to compute one answer beT. Assuming that thekstages into which the

logic is partitioned are of equal delay, the time per stage and thus the time per computation becomes (see Chapter 2 in Ref. [13])

t¼T=kþD (3:10)

whereDis the delay added due to the staging latch. The inverse oftdetermines the clocking rate or frequency of operation. Similarly, if the energy spent (per cycle, per second, or over the duration of the program run) in the logic isWand the corresponding energy spent per level of staging latches isLthen the total energy equation for thek-stage pipelined version is roughly given as the follows:

E¼LkþW (3:11)

The energy equation assumes that the clock is free-running; i.e., on every cycle, each level of staging latches is clocked to enable the advancement of operations along the pipeline. (Later, we shall consider the effect of clock-gating.) Equations 3.10 and 3.11, when plotted as a function ofk, are depicted in Figure 3.2a and b, respectively.

As the number of stages increases, the energy or power consumed increases linearly; while, the performance also increases, but not as fast. In order to consider the PDP-based power–performance efficiency, we compute the ratio

Power

Performance¼(LkþW)(T=kþD)

¼LTþWDþ(LDk²

þWT)=k (3:12)

Figure 3.3 shows the general shape of this curve as a function of k. Differentiating the right-hand side expression in Equation 3.12 and setting it to zero, one can solve for the optimum value ofkfor which the power–performance efficiency is maximized; i.e., the minimum of the curve in Figure 3.2b can be shown to occur when

k(opt)¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (WT)=(LD)

p (3:13)

Number of stages, k Number of stages, k

Energy, E (per unit of time) Performance (operations per second)

(a) (b)

1/(T/k + D)

Lk + W

FIGURE 3.2 Power and performance curves for idealized pipeline flow.

Minimum k (opt)

Number of stages, k --->

Power/performance

FIGURE 3.3 Power performance ratio curve for idealized pipeline flow.

Larson [14] first published the above analysis, albeit from a cost=performance perspective.

This analysis shows that, at least, for the simplest, hazard-free pipeline flow, the highest frequency operating point achievable in a given technology may not be the most energy-efficient. Rather, the optimal number of stages (and hence operating frequency) is expected to be at a point which increases for greater W or T and decreases for greater L or D. For a prior generation POWER4-class (0.18m) superscalar processor operating at around 1 GHz [15], the floating-point arithmetic unit is estimated to yield values of T¼7.5 ns, D¼0.15 ns,W¼0.15 W, and L¼0.1 W.

This yields a k(opt)8 (rounded off from 8.67), if we use the idealized formalism (Equation 3.13).

For real superscalar machines, the number of latches in the overall design tends to go up much more sharply withkthan the linear assumption in the above model. This tends to makek(opt) even smaller. Also, in real pipeline flow with hazards, e.g., in the presence of branch-related stalls and disruptions, performance actually peaks at a certain value ofk before decreasing [3,16,17] (instead of the asymptotically increasing behavior shown in Figure 3.2b). This effect would also lead to decreasing the effective value ofk(opt). (However,k(opt) increases if we use EDP or ED²P metrics instead of the PDP metric used.). In a detailed simulation-based analysis of a POWER4-class superscalar machine, it has been shown [18,19] that the optimal pipeline depth using a ED²P metric like (BIPS)³=watt (where BIPS is the standard performance metric of billions of instructions completed per second) is around 18 FO4* per pipe stage for SPEC2000 workloads. For commercial workloads like TPC-C, the optimal point is shown to shift to shallower pipelines (25–28 FO4). In contrast, note that if one considered a power-unaware performance-only metric, like BIPS, the optimal pipeline depth for SPEC2000 is around 10 FO4 per stage. For TPC-C, the performance-only optimal point is reported to be pretty flat across the 10–14 FO4 points.

3.3.1.2 Vector=SIMD Processing Support

Vector=SIMD modes of parallelism present in current architectures afford a power-efficient method of extending performance for vectorizable codes. Fundamentally, this is because for doing the work offetching and processing a single (vector) instruction, a large amount of data is processed in a parallel or pipelined manner. If we consider an SIMD machine, with p k-stage functional pipelines (see Figure 3.4) then looking at the pipelines alone,

one sees ap-fold increase of performance, with ap-fold increase in power, assuming full utilization and hazard-free flow, as before. Thus, an SIMD pipeline unit offers the potential of scalable growth in performance, with commensurate growth in power; i.e., at constant power–

performance efficiency. If, however, one includes the front-end instruction cache and fetch=dispatch unit that are shared across the p SIMD pipelines, then power–performance efficiency can actually grow with p. This is because, the energy behavior of the instruction cache (memory) and the fetch=decode path remains essentially invariant with p, while net performance grows linearly withp.

In a superscalar machine with a vector=SIMD exten-sion, the overall power-efficiency increase is limited by the fraction of code that runs in vector=SIMD-mode (per Amdahl’s law).

*Fan-out-of-four (FO4) delay is defined as the delay of one inverter driving four copies of an equally sized inverter. The amount of logic and latch overhead per pipeline stage is often measured in terms of FO4 delay.

3.3.1.3 Clock-Gating: Power Reduction Potential and Microarchitectural Support Clock-gating refers to circuit-level control [21,22] for disabling the clock to a given set of latches, a macro, a bus, to a cache or register file access path, or an entire unit on a particular machine cycle.

Figure 3.5 depicts a typical clocking arrangement used in pipelined dataflow logic within a high-end microprocessor. The bank of latches is clocked via an AND gate that is enabled by a valid-bit signal from the previous pipeline stage. A stall-bit from the next pipeline stage is used to recirculate the current data during a pipeline stall. The latches are clocked only when there is valid data available from the previous stage or when the data needs to be held. In alternate designs, the stall-bit can also be used to gate the clock to further improve clock-gating efficiency.

In current generation server-class microprocessors, about 70% of the active (switching) power is consumed by the clock distribution network and its latch load alone. As reported in Ref. [23], the major part of the clock power is dissipated close to the leaf nodes of the clock tree that drive latch banks. Since a clock-gated latch keeps its current data value stable, clock-gating prevents signal transitions of invalid data from propagating down the pipeline thereby reducing switching power in the combinational logic between latches. In addition to reducing dynamic power, clock-gating can also reduce static (leakage) power. As already explained, leakage current in CMOS devices is exponentially dependent on tempera-ture. The temperature reduction brought on by clock-gating can therefore significantly reduce the leakage power as well.

We define clock-gating efficiency (CGE) for a given input workload as follows: CGE[1(average clock-gated power)=(maximum unconstrained power)]3100%, so that higher numbers imply greater levels of power reduction.

Figure 4 in Ref. [24] shows the computed CGE across various workloads (SPEC2000 suite) using the Turandot=PowerTimer [23] power–performance simulator. Such microarchitecture-level analysis points to opportunities of power savings in a processor, since idle periods of a particular resource (e.g., a pipeline stage) can be identified and quantified.

Figure 3.6a and b show the opportunities available within several units (and in particular, the instruction fetch unit, IFU) of the same example processor in the context of the TPC-C trace segment referred to above. Figure 3.6a depicts the instruction frequency mix of the trace segment used.

This shows that the FPU operations are a very tiny fraction of the total number of instructions in the trace. Therefore, with proper detection and control mechanisms architected in hardware, the FPU could essentially be ‘‘gated off ’’ in terms of the clock delivery for the most part of such an execution.

Figure 3.6b shows the fraction of total cycles spent in various modes within the IFU. I-fetch was on hold for about 48% of the cycles, and the fraction of useful fetch cycles was only 28%. Again, this points to great opportunities either in terms of clock-gating or dynamic ifetch throttling (see Section 3.3.1.8).

Valid signal from prev. stage

Clock

Stall signal from next stage Data-in

Clock

FIGURE 3.5 Clock gating mechanism.

Microarchitectural support for conventional clock-gating can be provided in at least three ways:

(1) dynamic detection of idle modes in various clocked units or regions within a processor or system, (2) static or dynamic prediction of such idle modes, and (3) using ‘‘data valid’’ bits within a pipeline flow path to selectively enable=disable the clock applied to the pipeline stage latches. If static prediction is used, the compiler inserts special nap=doze=sleep=wake-type instructions where appropriate, to aid the hardware in generating the necessary gating signals. Methods 1 and 2 result in coarse-grain clock-gating, where entire units, macros, or regions can be gated off to save power, whereas method 3 results in fine-grain clock-gating, where unutilized pipe segments can be gated off during normal execution within a particular unit, like the FPU. The detailed circuit-level implementation of gated-clocks, the potential performance degradation, inductive noise problems, etc., are not discussed in this chapter. However, these are very important issues that must be dealt with adequately in an actual design.

Referring back to Figures 3.2 and 3.3, note that since (fine-grain) clock-gating effectively causes a fraction of the latches to be gated off, we may model this by assuming that the effective value of L decreases when such clock-gating is applied. This has the effect of increasingk(opt); i.e., the operating frequency for the most power-efficient pipeline operation can be increased in the presence of clock-gating. This is an added benefit.

Hold (idle) ICMiss Imiss ICWrite Redirect BIQFull Prefb Brn flush >2 Br Useful fetches

0 10 20 30 40 50 60

Percent of net cycles

Frequency mix BRU 10.1%

CRU 2.3%

FXU 48.5%

LSU 38.9%

FPU 0.2%

BRU CRU FXU LSU FPU

(a)

(b)

FIGURE 3.6 (a) Instruction frequency mix for a typical commercial workload trace segment. (b) Stall profile in the instruction fetch unit (IFU) for the commercial workload.

In a recently reported work [24], the limits of CGE has been examined and then stretched by adding a couple of new advances: transparent pipeline gating (TCG) [25] and elastic pipeline clock-gating (ECG) [26]. TCG introduces a new way of clock-clock-gating pipelines. In traditional clock-clock-gating, latches are held opaque to avoid data races between adjacent latch stages; this N clock pulses are needed to propagate a single data item through anN-stage pipeline, even if at a given clock cycle all other (i.e.,N1) stages have invalid input data. In a transparent clock-gated pipeline, latches are held transparent by default. TCG is based on the concept of data separation. Assume that a pair of data items A and B simultaneously moves through a TCG pipeline. A data race between A and B is avoided by separating the two data items by clocking or gating a latch stage opaque, such that the opaque latch stage acts as a barrier separating the two data items from each other. The number of clock pulses required for a data item item A to move through an N-stage pipeline is no longer only dependent on N, but also on the number of clock cycles that separate A from the closest upstream data item B. For anN-stage pipeline, where B followsnclock cycles behind A, only floor (N=n) clock pulses have to be generated to move A safely through the pipeline. ECG is a different technique that achieves further efficiency by exploiting the inherent storage redundancy afforded by a traditional master–slave latch pair. ECG allows the designer to allow stall signals to propagate backward in pipeline flow logic in a stage-by-stage fashion, without incurring the leakage power and area overhead of explicitly inserted stall buffers. Logic-level details of TCG and ECG are available in the originally published papers [24–26]. As reported there, TCG enables clock power reduction to the tune of 50% over traditional stage-level clock-gating under commercial (TPC-C) class workloads. Even under heavy floating-point workloads, where fewer bubbles are available in the pipeline, the clock power in the floating-point pipeline can be reduced by 34%. The significant reduction in dynamic stall power (27%) and leakage power (44%) afforded by ECG in a FPU design have also been reported in the published literature.

3.3.1.4 Predictive Power-Gating

As previously indicated, leakage power is a major (if not dominant) component of total power dissipation in current and future CMOS microprocessors. Cutting off the power supply (Vdd) to major circuit blocks to conserve idle power (sleep mode) is not a new concept, especially for battery-powered mobile systems. However, dynamically effecting such gating, on a unit-by-unit basis, as function of input workload demand is not a design technique that has seen widespread usage yet, especially in server-class microprocessors. The main reasons have been the perceived risks or negative effects arising from (1) performance and area overheads, (2) inductive noise on the power supply grid, and (3) potential design tools and verification concerns. Advances in circuit design have minimized the area and cycle-time delay overhead concerns in recent industrial practices. Microarchitectural predict-ive techniques [27–29] have recently been so perfected that now they equip designers with the tools needed to minimize any architectural performance overheads as well. The inductive noise concerns do persist, but there are known solution approaches in the realm of power distribution networks and package design that will no doubt mature to help mitigate those concerns. The design tools and verification challenge, of course will prevail as a difficult roadblock—but again, solutions will even-tually emerge to get rid of that concern.

3.3.1.5 Variable Bit-Width Operands

One of the techniques proposed for reducing dynamic power consists of exploiting the behavior of data in programs, which is characterized by the frequent presence of small values. Such values can be represented as and operated upon as short bit-vectors. Thus, by using only a small part of the processing datapath, power can be reduced without loss of performance. Brooks and Martonosi [30] analyzed the potential of this approach in the context of 64-bit processor implementations (e.g., the Compaq Alpha

Dans le document How to go to your page (Page 194-200)