Operations specified by the standard - The New IEEE 754-2008 Standard

Floating-Point Formats and Environment

3.4 The New IEEE 754-2008 Standard

3.4.7 Operations specified by the standard

Preferred exponent for arithmetic operations in the decimal format

LetQ(x)be the quantum exponent of a floating-point numberx. Since some numbers in the decimal format have several possible representations (the set of their representations is acohort), the standard specifies for each operation which exponent is preferred for representing the result of a calculation. The rule to be followed is:

• if the result of an operation is inexact, the cohort member of smallest exponent is used;

14This implies that the language standards must specify what that literal meaning is: order of operations, destination formats of operations, etc.

98 Chapter 3. Floating-Point Formats and Environment

• if the result of an operation is exact, then if the result’s cohort includes a member with the preferred exponent (see below), that very member is returned; otherwise, the member with the exponent closest to the pre-ferred exponent is returned.

The preferred quantum exponents for the most common operations are:

• x+yandx−y:min(Q(x), Q(y));

• x×y:Q(x) +Q(y);

• x/y:Q(x)−Q(y);

• FMA(x, y, z)(i.e.,xy+zusing an FMA):min(Q(x) +Q(y), Q(z));

• √

x:bQ(x)/2c.

scaleB and logB

For designing fast software for the elementary functions, or for efficiently scaling variables (for instance, to write robust code for computing functions such asp

x²+y²), it is sometimes very useful to have functions x·βⁿ and blog_β|x|c, whereβ is the radix of the floating-point system,nis an integer, andxis a floating-point number. This is the purpose of the functions scaleB and logB:

• scaleB(x, n)is equal tox·βⁿ, correctly rounded¹⁵(following the round-ing direction attribute);

• whenxis finite and nonzero, logB(x)equalsblog_β|x|c. When the output format of logB is a floating-point format, logB(NaN) is NaN, logB(±∞) is+∞, and logB(±0) is−∞.

Operations with NaNs

We have seen in Sections 3.4.2 and 3.4.3 that in the binary interchange for-mats, thep−2least significant bits of a NaN are not defined, and that in the decimal interchange formats, the trailing significand bits of a NaN are not defined. These bits can be used for encoding thepayloadof the NaN, i.e., some information that can be transmitted through the arithmetic operation for diagnosis purposes. To preserve that diagnosis information, it is required that for an operation with quiet NaN inputs, other than minimum or maximum operations, the returned result should be one of these input NaNs. Also, the sign of a NaN is not interpreted.

15In most cases,x·βⁿis exactly representable so that there is no rounding at all, but requir-ing correct roundrequir-ing is the simplest way of definrequir-ing what should be returned if the result is outside the normal range.

Miscellaneous

The standard defines many very useful operations, see [187]. Examples are nextUp(x)(smallest floating-point number in the format ofxthat is greater thanx), maxNum(x, y) (maximum of x and y), and class(x) (tells if x is a signaling NaN, a quiet NaN, −∞, a negative normal number, a negative subnormal number,−0,+0, a positive subnormal number, a positive normal number, or+∞), etc.

3.4.8 Comparisons

Floating-point data represented in different formats specified by the standard must be comparableif these formats have the same radix: the standard does not require that comparing a decimal and a binary number should be possible without a preliminary conversion.¹⁶ Exactly as in IEEE 754-1985, four rela-tions are possible:less than, equal, greater than,andunordered, and a compari-son is delivered either as one of these four relations, or as a Boolean response to some predicate that gives the desired comparison.

3.4.9 Conversions

Concerning input and output conversions (that is, conversions between an external decimal or hexadecimal character sequence and an internal binary or decimal format), the new standard has requirements that are much stronger than those of IEEE 754-1985. They are described as follows.

1. Conversions between an external decimal character sequence and a supported decimal format: Input and output conversions are correctly rounded (according to the applicable rounding direction).

2. Conversions between an external hexadecimal character sequence and a supported binary format: Input and output conversions are also cor-rectly rounded (according to the applicable rounding direction), but such conversions are optional. They have been specified to allow any binary number to be represented exactly by a finite character sequence.

3. Conversions between an external decimal character sequence and a supported binary format: first, for each supported binary format, define a valuep₁₀as the minimum number of decimal digits in the deci-mal external character sequence that allows for an error-free write-read

16Such comparisons appear extremely rarely in programs designed by sensible beings, and would be very tricky to implement without preliminary conversion. Also, if we really need such a comparison, we do not lose much information by performing a preliminary conversion.

Assume that the binary and decimal numbers to be compared arex2(in formatF2) andy10(in formatF10). Definey2asy10correctly rounded to formatF2. Thenx2≥y10impliesx2 ≥y2, andx2 ≤y10impliesx2≤y2.

100 Chapter 3. Floating-Point Formats and Environment format binary32 binary64 binary128

p10 9 17 36

Table 3.25: Minimum number of decimal digits in the decimal external charac-ter sequence that allows for an error-free write-read cycle, for the various basic binary formats of the standard. See Section 2.7 page 40 for further explanation.

cycle, as explained in Section 2.7. Table 3.25, which gives the value of p₁₀ from the various basic binary formats of the standard, is directly derived from Table 2.3 (page 44).

Then, define a valueH so thatHis preferably unbounded, and in any case, H is larger than or equal to3plus the largest value ofp₁₀ for all supported binary formats.

The conversions must be correctly rounded to and from external char-acter sequences with any number of significant digits between 1 and H (which implies that these conversions must always be correctly rounded ifHis unbounded).

For output conversions, if the external decimal format has more than Hsignificant digits, then the binary value is correctly rounded toH dec-imal digits and trailing zeros are appended to fill the output format.

For input conversions, if the external decimal format has more thanH significant digits, then the internal binary number is obtained by first correctly rounding the value to H significant digits (according to the applicable rounding direction), then by correctly rounding the resulting decimal value to the target binary format (with the applicable rounding direction). In the directed rounding directions, these rules allow inter-vals to be respected.

More details are given in the standard [187].

Dans le document Handbook of Floating-Point Arithmetic (Page 116-119)