Binary Numbers and Round-oﬀ - Mathematical and Computational Concepts

Key Concept

2.2 Mathematical and Computational Concepts

2.2.2 Binary Numbers and Round-oﬀ

Appreciation of the finite arithmetic in scientific computing is very important and sloppy handling of arithmetic precision often leads to erroneous results or even disasters in large scientific computations. It has been reported, for example, that the Patriot missile failure in Dharan, Saudi Arabia, on February 25, 1991 which resulted in 28 deaths, is ultimately attributable to poor handling of rounding errors. Similarly, the explosion of an Ariane 5 rocket just after lift-off on its maiden voyage off French Guinea, on June 4, 1996, was ultimately the consequence of simple overflow (the conversion from a 64-bit floating point value to a 16-bit signed integer value).

While we are familiar and more comfortable with the base 10 arithmetic system, a com-puter is restricted to a binary numbering system. The number 126, for example, has the representation

126 = 1×10²+ 2×10¹+ 6×10⁰ in the base-10 system, or equivalently

01111110₂ = 0×2⁷+ 1×2⁶+ 1×2⁵ + 1×2⁴+ 1×2³+ 1×2²+ 1×2¹+ 0×2⁰ in the base-2 system. This is the ﬂoatic point representation.

In computing we call each place in a binary number a digit or a bit, and we call a group of 8 bits a byte. Similarly, we call 1,024 bytes a Kilo-byte (1 KB) and 1,048,576 bytes a Megabyte (1 MB), and so on. An equivalent way to write the number 126 in scientiﬁc notation is:

+ .126 ×10³ sign f raction exponent

Therefore, in the computer we need to store the sign, the fraction and the exponent sepa-rately. To this end, there is a standard notation adopted by IEEE (Institute of Electrical and Electronic Engineers) for binary arithmetic, which is used in most computers (the old Cray computers did not follow this convention). There are two types of floating point num-bers, depending on the number of binary digits (bits) we store: Specifically, in the single precision (float type in C++) we have 8 bits for the exponent and 23 bits in the fraction

whereas in the double precision (double type in C++) we have 11 bits for the exponent and 52 bits for the fraction. In both cases we need to also reserve one bit for the sign. What this means simply is that there is a lower and an upper bound on the size of numbers we can deal with in the computer. In the single precision type this range extends from 2⁻¹²⁶ to 2¹²⁸ and in the double precision from 2⁻¹⁰²² to 2¹⁰²⁴, so clearly the latter allows great flexibility in dealing with very small or very large numbers. The lower limit in this range determines anunderflow while the upper limit determines an overflow. What value a variable takes on when it overflows/underflows depends both on the variable type and the computing archi-tecture on which you are running. Even this large range, however, may not be sufficient in applications, and one may need to extend it by using the so-called double extended precision (long double in C++) which can store up to total of 128 bits. In practice, it is more efficient to use adaptive arithmetic only when it is needed, for example in refining the mesh down to very small length scales to resolve small vortices in a flow simulation.

The finite arithmetic in computing implies that theeffective zeroin the computer is about 6×10⁻⁸ for single precision and 10⁻¹⁶ for double precision. We can determine the value of machine epsilon by finding the value of ₂¹_p such that to the computer:

1.0 + 1

2^p = 1.0.

Software Suite

This is accomplished by increasing the value of p incremen-tally, and monitoring the point at which the computer cannot distinguish between the value 1 and the value 1 + ₂¹_p. This procedure is implemented for both ﬂoating point and double precision variables in the following two functions:

float FloatMachineEps(){

float fmachine_e, ftest;

fmachine_e = 1.0;

ftest = 1.0 + fmachine_e;

while(1.0 != ftest){

fmachine_e = fmachine_e/2.0;

ftest = 1.0 + fmachine_e;

}

return fmachine_e;

}

double DoubleMachineEps(){

double dmachine_e, dtest;

dmachine_e = 1.0;

dtest = 1.0 + dmachine_e;

while(1.0 != dtest){

dmachine_e = dmachine_e/2.0;

dtest = 1.0 + dmachine_e;

}

return dmachine_e;

}

Now, a natural question is “How do I use these functions?” For starters, we would write the following program which uses both functions:

#include <iostream.h>

float FloatMachineEps();

double DoubleMachineEps();

int main(int * argc, char ** argv[]){

float fep;

double dep;

fep = FloatMachineEps();

dep = DoubleMachineEps();

cout << "Machine epsilon for single precision is: " << fep << endl;;

cout << "Machine epsilon for double precision is: " << dep << endl;

}

The machine zero values obtained by running the program above on a Pentium-4 proces-sor are given in table 2.8.

Key Concept

• Notice the structure of this code:

1. Function declarations 2. “main” Function 3. Function deﬁnitions

Variable Type Machine Zero ﬂoat 5.96046e-08 double 1.11022e-16

Table 2.8: Machine zero forﬂoatand doubleprecision for a Pentium-4 processor.

This code example demonstrates two important concepts. First, it demonstrates that in computing, it is important to understand how arithmetic works on your machine. Secondly, this example demonstrates that with very little programming, a user can investigate the machine upon which she is running. We must be mindful that no computer can accomplish infinite precision arithmetic; it is limited to finite precision. Finite precision arithmetic is explained as follows: when the exact value of a basic operation, e.g. addition of two numbers, is not represented with a sufficient number of digits, it is then approximated with the closest floating point number. The approximation error incurred is referred to as theround-offerror.

It is for this reason that such a fundamental property of addition as the associative property is not always satisﬁed in the computer. For example

−1.0 + (1.0 +)= (−1.0 + 1.0) +

as on the left-hand-side a very small number is added to a large number and that change may not be represented exactly (due to round-oﬀ) in the computer.

Dans le document in C++ and MPI (Page 49-52)