The Table Maker's Dilemma.

Texte intégral

(1)The Table Maker’s Dilemma. Vincent Lefevre, Jean-Michel Muller, Arnaud Tisserand. To cite this version: Vincent Lefevre, Jean-Michel Muller, Arnaud Tisserand. The Table Maker’s Dilemma.. [Research Report] LIP RR-1998-12, Laboratoire de l’informatique du parallélisme. 1998, 2+16p. �hal-02101765�. HAL Id: hal-02101765 https://hal-lara.archives-ouvertes.fr/hal-02101765 Submitted on 17 Apr 2019. HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés..

(2) Laboratoire de l’Informatique du Parallélisme. cole Normale Suprieure de Lyon Unit de recherche associe au CNRS no 1398. SPI. The Table Maker's Dilemma Vincent Lef vre Jean-Michel Muller Arnaud Tisserand. Feb. 1998. Research Report No 98-12. École Normale Supérieure de Lyon 46 Allée d’Italie, 69364 Lyon Cedex 07, France Téléphone : +33(0)4.72.72.80.00 Télécopieur : +33(0)4.72.72.80.80 Adresse électronique : [email protected].

(3) The Table Maker's Dilemma Vincent Lef vre Jean-Michel Muller Arnaud Tisserand Feb. 1998. Abstract The Table Maker's Dilemma is the problem of always getting correctly rounded results when computing the elementary functions. After a brief presentation of this problem, we present new developments that have helped us to solve this problem for the double-precision exponential function in a small domain. These new results show that this problem can be solved, at least for the doubleprecision format, for the most usual functions.. Keywords: Table Maker's Dilemma, Elementary Functions, Correct Rounding, Floating-Point Arithmetic.. Rsum Le dilemme du concepteur de tables est le problme de toujours fournir des r sultats arrondis correctement lors du calcul de fonctions l mentaires. Aprs une brve pr sentation du problme, nous pr sentons de nouveaux r sultats qui permettent de r soudre ce problme pour l'exponentielle en double pr cision dans un petit domaine. Ces r sultats montrent que le problme peut tre r solu, au moins pour le format double pr cision, pour la plupart des fonctions usuelles.. Mots-cls: Dilemme du constructeur de tables, fonctions l mentaires, arrondi correct, arithm tique virgule ottante..

(4) 1. Introduction. The IEEE-754 standard for oating-point arithmetic 2, 11] requires that the results of the arithmetic operations should always be correctly rounded. That is, once a rounding mode is chosen among the four possible ones, the system must behave as if the result were rst computed exactly, with innite precision, then rounded. There is no similar requirement for the elementary functions. The possible rounding modes are: round towards ;1: r(x) is the largest machine number less than or equal to x round towards +1: (x) is the smallest machine number greater than or equal to x round towards 0: Z (x) is equal to r(x) if x 0, and to (x) if x < 0 round to the nearest: N (x) is the machine number which is the closest to x (with a special convention if x is exactly between two consecutive machine numbers). Throughout this paper, machine number means a number that is exactly representable in the oating-point format being considered. The rst 3 modes are called directed modes. A very interesting discussion on the current status of the IEEE-754 standard is given by Kahan 13]. Our ultimate goal is to provide correctly rounded elementary functions at a reasonable cost. Requiring correctly rounded results not only improves the accuracy of computations: it is the best way to make numerical software portable. Moreover, as noticed by Agarwal et al. 1], correct rounding facilitates the preservation of useful mathematical properties such as monotonicity, symmetry1, and some identities. Throughout the paper, we want to implement a function f (f being sine, cosine, exponential, natural logarithm or arctangent) in a radix-2 oating-point number system, with n mantissa bits. We assume that from any real number x and any integer m (with m > n), we are able to compute an approximation of f(x) with an error on its mantissa y less than or equal to 2;m+1 . This can be achieved with the presently known methods, using polynomial or rational approximations or Cordic-like algorithms, provided that a very careful range reduction is performed 19, 12, 22, 7, 17]. The intermediate computations can be carried out using a larger xed-point or oating-point format. Therefore the problem is to get an n-bit mantissa oating-point correctly rounded result from the mantissa y of an approximation of f(x), with error 2;m+1 . This is not possible if y has the form: in rounding to the nearest mode, m }|bits z { 1:xxxxx : ::xxx 1000000 : : :000000xxx ::: | {z } n bits 1 Only when the rounding mode itself is symmetric, i.e., for rounding to the nearest and rounding toward zero.. 1.

(5) or. m bits. z }| { 1:xxxxx {z: : :xxx } 0111111 : : :111111xxx : : : |. n bits in rounding towards 0, +1 or ;1 modes, m }|bits z { 1:xxxxx | {z: ::xxx} 0000000 : : :000000xxx : : : n bits or m }|bits z { 1:xxxxx {z: : :xxx } 1111111 : : :111111xxx : : :: | n bits This problem is known as the Table Maker's Dilemma (TMD) 11]. For example, assuming a oating-point arithmetic with 6-bit mantissa, sin(11:1010) = 0:0 111011 01111110 : : : a problem may occur with rounding to the nearest if the sine function is not computed accurately enough. Although we deal with radix-2 oating-point systems, the same problem obviously occurs with other radices2 . For instance, in a radix 10 oating-point number system, with 4 digit mantissas, sin(73:47) = ;0: 9368 0000058 : : : therefore a problem may occur with a directed rounding mode. Similar problems with radix 10, directed rounding modes, and the cosine function are given in Table 1. n 4 5 6 7 8. x cos(x) 1:149 0:409400000428 :: : 1:6406 ;0:69747000000219 :: : 1:41162 0:15850500000180 :: : 1:225910 0:33808970000005907 :: : 1:9509449 ;0:37105844000000012 :: :. Table 1: Worst cases for n-digit mantissa, radix-10, oating-point numbers between 1 and 2, directed rounding modes and the cosine function. Our problem is to know if there is a maximum value for m, and to estimate it. If this value is not too large, then computing correctly rounded results will become possible. It is worth noticing that even if the maximum value of m is quite large, it is not necessary to always evaluate f(x) with large accuracy. The key point is 2. Other radices are actually used. For example, most pocket computers use radix 10.. 2.

(6) that if the intermediate approximation y is not accurate enough, we are in one of the four cases listed above, so we can be aware of the problem. This leads to a solution rst suggested by Ziv 23]: we rst start to approximate f(x) with a value of m slightly larger than n (say, n + 10). In most cases this will suce to get a correctly rounded result. If this does not suce, we continue with a larger value of m, and so on. We increase m until we have enough accuracy to be able to correctly round f(x). In a practical implementation of this strategy, the rst step may be hardwired, and the other ones are written in software. It is very important that the rst steps (say, the rst two steps) be fast. The other ones are unlikely to occur, so they may be slower without signicantly changing the average computational delay. In 1882, Lindemann showed that the exponential of a nonzero (possibly complex) algebraic number is not algebraic 3]. From this we easily deduce that the sine, cosine, exponential, or arctangent of a machine number dierent from zero cannot be a machine number (and cannot be exactly between two consecutive machine numbers), and the logarithm of a machine number dierent from 1 cannot be a machine number (and cannot be exactly between two consecutive machine numbers). Therefore, for any non-trivial machine number x (the cases x = 0 and x = 1 are obviously handled), there exists m such that the TMD cannot occur. This is not always true for functions such as 2x , log2 x, or xy . This is why we do not consider them in this paper3 . Since there is a nite number of machine numbers x, there exists a value of m such that for any x the TMD cannot occur. Schulte and Swartzlander 20, 21] proposed algorithms for producing correctly rounded results for the functions 1=x, px, 2x and log2 x in single-precision. Those functions are not discussed here, but Schulte and Swartzlander's result helped us to start our study. To nd the correct value of m, they performed an exhaustive search for n = 16 and n = 24. For n = 16, they found m = 35 for log2 x and m = 29 for 2x, and for n = 24, they found m = 51 for log2 x and m = 48 for 2x . We performed similar exhaustive searches for very small values of n and a few dierent exponents only, and always found m close to twice n. For instance, the largest value of m for cosines of single-precision numbers between 1 and 2 is obtained for 3 numbers, including 8 791 717 x = 1:000011000100110101001012 = 8 388 608 whose cosine equals 0:0 111111111001111010111000 | {z } 0 111111111111111111111111 | {z } 000011101 : :: 24. 24. For this number m = 49. One would like to extrapolate those gures and nd m 2n. In 16], two of us showed that if the bits of f(x) after the n-th position can be viewed as if they were random sequences of zeroes and ones, with probability 1 for 0 as well as for 1, then for n-bit mantissa normalized oating-point input 2 numbers, assuming that ne dierent exponents are considered, m is close to 3 However, these functions satisfy some properties that might probably be used. For instance, if x is a machine number and if x is not an integer, then 2x cannot be a rational number.. 3.

(7) 2n + log2(ne ) with very high probability4. Similar probabilistic studies have been done previously by Dunham 9], and by Gal and Bachelis 10]. Of course, probabilistic arguments do not constitute a proof: they can only give an estimate of the accuracy required during the intermediate computation to get a correctly rounded result. There are a few results from number theory that can be used to get an upper bound on the maximum value of m. Unfortunately, such bounds are very large. In practice, they appear to be much larger than the actual maximum value of m. For instance, using a theorem due to Nesterenko and Waldschmidt 18], we could show 16] that getting correctly rounded results in double-precision could be done with m 1 000 000. Although computing functions with 1 000 000 digits is feasible (on current machines, this would require less than half an hour using Brent's algorithms 5] for the functions and Zuras' algorithms 24] for multiplication), this cannot be viewed as a satisfactory solution. Moreover, after the probabilistic arguments, the actual bound is likely to be around 110. Therefore, we decided to study how could an exhaustive search of the worst cases be possible.. 2. Exhaustive tests. For a given elementary function f, a given oating-point format (in practice, IEEE-754 double precision), a given rounding mode, and a given range (ideally, the range of all machine numbers for which the function is mathematically dened, but this may be dicult in some cases), we want to nd worst cases. A worst case is a machine number x, belonging to the considered range, for which the distance between f(x) and a machine number (for directed rounding modes) or the distance between f(x) and the middle point of two consecutive machine numbers (for rounding to the nearest) is minimal. Here, distance means mantissa distance, that is, the distance between f(x) and y is jf(x) ; yj 2blog2 jf (x)jc : To nd such worst cases within reasonable time, we need to workout ecient lters. A lter is an algorithm that allows to eliminate most cases, that is, that quickly eliminates any number x for which the distance between f(x) and a machine number or the middle point of two consecutive machine numbers is far from being minimal. If the lter is adequately designed, a very small fraction of the set of the considered machine numbers will not have been ltered out, so that we will be able to deal with these numbers later on, during a second step, using more accurate yet much slower techniques. We consider here two dierent lters. Both are based on very low degree (consequently, valid on very small domains only) polynomial approximations of the functions being considered. 4 If all possible exponents are considered, this formula can be simpli ed: with a p-bit

(8) oating-point number system, it becomes n + p. And yet, we prefer the formula 2n + log2 (ne) because in practice, we seldom have to deal with all possible exponents: when an argument is very small (i.e., its exponents are negative and have a rather large absolute value), simple Taylor expansions suce to nd a correctly rounded result. When an argument is very large, the value of the functionmay be too large to be representable(for instancewith the exponential function), or so close to a limit value that correctly rounding it becomes obvious (for instance with the arctangent function).. 4.

(9) The exhaustive search is restricted to a given interval outside this interval other methods can be chosen. Indeed, if x is small enough (less than 2;53 for double precision), an order-1 Taylor expansion can be used, and if x is large enough, the exponential gives an over ow, and the values of trigonometric functions do not make much sense any longer for most applications (although we prefer to always return correctly rounded results). A similar work for the single-precision oating-point numbers has already been done by various authors 20, 16]. Our methods have rst been applied to the exponential function with double-precision arguments in the interval 21 1). We are extending them to other intervals and other functions. Results concerning other intervals cannot be deduced from the results in 21 1), thus these intervals will have to be considered later. Concerning the double-precision normalized oating-point numbers (n = 53), there are 252 possible mantissas (the rst bit is always 1), which is a large number. Thus one of the most important concerns is that the test program should be as fast as possible, even though it is no longer portable (we only used SparcStations). Indeed, one cycle per test corresponds to approximately two years of computation on a 100 MHz workstation: saving any cycle is important! 2.1 First ltering strategy. 2.1.1 Algorithm (general idea). The TMD occurs for an argument x if the binary expansion of f(x) has a sequence of consecutive 0's or 1's starting from a given position, where the position and the length of the sequence depend on the nal and the intermediate precisions. The problem consists in checking for all x whether these bits are all 0's or all 1's, and in this case, nding the maximal value of m for which the TMD occurs. As explained previously, to get fast tests, we apply a two-step strategy similar to Ziv's 23]. The rst step (that is, the lter) must be very fast and eliminate most arguments the second one, that may be much slower, consists in testing again the arguments that have failed at the rst step, by approximating f(x) with higher precision. The rst step consists in testing a given subsequence of the binary expansion (bits of weight 2;54 to 2;(M ;1)) of an approximation of each f(x) with error 2;M . In our case, we have chosen M = 86 from the algorithm and the processor characteristics. The test fails if and only if these bits are the same, and the argument will have to be checked during the second step. In the opposite case, one can easily show that the bits 54 to M of the exact result cannot all be equal. In the following, we assume f(x) = ex .. 2.1.2 First step: overall presentation. To perform the rst step, the exponential function is approximated by a polynomial. The chosen polynomial must have a low degree for the following reasons: on the one hand, to reduce the computation time, on the other hand, to limit possible rounding errors. As a consequence, the approximation is valid on a small interval only. Let us consider an interval ;2r;53 2r;53), where r is a positive integer (it will be about 15), in which we know a polynomial approximation. We can use 5.

(10) the formula. et+x = et :ex where x is in this interval and t has the form (2`+1)2r;53, to test every argument in the range 1=2 to 1. The main idea consists in computing a polynomial approximation of the exponential in the intervals t ; 2r;53 t + 2r;53 ), then evaluating the obtained polynomial at consecutive values by the nite dierence method 14], brie y recalled in Section 2.1.4. This method is attractive, for it only requires additions (two for each argument in the case of a polynomial of degree two, which was chosen) and the computations can be performed modulo 2;53 (the rst tested bit having weight 2;54). Thus the algorithm consists of two parts: the computation of the et 's with error 2;s , where t will have the form (2` + 1)2r;53 (and will be between 1=2 and 1), the computation of the et+x 's with error 2;M , where x is in ;2r;53 2r;53), knowing et with error 2;s. For our tests, we chose r = 16, M = 86 and s = 88. These parameters have been determined from an error analysis (giving relations between r, M and s) and hardware parameters (register width and disk size).. 2.1.3 Computing the et's. We seek to compute u` = ey+`z with error 2;s, where y = 1=2+2r;53 , z = 2r;52 and ` is an integer such that 0 ` < 251;r . The following method allows to compute a new term with only one multiplication (using the formula ea+b = ea :eb ), with a balanced computation tree to avoid losing too much precision, and without needing to store too many values u` , the disk storage being limited. We write ` in base 2: ` = `50;r `49;r : : : `0 . We have ey+`z = ey :e`00 :e`11 : : : e`5050;;r where ei = (ez )2 to simplify, ey and the ei 's are precomputed. The problem now consists in computing for all I f0 1 : : : hg containing h: Y PI = ei r. i. i 2I. where the ei 's are the above precomputed values (eh = ey ). For that, we partition f0 1 : : : hg into two subsets that have the same size (or almost) to balance, and we apply this method recursively on each subset (we stop the recursion when the set has only one element) then we calculate all the products xy, where x is in the rst subset and y is in the second subset. The last products xy are computed later, just before testing the corresponding interval. These computations can be performed sequentially on one machine: they only need several minutes.. 6.

(11) x. P(x). 0 1 1 1 2 5 3 19 4 49 5 101 6 181 7 295 8 449 9 649 10 901 11 1211. P(x) = P (x + 1) ;P(x) 0 4 14 30 52 80 114 154 200 252 310. 2P(x) = P(x + 1) ;P(x). 4 10 16 22 28 34 40 46 52 58. 3P(x) = 2P (x + 1) ;2P(x). 6 6 6 6 6 6 6 6 6. Table 2: Illustration of the nite dierence method, with P (x) = x3 ; x2 + 1, x0 = 0 and h = 1. Once 4 consecutive values of P are evaluated, we compute 3 values of P , 2 values of 2P and one value of 3P using subtractions. After this, evaluating a new value P (xi) requires 3 additions only.. 2.1.4 Computing and testing the et+x 's. The exponential ex is approximated by 1+x+ 12 x2 on the interval ;2r;53 2r;53]. The computation of the et+x 's using that polynomial approximation is performed using the nite dierence method 14]. This method allows to calculate consecutive values of a polynomial of degree d with only d additions for each value. It is probably known by most readers, so we just brie y recall it, to make the paper self-contained. Assume we wish to evaluate a degree-d polynomial P at regularly-spaced values x0, x1 = x0 + h, x2 = x0 + 2h, . . . , xi = x0 + ih, . . . . Dene the values: P (xi) = P (xi+1) ; P (xi), 2P(xi) = P (xi+1) ; P(xi), 3P(xi) = 2P(xi+1) ; 2P(xi), ... d P (xi) = d;1 P (xi+1) ; d;1P (xi). It is quite easy to show that for any i, d P(xi) is a constant that only depends on P. This is illustrated in Table. 2 in the case P(x) = x3 ; x2 + 1, x0 = 0 and h = 1. This shows that once the rst d + 1 values of P , then the rst d values of P, then the rst d ; 1 values of 2P, . . . then the (constant) value of d P are computed, it suces to perform d additions to get a new value of P. In our problem (with h = 1 and x0 = 0), the polynomial is not given by its coecients nor by its rst values P (0), P (1), . . . , P (d), but by the elements P(0), P (0), 2 P(0), . . . . These values are more suited to our computations. 7.

(12) They are the coecients of the polynomial in the base X(X ; 1) X(X ; 1)(X ; 2) ::: : 1 X 2 3! The test of the bits cannot be performed with only one instruction on a Sparc processor without modications. This test consists in testing whether a number is in a given interval centered on 0 modulo 2;53. With the nite dierence method, we can translate the interval so that its lower bound is 0, and we can use an unsigned comparison to test whether the number is in the interval thus we nally need only one instruction. Thus the rst step takes 5 cycles per argument on average (two 64-bit additions and one 32-bit comparison), the time required by the other computations branch instructions being negligible.. 2.1.5 Parallelizing on a computer network. The total amount of CPU time required for our computations is several years. We wanted to quickly get the results of the tests (within a few months). Therefore we had to parallelize the computations. We used the network of 120 workstations of our department. These workstations often have a small load. We sought to use each machine at its maximum without disturbing its user in particular, the process uses very little memory and there are very few communications (e.g. le system accesses).. 2.1.6 Second step. The second step consists in a more precise test for the arguments that failed during the rst step. The exponential is computed with a higher precision, chosen so that the probability that the test fails for an argument is very low. We chose a variation of De Lugish's algorithm 8] for computing the exponential (since it contains no multiplication). The computations were performed on 128-bit integers (four 32-bit integers). The algorithm was implemented in assembly language (which is, for the present purpose, simpler than in C language).. 2.1.7 Improvements. The algorithm given above (rst ltering strategy) has been used during the summer of 1996. We present the results in Section 3. Since 1996, we have developed another ltering strategy, presented in Section 2.2.1. Before examining that strategy, let us notice some remarks that may help to accelerate the tests. First, we can test both a function and its inverse at the same time. As we would need to test twice as many numbers as before, it seems that we would save nothing, but this is no longer true in combination with the following methods. Since testing a function and testing its inverse are equivalent (the exponential of a machine number a is close to a machine number b if and only if the logarithm of b is close to the machine number a), we can choose the function that is the fastest to test, i.e. the function for which the number of points to test is the smallest in the given domain of course, this choice may depend on the considered domain. 8.

(13) We can approximate a degree-2 polynomial by several degree-1 polynomials in subintervals (which is not equivalent to directly approximating the function by degree-1 polynomials). By doing this, the tests would require 3 cycles per argument in average instead of 5 cycles. But we may do better: now we have a function that may be simple enough (a translation of a linear function) to nd an attractive algorithm based on advanced mathematical properties. This is done in the next section. 2.2 Second ltering strategy: towards faster tests. 2.2.1 Faster tests. As previously, we wish to know how close can f(x) be to a machine number (or the middle point of two consecutive machine numbers), where x is a machine number. Our second approach is based on the 3-distance theorem 4]. Let us start from a circle and an angle . We build a sequence of points x1, x2 ,. . . on the circle (see Fig. 1) as follows: x0 is chosen anywhere on the circle xi+1 is obtained by rotating xi of angle . The 3-distance theorem says that, at any time during this process (that is, for any given n, when the points x0, x1 , x2,. . . , xn are built), the distance between two points that are neighbors on the circle can take at most 3 possible values. This is illustrated on Fig. 2. Moreover, there are innitely many values of n (number of points) for which the distance between two points that are neighbors on the circle can take two values. x1 α x2 x0. x3. Figure 1: The 3-distance theorem. Construction of the rst points. To use this result, we will do the following: the initial domain is cut into subdomains small enough to make sure that, within an acceptable error, function f can be approximated by a linear approximation on each sub-domain. We also assume that all numbers in a sub-domain I, as well as in exp(I), have the same exponent 9.

(14) x 10 x. x6. 1. x5. x2. x9 x0. x7. x4 x3. x8. Figure 2: The 3-distance theorem. Construction of more points. There are at most 3 possible distances between consecutive points. now, let us focus on a given sub-domain. We scale the problem (input oating-point numbers, linear approximation) so that oating-point numbers now correspond to integers. Therefore our problem becomes the following: given a rectangular grid (whose points have integral coordinates belonging to a given range), a straight line, and a (small) real number , are there points of the grid that are within a distance from the line ? we slightly modify the problem again: we translate the line down by a distance . Our initial problem is equivalent to: are there points of the grid that are above the (translated) line, and at a distance less than 2 from it ? After these modications, assuming that the translated straight line is of equation y = ax ; b, our problem becomes: is there an integer x, belonging to the considered domain, and another integer i such that 0 i ; ax + b 2 ? Now, we compute modulo 1, and we look for an integer x such that b ; ax modulo 1 is less than 2. Now we can understand in what this problem is related to the 3-distance theorem. We compute modulo 1: the reals modulo one can be viewed as the points of a circle, a given point being arbitrarily chosen for representing 0. Adding a value a modulo 1 to some number z is equivalent to rotating the point that represents z of an angle 2a. Therefore, by choosing x0 = b and = ;2a, building the values b ; ax (where x is an integer) is equivalent to building the values x1 , x2, x3. . . of the 3-distance theorem. Now, we give a fast algorithm that computes the lower bound on the distance d between a segment and the points of a regular, rectangular grid. More details are given in 15]. This algorithm is related to the 3-distance theorem and is an extension to Euclid's algorithm, which is used to compute the development of a real number (the slope of the segment) into a continued fraction. We have seen that, in the 3-distance theorem,that there are innitely many values of n (number of points) for which the distance between two points that are neighbors on the circle can take two values. Let us call i and i these distances the i-th time there are two distances. The algorithm given below directly computes the sequences i and i , without having to build the points xi . 10.

(15) In the following, fxg denotes the positive fractional part of the real number x, i.e., fxg = x ; bxc. As previously , let y = ax ; b be the equation of the segment, where x is restricted to the interval 0 N ; 1]. The following algorithm gives a lower bound on fb ; axg, where x is an integer in 0 N ; 1]. Initialization: = fag = 1 ; fag d = fbg u = v = 1 Innite loop:. if (d < ) while ( < ) if (u + v N) exit. = ; u = u + v if (u + v N) exit = ; v = v + u else d = d ; while ( < ) if (u + v N) exit = ; v = v + u if (u + v N) exit. = ; u = u + v. Returned value: d. and contain the lengths that appear in the 3-distance theorem u and v contain the number of intervals (arcs) of respective lengths and (u+v is the total number of intervals, i.e., the number of points). During the computations, d is the temporary lower bound (for the current number of points u+ v), which becomes the nal lower bound when the algorithm stops.. 3. Results. In 1996, we tested the exponential function with double-precision arguments in the interval 21 1), using the rst ltering strategy (see Section 2.1). We needed up to 121 machines during three months for the rst step. The second step was carried out in less than one hour on one machine. The very same results have been obtained in January 1997, with the second ltering strategy (see Section 2.2.1) . Thanks to this better strategy, we needed a few machines only. The computation was approximately 150 times faster than with the rst strategy (we tested 30 arguments per cycle on average). Among all the 220 = 1 048 576 intervals, each containing 232 values, 2 097 626 exceptions have been found. From the probabilistic approach, the estimated number of exceptions was to be 221 = 2 097 152. This shows that, in this case, the probabilistic estimate is excellent. Our experiments allowed us to perform an in-depth check of the probabilistic hypotheses. For each double-precision number x, let us dene an integer kx such that the mantissa of ex has the following form: b|0 b1b2{z: : :b52} b|53000{z: : :00} 1 : : : 53 bits k bits 11.

(16) or. b|0 b1b2{z: : :b52} b|53111{z: : :11} 0 : : :: 53 bits k bits From the probabilistic hypotheses, kx k0 for given numbers x and k0 with a probability of 22;k0 . In the following table, we give the following numbers as a function of k0 (with k0 40) for 21 x < 1: the actual number of arguments for which kx = k0 the actual number of arguments for which kx k0 the estimated (after the probabilistic hypotheses) number of arguments for which kx k0: 252 22;k0 , i.e. 254;k0 . k0 kx = k0 kx k0 estimate 40 8185 16427 16384.0 41 4071 8242 8192.0 42 2113 4171 4096.0 43 999 2058 2048.0 44 541 1059 1024.0 45 258 518 512.0 46 123 260 256.0 47 63 137 128.0 48 43 74 64.0 49 14 31 32.0 50 5 17 16.0 51 7 12 8.0 52 1 5 4.0 53 2 4 2.0 54 1 2 1.0 55 1 1 0.5 We see that the probabilistic estimate is still very good. Now, let us focus on the worst case, for the exponential of double precision numbers between 1=2 and 1. This worst case is constituted by the exception for kx = 55, namely x = 0:110101100110011111011111001000110101101001110111100002 471 483 227 223 279 = 562 949 953 421312 with exp x = 10:010011111000010111001001011110 000011110111001110000 0 |11111 {z : : :1111} 01 : : :: 54. Therefore the value of m for the exponential function in 21 1) in doubleprecision is 109 = 53+55+1. We are computing the worst cases for the logarithm 12.

(17) function. For instance, one of the worst cases for logarithms of double precision numbers between 1 + 2;29 and 1 + 351040 2;29, and directed rounding, is. with. x = 1:00000000001000001001110000000111011011110110100001012 4 505 840534353 541 = 4 503 599627370 496 lnx = 2;11 1:00000100110011111001111101100000 10010011111111 000000000000 | {z: : :000000000 } 10110 : : :: 42. 4 Conclusion We have shown, using as an example the case of the exponential function in 21 1), that correctly rounding the double-precision elementary functions is an achievable goal. Moreover, if all the values of m have the same order of magnitude as the value we obtained for the exponential function (this is likely to be true), always computing correctly rounded functions will not be too expensive. As said in the introduction, correct rounding will facilitate the preservation of useful mathematical properties. And yet, to be honest, we must say that there are a few examples for which correct rounding may prevent from preserving a useful property. Let us consider the following example 17]. Assume that we use an IEEE-754 single-precision arithmetic. The machine number which is closest to arctan(230) is 13 176 795 = 1:57079637050628662109375 > : 8 388 608 2 Therefore, if the arctangent function is correctly rounded in round-to-nearest mode, we get an arctangent larger5 than =2. A consequence of this is that in our system, ; ; tan arctan 230 = ;2:2877 107 : In this case, the range preservation property j arctan xj is less than =2 for any x is not satised due to the use of correct rounding in round-to-nearest mode. This is a very peculiar case. One can easily show the following property Theorem 1 If a function f is such that: f(x) is dened for any x 2 a b], c = minx2 ab] f(x) is a machine number, d = maxx2 ab] f(x) is a machine number, then if f(x) is computed with correct rounding6 for a machine number x 2 a b], the computed value will belong to c d]. 5 6. But equal to the machine representation of Assuming this is possible. 13. =2..

(18) This property shows that the problem that occurred in the previous example will not appear frequently with the usual functions. For example, if they are correctly rounded (in any rounding mode), sines, cosines and hyperbolic tangents will always be between ;1 and +1. Our programs were written for Sparc-based machines, but a small portion of the code only is written in assembly code, so that getting programs for other machines would be fairly easy. Of course, it is very unlikely that somebody will be able to perform exhaustive tests for quadruple-precision in the near future, but all the results of our experiments shows that the estimates obtained from the probabilistic hypotheses are very good, so that adding a few more digits to 2n + log2 (ne ) for the sake of safety will most likely ensure correct rounding (it is important to notice that if correct rounding is impossible for one argument, we can be aware of that, so a ag can be raised). Therefore we really think that in the next ten years, libraries and / or circuits providing correctly rounded double-precision elementary functions will be available. Now, it is time to think about what should appear in a oating-point standard including the elementary functions. Among the various issues that should be discussed, one can cite: should we provide correctly rounded results in all the range of the function being evaluated (that is, for all machine numbers belonging to the domain where the function is mathematically dened), or in a somewhat limited range only ? Although we would prefer the rst solution, it might lead to more complex (and therefore time-consuming) calculations when computing trigonometric functions of huge arguments even if we provide correctly rounded results, should we allow a faster mode, for which faithful rounding 6] only is provided ? Faithful rounding means that the returned result is always one of the two machine numbers that are nearest to the exact result. This is not a rounding mode in the usual sense (for instance, it is not monotonic). Should we allow roundings that are intermediate between faithful rounding and correct rounding, for instance, should we allow instead of rounding to the nearest, results that are within 1=2ulp + from the exact result, for some very small ?. References 1] R. C. Agarwal, J. C. Cooley, F. G. Gustavson, J. B. Shearer, G. Slishman, and B. Tuckerman. New scalar and vector elementary functions for the IBM system/370. IBM Journal of Research and Development, 30(2):126 144, March 1986. 2] American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE standard for binary oating-point arithmetic. ANSI/IEEE Standard, Std 754-1985, New York, 1985. 3] G. A. Baker. Essentials of Pad approximants. Academic Press, New York, 1975. 4] V. Berth . Three distance theorems and combinatorics on words. Technical report, Institut de Math matiques de Luminy, Marseille, France, 1997. 14.

(19) 5] R. P. Brent. Fast multiple precision evaluation of elementary functions. Journal of the ACM, 23:242251, 1976. 6] M. Daumas and D. W. Matula. Rounding of oating-point intervals. In Proceedings of SCAN-93, Wien, Austria, June 1993. 7] M. Daumas, C. Mazenc, X. Merrheim, and J. M. Muller. Modular range reduction: A new algorithm for fast and accurate computation of the elementary functions. Journal of Universal Computer Science, 1(3):162175, March 1995. 8] B. DeLugish. A class of algorithms for automatic evaluation of functions and computations in a digital computer. PhD thesis, Dept. of Computer Science, University of Illinois, Urbana, 1970. 9] C. B. Dunham. Feasibility of perfect function evaluation. SIGNUM Newsletter, 25(4):2526, October 1990. 10] S. Gal and B. Bachelis. An accurate elementary mathematical library for the IEEE oating point standard. ACM Transactions on Mathematical Software, 17(1):2645, March 1991. 11] D. Goldberg. What every computer scientist should know about oatingpoint arithmetic. ACM Computing Surveys, 23(1):547, March 1991. 12] W. Kahan. Minimizing q*m-n, text accessible electronically at http://http.cs.berkeley.edu/ wkahan/. At the beginning of the le "nearpi.c", 1983. 13] W. Kahan. Lecture notes on the status of IEEE-754. Postscript le accessible electronically through the Internet at the address http://http.cs.berkeley.edu/ wkahan/ieee754status/ieee754.ps, 1996. 14] D. Knuth. The Art of Computer Programming, volume 2. Addison Wesley, Reading, MA, 1973. 15] V. Lefvre. An algorithm that computes a lower bound on the distance between a segment and z 2. Research report RR9718, LIP, Ecole Normale Sup rieure de Lyon, 1997. Available at http://www.ens-lyon.fr/LIP/research reports.us.html. 16] J. M. Muller and A. Tisserand. Towards exact rounding of the elementary functions. In Alefeld, Frommer, and Lang, editors, Scientic Computing and Validated Numerics (Proceedings of SCAN'95), Wuppertal, Germany, 1996. Akademie Verlag. 17] J.M. Muller. Elementary Functions, Algorithms and Implementation. Birkhauser, Boston, 1997. 18] Y. V. Nesterenko and M. Waldschmidt. On the approximation of the values of exponential function and logarithm by algebraic numbers (in Russian). Mat. Zapiski, 2:2342, 1996. 19] M. Payne and R. Hanek. Radian reduction for trigonometric functions. SIGNUM Newsletter, 18:1924, 1983. 15.

(20) 20] M. Schulte and E. E. Swartzlander. Exact rounding of certain elementary functions. In E. E. Swartzlander, M. J. Irwin, and G. Jullien, editors, Proceedings of the 11th IEEE Symposium on Computer Arithmetic, pages 138145, Windsor, Canada, June 1993. IEEE Computer Society Press, Los Alamitos, CA. 21] M. J. Schulte and E. E. Swartzlander. Hardware designs for exactly rounded elementary functions. IEEE Transactions on Computers, 43(8):964973, August 1994. 22] R. A. Smith. A continued-fraction analysis of trigonometric argument reduction. IEEE Transactions on Computers, 44(11):13481351, November 1995. 23] A. Ziv. Fast evaluation of elementary mathematical functions with correctly rounded last bit. ACM Transactions on Mathematical Software, 17(3):410 423, September 1991. 24] D. Zuras. More on squaring and multiplying large integers. IEEE Transactions on Computers, 43(8):899908, August 1994.. 16.

(21)