|  In this section we will review here some general 
                aspects of floating point numbers. See the presentation by J. 
                D. Darcy and other FP references for more in-depth discussions. 
                A floating-point number is represented in binary 
                as   ħb0.b1b2b3...bn-1 
                * 2exponent where bi 
                represents the i bit in the n bits of the significand (also called 
                the mantissa). In addition, there is a bit to indicate the sign. 
                A floating-point value is calculated as   (-1)s 
                ·(b0 + b1·2-1 + b2·2-2 
                + b3·2-3 + ...+ bn-1·2-(n-1))·2exponent where s 
                is a bit for the sign. Floating-point numbers involve a number 
                of complications with which the processor designers must deal. 
                These complications include
 For fractional numbers and for very large or very 
                small numbers, advanced processors provide floating point representations. 
                In the bit representation for the Java float type:  
                
                   
                    | 1 
                      bit | 8 
                      bits | 23 
                      bits |   
                    | Sign | exponent | significand |  and for double 
                  type  
                   
                    | 1 
                      bit | 11 | 52 |   
                    | Sign | exponent | significand |  Floating point numbers on computers involve a 
                  number of complications: 
                Approximations T he limited number of places in the significand means that 
                  only a finite number of fractional values can be represented 
                  exactly. Similarly, the finite width of the exponents limit 
                  the upper and lower size of the numbers.
 
 
Round-off Arithmetic operations will often result in the need to round 
                  off the fractional values. A round-off (or truncation) 
                  algorithm must be chosen by the designer of the language. Round-offs 
                  can have a significant impact on a long calculation as the errors 
                  accumulate.
 
 
Overflows/Underflows Similarly a calculation may result in a number that is smaller 
                  or larger value that the floating point type can represent. 
                  Again, the language designer must select a strategy for how 
                  to handle such situations.
 
 
Decimal-Binary Conversion 
                  The computer represents numbers in base 2. This can result in 
                  loss of precision since often a binary fraction cannot exactly 
                  represent a given finite decimal fraction (0.1 for example). 
                  All finite binary fractions, however, can be converted to finite 
                  decimal fractions.
 Java & Floating Point To handle these FP issues, Java follows the IEEE 754 standard 
                in most cases. In this standard : 
                Round-off takes the binary 
                  value nearest to the exact (or higher precision intermediate) 
                  value. If two binary values are equally close, then choose the 
                  even value; that is, the one with its last bit equal to 0.
 
Overflows/Underflows are 
                  represented by positive or negative infinity values. 
                  Similarly, for undefined numbers, such as 0/0, use a Not-a-Number 
                  (NaN) representation. 
                  No error messages are thrown for any of these cases. Note that even simple calculations with FP can provide 
                surprising results. For example, the following code    float 
                f = 0.0;for (int i=1; i <= 10; i++) {
 f += 0.1;
 }
 does not result in exactly f 
                = 1.0 (even if double 
                is chosen for f) because, as mentioned above, 0.1 
                is not exact in binary format.  For similar reasons, 
                Avoid equality (a 
                  == b) tests between two floating point variables.Instead, test with < 
                  , <= , >= , > .However, in some situations it may be sensible to test for 
                  equality to 0.0 to avoid divide by zero errors. In Java the float 
                representation has a 23 bit significant and double 
                has a 53 bit mantissa. This means that float 
                gives 6 to 9 digits of decimal precision while double 
                gives 15 to 17 digits of decimal precision.  In general, it is far safer to do calculations in 
                double 
                type. This helps to reduce round-off errors that lower the precision 
                during the intermediate calculations. (You can always cast the 
                final value to float 
                if that is a more convenient size such as for I/O or storage.) 
               Remember the difference between precision and accuracy: 
                Precision - how fine a 
                  distinction can be made between two close values.Accuracy - how close the 
                  value is to the correct value. References & Web 
                Resources 
                  Latest update: Oct. 15, 2004 |