Talk:Floating-point arithmetic/Removed sections

Content removed form Floating point refactored here from talk page...

Computer representation[edit]

This section describes some general issues, but mostly follows the IEEE standard.

To represent a floating-point number in a computer datum, the exponent has to be encoded into a bit field. Since the exponent could be negative, one could use two's complement representation. Instead, a fixed constant is added to the exponent, with the intention that the result be a positive number capable of being packed into a fixed bit field. For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. The result of this addition is placed in an 8 bit field.

Since the leftmost significand bit of a (normalized) floating-point number is always 1, that bit is not actually placed in the computer datum. The computer's hardware acts as though that "1" had been provided. This is the "implicit bit" or "hidden bit" of the IEEE standard. Because of this, a 24 bit significand is actually placed in a 23 bit field. However, because the number zero can't be normalized, it requires special treatment, described below.

Finally, a sign bit is required. This is set to 1 to indicate that the entire floating-point number is negative, or 0 to indicate that it is positive. (In the past, some computers have used a kind of two's complement encoding for the entire number, rather than simple "sign/magnitude" format.)

The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.

For the approximation to p, we have

sign=0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
e+127 = 128 = 10000000 in (8 bit) binary
final 32 bit result = 0 10000000 10010010000111111011011 = 0x40490FDB

As noted above, this number is not really p, but is exactly 3.1415927410125732421875.

In the common 64 bit "double precision" or "double" format of the IEEE standard, the offset added to the exponent is 1023, and the result is placed into an 11 bit field. The precision is 53 bits. After removal of the hidden bit, 52 bits remain. The result comprises 1+11+52=64 bits. The approximation to p is

sign=0 ; e=1 ; s=11001001000011111101101010100010001000010110100011000 (including the hidden bit)
e+1023 = 1024 = 10000000000 in (11 bit) binary
final 64 bit result = 0 10000000000 1001001000011111101101010100010001000010110100011000 = 0x400921FB54442D18

This number is exactly 3.141592653589793115997963468544185161590576171875.

Overflow, underflow, and zero[edit]

The necessity to pack the offset exponent into a fixed-size bit field places limits on the exponent. For the standard 32 bit format, e+127 must fit into an 8 bit field, so -127 = e = 128. The values -127 and +128 are reserved for special meanings, so the actual range for normalized floating-point numbers is -126 = e = 127. This means that the smallest normalized number is

e=-126 ; s=100000000000000000000000

which is about 1.18 × 10^-38, and is represented in hexadecimal as 00800000. The largest representable number is

e=+127 ; s=111111111111111111111111

which is about 3.4 × 10³⁸, and is represented in hexadecimal as 7F7FFFFF. For double precision the range is about 2.2 × 10^-308 to 1.8 × 10³⁰⁸.

Any floating-point computation that gives a result (after rounding to a representable value) higher than the upper limit is said to overflow. Under the IEEE standard, such result is set to a special value "infinity", which has the appropriate sign bit, the reserved exponent +128, and a bit pattern in the significand (typically zero) to indicate that this is infinity. Such numbers are generally printed as "+INF" or "-INF". The "infinity" value is also produced when a division by zero occurs.

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as

(+INF) + (+7) = (+INF)
(+INF) × (-2) = (-INF)

A floating-point computation that (after rounding) gives a nonzero result lower than the lower limit is said to underflow. This could happen, for example, if 10^-25 is multiplied by 10^-25 in single precision. Under the IEEE standard, the reserved exponent -127 is used, and the significand is set as follows.

First, if the number is zero, it is represented by an exponent of -127 and a significand field of all zeros. This means that zero is represented in hexadecimal as 00000000.

Otherwise, if normalizing the number would lead to an exponent of -127 or less, it is only normalized until the exponent is -127. That is, instead of shifting the significand bits left until the leftmost bit is 1, they are shifted until the exponent reaches -127. For example, the smallest non-underflowing number is

e=-126 ; s=1.00000000000000000000000 (about 1.18 × 10^-38)

A number 1/16^th as large would be

e=-130 ; s=1.00000000000000000000000 (about 7.3 × 10^-40)

If it is partially normalized, one gets

e=-127 ; s=0.00100000000000000000000

This does not have a leading bit of 1, so using the "hidden bit" mechanism won't work. What is done is to store the significand without removing the leftmost bit, since there is no guarantee that it is 1. This means that the precision is only 23 bits, not 24. The exponent of -127 is stored in the usual excess 127 format, that is, all zeros. The final representation is

0 00000000 00010000000000000000000 = 00080000 in hexadecimal

Whenever the exponent is -127 (that is, all zeros in the datum), the bits are interpreted in this special format. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal".

The smallest possible (subnormal) nonzero number is

0 00000000 00000000000000000000001 = 00000001 in hexadecimal

e=-127 ; s=0.0000000000000000000001

Which is 2^-149, or about 1.4 × 10^-45

The handling of the number zero can be seen to be a completely ordinary case of a subnormal number.

The creation of denormalized numbers is often called "gradual underflow". As numbers get extremely small, significand bits are slowly sacrificed. The alternative is "sudden underflow", in which any number that can't be normalized is simply set to zero by the computer hardware. Gradual underflow is difficult for computer hardware to handle, so hardware often uses software to assist it, through interrupts. This can create a performance penalty, and where this is critical sudden underflow might be used.

End of piece NickyMcLean 00:25, 28 September 2006 (UTC)[reply]