## Monday, September 8, 2008

### Comparison of Float and Double Precision

C++ supports two primitive floating point types: `float` and `double`. These are based on the IEEE 754 standard, which defines a binary standard for 32-bit floating point and 64-bit double precision floating point binary-decimal numbers. IEEE 754 represents floating point numbers as base 2 decimal numbers in scientific notation. An IEEE floating point number dedicates 1 bit to the sign of the number, 8 bits to the exponent, and 23 bits to the mantissa, or fractional part. The exponent is interpreted as a signed integer, allowing negative as well as positive exponents. The fraction is represented as a binary (base 2) decimal, meaning the highest-order bit corresponds to a value of ½ (2-1), the second bit ¼ (2-2), and so on. For double-precision floating point, 11 bits are dedicated to the exponent and 52 bits to the mantissa. The layout of IEEE floating point values is shown in Figure 1. Because any given number can be represented in scientific notation in multiple ways, floating point numbers are normalized so that they are represented as a base 2 decimal with a 1 to the left of the decimal point, adjusting the exponent as necessary to make this requirement hold. So, for example, the number 1.25 would be represented with a mantissa of 1.01 and an exponent of 0:
`(-1) `

The number 10.0 would be represented with a mantissa of 1.01 and an exponent of 3:
`(-1)`

here are some sample floating point representations:
```0      0x00000000
1.0    0x3f800000
0.5    0x3f000000
3      0x40400000
+inf   0x7f800000
-inf   0xff800000
+NaN   0x7fc00000 or 0x7ff00000
in general: number = (sign ? -1:1) * 2^(exponent) * 1.(mantissa bits)```

As a programmer, it is important to know certain characteristics of your FP representation. These are listed below, with example values for both single- and double-precision IEEE floating point numbers:

 Property Value for float Value for double Largest representable number 3.402823466e+38 1.7976931348623157e+308 Smallest number without losing precision 1.175494351e-38 2.2250738585072014e-308 Smallest representable number(*) 1.401298464e-45 5e-324 Mantissa bits 23 52 Exponent bits 8 11 Epsilon(**) 1.1929093e-7 2.220446049250313e-16

Note that all numbers in the text of this article assume single-precision floats; doubles are included above for comparison and reference purposes.

(*)

Just to make life interesting, here we have yet another special case. It turns out that if you set the exponent bits to zero, you can represent numbers other than zero by setting mantissa bits. As long as we have an implied leading 1, the smallest number we can get is clearly 2^-126, so to get these lower values we make an exception. The "1.m" interpretation disappears, and the number's magnitude is determined only by bit positions; if you shift the mantissa to the right, the apparent exponent will change (try it!). It may help clarify matters to point out that 1.401298464e-45 = 2^(-126-23), in other words the smallest exponent minus the number of mantissa bits.

However, as I have implied in the above table, when using these extra-small numbers you sacrifice precision. When there is no implied 1, all bits to the left of the lowest set bit are leading zeros, which add no information to a number (as you know, you can write zeros to the left of any number all day long if you want). Therefore the absolute smallest representable number (1.401298464e-45, with only the lowest bit of the FP word set) has an appalling mere single bit of precision!

(**)

Epsilon is the smallest x such that 1+x > 1. It is the place value of the least significant bit when the exponent is zero (i.e., stored as 0x7f).

#### 1 comment:

1. [...] have every write a post and mentioned a little bit about the binary format of IEEE 754 in Comparison of Float and Double Precision. In this post I made a table to indicate the details of bit values for the IEEE 754 32bit [...]