Sunday, July 02, 2006

IEEE 754

That's the floating point number format standard. What do you know about how floating point numbers are stored? (We are not talking about arithmetic.) My experience with programmers is that many of them don't know anything about these formats. Sure, they know that the real numbers are stored as sign, fraction and exponent triplets, but nothing more. After all, who needs to know how computers actually work these days, right? Well, if you think like that, I'm sorry for you. These line of thought may be practical (I hate that fact,) but it's not at all part of the hacker/geek spirit. Here's a brief description about how floating point numbers are stored, according to IEEE-754. I thought there were 3 formats for floats, but as it turns out, there's four. They are 32, 43, 64 and 80 bits (who've ever heard of a 43-bit float? Honestly?) The most widely used form is the 64-bit, or "double precision" one. This format (and all others) have a single sign bit (the MSB), with "0" for positive and "1" for negative numbers. Then comes an 11-bit exponent field. And after that, a 52-bit fraction part, for a total of 64 bits (it's 8 and 23 bits for the 32-bit single precision format.) As you know, floating point numbers are stored as some for of scientific notation, with base 2. Basically, that means that the value is the result of the multiplication of a M by two to the power of E. But that's not the whole story. First of all, you have to realize that every number you write in binary, has a '1' as it's leftmost bit. Think about it! Without zero-padding on the left, every number must have a non-zero digit at its left, and in radix-2, that means a 1. So we omit that 1 and save one bit. In our scientific notation, the M must be normalized to be greater or equal to 1, and less than 2. Second of all, the E is not the exponent, but the exponent minus 1023. That is, if you want 2 as E, you have to set the exponent to be 1025. This way, you can accommodate negative values as well. You might wonder why the common two's-complement method why not used. Well, I don't know all the reasons, but one of them probably has been the side effect that using this method, and the field layout being what it is (first exponent, then fraction) you can compare normalized floating point values just as integers, using the bit pattern! (not considering the sign bit, of course.) More precisely, the final number is calculated like this:
value = -1sign bit * 2exponent - 1023 * (1 + fraction / 252)
The above is used when 0 < exponent < style="font-style: italic;">normal number. There are also a few special case values that you need to know about as well:
  • If exponent is 0 and fraction is 0, the value is ±0.0;
  • If exponent is 0 but fraction is nonzero, the value is denormal and equal to ±F/21022+52;
  • If exponent is 2047 and fraction is 0, the value is ±∞;
  • If exponent is 2047 and fraction is nonzero, the value is ±NaN (Not a Number. There are 252-1 of these!)
Single-precision format is similar to the above, except that -127 is added to the exponent, and other changes to accommodate the smaller size.

No comments: