Floating point notation is essentially the same as scientific notation, only
translated to binary. There are three fields: the sign (which is the sign of
the number), the exponent (some representations have used a separate exponent
sign and exponent magnitude; IEEE format does not), and a significand
(mantissa).
As we discuss the details of the format, you'll find that the motivations
used to select some features seem like they should have driven other features
in directions other than what was actually used. It seems inconsistent to me,
too...
One other thing to mention is that the IEEE floating point format actually
specifies several different formats: a ``singleprecision'' format that takes
32 bits (ie one word in most machines) to
represent a value, and a ``doubleprecision'' format that allows for both
greater precision and greater range, but uses 64 bits. We'll be talking about
singleprecision here.
The onebit sign is 0 for positive, or 1 for negative. The representation is
signmagnitude.
In integers, we use 2's complement for negative numbers because it makes
the arithmetic ``just work;'' we can add two numbers together without regard to
whether they are positive or negative, and get the right answer. This won't
work for floating point numbers because the exponents need to be manipulated;
if we used a 2's complement representation for the entire word we'd have to
reconstruct the exponent any time we wanted to add or subtract, so it wouldn't
gain us anything; in fact, trying to do arithmetic involving a negative number
would involve converting it to positive first.
All the same, using the same negativerepresentation for integer and
floating point has been done: the CDC 6600, which used 1's complement
arithmetic for integers, also represented floating point numbers by taking the
1's complement of the entire word. The CDC Cyber 205 left the exponent alone,
and represented negatives by taking the 2's complement of the mantissa.
The exponent gives a power of two, rather than a power of ten as in
scientific notation (again, there have been floating point formats using a
power of eight or sixteen; IEEE uses two).
The eightbit exponent uses excess 127 notation.
What this means is that the exponent is represented in the field by a number
127 greater than its value. Why? Because it lets us use an integer comparison
to tell if one floating point number is larger than another, so long as both
are the same sign.
Of course, this is only a benefit if we use the same registers for both
integers and floating point numbers, which has become quite rare today. By the
time you've moved two operands from floating point registers to integer
registers and then performed a comparison, you might as well have just done a
floating point compare. Also, an integer compare will fail to give the right
answer for comparisons involving
The use of excess127, instead of excess128, is also a headscratcher. Most previous floating point formats using an
excess representation for the exponent used an excess that was a power of two;
this allowed conversion from exponent representation to exponent value (and
vice versa) by simply inverting a bit. I have yet to come across a good
explanation for the use of excess127.
Using a binary exponent gives us an unexpected benefit. In scientific
notation, we always work with a ``normalized'' number: a number whose mantissa
is between 1 and 9. If a binary floating point number is normalized, it must
have the form 1.f  the most significant bit must be a 1. Well, if we know
what it is, we don't need to explicitly represent it, right? So we just store
the fraction part in the word, and put in the ``1.'' when we're actually inside
the floating point unit. Sometimes this is called using a ``phantom bit'' or a
``hidden bit.''
Since we're going to fill a 32bit word, the fraction is 23 bits, but
represents a 24 bit significand.
A note on mantissas: a ``mantissa'' is the fractional part of the
logarithm of a number. For instance, if we take log_{10}73.2, we get
1.864511. The mantissa is .864511. I've also seen the word used to mean the fracitonal part of any decimal number  in the above
example, using this definition, the mantissa would be .2. The term is also
frequently used to mean the significand of a floating
point number; we're going to try to be consistent and use the term ``significand.''
The value represented by an IEEE floating point number is
(1)^{s} * 1.f * 2^{exp127}
Let's think a minute about just how we do arithmetic operations in
scientific notation:
Addition and subtraction:
Multiplication and division:
Let's add 2.5 + 4.75
Old 
Old/2 
Bit 
2 
1 
0 
1 
0 
1 
2.
So we get 10
Old 
Bit 
New 
.5 
1 
0 
4.
So the fraction part is .1
The number we're converting is 10.1, which is 1.01
x 2 ^{1}. The exponent is 127+1 = 128_{10}, or 10000000_{2},
and the fraction is 0100_{2}.
2.5 

Sign: 
0 
Exponent: 
10000000 
Significand: 
1.01 
4.75 

Sign: 
0 
Exponent: 
10000001 
Significand: 
1.0011 



· Put the result together:
0 10000001 110100_{2}, or 40e80000_{16}. One small point to notice here is that I didn't ever have
to figure out what the exponents meant; I just had to compare them. · Convert the result back to decimal. 1.
Since the exponent field is 10000001, its
value is 129127=2. So the number's value is 1.1101 x 2 ^{2}, or 111.01. 2.
The integer part is found with 3.

Old 
Bit 
New 
0 
1 
1 
1 
1 
3 
3 
1 
7 
Old 
Bit 
New 
0 
1 
1 
.5 
0 
0.5 
.25 


7.
and we get a final
result of 7.25.
5.
Let's run an example of multiplication in
floating point. We'll use the same two numbers that we used for addition: 40200000 * 40980000
.
6.
First, we find the contents of the sign,
exponent, and significand fields. As before, this
gives us
40200000 

Sign: 
0 
Exponent: 
10000000 
Significand: 
1.01 
40980000 

Sign: 
0 
Exponent: 
10000001 
Significand: 
1.0011 
7.
So now we apply the standard multiplication
algorithm.
2.
Determine the exponent by adding the operands'
exponents together. The only catch here is that we've left the exponents in
excess127 notation; if we just add them, we'll get
e_{1} + 127 + e_{2} + 127 =
e_{1} + e_{2} + 254
so we have to add the exponents and
subtract 127 (yes, we could have subtracted 127 from the exponent fields, added
them, and added 127 to the result. But the answer would have been the same, and
we would have gone to some extra work).
10000000 +
1000001  01111111 = 10000010
3.
Multiply the significands
using the standard multiplication algorithm
4. 1.0011
5. 1.01
6. 
7. .010011
8. .00000
9. 1.0011
10. 
11. 1.011111
12.
Renormalize. If we'd wound up with two places to
the left of the binary point we would have had to shift one place to the right,
and add one to the exponent.
13.
Reconstruct the answer as an IEEE floating point
number:
0 10000010 01111100 =
413e0000
9.
This time let's divide 42340000 / 41100000
. We break the
numbers up into fields as before:
c2340000 

Sign: 
1 
Exponent: 
10000100 
Significand: 
1.01101 
41100000 

Sign: 
0 
Exponent: 
10000010 
Significand: 
1.001 
2.
Determine the exponent by subtracting the
operands' exponents. This time the excesses will cancel out, so we need to add
them back in; we get
10000100 
1000010 + 01111111 = 10000001
3.
Perform the standard fractional division
operation. Note: check my math here!
4. 1.01
5. 
6. 1.001)1.01101
7. 1.001
8. 
9. .0100
10. .0000
11. 
12. .01001
13. 1001
14. 
15. 0000
So, our 24bit significand
is 10100
16.
Renormalize. Our result is already normalized,
so we don't need to do this.
17.
Reconstruct the answer as an IEEE floating point
number:
0 10000001 0100 =
40a00000
11.
IEEE FP uses a normalized representation where
possible, and also extends its range at the expense of normalization with denormalized numbers.
12.
Extend range of representation (at cost in
precision of really small numbers) with ``denormals.''
These have an exp field of 0, and represent
14.
exp field of ff is used for other goodies: if fraction
field is 0, + infinity; any other fraction is Not A Number.
15.
So we can express everything possible in the
format like this:
Sign 
Exponent 
Fraction 
Represents 
Notes 
1 
ff 
!= 0 


1 
ff 
0 
infinity 

1 
01fe 
anything 
1.f * 2^{(exp127)} 

1 
00 
!= 0 
0.f * 2^{126} 

1 
00 
0 
0 
(special case of last line) 
0 
00 
0 
0 
(special case of next line) 
0 
00 
!= 0 
0.f * 2^{126} 

0 
01fe 
anything 
1.f * 2^{(exp  127)} 

0 
ff 
0 
infinity 

0 
ff 
!= 0 


16.
There are actually two classes of NaNs: if the most significant fraction bit is 1, it's a
"Quiet NaN" (QNaN),
identifying an indeterminate result. QNaN's can be
used in arithmetic, and propagate freely (so nothing breaks, but when you're
done you get a
17.
Operations on the "special cases" are
well defined by the IEEE standard. Any operation involving a QNaN results in a QNaN; other
operations give results of:
Operation 
Result 
n / ±Infinity 
0 
±Infinity × ±Infinity 
±Infinity 
±nonzero / 0 
±Infinity 
Infinity + Infinity 
Infinity 
±0 / ±0 

Infinity  Infinity 

±Infinity / ±Infinity 

±Infinity × 0 

19.
Double precision works just like single
precision, except it's 64 bits. The exponent is 11
bits, the fraction is 52.