What is overflow and floating point overflow
I feel like I don't understand the concept overflow
and underflow
. I am asking this question to clarify this. I need to understand this at the most basic level with bits. Let's work with a simplified floating point representation of the sign of the 1
byte bit 1
, 3
exponent and 4
mantissa bit:
0 000 0000
The maximum we can store is 111_2=7
minus the offset K=2^2-1=3
that gives 4
, and is reserved for Infinity
and NaN
. The exponent for the maximum number is 3
which is 110
under the biased binary.
So the bit pattern for the maximum number is:
0 110 1111 // positive
1 110 1111 // negative
When the exponent is zero, the number is subnormal and has an implicit 0
instead 1
. So the bit pattern for the minimum number is:
0 000 0001 // positive
1 000 0001 // negative
I found these descriptions for single precision floating point:
Negative numbers less than −(2−2−23) × 2127 (negative overflow)
Negative numbers greater than −2−149 (negative underflow)
Positive numbers less than 2−149 (positive underflow)
Positive numbers greater than (2−2−23) × 2127 (positive overflow)
Of these, I only understand positive overflow , which results in +Infinity
, and an example would be something like this:
0 110 1111 + 0 110 1111 = 0 111 0000
Can anyone demonstrate three other cases of overflow and underflow using the bit patterns described above?
source to share
Of course the following is implementation dependent, but if the numbers behave like something like what the IEEE-754 indicates, floating point numbers do not overflow and overflow to wildly wrong answer like integers. you really shouldn't result in two positive numbers doubling, which results in a negative number.
Instead, overflow means that the result is "too big to be represented". Depending on the rounding mode this either usually gets max. Float (RTZ) or Inf (RNE):
0 110 1111 * 0 110 1111 = 0 111 0000
(Note that integer overflow, as you know, could have been avoided in hardware by applying a similar clamping operation; it's just not a convention to do it.)
When dealing with floating point numbers, the term underflow means that the number is "too small to represent", which usually results in 0.0:
0 000 0001 * 0 000 0001 = 0 000 0000
Note that I have also heard that the term underflow is used for overflow by a very large negative number, but this is not the best term for it. This is an example of when the result is negative and too large to represent, i.e. "Negative overflow":
0 110 1111 * 1 110 1111 = 1 111 0000
source to share