What is overflow and floating point overflow

I feel like I don't understand the concept overflow

and underflow

. I am asking this question to clarify this. I need to understand this at the most basic level with bits. Let's work with a simplified floating point representation of the sign of the 1

byte bit 1

, 3

exponent and 4

mantissa bit:

0 000 0000

      

The maximum we can store is 111_2=7

minus the offset K=2^2-1=3

that gives 4

, and is reserved for Infinity

and NaN

. The exponent for the maximum number is 3

which is 110

under the biased binary.

So the bit pattern for the maximum number is:

0 110 1111 // positive
1 110 1111 // negative

      

When the exponent is zero, the number is subnormal and has an implicit 0

instead 1

. So the bit pattern for the minimum number is:

0 000 0001 // positive
1 000 0001 // negative

      

I found these descriptions for single precision floating point:

Negative numbers less than −(2−2−23) × 2127 (negative overflow)
Negative numbers greater than −2−149 (negative underflow)
Positive numbers less than 2−149 (positive underflow)
Positive numbers greater than (2−2−23) × 2127 (positive overflow)

      

Of these, I only understand positive overflow , which results in +Infinity

, and an example would be something like this:

0 110 1111 + 0 110 1111 = 0 111 0000 

      

Can anyone demonstrate three other cases of overflow and underflow using the bit patterns described above?

+1


source to share


1 answer


Of course the following is implementation dependent, but if the numbers behave like something like what the IEEE-754 indicates, floating point numbers do not overflow and overflow to wildly wrong answer like integers. you really shouldn't result in two positive numbers doubling, which results in a negative number.

Instead, overflow means that the result is "too big to be represented". Depending on the rounding mode this either usually gets max. Float (RTZ) or Inf (RNE):

0 110 1111 * 0 110 1111 = 0 111 0000

      

(Note that integer overflow, as you know, could have been avoided in hardware by applying a similar clamping operation; it's just not a convention to do it.)



When dealing with floating point numbers, the term underflow means that the number is "too small to represent", which usually results in 0.0:

0 000 0001 * 0 000 0001 = 0 000 0000

      

Note that I have also heard that the term underflow is used for overflow by a very large negative number, but this is not the best term for it. This is an example of when the result is negative and too large to represent, i.e. "Negative overflow":

0 110 1111 * 1 110 1111 = 1 111 0000

      

+2


source







All Articles