Is the IEEE-754 float, double and square guarantee of an exact representation of -2, -1, -0, 0, 1, 2?

All in the title: IEEE-754 float

, double

and quad

provide an accurate representation of -2

, -1

, -0

, 0

, 1

, 2

?

0


source to share


3 answers


It guarantees accurate representation of all integers as long as the number of significant binary digits does not exceed the range of the mantissa.



+7


source


An easy way to get the answer for any decimal number is to convert the absolute value to binary (24 bits for float, 53 bits for double, 113 bits for quad), then go back to the decimal value and see if you get the same value.

For integers, the answer is obvious, you have nothing to lose if the value is too large to fit into the given number of bits.

More interesting is the transformation of rational values ​​with a non-integer part. There you can lose precision when converting to a binary with some fixed width, and when converting back to decimal, you can end up with a decimal value with periodic decimal expansion (or lose precision again if you round it up).




Since you are working with IEEE floats, read the wikipedia page first , then when you feel you are ready for more, proceed to the first external link, What Every Computer Scientist Should Know About Floating Point Arithmetic .

+3


source


IEEE 754 floating point numbers can be used to store integers of specific ranges. For example:

  • binary32

    implemented in C / C ++ as float

    , provides 24 bits of precision and therefore can represent 16-bit integers with full precision, e.g. short int

    ;
  • binary64

    implemented in C / C ++ like double

    , provides 53 bits of precision, and can represent exactly 32 bit integers, for example. int

    ;
  • Intel nonstandard 80-bit precision, implemented as long double

    some x86 / x64 compilers, provides 64 significant bits and can be 64-bit integers, for example. long int

    (on LP64 systems like Unix) or long long int

    (on LLP64 systems like Windows);
  • binary128

    implemented as compiler-specific types such as __float128

    (GCC) or _Quad

    (Intel C / C ++) provides 113 bits in the mantissa and therefore can represent exactly 64-bit integers.

The fact that it double

matches an extended range of integers, even exceeding the range of 32-bit integers, is used in JavaScript, which does not have a special integer numeric type, but instead uses double precision floating point to represent integers .

One of the numbers of floating point numbers is that they have a separate sign bit and so there are things like positive and negative zeros that are not possible in two's complement integer notation.

+2


source







All Articles