Is the IEEE-754 float, double and square guarantee of an exact representation of -2, -1, -0, 0, 1, 2?
An easy way to get the answer for any decimal number is to convert the absolute value to binary (24 bits for float, 53 bits for double, 113 bits for quad), then go back to the decimal value and see if you get the same value.
For integers, the answer is obvious, you have nothing to lose if the value is too large to fit into the given number of bits.
More interesting is the transformation of rational values ββwith a non-integer part. There you can lose precision when converting to a binary with some fixed width, and when converting back to decimal, you can end up with a decimal value with periodic decimal expansion (or lose precision again if you round it up).
Since you are working with IEEE floats, read the wikipedia page first , then when you feel you are ready for more, proceed to the first external link, What Every Computer Scientist Should Know About Floating Point Arithmetic .
source to share
IEEE 754 floating point numbers can be used to store integers of specific ranges. For example:
-
binary32
implemented in C / C ++ asfloat
, provides 24 bits of precision and therefore can represent 16-bit integers with full precision, e.g.short int
; -
binary64
implemented in C / C ++ likedouble
, provides 53 bits of precision, and can represent exactly 32 bit integers, for example.int
; - Intel nonstandard 80-bit precision, implemented as
long double
some x86 / x64 compilers, provides 64 significant bits and can be 64-bit integers, for example.long int
(on LP64 systems like Unix) orlong long int
(on LLP64 systems like Windows); -
binary128
implemented as compiler-specific types such as__float128
(GCC) or_Quad
(Intel C / C ++) provides 113 bits in the mantissa and therefore can represent exactly 64-bit integers.
The fact that it double
matches an extended range of integers, even exceeding the range of 32-bit integers, is used in JavaScript, which does not have a special integer numeric type, but instead uses double precision floating point to represent integers .
One of the numbers of floating point numbers is that they have a separate sign bit and so there are things like positive and negative zeros that are not possible in two's complement integer notation.
source to share