Adding double value to unsigned 64 bit value gives strange results
int main(int argc, char *argv[])
{
uint64_t length = 0x4f56aa5d4b2d8a80;
uint64_t new_length = 0;
new_length = length + 119.000000;
printf("new length 0x%"PRIx64"\n",new_length);
new_length = length + 238.000000;
printf("new length 0x%"PRIx64"\n",new_length);
return 0;
}
With the above code. I am adding two different double values ββto an unsigned 64 bit integer. I get the same result in both cases. The program output is shown below
$./a.out
new length 0x4f56aa5d4b2d8c00
new length 0x4f56aa5d4b2d8c00
I would expect two different results, but this is not the case. I also tried using the uint64_t
value uint64_t
for double
as in
new_length = (double)length + 119.000000;
But that doesn't help either. Any idea on what the problem might be?
source to share
Floating point arithmetic is not precise. As the number increases, the precision of the lower numbers decreases.
0x4f56aa5d4b2d8a80 is a very large number.
What's going on in
new_length = length + 119.000000;
This is what the length + 119.000000
double gets cast to do the addition. This double rounded one is pretty dramatic because it's so big. It then reverts back to the integral type uint64_t when assigned to it new_length
.
When you call
new_length = length + 238.000000;
It happens that the rounded result ends up the same.
What you really want to do is
new_length = length + (uint64_t)238.0;
This will give you the answer you want. It will use double integral type first, which will be added exactly.
source to share
Since you are adding a floating point operand, both operands are implicitly translated to double
, and the addition is done using floating point arithmetic.
However, double
it is not accurate enough to accurately hold one of the following values:
0x4f56aa5d4b2d8a80 + 119.0 (requires 63 bits of precision)
0100111101010110101010100101110101001011001011011000101011110111
<-------------------63 bits of precision---------------------->
0x4f56aa5d4b2d8a80 + 238.0 (requires 62 bits of precision)
0100111101010110101010100101110101001011001011011000101101101110
<-------------------62 bits of precision--------------------->
IEEE standard double precision has 53 bits of precision .
As a result, both of them are rounded to the same final value:
0x4f56aa5d4b2d8c00 (53 bits of precision)
0100111101010110101010100101110101001011001011011000110000000000
<-----------------53 bits of precision-------------->
If you want to avoid this rounding, you should avoid floating point arithmetic by moving the operands to integers. (or just using 119
and 238
)
source to share