Adding double value to unsigned 64 bit value gives strange results

int main(int argc, char *argv[])
{
    uint64_t length = 0x4f56aa5d4b2d8a80;
    uint64_t new_length = 0;

    new_length = length + 119.000000;

    printf("new length  0x%"PRIx64"\n",new_length);

    new_length = length + 238.000000;

    printf("new length  0x%"PRIx64"\n",new_length);

    return 0;
}

      

With the above code. I am adding two different double values ​​to an unsigned 64 bit integer. I get the same result in both cases. The program output is shown below

$./a.out
new length  0x4f56aa5d4b2d8c00
new length  0x4f56aa5d4b2d8c00

      

I would expect two different results, but this is not the case. I also tried using the uint64_t

value uint64_t

for double

as in

new_length = (double)length + 119.000000;

      

But that doesn't help either. Any idea on what the problem might be?

+3


source to share


2 answers


Floating point arithmetic is not precise. As the number increases, the precision of the lower numbers decreases.

0x4f56aa5d4b2d8a80 is a very large number.

What's going on in

new_length = length + 119.000000;

      

This is what the length + 119.000000

double gets cast to do the addition. This double rounded one is pretty dramatic because it's so big. It then reverts back to the integral type uint64_t when assigned to it new_length

.

When you call



new_length = length + 238.000000; 

      

It happens that the rounded result ends up the same.

What you really want to do is

new_length = length + (uint64_t)238.0; 

      

This will give you the answer you want. It will use double integral type first, which will be added exactly.

+3


source


Since you are adding a floating point operand, both operands are implicitly translated to double

, and the addition is done using floating point arithmetic.

However, double

it is not accurate enough to accurately hold one of the following values:

0x4f56aa5d4b2d8a80 + 119.0  (requires 63 bits of precision)

0100111101010110101010100101110101001011001011011000101011110111
 <-------------------63 bits of precision---------------------->


0x4f56aa5d4b2d8a80 + 238.0  (requires 62 bits of precision)

0100111101010110101010100101110101001011001011011000101101101110
 <-------------------62 bits of precision--------------------->

      

IEEE standard double precision has 53 bits of precision .



As a result, both of them are rounded to the same final value:

0x4f56aa5d4b2d8c00  (53 bits of precision)

0100111101010110101010100101110101001011001011011000110000000000
 <-----------------53 bits of precision-------------->

      


If you want to avoid this rounding, you should avoid floating point arithmetic by moving the operands to integers. (or just using 119

and 238

)

+7


source







All Articles