Strip unsigned long for size_t and assign the result to double

I need to split unsigned long int for size_t (returned from dimension of an array with size ()) like this:

vector<string> mapped_samples;
vector<double> mean;
vector<unsigned long> feature_sum;
/* elaboration here */
mean.at(index) = feature_sum.at(index) /mapped_samples.size();

      

but this way integer division happens (I lose the decimal part. This is not good)

So I can do:

 mean.at(index) = feature_sum.at(index) / double(mapped_samples.size());

      

But this way the feature_sum.at(index)

(temporary copy) is automatically converted to double

, and I might lose precision. How can I resolve this issue? Should I be using some library?

This can be a loss of precision when converting an unsigned long to a double (because an unsigned long value can be larger than the maximum double). An unsigned long value is the sum of the characteristics (values ​​of positive values). Function samples can be 1,000,000 or more, and the sum of function values ​​can be enourmus. The maximum value of the function is 2000: 2000 * 1000000 or more

(I am using C ++ 11)

+3


source to share


3 answers


You can try to use std::div

Along the lines



auto dv = std::div(feature_sum.at(index), mapped_samples.size());

double mean = dv.quot + dv.rem / double(mapped_samples.size());

      

+4


source


You can use:



// Grab the integral part of the division
auto v1 = feature_sum.at(index)/mapped_samples.size();

// Grab the remainder of the division
auto v2 = feature_sum.at(index)%mapped_samples.size();

// Dividing 1.0*v2 is unlikely to lose precision
mean.at(index) = v1 + static_cast<double>(v2)/mapped_samples.size();

      

+3


source


you can't do better (if you want to store the result as double

) than a simple

std::uint64_t x=some_value, y=some_other_value;
auto mean = double(x)/double(y);

      

since the relative precision of the truncated form of the correct result using float128

auto improved = double(float128(x)/float128(x))

      

usually the same (for a typical input - there may be rare inputs where improvement is possible). Both have a relative error dictated by the length of the mantissa for double

(53 bits). So the simple answer is, use a more precise type than double

your value, or forget about this problem.


To see the relative accuracy, suppose

x=a*(1+e);   // a=double(x)
y=b*(1+f);   // b=double(y)

      

where e

, f

are of order 2 ^ -53.

Then the "correct" factor is the first order in e

andf

(x/y) = (a/b) * (1 + e - f)

      

Converting this parameter to double

results in another relative error of order 2 ^ -53, that is, the same order as the error (a/b)

, the result is naive

mean = double(x)/double(y).

      

Of course, e

and f

can conspire to cancel when higher accuracy can be obtained by the methods suggested in other answers, but generally the accuracy cannot be improved.

+2


source







All Articles