Strip unsigned long for size_t and assign the result to double

Question

Strip unsigned long for size_t and assign the result to double

I need to split unsigned long int for size_t (returned from dimension of an array with size ()) like this:

vector<string> mapped_samples;
vector<double> mean;
vector<unsigned long> feature_sum;
/* elaboration here */
mean.at(index) = feature_sum.at(index) /mapped_samples.size();

but this way integer division happens (I lose the decimal part. This is not good)

So I can do:

 mean.at(index) = feature_sum.at(index) / double(mapped_samples.size());

But this way the feature_sum.at(index)

(temporary copy) is automatically converted to double

, and I might lose precision. How can I resolve this issue? Should I be using some library?

This can be a loss of precision when converting an unsigned long to a double (because an unsigned long value can be larger than the maximum double). An unsigned long value is the sum of the characteristics (values of positive values). Function samples can be 1,000,000 or more, and the sum of function values can be enourmus. The maximum value of the function is 2000: 2000 * 1000000 or more

(I am using C ++ 11)

+3

c ++ precision c ++ 11 arbitrary-precision

Umbert May 24 '17 at 16:23

source to share

3 answers

Severin Pappadeux · Answer 1 · 2017-05-24T16:30:44+0000

You can try to use std::div

Along the lines

auto dv = std::div(feature_sum.at(index), mapped_samples.size());

double mean = dv.quot + dv.rem / double(mapped_samples.size());

R Sahu · Answer 2 · 2017-05-24T16:31:55+0000

You can use:

// Grab the integral part of the division
auto v1 = feature_sum.at(index)/mapped_samples.size();

// Grab the remainder of the division
auto v2 = feature_sum.at(index)%mapped_samples.size();

// Dividing 1.0*v2 is unlikely to lose precision
mean.at(index) = v1 + static_cast<double>(v2)/mapped_samples.size();

Walter · Answer 3 · 2017-05-24T17:00:02+0000

you can't do better (if you want to store the result as double

) than a simple

std::uint64_t x=some_value, y=some_other_value;
auto mean = double(x)/double(y);

since the relative precision of the truncated form of the correct result using float128

auto improved = double(float128(x)/float128(x))

usually the same (for a typical input - there may be rare inputs where improvement is possible). Both have a relative error dictated by the length of the mantissa for double

(53 bits). So the simple answer is, use a more precise type than double

your value, or forget about this problem.

To see the relative accuracy, suppose

x=a*(1+e);   // a=double(x)
y=b*(1+f);   // b=double(y)

where e

, f

are of order 2 ^ -53.

Then the "correct" factor is the first order in e

andf

(x/y) = (a/b) * (1 + e - f)

Converting this parameter to double

results in another relative error of order 2 ^ -53, that is, the same order as the error (a/b)

, the result is naive

mean = double(x)/double(y).

Of course, e

and f

can conspire to cancel when higher accuracy can be obtained by the methods suggested in other answers, but generally the accuracy cannot be improved.

Strip unsigned long for size_t and assign the result to double

More articles: