Strip unsigned long for size_t and assign the result to double
I need to split unsigned long int for size_t (returned from dimension of an array with size ()) like this:
vector<string> mapped_samples;
vector<double> mean;
vector<unsigned long> feature_sum;
/* elaboration here */
mean.at(index) = feature_sum.at(index) /mapped_samples.size();
but this way integer division happens (I lose the decimal part. This is not good)
So I can do:
mean.at(index) = feature_sum.at(index) / double(mapped_samples.size());
But this way the feature_sum.at(index)
(temporary copy) is automatically converted to double
, and I might lose precision. How can I resolve this issue? Should I be using some library?
This can be a loss of precision when converting an unsigned long to a double (because an unsigned long value can be larger than the maximum double). An unsigned long value is the sum of the characteristics (values ββof positive values). Function samples can be 1,000,000 or more, and the sum of function values ββcan be enourmus. The maximum value of the function is 2000: 2000 * 1000000 or more
(I am using C ++ 11)
source to share
You can use:
// Grab the integral part of the division
auto v1 = feature_sum.at(index)/mapped_samples.size();
// Grab the remainder of the division
auto v2 = feature_sum.at(index)%mapped_samples.size();
// Dividing 1.0*v2 is unlikely to lose precision
mean.at(index) = v1 + static_cast<double>(v2)/mapped_samples.size();
source to share
you can't do better (if you want to store the result as double
) than a simple
std::uint64_t x=some_value, y=some_other_value;
auto mean = double(x)/double(y);
since the relative precision of the truncated form of the correct result using float128
auto improved = double(float128(x)/float128(x))
usually the same (for a typical input - there may be rare inputs where improvement is possible). Both have a relative error dictated by the length of the mantissa for double
(53 bits). So the simple answer is, use a more precise type than double
your value, or forget about this problem.
To see the relative accuracy, suppose
x=a*(1+e); // a=double(x)
y=b*(1+f); // b=double(y)
where e
, f
are of order 2 ^ -53.
Then the "correct" factor is the first order in e
andf
(x/y) = (a/b) * (1 + e - f)
Converting this parameter to double
results in another relative error of order 2 ^ -53, that is, the same order as the error (a/b)
, the result is naive
mean = double(x)/double(y).
Of course, e
and f
can conspire to cancel when higher accuracy can be obtained by the methods suggested in other answers, but generally the accuracy cannot be improved.
source to share