Should I combine multiplication and division when working with floating point values?
I am aware of precision issues in floats and doubles, why am I asking this:
If I have a formula like: (a/PI)*180.0
(where PI is a constant)
Do I need to combine division and multiplication, so I can only use one division: a/0.017453292519943295769236
to avoid losing precision?
Does this mean it is more accurate if it has fewer steps to calculate the result?
source to share
Short answer
Yes, you should combine as many constant multiplications and divisions as possible in one operation. It is (generally (*)) faster and more accurate at the same time.
Neither π, π / 180, nor their inversions are represented exactly as floating point. For this reason, the computation will include at least one approximate constant (in addition to approximating each of the operations involved).
Since the two operations introduce the same approximation in each case, more accurate execution of the entire calculation in one operation can be expected.
Is division or multiplication better in this case?
Also, it is about "luck" whether the relative precision with which π / 180 can be represented in floating point format is better or worse than 180 / π.
My compiler provides additional precision with the type long double
, so I can use it as a reference to answer this question for double
:
~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L
#include <stdio.h>
int main() {
long double heop = 180.L / PIL;
long double pohe = PIL / 180.L;
printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17
In normal programming practice, nobody would bother and just multiply by (floating point representation) 180 / π, since multiplication is much faster than division. As it turned out, in the case of binarnogo64 type float double
almost always displayed, π / 180 may be provided with a higher relative accuracy than 180 / π, so the π / 180 - constant, which should be used to optimize accuracy: a / ((double) (π / 180))
. With this formula, the total relative error will be approximately equal to the sum of the relative error of the constant (1.688893e-17) and the relative error of division (which will depend on the value a
, but will never be greater than 2 -53 ).
Alternative methods for faster and more accurate results
Note that division is so expensive that you can get a more accurate result faster by using one multiplication and one fma: let 180 / π heop1
be the best approximation double
, and 180 / π the heop2
best double
approximation heop1
. Then the best value for the result can be calculated as:
double r = fma(a, heop1, a * heop2);
The fact that the above is the absolute best approximation double
to the actual computation is a theorem (in fact, it is an exception theorem). Details can be found in the Floating Point Arithmetic Reference). But even when the real constant you want to multiply by double
to get the result double
is one of the exceptions to the theorem, the above calculation is still very accurate and differs from the best double
approximation for a few exceptional values a
.
If, like mine, your compiler provides higher precision for long double
than for double
, you can also use a single long double
multiplication:
// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)
This is not as good as the fma-based solution, but good enough that for most values a
it gives an optimal approximation double
to the actual computation.
Counterexample to the general claim that transactions should be grouped as one
(*) The statement that it is better to group a constant is only statistically true for most constants.
If you wanted to multiply a
by, say, the actual constant 0.0000001 * DBL_MIN
, you'd be better off first multiplying by 0.0000001
, then by, DBL_MIN
and the final result (which can be a normalized number if a
greater than 1,000,000 or so) would be more accurate than if you multiplied for the best performance double
0.0000001 * DBL_MIN
. This is because the relative precision for representing 0.0000001 * DBL_MIN
as a single value is double
much worse than the precision for representing 0.0000001.
source to share