Should I combine multiplication and division when working with floating point values?

I am aware of precision issues in floats and doubles, why am I asking this:

If I have a formula like: (a/PI)*180.0

(where PI is a constant)

Do I need to combine division and multiplication, so I can only use one division: a/0.017453292519943295769236

to avoid losing precision?

Does this mean it is more accurate if it has fewer steps to calculate the result?

+3


source to share


1 answer


Short answer

Yes, you should combine as many constant multiplications and divisions as possible in one operation. It is (generally (*)) faster and more accurate at the same time.

Neither π, π / 180, nor their inversions are represented exactly as floating point. For this reason, the computation will include at least one approximate constant (in addition to approximating each of the operations involved).

Since the two operations introduce the same approximation in each case, more accurate execution of the entire calculation in one operation can be expected.

Is division or multiplication better in this case?

Also, it is about "luck" whether the relative precision with which π / 180 can be represented in floating point format is better or worse than 180 / π.

My compiler provides additional precision with the type long double

, so I can use it as a reference to answer this question for double

:

~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L

#include <stdio.h>

int main() {

  long double heop = 180.L / PIL;
  long double pohe = PIL / 180.L;
  printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
  printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out 
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17

      

In normal programming practice, nobody would bother and just multiply by (floating point representation) 180 / π, since multiplication is much faster than division. As it turned out, in the case of binarnogo64 type float double

almost always displayed, π / 180 may be provided with a higher relative accuracy than 180 / π, so the π / 180 - constant, which should be used to optimize accuracy: a / ((double) (π / 180))

. With this formula, the total relative error will be approximately equal to the sum of the relative error of the constant (1.688893e-17) and the relative error of division (which will depend on the value a

, but will never be greater than 2 -53 ).

Alternative methods for faster and more accurate results



Note that division is so expensive that you can get a more accurate result faster by using one multiplication and one fma: let 180 / π heop1

be the best approximation double

, and 180 / π the heop2

best double

approximation heop1

. Then the best value for the result can be calculated as:

double r = fma(a, heop1, a * heop2);

      

The fact that the above is the absolute best approximation double

to the actual computation is a theorem (in fact, it is an exception theorem). Details can be found in the Floating Point Arithmetic Reference). But even when the real constant you want to multiply by double

to get the result double

is one of the exceptions to the theorem, the above calculation is still very accurate and differs from the best double

approximation for a few exceptional values a

.


If, like mine, your compiler provides higher precision for long double

than for double

, you can also use a single long double

multiplication:

// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)

      

This is not as good as the fma-based solution, but good enough that for most values a

it gives an optimal approximation double

to the actual computation.

Counterexample to the general claim that transactions should be grouped as one

(*) The statement that it is better to group a constant is only statistically true for most constants.

If you wanted to multiply a

by, say, the actual constant 0.0000001 * DBL_MIN

, you'd be better off first multiplying by 0.0000001

, then by, DBL_MIN

and the final result (which can be a normalized number if a

greater than 1,000,000 or so) would be more accurate than if you multiplied for the best performance double

0.0000001 * DBL_MIN

. This is because the relative precision for representing 0.0000001 * DBL_MIN

as a single value is double

much worse than the precision for representing 0.0000001.

+4


source







All Articles