Remove slow int64 division from fixed point atan2 () approximation
I made a function to compute the fixed point approximation atan2 (y, x). The problem is that of the 83 loops required to execute the entire function, 70 loops (compilation with gcc 4.9.1 mingw-w64 -O3 on AMD FX-6100) are entirely done by simple 64-bit integer division! And, unfortunately, none of the conditions of this separation are permanent. Can I speed up the splitting itself? Is there a way to remove it?
I think I need this separation because since I am approximating atan2 (y, x) with a 1D lookup table, I need to normalize the distance from the point represented by x, y to something like the unit circle or unit square ( I chose a single diamond, which is a unit square rotated 45 °, which gives a fairly clear positive quadrant accuracy). Thus, division finds (| y | - | x |) / (| y | + | x |). Note that the divisor is in 32 bits, and the numerator is a 32-bit number, shifted 29 bits to the right, so that the division has 29 fractional bits. Also the use of floating point division is not an option, as this function does not need to use floating point arithmetic.
Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles to split). Here's the complete function for reference:
int32_t fpatan2(int32_t y, int32_t x) // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
#include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
const uint32_t outfmt = 32; // final output format in s0.outfmt
const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp); // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
const uint32_t halfof = 1UL<<(outfmt-1); // represents 0.5 in the output format, which since it is in turns means half a circle
const uint32_t pds=ds-lutsp; // division shift and post-division shift
uint32_t lutind, p, t, d;
int32_t a, b, xa, ya, xs, ys, div, r;
xs = x >> 31; // equivalent of fabs()
xa = (x^xs) - xs;
ys = y >> 31;
ya = (y^ys) - ys;
d = ya+xa;
if (d==0) // if both y and x are 0 then they add up to 0 and we must return 0
return 0;
// the following does 0.5 * (1. - (y-x) / (y+x))
// (y+x) is u1.31, (y-x) is s0.31, div is in s1.29
div = ((int64_t) (ya-xa)<<ds) / d; // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
p = ((1UL<<ds) - div) >> 1; // before shift the format is s2.29. position in u1.29
lutind = p >> ish; // index for the LUT
t = (p & tmask) >> tbd; // interpolator between two LUT entries
a = fpatan_lut[lutind];
b = fpatan_lut[lutind+1];
r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip); // linear interpolation of a and b by t in s0.32 format
// Quadrants
if (xs) // if x was negative
r = halfof - r; // r = 0.5 - r
r = (r^ys) - ys; // if y was negative then r is negated
return r;
}
source to share
Based on @Iwillnotexist Idonotexist's suggestion of using lzcnt, reciprocity and multiplication, I implemented a division function that runs in about 23.3 cycles and with a fairly high precision of 1 part in 19 million with 1.5K LUTs, for example. one of the worst cases for 1428769848/1080138864 you can get 1.3227648959 instead of 1.3227649663.
I came up with an interesting technique while researching this, I was really trying to think of something that could be fast and accurate, since even a 1 / x quadratic approximation in [0.5, 1.0] combined with an interpolated LUT difference, then I have I had the idea to do it the other way around, so I made a lookup table that contains quadratic coefficients that fit the curve on a short segment that represents 1/8 of the curve [0.5, 1.0], which gives you a very small error like this . And using the 7 most significant bits of what x represents in the range [0.5, 1.0] as the LUT index, I get directly the coefficients that are best suited for the segment that x falls into.
Here's the complete code with lookup tables ffo_lut.h and fpdiv.h :
#include "ffo_lut.h"
static INLINE int32_t log2_ffo32(uint32_t x) // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
int32_t y;
y = x>>21; if (y) return ffo_lut[y]+21;
y = x>>10; if (y) return ffo_lut[y]+10;
return ffo_lut[x];
}
// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt) // ~23.3 cycles, max error (by division) 53.39e-9
{
#include "fpdiv.h" // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
const int32_t *c;
int32_t xa, xs, p, sh;
uint32_t expon, frx, lutind;
const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1); // the shift for the index, the shift for 31-bit xa, the value of 0.5
int64_t out;
int64_t c0, c1, c2;
// turn x into xa (|x|) and sign of x (xs)
xs = x >> 31;
xa = (x^xs) - xs;
// decompose |x| into frx * 2^expon
expon = log2_ffo32(xa);
frx = (xa << (31-expon)) >> cfs; // the fractional part is now in 0.27 format
// lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
lutind = (frx - half) >> ish; // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
lutind *= 3; // 3 entries for each index
c = &fpdiv_lut[lutind]; // c points to the correct c0, c1, c2
c0 = c[0]; c1 = c[1]; c2 = c[2];
p = (int64_t) frx * frx >> prec; // x^2
p = c2 * p >> prec; // c2 * x^2
p += c1 * frx >> prec; // + c1 * x
p += c0; // + c0, p = (1.0 , 2.0] in 2.27 format
// apply the necessary bit shifts and reapplies the original sign of x to make final result
sh = expon + prec - outfmt; // calculates the final needed shift
out = (int64_t) y * p; // format is s31 + 1.27 = s32.27
if (sh >= 0)
out >>= sh;
else
out <<= -sh;
out = (out^xs) - xs; // if x was negative then out is negated
return out;
}
I think ~ 23.3 loops are about as good as what it does, but if you have any ideas to shave a few loops please let me know.
As for the fpatan2 () question, the solution would be to replace this line:
div = ((int64_t) (ya-xa)<<ds) / d;
with this line:
div = fpdiv(ya-xa, d, ds);
source to share
Unfortunately, the 70 cycle latency is typical of 64-bit integer division on x86 processors. Floating point division usually has about half the latency or less. The increase in cost is due to the fact that modern processors only have separators in their floating point units (they are very expensive in terms of the silicon realm), so it is necessary to convert integers to floating point and vice versa. So simply replacing a floating unit instead of a whole is unlikely to help. You will need to refactor your code to use floating points instead to take advantage of faster floating point division.
If you can refactor your code, you can also use an approximate floating point counter command RCPSS
if you don't need an exact answer. It has a latency of about 5 cycles.
source to share
Your time hog instruction:
div = ((int64_t) (ya-xa)<<ds) / d;
gives at least two problems. The first is that you mask the inline function div
; but this is a minor fact that has never been observed. The second is that first of all, according to the rules of the C language, both operands are converted to a common type, which is equal int64_t
, and then the division for that type is expanded into a CPU instruction that divides the 128-bit dividend by the 64-bit divisor (!) To extract from the assembly the cut version of your function:
21: 48 89 c2 mov %rax,%rdx
24: 48 c1 fa 3f sar $0x3f,%rdx ## this is sign bit extension
28: 48 f7 fe idiv %rsi
Yes, this division takes about 70 cycles and cannot be optimized (well, indeed, it is possible, but, for example, the inverse divider approach requires multiplication by the 192-bit product). But if you are sure that this division can be done with a 64-bit dividend and a 32-bit divisor and it will not overflow (the factor will fit into 32 bits) (I agree that ya-xa is always smaller in absolute value than ya + xa), this can be sped up by using an explicit build request:
uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
"cc");
it's quick and dirty and should be checked thoroughly, but I hope the idea is clear. The resulting assembly now looks like this:
18: 48 98 cltq
1a: 48 c1 e0 1d shl $0x1d,%rax
1e: 48 89 c2 mov %rax,%rdx
21: 48 c1 ea 20 shr $0x20,%rdx
27: f7 f9 idiv %ecx
This seems like a huge step forward because the 64/32 split requires up to 25 clocks in the Core family, according to Intel's optimization guide, instead of 70 you see the 128/64 split.
Smaller statements can be added; for example shifts can be performed even more economically in parallel:
uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (lowpart), "d" (highpart), [d] "q" (d) :
"cc");
that leads to:
18: 89 d0 mov %edx,%eax
1a: c1 e0 1d shl $0x1d,%eax
1d: c1 ea 03 shr $0x3,%edx
22: f7 f9 idiv %ecx
but this is a minor fix compared to division.
In conclusion, I really doubt that this routine is worth implementing in C. The latter is quite uneconomical in integer arithmetic, requiring useless expansions and large losses. The whole procedure should be moved to assembler.
source to share
Given the implementation fpatan()
, you can simply implement fpatan2()
in terms of this.
Assuming the constants defined for pi abd pi / 2:
int32_t fpatan2 (int32_t y, int32_t x) {fixed theta;
if( x == 0 )
{
theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
theta = fpatan( y / x ) ;
if( x < 0 )
{
theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
}
}
return theta ;
}
Note that fixed library fixes are easy to get wrong. You can take a look at Optimizing Applications Mathematically with Fixed Point Arithmetic . Using C ++ in the discussed library makes the code much easier, in most cases you can just replace the keyword float
or double
with fixed
. However, it has no implementation atan2()
, the above code is adapted from my implementation for this library.
source to share