X87 FPU compute e power x for x float (IEEE 754 standard)
I am trying to compute the function e ^ x (x is a single precision float) in x87. For this I use the Taylor series (as a reminder: e ^ x = 1 + (x ^ 1) / 1! + (X ^ 2) / 2! + ... + (x ^ n) / n!)
When I work with x87, I can calculate all values ββin extended precision (80 bits instead of 32 bits).
My implementation so far: I have an addend and a sum in two separate registers in FPU (plus some other less significant stuff). I have a loop that updates a term and adds it to the total. So, in undefined pseudocode:
loop:
;update my summand
n = n + 1
summand = summand / n
summand = summand * x
;add the summand to the total sum
sum = sum + summand
Now my problem is the loop exit. I wanted to create it in such a way that it breaks out of the loop after adding a summand to the total would not affect the value of the sum in one precision, even if I find out how difficult it is to implement such an exit condition is very complicated and uses many instructions β calculation time.
My only happy idea so far would be this: 1. Get sum and addend metrics via FXTRACT. If (exp (sum) - exp (summand))> 23, the addend will no longer affect the representation of single precision bits (because the single precision mantissa has 23 bits) -> so exit. 2 .: Compare the term with 0, if it is zero, it obviously won't affect the result either.
My question is, will anyone have better ideas than me for getting out of the condition?
source to share
Perhaps the most efficient way to compute e x using an x87 floating point block is the following sequence of instructions:
; Computes e^x via the formula 2^(x * log2(e))
fldl2e ; st(0) = log2(e) <-- load log2(e)
fmul [x] ; st(0) = x * log2(e)
fld1 ; st(0) = 1 <-- load 1
; st(1) = x * log2(e)
fld st(1) ; st(0) = x * log2(e) <-- make copy of intermediate result
; st(1) = 1
; st(2) = x * log2(e)
fprem ; st(0) = partial remainder((x * log2(e)) / 1) <-- call this "rem"
; st(1) = 1
; st(2) = x * log2(e)
f2xm1 ; st(0) = 2^(rem) - 1
; st(1) = 1
; st(2) = x * log2(e)
faddp st(1), st(0) ; st(0) = 2^(rem) - 1 + 1 = 2^(rem)
; st(1) = x * log2(e)
fscale ; st(0) = 2^(rem) * 2^(trunc(x * log2(e)))
; st(1) = x * log2(e)
fstp st(1)
The result remains in st(0)
.
If you already have an input x
,, on a floating point stack, you can slightly change the sequence of instructions:
; st(0) == x
fldl2e
fmulp st(1), st(0)
fld1
fld st(1)
fprem
f2xm1
faddp st(1), st(0)
fscale
fstp st(1)
This multiplies x by log 2e, finds the remainder of division by 1, expresses 2 to the power of that value ( f2xm1
also subtracts 1, then we add 1 back), and finally scales that x x; log 2 e.
An alternative implementation - essentially proposed by fuz and resembling the code generated by MSVC for a function exp
from the C standard library - is this:
; st(0) == x
fldl2e
fmulp st(1), st(0)
fld st(0)
frndint
fsub st(1), st(0)
fxch ; pairable with FSUB on Pentium (P5)
f2xm1
fld1
faddp st(1), st(0)
fscale
fstp st(1)
The main difference is to use frndint
and fsub
to get a value in the 1.0 to 1.0 plus range as required f2xm1
, as opposed to using fprem
to get the remainder after division by 1.
To get an idea of ββthe relative costs of these instructions, we'll pull the data from Agner Fog instruction tables . "?" indicates that the corresponding data is not available.
Instruction AMD K7 AMD K8 AMD K10 Bulldozer Piledriver Ryzen
-------------------------------------------------------------------------------------------
FLD/FLD1/FLDL2E [all very fast, 1-cycle instructions, with a reciprocal throughput of 1]
FADD(P)/FSUB(P) 1/4/1 1/4/1 1/4/1 1/5-6/1 1/5-6/1 1/5/1
FMUL(P) 1/4/1 1/4/1 1/4/1 1/5-6/1 1/5-6/1 1/5/1
FPREM 1/7-10/8 1/7-10/8 1/?/7 1/19-62/? 1/17-60/? 2/?/12-50
FRNDINT 5/10/3 5/10/3 6/?/37 1/4/1 1/4/1 1/4/3
FSCALE 5/8/? 5/8/? 5/9/29 8/52/? 8/44/5 8/9/4
F2XM1 8/27/? 53/126/? 8/65/30? 10/64-71/? 10/60-73/? 10/50/?
The notation used above is "ops / latency / reciprocal throughput".
Instruction 8087 287 387 486 P5 P6 PM Nehalem
-------------------------------------------------------------------------------------------
FLD 17-22 17-22 14 4 1/0 1/? 1/1 1/1
FLD1 15-21 15-21 24 4 2/0 2/? 2/? 2/?
FLDL2E 15-21 15-21 40 8 5/2 2/? 2/? 2/?
FADD(P)/FSUB(P) 70-100 70-100 23-34 8-20 3/2 1/3 1/3 1/3
FMUL(P) 90-145 90-145 29-57 16 3/2 1/5 1/5 1/5
FPREM 15-190 15-190 74-155 70-138 16-64 (2) 23/? 26/37 25/14
FRNDINT 16-50 16-50 66-80 21-30 9-20 (0) 30/? 15/19 17/22
FSCALE 32-38 32-38 67-86 30-32 20-32 (5) 56/? 28/43 24/12
F2XM1 310-630 310-630 211-476 140-279 13-57 (2) 17-48/66 ~20/? 19/58
For pre-P5, this value is equal to the number of cycles. For P5, the notation is the number of cycles, with the parentheses indicating the match with other FP instructions. For P6 and later, the designation is ΞΌops / latency.
This is clearly f2xm1
a slow, expensive instruction, but it is used by both implementations and is difficult to avoid. As slow as it is, it is still faster than implementing this loop.
(One possible way of dealing with a slow command f2xm1
- if you're willing to sacrifice code size for speed - would be the lookup table approach. You can find several articles published on this subject if you search, although most of them are behind paywalls .: - ( There is one public link here . Actually this is probably what an optimized math library written in C would do for its implementation exp
. Ben Voigt wrote similar code in C #... The main question here is, as always, are you optimizing for size or speed? In other words, how often do you call this function and need it as quickly as possible? Or you call it infrequently and just need it to be precise without consuming too much space in the binary and potentially slowing down other code the hot way by pushing it out of the cache.)
But what about the fprem
cons frndint
?
Well, on AMD CPUs fprem
it decodes pretty consistently to fewer operations than frndint
it frndint
does , although it is faster than fprem
on modern AMD CPUs. Again, I would say it doesn't matter because on these processors you will be writing SSE2 / AVX code and not using the legacy x90 FPU. (These instructions also get slower on later Intel microarchitecture models such as Sandy Bridge, Haswell, etc., but the same argument applies there, so I ignored them altogether when compiling the data.) What really matters is how this code works with older processors and older processors, the version using fprem
looks like a clear winner.
However, on Intel processors, the opposite looks like this: frndint
usually faster than fprem
, with the exception of the P6 (Pentium Pro, Pentium II and Pentium III).
But we're not just talking about frndint
vs fprem
- we're really talking about fprndint
+ fsub
vs. fprem
... fsub
is a relatively fast instruction, but it becomes difficult to predict what performance this code will have without executing and synchronizing it. The number of cycles can only tell us about this, and the sequence is fprem
shorter (fewer instructions and, more importantly, fewer bytes-18 versus 22), which can mean significant speed improvements.
Any implementation is good if you don't want to compare it, or if you want something in common that works well on all processors. Neither of these are going to set the world on fire in terms of performance, but as I said earlier, both will be significantly faster than the loop that the Taylor series is trying to calculate. Overhead is what will kill your productivity.
source to share