X87 FPU compute e power x for x float (IEEE 754 standard)

I am trying to compute the function e ^ x (x is a single precision float) in x87. For this I use the Taylor series (as a reminder: e ^ x = 1 + (x ^ 1) / 1! + (X ^ 2) / 2! + ... + (x ^ n) / n!)

When I work with x87, I can calculate all values ​​in extended precision (80 bits instead of 32 bits).

My implementation so far: I have an addend and a sum in two separate registers in FPU (plus some other less significant stuff). I have a loop that updates a term and adds it to the total. So, in undefined pseudocode:

loop:
;update my summand
n = n + 1
summand = summand / n
summand = summand * x
;add the summand to the total sum
sum = sum + summand

      

Now my problem is the loop exit. I wanted to create it in such a way that it breaks out of the loop after adding a summand to the total would not affect the value of the sum in one precision, even if I find out how difficult it is to implement such an exit condition is very complicated and uses many instructions β†’ calculation time.

My only happy idea so far would be this: 1. Get sum and addend metrics via FXTRACT. If (exp (sum) - exp (summand))> 23, the addend will no longer affect the representation of single precision bits (because the single precision mantissa has 23 bits) -> so exit. 2 .: Compare the term with 0, if it is zero, it obviously won't affect the result either.

My question is, will anyone have better ideas than me for getting out of the condition?

+3


source to share


1 answer


Perhaps the most efficient way to compute e x using an x87 floating point block is the following sequence of instructions:

; Computes e^x via the formula 2^(x * log2(e))
fldl2e                  ; st(0) = log2(e)        <-- load log2(e)
fmul    [x]             ; st(0) = x * log2(e)
fld1                    ; st(0) = 1              <-- load 1
                        ; st(1) = x * log2(e)
fld     st(1)           ; st(0) = x * log2(e)    <-- make copy of intermediate result
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
fprem                   ; st(0) = partial remainder((x * log2(e)) / 1)  <-- call this "rem"
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
f2xm1                   ; st(0) = 2^(rem) - 1
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
faddp   st(1), st(0)    ; st(0) = 2^(rem) - 1 + 1 = 2^(rem)
                        ; st(1) = x * log2(e)
fscale                  ; st(0) = 2^(rem) * 2^(trunc(x * log2(e)))
                        ; st(1) = x * log2(e)
fstp    st(1)

      

The result remains in st(0)

.

If you already have an input x

,, on a floating point stack, you can slightly change the sequence of instructions:

; st(0) == x
fldl2e
fmulp   st(1), st(0)
fld1
fld     st(1)
fprem
f2xm1
faddp   st(1), st(0)
fscale
fstp    st(1)

      

This multiplies x by log 2e, finds the remainder of division by 1, expresses 2 to the power of that value ( f2xm1

also subtracts 1, then we add 1 back), and finally scales that x x; log 2 e.


An alternative implementation - essentially proposed by fuz and resembling the code generated by MSVC for a function exp

from the C standard library - is this:

; st(0) == x
fldl2e
fmulp    st(1), st(0)
fld      st(0)
frndint
fsub     st(1), st(0)
fxch                         ; pairable with FSUB on Pentium (P5)
f2xm1
fld1
faddp    st(1), st(0)
fscale
fstp     st(1)

      

The main difference is to use frndint

and fsub

to get a value in the 1.0 to 1.0 plus range as required f2xm1

, as opposed to using fprem

to get the remainder after division by 1.

To get an idea of ​​the relative costs of these instructions, we'll pull the data from Agner Fog instruction tables . "?" indicates that the corresponding data is not available.



Instruction      AMD K7      AMD K8      AMD K10    Bulldozer    Piledriver    Ryzen
-------------------------------------------------------------------------------------------
FLD/FLD1/FLDL2E  [all very fast, 1-cycle instructions, with a reciprocal throughput of 1]
FADD(P)/FSUB(P)  1/4/1       1/4/1       1/4/1      1/5-6/1      1/5-6/1       1/5/1
FMUL(P)          1/4/1       1/4/1       1/4/1      1/5-6/1      1/5-6/1       1/5/1
FPREM            1/7-10/8    1/7-10/8    1/?/7      1/19-62/?    1/17-60/?     2/?/12-50
FRNDINT          5/10/3      5/10/3      6/?/37     1/4/1        1/4/1         1/4/3
FSCALE           5/8/?       5/8/?       5/9/29     8/52/?       8/44/5        8/9/4
F2XM1            8/27/?      53/126/?    8/65/30?   10/64-71/?   10/60-73/?    10/50/?

      

The notation used above is "ops / latency / reciprocal throughput".

Instruction      8087     287      387      486      P5         P6        PM      Nehalem
-------------------------------------------------------------------------------------------
FLD              17-22    17-22    14       4        1/0        1/?       1/1      1/1
FLD1             15-21    15-21    24       4        2/0        2/?       2/?      2/?
FLDL2E           15-21    15-21    40       8        5/2        2/?       2/?      2/?
FADD(P)/FSUB(P)  70-100   70-100   23-34    8-20     3/2        1/3       1/3      1/3
FMUL(P)          90-145   90-145   29-57    16       3/2        1/5       1/5      1/5
FPREM            15-190   15-190   74-155   70-138   16-64 (2)  23/?      26/37    25/14
FRNDINT          16-50    16-50    66-80    21-30    9-20 (0)   30/?      15/19    17/22
FSCALE           32-38    32-38    67-86    30-32    20-32 (5)  56/?      28/43    24/12
F2XM1            310-630  310-630  211-476  140-279  13-57 (2)  17-48/66  ~20/?    19/58

      

For pre-P5, this value is equal to the number of cycles. For P5, the notation is the number of cycles, with the parentheses indicating the match with other FP instructions. For P6 and later, the designation is ΞΌops / latency.

This is clearly f2xm1

a slow, expensive instruction, but it is used by both implementations and is difficult to avoid. As slow as it is, it is still faster than implementing this loop.

(One possible way of dealing with a slow command f2xm1

- if you're willing to sacrifice code size for speed - would be the lookup table approach. You can find several articles published on this subject if you search, although most of them are behind paywalls .: - ( There is one public link here . Actually this is probably what an optimized math library written in C would do for its implementation exp

. Ben Voigt wrote similar code in C #... The main question here is, as always, are you optimizing for size or speed? In other words, how often do you call this function and need it as quickly as possible? Or you call it infrequently and just need it to be precise without consuming too much space in the binary and potentially slowing down other code the hot way by pushing it out of the cache.)

But what about the fprem

cons frndint

?

Well, on AMD CPUs fprem

it decodes pretty consistently to fewer operations than frndint

it frndint

does , although it is faster than fprem

on modern AMD CPUs. Again, I would say it doesn't matter because on these processors you will be writing SSE2 / AVX code and not using the legacy x90 FPU. (These instructions also get slower on later Intel microarchitecture models such as Sandy Bridge, Haswell, etc., but the same argument applies there, so I ignored them altogether when compiling the data.) What really matters is how this code works with older processors and older processors, the version using fprem

looks like a clear winner.

However, on Intel processors, the opposite looks like this: frndint

usually faster than fprem

, with the exception of the P6 (Pentium Pro, Pentium II and Pentium III).

But we're not just talking about frndint

vs fprem

- we're really talking about fprndint

+ fsub

vs. fprem

... fsub

is a relatively fast instruction, but it becomes difficult to predict what performance this code will have without executing and synchronizing it. The number of cycles can only tell us about this, and the sequence is fprem

shorter (fewer instructions and, more importantly, fewer bytes-18 versus 22), which can mean significant speed improvements.

Any implementation is good if you don't want to compare it, or if you want something in common that works well on all processors. Neither of these are going to set the world on fire in terms of performance, but as I said earlier, both will be significantly faster than the loop that the Taylor series is trying to calculate. Overhead is what will kill your productivity.

+3


source







All Articles