LUT against Newton-Raphson subdivision for IEEE-754 32-bit floating point

I was wondering what are the trade-offs when implementing IEEE-754 32-bit floating point division: using LUT versus Newton-Raphson method?

When I say tradeoffs, I mean memory size, number of instructions, etc.

I have a small memory (130 words (each 16 bits)). I store the upper 12-bit mantissa (including the hidden bit) in one memory location and the lower 12-bit mantissa elsewhere.

I am currently using newton-raphson division, but I think what are the tradeoffs if I change my method. Here is a link to my algorithm: Newton's method for finding the reciprocal of a floating point number for division

Thank you, please explain your arguments.

+3


source to share


2 answers


Each Newton-Raphson step roughly doubles the number of digits of precision, so if you can determine the number of bits of precision that you expect from a LUT of a certain size, you should be able to determine how many NR steps you need to achieve the desired precision. Cray-1 used NR as the final stage of their mutual settlement. While looking for this, I found a fairly detailed article on the subject: An Accurate, High Speed ​​Implementation of the Mutual Approximation Unit from the 9th IEEE Symposium on Computer Arithmetic (6-8 September 1989).



+3


source


The trade-off is pretty simple. LUT uses additional memory in the hopes of reducing the number of instructions to save some time. Whether or not this is effective will depend heavily on the details of the processor - caching in particular.

For Newton-Raphson, you change X / Y to X * (1 / Y) and use your iteration to find 1 / Y. At least in my experience, if you want full precision, this is rarely useful - the main power is to find something (say) faster with 16-bit precision.



The usual division method is the bitwise method . While this particular answer deals with integers, for floating point you do essentially the same thing, except that you subtract exponents with it. A floating point number is basically A * 2 N where A is the value and N is the exponent of the number. So, you take two numbers A * 2 N / B * 2 M and divide as A / B * 2 NMwith A and B being treated as (essentially) integers in this case. The only real difference is that with floating point, you usually want to round rather than truncate the result. It basically just means to split (at least) one extra bit of precision, and then round if that extra bit is one.

The most common method using lookup tables is SRT splitting. This is most often done in hardware, so I would probably Google for something like "Verilog SRT" or "VHDL SRT". Rendering this in C ++ doesn't have to be terribly difficult. If the method described in the linked answer is getting a bit per iteration, it can be written to do 2, 4, etc. If memory is being served, the size of the table grows quadratically with the number of bits created per iteration, though, which is why you rarely see much more than 4 in practice.

+4


source







All Articles