The sum of two "np.longdouble" gives a large numerical error

Good morning,

I am reading two numbers from a FITS file (representing integers and floating point numbers of the same number), converting them to long doubles (128 bits on my machine) and then summing them up.

The result is not as accurate as I would expect from using 128 bit floats. Here is the code:

a_int = np.longdouble(read_header_key(fits_file, 'I'))
print "I %.25f" % a_int, type(a_int)
a_float = np.longdouble(read_header_key(fits_file, 'F'))
print "F %.25f" % a_float, a_float.dtype
a = a_int + a_float
print "TOT %.25f" % a, a.dtype

      

and here is the answer I get:

I 55197.0000000000000000000000000 <type 'numpy.float128'>
F 0.0007660185200000000195833 float128
TOT 55197.0007660185219720005989075 float128

      

The result deviates from what was expected (55197.0007660185200000000195833) after just 11 decimal digits (16 significant digits in total). I would expect much better accuracy from 128 bit floats. What am I doing wrong?

This result was reproduced on a Mac machine and on a 32-bit Linux machine (in this case the dtype was float96, but the values ​​were exactly the same)

Thanks in advance for your help!

Matteo

+3


source to share


2 answers


The problem is with what you are printing np.longdouble

. When you format with %f

, Python passes the result to a float (64 bit) before printing.

Here:



>>> a_int = np.longdouble(55197)
>>> a_float = np.longdouble(76601852) / 10**11
>>> b = a_int + a_float
>>> '%.25f' % b
'55197.0007660185219720005989075'
>>> '%.25f' % float(b)
'55197.0007660185219720005989075'
>>> b * 10**18
5.5197000766018519998e+22

      

Note that on my machine I get a little more precision with longdouble

compared to normal double

(20 decimal places instead of 15). As such, it might be worth seeing if the module Decimal

might be more appropriate for your application. Decimal

handles arbitrary precision floating point numbers without loss of precision.

+3


source


I am assuming the modifier %f

creates a float from a longdouble object and uses that when creating a format string.

>>> import numpy as np
>>> np.longdouble(55197)
55197.0
>>> a = np.longdouble(55197)
>>> b = np.longdouble(0.0007660185200000000195833)
>>> a
55197.0
>>> b
0.00076601852000000001958
>>> a + b
55197.00076601852
>>> type(a+b)
<type 'numpy.float128'>
>>> a + b == 55197.00076601852
False

      



As a side note, it repr

doesn't even print enough props to restore an object. This is simply because you cannot have a float literal enough to jump to yours longdouble

.

+3


source







All Articles