The sum of two "np.longdouble" gives a large numerical error
Good morning,
I am reading two numbers from a FITS file (representing integers and floating point numbers of the same number), converting them to long doubles (128 bits on my machine) and then summing them up.
The result is not as accurate as I would expect from using 128 bit floats. Here is the code:
a_int = np.longdouble(read_header_key(fits_file, 'I'))
print "I %.25f" % a_int, type(a_int)
a_float = np.longdouble(read_header_key(fits_file, 'F'))
print "F %.25f" % a_float, a_float.dtype
a = a_int + a_float
print "TOT %.25f" % a, a.dtype
and here is the answer I get:
I 55197.0000000000000000000000000 <type 'numpy.float128'>
F 0.0007660185200000000195833 float128
TOT 55197.0007660185219720005989075 float128
The result deviates from what was expected (55197.0007660185200000000195833) after just 11 decimal digits (16 significant digits in total). I would expect much better accuracy from 128 bit floats. What am I doing wrong?
This result was reproduced on a Mac machine and on a 32-bit Linux machine (in this case the dtype was float96, but the values ββwere exactly the same)
Thanks in advance for your help!
Matteo
source to share
The problem is with what you are printing np.longdouble
. When you format with %f
, Python passes the result to a float (64 bit) before printing.
Here:
>>> a_int = np.longdouble(55197)
>>> a_float = np.longdouble(76601852) / 10**11
>>> b = a_int + a_float
>>> '%.25f' % b
'55197.0007660185219720005989075'
>>> '%.25f' % float(b)
'55197.0007660185219720005989075'
>>> b * 10**18
5.5197000766018519998e+22
Note that on my machine I get a little more precision with longdouble
compared to normal double
(20 decimal places instead of 15). As such, it might be worth seeing if the module Decimal
might be more appropriate for your application. Decimal
handles arbitrary precision floating point numbers without loss of precision.
source to share
I am assuming the modifier %f
creates a float from a longdouble object and uses that when creating a format string.
>>> import numpy as np
>>> np.longdouble(55197)
55197.0
>>> a = np.longdouble(55197)
>>> b = np.longdouble(0.0007660185200000000195833)
>>> a
55197.0
>>> b
0.00076601852000000001958
>>> a + b
55197.00076601852
>>> type(a+b)
<type 'numpy.float128'>
>>> a + b == 55197.00076601852
False
As a side note, it repr
doesn't even print enough props to restore an object. This is simply because you cannot have a float literal enough to jump to yours longdouble
.
source to share