Roll_mean instability in pandas
I am upgrading from our current environment (Python 2.7.3 64-bit, pandas 0.9) to a new one (Python 2.7.6, pandas 0.14.1) and some of my regression tests are not working. I tracked it down to behaviorpandas.stats.moments.rolling_mean
Here's an example to reproduce the error:
import pandas as pd
data = [
1.0,
0.99997000000000003,
0.99992625131299995,
0.99992500140499996,
0.99986125618599997,
0.99981126312299995,
0.99976377208800005,
0.99984375318999996]
ser = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))
print "rolling mean: %.17f" % pd.stats.moments.rolling_mean(ser, window=5, min_periods=1)['2008-06-06']
print "sum divide: %.17f" % (ser['2008-6-1':'2008-6-6'].sum()/5)
In my original environment, I get the following output:
rolling mean: 0.99984100919839991
sum divide: 0.99984100919839991
but in my new environment the output is now:
rolling mean: 0.99984100919840002
sum divide: 0.99984100919839991
As you can see, the current average gives a slightly different number. Of course, this is a small difference, but errors get worse, and in the end it becomes non-trivial.
Does anyone know what might be causing this, or if there is a workaround?
source to share
The reason for the difference in the results of the different approaches is the accumulated rounding error, which is greater during the calculation of the division sum. It used to calculate a moving average computation on a similar problem, but it seems that internal improvements in its algorithm over the last few versions have led to a more accurate result.
First of all, let me state that the new moving average is more accurate. We'll do this by using the sum division method twice, but each time with a different precision:
In [166]: ser1 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))
In [167]: type(ser1[0])
Out[167]: numpy.float64
In [168]: print "sum divide: %.17f" % (ser1['2008-6-1':'2008-6-6'].sum()/5)
sum divide: 0.99984100919839991
In [169]: ser2 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'), dtype = np.float128)
In [170]: print "sum divide: %.17f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide: 0.99984100919840002
Using more precision np.float128
results in a value close to the value of the new moving average version. This clearly proves that the new medium version of the medium caliber is more accurate than the previous one.
It also indicates a possible workaround for your problem - use more precision in your calculations by specifying your series for object placement np.float128
. This improves the accuracy of the division of the sum approach, but does not affect the approximation of the mean:
In [185]: pd.stats.moments.rolling_mean(ser1, window=5, min_periods=1) == pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)
Out[185]:
2008-05-28 True
2008-05-29 True
2008-05-30 True
2008-06-02 True
2008-06-03 True
2008-06-04 True
2008-06-05 True
2008-06-06 True
Freq: B, dtype: bool
Note that although this brings the results of each approach closer and they appear to be the same:
In [194]: print "sum divide: %.60f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide: 0.999841009198400021418251526483800262212753295898437500000000
In [195]: print "rolling mean: %.60f" % pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06']
rolling mean: 0.999841009198400021418251526483800262212753295898437500000000
in terms of the processor, they are still different:
In [196]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] == ser2['2008-6-1':'2008-6-6'].sum()/5
Out[196]: False
In [197]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] - ser2['2008-6-1':'2008-6-6'].sum()/5
Out[197]: 4.4398078963281406573e-17
but hopefully the margin of error, which is slightly less now, falls into your use case.
source to share