Roll_mean instability in pandas

I am upgrading from our current environment (Python 2.7.3 64-bit, pandas 0.9) to a new one (Python 2.7.6, pandas 0.14.1) and some of my regression tests are not working. I tracked it down to behaviorpandas.stats.moments.rolling_mean

Here's an example to reproduce the error:

import pandas as pd
data = [
    1.0,
    0.99997000000000003,
    0.99992625131299995,
    0.99992500140499996,
    0.99986125618599997,
    0.99981126312299995,
    0.99976377208800005,
    0.99984375318999996]
ser = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))

print "rolling mean: %.17f" % pd.stats.moments.rolling_mean(ser, window=5, min_periods=1)['2008-06-06']
print "sum divide:   %.17f" % (ser['2008-6-1':'2008-6-6'].sum()/5)

      

In my original environment, I get the following output:

rolling mean: 0.99984100919839991                                                   
sum divide:   0.99984100919839991

      

but in my new environment the output is now:

rolling mean: 0.99984100919840002                                                   
sum divide:   0.99984100919839991

      

As you can see, the current average gives a slightly different number. Of course, this is a small difference, but errors get worse, and in the end it becomes non-trivial.

Does anyone know what might be causing this, or if there is a workaround?

+3


source to share


1 answer


The reason for the difference in the results of the different approaches is the accumulated rounding error, which is greater during the calculation of the division sum. It used to calculate a moving average computation on a similar problem, but it seems that internal improvements in its algorithm over the last few versions have led to a more accurate result.

First of all, let me state that the new moving average is more accurate. We'll do this by using the sum division method twice, but each time with a different precision:

In [166]: ser1 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))

In [167]: type(ser1[0])
Out[167]: numpy.float64

In [168]: print "sum divide:   %.17f" % (ser1['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.99984100919839991

In [169]: ser2 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'), dtype = np.float128)

In [170]: print "sum divide:   %.17f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.99984100919840002

      

Using more precision np.float128

results in a value close to the value of the new moving average version. This clearly proves that the new medium version of the medium caliber is more accurate than the previous one.

It also indicates a possible workaround for your problem - use more precision in your calculations by specifying your series for object placement np.float128

. This improves the accuracy of the division of the sum approach, but does not affect the approximation of the mean:

In [185]: pd.stats.moments.rolling_mean(ser1, window=5, min_periods=1) == pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)
Out[185]: 
2008-05-28    True
2008-05-29    True
2008-05-30    True
2008-06-02    True
2008-06-03    True
2008-06-04    True
2008-06-05    True
2008-06-06    True
Freq: B, dtype: bool

      



Note that although this brings the results of each approach closer and they appear to be the same:

In [194]: print "sum divide:   %.60f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.999841009198400021418251526483800262212753295898437500000000

In [195]: print "rolling mean: %.60f" % pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06']
rolling mean: 0.999841009198400021418251526483800262212753295898437500000000

      

in terms of the processor, they are still different:

In [196]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] == ser2['2008-6-1':'2008-6-6'].sum()/5
Out[196]: False

In [197]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] - ser2['2008-6-1':'2008-6-6'].sum()/5
Out[197]: 4.4398078963281406573e-17

      

but hopefully the margin of error, which is slightly less now, falls into your use case.

+4


source







All Articles