# Roll_mean instability in pandas

I am upgrading from our current environment (Python 2.7.3 64-bit, pandas 0.9) to a new one (Python 2.7.6, pandas 0.14.1) and some of my regression tests are not working. I tracked it down to behavior`pandas.stats.moments.rolling_mean`

Here's an example to reproduce the error:

``````import pandas as pd
data = [
1.0,
0.99997000000000003,
0.99992625131299995,
0.99992500140499996,
0.99986125618599997,
0.99981126312299995,
0.99976377208800005,
0.99984375318999996]
ser = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))

print "rolling mean: %.17f" % pd.stats.moments.rolling_mean(ser, window=5, min_periods=1)['2008-06-06']
print "sum divide:   %.17f" % (ser['2008-6-1':'2008-6-6'].sum()/5)
```

```

In my original environment, I get the following output:

``````rolling mean: 0.99984100919839991
sum divide:   0.99984100919839991
```

```

but in my new environment the output is now:

``````rolling mean: 0.99984100919840002
sum divide:   0.99984100919839991
```

```

As you can see, the current average gives a slightly different number. Of course, this is a small difference, but errors get worse, and in the end it becomes non-trivial.

Does anyone know what might be causing this, or if there is a workaround?

+3

source to share

The reason for the difference in the results of the different approaches is the accumulated rounding error, which is greater during the calculation of the division sum. It used to calculate a moving average computation on a similar problem, but it seems that internal improvements in its algorithm over the last few versions have led to a more accurate result.

First of all, let me state that the new moving average is more accurate. We'll do this by using the sum division method twice, but each time with a different precision:

``````In : ser1 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))

In : type(ser1)
Out: numpy.float64

In : print "sum divide:   %.17f" % (ser1['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.99984100919839991

In : ser2 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'), dtype = np.float128)

In : print "sum divide:   %.17f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.99984100919840002
```

```

Using more precision `np.float128`

results in a value close to the value of the new moving average version. This clearly proves that the new medium version of the medium caliber is more accurate than the previous one.

It also indicates a possible workaround for your problem - use more precision in your calculations by specifying your series for object placement `np.float128`

. This improves the accuracy of the division of the sum approach, but does not affect the approximation of the mean:

``````In : pd.stats.moments.rolling_mean(ser1, window=5, min_periods=1) == pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)
Out:
2008-05-28    True
2008-05-29    True
2008-05-30    True
2008-06-02    True
2008-06-03    True
2008-06-04    True
2008-06-05    True
2008-06-06    True
Freq: B, dtype: bool
```

```

Note that although this brings the results of each approach closer and they appear to be the same:

``````In : print "sum divide:   %.60f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.999841009198400021418251526483800262212753295898437500000000

In : print "rolling mean: %.60f" % pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06']
rolling mean: 0.999841009198400021418251526483800262212753295898437500000000
```

```

in terms of the processor, they are still different:

``````In : pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] == ser2['2008-6-1':'2008-6-6'].sum()/5
Out: False

In : pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] - ser2['2008-6-1':'2008-6-6'].sum()/5
Out: 4.4398078963281406573e-17
```

```

but hopefully the margin of error, which is slightly less now, falls into your use case.

+4

source

All Articles