Pandas cumsum with conditional lag product?
I'm trying to get a cumulative sum that changes according to the product of another variable and the lagging value of the sum (sounds a bit like math gibberish, I know .. please bear with me)
Here's a setup example:
import pandas as pd
df = pd.DataFrame([1,1,1.004878,1,1.043394],columns=['xx'])
df['n'] = 1000000.0
What's going on:
xx n
0 1.000000 1000000
1 1.000000 1000000
2 1.004878 1000000
3 1.000000 1000000
4 1.043394 1000000
Now we need to multiply xx
by a lagged value n
, iteratively, and then take the cumulative sum of that value:
cs = pd.Series([0.0] * len(df))
cs[0] = df.ix[0]['n']
for i,e in enumerate(df.iterrows()):
if i == 0: continue
cs[i] = df.ix[i]['xx'] * cs[(i - 1)]
This brings up the following:
0 1000000.000000
1 1000000.000000
2 1004878.000000
3 1004878.000000
4 1048483.675932
dtype: float64
Question: Is there a way to do this in pandas / numpy that doesn't require iterating over each line? If not, are there any tricks in trading to optimize the code like above when you are forced to iterate? Can a creatively generated index help in this case? Performance is a 10,000+ row issue, across multiple datasets.
source to share
First, your for loop can be simplified to:
for i in xrange(1, len(df)):
cs[i] = df.ix[i]['xx'] * cs[(i - 1)]
(more math gibberish) Each item in cs[1:]
is the product of all previous items in df['xx']
(cumulative product) multiplied by the first item in the columnn
df
>>> df
xx n
0 1.000000 1000000
1 1.000000 1000000
2 1.004878 1000000
3 1.000000 1000000
4 1.043394 1000000
>>> a = df['xx']
>>> a
0 1.000000
1 1.000000
2 1.004878
3 1.000000
4 1.043394
Name: xx, dtype: float64
>>> a = a.cumprod()
>>> a
0 1.000000
1 1.000000
2 1.004878
3 1.004878
4 1.048484
Name: xx, dtype: float64
>>> a = a * df['n'][0]
>>> a
0 1000000.000000
1 1000000.000000
2 1004878.000000
3 1004878.000000
4 1048483.675932
Name: xx, dtype: float64
>>> np.all(a == cs)
True
>>>
a = df['xx'].cumprod() * df['n'][0]
This is not a trick. It only works because it df['xx'][0]
is 1. If it was some other value, AND cs[0] = df.ix[0]['n']
was not just a shortcut, it cumprod
didn't work.
Expanding each item cs
gives
cs[0] = df['n'][0]
cs[1] = df['xx'][1] * df['n'][0]
cs[2] = df['xx'][2] * df['xx'][1] * df['n'][0]
cs[3] = df['xx'][3] * df['xx'][2] * df['xx'][1] * df['n'][0]
cs[4] = df['xx'][4] * df['xx'][3] * df['xx'][2] * df['xx'][1] * df['n'][0]
Since it df['xx'][0]
is equal to one and df['xx'][0] * df['n'][0] == df['n'][0]
, then:
cs[0] = df['xx'][0] * df['n'][0]
cs[1] = df['xx'][1] * df['xx'][0] * df['n'][0]
cs[2] = df['xx'][2] * df['xx'][1] * df['xx'][0] * df['n'][0]
cs[3] = df['xx'][3] * df['xx'][2] * df['xx'][1] * df['xx'][0] * df['n'][0]
cs[4] = df['xx'][4] * df['xx'][3] * df['xx'][2] * df['xx'][1] * df['xx'][0] * df['n'][0]
If you were to tweak the terms of the problem a bit, where after each iteration I need to subtract 0.05% from the last computed n value (before the next iteration), does cumprod work?
If you followed the element expansion exercise, you should see that the new condition results in multiplication by the cumulative product of the scale factor array. There are two ways to do this - both results in some minor floating point errors from the computed in the loop. Again, you need to consider that the first item in df['xx']
is one of them.
for i in xrange(1, len(df)):
cs[i] = df.ix[i]['xx'] * (.9995 * cs[(i - 1)])
>>> k
array([ 1. , 0.9995, 0.9995, 0.9995, 0.9995])
>>> z = df['xx'] * k
>>> z
0 1.000000
1 0.999500
2 1.004376
3 0.999500
4 1.042872
Name: xx, dtype: float64
>>> z = z.cumprod() * df['n'][0]
>>> cs - z
0 0.000000e+00
1 0.000000e+00
2 0.000000e+00
3 0.000000e+00
4 -1.164153e-10
dtype: float64
>>>
>>> z = df['xx'].cumprod() * df['n'][0]
>>> z *= k.cumprod()
>>> cs - z
0 0.000000e+00
1 0.000000e+00
2 -1.164153e-10
3 0.000000e+00
4 0.000000e+00
dtype: float64
>>>
source to share