Pandas cumsum with conditional lag product?

I'm trying to get a cumulative sum that changes according to the product of another variable and the lagging value of the sum (sounds a bit like math gibberish, I know .. please bear with me)

Here's a setup example:

import pandas as pd
df = pd.DataFrame([1,1,1.004878,1,1.043394],columns=['xx'])
df['n'] = 1000000.0

      

What's going on:

       xx        n
0  1.000000  1000000
1  1.000000  1000000
2  1.004878  1000000
3  1.000000  1000000
4  1.043394  1000000

      

Now we need to multiply xx

by a lagged value n

, iteratively, and then take the cumulative sum of that value:

cs = pd.Series([0.0] * len(df))
cs[0] = df.ix[0]['n']
for i,e in enumerate(df.iterrows()):
    if i == 0: continue
    cs[i] = df.ix[i]['xx'] * cs[(i - 1)]

      

This brings up the following:

0    1000000.000000
1    1000000.000000
2    1004878.000000
3    1004878.000000
4    1048483.675932
dtype: float64

      

Question: Is there a way to do this in pandas / numpy that doesn't require iterating over each line? If not, are there any tricks in trading to optimize the code like above when you are forced to iterate? Can a creatively generated index help in this case? Performance is a 10,000+ row issue, across multiple datasets.

+3


source to share


2 answers


First, your for loop can be simplified to:

for i in xrange(1, len(df)):
    cs[i] = df.ix[i]['xx'] * cs[(i - 1)]

      

(more math gibberish) Each item in cs[1:]

is the product of all previous items in df['xx']

(cumulative product) multiplied by the first item in the columnn

df

>>> df
         xx        n
0  1.000000  1000000
1  1.000000  1000000
2  1.004878  1000000
3  1.000000  1000000
4  1.043394  1000000
>>> a = df['xx']
>>> a
0    1.000000
1    1.000000
2    1.004878
3    1.000000
4    1.043394
Name: xx, dtype: float64
>>> a = a.cumprod()
>>> a
0    1.000000
1    1.000000
2    1.004878
3    1.004878
4    1.048484
Name: xx, dtype: float64
>>> a = a * df['n'][0]
>>> a
0    1000000.000000
1    1000000.000000
2    1004878.000000
3    1004878.000000
4    1048483.675932
Name: xx, dtype: float64
>>> np.all(a == cs)
True
>>> 

a = df['xx'].cumprod() * df['n'][0]

      

This is not a trick. It only works because it df['xx'][0]

is 1. If it was some other value, AND cs[0] = df.ix[0]['n']

was not just a shortcut, it cumprod

didn't work.

Expanding each item cs

gives



cs[0] = df['n'][0]
cs[1] = df['xx'][1] * df['n'][0]
cs[2] = df['xx'][2] * df['xx'][1] * df['n'][0]
cs[3] = df['xx'][3] * df['xx'][2] * df['xx'][1] * df['n'][0]
cs[4] = df['xx'][4] * df['xx'][3] * df['xx'][2] * df['xx'][1] * df['n'][0]

      

Since it df['xx'][0]

is equal to one and df['xx'][0] * df['n'][0] == df['n'][0]

, then:

cs[0] = df['xx'][0] * df['n'][0]
cs[1] = df['xx'][1] * df['xx'][0] * df['n'][0]
cs[2] = df['xx'][2] * df['xx'][1] * df['xx'][0] * df['n'][0]
cs[3] = df['xx'][3] * df['xx'][2] * df['xx'][1] * df['xx'][0] * df['n'][0]
cs[4] = df['xx'][4] * df['xx'][3] * df['xx'][2] * df['xx'][1] * df['xx'][0] * df['n'][0]

      

If you were to tweak the terms of the problem a bit, where after each iteration I need to subtract 0.05% from the last computed n value (before the next iteration), does cumprod work?

If you followed the element expansion exercise, you should see that the new condition results in multiplication by the cumulative product of the scale factor array. There are two ways to do this - both results in some minor floating point errors from the computed in the loop. Again, you need to consider that the first item in df['xx']

is one of them.

for i in xrange(1, len(df)):
    cs[i] = df.ix[i]['xx'] * (.9995 * cs[(i - 1)])

>>> k
array([ 1.    ,  0.9995,  0.9995,  0.9995,  0.9995])
>>> z = df['xx'] * k
>>> z
0    1.000000
1    0.999500
2    1.004376
3    0.999500
4    1.042872
Name: xx, dtype: float64
>>> z = z.cumprod() * df['n'][0]
>>> cs - z
0    0.000000e+00
1    0.000000e+00
2    0.000000e+00
3    0.000000e+00
4   -1.164153e-10
dtype: float64
>>> 
>>> z = df['xx'].cumprod() * df['n'][0]
>>> z *= k.cumprod()
>>> cs - z
0    0.000000e+00
1    0.000000e+00
2   -1.164153e-10
3    0.000000e+00
4    0.000000e+00
dtype: float64
>>> 

      

+4


source


I'm not sure I understand what "n" is supposed to do (is it always = 1,000,000?), But it's pretty simple to match your results above with cumprod:



In [60]: df.xx.cumprod() * 1e6

Out[60]: 0    1000000.000000
         1    1000000.000000
         2    1004878.000000
         3    1004878.000000
         4    1048483.675932

      

+1


source







All Articles