How to speed up Pandas shifting multi-level data across groups?

I'm trying to migrate Pandas data column data of group data by first index. Here's a demo code:

 In [8]: df = mul_df(5,4,3)

In [9]: df
Out[9]:
                 COL000  COL001  COL002
STK_ID RPT_Date
A0000  B000     -0.5505  0.7445 -0.3645
       B001      0.9129 -1.0473 -0.5478
       B002      0.8016  0.0292  0.9002
       B003      2.0744 -0.2942 -0.7117
A0001  B000      0.7064  0.9636  0.2805
       B001      0.4763  0.2741 -1.2437
       B002      1.1563  0.0525 -0.7603
       B003     -0.4334  0.2510 -0.0105
A0002  B000     -0.6443  0.1723  0.2657
       B001      1.0719  0.0538 -0.0641
       B002      0.6787 -0.3386  0.6757
       B003     -0.3940 -1.2927  0.3892
A0003  B000     -0.5862 -0.6320  0.6196
       B001     -0.1129 -0.9774  0.7112
       B002      0.6303 -1.2849 -0.4777
       B003      0.5046 -0.4717 -0.2133
A0004  B000      1.6420 -0.9441  1.7167
       B001      0.1487  0.1239  0.6848
       B002      0.6139 -1.9085 -1.9508
       B003      0.3408 -1.3891  0.6739

In [10]: grp = df.groupby(level=df.index.names[0])

In [11]: grp.shift(1)
Out[11]:
                 COL000  COL001  COL002
STK_ID RPT_Date
A0000  B000         NaN     NaN     NaN
       B001     -0.5505  0.7445 -0.3645
       B002      0.9129 -1.0473 -0.5478
       B003      0.8016  0.0292  0.9002
A0001  B000         NaN     NaN     NaN
       B001      0.7064  0.9636  0.2805
       B002      0.4763  0.2741 -1.2437
       B003      1.1563  0.0525 -0.7603
A0002  B000         NaN     NaN     NaN
       B001     -0.6443  0.1723  0.2657
       B002      1.0719  0.0538 -0.0641
       B003      0.6787 -0.3386  0.6757
A0003  B000         NaN     NaN     NaN
       B001     -0.5862 -0.6320  0.6196
       B002     -0.1129 -0.9774  0.7112
       B003      0.6303 -1.2849 -0.4777
A0004  B000         NaN     NaN     NaN
       B001      1.6420 -0.9441  1.7167
       B002      0.1487  0.1239  0.6848
       B003      0.6139 -1.9085 -1.9508

      

The code is mul_df()

attached here: How to speed up Pandas sum of multi-level data?

Now I want grp.shift(1)

for a large dataframe.

In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop

      

5.23s is too slow. How to speed it up?

(My computer config: dual core Pentium T4200@2.00GHZ , RAM 3.00GB, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1, Anaconda 1.5.0 (32bit))

+2


source to share


3 answers


How about shifting the shared DataFrame and then setting the first row of each group to NaN?



dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan

      

+3


source


the problem is the operation is shift

not cython optimized, so it includes a callback for python. Compare this to:

In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop

In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop

      



added issue for this: https://github.com/pydata/pandas/issues/4095

+4


source


a similar question and an added answer that works for shifting in any direction and magnitude: pandas: setting the last N rows of the multi-index to Nan to speed up grouping with a shift

Code (including test installation):

#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
    if (N > 0):
        grp[col][:N] = value
    else:
        grp[col][N:] = value
    return grp


length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
    tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
    frames.append(tmpdf)
df = pd.concat(frames)

df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)

# 
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)

# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

      

EDIT: Please note that the initial look is really consuming the effectiveness of this. Therefore, in some cases, the original answer is more effective.

0


source







All Articles