Pandas: setting last N lines of multi-index to Nan to speed up grouping with shift

I am trying to speed up my group .apply + shift and thanks to this previous question and answers: How to speed up Pandas multilevel shifting of frames across groups? I can prove that it really speeds things up when you have many groups.

From this question, I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shifts around the world, and not in a group.

df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan

      

but I want to look forward, not backward, and perform calculations on N lines. So I'm trying to use some similar code to set the last N records to NaN, but obviously I'm missing some important indexing knowledge as I just can't figure it out.

I suppose I want to convert this so that each entry is a range, not an integer. How should I do it?

 # the start of each group, ignoring the first entry
 df.groupby(level=0).size().cumsum()[1:]

      

Test setup (for reverse change) if you want to try:

length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
    tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
    frames.append(tmpdf)
df = pd.concat(frames)

df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan

# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)

      

Thank!

0


source to share


1 answer


I ended it up with groupby like this (and coded to work forwards or backwards):

def replace_tail(grp,col,N,value):
    if (N > 0):
        grp[col][:N] = value
    else:
        grp[col][N:] = value
    return grp

df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)

      



So the last code:

def replace_tail(grp,col,N,value):
    if (N > 0):
        grp[col][:N] = value
    else:
        grp[col][N:] = value
    return grp


length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
    tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
    frames.append(tmpdf)
df = pd.concat(frames)

df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)

# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

      

0


source







All Articles