Faster alternative to grouby / shift
A file under the general heading "groupby is slow if you have a lot of groups"
nobs = 9999 df = DataFrame( { 'id' : np.arange(nobs) / 3, 'yr' : np.tile( np.array([2007,2008,2009]), nobs/3 ), 'val': np.random.randn(nobs) } ) df = df.sort(['id','yr']) A = df.groupby('id').shift() B = df.shift()
A is what I want, but it takes about 1.5 seconds here and my actual use case has about 100 times of observations. For reference, computing A is about 1000 times slower than computing B.
This is what the letters A and B look like:
In [599]: A.head(6)
Out[599]:
val yr
0 NaN NaN
1 -0.839041 2007
2 -1.089094 2008
3 NaN NaN
4 -0.068383 2007
5 0.555293 2008
In [600]: B.head(6)
Out[600]:
id val yr
0 NaN NaN NaN
1 0 -0.839041 2007
2 0 -1.089094 2008
3 0 0.050604 2009
4 1 -0.068383 2007
5 1 0.555293 2008
I would have liked the general solution to speed up A, but if it wasn't, the workaround would be great. As you can see, B is actually the same as A, except that the first value of each group is not really valid and must be converted to NaN. It can be done with groupby / rank, but everything about groupby seems to be slow, so I need a non-groupby method.
Is there a way to replicate the rank function through sorting or indexing? It looks like the information should be embedded there, but I don't know how to extract it into a new variable.
(Edited to add the following)
Here is the solution from the link provided by Jeff below (original answer by HYRY). I modified it a bit to work with the example here. On my computer it runs at almost the same speed as the DSM solution.
B.iloc[df.groupby('id').size().cumsum()[:-1]] = np.nan
source to share
This isn't the most elegant piece of code, but as a workaround for your case, how about something like:
def fast(df, col):
A = df.sort(col).shift()
last = A.iloc[-1].copy()
A.loc[A[col] != A[col].shift(-1)] = np.nan
A.iloc[-1] = last
return A.drop(col, axis=1)
def slow(df, col):
A = df.sort(col).groupby(col).shift()
return A
which gives
>>> %timeit s = slow(df, "id")
1 loops, best of 3: 2.09 s per loop
>>> %timeit f = fast(df, "id")
100 loops, best of 3: 3.51 ms per loop
>>> slow(df, "id").equals(fast(df, "id"))
True
source to share