Faster alternative to grouby / shift

A file under the general heading "groupby is slow if you have a lot of groups"

nobs =   9999 

df = DataFrame( { 'id' : np.arange(nobs) / 3,
                  'yr' : np.tile( np.array([2007,2008,2009]), nobs/3 ),
                  'val': np.random.randn(nobs) } )

df = df.sort(['id','yr'])

A = df.groupby('id').shift()
B = df.shift()

      

A is what I want, but it takes about 1.5 seconds here and my actual use case has about 100 times of observations. For reference, computing A is about 1000 times slower than computing B.

This is what the letters A and B look like:

In [599]: A.head(6)
Out[599]: 
        val    yr
0       NaN   NaN
1 -0.839041  2007
2 -1.089094  2008
3       NaN   NaN
4 -0.068383  2007
5  0.555293  2008

In [600]: B.head(6)
Out[600]: 
   id       val    yr
0 NaN       NaN   NaN
1   0 -0.839041  2007
2   0 -1.089094  2008
3   0  0.050604  2009
4   1 -0.068383  2007
5   1  0.555293  2008

      

I would have liked the general solution to speed up A, but if it wasn't, the workaround would be great. As you can see, B is actually the same as A, except that the first value of each group is not really valid and must be converted to NaN. It can be done with groupby / rank, but everything about groupby seems to be slow, so I need a non-groupby method.

Is there a way to replicate the rank function through sorting or indexing? It looks like the information should be embedded there, but I don't know how to extract it into a new variable.

(Edited to add the following)

Here is the solution from the link provided by Jeff below (original answer by HYRY). I modified it a bit to work with the example here. On my computer it runs at almost the same speed as the DSM solution.

B.iloc[df.groupby('id').size().cumsum()[:-1]] = np.nan

+3


source to share


1 answer


This isn't the most elegant piece of code, but as a workaround for your case, how about something like:

def fast(df, col):
    A = df.sort(col).shift()
    last = A.iloc[-1].copy()
    A.loc[A[col] != A[col].shift(-1)] = np.nan
    A.iloc[-1] = last
    return A.drop(col, axis=1)

def slow(df, col):
    A = df.sort(col).groupby(col).shift()
    return A

      



which gives

>>> %timeit s = slow(df, "id")
1 loops, best of 3: 2.09 s per loop
>>> %timeit f = fast(df, "id")
100 loops, best of 3: 3.51 ms per loop
>>> slow(df, "id").equals(fast(df, "id"))
True

      

+4


source







All Articles