Defining a custom pandas aggregation function with Cython
I have a large one DataFrame
in pandas with three columns: 'col1'
- row, 'col2'
and 'col3'
- numpy.int64
. I need to do groupby
and then apply a custom aggregation function using apply
like this:
pd = pandas.read_csv(...)
groups = pd.groupby('col1').apply(my_custom_function)
Each group can be thought of as a numpy array with two integer columns 'col2'
and 'col3'
. To understand what I am doing, you can think of each line ('col2','col3')
as a time interval; I am checking if there are any intervals that overlap. First, I sort the array by the first column and then check to see if the value of the second column in the index i
is less than the first value in the column in index i + 1
.
FIRST QUESTION . My idea is to use Cython to define a custom aggregation function. Is this a good idea?
I tried the following definition in the file .pyx
:
cimport nump as c_np
def c_my_custom_function(my_group_df):
cdef Py_ssize_t l = len(my_group_df.index)
if l < 2:
return False
cdef c_np.int64_t[:, :] temp_array
temp_array = my_group_df[['col2','col3']].sort(columns='col2').values
cdef Py_ssize_t i
for i in range(l - 1):
if temp_array[i, 1] > temp_array[i + 1, 0]:
return True
return False
I also defined the version in pure Python / pandas:
def my_custom_function(my_group_df):
l = len(my_group_df.index)
if l < 2:
return False
temp_array = my_group_df[['col2', 'col3']].sort(columns='col2').values
for i in range(l - 1):
if temp_array[i, 1] > temp_array[i + 1, 0]:
return True
return False
SECOND QUESTION . I've timed two versions and both run at the same time. The Cython version doesn't seem to speed up anything. What's happening?
BONUS QUESTION : Do you see a better way to implement this algorithm?
source to share
A vector test numpy
could be:
np.any(temp_array[:-1,1]>temp_array[1:,0])
Whether it's better than iterating python or cython depends on where True
, if at all possible. If returning early in the iteration, iteration is clearly better. And the version cython
won't have much of an advantage. Also the test step will be faster than the sort step.
But if iteration usually goes all the way, then a vector test will be faster than iterating in Python and faster than sorting. It can be slower than a properly coded cython iteration.
source to share