Defining a custom pandas aggregation function with Cython

I have a large one DataFrame

in pandas with three columns: 'col1'

- row, 'col2'

and 'col3'

- numpy.int64

. I need to do groupby

and then apply a custom aggregation function using apply

like this:

pd = pandas.read_csv(...)
groups = pd.groupby('col1').apply(my_custom_function)

      

Each group can be thought of as a numpy array with two integer columns 'col2'

and 'col3'

. To understand what I am doing, you can think of each line ('col2','col3')

as a time interval; I am checking if there are any intervals that overlap. First, I sort the array by the first column and then check to see if the value of the second column in the index i

is less than the first value in the column in index i + 1

.

FIRST QUESTION . My idea is to use Cython to define a custom aggregation function. Is this a good idea?

I tried the following definition in the file .pyx

:

cimport nump as c_np

def c_my_custom_function(my_group_df):
    cdef Py_ssize_t l = len(my_group_df.index)
    if l < 2:
        return False

    cdef c_np.int64_t[:, :] temp_array
    temp_array = my_group_df[['col2','col3']].sort(columns='col2').values
    cdef Py_ssize_t i

    for i in range(l - 1):
        if temp_array[i, 1] > temp_array[i + 1, 0]:
            return True
    return False

      

I also defined the version in pure Python / pandas:

def my_custom_function(my_group_df):
    l = len(my_group_df.index)
    if l < 2:
        return False

    temp_array = my_group_df[['col2', 'col3']].sort(columns='col2').values

    for i in range(l - 1):
        if temp_array[i, 1] > temp_array[i + 1, 0]:
            return True
    return False

      

SECOND QUESTION . I've timed two versions and both run at the same time. The Cython version doesn't seem to speed up anything. What's happening?

BONUS QUESTION : Do you see a better way to implement this algorithm?

+3


source to share


1 answer


A vector test numpy

could be:

np.any(temp_array[:-1,1]>temp_array[1:,0])

      



Whether it's better than iterating python or cython depends on where True

, if at all possible. If returning early in the iteration, iteration is clearly better. And the version cython

won't have much of an advantage. Also the test step will be faster than the sort step.

But if iteration usually goes all the way, then a vector test will be faster than iterating in Python and faster than sorting. It can be slower than a properly coded cython iteration.

+1


source







All Articles