Python pandas: commonplace application statements are incredibly slow

I have a pandas framework with app. 250,000 lines. I am trying to create a new field like this:

df['new_field'] = df.apply( lambda x: x.field2 if x.field1 > 0 else 0, axis =1 )

      

this works, but it takes about 15 seconds to execute one line!

I've optimized it this way:

@numba.jit(nopython=True)
def mycalc(field1, field2, out):
    for i in xrange(field1.size):
        if field1[i] > 0:
            out[i] = field2[i]
        else:
            out[i] = 0

    return out

df['new_field'] = mycalc(df.field1.as_matrix(), df.field2.as_matrix(), np.zeros(df.field1.size) )

      

and now it takes 0.25 seconds.

My question is, is there a better way to do this?

The time with numba's solution is great, but the whole approach seems dodgy: I would count on a banal to be done efficiently in one line. Also, with numba in nopython mode, I need to initialize the output array outside of numba and pass it to numba, because I understand that numba cannot create new arrays in nopython mode.

Some of the data comes from SQL, and the more I use pandas, the more I feel like I'm better off doing SQL as much as possible, because the difference in speed is crazy. Now, of course, I believe that SQL will be faster when working with GB of data, but in 15 seconds for this trivial calculation on 250 thousand lines is enough.

Thank!

+3


source to share


1 answer


You can use np.where

:

df['new_field'] = np.where(df['field1'] > 0, df['field2'], 0)

      

So the above tests your boolean condition and returns df['field2']

when True

else returns0



or in pandas style:

df['new_field'] = df['field2'].where(df['field1'] >0, 0)

      

+4


source







All Articles