Python pandas: commonplace application statements are incredibly slow
I have a pandas framework with app. 250,000 lines. I am trying to create a new field like this:
df['new_field'] = df.apply( lambda x: x.field2 if x.field1 > 0 else 0, axis =1 )
this works, but it takes about 15 seconds to execute one line!
I've optimized it this way:
@numba.jit(nopython=True)
def mycalc(field1, field2, out):
for i in xrange(field1.size):
if field1[i] > 0:
out[i] = field2[i]
else:
out[i] = 0
return out
df['new_field'] = mycalc(df.field1.as_matrix(), df.field2.as_matrix(), np.zeros(df.field1.size) )
and now it takes 0.25 seconds.
My question is, is there a better way to do this?
The time with numba's solution is great, but the whole approach seems dodgy: I would count on a banal to be done efficiently in one line. Also, with numba in nopython mode, I need to initialize the output array outside of numba and pass it to numba, because I understand that numba cannot create new arrays in nopython mode.
Some of the data comes from SQL, and the more I use pandas, the more I feel like I'm better off doing SQL as much as possible, because the difference in speed is crazy. Now, of course, I believe that SQL will be faster when working with GB of data, but in 15 seconds for this trivial calculation on 250 thousand lines is enough.
Thank!
source to share