Python pandas: commonplace application statements are incredibly slow
I have a pandas framework with app. 250,000 lines. I am trying to create a new field like this:
df['new_field'] = df.apply( lambda x: x.field2 if x.field1 > 0 else 0, axis =1 )
this works, but it takes about 15 seconds to execute one line!
I've optimized it this way:
@numba.jit(nopython=True)
def mycalc(field1, field2, out):
for i in xrange(field1.size):
if field1[i] > 0:
out[i] = field2[i]
else:
out[i] = 0
return out
df['new_field'] = mycalc(df.field1.as_matrix(), df.field2.as_matrix(), np.zeros(df.field1.size) )
and now it takes 0.25 seconds.
My question is, is there a better way to do this?
The time with numba's solution is great, but the whole approach seems dodgy: I would count on a banal to be done efficiently in one line. Also, with numba in nopython mode, I need to initialize the output array outside of numba and pass it to numba, because I understand that numba cannot create new arrays in nopython mode.
Some of the data comes from SQL, and the more I use pandas, the more I feel like I'm better off doing SQL as much as possible, because the difference in speed is crazy. Now, of course, I believe that SQL will be faster when working with GB of data, but in 15 seconds for this trivial calculation on 250 thousand lines is enough.
Thank!
You can use np.where
:
df['new_field'] = np.where(df['field1'] > 0, df['field2'], 0)
So the above tests your boolean condition and returns df['field2']
when True
else returns0
or in pandas style:
df['new_field'] = df['field2'].where(df['field1'] >0, 0)