Pandas speed df.loc [x, 'column']
I have a pandas DataFrame about 100 rows from which I need to efficiently select values ββfrom a column for a given index. At the moment I am using df.loc[index, 'col']
for this, but it seems to be relatively slow:
df = pd.DataFrame({'col': range(100)}, index=range(100))
%timeit df.loc[random.randint(0, 99), 'col']
#100000 loops, best of 3: 19.3 Β΅s per loop
What seems to be much faster (about 10x) is to rotate the dataframe into a dictionary and then query the following:
d = df.to_dict()
%timeit d['col'][random.randint(0, 99)]
#100000 loops, best of 3: 2.5 Β΅s per loop
Is there a way to get similar performance using normal dataframe methods, without explicitly creating the dict? Should I be using something other than .loc
?
Or is it just a situation where I would be better off using this workaround?
source to share
A dict
really seems to be the fastest option:
df_dict = df.to_dict()
df_numpy = np.array(df)
print(timeit.timeit("df.loc[random.randint(0, 99), 'col']", number = 100000, globals=globals()))
print(timeit.timeit("df.get_value(random.randint(0, 99), 'col')", number = 100000, globals=globals()))
print(timeit.timeit('df_numpy[df_numpy[random.randint(0, 99)]]', number=100000, globals=globals()))
print(timeit.timeit("df_dict['col'][random.randint(0, 99)]", number = 100000, globals=globals()))
Result:
4.859706375747919
1.8850274719297886
1.4855970665812492
0.6550335008651018
source to share
If the effective factor is a factor to consider, Numpy arrays may be a better choice than pandas dataframe. I am trying to reproduce your example to measure the effectiveness of a comparison:
import numpy as np
import pandas as pd
import timeit, random
df = pd.DataFrame({'col': range(100)}, index=range(100))
print(timeit.timeit('df.loc[random.randint(0, 99), "col"]', number=10000, globals=globals()))
ds_numpy = np.array(df)
print(timeit.timeit('ds_numpy[ds_numpy[random.randint(0, 99)]]', number=10000, globals=globals()))
Results:
$ python test_pandas_vs_numpy.py
0.1583892970229499
0.05918855100753717
In this case, it is like using a Numpy array over a pandas dataframe and is a performance advantage.
Link: 1
source to share