Speed ββup Pandas filtering
I have a 37456153 row x 3 columns Pandas dataframe, consisting of the following columns: [Timestamp, Span, Elevation]
. Each value Timestamp
has approximately 62,000 rows of data Span
and Elevation
looks (for example, when filtering for Timestamp = 17210
):
Timestamp Span Elevation
94614 17210 -0.019766 36.571
94615 17210 -0.019656 36.453
94616 17210 -0.019447 36.506
94617 17210 -0.018810 36.507
94618 17210 -0.017883 36.502
... ... ... ...
157188 17210 91.004000 33.493
157189 17210 91.005000 33.501
157190 17210 91.010000 33.497
157191 17210 91.012000 33.500
157192 17210 91.013000 33.503
As seen above, the data is Span
not equal to each other, which is what I really need. So I came up with the following code to convert it to equal spacing format. I know the tags start
and end
that I would like to analyze. Then I defined the parameter delta
as my increment. I created a named numpy array mesh
that contains the equal intermediate data Span
that I would like to receive. Finally, I decided to iterate over the file frame for the given one Timestamp
(17300 in the code) to test how fast it would run. The for loop in the code calculates the averages Elevation
for the +/- range 0.5delta
at each increment.
My problem: It takes 603ms to filter through the dataframe and calculates the average Elevation
on a single iteration. For these parameters, I have to go through 9101 iterations, which will give about 1.5 hours of computational time to complete this cycle. Moreover, this is for one value Timestamp
and I have 600 of them (900 hours to do everything ?!).
Is there a way to speed up this cycle? Thanks a lot for any input!
# MESH GENERATION
start = 0
end = 91
delta = 0.01
mesh = np.linspace(start,end, num=(end/delta + 1))
elevation_list =[]
#Loop below will take forever to run, any idea about how to optimize it?!
for current_loc in mesh:
average_elevation = np.average(df[(df.Timestamp == 17300) &
(df.Span > current_loc - delta/2) &
(df.Span < current_loc + delta/2)].Span)
elevation_list.append(average_elevation)
source to share
You can vectorize the whole thing using np.searchsorted
. I don't really like the pandas user, but something like this should work and it works fast on my system. Using dummy chrisb data:
In [8]: %%timeit
...: mesh = np.linspace(start, end, num=(end/delta + 1))
...: midpoints = (mesh[:-1] + mesh[1:]) / 2
...: idx = np.searchsorted(midpoints, df.Span)
...: averages = np.bincount(idx, weights=df.Elevation, minlength=len(mesh))
...: averages /= np.bincount(idx, minlength=len(mesh))
...:
100 loops, best of 3: 5.62 ms per loop
This is about 3500x faster than your code:
In [12]: %%timeit
...: mesh = np.linspace(start, end, num=(end/delta + 1))
...: elevation_list =[]
...: for current_loc in mesh:
...: average_elevation = np.average(df[(df.Span > current_loc - delta/2) &
...: (df.Span < current_loc + delta/2)].Span)
...: elevation_list.append(average_elevation)
...:
1 loops, best of 3: 19.1 s per loop
EDIT . How it works? In midpoints
we keep a sorted list of bucket boundaries. We then do a binary search with searchsorted
on that sorted list and get idx
that which basically tells us which bucket each data point belongs to. All that remains is to group all the values ββin each bucket. This is for bincount
. Given an array of ints, it counts how many times each number appears. For an array of int and corresponding array, weights
instead of adding 1 to the bucket bill, it adds the corresponding value to weights
. On two calls, bincount
you get the sum and the number of items per bucket: divide them and you get the average of the bucket.
source to share
Here's an idea - might still be too slow, but I thought I'd share it. First, some bogus data.
df = pd.DataFrame(data={'Timestamp': 17210,
'Span': np.linspace(-1, 92, num=60000),
'Elevation': np.linspace(33., 37., num=60000)})
Then take the mesh array you created, turn it into a dataframe and add a shifted entry, so each entry in the dataframe represents one step of a new even range.
mesh_df = pd.DataFrame(mesh, columns=['Equal_Span'])
mesh_df['Equal_Span_Prev'] = mesh_df['Equal_Span'].shift(1)
mesh_df = mesh_df.dropna()
Then I want to join this big dataset framework based on the record between two columns Equal_Span
. There might be a way in pandas, but joins as cartesian types seem to be much easier to express in SQL, so I send all the data to an in-memory sqlite database first. If you have any memory problems, I would make this db based file.
import sqlite3
con = sqlite3.connect(':memory:')
df.to_sql('df', con, index=False)
mesh_df.to_sql('mesh_df', con, index=False)
Here's the basic query. It took about 1m 30 on my test data, so it will probably take a long time on the full dataset.
join_df = pd.read_sql("""SELECT a.Timestamp, a.Span, a.Elevation, b.Equal_Span
FROM df a, mesh_df b
WHERE a.Span BETWEEN b.Equal_Span_Prev AND b.Equal_Span""", con)
But once the data is in this form, it is easy or quick to get the desired mean.
join_df.groupby(['Timestamp','Equal_Span'])['Elevation'].mean()
source to share