Speed ​​up Pandas filtering

I have a 37456153 row x 3 columns Pandas dataframe, consisting of the following columns: [Timestamp, Span, Elevation]

. Each value Timestamp

has approximately 62,000 rows of data Span

and Elevation

looks (for example, when filtering for Timestamp = 17210

):

        Timestamp       Span  Elevation
94614       17210  -0.019766     36.571
94615       17210  -0.019656     36.453
94616       17210  -0.019447     36.506
94617       17210  -0.018810     36.507
94618       17210  -0.017883     36.502

...           ...        ...        ...
157188      17210  91.004000     33.493
157189      17210  91.005000     33.501
157190      17210  91.010000     33.497
157191      17210  91.012000     33.500
157192      17210  91.013000     33.503

      

As seen above, the data is Span

not equal to each other, which is what I really need. So I came up with the following code to convert it to equal spacing format. I know the tags start

and end

that I would like to analyze. Then I defined the parameter delta

as my increment. I created a named numpy array mesh

that contains the equal intermediate data Span

that I would like to receive. Finally, I decided to iterate over the file frame for the given one Timestamp

(17300 in the code) to test how fast it would run. The for loop in the code calculates the averages Elevation

for the +/- range 0.5delta

at each increment.

My problem: It takes 603ms to filter through the dataframe and calculates the average Elevation

on a single iteration. For these parameters, I have to go through 9101 iterations, which will give about 1.5 hours of computational time to complete this cycle. Moreover, this is for one value Timestamp

and I have 600 of them (900 hours to do everything ?!).

Is there a way to speed up this cycle? Thanks a lot for any input!

# MESH GENERATION
start = 0
end = 91
delta = 0.01

mesh = np.linspace(start,end, num=(end/delta + 1))
elevation_list =[]

#Loop below will take forever to run, any idea about how to optimize it?!

for current_loc in mesh:
    average_elevation = np.average(df[(df.Timestamp == 17300) & 
                                      (df.Span > current_loc - delta/2) & 
                                      (df.Span < current_loc + delta/2)].Span)
     elevation_list.append(average_elevation)

      

+3


source to share


2 answers


You can vectorize the whole thing using np.searchsorted

. I don't really like the pandas user, but something like this should work and it works fast on my system. Using dummy chrisb data:

In [8]: %%timeit
   ...: mesh = np.linspace(start, end, num=(end/delta + 1))
   ...: midpoints = (mesh[:-1] + mesh[1:]) / 2
   ...: idx = np.searchsorted(midpoints, df.Span)
   ...: averages = np.bincount(idx, weights=df.Elevation, minlength=len(mesh))
   ...: averages /= np.bincount(idx, minlength=len(mesh))
   ...: 
100 loops, best of 3: 5.62 ms per loop  

      

This is about 3500x faster than your code:



In [12]: %%timeit
    ...: mesh = np.linspace(start, end, num=(end/delta + 1))
    ...: elevation_list =[]
    ...: for current_loc in mesh:
    ...:     average_elevation = np.average(df[(df.Span > current_loc - delta/2) & 
    ...:                                       (df.Span < current_loc + delta/2)].Span)
    ...:     elevation_list.append(average_elevation)
    ...: 
1 loops, best of 3: 19.1 s per loop

      


EDIT . How it works? In midpoints

we keep a sorted list of bucket boundaries. We then do a binary search with searchsorted

on that sorted list and get idx

that which basically tells us which bucket each data point belongs to. All that remains is to group all the values ​​in each bucket. This is for bincount

. Given an array of ints, it counts how many times each number appears. For an array of int and corresponding array, weights

instead of adding 1 to the bucket bill, it adds the corresponding value to weights

. On two calls, bincount

you get the sum and the number of items per bucket: divide them and you get the average of the bucket.

+6


source


Here's an idea - might still be too slow, but I thought I'd share it. First, some bogus data.

df = pd.DataFrame(data={'Timestamp': 17210, 
                        'Span': np.linspace(-1, 92, num=60000), 
                        'Elevation': np.linspace(33., 37., num=60000)})

      

Then take the mesh array you created, turn it into a dataframe and add a shifted entry, so each entry in the dataframe represents one step of a new even range.

mesh_df = pd.DataFrame(mesh, columns=['Equal_Span'])
mesh_df['Equal_Span_Prev'] = mesh_df['Equal_Span'].shift(1)
mesh_df = mesh_df.dropna()

      

Then I want to join this big dataset framework based on the record between two columns Equal_Span

. There might be a way in pandas, but joins as cartesian types seem to be much easier to express in SQL, so I send all the data to an in-memory sqlite database first. If you have any memory problems, I would make this db based file.



import sqlite3
con = sqlite3.connect(':memory:')
df.to_sql('df', con, index=False)
mesh_df.to_sql('mesh_df', con, index=False)

      

Here's the basic query. It took about 1m 30 on my test data, so it will probably take a long time on the full dataset.

join_df = pd.read_sql("""SELECT a.Timestamp, a.Span, a.Elevation, b.Equal_Span
                         FROM df a, mesh_df b
                         WHERE a.Span BETWEEN b.Equal_Span_Prev AND b.Equal_Span""", con)

      

But once the data is in this form, it is easy or quick to get the desired mean.

join_df.groupby(['Timestamp','Equal_Span'])['Elevation'].mean()

      

+1


source







All Articles