Faster roll_apply on Pandas DataFrame?

Improving on this question , which provided a smart solution for applying a function across multiple columns in a DataFrame, I'm wondering if the solution can be optimized for speed.

Environment: Python 2.7.8, Pandas 14.1, Numpy 1.8.

Here's a setup example:

import pandas as pd
import numpy as np
import random

def meanmax(ii,df):
    xdf = df.iloc[map(int,ii)]
    n = max(xdf['A']) + max(xdf['B'])
    return n / 2.0

df  = pd.DataFrame(np.random.randn(2500,2)/10000, 
                    index=pd.date_range('2001-01-01',periods=2500),
                    columns=['A','B'])              
df['ii'] = range(len(df))      

res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))

      

Note that the function is meanmax

not paired, so something like rolling_mean(df['A'] + df['B'],26)

it won't work.

However, I can do something like:

res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2

      

This is about 3000x faster:

%timeit res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
1 loops, best of 3: 1 s per loop

%timeit res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2
1000 loops, best of 3: 325 µs per loop

      

Is there anything better / equivalent than the second option above given the example function and using rolling_apply

? Although the second option is faster, it doesn't use rolling_apply

, which can apply to a wider set of problems

Edit: adjusting the runtime

+3


source to share


2 answers


Computing a general rolling function over a size array n

with a size window m

takes approximately O(n*m)

time. The built-in methods rollin_xxx

use some pretty smart algorithms to keep runtimes well below that, and can often guarantee times O(n)

that, if you think about it, are very impressive.

rolling_min

and rolling_max

in particular borrowed their implementation from bottleneck , which cites Richard Harter as the source of the algorithm, although I found what I think is an earlier description of the same algorithm in this article .



So, after the history lesson: it is very likely that you will not be able to eat your cake. rolling_apply

very convenient, but it will almost always sacrifice performance for a specific algorithm. In my experience, one of the most enjoyable parts of using the Python scientific stack is coming up with efficient ways to do computations using fast primitives that are creatively presented. A good example of this is your own solution calling rolling_max

twice. So sit back and enjoy the ride knowing that you will always have rolling_apply

to fall back if you, or the good SO people, cannot come up with a smarter solution.

+7


source


You won't be able to go up to speed rolling_max

, but you can often shave off the order or so by dropping to numpy

through .values

:

def meanmax_np(ii, df):
    ii = ii.astype(int)
    n = df["A"].values[ii].max() + df["B"].values[ii].max()
    return n/2.0

      

gives me



>>> %timeit res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
1 loops, best of 3: 701 ms per loop
>>> %timeit res_np = pd.rolling_apply(df.ii, 26, lambda x: meanmax_np(x, df))
10 loops, best of 3: 31.2 ms per loop
>>> %timeit res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2
1000 loops, best of 3: 247 µs per loop

      

which although 100 times slower than the optimized case is much faster than the original. Sometimes when I only need something to be ten times faster so that it isn't dominant time, that's enough.

+3


source







All Articles