Pandas: Rolling Correlation with Fixed Patch to Match Pattern

Happy New Year.

I am looking for a way to calculate the correlation of a pivot window and a fixed window ('patch') with pandas. The ultimate goal is to perform pattern matching.

From what I read in the docs, AND RELIABLE that I DON'T MISS, corr () or corrwith () does not allow you to block one of the series / DataFrames.

Currently the best shitty solution I can find is below. And when this is done on 50K lines with a patch of 30 samples, the processing time goes into the Ctrl + C range.

I am very grateful for the suggestions and alternatives. THANK.

Please run the below code and it will be very clear what I am trying to do:

import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame

# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)

n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)

print
print '    *** Start corr example ***'
# To avoid the automatic alignment between df and patch, 
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
#    df.reset_index(inplace=True, drop=True)

df['corr'] = np.nan

for i in range(df.shape[0]):
    window = df[i : i+patch.shape[0]]
    # If slice has only two rows, I have a line between two points
    # When I corr with to points in patch, I start getting 
    # misleading values like 1 or -1
    if window.shape[0] != patch.shape[0] :
        break
    else:
        # I need to reset_index for the window, 
        # which is less efficient than doing outside the 
        # for loop where the patch has its reset_index done.
        # If I would do the df.reset_index up there, 
        # I would still have automatic realignment but
        # by index.
        window.reset_index(inplace=True, drop=True)

        # On top of the obvious inefficiency
        # of this method, I cannot just corrwith()
        # between specific columns in the dataframe;
        # corrwith() runs for all.
        # Alternatively I could create a new DataFrame
        # only with the needed columns:
        #     df_col = DataFrame(df.a)
        #     patch_col = DataFrame(patch.a)
        # Alternatively I could join the patch to
        # the df and shift it.
        corr = window.corrwith(patch)

        print
        print '==========================='
        print 'window:'
        print window
        print '---------------------------'
        print 'patch:'
        print patch
        print '---------------------------'
        print 'Corr for this window'
        print corr
        print '============================'

        df['corr'][i] = corr.a

print
print '    *** End corr example ***'
print " Please inspect var 'df'"
print

      

+3


source to share


1 answer


Excessive usage reset_index

is obviously the signal we are fighting with Panda indexing and auto-alignment. Oh, how much easier it would be if we could just forget about the index! Indeed, this is what NumPy is designed to do. (Generally speaking, use Pandas when you need alignment or grouping by index; use NumPy when doing calculations on N-dimensional arrays.)

Using NumPy will make the computation much faster because we can remove for-loop

and process all computation done in the for-loop as one computation done in the NumPy array for framing windows.

We can look insidepandas/core/frame.py

DataFrame.corrwith

to see how the computation is performed. Then translate it into the appropriate code, made on NumPy arrays, making the necessary adjustments so that we perform calculations on an entire array filled with upside-down windows, and not just one window at a time, while maintaining a constant patch

. (Note: The Pandas method corrwith

handles NaNs. To keep the code a little simpler, I assumed there are no NaNs on the inputs.)

import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
import numpy.lib.stride_tricks as stride
np.random.seed(1)

n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)

m = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=m, freq='5min')
patch = DataFrame(np.arange(m), columns=['a'], index=rng)

def orig(df, patch):
    patch.reset_index(inplace=True, drop=True)

    df['corr'] = np.nan

    for i in range(df.shape[0]):
        window = df[i : i+patch.shape[0]]
        if window.shape[0] != patch.shape[0] :
            break
        else:
            window.reset_index(inplace=True, drop=True)
            corr = window.corrwith(patch)

            df['corr'][i] = corr.a

    return df

def using_numpy(df, patch):
    left = df['a'].values
    itemsize = left.itemsize
    left = stride.as_strided(left, shape=(n-m+1, m), strides = (itemsize, itemsize))

    right = patch['a'].values

    ldem = left - left.mean(axis=1)[:, None]
    rdem = right - right.mean()

    num = (ldem * rdem).sum(axis=1)
    dom = (m - 1) * np.sqrt(left.var(axis=1, ddof=1) * right.var(ddof=1))
    correl = num/dom

    df.ix[:len(correl), 'corr'] = correl
    return df

expected = orig(df.copy(), patch.copy())
result = using_numpy(df.copy(), patch.copy())

print(expected)
print(result)

      

This confirms that the values ​​generated by orig

and using_numpy

are equal to the same:

assert np.allclose(expected['corr'].dropna(), result['corr'].dropna())

      




Technical note:

To create an array filled with rolling windows in a memory-friendly manner, I used a trick I learned here .




Here is a benchmark using n, m = 1000, 4

(many lines and a tiny patch to create a lot of windows):

In [77]: %timeit orig(df.copy(), patch.copy())
1 loops, best of 3: 3.56 s per loop

In [78]: %timeit using_numpy(df.copy(), patch.copy())
1000 loops, best of 3: 1.35 ms per loop

      

- acceleration 2600x.

+2


source







All Articles