Pandas: Rolling Correlation with Fixed Patch to Match Pattern
Happy New Year.
I am looking for a way to calculate the correlation of a pivot window and a fixed window ('patch') with pandas. The ultimate goal is to perform pattern matching.
From what I read in the docs, AND RELIABLE that I DON'T MISS, corr () or corrwith () does not allow you to block one of the series / DataFrames.
Currently the best shitty solution I can find is below. And when this is done on 50K lines with a patch of 30 samples, the processing time goes into the Ctrl + C range.
I am very grateful for the suggestions and alternatives. THANK.
Please run the below code and it will be very clear what I am trying to do:
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)
print
print ' *** Start corr example ***'
# To avoid the automatic alignment between df and patch,
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
# df.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
# If slice has only two rows, I have a line between two points
# When I corr with to points in patch, I start getting
# misleading values like 1 or -1
if window.shape[0] != patch.shape[0] :
break
else:
# I need to reset_index for the window,
# which is less efficient than doing outside the
# for loop where the patch has its reset_index done.
# If I would do the df.reset_index up there,
# I would still have automatic realignment but
# by index.
window.reset_index(inplace=True, drop=True)
# On top of the obvious inefficiency
# of this method, I cannot just corrwith()
# between specific columns in the dataframe;
# corrwith() runs for all.
# Alternatively I could create a new DataFrame
# only with the needed columns:
# df_col = DataFrame(df.a)
# patch_col = DataFrame(patch.a)
# Alternatively I could join the patch to
# the df and shift it.
corr = window.corrwith(patch)
print
print '==========================='
print 'window:'
print window
print '---------------------------'
print 'patch:'
print patch
print '---------------------------'
print 'Corr for this window'
print corr
print '============================'
df['corr'][i] = corr.a
print
print ' *** End corr example ***'
print " Please inspect var 'df'"
print
source to share
Excessive usage reset_index
is obviously the signal we are fighting with Panda indexing and auto-alignment. Oh, how much easier it would be if we could just forget about the index! Indeed, this is what NumPy is designed to do. (Generally speaking, use Pandas when you need alignment or grouping by index; use NumPy when doing calculations on N-dimensional arrays.)
Using NumPy will make the computation much faster because we can remove for-loop
and process all computation done in the for-loop as one computation done in the NumPy array for framing windows.
We can look insidepandas/core/frame.py
DataFrame.corrwith
to see how the computation is performed. Then translate it into the appropriate code, made on NumPy arrays, making the necessary adjustments so that we perform calculations on an entire array filled with upside-down windows, and not just one window at a time, while maintaining a constant patch
. (Note: The Pandas method corrwith
handles NaNs. To keep the code a little simpler, I assumed there are no NaNs on the inputs.)
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
import numpy.lib.stride_tricks as stride
np.random.seed(1)
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
m = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=m, freq='5min')
patch = DataFrame(np.arange(m), columns=['a'], index=rng)
def orig(df, patch):
patch.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
if window.shape[0] != patch.shape[0] :
break
else:
window.reset_index(inplace=True, drop=True)
corr = window.corrwith(patch)
df['corr'][i] = corr.a
return df
def using_numpy(df, patch):
left = df['a'].values
itemsize = left.itemsize
left = stride.as_strided(left, shape=(n-m+1, m), strides = (itemsize, itemsize))
right = patch['a'].values
ldem = left - left.mean(axis=1)[:, None]
rdem = right - right.mean()
num = (ldem * rdem).sum(axis=1)
dom = (m - 1) * np.sqrt(left.var(axis=1, ddof=1) * right.var(ddof=1))
correl = num/dom
df.ix[:len(correl), 'corr'] = correl
return df
expected = orig(df.copy(), patch.copy())
result = using_numpy(df.copy(), patch.copy())
print(expected)
print(result)
This confirms that the values ββgenerated by orig
and using_numpy
are equal to the same:
assert np.allclose(expected['corr'].dropna(), result['corr'].dropna())
Technical note:
To create an array filled with rolling windows in a memory-friendly manner, I used a trick I learned here .
Here is a benchmark using n, m = 1000, 4
(many lines and a tiny patch to create a lot of windows):
In [77]: %timeit orig(df.copy(), patch.copy())
1 loops, best of 3: 3.56 s per loop
In [78]: %timeit using_numpy(df.copy(), patch.copy())
1000 loops, best of 3: 1.35 ms per loop
- acceleration 2600x.
source to share