Rate of change over the last n hours using pandas timeseries

Question

Rate of change over the last n hours using pandas timeseries

I would like to add columns to a time indexed pandas DataFrame that contains the rate of change over the last n hours for each of the existing columns. I have accomplished this with the following code, however, it is too slow for my needs (maybe due to cyclization on every index of every column?).

Is there a (computationally) faster way to do this?

roc_hours = 12
tol = 1e-10 
for c in ts.columns:
    c_roc = c + ' +++ RoC ' + str(roc_hours) + 'h' 
    ts[c_roc] = np.nan
    for i in ts.index[np.isfinite(ts[c])]:
        df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]
        X = (df.index.values - df.index.values.min()).astype('Int64')*2.77778e-13 #hours back
        Y = df.values
        if Y.std() > tol and X.shape[0] > 1:
            fit = np.polyfit(X,Y,1)
            ts[c_roc][i] = fit[0]
        else:
            ts[c_roc][i] = 0

Change input data frame ts is irregularly sampled and may contain NaN

s. The first few lines of ts input:

+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
|         WCT         |         a         |  b   |  c   |         d          |         e         |        f         |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| 2011-09-04 20:00:00 |                   |      |      |                    |                   |                  |
| 2011-09-04 21:00:00 |                   |      |      |                    |                   |                  |
| 2011-09-04 22:00:00 |                   |      |      |                    |                   |                  |
| 2011-09-04 23:00:00 |                   |      |      |                    |                   |                  |
| 2011-09-05 02:00:00 |        93.0       | 97.0 | 20.0 |       209.0        |        85.0       |       98.0       |
| 2011-09-05 03:00:00 | 74.14285714285714 | 97.0 | 20.0 | 194.14285714285717 | 74.42857142857143 |       98.0       |
| 2011-09-05 04:00:00 |        67.5       | 98.5 | 20.0 |       176.0        |        75.0       |       98.0       |
| 2011-09-05 05:00:00 |        72.0       | 98.5 | 20.0 |       176.0        |        75.0       |       98.0       |
| 2011-09-05 07:00:00 |        80.0       | 93.0 | 19.0 |       186.0        |        71.0       |       97.0       |
| 2011-09-05 08:00:00 |        80.0       | 93.0 | 19.0 |       186.0        |        71.0       |       97.0       |
| 2011-09-05 09:00:00 |        78.5       | 98.0 | 19.0 |       186.0        |        71.0       |       97.0       |
| 2011-09-05 10:00:00 |        73.0       | 98.0 | 19.0 |       186.0        |        71.0       |       97.0       |
| 2011-09-05 11:00:00 |        77.0       | 98.0 | 18.0 |       175.0        |        87.0       | 97.0999984741211 |
| 2011-09-05 12:00:00 |        78.0       | 98.0 | 19.0 |       163.0        |        57.0       | 98.4000015258789 |
| 2011-09-05 15:00:00 |        78.0       | 98.0 | 19.0 |       163.0        |        57.0       | 98.4000015258789 |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+

Edit 2

After profiling the bottleneck is at the stage of cut: df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]

. This line pulls out observations with timestamps between now-roc_hours and now. This is a very convenient syntax, but it takes up most of the computation time.

+3

python pandas time-series

cjbayesian 26 nov. 14 at 14:18

source to share

1 answer

cphlewis · Answer 1 · 2015-02-17T04:47:32+0000

Works on my dataset, didn't check on yours:

import pandas as pd
from numpy import polyfit
from matplotlib import style
style.use('ggplot')

# ... acquire a dataframe named *water* with a column *value*

WINDOW = 10
ax=water.value.plot()
roll = pd.rolling_mean(water.value, WINDOW)
roll.plot(ax=ax)

def lintrend(df):
    df = df.tolist()
    m, b = polyfit(range(len(df)), df,1)
    return m

linny = pd.rolling_apply(water.value, WINDOW, lintrend)

linny.plot(ax=ax)

Casting numpy.ndarray to a list after roll_apply changes it to numpy.ndarray seems inelegant. Suggestions?

Rate of change over the last n hours using pandas timeseries

More articles: