Optimize your code to find the median of the last 4-6 days for each row in the DataFrame

Given the data of the temporal data, I would like to calculate the median of a specific variable over the last 4-6 days. The median of the last 1-3 days can be calculated using pd.pandas.DataFrame.rolling

, but I couldn't find how to use roll to calculate the median for the last 4-6 days.

import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

      

The data looks like this. My real data has gaps in time and maybe more data points in one day.

              timestamp       var
0   2011-01-01 00:00:00  1.624345
1   2011-01-01 06:00:00 -0.611756
2   2011-01-01 12:00:00 -0.528172
3   2011-01-01 18:00:00 -1.072969
4   2011-01-02 00:00:00  0.865408
5   2011-01-02 06:00:00 -2.301539
6   2011-01-02 12:00:00  1.744812
7   2011-01-02 18:00:00 -0.761207
8   2011-01-03 00:00:00  0.319039
9   2011-01-03 06:00:00 -0.249370
10  2011-01-03 12:00:00  1.462108

      

Desired output:

              timestamp       var  past4d-6d_var_median
0   2011-01-01 00:00:00  1.624345                   NaN # no data in past 4-6 days
1   2011-01-01 06:00:00 -0.611756                   NaN # no data in past 4-6 days
2   2011-01-01 12:00:00 -0.528172                   NaN # no data in past 4-6 days
3   2011-01-01 18:00:00 -1.072969                   NaN # no data in past 4-6 days
4   2011-01-02 00:00:00  0.865408                   NaN # no data in past 4-6 days
5   2011-01-02 06:00:00 -2.301539                   NaN # no data in past 4-6 days
6   2011-01-02 12:00:00  1.744812                   NaN # no data in past 4-6 days
7   2011-01-02 18:00:00 -0.761207                   NaN # no data in past 4-6 days
8   2011-01-03 00:00:00  0.319039                   NaN # no data in past 4-6 days
9   2011-01-03 06:00:00 -0.249370                   NaN # no data in past 4-6 days
10  2011-01-03 12:00:00  1.462108                   NaN # no data in past 4-6 days
11  2011-01-03 18:00:00 -2.060141                   NaN # no data in past 4-6 days
12  2011-01-04 00:00:00 -0.322417                   NaN # no data in past 4-6 days
13  2011-01-04 06:00:00 -0.384054                   NaN # no data in past 4-6 days
14  2011-01-04 12:00:00  1.133769                   NaN # no data in past 4-6 days
15  2011-01-04 18:00:00 -1.099891                   NaN # no data in past 4-6 days
16  2011-01-05 00:00:00 -0.172428                   NaN # only 4 data in past 4-6 days
17  2011-01-05 06:00:00 -0.877858             -0.528172
18  2011-01-05 12:00:00  0.042214             -0.569964
19  2011-01-05 18:00:00  0.582815             -0.528172
20  2011-01-06 00:00:00 -1.100619             -0.569964
21  2011-01-06 06:00:00  1.144724             -0.528172
22  2011-01-06 12:00:00  0.901591             -0.388771
23  2011-01-06 18:00:00  0.502494             -0.249370

      

My current code:

def findPastVar2(df, var='var' ,window=3, method='median'):
    # window= # of past days    
    for i in xrange(len(df)):
        pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
        if pastVar2.shape[0]>=5: # At least 5 data points
            if method == 'median':
                df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
    return(df)

      

Current speed:

In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop

      

I edited the post so that I can clearly show the expected result of at least 5 data points. I set a random seed so that everyone can receive the same input and show the same output. Simple as far as I know rolling

and shift

don't work for the case of multiple data on the same day.

+3


source to share


3 answers


Here we go:



df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)

df['var'] 
Out[55]: 
timestamp
2011-01-01 00:00:00         NaN
2011-01-01 06:00:00         NaN
2011-01-01 12:00:00         NaN
2011-01-01 18:00:00         NaN
2011-01-02 00:00:00         NaN
2011-01-02 06:00:00         NaN
2011-01-02 12:00:00         NaN
2011-01-02 18:00:00         NaN
2011-01-03 00:00:00         NaN
2011-01-03 06:00:00         NaN
2011-01-03 12:00:00         NaN
2011-01-03 18:00:00         NaN
2011-01-04 00:00:00         NaN
2011-01-04 06:00:00         NaN
2011-01-04 12:00:00         NaN
2011-01-04 18:00:00         NaN
2011-01-05 00:00:00         NaN
2011-01-05 06:00:00   -0.528172
2011-01-05 12:00:00   -0.569964
2011-01-05 18:00:00   -0.528172
2011-01-06 00:00:00   -0.569964
2011-01-06 06:00:00   -0.528172
2011-01-06 12:00:00   -0.569964
2011-01-06 18:00:00   -0.528172
2011-01-07 00:00:00   -0.388771
2011-01-07 06:00:00   -0.249370
2011-01-07 12:00:00   -0.388771

      

+2


source


How it is done for each line, and how irregular timers are, it will have different widths, which requires an iterative approach as you started. But, if we make the index time series

# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))

      

in this case I chose the interval every 12 hours, but could be what is available or irregular. Using a modified function with a window for the median, together with the offset (here the positive Delta

looks backwards) gives you the flexibility you need:



def GetMedian(df,var='var',window='2D',Delta='3D'):
    for Ti in df.index:
        Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
                (df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
        df.loc[Ti,'Medians']=Vals[var].median()
    return df

      

This is much faster:

%timeit GetMedian(df)
84.8 ms Âą 3.04 ms per loop (mean Âą std. dev. of 7 runs, 10 loops each)

      

+1


source


min_period should be 2 instead of 5 because you shouldn't count the window size. (5 - 3 = 2)

import pandas as pd
import numpy as np
import datetime
np.random.seed(1)  # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

def first():
    df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
    return df

      

%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop

      

My first try was not using shift()

, but then I saw Noobie's answer .

I did the next one with shift()

, which is much faster than the previous one.

def test():
    df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
    return df

      

%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop

      

The second is about 4 times faster than the first.

These two functions produce the same result that looks like this:

df2 = test()
df2
                  timestamp       var   past4d-6d_var_median
    0   2011-01-01 00:00:00  1.624345                    NaN
    1   2011-01-02 00:00:00 -0.611756                    NaN
    2   2011-01-03 00:00:00 -0.528172                    NaN
    3   2011-01-04 00:00:00 -1.072969                    NaN
    4   2011-01-05 00:00:00  0.865408               0.506294
    5   2011-01-06 00:00:00 -2.301539              -0.528172
    6   2011-01-07 00:00:00  1.744812              -0.611756
    ...         ...            ...             ...
    93  2011-04-04 00:00:00 -0.638730               1.129484
    94  2011-04-05 00:00:00  0.423494               1.129484
    95  2011-04-06 00:00:00  0.077340               0.185156
    96  2011-04-07 00:00:00 -0.343854              -0.375285
    97  2011-04-08 00:00:00  0.043597              -0.375285
    98  2011-04-09 00:00:00 -0.620001               0.077340
    99  2011-04-10 00:00:00  0.698032               0.077340

      

0


source







All Articles