Optimize your code to find the median of the last 4-6 days for each row in the DataFrame
Given the data of the temporal data, I would like to calculate the median of a specific variable over the last 4-6 days. The median of the last 1-3 days can be calculated using pd.pandas.DataFrame.rolling
, but I couldn't find how to use roll to calculate the median for the last 4-6 days.
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
The data looks like this. My real data has gaps in time and maybe more data points in one day.
timestamp var
0 2011-01-01 00:00:00 1.624345
1 2011-01-01 06:00:00 -0.611756
2 2011-01-01 12:00:00 -0.528172
3 2011-01-01 18:00:00 -1.072969
4 2011-01-02 00:00:00 0.865408
5 2011-01-02 06:00:00 -2.301539
6 2011-01-02 12:00:00 1.744812
7 2011-01-02 18:00:00 -0.761207
8 2011-01-03 00:00:00 0.319039
9 2011-01-03 06:00:00 -0.249370
10 2011-01-03 12:00:00 1.462108
Desired output:
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN # no data in past 4-6 days
1 2011-01-01 06:00:00 -0.611756 NaN # no data in past 4-6 days
2 2011-01-01 12:00:00 -0.528172 NaN # no data in past 4-6 days
3 2011-01-01 18:00:00 -1.072969 NaN # no data in past 4-6 days
4 2011-01-02 00:00:00 0.865408 NaN # no data in past 4-6 days
5 2011-01-02 06:00:00 -2.301539 NaN # no data in past 4-6 days
6 2011-01-02 12:00:00 1.744812 NaN # no data in past 4-6 days
7 2011-01-02 18:00:00 -0.761207 NaN # no data in past 4-6 days
8 2011-01-03 00:00:00 0.319039 NaN # no data in past 4-6 days
9 2011-01-03 06:00:00 -0.249370 NaN # no data in past 4-6 days
10 2011-01-03 12:00:00 1.462108 NaN # no data in past 4-6 days
11 2011-01-03 18:00:00 -2.060141 NaN # no data in past 4-6 days
12 2011-01-04 00:00:00 -0.322417 NaN # no data in past 4-6 days
13 2011-01-04 06:00:00 -0.384054 NaN # no data in past 4-6 days
14 2011-01-04 12:00:00 1.133769 NaN # no data in past 4-6 days
15 2011-01-04 18:00:00 -1.099891 NaN # no data in past 4-6 days
16 2011-01-05 00:00:00 -0.172428 NaN # only 4 data in past 4-6 days
17 2011-01-05 06:00:00 -0.877858 -0.528172
18 2011-01-05 12:00:00 0.042214 -0.569964
19 2011-01-05 18:00:00 0.582815 -0.528172
20 2011-01-06 00:00:00 -1.100619 -0.569964
21 2011-01-06 06:00:00 1.144724 -0.528172
22 2011-01-06 12:00:00 0.901591 -0.388771
23 2011-01-06 18:00:00 0.502494 -0.249370
My current code:
def findPastVar2(df, var='var' ,window=3, method='median'):
# window= # of past days
for i in xrange(len(df)):
pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
if pastVar2.shape[0]>=5: # At least 5 data points
if method == 'median':
df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
return(df)
Current speed:
In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop
I edited the post so that I can clearly show the expected result of at least 5 data points. I set a random seed so that everyone can receive the same input and show the same output. Simple as far as I know rolling
and shift
don't work for the case of multiple data on the same day.
source to share
Here we go:
df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)
df['var']
Out[55]:
timestamp
2011-01-01 00:00:00 NaN
2011-01-01 06:00:00 NaN
2011-01-01 12:00:00 NaN
2011-01-01 18:00:00 NaN
2011-01-02 00:00:00 NaN
2011-01-02 06:00:00 NaN
2011-01-02 12:00:00 NaN
2011-01-02 18:00:00 NaN
2011-01-03 00:00:00 NaN
2011-01-03 06:00:00 NaN
2011-01-03 12:00:00 NaN
2011-01-03 18:00:00 NaN
2011-01-04 00:00:00 NaN
2011-01-04 06:00:00 NaN
2011-01-04 12:00:00 NaN
2011-01-04 18:00:00 NaN
2011-01-05 00:00:00 NaN
2011-01-05 06:00:00 -0.528172
2011-01-05 12:00:00 -0.569964
2011-01-05 18:00:00 -0.528172
2011-01-06 00:00:00 -0.569964
2011-01-06 06:00:00 -0.528172
2011-01-06 12:00:00 -0.569964
2011-01-06 18:00:00 -0.528172
2011-01-07 00:00:00 -0.388771
2011-01-07 06:00:00 -0.249370
2011-01-07 12:00:00 -0.388771
source to share
How it is done for each line, and how irregular timers are, it will have different widths, which requires an iterative approach as you started. But, if we make the index time series
# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))
in this case I chose the interval every 12 hours, but could be what is available or irregular. Using a modified function with a window for the median, together with the offset (here the positive Delta
looks backwards) gives you the flexibility you need:
def GetMedian(df,var='var',window='2D',Delta='3D'):
for Ti in df.index:
Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
(df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
df.loc[Ti,'Medians']=Vals[var].median()
return df
This is much faster:
%timeit GetMedian(df)
84.8 ms Âą 3.04 ms per loop (mean Âą std. dev. of 7 runs, 10 loops each)
source to share
min_period should be 2 instead of 5 because you shouldn't count the window size. (5 - 3 = 2)
import pandas as pd
import numpy as np
import datetime
np.random.seed(1) # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
def first():
df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
return df
%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop
My first try was not using shift()
, but then I saw Noobie's answer .
I did the next one with shift()
, which is much faster than the previous one.
def test():
df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
return df
%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop
The second is about 4 times faster than the first.
These two functions produce the same result that looks like this:
df2 = test()
df2
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN
1 2011-01-02 00:00:00 -0.611756 NaN
2 2011-01-03 00:00:00 -0.528172 NaN
3 2011-01-04 00:00:00 -1.072969 NaN
4 2011-01-05 00:00:00 0.865408 0.506294
5 2011-01-06 00:00:00 -2.301539 -0.528172
6 2011-01-07 00:00:00 1.744812 -0.611756
... ... ... ...
93 2011-04-04 00:00:00 -0.638730 1.129484
94 2011-04-05 00:00:00 0.423494 1.129484
95 2011-04-06 00:00:00 0.077340 0.185156
96 2011-04-07 00:00:00 -0.343854 -0.375285
97 2011-04-08 00:00:00 0.043597 -0.375285
98 2011-04-09 00:00:00 -0.620001 0.077340
99 2011-04-10 00:00:00 0.698032 0.077340
source to share