Filling data gaps with average data from adjacent days
Imagine a data frame with several variables measured every 30 minutes. Each time series within this data frame has gaps in possibly different locations. These gaps should be replaced with some average value, say +/- 2 days. For example, if I have no data on the 4th day 07:30, I want to replace the record NaN
with the average of the measurements at 07:30 on the 2nd, 3rd, 5th and 6th days. Note that it is also possible that for example day 5, 07:30 also NaN
- in this case it should be excluded from the mean, which should replace the missing measurement on day 4 (should it be possible with np.nanmean
?)
I'm not sure how to do this. Right now, I would probably move every row and column in the dataframe and write a really bad hack along the lines np.mean(df.ix[[i-48, i, i+48], "A"])
, but I feel like there should be more pythonic / pandas -y way?
Example dataset:
import numpy as np
import pandas as pd
# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))
# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan
print df["2014-01-04 08:00":"2014-01-04 16:00"]
A B C
2014-01-04 08:00:00 0.675720 2.186484 -0.033969
2014-01-04 08:30:00 -0.897217 1.332437 -2.618197
2014-01-04 09:00:00 0.299395 0.837023 1.346117
2014-01-04 09:30:00 0.223051 0.913047 NaN
2014-01-04 10:00:00 NaN 1.395480 NaN
2014-01-04 10:30:00 NaN -0.800921 NaN
2014-01-04 11:00:00 NaN -0.932760 NaN
2014-01-04 11:30:00 0.057219 -0.071280 NaN
2014-01-04 12:00:00 0.215810 -1.099531 NaN
2014-01-04 12:30:00 -0.532563 NaN NaN
2014-01-04 13:00:00 -0.697872 NaN NaN
2014-01-04 13:30:00 -0.028541 NaN NaN
2014-01-04 14:00:00 -0.073426 NaN NaN
2014-01-04 14:30:00 -1.187419 0.221636 NaN
2014-01-04 15:00:00 1.802449 0.144715 NaN
2014-01-04 15:30:00 0.446615 1.013915 -1.813272
2014-01-04 16:00:00 -0.410670 1.265309 -0.198607
[17 rows x 3 columns]
(An even more sophisticated tool also excludes measurements from the averaging that were themselves created by averaging, but this does not need to be included in the answer, as I think it might make things too complicated at this point.)
/ edit: A sample solution I'm not very happy with:
# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
for idx, rows in df.iterrows():
if np.isnan(df.ix[idx, col]):
# replace with mean of adjacent days
df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col])
There are two things about this solution that I don't like:
- If at any point one line is missing or duplicated, it won't work. On the last line, I would like to subtract "one day" all the time, no matter if it's 47, 48, or 49 rows. Also, it would be nice to expand the range (e.g. -3 days to +3 days) without manually writing the list for the index.
- I would like to get rid of the loops if possible.
source to share
There should be a faster, more concise way to do it. The main thing is to use the shift () function instead of a loop. Simple option:
df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )
This turned out to be very difficult to generalize, but it seems to work:
df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in
range(-48*window,48*(window+1),48) ], axis=0 )
I'm not sure, but I suspect there may be a bug with nanmean, and this is also the same reason you are losing values ββyourself. It seems to me that nanmean cannot handle nans if you feed it with a file frame. But if I convert to array (with .values) and use axis = 0, then it seems to work.
Check results for window = 1:
print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]
2014-01-04 12:30:00 0.940193 # was nan, now filled
2014-01-04 13:00:00 0.078160
2014-01-04 13:30:00 -0.662918
2014-01-04 14:00:00 -0.967121
2014-01-03 12:30:00 0.947915 # day before
2014-01-03 13:00:00 0.167218
2014-01-03 13:30:00 -0.391444
2014-01-03 14:00:00 -1.157040
2014-01-05 12:30:00 0.932471 # day after
2014-01-05 13:00:00 -0.010899
2014-01-05 13:30:00 -0.934391
2014-01-05 14:00:00 -0.777203
As for problem # 2, it will depend on your data, but if you precede the above,
df = df.resample('30min')
which will give you a row of nans for all the missing lines, and then you can fill them in just like all other nans. This is probably the easiest and fastest way if it works.
Alternatively, you can do something with groupby. My group fu is weak, but to give you a taste of it, something like:
df.groupby( df.index.hour ).fillna(method='pad')
will correctly address the issue with missing lines, but not others.
source to share