Filling data gaps with average data from adjacent days

Imagine a data frame with several variables measured every 30 minutes. Each time series within this data frame has gaps in possibly different locations. These gaps should be replaced with some average value, say +/- 2 days. For example, if I have no data on the 4th day 07:30, I want to replace the record NaN

with the average of the measurements at 07:30 on the 2nd, 3rd, 5th and 6th days. Note that it is also possible that for example day 5, 07:30 also NaN

- in this case it should be excluded from the mean, which should replace the missing measurement on day 4 (should it be possible with np.nanmean

?)

I'm not sure how to do this. Right now, I would probably move every row and column in the dataframe and write a really bad hack along the lines np.mean(df.ix[[i-48, i, i+48], "A"])

, but I feel like there should be more pythonic / pandas -y way?

Example dataset:

import numpy as np
import pandas as pd

# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))

# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan

print df["2014-01-04 08:00":"2014-01-04 16:00"]

                            A         B         C
2014-01-04 08:00:00  0.675720  2.186484 -0.033969
2014-01-04 08:30:00 -0.897217  1.332437 -2.618197
2014-01-04 09:00:00  0.299395  0.837023  1.346117
2014-01-04 09:30:00  0.223051  0.913047       NaN
2014-01-04 10:00:00       NaN  1.395480       NaN
2014-01-04 10:30:00       NaN -0.800921       NaN
2014-01-04 11:00:00       NaN -0.932760       NaN
2014-01-04 11:30:00  0.057219 -0.071280       NaN
2014-01-04 12:00:00  0.215810 -1.099531       NaN
2014-01-04 12:30:00 -0.532563       NaN       NaN
2014-01-04 13:00:00 -0.697872       NaN       NaN
2014-01-04 13:30:00 -0.028541       NaN       NaN
2014-01-04 14:00:00 -0.073426       NaN       NaN
2014-01-04 14:30:00 -1.187419  0.221636       NaN
2014-01-04 15:00:00  1.802449  0.144715       NaN
2014-01-04 15:30:00  0.446615  1.013915 -1.813272
2014-01-04 16:00:00 -0.410670  1.265309 -0.198607

[17 rows x 3 columns]

      

(An even more sophisticated tool also excludes measurements from the averaging that were themselves created by averaging, but this does not need to be included in the answer, as I think it might make things too complicated at this point.)

/ edit: A sample solution I'm not very happy with:

# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
    for idx, rows in df.iterrows():
        if np.isnan(df.ix[idx, col]):
            # replace with mean of adjacent days
            df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col]) 

      

There are two things about this solution that I don't like:

  • If at any point one line is missing or duplicated, it won't work. On the last line, I would like to subtract "one day" all the time, no matter if it's 47, 48, or 49 rows. Also, it would be nice to expand the range (e.g. -3 days to +3 days) without manually writing the list for the index.
  • I would like to get rid of the loops if possible.
+3


source to share


1 answer


There should be a faster, more concise way to do it. The main thing is to use the shift () function instead of a loop. Simple option:

df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )

      

This turned out to be very difficult to generalize, but it seems to work:

df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in 
                                     range(-48*window,48*(window+1),48) ], axis=0 )

      

I'm not sure, but I suspect there may be a bug with nanmean, and this is also the same reason you are losing values ​​yourself. It seems to me that nanmean cannot handle nans if you feed it with a file frame. But if I convert to array (with .values) and use axis = 0, then it seems to work.

Check results for window = 1:

print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]    

2014-01-04 12:30:00    0.940193     # was nan, now filled
2014-01-04 13:00:00    0.078160
2014-01-04 13:30:00   -0.662918
2014-01-04 14:00:00   -0.967121

2014-01-03 12:30:00    0.947915     # day before
2014-01-03 13:00:00    0.167218
2014-01-03 13:30:00   -0.391444
2014-01-03 14:00:00   -1.157040

2014-01-05 12:30:00    0.932471     # day after
2014-01-05 13:00:00   -0.010899
2014-01-05 13:30:00   -0.934391
2014-01-05 14:00:00   -0.777203

      



As for problem # 2, it will depend on your data, but if you precede the above,

df = df.resample('30min')

which will give you a row of nans for all the missing lines, and then you can fill them in just like all other nans. This is probably the easiest and fastest way if it works.

Alternatively, you can do something with groupby. My group fu is weak, but to give you a taste of it, something like:

df.groupby( df.index.hour ).fillna(method='pad')

will correctly address the issue with missing lines, but not others.

+2


source







All Articles