Pandas padding values based on datetime index and column

Question

Pandas padding values based on datetime index and column

I have a data frame Pandas

with two sets of dates, a DatetimeIndex

for the index, and a column named date2

that contains datetime objects, value and id. For some id I am missing values where date2

is index, in this case I want to fill the string / values with the values of the previous DatetimeIndex and id values. date1

represents the current point in time, and date2

represents the last date. Each df[df.id == id]

can be thought of as its own data framework, however the data is stored in one giant line of 500K frames.

Example: given

            date2      id   value
index
2006-01-24  2006-01-26  3   3       
2006-01-25  2006-01-26  1   1
2006-01-25  2006-01-26  2   2
2006-01-26  2006-01-26  2   2.1
2006-01-27  2006-02-26  4   4

In this example, there was no line index == date2

for id 1, id 2 and for id3. I would like to fill every missing row with the previous index value corresponding to that id.

I would like to return:

            date2      id   value
index
2006-01-24  2006-01-26  3   3               
2006-01-25  2006-01-26  1   1
2006-01-25  2006-01-26  2   2
2006-01-26  2006-01-26  1   1    #<---- row added
2006-01-26  2006-01-26  2   2.1
2006-01-26  2006-01-26  3   3    #<---- row added
2006-01-27  2006-02-26  4   4
2006-02-26  2006-02-26  4   4    #<---- row added

+3

python pandas

pyCthon May 05 '15 at 21:31

source to share

2 answers

It's not very clean, but this is a possible solution. Firstly, I've moved the index column date1

:

In [228]: df
Out[228]: 
       date1      date2  id  value
0 2006-01-24 2006-01-26   3    3.0
1 2006-01-25 2006-01-26   1    1.0
2 2006-01-25 2006-01-26   2    2.0
3 2006-01-26 2006-01-26   2    2.1

Then I group each date pair by adding IDs to the pairs that match. This involves splitting the DataFrame into a list of subframes and using it concat

for merging.

In [229]: dfs = []
     ...: for (date1, date2), df_gb in df.groupby(['date1','date2']):
     ...:     if date1 == date2:
     ...:         to_add = list(set([1,2,3]) - set(df_gb['id']))
     ...:         df_gb = df_gb.append(pd.DataFrame({'id': to_add, 'date1': date1, 'date2': date2, 'value': np.nan}), ignore_index=True)
     ...:     dfs.append(df_gb)

In [231]: df = pd.concat(dfs, ignore_index=True)

In [232]: df
Out[232]: 
       date1      date2  id  value
0 2006-01-24 2006-01-26   3    3.0
1 2006-01-25 2006-01-26   1    1.0
2 2006-01-25 2006-01-26   2    2.0
3 2006-01-26 2006-01-26   2    2.1
4 2006-01-26 2006-01-26   1    NaN
5 2006-01-26 2006-01-26   3    NaN

Finally, I sorted and filled in the missing values.

In [233]: df = df.sort(['id', 'date1', 'date2'])

In [234]: df = df.fillna(method='ffill')

In [236]: df.sort(['date1', 'date2'])
Out[236]: 
       date1      date2  id  value
0 2006-01-24 2006-01-26   3    3.0
1 2006-01-25 2006-01-26   1    1.0
2 2006-01-25 2006-01-26   2    2.0
4 2006-01-26 2006-01-26   1    1.0
3 2006-01-26 2006-01-26   2    2.1
5 2006-01-26 2006-01-26   3    3.0

+2

chrisb 06 May '15 at 1:02

source to share

JohnE · Accepted Answer · 2015-05-09T20:54:00+0000

I am a little reluctant to answer b / c, it seems that @chrisb may have successfully answered the original question, which later changed. However, Chris hasn't updated the answer in a few days and this answer takes a different approach, so I'm going to +1 Chris and add it.

First, just create a new datafile from the original with 'index' = 'date2'. This will be the basis for adding an existing dataframe (note that "index" here is a column, not an index):

df2 = df[ df['index'] != df['date2'] ]
df2['index'] = df2['date2']
df2['value'] = np.nan

        index       date2  id  value
0  2006-01-26  2006-01-26   3    NaN
1  2006-01-26  2006-01-26   1    NaN
2  2006-01-26  2006-01-26   2    NaN
4  2006-02-26  2006-02-26   4    NaN

Now, just add all of them, but discard the ones we don't need (assuming we already have an existing row with "index" = "date2" like id = 2 here):

df3 = df.append(df2)
df3 = df3.drop_duplicates(['index','date2','id'])
df3 = df3.reset_index(drop=True).sort(['id','index','date2'])
df3['value'] = df3.value.fillna(method='ffill')

        index       date2  id  value
1  2006-01-25  2006-01-26   1    1.0
6  2006-01-26  2006-01-26   1    1.0
2  2006-01-25  2006-01-26   2    2.0
3  2006-01-26  2006-01-26   2    2.1
0  2006-01-24  2006-01-26   3    3.0
5  2006-01-26  2006-01-26   3    3.0
4  2006-01-27  2006-02-26   4    4.0
7  2006-02-26  2006-02-26   4    4.0

Pandas padding values ​​based on datetime index and column

More articles:

Pandas padding values based on datetime index and column