Calculating Date Difference in Pandas GroupBy Object
I have a Pandas DataFrame with the following format:
In [0]: df
Out[0]:
col1 col2 date
0 1 1 2015-01-01
1 1 2 2015-01-09
2 1 3 2015-01-10
3 2 1 2015-02-10
4 2 2 2015-02-10
5 2 3 2015-02-25
In [1]: df.dtypes
Out[1]:
col1 int64
col2 int64
date datetime64[ns]
dtype: object
We want to find the value for col2
corresponding to the largest date difference (between consecutive items in groups sorted by date), grouped by col1
. Suppose there are no groups of size 1.
Desired result
In [2]: output
Out[2]:
col1 col2
1 1 # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest
2 2 # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest
The real one df
has many meanings for col1
which we need to group together to perform calculations. Is this possible by applying a function to the next one? Note that the dates are already in ascending order.
gb = df.groupby(col1) gb.apply(right_maximum_date_difference)
source to share
Here's something that's almost your dataframe (I avoided copying dates):
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 2],
'col2': [1, 2, 3, 1, 2, 3],
'date': [1, 9, 10, 10, 10, 25]
})
With this, define:
def max_diff_date(g):
g = g.sort(columns=['date'])
return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]
and you have:
>> df.groupby(df.col1).apply(max_diff_date)
col1
1 1
2 2
dtype: int64
source to share
I would try a slightly different binding: rotate the table so that you have a column for each value in col2
, containing the dates and values col1
as an index. Then you can use the method .diff
to get the differences between consecutive cells. Perhaps it will not work if there are two duplicate pairs col1
, col2
that is not clear from the question.
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2],
'col2': [1, 2, 3, 1, 2, 3],
'date': pd.to_datetime(['2015-01-01', '2015-01-09', '2015-01-10',
'2015-02-10', '2015-02-10', '2015-02-25'])})
p = df.pivot(columns='col1', index='col2', values='date')
p
col1 1 2
col2
1 2015-01-01 2015-02-10
2 2015-01-09 2015-02-10
3 2015-01-10 2015-02-25
p.diff().shift(-1).idxmax()
col1
1 1
2 2
.shift(-1)
takes care that you want the first of two consecutive dates to have the largest difference.
source to share