Calculating Date Difference in Pandas GroupBy Object

Question

Calculating Date Difference in Pandas GroupBy Object

I have a Pandas DataFrame with the following format:

In [0]: df
Out[0]: 
       col1  col2       date
 0     1     1          2015-01-01
 1     1     2          2015-01-09
 2     1     3          2015-01-10
 3     2     1          2015-02-10
 4     2     2          2015-02-10
 5     2     3          2015-02-25

In [1]: df.dtypes
Out[1]:
 col1             int64
 col2             int64
 date    datetime64[ns]
 dtype: object

We want to find the value for col2

corresponding to the largest date difference (between consecutive items in groups sorted by date), grouped by col1

. Suppose there are no groups of size 1.

Desired result

In [2]: output
Out[2]:
col1   col2
1      1         # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest
2      2         # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest

The real one df

has many meanings for col1

which we need to group together to perform calculations. Is this possible by applying a function to the next one? Note that the dates are already in ascending order.

gb = df.groupby(col1)
gb.apply(right_maximum_date_difference)

+3

python pandas time-series

invoker 08 june 15 at 18:07

source to share

2 answers

I would try a slightly different binding: rotate the table so that you have a column for each value in col2

, containing the dates and values col1

as an index. Then you can use the method .diff

to get the differences between consecutive cells. Perhaps it will not work if there are two duplicate pairs col1

, col2

that is not clear from the question.

df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2],
          'col2': [1, 2, 3, 1, 2, 3],
          'date': pd.to_datetime(['2015-01-01', '2015-01-09', '2015-01-10', 
                                  '2015-02-10', '2015-02-10', '2015-02-25'])})
p = df.pivot(columns='col1', index='col2', values='date')
p
    col1    1   2
col2        
1   2015-01-01  2015-02-10
2   2015-01-09  2015-02-10
3   2015-01-10  2015-02-25

p.diff().shift(-1).idxmax() 

col1
1       1
2       2

.shift(-1)

takes care that you want the first of two consecutive dates to have the largest difference.

+3

JoeCondron 08 june 15 at 19:14

source to share

Ami tavory · Accepted Answer · 2015-06-08T18:23:40+0000

Here's something that's almost your dataframe (I avoided copying dates):

df = pd.DataFrame({
    'col1': [1, 1, 1, 2, 2, 2],
    'col2': [1, 2, 3, 1, 2, 3],
    'date': [1, 9, 10, 10, 10, 25]
})

With this, define:

def max_diff_date(g):
    g = g.sort(columns=['date'])
    return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]

and you have:

>> df.groupby(df.col1).apply(max_diff_date)
col1
1    1
2    2
dtype: int64

Calculating Date Difference in Pandas GroupBy Object

More articles: