Calculating Date Difference in Pandas GroupBy Object

I have a Pandas DataFrame with the following format:

In [0]: df
Out[0]: 
       col1  col2       date
 0     1     1          2015-01-01
 1     1     2          2015-01-09
 2     1     3          2015-01-10
 3     2     1          2015-02-10
 4     2     2          2015-02-10
 5     2     3          2015-02-25

In [1]: df.dtypes
Out[1]:
 col1             int64
 col2             int64
 date    datetime64[ns]
 dtype: object

      

We want to find the value for col2

corresponding to the largest date difference (between consecutive items in groups sorted by date), grouped by col1

. Suppose there are no groups of size 1.

Desired result

In [2]: output
Out[2]:
col1   col2
1      1         # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest
2      2         # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest

      

The real one df

has many meanings for col1

which we need to group together to perform calculations. Is this possible by applying a function to the next one? Note that the dates are already in ascending order.

gb = df.groupby(col1)
gb.apply(right_maximum_date_difference)

      

+3


source to share


2 answers


Here's something that's almost your dataframe (I avoided copying dates):

df = pd.DataFrame({
    'col1': [1, 1, 1, 2, 2, 2],
    'col2': [1, 2, 3, 1, 2, 3],
    'date': [1, 9, 10, 10, 10, 25]
})

      

With this, define:



def max_diff_date(g):
    g = g.sort(columns=['date'])
    return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]

      

and you have:

>> df.groupby(df.col1).apply(max_diff_date)
col1
1    1
2    2
dtype: int64

      

+2


source


I would try a slightly different binding: rotate the table so that you have a column for each value in col2

, containing the dates and values col1

as an index. Then you can use the method .diff

to get the differences between consecutive cells. Perhaps it will not work if there are two duplicate pairs col1

, col2

that is not clear from the question.

df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2],
          'col2': [1, 2, 3, 1, 2, 3],
          'date': pd.to_datetime(['2015-01-01', '2015-01-09', '2015-01-10', 
                                  '2015-02-10', '2015-02-10', '2015-02-25'])})
p = df.pivot(columns='col1', index='col2', values='date')
p
    col1    1   2
col2        
1   2015-01-01  2015-02-10
2   2015-01-09  2015-02-10
3   2015-01-10  2015-02-25

p.diff().shift(-1).idxmax() 

col1
1       1
2       2

      



.shift(-1)

takes care that you want the first of two consecutive dates to have the largest difference.

+3


source







All Articles