Pandas concatenate / concatenate 2 dataframes using date as index

Question

Pandas concatenate / concatenate 2 dataframes using date as index

I have 2 large data frames with date as index. To just give an example, let's say they look like this (the amount of data on a particular date in the first dataframe is not the same as the second):

df1:

      Date    X    Y
2000-01-01   x1   y1
2000-01-01   x2   y2
2000-01-02   x3   y3
2000-01-03   x4   y4
2000-01-03   x5   y5
2000-01-03   x6   y6

df2:

      Date  X_2  Y_2
2000-01-01   X1   Y1
2000-01-01   X2   Y2
2000-01-01   X3   Y3
2000-01-03   X4   Y4
2000-01-03   X5   Y5

The result should look like this (I only want to combine the data with the dates that appear in both data files):

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01  NaN  NaN   X3   Y3
2000-01-03   x4   y4   X4   Y4
2000-01-03   x5   y5   X5   Y5
2000-01-03   x6   y6  NaN  NaN

I have tried different combinations of code and I keep getting duplicated data like this:

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x1   y1   X2   Y2
2000-01-01   x1   y1   X3   Y3
2000-01-01   x2   y2   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01   x2   y2   X3   Y3

I've tried for example. result = pd.merge(df1,df2, how='inner', on='Date')

What should I do to get the result I want?

+3

python python-2.7 pandas

Patricia Jul 26 15 at 16:27

source to share

1 answer

unutbu · Accepted Answer · 2015-07-26T16:55:21+0000

Use cumcount

for the number of items in each group, grouped by Date

:

In [107]: df1['count'] = df1.groupby('Date').cumcount()

In [108]: df1
Out[108]: 
         Date   X   Y  count
0  2000-01-01  x1  y1      0
1  2000-01-01  x2  y2      1
2  2000-01-02  x3  y3      0
3  2000-01-03  x4  y4      0
4  2000-01-03  x5  y5      1
5  2000-01-03  x6  y6      2

In [109]: df2['count'] = df2.groupby('Date').cumcount()

In [110]: df2
Out[110]: 
         Date X_2 Y_2  count
0  2000-01-01  X1  Y1      0
1  2000-01-01  X2  Y2      1
2  2000-01-01  X3  Y3      2
3  2000-01-03  X4  Y4      0
4  2000-01-03  X5  Y5      1

By adding a column count

, you can now combine both with Date

and with count

, which brings you closer to your desired result:

In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]: 
         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
2  2000-01-02   x3   y3      0  NaN  NaN
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

The lines you want to delete can be characterized as those where count is 0 and X or X_2 is NaN. Therefore, you can remove these lines with a boolean mask like this:

mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]

import pandas as pd

df1 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-02',
  '2000-01-03',
  '2000-01-03',
  '2000-01-03'],
 'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
 'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})

df2 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-01',
  '2000-01-03',
  '2000-01-03'],
 'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
 'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})


df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)

gives

         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

Another way to restrict the merge to only those dates that are common to both df1

and df2

find the intersection of df1['Date']

and df2['Date']

, then apply pd.merge

to sub-DataFrames df1

and df2

that contain only those dates:

import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')

Pandas concatenate / concatenate 2 dataframes using date as index

More articles: