Pandas concatenate / concatenate 2 dataframes using date as index
I have 2 large data frames with date as index. To just give an example, let's say they look like this (the amount of data on a particular date in the first dataframe is not the same as the second):
df1:
Date X Y
2000-01-01 x1 y1
2000-01-01 x2 y2
2000-01-02 x3 y3
2000-01-03 x4 y4
2000-01-03 x5 y5
2000-01-03 x6 y6
df2:
Date X_2 Y_2
2000-01-01 X1 Y1
2000-01-01 X2 Y2
2000-01-01 X3 Y3
2000-01-03 X4 Y4
2000-01-03 X5 Y5
The result should look like this (I only want to combine the data with the dates that appear in both data files):
Date X Y X_2 Y_2
2000-01-01 x1 y1 X1 Y1
2000-01-01 x2 y2 X2 Y2
2000-01-01 NaN NaN X3 Y3
2000-01-03 x4 y4 X4 Y4
2000-01-03 x5 y5 X5 Y5
2000-01-03 x6 y6 NaN NaN
I have tried different combinations of code and I keep getting duplicated data like this:
Date X Y X_2 Y_2
2000-01-01 x1 y1 X1 Y1
2000-01-01 x1 y1 X2 Y2
2000-01-01 x1 y1 X3 Y3
2000-01-01 x2 y2 X1 Y1
2000-01-01 x2 y2 X2 Y2
2000-01-01 x2 y2 X3 Y3
I've tried for example. result = pd.merge(df1,df2, how='inner', on='Date')
What should I do to get the result I want?
source to share
Use cumcount
for the number of items in each group, grouped by Date
:
In [107]: df1['count'] = df1.groupby('Date').cumcount()
In [108]: df1
Out[108]:
Date X Y count
0 2000-01-01 x1 y1 0
1 2000-01-01 x2 y2 1
2 2000-01-02 x3 y3 0
3 2000-01-03 x4 y4 0
4 2000-01-03 x5 y5 1
5 2000-01-03 x6 y6 2
In [109]: df2['count'] = df2.groupby('Date').cumcount()
In [110]: df2
Out[110]:
Date X_2 Y_2 count
0 2000-01-01 X1 Y1 0
1 2000-01-01 X2 Y2 1
2 2000-01-01 X3 Y3 2
3 2000-01-03 X4 Y4 0
4 2000-01-03 X5 Y5 1
By adding a column count
, you can now combine both with Date
and with count
, which brings you closer to your desired result:
In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]:
Date X Y count X_2 Y_2
0 2000-01-01 x1 y1 0 X1 Y1
1 2000-01-01 x2 y2 1 X2 Y2
2 2000-01-02 x3 y3 0 NaN NaN
3 2000-01-03 x4 y4 0 X4 Y4
4 2000-01-03 x5 y5 1 X5 Y5
5 2000-01-03 x6 y6 2 NaN NaN
6 2000-01-01 NaN NaN 2 X3 Y3
The lines you want to delete can be characterized as those where count is 0 and X or X_2 is NaN. Therefore, you can remove these lines with a boolean mask like this:
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
import pandas as pd
df1 = pd.DataFrame({'Date': ['2000-01-01',
'2000-01-01',
'2000-01-02',
'2000-01-03',
'2000-01-03',
'2000-01-03'],
'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})
df2 = pd.DataFrame({'Date': ['2000-01-01',
'2000-01-01',
'2000-01-01',
'2000-01-03',
'2000-01-03'],
'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})
df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)
gives
Date X Y count X_2 Y_2
0 2000-01-01 x1 y1 0 X1 Y1
1 2000-01-01 x2 y2 1 X2 Y2
3 2000-01-03 x4 y4 0 X4 Y4
4 2000-01-03 x5 y5 1 X5 Y5
5 2000-01-03 x6 y6 2 NaN NaN
6 2000-01-01 NaN NaN 2 X3 Y3
Another way to restrict the merge to only those dates that are common to both
df1
and df2
find the intersection of df1['Date']
and
df2['Date']
, then apply pd.merge
to sub-DataFrames df1
and df2
that contain only those dates:
import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')
source to share