Pandas - concatenate two dataframes with different number of rows
I have the following two data frames:
DF:
value
period
2000-01-01 100
2000-04-01 200
2000-07-01 300
2000-10-01 400
2001-01-01 500
df1:
value
period
2000-07-01 350
2000-10-01 450
2001-01-01 550
2001-04-01 600
2001-07-01 700
This is the desired output:
DF:
value
period
2000-01-01 100
2000-04-01 200
2000-07-01 350
2000-10-01 450
2001-01-01 550
2001-04-01 600
2001-07-01 700
I have set_index(['period'])
both on df1 and df2. I've also tried several things, including the concat and where expression after creating a new column, but notting works as expected. My first framework is primary. The second option is updating. It should replace the corresponding values ββin the first one and at the same time add new records if available.
How can i do this?
source to share
You can also use combine_first
if dtype
any index object
convert to_datetime
, which works well if it df1.index
is always in df.index
:
print (df.index.dtype)
object
print (df1.index.dtype)
object
df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)
df = df1.combine_first(df)
#if necessary int columns
#df = df1.combine_first(df).astype(int)
print (df)
value
period
2000-01-01 100.0
2000-04-01 200.0
2000-07-01 350.0
2000-10-01 450.0
2001-01-01 550.0
2001-04-01 600.0
2001-07-01 700.0
If not, then a filter is needed intersection
:
df = df1.loc[df1.index.intersection(df.index)].combine_first(df)
Another solution with numpy.setdiff1d
andconcat
df = pd.concat([df.loc[np.setdiff1d(df.index, df1.index)], df1])
print (df)
value
period
2000-01-01 100
2000-04-01 200
2000-07-01 350
2000-10-01 450
2001-01-01 550
2001-04-01 600
2001-07-01 700
source to share
Is this what you want?
In [151]: pd.concat([df1, df.loc[df.index.difference(df1.index)]]).sort_index()
Out[151]:
value
period
2000-01-01 100
2000-04-01 200
2000-07-01 350
2000-10-01 450
2001-01-01 550
2001-04-01 600
2001-07-01 700
PS make sure both indexes are of the same type - it is better to convert them to datetime
dtype using the methodpd.to_datetime()
source to share
I used the pd.concat () function to concatenate the dataframes, then dump the duplicates to get the results.
df_con = pd.concat([df, df1])
df_con.drop_duplicates(subset="period",keep="last",inplace=True)
print(df_con)
period value
0 2000-01-01 100
1 2000-04-01 200
0 2000-07-01 350
1 2000-10-01 450
2 2001-01-01 550
3 2001-04-01 600
4 2001-07-01 700
To set the "period" back as an index, just set the index,
print(df_con.set_index("period"))
value
period
2000-01-01 100
2000-04-01 200
2000-07-01 350
2000-10-01 450
2001-01-01 550
2001-04-01 600
2001-07-01 700
source to share