Pandas - concatenate two dataframes with different number of rows

I have the following two data frames:

DF:

              value
period
2000-01-01    100
2000-04-01    200
2000-07-01    300
2000-10-01    400
2001-01-01    500

      

df1:

              value
period
2000-07-01    350
2000-10-01    450
2001-01-01    550
2001-04-01    600
2001-07-01    700

      

This is the desired output:

DF:

              value
period
2000-01-01    100
2000-04-01    200
2000-07-01    350
2000-10-01    450
2001-01-01    550
2001-04-01    600
2001-07-01    700

      

I have set_index(['period'])

both on df1 and df2. I've also tried several things, including the concat and where expression after creating a new column, but notting works as expected. My first framework is primary. The second option is updating. It should replace the corresponding values ​​in the first one and at the same time add new records if available.

How can i do this?

+3


source to share


4 answers


You can also use combine_first

if dtype

any index object

convert to_datetime

, which works well if it df1.index

is always in df.index

:

print (df.index.dtype)
object

print (df1.index.dtype)
object

df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)

df = df1.combine_first(df)
#if necessary int columns
#df = df1.combine_first(df).astype(int)
print (df)
            value
period           
2000-01-01  100.0
2000-04-01  200.0
2000-07-01  350.0
2000-10-01  450.0
2001-01-01  550.0
2001-04-01  600.0
2001-07-01  700.0

      

If not, then a filter is needed intersection

:

df = df1.loc[df1.index.intersection(df.index)].combine_first(df)

      




Another solution with numpy.setdiff1d

andconcat

df = pd.concat([df.loc[np.setdiff1d(df.index, df1.index)], df1])
print (df)
            value
period           
2000-01-01    100
2000-04-01    200
2000-07-01    350
2000-10-01    450
2001-01-01    550
2001-04-01    600
2001-07-01    700

      

+4


source


Is this what you want?

In [151]: pd.concat([df1, df.loc[df.index.difference(df1.index)]]).sort_index()
Out[151]:
            value
period
2000-01-01    100
2000-04-01    200
2000-07-01    350
2000-10-01    450
2001-01-01    550
2001-04-01    600
2001-07-01    700

      



PS make sure both indexes are of the same type - it is better to convert them to datetime

dtype using the methodpd.to_datetime()

+3


source


Another option with append

anddrop_duplicates

d1 = df1.append(df)
d1[~d1.index.duplicated()]

            value
period           
2000-07-01    350
2000-10-01    450
2001-01-01    550
2001-04-01    600
2001-07-01    700
2000-01-01    100
2000-04-01    200

      

+3


source


I used the pd.concat () function to concatenate the dataframes, then dump the duplicates to get the results.

df_con = pd.concat([df, df1])
df_con.drop_duplicates(subset="period",keep="last",inplace=True)
print(df_con)

       period  value
0  2000-01-01    100
1  2000-04-01    200
0  2000-07-01    350
1  2000-10-01    450
2  2001-01-01    550
3  2001-04-01    600
4  2001-07-01    700

      

To set the "period" back as an index, just set the index,

print(df_con.set_index("period"))

            value
period           
2000-01-01    100
2000-04-01    200
2000-07-01    350
2000-10-01    450
2001-01-01    550
2001-04-01    600
2001-07-01    700

      

0


source







All Articles