Subtracting data with an unequal number of rows

I have two dataframes like

import pandas as pd
import numpy as np

np.random.seed(0)

df1 = pd.DataFrame(np.random.randint(10, size=(5, 4)), index=list('ABCDE'), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randint(10, size=(2, 4)), index=list('CE'), columns=list('abcd'))

   a  b  c  d
A  5  0  3  3
B  7  9  3  5
C  2  4  7  6
D  8  8  1  6
E  7  7  8  1

   a  b  c  d
C  5  9  8  9
E  4  3  0  3

      

The index is df2

always a subset of the index df1

, and the column names are identical.

I want to create a third dataframe df3 = df1 - df2

. If you do this, it turns out

     a    b    c    d
A  NaN  NaN  NaN  NaN
B  NaN  NaN  NaN  NaN
C -3.0 -5.0 -1.0 -3.0
D  NaN  NaN  NaN  NaN
E  3.0  4.0  8.0 -2.0

      

I do not want NAs

in the output, but the corresponding values df1

. Is there a sensible way to use eg. fillna

with values df1

on lines not contained in df2

?

A workaround would be to subtract only the required lines, for example:

sub_ind = df2.index
df3 = df1.copy()
df3.loc[sub_ind, :] = df1.loc[sub_ind, :] - df2.loc[sub_ind, :]

      

which gives me the desired output

   a  b  c  d
A  5  0  3  3
B  7  9  3  5
C -3 -5 -1 -3
D  8  8  1  6
E  3  4  8 -2

      

but maybe there is an easier way to achieve this?

+3


source to share


3 answers


If you use the method sub

instead -

, you can pass the fill value:



df1.sub(df2, fill_value=0)
Out: 
     a    b    c    d
A  5.0  0.0  3.0  3.0
B  7.0  9.0  3.0  5.0
C -3.0 -5.0 -1.0 -3.0
D  8.0  8.0  1.0  6.0
E  3.0  4.0  8.0 -2.0

      

+2


source


I think this is what you want:

(df1-df2).fillna(df1)

Out[40]: 
     a    b    c    d
A  5.0  0.0  3.0  3.0
B  7.0  9.0  3.0  5.0
C -3.0 -5.0 -1.0 -3.0
D  8.0  8.0  1.0  6.0
E  3.0  4.0  8.0 -2.0

      



Just subtract the data as usual, but package the result with parentheses and run the method pandas.DataFrame.fillna

on the result. Or, in a little more detail:

diff = df1-df2
diff.fillna(df1, inplace=True)

      

+3


source


Here is an option using reindex

its parameter as well fill_value

. The main differences between this answer and @ayhan's answer:

  • You can control padding value on only one of the dataframes, or both
  • This can be generalized to reindex

    over custom index joining df1

    anddf2

  • We have better control over data type persistence int


df1 - df2.reindex(df1.index, fill_value=0)

   a  b  c  d
A  5  0  3  3
B  7  9  3  5
C -3 -5 -1 -3
D  8  8  1  6
E  3  4  8 -2

      

+2


source







All Articles