What does the pandas suboperator do?

This comes straight from the tutorial, which I can't figure out even after reading the document.

In [14]: df = DataFrame({'one' : Series(randn(3), index=['a', 'b', 'c']),
   ....:                 'two' : Series(randn(4), index=['a', 'b', 'c', 'd']),
   ....:                 'three' : Series(randn(3), index=['b', 'c', 'd'])})
   ....: 

In [15]: df
Out[15]: 
        one     three       two
a -0.626544       NaN -0.351587
b -0.138894 -0.177289  1.136249
c  0.011617  0.462215 -0.448789
d       NaN  1.124472 -1.101558

In [16]: row = df.ix[1]

In [17]: column = df['two']

In [18]: df.sub(row, axis='columns')
Out[18]: 
        one     three       two
a -0.487650       NaN -1.487837
b  0.000000  0.000000  0.000000
c  0.150512  0.639504 -1.585038
d       NaN  1.301762 -2.237808

      

Why does the second line turn to 0? Is it sub

-substituted 0?

Also, when I use row = df.ix[0]

, the entire second column turns into NaN

. Why?

+3


source to share


2 answers


sub

means to subtract, so skip this:

In [44]:
# create some data
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df
Out[44]:
        one     three       two
a -1.536737       NaN  1.537104
b  1.486947 -0.429089 -0.227643
c  0.219609 -0.178037 -1.118345
d       NaN  1.254126 -0.380208
In [45]:
# take a copy of 2nd row
row = df.ix[1]
row
Out[45]:
one      1.486947
three   -0.429089
two     -0.227643
Name: b, dtype: float64
In [46]:
# now subtract the 2nd row row-wise
df.sub(row, axis='columns')
Out[46]:
        one     three       two
a -3.023684       NaN  1.764747
b  0.000000  0.000000  0.000000
c -1.267338  0.251052 -0.890702
d       NaN  1.683215 -0.152565

      

So, what is probably confusing you is what happens when you specify "columns" as the axis to work with. We subtracted the value of the 2nd row from each row, this explains why the second row is now all 0. The data you passed is a series, and we align the columns, so we align them by the column names, so they are done row by row



In [47]:
# now take a copy of the first row
row = df.ix[0]
row
Out[47]:
one     -1.536737
three         NaN
two      1.537104
Name: a, dtype: float64
In [48]:
# perform the same op
df.sub(row, axis='columns')
Out[48]:
        one  three       two
a  0.000000    NaN  0.000000
b  3.023684    NaN -1.764747
c  1.756346    NaN -2.655449
d       NaN    NaN -1.917312

      

So why do we now have a column with all values NaN

? This is because when you execute any operator function with NaN

then the result will beNaN

In [55]:

print(1 + np.NaN)
print(1 * np.NaN)
print(1 / np.NaN)
print(1 - np.NaN)
nan
nan
nan
nan

      

+5


source


What it is - subtracting each value in the second row from all the values ​​in its column. That is, it takes the value at position ("b", "one")

and subtracts it from all the values ​​in column one; it takes the value at position ("b", "two")

and subtracts it from all the values ​​in column two; and it takes the value in poisiton ("b", "three")

and subtracts it from all the values ​​in column three. So, for example, the result in ("c", "one")

is 0.011617 - (-0.138894) = 0.150512

. All of the values ​​on line "b" are zero because this is the line you are subtracting, so on that line, you subtract it from yourself, giving zero.



Regarding the second part of your question, if you select the first line, it contains NaN. So subtraction subtracts NaN from all values ​​in the second column, which also turns them into NaN (since any minus NaN is NaN).

+2


source







All Articles