What does the pandas suboperator do?
This comes straight from the tutorial, which I can't figure out even after reading the document.
In [14]: df = DataFrame({'one' : Series(randn(3), index=['a', 'b', 'c']),
....: 'two' : Series(randn(4), index=['a', 'b', 'c', 'd']),
....: 'three' : Series(randn(3), index=['b', 'c', 'd'])})
....:
In [15]: df
Out[15]:
one three two
a -0.626544 NaN -0.351587
b -0.138894 -0.177289 1.136249
c 0.011617 0.462215 -0.448789
d NaN 1.124472 -1.101558
In [16]: row = df.ix[1]
In [17]: column = df['two']
In [18]: df.sub(row, axis='columns')
Out[18]:
one three two
a -0.487650 NaN -1.487837
b 0.000000 0.000000 0.000000
c 0.150512 0.639504 -1.585038
d NaN 1.301762 -2.237808
Why does the second line turn to 0? Is it sub
-substituted 0?
Also, when I use row = df.ix[0]
, the entire second column turns into NaN
. Why?
source to share
sub
means to subtract, so skip this:
In [44]:
# create some data
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df
Out[44]:
one three two
a -1.536737 NaN 1.537104
b 1.486947 -0.429089 -0.227643
c 0.219609 -0.178037 -1.118345
d NaN 1.254126 -0.380208
In [45]:
# take a copy of 2nd row
row = df.ix[1]
row
Out[45]:
one 1.486947
three -0.429089
two -0.227643
Name: b, dtype: float64
In [46]:
# now subtract the 2nd row row-wise
df.sub(row, axis='columns')
Out[46]:
one three two
a -3.023684 NaN 1.764747
b 0.000000 0.000000 0.000000
c -1.267338 0.251052 -0.890702
d NaN 1.683215 -0.152565
So, what is probably confusing you is what happens when you specify "columns" as the axis to work with. We subtracted the value of the 2nd row from each row, this explains why the second row is now all 0. The data you passed is a series, and we align the columns, so we align them by the column names, so they are done row by row
In [47]:
# now take a copy of the first row
row = df.ix[0]
row
Out[47]:
one -1.536737
three NaN
two 1.537104
Name: a, dtype: float64
In [48]:
# perform the same op
df.sub(row, axis='columns')
Out[48]:
one three two
a 0.000000 NaN 0.000000
b 3.023684 NaN -1.764747
c 1.756346 NaN -2.655449
d NaN NaN -1.917312
So why do we now have a column with all values NaN
? This is because when you execute any operator function with NaN
then the result will beNaN
In [55]:
print(1 + np.NaN)
print(1 * np.NaN)
print(1 / np.NaN)
print(1 - np.NaN)
nan
nan
nan
nan
source to share
What it is - subtracting each value in the second row from all the values in its column. That is, it takes the value at position ("b", "one")
and subtracts it from all the values in column one; it takes the value at position ("b", "two")
and subtracts it from all the values in column two; and it takes the value in poisiton ("b", "three")
and subtracts it from all the values in column three. So, for example, the result in ("c", "one")
is 0.011617 - (-0.138894) = 0.150512
. All of the values on line "b" are zero because this is the line you are subtracting, so on that line, you subtract it from yourself, giving zero.
Regarding the second part of your question, if you select the first line, it contains NaN. So subtraction subtracts NaN from all values in the second column, which also turns them into NaN (since any minus NaN is NaN).
source to share