Pandas logical ambiguity of DataFrame selection

EDIT: fixed values ​​in tables.

Let's say I have a pandas dataframe df:

>>>df
                  a         b         c
        0  0.016367  0.289944 -0.891527
        1  1.130206  0.899758 -0.276587
        2  1.390528 -1.472802  0.128979
        3  0.023598 -0.931329  0.158143
        4  1.401183 -0.162357 -0.959156
        5 -0.127765  1.142039 -0.734434

      

So now I'm trying to do boolean indexing:

>>>df[df > 0.5]
          a         b         c
0       NaN       NaN        Nan
1  1.130206  0.899758        NaN
2  1.390528       NaN        NaN
3       NaN       NaN        NaN
4  1.401183       NaN        NaN
5       NaN  1.142039        NaN

>>>df[df < 0]
          a         b         c
0       NaN       NaN -0.891527
1       NaN       NaN -0.276587
2       NaN -1.472802       NaN
3       NaN -0.931329       NaN
4       NaN -0.162357 -0.959156
5 -0.127765       NaN -0.734434

      

So now I'm trying to do the logical OR of this condition as an indexing condition:

>>>df[df > 0.5 or df < 0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Ben\Anaconda\lib\site-packages\pandas\core\generic.py", line 692, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any()    or a.all().

      

I researched this a bit, its main function that the numpy devs decided that some conditions might be ambiguous depending on the case. What I don't get is why checking if the value is> 0.5 and checking if it's 0, but checking if its> 0.5 or <0 is INVALID. I've also tried mixing boolean syntax, but this error is in escable. Can anyone explain why doing OR creates an ambiguous case?

+2


source to share


3 answers


It is not possible to user-defined types redefined behavior and

and or

in Python. That is, Numpy cannot say what it wants to [0, 1, 1] and [1, 1, 0]

be [0, 1, 0]

. It has to do with how short-circuited operations are and

(see documentation ); in essence, shorted behavior and

and or

means that these operations should be run as two separate truth values of two arguments; they cannot in any way combine their two operands that use the data in both operands at once (for example, to compare elements in half, as would be natural for Numpy).



The solution is to use the bitwise &

and operators |

. However, you need to be careful with this as the priority is not what you might expect.

+3


source


You need to use bitwise or put conditions in parentheses:

df[(df > 0.5) | (df < 0)]

      

The reason is that for array matching, it is ambiguous to compare when perhaps some of the values ​​in the array satisfy the condition, so it becomes ambiguous.

If you named the attribute any

then it will evaluate to True.



The parentheses are required due to operator precedence.

Example:

In [23]:

df = pd.DataFrame(randn(5,5))
df
Out[23]:
          0         1         2         3         4
0  0.320165  0.123677 -0.202609  1.225668  0.327576
1 -0.620356  0.126270  1.191855  0.903879  0.214802
2 -0.974635  1.712151  1.178358  0.224962 -0.921045
3 -1.337430 -1.225469  1.150564 -1.618739 -1.297221
4 -0.093164 -0.928846  1.035407  1.766096  1.456888
In [24]:

df[(df > 0.5) | (df < 0)]
Out[24]:
          0         1         2         3         4
0       NaN       NaN -0.202609  1.225668       NaN
1 -0.620356       NaN  1.191855  0.903879       NaN
2 -0.974635  1.712151  1.178358       NaN -0.921045
3 -1.337430 -1.225469  1.150564 -1.618739 -1.297221
4 -0.093164 -0.928846  1.035407  1.766096  1.456888

      

+1


source


Since boolean operators are not overridden in python, numpy and pandas override bitwise operators.

This means you need to use the bitwise or operator:

df[(df > 0.5) | (df < 0)]

      

+1


source







All Articles