Pandas logical ambiguity of DataFrame selection
EDIT: fixed values ββin tables.
Let's say I have a pandas dataframe df:
>>>df
a b c
0 0.016367 0.289944 -0.891527
1 1.130206 0.899758 -0.276587
2 1.390528 -1.472802 0.128979
3 0.023598 -0.931329 0.158143
4 1.401183 -0.162357 -0.959156
5 -0.127765 1.142039 -0.734434
So now I'm trying to do boolean indexing:
>>>df[df > 0.5]
a b c
0 NaN NaN Nan
1 1.130206 0.899758 NaN
2 1.390528 NaN NaN
3 NaN NaN NaN
4 1.401183 NaN NaN
5 NaN 1.142039 NaN
>>>df[df < 0]
a b c
0 NaN NaN -0.891527
1 NaN NaN -0.276587
2 NaN -1.472802 NaN
3 NaN -0.931329 NaN
4 NaN -0.162357 -0.959156
5 -0.127765 NaN -0.734434
So now I'm trying to do the logical OR of this condition as an indexing condition:
>>>df[df > 0.5 or df < 0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Ben\Anaconda\lib\site-packages\pandas\core\generic.py", line 692, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I researched this a bit, its main function that the numpy devs decided that some conditions might be ambiguous depending on the case. What I don't get is why checking if the value is> 0.5 and checking if it's 0, but checking if its> 0.5 or <0 is INVALID. I've also tried mixing boolean syntax, but this error is in escable. Can anyone explain why doing OR creates an ambiguous case?
source to share
It is not possible to user-defined types redefined behavior and
and or
in Python. That is, Numpy cannot say what it wants to [0, 1, 1] and [1, 1, 0]
be [0, 1, 0]
. It has to do with how short-circuited operations are and
(see documentation ); in essence, shorted behavior and
and or
means that these operations should be run as two separate truth values of two arguments; they cannot in any way combine their two operands that use the data in both operands at once (for example, to compare elements in half, as would be natural for Numpy).
The solution is to use the bitwise &
and operators |
. However, you need to be careful with this as the priority is not what you might expect.
source to share
You need to use bitwise or put conditions in parentheses:
df[(df > 0.5) | (df < 0)]
The reason is that for array matching, it is ambiguous to compare when perhaps some of the values ββin the array satisfy the condition, so it becomes ambiguous.
If you named the attribute any
then it will evaluate to True.
The parentheses are required due to operator precedence.
Example:
In [23]:
df = pd.DataFrame(randn(5,5))
df
Out[23]:
0 1 2 3 4
0 0.320165 0.123677 -0.202609 1.225668 0.327576
1 -0.620356 0.126270 1.191855 0.903879 0.214802
2 -0.974635 1.712151 1.178358 0.224962 -0.921045
3 -1.337430 -1.225469 1.150564 -1.618739 -1.297221
4 -0.093164 -0.928846 1.035407 1.766096 1.456888
In [24]:
df[(df > 0.5) | (df < 0)]
Out[24]:
0 1 2 3 4
0 NaN NaN -0.202609 1.225668 NaN
1 -0.620356 NaN 1.191855 0.903879 NaN
2 -0.974635 1.712151 1.178358 NaN -0.921045
3 -1.337430 -1.225469 1.150564 -1.618739 -1.297221
4 -0.093164 -0.928846 1.035407 1.766096 1.456888
source to share