Why is comparison order used for comparison? / Lambda inequality?
Sorry, this is not a big title. Simple example:
(pandas version 0.16.1)
df = pd.DataFrame({ 'x':range(1,5), 'y':[1,1,1,9] })
Works great:
df.apply( lambda x: x > x.mean() )
x y
0 False False
1 False False
2 True False
3 True True
Should this work the same?
df.apply( lambda x: x.mean() < x )
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-467-6f32d50055ea> in <module>()
----> 1 df.apply( lambda x: x.mean() < x )
C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
3707 if reduce is None:
3708 reduce = True
-> 3709 return self._apply_standard(f, axis, reduce=reduce)
3710 else:
3711 return self._apply_broadcast(f, axis)
C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
3797 try:
3798 for i, v in enumerate(series_gen):
-> 3799 results[i] = func(v)
3800 keys.append(v.name)
3801 except Exception as e:
<ipython-input-467-6f32d50055ea> in <lambda>(x)
----> 1 df.apply( lambda x: x.mean() < x )
C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
586 return NotImplemented
587 elif isinstance(other, (np.ndarray, pd.Index)):
--> 588 if len(self) != len(other):
589 raise ValueError('Lengths must match to compare')
590 return self._constructor(na_op(self.values, np.asarray(other)),
TypeError: ('len() of unsized object', u'occurred at index x')
For a counter example, these both work:
df.mean() < df
df > df.mean()
source to share
EDIT
Finally found a bug for this - Issue 9369
As stated in the release -
left = 0> s works (like python scalar). So I think this is treated as a 0-dimensional array (its np.int64) (and not as a scalar when called.) I'll be flagging as a bug. Feel free to dig into
The problem occurs when using type comparison operators numpy
(like np.int64 or np.float64, etc.) on the left side of the comparison operator. A simple fix for this, perhaps as @santon pointed out in his answer, is to convert the number to a scalar scanner, instead of using numpy
scalar.
Old:
I have tried in Pandas 0.16.2.
I did the following on your original df -
In [22]: df['z'] = df['x'].mean() < df['x']
In [23]: df
Out[23]:
x y z
0 1 1 False
1 2 1 False
2 3 1 True
3 4 9 True
In [27]: df['z'].mean() < df['z']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-27-afc8a7b869b4> in <module>()
----> 1 df['z'].mean() < df['z']
C:\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
586 return NotImplemented
587 elif isinstance(other, (np.ndarray, pd.Index)):
--> 588 if len(self) != len(other):
589 raise ValueError('Lengths must match to compare')
590 return self._constructor(na_op(self.values, np.asarray(other)),
TypeError: len() of unsized object
Seems like a bug, I can compare boolean to int and vice versa, but the problem only occurs when using boolean mean with boolean (although I don't think it makes sense to take mean () for boolean)
In [24]: df['z'] < df['x']
Out[24]:
0 True
1 True
2 True
3 True
dtype: bool
In [25]: df['z'] < df['x'].mean()
Out[25]:
0 True
1 True
2 True
3 True
Name: z, dtype: bool
In [26]: df['x'].mean() < df['z']
Out[26]:
0 False
1 False
2 False
3 False
Name: z, dtype: bool
I have tried and reproduced the issue in Pandas 0.16.1, it can also be reproduced with -
In [10]: df['x'].mean() < df['x']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-4e5dab1545af> in <module>()
----> 1 df['x'].mean() < df['x']
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/ops.pyc in wrapper(self, other, axis)
586 return NotImplemented
587 elif isinstance(other, (np.ndarray, pd.Index)):
--> 588 if len(self) != len(other):
589 raise ValueError('Lengths must match to compare')
590 return self._constructor(na_op(self.values, np.asarray(other)),
TypeError: len() of unsized object
In [11]: df['x'] < df['x'].mean()
Out[11]:
0 True
1 True
2 False
3 False
Name: x, dtype: bool
It looks like this is also a bug that was fixed in Pandas version 0.16.2 (except when mixing booleans with an integer). I suggest updating your Pandas version using -
pip install pandas --upgrade
source to share
I think it has to do with how the operator is overloaded more than the operator. When using an overloaded function, if the data types differ on the left or right, the order matters. (Python has a tricky way of figuring out which overloaded function to use.) You can make your code work by exposing the result mean()
(which is numpy.float64
) to a simple float:
df.apply( lambda x: float(x.mean()) < x )
For some reason, it seems like the pandas code is treating numpy.float64
like an array, and maybe that's why it fails.
source to share