Why is comparison order used for comparison? / Lambda inequality?

Sorry, this is not a big title. Simple example:

(pandas version 0.16.1)

df = pd.DataFrame({ 'x':range(1,5), 'y':[1,1,1,9] })

      

Works great:

df.apply( lambda x: x > x.mean() )

       x      y
0  False  False
1  False  False
2   True  False
3   True   True

      

Should this work the same?

df.apply( lambda x: x.mean() < x )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-467-6f32d50055ea> in <module>()
----> 1 df.apply( lambda x: x.mean() < x )

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3707                     if reduce is None:
   3708                         reduce = True
-> 3709                     return self._apply_standard(f, axis, reduce=reduce)
   3710             else:
   3711                 return self._apply_broadcast(f, axis)

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3797             try:
   3798                 for i, v in enumerate(series_gen):
-> 3799                     results[i] = func(v)
   3800                     keys.append(v.name)
   3801             except Exception as e:

<ipython-input-467-6f32d50055ea> in <lambda>(x)
----> 1 df.apply( lambda x: x.mean() < x )

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: ('len() of unsized object', u'occurred at index x')

      

For a counter example, these both work:

df.mean() < df

df > df.mean()

      

+3


source to share


2 answers


EDIT

Finally found a bug for this - Issue 9369

As stated in the release -

left = 0> s works (like python scalar). So I think this is treated as a 0-dimensional array (its np.int64) (and not as a scalar when called.) I'll be flagging as a bug. Feel free to dig into

The problem occurs when using type comparison operators numpy

(like np.int64 or np.float64, etc.) on the left side of the comparison operator. A simple fix for this, perhaps as @santon pointed out in his answer, is to convert the number to a scalar scanner, instead of using numpy

scalar.


Old:

I have tried in Pandas 0.16.2.

I did the following on your original df -



In [22]: df['z'] = df['x'].mean() < df['x']

In [23]: df
Out[23]:
   x  y      z
0  1  1  False
1  2  1  False
2  3  1   True
3  4  9   True

In [27]: df['z'].mean() < df['z']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-afc8a7b869b4> in <module>()
----> 1 df['z'].mean() < df['z']

C:\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: len() of unsized object

      

Seems like a bug, I can compare boolean to int and vice versa, but the problem only occurs when using boolean mean with boolean (although I don't think it makes sense to take mean () for boolean)

In [24]: df['z'] < df['x']
Out[24]:
0    True
1    True
2    True
3    True
dtype: bool

In [25]: df['z'] < df['x'].mean()
Out[25]:
0    True
1    True
2    True
3    True
Name: z, dtype: bool

In [26]: df['x'].mean() < df['z']
Out[26]:
0    False
1    False
2    False
3    False
Name: z, dtype: bool

      


I have tried and reproduced the issue in Pandas 0.16.1, it can also be reproduced with -

In [10]: df['x'].mean() < df['x']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-4e5dab1545af> in <module>()
----> 1 df['x'].mean() < df['x']

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/ops.pyc in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: len() of unsized object

In [11]: df['x'] < df['x'].mean()
Out[11]: 
0     True
1     True
2    False
3    False
Name: x, dtype: bool

      

It looks like this is also a bug that was fixed in Pandas version 0.16.2 (except when mixing booleans with an integer). I suggest updating your Pandas version using -

pip install pandas --upgrade

      


+3


source


I think it has to do with how the operator is overloaded more than the operator. When using an overloaded function, if the data types differ on the left or right, the order matters. (Python has a tricky way of figuring out which overloaded function to use.) You can make your code work by exposing the result mean()

(which is numpy.float64

) to a simple float:

df.apply( lambda x: float(x.mean()) < x )

      



For some reason, it seems like the pandas code is treating numpy.float64

like an array, and maybe that's why it fails.

+2


source







All Articles