Why is comparison order used for comparison? / Lambda inequality?

Question

Why is comparison order used for comparison? / Lambda inequality?

Sorry, this is not a big title. Simple example:

(pandas version 0.16.1)

df = pd.DataFrame({ 'x':range(1,5), 'y':[1,1,1,9] })

Works great:

df.apply( lambda x: x > x.mean() )

       x      y
0  False  False
1  False  False
2   True  False
3   True   True

Should this work the same?

df.apply( lambda x: x.mean() < x )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-467-6f32d50055ea> in <module>()
----> 1 df.apply( lambda x: x.mean() < x )

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3707                     if reduce is None:
   3708                         reduce = True
-> 3709                     return self._apply_standard(f, axis, reduce=reduce)
   3710             else:
   3711                 return self._apply_broadcast(f, axis)

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3797             try:
   3798                 for i, v in enumerate(series_gen):
-> 3799                     results[i] = func(v)
   3800                     keys.append(v.name)
   3801             except Exception as e:

<ipython-input-467-6f32d50055ea> in <lambda>(x)
----> 1 df.apply( lambda x: x.mean() < x )

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: ('len() of unsized object', u'occurred at index x')

For a counter example, these both work:

df.mean() < df

df > df.mean()

+3

python pandas

JohnE Jul 17 15 at 16:53

source to share

2 answers

I think it has to do with how the operator is overloaded more than the operator. When using an overloaded function, if the data types differ on the left or right, the order matters. (Python has a tricky way of figuring out which overloaded function to use.) You can make your code work by exposing the result mean()

(which is numpy.float64

) to a simple float:

df.apply( lambda x: float(x.mean()) < x )

For some reason, it seems like the pandas code is treating numpy.float64

like an array, and maybe that's why it fails.

+2

santon Jul 17 15 at 17:15

source to share

Anand s kumar · Accepted Answer · 2015-07-17T17:08:03+0000

EDIT

Finally found a bug for this - Issue 9369

As stated in the release -

left = 0> s works (like python scalar). So I think this is treated as a 0-dimensional array (its np.int64) (and not as a scalar when called.) I'll be flagging as a bug. Feel free to dig into

The problem occurs when using type comparison operators numpy

(like np.int64 or np.float64, etc.) on the left side of the comparison operator. A simple fix for this, perhaps as @santon pointed out in his answer, is to convert the number to a scalar scanner, instead of using numpy

scalar.

Old:

I have tried in Pandas 0.16.2.

I did the following on your original df -

In [22]: df['z'] = df['x'].mean() < df['x']

In [23]: df
Out[23]:
   x  y      z
0  1  1  False
1  2  1  False
2  3  1   True
3  4  9   True

In [27]: df['z'].mean() < df['z']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-afc8a7b869b4> in <module>()
----> 1 df['z'].mean() < df['z']

C:\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: len() of unsized object

Seems like a bug, I can compare boolean to int and vice versa, but the problem only occurs when using boolean mean with boolean (although I don't think it makes sense to take mean () for boolean)

In [24]: df['z'] < df['x']
Out[24]:
0    True
1    True
2    True
3    True
dtype: bool

In [25]: df['z'] < df['x'].mean()
Out[25]:
0    True
1    True
2    True
3    True
Name: z, dtype: bool

In [26]: df['x'].mean() < df['z']
Out[26]:
0    False
1    False
2    False
3    False
Name: z, dtype: bool

I have tried and reproduced the issue in Pandas 0.16.1, it can also be reproduced with -

In [10]: df['x'].mean() < df['x']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-4e5dab1545af> in <module>()
----> 1 df['x'].mean() < df['x']

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/ops.pyc in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: len() of unsized object

In [11]: df['x'] < df['x'].mean()
Out[11]: 
0     True
1     True
2    False
3    False
Name: x, dtype: bool

It looks like this is also a bug that was fixed in Pandas version 0.16.2 (except when mixing booleans with an integer). I suggest updating your Pandas version using -

pip install pandas --upgrade

Why is comparison order used for comparison? / Lambda inequality?

More articles: