Pandas DataFrame comb_first and update methods have strange behavior

I am facing some weird problem (or alleged?) Where combine_first

or update

causes the values ​​stored as bool

, to be promoted in float64

if the supplied argument does not supply boolean columns.

Example workflow in ipython:

In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])

In [145]: test
Out[145]:
   a  b isBool isBool2
0  1  2  False    True
1  4  5   True   False


In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])

In [148]: b
Out[148]:
    a   b
0  45  45

In [149]: test.update(b)

In [150]: test
Out[150]:
    a   b  isBool  isBool2
0  45  45       0        1
1   4   5       1        0

      

Was this the behavior of the function update

? I would have thought that if nothing is specified that update

won't bind to other columns.


EDIT : I started doing a little more. The plot thickens. If I insert another command: test.update([])

before starting test.update(b)

, the boolean behavior works at the cost of the cost increased as objects

. This also applies to the simplified DSM example.

Based on the panda source code , it looks like the reindex_like method creates a DataFrame dtype object

and reindex_like b

creates a DataFrame dtype float64

. Since it object

is more general, subsequent operations work with bools. Unfortunately launching np.log

on numeric columns will fail with AttributeError

.

+1


source to share


2 answers


this is a bug, update should not touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021



+1


source


Before the update , the dateframe isb

filled with reindex_link so that b becomes

In [5]: b.reindex_like(a)
Out[5]: 
    a   b  isBool  isBool2
0  45  45     NaN      NaN
1 NaN NaN     NaN      NaN

      

And then use numpy.where to update the dataframe .

The tragedy is that for numpy.where

if the two data are of a different type, the more general one will be used. for example

In [20]: np.where(True, [True], [0])
Out[20]: array([1])

In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])

      



Since NaN

in numpy

is a floating type, it also returns a floating type.

In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])

      

So after the update, your 'isBool' and 'isBool2' column will float.

I added this issue to issue tracker for pandas

+1


source







All Articles