Pandas DataFrame comb_first and update methods have strange behavior
I am facing some weird problem (or alleged?) Where combine_first
or update
causes the values stored as bool
, to be promoted in float64
if the supplied argument does not supply boolean columns.
Example workflow in ipython:
In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])
In [145]: test
Out[145]:
a b isBool isBool2
0 1 2 False True
1 4 5 True False
In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])
In [148]: b
Out[148]:
a b
0 45 45
In [149]: test.update(b)
In [150]: test
Out[150]:
a b isBool isBool2
0 45 45 0 1
1 4 5 1 0
Was this the behavior of the function update
? I would have thought that if nothing is specified that update
won't bind to other columns.
EDIT : I started doing a little more. The plot thickens. If I insert another command: test.update([])
before starting test.update(b)
, the boolean behavior works at the cost of the cost increased as objects
. This also applies to the simplified DSM example.
Based on the panda source code , it looks like the reindex_like method creates a DataFrame dtype object
and reindex_like b
creates a DataFrame dtype float64
. Since it object
is more general, subsequent operations work with bools. Unfortunately launching np.log
on numeric columns will fail with AttributeError
.
source to share
this is a bug, update should not touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021
source to share
Before the update , the dateframe isb
filled with reindex_link so that b becomes
In [5]: b.reindex_like(a)
Out[5]:
a b isBool isBool2
0 45 45 NaN NaN
1 NaN NaN NaN NaN
And then use numpy.where to update the dataframe .
The tragedy is that for numpy.where
if the two data are of a different type, the more general one will be used. for example
In [20]: np.where(True, [True], [0])
Out[20]: array([1])
In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])
Since NaN
in numpy
is a floating type, it also returns a floating type.
In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])
So after the update, your 'isBool' and 'isBool2' column will float.
I added this issue to issue tracker for pandas
source to share