Max / Min of date on Pandas column, columns include nan values
I am trying to create a new column in pandas framework with the maximum (or minimum) date from the other two date columns. But, when there is a NAN in any of these columns, the entire min / max column becomes NAN. What gives? When using numeric columns, this works great ... but with dates, the new column is all NAN. Here's some sample code to illustrate the problem:
df = pd.DataFrame(data=[[np.nan,date(2000,11,1)],
[date(2000,12,1), date(2000,9,1)],
[date(2000,4,1),np.nan],
[date(2000,12,2),np.nan]], columns=['col1','col2'])
df['col3'] = df[['col1','col2']].max(axis=1)
I know it can be done with loc and a combination of <,>, isnull and so on. But how to make it work with regular max / min functions?
source to share
You keep objects date
in your columns, if you convert to datetime
then it works as expected:
In[10]:
df['col1'] = pd.to_datetime(df['col1'])
df['col2'] = pd.to_datetime(df['col2'])
df
Out[10]:
col1 col2 col3
0 NaT 2000-11-01 NaN
1 2000-12-01 2000-09-01 NaN
2 2000-04-01 NaT NaN
3 2000-12-02 NaT NaN
In[11]:
df['col3'] = df[['col1','col2']].max(axis=1)
df
Out[11]:
col1 col2 col3
0 NaT 2000-11-01 2000-11-01
1 2000-12-01 2000-09-01 2000-12-01
2 2000-04-01 NaT 2000-04-01
3 2000-12-02 NaT 2000-12-02
If you just did:
df['col3'] = df['col1'].max()
this calls a TypeError: '>=' not supported between instances of 'float' and 'datetime.date'
The values ββare NaN
forced dtype
to rise to float
, so it returns NaN
. If you didn't have missing values ββthen it would work as expected, if you have missing values ββthen you have to convert dtype
to datetime
so the missing values ββare converted to NaT
so that max
works correctly
source to share