Pandas: parsing 24:00 instead of 00:00
I have a dataset where the hour is written as [0100:2400]
instead of[0000:2300]
for example
pd.to_datetime('201704102300', format='%Y%m%d%H%M')
returns
Timestamp('2017-04-10 20:00:00')
But
pd.to_datetime('201704102400', format='%Y%m%d%H%M')
gives me an error:
ValueError: Unacknowledged data remains: 0
How can I fix this problem?
I can manually tweak the data like those mentioned in SO Post , but I think pandas should have handled this case already?
UPDATE:
And how do I do it in a scalable way for a dataframe? For example, the data looks like this:
source to share
Pandas uses the system strptime
, and so if you need something non-standard you can flip your own.
Code:
import pandas as pd
import datetime as dt
def my_to_datetime(date_str):
if date_str[8:10] != '24':
return pd.to_datetime(date_str, format='%Y%m%d%H%M')
date_str = date_str[0:8] + '00' + date_str[10:]
return pd.to_datetime(date_str, format='%Y%m%d%H%M') + \
dt.timedelta(days=1)
print(my_to_datetime('201704102400'))
Results:
2017-04-11 00:00:00
For a column in pandas.DataFrame
:
df['time'] = df.time.apply(my_to_datetime)
source to share
Vectorized solution using pd.to_datetime (DataFrame) method :
DF source
In [27]: df
Out[27]:
time
0 201704102400
1 201602282400
2 201704102359
Decision
In [28]: pat = '(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<hour>\d{2})(?P<minute>\d{2})'
In [29]: pd.to_datetime(df['time'].str.extract(pat, expand=True))
Out[29]:
0 2017-04-11 00:00:00
1 2016-02-29 00:00:00
2 2017-04-10 23:59:00
dtype: datetime64[ns]
Explanation:
In [30]: df['time'].str.extract(pat, expand=True)
Out[30]:
year month day hour minute
0 2017 04 10 24 00
1 2016 02 28 24 00
2 2017 04 10 23 59
pat
is the RegEx template argument in Series.str.extract () Function
UPDATE: Timing
In [37]: df = pd.concat([df] * 10**4, ignore_index=True)
In [38]: df.shape
Out[38]: (30000, 1)
In [39]: %timeit df.time.apply(my_to_datetime)
1 loop, best of 3: 4.1 s per loop
In [40]: %timeit pd.to_datetime(df['time'].str.extract(pat, expand=True))
1 loop, best of 3: 475 ms per loop
source to share