Pandas: parsing 24:00 instead of 00:00

I have a dataset where the hour is written as [0100:2400]

instead of[0000:2300]

for example

pd.to_datetime('201704102300', format='%Y%m%d%H%M')

      

returns

Timestamp('2017-04-10 20:00:00')

      

But

pd.to_datetime('201704102400', format='%Y%m%d%H%M')

      

gives me an error:

ValueError: Unacknowledged data remains: 0

How can I fix this problem?

I can manually tweak the data like those mentioned in SO Post , but I think pandas should have handled this case already?

UPDATE:

And how do I do it in a scalable way for a dataframe? For example, the data looks like this: enter image description here

+3


source to share


2 answers


Pandas uses the system strptime

, and so if you need something non-standard you can flip your own.

Code:

import pandas as pd
import datetime as dt

def my_to_datetime(date_str):
    if date_str[8:10] != '24':
        return pd.to_datetime(date_str, format='%Y%m%d%H%M')

    date_str = date_str[0:8] + '00' + date_str[10:]
    return pd.to_datetime(date_str, format='%Y%m%d%H%M') + \
           dt.timedelta(days=1)

print(my_to_datetime('201704102400'))

      

Results:



2017-04-11 00:00:00

      

For a column in pandas.DataFrame

:

df['time'] = df.time.apply(my_to_datetime)

      

+4


source


Vectorized solution using pd.to_datetime (DataFrame) method :

DF source

In [27]: df
Out[27]:
           time
0  201704102400
1  201602282400
2  201704102359

      

Decision

In [28]: pat = '(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<hour>\d{2})(?P<minute>\d{2})'

In [29]: pd.to_datetime(df['time'].str.extract(pat, expand=True))
Out[29]:
0   2017-04-11 00:00:00
1   2016-02-29 00:00:00
2   2017-04-10 23:59:00
dtype: datetime64[ns]

      



Explanation:

In [30]: df['time'].str.extract(pat, expand=True)
Out[30]:
   year month day hour minute
0  2017    04  10   24     00
1  2016    02  28   24     00
2  2017    04  10   23     59

      

pat

is the RegEx template argument in Series.str.extract () Function

UPDATE: Timing

In [37]: df = pd.concat([df] * 10**4, ignore_index=True)

In [38]: df.shape
Out[38]: (30000, 1)

In [39]: %timeit df.time.apply(my_to_datetime)
1 loop, best of 3: 4.1 s per loop

In [40]: %timeit pd.to_datetime(df['time'].str.extract(pat, expand=True))
1 loop, best of 3: 475 ms per loop

      

+4


source







All Articles