Parsing timestamps containing different length Python

I have 180,000 lines of timestamps that I would like to parse in datetime format, for example:

YYYY-MM-DD HH:MM:SS

      

Below are the timestamps (note the absence of leading zeros in the first 9 hours):

19-May-14 3:36:00 PM PDT
19-May-14 10:37:00 PM PDT 

      

I parsed these dates using parse_dates

as part pandas.read

, but I found this method slow (usually ~ 80 seconds). I also tried the parser dateutil

with similar results.

I would like to parse timestamps faster, but I am having problems with different widths of timestamps. I found this SO solution which is similar to my problem, but failed to adapt the method to timestamps of different lengths.

Can anyone recommend an acceptable adaptation to a related solution, or another better method?

thank

+3


source to share


4 answers


This solution builds on the accepted answer given in the attached link and assumes the timezone is exactly 3 characters long (and ignores its specific meaning).


You can extract the year, month and day depending on their relative position to the beginning of the line like this:

month_abbreviations = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4,
                       'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8,
                       'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
day = int(line[0:2])
month = month_abbreviations[line[3:6]]
year = 2000 + int(line[7:9]) # this should be adapted to your specific use-case

      

You can extract minutes, seconds and AM / PM based on their relative position to the end of the line as shown below:

AM_PM = line[-6:-4]
second = int(line[-9:-7])
minute = int(line[-12:-10])

      

You can extract the hour based on its relative position to the beginning and end of the line:



hour = int(line[10:-13])

      

Then you can simply calculate the exact hour according to the AM_PM value like this:

hour = hour if AM_PM == 'AM' else hour + 12

      

According to my calculations, this is slightly faster than using dict

, but not much:

hour_shifter = {(0, 'AM'): 0, (0, 'PM'): 12,
                (1, 'AM'): 1, (1, 'PM'): 13,
                ...
                (11, 'AM'): 11, (11, 'PM'): 23,
                (12, 'AM'): 12}
hour = hour_shifter[(hour, AM_PM)]

      

Now you can instantiate the object datetime

:

datetime.datetime(year, month, day, hour, minute, second)

      

+2


source


How do I use a regular expression? Can you provide your data file for testing?

patt = re.compile(r'(?P<day>\d\d)-(?P<month>\w+)-(?P<year>\d\d)'
                  r' (?P<hour>\d{1,2}):(?P<minute>\d\d):(?P<second>\d\d)'
                  r' (?P<noon>\w\w) (?P<tz>\w+)')

for date in dates:
    res = patt.match(date)
    print(res.groupdict())

      



Then convert day, month, year, etc. to integers, create a timezone object:

from pytz import timezone
tz = timezone(res.groupdict()['tz'])

      

0


source


First, some questions.

  • You display this hour as 1 or 2 characters. Does the day really change? Or is it always 2 characters?
  • What do you do with the time zone? Cut it out?
  • How do you feel about years that look like they are from the 1900s? Do you need to deal with future dates? Are you sure the year 48 means 1948 and not 2048?

Here's what I would like to try. First, create multiple search dictionaries for the year and month.

months = {'Jan': '01', 'Feb': '02', ... 'Dec': '12'} 
years = {}
for i in range(50, 100):
    years[str(i)] = '19' + str(i)
for i in range(0, 50):
    years[str(i)] = '20' + str(i)

      

Scroll through each entry and

  • split each line into spaces
  • extract the signatures of the day, month and year from the date string. Search for year and month from dictionaries. Use the day as is.
  • Divide minutes and seconds from the hour component of time. Minutes and seconds are fine in text form.
  • Retrieve the integer value of the hour. Add 12 if the 3rd field from the split operation is "PM", be case sensitive if necessary.
  • Collect everything in your target format. Insert hour with zero if it is only one character.

It might be wise to check if the year dictionary is superior to converting two digit years to ints, checking the value, and adding 1900 or 2000 depending on the clipping you choose. I would have expected the dictionary to win, but it's hard to say.

0


source


Assuming the "14" in your date string is 2014:

import datetime

month_abbr = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5,'Jun':6, 
              'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12
              }   

def format_date(date_str):
    day, month, year = (date_str.split(' ')[0]).split('-')
    hour, minute, sec = (date_str.split(' ')[1]).split(':')
    return datetime.datetime(int(year)+2000, month_abbr[month], 
           int(day), int(hour), int(minute), int(sec))


date_str = '19-May-14 3:36:00 PM PDT'
#date_str = '19-May-14 10:37:00 PM PDT'
formatted_date = format_date(date_str)
print(formatted_date)
2014-05-19 03:36:00

      

The defaul format for an object datetime

is YYYY-MM-DD HH: MM: SS, so you don't need to specify a unique format in this case. If you do in the future, check the function strftime

in datetime .

If "14" can switch between 1900 and 2000, you need to (1) know this information before swallowing the date string and (2) tweak the above code to add 1900 or 2000 to the year.

0


source







All Articles