Parsing unformatted dates in Python

Question

Parsing unformatted dates in Python

I have text taken from different sites that I want to extract dates to. As you might imagine, dates vary significantly in how they are formatted and look something like this:

Posted: 10/01/2014 
Published on August 1st 2014
Last modified on 5th of July 2014
Posted by Dave on 10-01-14

I want to know if anyone knows of a Python library [or API] that will help with this - (except for eg regex which will be my fallback). I could probably remove the "put on" parts relatively easily, but getting the other stuff consistent isn't easy.

+3

python python-2.7 parsing

kyrenia Apr 16 15 at 18:46

source to share

2 answers

You can use the Arrow library:

arrow.get('2013-05-05 12:30:45', ['MM/DD/YYYY', 'MM-DD-YYYY'])

Two arguments, first str for parsing and second a list of formats to try.

0

dizballanze Apr 16 15 at 18:59

source to share

kyrenia · Accepted Answer · 2015-04-16T19:59:38+0000

My solution using dateutil

Following Lucas' suggestion, I used the dateutil package (seemed much more flexible than Arrow) using Fuzzy's entry which basically ignores things that are not dates.

Caveat about fuzzy parsing using dateutil

The main thing to note is that, as noted in the thread Failure when parsing a date using dateutil , if it cannot parse the day / month / year, this takes the default (which is the current day if not specified), and as far as I can tell, the flag is not reported to indicate that it has accepted the default.

This will result in "random text" returning the date 2015-4-16 today, which could have caused problems.

Decision

Since I really want to know when it fails, instead of populating the date with the default, I ended up working twice and then looked to see if it accepted the default for both instances - if not, then I decided to parse correctly.

from datetime import datetime
from dateutil.parser import parse

def extract_date(text):

    date = {}
    date_1 = parse(text, fuzzy=True, default=datetime(2001, 01, 01))
    date_2 = parse(text, fuzzy=True, default=datetime(2002, 02, 02))

    if date_1.day == 1 and date_2.day ==2:
        date["day"] = "XX"
    else:
        date["day"] = date_1.day

    if date_1.month == 1 and date_2.month ==2:
        date["month"] = "XX"
    else:
        date["month"] = date_1.month    

    if date_1.year == 2001 and date_2.year ==2002:
        date["year"] = "XXXX"
    else:
        date["year"] = date_1.year  

    return(date)

print extract_date("Posted: by dave August 1st")

Obviously this is a bit of a batch (so if anyone has a more elegant -please share solution), but this parsed correctly the four examples I had above [where he assumed the US format as of 10/01/2014 than in imperial], and caused XX to be returned appropriately for missing data.

Parsing unformatted dates in Python

More articles: