Parsing Text - Date Recognizer
Does anyone know if there is a Python text parser that recognizes inline dates? For example, given the sentence
"bla bla bla bla 12 Jan 14 bla bla bla 01/04/15 bla bla bla"
the analyzer could pick two date events. I know some Java tools, but is there Python? Will NTLK be overkill?
thank
source to share
Here is an attempt at a non-deterministic (read: exhaustive) solution to the search problem where dates are in tokenized text. It lists all the ways to split a sentence (like a list of tokens) with a section size from minps
to maxps
.
Each split is done into a parser, which outputs a list of parsed dates and the range of markers in which it was parsed.
Each parser result is evaluated with the sum of the squares of the tokens squared (so a date processed from 4 tokens is preferable, rather than 2 dates processed from two tokens each).
Finally, it finds and outputs the parsing with the best result.
Three building blocks of the algorithm:
from dateutil.parser import parse as parsedate
def partition(lst, minps, maxps, i=0):
if lst == []:
yield []
else:
try:
for l in range(minps, maxps+1):
if l > len(lst): continue
for z in partition(lst[l:], minps, maxps, i+l):
yield [(i, lst[:l])] + z
except:
pass
def parsedates(p):
for x in p:
i, pi = x
try:
d = parsedate(' '.join(pi))
# output: (startIndex, endIndex, parsedDate)
if d: yield i, i+len(pi), d
except: pass
def score(p):
score = 0
for pi in p:
score += (pi[1]-pi[0])**2
return score
Finding the syntax with the best result:
def bestparse(toks, maxps=3):
bestscore = 0
bestparse = None
for ps in partition(toks, 1, maxps):
l = list(parsedates(ps))
s = score(l)
if s > bestscore:
bestscore = s
bestparse = l
return bestparse
Some tests:
l=['bla', 'bla', 'bla', '12', 'Jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla']
for bpi in bestparse(l):
print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))
found date 2014-01-12 00:00:00 in tokens 3,4,5
found date 2015-01-04 00:00:00 in tokens 9
l=['Fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30']
for bpi in bestparse(l, 5):
print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))
publication date 1999-01-23 23:30:00 in tokens 3,4,5,6
Beware that this can be very computationally expensive, so you might want to run it one short phrase at a time rather than the entire document. You can even split long phrases into chunks.
Another point for improvement is the split function. If you have preliminary information, for example, how many dates there can be no more than one sentence, the number of ways to split it can be significantly reduced.
source to share