Parsing Text - Date Recognizer

Does anyone know if there is a Python text parser that recognizes inline dates? For example, given the sentence

"bla bla bla bla 12 Jan 14 bla bla bla 01/04/15 bla bla bla"

the analyzer could pick two date events. I know some Java tools, but is there Python? Will NTLK be overkill?

thank

+3


source to share


1 answer


Here is an attempt at a non-deterministic (read: exhaustive) solution to the search problem where dates are in tokenized text. It lists all the ways to split a sentence (like a list of tokens) with a section size from minps

to maxps

.

Each split is done into a parser, which outputs a list of parsed dates and the range of markers in which it was parsed.

Each parser result is evaluated with the sum of the squares of the tokens squared (so a date processed from 4 tokens is preferable, rather than 2 dates processed from two tokens each).

Finally, it finds and outputs the parsing with the best result.

Three building blocks of the algorithm:

from dateutil.parser import parse as parsedate

def partition(lst, minps, maxps, i=0):
    if lst == []:
        yield []
    else:
        try:
            for l in range(minps, maxps+1):
                if l > len(lst): continue
                for z in partition(lst[l:], minps, maxps, i+l):
                    yield [(i, lst[:l])] + z
        except:
            pass

def parsedates(p):
    for x in p:
        i, pi = x
        try:
            d = parsedate(' '.join(pi))
            # output: (startIndex, endIndex, parsedDate)
            if d: yield i, i+len(pi), d
        except: pass

def score(p):
    score = 0
    for pi in p:
        score += (pi[1]-pi[0])**2
    return score

      

Finding the syntax with the best result:

def bestparse(toks, maxps=3):
    bestscore = 0
    bestparse = None
    for ps in partition(toks, 1, maxps):
        l = list(parsedates(ps))
        s = score(l)
        if s > bestscore:
            bestscore = s
            bestparse = l
    return bestparse

      



Some tests:

l=['bla', 'bla', 'bla', '12', 'Jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla']
for bpi in bestparse(l):
    print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

      

found date 2014-01-12 00:00:00 in tokens 3,4,5

found date 2015-01-04 00:00:00 in tokens 9

l=['Fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30']
for bpi in bestparse(l, 5):
    print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

      

publication date 1999-01-23 23:30:00 in tokens 3,4,5,6

Beware that this can be very computationally expensive, so you might want to run it one short phrase at a time rather than the entire document. You can even split long phrases into chunks.

Another point for improvement is the split function. If you have preliminary information, for example, how many dates there can be no more than one sentence, the number of ways to split it can be significantly reduced.

+3


source







All Articles