Extracting related values from text using NLP

Question

Extracting related values from text using NLP

I want to extract the Cardinal (CD) values associated with units and store them in a dictionary. For example, if the text contains tokens like "20 kilograms", it should extract it and store it in the dictionary.

Example:

for text input: "10-inch frying pan offers excellent heat conductivity and distribution", the output dictionary should look like this: {"dimension":"10-inch"}

to enter the text: "This bucket contains 5 liters of water", the output should look like this: {"volume": "5 litres"}

line = 'This bucket holds 5 litres of water.'
tokenized = nltk.word_tokenize(line)
tagged = nltk.pos_tag(tokenized)

The above line will give the result:

[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]

Is there a way to extract CD and UOM values from text?

+3

python nlp nltk

Vaulstein Dec 15. 14 at 15:39

source to share

2 answers

Hm, not sure if this helps, but I wrote it in Javascript. Here: http://github.com/redaktor/nlp_compromise

It might be a little undocumented, but the guys are now migrating it to the 2.0 branch.

This should be easy to port to python given What is the difference between Python and Javascript regular expressions?

Q: Have you checked pythons NLTK ?: http://www.nltk.org

+1

sebilasse 02 Sep 15 at 17:45

source to share

bogs · Accepted Answer · 2014-12-16T18:04:59+0000

Not sure how flexible this process should be. You can play with nltk.RegexParser and come up with some good patterns:

import nltk

sentence = 'This bucket holds 5 litres of water.'

parser = nltk.RegexpParser(
    """
    INDICATOR: {<CD><NNS>}
    """)

print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))

Output:

(S
  This/DT
  bucket/NN
  holds/VBZ
  (INDICATOR 5/CD litres/NNS)
  of/IN
  water/NN
  ./.)

You can also create a corpus and prepare a chunker.

Extracting related values ​​from text using NLP

More articles:

Extracting related values from text using NLP