Extracting related values from text using NLP
I want to extract the Cardinal (CD) values associated with units and store them in a dictionary. For example, if the text contains tokens like "20 kilograms", it should extract it and store it in the dictionary.
Example:
-
for text input: "10-inch frying pan offers excellent heat conductivity and distribution", the output dictionary should look like this:
{"dimension":"10-inch"}
-
to enter the text: "This bucket contains 5 liters of water", the output should look like this:
{"volume": "5 litres"}
line = 'This bucket holds 5 litres of water.' tokenized = nltk.word_tokenize(line) tagged = nltk.pos_tag(tokenized)
The above line will give the result:
[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]
Is there a way to extract CD and UOM values from text?
source to share
Not sure how flexible this process should be. You can play with nltk.RegexParser and come up with some good patterns:
import nltk
sentence = 'This bucket holds 5 litres of water.'
parser = nltk.RegexpParser(
"""
INDICATOR: {<CD><NNS>}
""")
print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))
Output:
(S
This/DT
bucket/NN
holds/VBZ
(INDICATOR 5/CD litres/NNS)
of/IN
water/NN
./.)
You can also create a corpus and prepare a chunker.
source to share
Hm, not sure if this helps, but I wrote it in Javascript. Here: http://github.com/redaktor/nlp_compromise
It might be a little undocumented, but the guys are now migrating it to the 2.0 branch.
This should be easy to port to python given What is the difference between Python and Javascript regular expressions?
Q: Have you checked pythons NLTK ?: http://www.nltk.org
source to share