Extracting related values ​​from text using NLP

I want to extract the Cardinal (CD) values ​​associated with units and store them in a dictionary. For example, if the text contains tokens like "20 kilograms", it should extract it and store it in the dictionary.

Example:

  • for text input: "10-inch frying pan offers excellent heat conductivity and distribution", the output dictionary should look like this: {"dimension":"10-inch"}

  • to enter the text: "This bucket contains 5 liters of water", the output should look like this: {"volume": "5 litres"}

    line = 'This bucket holds 5 litres of water.'
    tokenized = nltk.word_tokenize(line)
    tagged = nltk.pos_tag(tokenized)
    
          

The above line will give the result:

[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]

      

Is there a way to extract CD and UOM values ​​from text?

+3


source to share


2 answers


Not sure how flexible this process should be. You can play with nltk.RegexParser and come up with some good patterns:

import nltk

sentence = 'This bucket holds 5 litres of water.'

parser = nltk.RegexpParser(
    """
    INDICATOR: {<CD><NNS>}
    """)

print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))

      

Output:



(S
  This/DT
  bucket/NN
  holds/VBZ
  (INDICATOR 5/CD litres/NNS)
  of/IN
  water/NN
  ./.)

      

You can also create a corpus and prepare a chunker.

+2


source


Hm, not sure if this helps, but I wrote it in Javascript. Here: http://github.com/redaktor/nlp_compromise

It might be a little undocumented, but the guys are now migrating it to the 2.0 branch.



This should be easy to port to python given What is the difference between Python and Javascript regular expressions?

Q: Have you checked pythons NLTK ?: http://www.nltk.org

+1


source







All Articles