Identifying names in a string
I would like to find a good way to define the names of people, places, etc. in user searches on my site. For example, if a user asks "how old is George Washington", I need to know from a predefined list that George Washington is human.
Some of the lists will be global and some will be user specific. For example, if they asked "how old is John Smith," I can only want to identify a specific John Smith who is my accomplice - and I would not want to identify him as a person if he is not my associate.
Is there an NLP library or scan of these lists that I could do to take advantage of Soundx functionality, mature NLP features, misspell, etc.? I can write this by hand, but I would rather use something mature. Thank you.
source to share
What you need is called Named Entity Recognition
One of the best programs available for this is Stanford NLP: http://nlp.stanford.edu/software/CRF-NER.shtml (written in Java)
If you are on a different platform, there are good open source projects in Ruby and Python. Search for Named Entity Recognition.
source to share
The natural language processing (NLP) problem you are looking for is called Named Entity Recognition
(NER)
Besides Stanford CRF-NER (in java), the popular python choice from Natural Language ToolKit
( NLTK ) is often used as a baseline for NER tasks.
You can try installing NLTK and then execute the following code:
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> from nltk.chunk import ne_chunk
>>> sentence = "How old is John Smith?"
>>> ne_chunk(pos_tag(word_tokenize(sentence)))
Tree('S', [('How', 'WRB'), ('old', 'JJ'), ('is', 'VBZ'), Tree('PERSON', [('John', 'NNP'), ('Smith', 'NNP')]), ('?', '.')])
source to share