Lemmatize plural nouns using nltk and wordnet
I want lemmatize with
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lmtzr = WordNetLemmatizer()
POS = pos_tag(text)
def get_wordnet_pos(treebank_tag):
#maps pos tag so lemmatizer understands
from nltk.corpus import wordnet
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
lmtzr.lemmatize(text[i], get_wordnet_pos(POS[i][1]))
The problem is that the POS tag gets that "procaspases" is "NNS", but how do I convert NNS to wordnet since "procaspases" continues to be "procaspaseS" even after lemmatizer.
+3
source to share
2 answers
NLTK takes care of most plurals, not just removing the ending.
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
Lem = WordNetLemmatizer()
phrase = 'cobblers ants women boys needs finds binaries hobbies busses wolves'
words = phrase.split()
for word in words :
lemword = Lem.lemmatize(word)
print(lemword)
Conclusion: shoemaker ant woman boy need to find binary hobby bus wolf
+4
source to share
I can lemmatize things easily using wordnet.morphy:
>>> from nltk.corpus import wordnet
>>> wordnet.morphy('cats')
u'cat'
Note that procaspases are not in WordNet (caspases, however, and morphine will give caspase as a lemma) and probably your lemmatizer just won't recognize it. Unless you have trouble lecturing other words, this is probably just foreign to implement.
+3
source to share