How to find out if a word exists in English using nltk

I am looking for a suitable solution to this question. This question has been asked many times and I have not found any suitable answer. I need to use corpus in NLTK to determine if a word is an English word

I tried to do:

wordnet.synsets(word)

      

This is not a word for many common words. Using an English wordlist and doing a file search is not an option. Using enchantments is not an option either. If there is another library that can do the same, please provide using api. If not, specify the corpus in nltk, which has all words in English.

+2


source to share


2 answers


NLTK includes several corpora , which are nothing more than word lists. Words Corpus is a Unix / usr / share / dict / words file used by some spelling checks . We can use it to find unusual or misspelled words in the text corpus, as shown in the picture:

def unusual_words(text):
    text_vocab = set(w.lower() for w in text.split() if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

      



And in this case, you can check your word's membership on english_vocab

.

>>> import nltk
>>> english_vocab = set(w.lower() for w in nltk.corpus.words.words())
>>> 'a' in english_vocab
True
>>> 'this' in english_vocab
True
>>> 'nothing' in english_vocab
True
>>> 'nothingg' in english_vocab
False
>>> 'corpus' in english_vocab
True
>>> 'Terminology'.lower() in english_vocab
True
>>> 'sorted' in english_vocab
True

      

+9


source


I tried the above approach, but for many words that should exist, so I tried wordnet. I think this has a more comprehensive vocabulary.



from nltk.corpus import wordnet if wordnet.synsets(word): #Do something else: #Do some otherthing

+1


source







All Articles