Peter Norvig's word segmentation question: how can I segment misspelled words inside?
I am trying to understand how Peter Norvig's spelling corrector works.
On its name jupyter-notebook, here it explains how to segment a sequence of characters without spaces separating words. It works correctly when all the words in the sequence are spelled correctly:
>>> segment("deeplearning")
['deep', 'learning']
But when a word (or several words) in the sequence is spelled incorrectly, it does not work correctly:
>>> segment("deeplerning")
['deep', 'l', 'erning']
Unfortunately, I have no idea how to fix this and get the segment () function to work on concatenated misspelled words.
Does anyone have any ideas how to deal with this problem?
source to share
This can be achieved using the Peter Norvig algorithm with minor modifications. The trick is to add a space character to the alphabet and treat all space-separated bigrams as a unique word.
Since big.txt doesn't contain deep learning
bigram, we'll have to add a little more text to our dictionary. I will use the wikipedia ( pip install wikipedia
) library to get more text.
import re
import wikipedia as wiki
import nltk
from nltk.tokenize import word_tokenize
unigrams = re.findall(r"\w+", open("big.txt").read().lower())
for deeplerning in wiki.search("Deep Learning"):
try:
page = wiki.page(deeplerning).content.lower()
page = page.encode("ascii", errors="ignore")
unigrams = unigrams + word_tokenize(page)
except:
break
I will create a new dictionary with all unigrams and bigrams:
fo = open("new_dict.txt", "w")
for u in unigrams:
fo.write(u + "\n")
bigrams = list(nltk.bigrams(unigrams))
for b in bigrams:
fo.write(" ".join(b)+ "\n")
fo.close()
Now just add symbol space
to variable letters
in edits1
, change big.txt
to new_dict.txt
and change this function:
def words(text): return re.findall(r'\w+', text.lower())
:
def words(text): return text.split("\n")
and now correction("deeplerning")
returns 'deep learning'
!
This trick will work well if you need a spelling corrector for a specific domain. If this domain is large, you can try adding only the most common unigrams / bitrams to your dictionary.
This question might help as well.
source to share