Peter Norvig's word segmentation question: how can I segment misspelled words inside?

Question

Peter Norvig's word segmentation question: how can I segment misspelled words inside?

I am trying to understand how Peter Norvig's spelling corrector works.

On its name jupyter-notebook, here it explains how to segment a sequence of characters without spaces separating words. It works correctly when all the words in the sequence are spelled correctly:

>>> segment("deeplearning")
['deep', 'learning']

But when a word (or several words) in the sequence is spelled incorrectly, it does not work correctly:

>>> segment("deeplerning")
['deep', 'l', 'erning']

Unfortunately, I have no idea how to fix this and get the segment () function to work on concatenated misspelled words.

Does anyone have any ideas how to deal with this problem?

+3

python nlp spell-checking spelling

Philip Marchenko 02 Aug 17 at 10:34

source to share

1 answer

Vlad · Accepted Answer · 2017-08-04T08:32:40+0000

This can be achieved using the Peter Norvig algorithm with minor modifications. The trick is to add a space character to the alphabet and treat all space-separated bigrams as a unique word.

Since big.txt doesn't contain deep learning

bigram, we'll have to add a little more text to our dictionary. I will use the wikipedia ( pip install wikipedia

) library to get more text.

import re
import wikipedia as wiki
import nltk
from nltk.tokenize import word_tokenize
unigrams =  re.findall(r"\w+", open("big.txt").read().lower())
for deeplerning in wiki.search("Deep Learning"):
    try:
        page = wiki.page(deeplerning).content.lower()
        page = page.encode("ascii", errors="ignore")
        unigrams = unigrams + word_tokenize(page)
    except:
        break

I will create a new dictionary with all unigrams and bigrams:

fo = open("new_dict.txt", "w")
for u in unigrams:
    fo.write(u + "\n")
bigrams = list(nltk.bigrams(unigrams))
for b in bigrams:
    fo.write(" ".join(b)+ "\n")
fo.close()

Now just add symbol space

to variable letters

in edits1

, change big.txt

to new_dict.txt

and change this function:

def words(text): return re.findall(r'\w+', text.lower())

:

def words(text): return text.split("\n")

and now correction("deeplerning")

returns 'deep learning'

!

This trick will work well if you need a spelling corrector for a specific domain. If this domain is large, you can try adding only the most common unigrams / bitrams to your dictionary.

This question might help as well.

Peter Norvig's word segmentation question: how can I segment misspelled words inside?

More articles: