Quick match in dictionary

I need to perform a spell check operation in Python like this:

I have a huge list of words (let's call it a lexicon). I got some text (let's call it a sample). I have to search every example word in the lexicon. If I cannot find it, this approximate word is a mistake.

In short, a brute-force spell checker. However, searching the vocabulary linearly for each word in the sample should be slow. What's the best way to do this?

The complicating factor is that neither the sample nor the vocabulary is in English. It is in a language that, instead of 26 characters, can have more than 300 - stored in Unicode.

Suggestion of any algorithm / data structure / parallelization method would be helpful. Algorithms that have high speed at the expense of less than 100% accuracy would be ideal since I don't need 100% accuracy. I know about Norvig's algorithm for this, but it seems to be specific to English.

+3


source to share


8 answers


You can use the Unicode stringset:

s = set(u"rabbit", u"lamb", u"calf")

      

and use the operator in

to check if a word exists:

>>> u"rabbit" in s
True
>>> u"wolf" in s
False

      



This search is essentially O (1), so the size of the dictionary doesn't matter.

Edit : here's the complete code to check (check for an item) (2.6 or higher):

from io import open
import re
with open("dictionary", encoding="utf-8") as f:
    words = set(line.strip() for line in f)
with open("document", encoding="utf-8") as f:
    for w in re.findall(r"\w+", f.read()):
        if w not in words:
            print "Misspelled:", w.encode("utf-8")

      

( print

assumes your terminal is using UTF-8.)

+6


source


Use a tree structure to store words so that each path from root to leaf is one word. If your traversal fails to reach the leaf or reaches the leaf to the end of the word, you have a word out of your vocabulary.



Aside from the benefits that Emil mentions in the comments, note also that it allows you to do things like backtracking to find alternate spellings.

+1


source


Try it with a set as everyone tells you. Multiple searches have been optimized in python C code by experienced programmers, so you can't do better in your small application.

Unicode is not a problem: the keys of the set and dictionary can be unicode or English text, it doesn't matter. Your only concern is unicode normalization, since different diacritical orders would not compare in the same way. If this is a problem for your language, I must first ensure that the vocabulary is stored in a normalized form and then normalize each word before you check it. For example,unicodedata.normalize('NFC', word)

+1


source


Here sets . Create a set of all words in the dictionary and then use the membership operator to check if the word is present in the dictionary or not.

Here's a simplified example

>>> dictionary = {'Python','check-like', 'will', 'perform','follows:', 'spelling', 'operation'}
>>> for word in "I will have to perform a spelling check-like operation in Python as follows:".split():
    if word in dictionary:
        print "Found {0} in the dictionary".format(word)
    else:
        print "{0} not present in the dictionary".format(word)


I not present in the dictionary
Found will in the dictionary
have not present in the dictionary
to not present in the dictionary
Found perform in the dictionary
a not present in the dictionary
Found spelling in the dictionary
Found check-like in the dictionary
Found operation in the dictionary
in not present in the dictionary
Found Python in the dictionary
as not present in the dictionary
Found follows: in the dictionary
>>> 

      

0


source


The average time complexity of a hashed lookup in a python dictionary is O (1). Therefore you can use a "dictionary without values" (aka a set)

0


source


What are python dictionaries and suites for! :) Either store your lexicon in a dictionary if each word has some meaning (like frequency), or a set if you just need to check for existence. Finding them is O (1), so it'll be pretty darn fast.

lex = set(('word1', 'word2', .....))

for w in words:
    if w not in lex:
        print "Error: %s" % w

      

0


source


First you need to create an index of your vocabulary. for example, you can create your own indexing system, but it is better to use full text search engines Full text search engine I can recommend apache lucene or sphinx to you. it's fast and open source. Once you can send search queries from python to a search engine and get error responses.

0


source


Here is a post I wrote about checking things like this. It's a simulator to have google suggestion / spell checker.

http://blog.mattalcock.com/2012/12/5/python-spell-checker/

Hope it helps.

0


source







All Articles