Finding words from random letters of input in python. Which algorithm to use / code already exists?

I am trying to encode word descrambler like this one here and was wondering what algorithms should I use to implement it. Also, if someone can find existing code for this, that would be great too. Basically the functionality will be similar to the combat solver, but without a matrix, it just searches for all the possibilities of a word from a character string. I already have sufficient dictionaries.

I was planning to do this in python or ruby. Thank you in advance for your help!

+2


source to share


5 answers


I would use Trie . Here's the implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber)



+3


source


I may lack understanding of the game, but avoiding some complications in the rules such as introducing the letters "joker" (wildcard), missing or extra letters, a few words, etc ... I think the following ideas will help turn the problem into a somewhat relatively uninteresting thing.: - (

The main idea is to index words according to the ordered sequence of their letters .
 For example, "computer" gets the key "cemoprtu". No matter what random drawings provide sorting in kind and are used as a key to find possible matches. Using the trie as shown by perimosocordiae as the underlying storage for these sorted keys and associated wordIds in leaf nodes, Word lookups can be done in O (n) time , where n is the number of letters (or better, on average due to non-existent words).

To further help with indexing, we can have multiple tables / dictionaries, one for the number of letters. Also, depending on the statistics, vowels and consonants can be processed separately. Another trick would be to have a custom sort order, placing the most selective letters first.



Additional curls to the game (for example, searching for words from a subset of letters) are mainly associated with iterating the power set of these letters and checking the dictionary for each combination.

Several heuristics can be introduced to help shorten some of the combinations (for example, combinations without vowels [and of a certain length] are not possible solutions, etc. These heuristics should be carefully used to make the search cost relatively low.

+2


source


For your dictionary index, create a map (Map [Bag [Char], List [String]]). It must be a hashmap so you can find O (1) word searches. The [Char] bag is an identifier for a word that is unique in character order. It's basically a hash map from Char to Int. Char is the given character in the word, and Int is the number of times the character appears in the word.

Example:

{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]

      

To find words, take each combination of characters from the input string and find it on this map. The complexity of this algorithm is O (2 ^ n) in the length of the input string. It is noteworthy that the complexity does not depend on the length of the dictionary.

+2


source


It sounds like a line search Rabin-Karp would be a good choice. If you are using a sliding hash function, then at each position you need one hash value update and one dictionary lookup. You also need to create a good way to deal with different word lengths, for example, truncate all words to the shortest word in the set and double-check for possible matches. Dividing a word given into separate length ranges will reduce false positives by increasing hashing work.

+1


source


There are two ways to do this. One is to test each letter swap candidate in a word to see if the candidate is in your word dictionary. This is an O (N!) Operation, depending on the word length.

Another way is to check each candidate word in your dictionary to see if it is contained in that word. This can be sped up by combining a dictionary; instead of each candidate word, you check all the words that are anagrams of each other at the same time, because if any of them are in your word, they are all.

So, start by creating a dictionary whose key is a sorted string of letters and whose value is a list of words that are anagrams of the key:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
        for line in f.readlines():
            if line[0].isupper(): continue
            word = line.strip()
            key = "".join(sorted(word.lower()))
            d[key].append(word)

      

Now we need a function to see if a word contains a candidate. This function assumes that the word and candidate are sorted, so that it can walk through them both letter by letter and quickly discard when it detects that they do not match.

>>> def contains(sorted_word, sorted_candidate):
        wchars = (c for c in sorted_word)
        for cc in sorted_candidate:
            while(True):
                try:
                    wc = wchars.next()
                except StopIteration:
                    return False
                if wc < cc: continue
                if wc == cc: break
                return False
        return True

      

Now find all candidate keys in the dictionary that are contained in the word and sum all their values ​​into one list:

>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
        if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']

      

This last step takes about a quarter of a second on my laptop; I have 195K keys in my dictionary (I am using BSD Unix word file).

+1


source







All Articles