Finding anagrams of words in a comma separated string in python

Question

Finding anagrams of words in a comma separated string in python

>>> line = "cat,ant,ate,abc,tan,act,tea"
>>> words = line.split(",")
>>> words
['cat', 'ant', 'ate', 'abc', 'tan', 'act', 'tea']
>>> sorted_words = map(tuple, [sorted(eachword) for eachword in words])
>>> sorted_words
[('a', 'c', 't'), ('a', 'n', 't'), ('a', 'e', 't'), ('a', 'b', 'c'), ('a', 'n', 't'), ('a', 'c', 't'), ('a', 'e', 't')]
>>> repeated_words = set(sorted_words)
>>> repeated_words
set([('a', 'b', 'c'), ('a', 'e', 't'), ('a', 'c', 't'), ('a', 'n', 't')])
>>> for repeated_word in repeated_words:
    for index in [i for i, x in enumerate(sorted_words) if sorted_words.count(x) > 1 and x == repeated_word]:
        print words[index],
    print '\t'



ate tea     
cat act     
ant tan

Able to get anagrams per string, but would like to know if there is any better approach to solve the above problem in less complexity. Please help me calculate the complexity for the above approach.

+3

python

Srikanth May 05 '15 @ 6:44 am

source to share

1 answer

abarnert · Accepted Answer · 2015-05-05T06:58:14+0000

The problem with great efficiency is what if sorted_words.count(x) > 1

you do on each one.

We go through the parts. Let's say we have N elements, K unique elements, and the middle word is the length M.

Sort each element of the list and put the result in another list. This is the O(MlogM)

time per item, or O(NMlogM)

total.
Extract the set from this new list that O(N)

.
For each word in the set, for each word in the list, count how many times the list appears in the list. It's a biggie. Counting how many times something appears on the list takes time O(N)

, and you do it KN

once, so O(N^2 * K)

.
For each word in the iteration set, a list containing all matching values if count > 1

. This is the time O(NK)

.

You can fix this part O(N^2 * K)

simply by removing the count from the list comprehension. Let's say you did this without explaining exactly how (it's pretty easy). Now is your time O(NMlogM + N + NK)

. Assuming M << K

that O(NK)

.

To fix this, you want to create a mapping from the sorted words to the original words so that you can search for the original words at a constant time.

For example:

>>> repeated_words = {}
>>> for word in words:
...     sorted_word = tuple(sorted(word))
...     repeated_words.setdefault(sorted_word, set()).add(word)
>>> repeated_words
{('a', 'b', 'c'): {'abc'},
 ('a', 'c', 't'): {'act', 'cat'},
 ('a', 'e', 't'): {'ate', 'tea'},
 ('a', 'n', 't'): {'ant', 'tan'}}
>>> for repeated_word, words in repeated_words.viewitems():
...     if len(words) > 1:
...         print(' '.join(words))
tea ate
act cat
ant tan

Now our first two steps O(NMlogM + N)

, but our third step is O(K)

instead O(KN)

, because we're just doing a one-line search over a set of words instead of one linear search per given word.

So, our common time O(NMlogM)

.

(If the order of the anagrams in each set matters, or if there might be actual duplicate words, you can map each sorted word to a list, rather than a set of original words. This does not affect performance here, because the only thing we've ever done with by this list / set, it's append / add and iterate, I just used a set because it seemed conceptually order doesn't matter, and there shouldn't be any duplicates.)

But we can do even better. It probably doesn't matter given that M << K

, but ...

Why do we need to sort words? Because if two words are the same, their sorted letters are the same. But if two words are the same, their set of letters is also the same, unless there are duplicate letters that are not in your example. (Even if that were the case, you could handle it using "multiset", for example Counter

, but immutable and hashable ... although then comparisons aren't quite constant time longer, they depend on the average number of repeated letters ... let's say ignores this complexity as it has nothing to do with your example, but we could handle it if needed.)

>>> repeated_words = {}
>>> for word in words:
...     letter_set = frozenset(word)
...     repeated_words.setdefault(letter_set, set()).add(word)
>>> repeated_words
{frozenset({'a', 'b', 'c'}): {'abc'},
 frozenset({'a', 'e', 't'}): {'ate', 'tea'},
 frozenset({'a', 'n', 't'}): {'ant', 'tan'},
 frozenset({'a', 'c', 't'}): {'act', 'cat'}}
>>> for repeated_word, words in repeated_words.viewitems():
...     if len(words) > 1:
...         print(' '.join(words))
tea ate
ant tan
act cat

And now our common time is just O(NM)

instead O(NMlogM)

.

Again, this last improvement is probably not worth doing (especially if you need a lot, because the time we spent figuring out how to express complexity Counter.__eq__

, as well as creating and explaining FrozenCounter

, is probably more than the time that we saved to run the program :) given M << K

.

Finding anagrams of words in a comma separated string in python

More articles: