Could this Python code be more efficient?

I wrote some code to find out how many substrings of a string are anagram pairs. The search function anagram(anagramSolution)

has O (N) complexity. The subscript function has complexity less than N square. But this code is the problem here. Could it be more optimized?

for i in range(T):
    x = raw_input()
    alist = get_all_substrings(x)

    for k, j in itertools.combinations(alist,2):
        if(len(k) == len(j)):
            if(anagramSolution(k,j)):
                counter +=1

    counterlist.append(counter)
    counter = 0

      

alist

can contain thousands of elements (subsets). The main problem is the loop. It will take a long time to go through all the items. Is there a faster or more efficient way to do this?

+3


source to share


2 answers


Define the anagram class of a string as a collection of counts of how many times each letter appears in a string. For example, it 'banana'

has an anagram class a: 3, b: 1, n: 2

. Two strings are anagrams of each other if they have the same class of anagrams. We can count how many substrings of a string belong to each class of anagrams, and then calculate the number of pairs by calculating (n choose 2)

anagrams with n substrings for each class:

from collections import Counter

anagram_class_counts = Counter()

for substring in get_all_substrings(x):
    anagram_class_counts[frozenset(Counter(substring).viewitems())] += 1

anagram_pair_count = sum(x*(x-1)/2 for x in anagram_class_counts.viewvalues())

      

frozenset(Counter(substring).viewitems())

creates a hashable representation of a string anagram class.

  • Counter

    takes an iteration and builds a display representing how many times each item has appeared, so
  • Counter(substring)

    creates a mapping representing the string anagram class.
  • viewitems()

    gives a set-like set of letters: count count and
  • frozenset

    turns this into an immutable set that can be used like the dict key.


These steps together take time proportional to the size of the substring; on average, substrings are about a third of the size of the entire string, so on average, processing each substring takes O(len(x))

time. Substrings O(len(x)**2)

, so processing all substrings takes O(len(x)**3)

time.

If there are substrings x

with the same anagram class, they can be paired in x*(x-1)/2

ways, so it sum

goes through the number of occurrences of each anagram class and calculates the number of pairs. This takes time O(len(x)**2)

, since it has to go through each anagram class once, and there can be no more anagram classes than substrings.

Overall, this algorithm takes O(len(x)**3)

time, which is not good, but much better than the original. There is still room to optimize this, for example by calculating the anagram classes in such a way as to take advantage of overlap between substrings or by using a more efficient representation of the anagram class.

+5


source


I don't think you can completely avoid iteration for this problem, but at least you can make the task at hand smaller in 0 (2 ^ nC2 / 2 ^ n).

You want to group substrings into their respective lengths before you start iterating when you add a lot more cases to check.

The current method compares all pairs from a set that takes 2 ^ nC2 = comparisons. This is a huge amount (2^n)! / ((2^n-2)! * 2!)

.

If we first compose a list of 1-n substrings and then compare, we spend:



  • 2 ^ n operations going through all substrings Operations
  • nC2 going through a substring of length 1
  • ...

That is, we are doing logarithmically better.

Edit: I figured out that strings are not sets and substrings are not subsets, but this only affects the number of subsets and does not affect the main argument.

+1


source







All Articles