Could this Python code be more efficient?
I wrote some code to find out how many substrings of a string are anagram pairs. The search function anagram(anagramSolution)
has O (N) complexity. The subscript function has complexity less than N square. But this code is the problem here. Could it be more optimized?
for i in range(T):
x = raw_input()
alist = get_all_substrings(x)
for k, j in itertools.combinations(alist,2):
if(len(k) == len(j)):
if(anagramSolution(k,j)):
counter +=1
counterlist.append(counter)
counter = 0
alist
can contain thousands of elements (subsets). The main problem is the loop. It will take a long time to go through all the items. Is there a faster or more efficient way to do this?
source to share
Define the anagram class of a string as a collection of counts of how many times each letter appears in a string. For example, it 'banana'
has an anagram class a: 3, b: 1, n: 2
. Two strings are anagrams of each other if they have the same class of anagrams. We can count how many substrings of a string belong to each class of anagrams, and then calculate the number of pairs by calculating (n choose 2)
anagrams with n substrings for each class:
from collections import Counter
anagram_class_counts = Counter()
for substring in get_all_substrings(x):
anagram_class_counts[frozenset(Counter(substring).viewitems())] += 1
anagram_pair_count = sum(x*(x-1)/2 for x in anagram_class_counts.viewvalues())
frozenset(Counter(substring).viewitems())
creates a hashable representation of a string anagram class.
-
Counter
takes an iteration and builds a display representing how many times each item has appeared, so -
Counter(substring)
creates a mapping representing the string anagram class. -
viewitems()
gives a set-like set of letters: count count and -
frozenset
turns this into an immutable set that can be used like the dict key.
These steps together take time proportional to the size of the substring; on average, substrings are about a third of the size of the entire string, so on average, processing each substring takes O(len(x))
time. Substrings O(len(x)**2)
, so processing all substrings takes O(len(x)**3)
time.
If there are substrings x
with the same anagram class, they can be paired in x*(x-1)/2
ways, so it sum
goes through the number of occurrences of each anagram class and calculates the number of pairs. This takes time O(len(x)**2)
, since it has to go through each anagram class once, and there can be no more anagram classes than substrings.
Overall, this algorithm takes O(len(x)**3)
time, which is not good, but much better than the original. There is still room to optimize this, for example by calculating the anagram classes in such a way as to take advantage of overlap between substrings or by using a more efficient representation of the anagram class.
source to share
I don't think you can completely avoid iteration for this problem, but at least you can make the task at hand smaller in 0 (2 ^ nC2 / 2 ^ n).
You want to group substrings into their respective lengths before you start iterating when you add a lot more cases to check.
The current method compares all pairs from a set that takes 2 ^ nC2 = comparisons. This is a huge amount (2^n)! / ((2^n-2)! * 2!)
.
If we first compose a list of 1-n substrings and then compare, we spend:
- 2 ^ n operations going through all substrings Operations
- nC2 going through a substring of length 1
- ...
That is, we are doing logarithmically better.
Edit: I figured out that strings are not sets and substrings are not subsets, but this only affects the number of subsets and does not affect the main argument.
source to share