Executing my function to remove keys from a dictionary that are substrings of other keys
I am curious why deleting a line in my code results in a significant performance improvement. The function itself takes a dictionary and removes all keys that are substrings of other keys.
The line that slows down my code is:
if sub in reduced_dict and sub2 in reduced_dict:
Here's my function:
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
The function checks if sub is in sub2 many times. I assumed that if I had verified that this comparison had already been made, I would have saved time. It doesn't seem to be the case. Why is the constant time function for looking up the dictionary slowing me down?
I'm a beginner so I'm interested in concepts.
When I tested, if the string in question ever returns False, it seems to be there. I have tested this with the following
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub not in reduced_dict or sub2 not in reduced_dict:
print 'not present' # This line prints many thousands of times
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
For 14805 keys in the function input dictionary:
- 19.6360001564 sec. without line
- 33.1449999809 sec. with line
Here are 3 dictionaries. Largest dictionary sample with 14805 keys , medium dictionary sample, and smaller dictionary sample
I have a graphical time in seconds (Y) and an input size in # of keys (X) for the first 14,000 keys in the largest sample dictionary. All of these functions seem to have exponential complexity.
- John Zwinck's answer for this question
- Matt my algorithm for this question without Comparision Dictionary
- Matt's exponent is my first attempt at solving this problem. It took 76 seconds
- Matt comparison is the algorithm in this question with a dict comparison line
- tdelaney solution for this question. Algorithm 1 and 2 are ok
- georg solution from the linked question I asked
The accepted answer runs in seemingly linear time.
I am surprised to find that a magic relation exists for input size, where the runtime for a dict search = string search.
source to share
For a sample body, or any body in which most keys are small, it is much faster to check all possible subsections:
def reduced(dictionary):
keys = set(dictionary.iterkeys())
subkeys = set()
for key in keys:
for n in range(1, len(key)):
for i in range(len(key) + 1 - n):
subkey = key[i:i+n]
if subkey in keys:
subkeys.add(subkey)
return {k: v
for (k, v) in dictionary.iteritems()
if k not in subkeys}
It takes about 0.2s on my system (i7-3720QM 2.6GHz).
source to share
You create len_dict, but even if it groups keys of the same size, you still have to traverse everything multiple times to compare. Your baseline is right - sort by size and only compare the same amount or more, but there are other ways to do this. Below, I just created a normal list, sorted by key size and then looped back so that I can crop the recorder as I went. I'm curious how his run time compares to yours. This is your little narrator example in 0.49 seconds.
(I hope this actually worked!)
def myfilter(d):
items = d.items()
items.sort(key=lambda x: len(x[0]))
for i in range(len(items)-2,-1,-1):
k = items[i][0]
for k_fwd,v_fwd in items[i+1:]:
if k in k_fwd:
del items[i]
break
return dict(items)
EDIT
Significant speed increase without unpacking k_fwd, v_fwd (after running both a few times it wasn't really a speedup. Something else must have had time on my PC for a while).
def myfilter(d):
items = d.items()
items.sort(key=lambda x: len(x[0]))
for i in range(len(items)-2,-1,-1):
k = items[i][0]
for kv_fwd in items[i+1:]:
if k in kv_fwd[0]:
del items[i]
break
return dict(items)
source to share
I would do it differently. There is a generator function here that only gives you the "good" keys. This avoids the creation of a dictate that can be largely destroyed by key by key. I also only have two levels of back loops and some simple optimizations to find matches faster and avoid finding impossible matches.
def reduced_keys(dictionary):
keys = dictionary.keys()
keys.sort(key=len, reverse=True) # longest first for max hit chance
for key1 in keys:
found_in_key2 = False
for key2 in keys:
if len(key2) <= len(key1): # no more keys are long enough to match
break
if key1 in key2:
found_in_key2 = True
break
if not found_in_key2:
yield key1
If you want to make an actual dict using this you can:
{ key: d[key] for key in reduced_keys(d) }
source to share