Executing my function to remove keys from a dictionary that are substrings of other keys

I am curious why deleting a line in my code results in a significant performance improvement. The function itself takes a dictionary and removes all keys that are substrings of other keys.

The line that slows down my code is:

if sub in reduced_dict and sub2 in reduced_dict:

      

Here's my function:

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
                            if sub in sub2:
                                reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

      

The function checks if sub is in sub2 many times. I assumed that if I had verified that this comparison had already been made, I would have saved time. It doesn't seem to be the case. Why is the constant time function for looking up the dictionary slowing me down?

I'm a beginner so I'm interested in concepts.

When I tested, if the string in question ever returns False, it seems to be there. I have tested this with the following

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub not in reduced_dict or sub2 not in reduced_dict:
                            print 'not present' # This line prints many thousands of times
                        if sub in sub2:
                            reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

      

For 14805 keys in the function input dictionary:

  • 19.6360001564 sec. without line
  • 33.1449999809 sec. with line

Here are 3 dictionaries. Largest dictionary sample with 14805 keys , medium dictionary sample, and smaller dictionary sample

I have a graphical time in seconds (Y) and an input size in # of keys (X) for the first 14,000 keys in the largest sample dictionary. All of these functions seem to have exponential complexity.

  • John Zwinck's answer for this question
  • Matt my algorithm for this question without Comparision Dictionary
  • Matt's exponent is my first attempt at solving this problem. It took 76 seconds
  • Matt comparison is the algorithm in this question with a dict comparison line
  • tdelaney solution for this question. Algorithm 1 and 2 are ok
  • georg solution from the linked question I asked

enter image description here

The accepted answer runs in seemingly linear time.

Marcelo cantos

I am surprised to find that a magic relation exists for input size, where the runtime for a dict search = string search.

+3


source to share


3 answers


For a sample body, or any body in which most keys are small, it is much faster to check all possible subsections:

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}

      



It takes about 0.2s on my system (i7-3720QM 2.6GHz).

+2


source


You create len_dict, but even if it groups keys of the same size, you still have to traverse everything multiple times to compare. Your baseline is right - sort by size and only compare the same amount or more, but there are other ways to do this. Below, I just created a normal list, sorted by key size and then looped back so that I can crop the recorder as I went. I'm curious how his run time compares to yours. This is your little narrator example in 0.49 seconds.

(I hope this actually worked!)

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for k_fwd,v_fwd in items[i+1:]:
            if k in k_fwd:
                del items[i]
                break
    return dict(items)

      



EDIT

Significant speed increase without unpacking k_fwd, v_fwd (after running both a few times it wasn't really a speedup. Something else must have had time on my PC for a while).

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for kv_fwd in items[i+1:]:
            if k in kv_fwd[0]:
                del items[i]
                break
    return dict(items)

      

+1


source


I would do it differently. There is a generator function here that only gives you the "good" keys. This avoids the creation of a dictate that can be largely destroyed by key by key. I also only have two levels of back loops and some simple optimizations to find matches faster and avoid finding impossible matches.

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

      

If you want to make an actual dict using this you can:

{ key: d[key] for key in reduced_keys(d) }

      

+1


source







All Articles