Executing my function to remove keys from a dictionary that are substrings of other keys

Question

Executing my function to remove keys from a dictionary that are substrings of other keys

I am curious why deleting a line in my code results in a significant performance improvement. The function itself takes a dictionary and removes all keys that are substrings of other keys.

The line that slows down my code is:

if sub in reduced_dict and sub2 in reduced_dict:

Here's my function:

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
                            if sub in sub2:
                                reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

The function checks if sub is in sub2 many times. I assumed that if I had verified that this comparison had already been made, I would have saved time. It doesn't seem to be the case. Why is the constant time function for looking up the dictionary slowing me down?

I'm a beginner so I'm interested in concepts.

When I tested, if the string in question ever returns False, it seems to be there. I have tested this with the following

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub not in reduced_dict or sub2 not in reduced_dict:
                            print 'not present' # This line prints many thousands of times
                        if sub in sub2:
                            reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

For 14805 keys in the function input dictionary:

19.6360001564 sec. without line
33.1449999809 sec. with line

Here are 3 dictionaries. Largest dictionary sample with 14805 keys , medium dictionary sample, and smaller dictionary sample

I have a graphical time in seconds (Y) and an input size in # of keys (X) for the first 14,000 keys in the largest sample dictionary. All of these functions seem to have exponential complexity.

John Zwinck's answer for this question
Matt my algorithm for this question without Comparision Dictionary
Matt's exponent is my first attempt at solving this problem. It took 76 seconds
Matt comparison is the algorithm in this question with a dict comparison line
tdelaney solution for this question. Algorithm 1 and 2 are ok
georg solution from the linked question I asked

enter image description here

The accepted answer runs in seemingly linear time.

Marcelo cantos

I am surprised to find that a magic relation exists for input size, where the runtime for a dict search = string search.

+3

performance python dictionary algorithm

12345678910111213 13 Sep '14 at 2:35

source to share

3 answers

You create len_dict, but even if it groups keys of the same size, you still have to traverse everything multiple times to compare. Your baseline is right - sort by size and only compare the same amount or more, but there are other ways to do this. Below, I just created a normal list, sorted by key size and then looped back so that I can crop the recorder as I went. I'm curious how his run time compares to yours. This is your little narrator example in 0.49 seconds.

(I hope this actually worked!)

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for k_fwd,v_fwd in items[i+1:]:
            if k in k_fwd:
                del items[i]
                break
    return dict(items)

EDIT

Significant speed increase without unpacking k_fwd, v_fwd (after running both a few times it wasn't really a speedup. Something else must have had time on my PC for a while).

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for kv_fwd in items[i+1:]:
            if k in kv_fwd[0]:
                del items[i]
                break
    return dict(items)

+1

tdelaney 13 Sep '14 at 3:42

source to share

I would do it differently. There is a generator function here that only gives you the "good" keys. This avoids the creation of a dictate that can be largely destroyed by key by key. I also only have two levels of back loops and some simple optimizations to find matches faster and avoid finding impossible matches.

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

If you want to make an actual dict using this you can:

{ key: d[key] for key in reduced_keys(d) }

+1

John Zwinck 13 Sep At 4:01 am

source to share

Marcelo cantos · Accepted Answer · 2014-09-14T02:33:08+0000

For a sample body, or any body in which most keys are small, it is much faster to check all possible subsections:

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}

It takes about 0.2s on my system (i7-3720QM 2.6GHz).

Executing my function to remove keys from a dictionary that are substrings of other keys

More articles: