Optimizing Python dictionary lookup speeds by reducing key size?

I don't understand what is going on behind the scenes of a dictionary lookup. Does the key size factor have the speed of finding that key?

The current dictionary keys are between 10-20 long, alphanumeric keys.

I need to do hundreds of searches per minute.

If I replace the ones with fewer 1 to 4 digits key ids, do I get faster lookup times? This would mean that I would need to add a different value to each element that the dictionary holds. In general, the dictionary will be larger.

Also I will need to change the program to look up the id and then get the url associated with the id.

Perhaps I just added complexity to the program with little benefit?

+2


source to share


2 answers


Dictionaries are hash tables, so a key lookup consists of:

  • Hash the key.
  • Reduce the hash to the size of the table.
  • Index of the table with the result.
  • Compare the search key with the enter key.

This is usually amortized constant time and you don't care which is more. There are two potential problems, but they often don't arise.


Hashing a key takes linear time in the length of the key. For, for example, huge strings, this can be a problem. However, if you look at the source code for most of the important types, including [ str

/ unicode

] ( https://hg.python.org/cpython/file/default/Objects/unicodeobject.c , you can see that they cache the hash the first time So, unless you type (or randomly create or whatever) a string to search for once and then throw away, this is unlikely to be a problem in most real-world programs.

Also, 20 characters are really pretty short; you can probably make millions of such hashes per second, not hundreds.



From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value that I've already done takes 77ns. Yes, it's nanoseconds.


Meanwhile, "Index table with result" is a bit of a hoax. What happens if two different key hashes have the same index? Then "compare the verified key" will fail, and ... what will happen next? The CPython implementation uses probing for this. The exact algorithm is pretty well explained in the source . But you will notice that given the truly pathological data, you can do a linear search for each item. It will never come - unless someone can attack your program by clearly creating pathological data, in which case it will definitely come.

Switching from 20-character strings to 4-digit numbers wouldn't help either. If I generate DoS keys for your system using dictionary collisions, I don't care what your actual keys look like, just that they have a hash.


More generally, premature optimization is the root of all evil . It is sometimes misquoted to overstate the point; Knuth argued that the most important thing is to find 3% of cases where optimization is important, not that optimization is always a waste of time. But anyway, the point is, if you don't know ahead of time where your program is too slow (and if you think you know ahead of time, you're usually wrong ...), profile, then find the part where you get getting the most out of your dollar. Optimizing one arbitrary piece of your code probably won't have any measurable effect.

+6


source


Python dictionaries are implemented as hash maps in the background. Key length can affect performance if, for example, the complexity of hash functions depends on the key length. But the overall performance impact will be completely unjustified.



So, I would say there is little use for the added complexity.

+1


source







All Articles