Why are we using linked list to resolve conflicts in hash tables?

I was wondering why many languages ​​(Java, C ++, Python, Perl, etc.) implement hash tables using linked lists to avoid collisions instead of arrays?
I mean, instead of a bucket of linked lists, we should use arrays.
If the problem is with the size of the array, it means that we have too many conflicts, so we already have a problem with the hash function, not how we handle the collisions. I do not understand something?

+3


source to share


5 answers


I mean, instead of a bucket of linked lists, we should use arrays.

Pros and cons to everything, depending on many factors.

The two biggest problems with arrays are:

  • changing capacity involves copying all content to another memory area

  • you need to choose between:

    a) arrays Element*

    s, adding one additional indirect binding during table operations and one additional memory allocation for a non-empty bucket with the corresponding heap management overhead

    b) arrays Element

    s so that already existing Element

    iterators / pointers / references are invalidated by some operations on other nodes (e.g. insert

    ) (linked list - or 2a above, for that matter - no need to invalidate them)

... will ignore a few smaller design choices regarding indirection with arrays ...

Practical ways to reduce copying from 1. include preserving excess capacity (i.e., currently unused memory for pending or already erased items), and - if sizeof(Element)

much more than sizeof(Element*)

- you are pushed towards arrays - of << 26> ( with problems "2a"), not Element[]

s / 2b.




There are several other answers that require erasing on arrays that are more expensive than linked lists, but the opposite is often true: searching for contiguous Element

is faster than scanning a linked list (fewer steps in code, more cache) and once you find you can copy the latter an array Element

or Element*

on top of one that was deleted and then shrunk in size.




If the problem is with the size of the array, it means that we have too many collisions, so we already have a problem with the hash function, not how we handle collisions. I do not understand something?

To answer this question, let's take a look at what's going on with the big hash function. Packing a million items into a million buckets using a cryptographic hash hash, multiple runs of my program counting the number of buckets to which to hash items 0, 1, 2, etc., gave ...

0=367790 1=367843 2=184192 3=61200 4=15370 5=3035 6=486 7=71 8=11 9=2
0=367664 1=367788 2=184377 3=61424 4=15231 5=2933 6=497 7=75 8=10 10=1
0=367717 1=368151 2=183837 3=61328 4=15300 5=3104 6=486 7=64 8=10 9=3

      

If we increase this to 100 million items - still with a load factor of 1.0:

0=36787653 1=36788486 2=18394273 3=6130573 4=1532728 5=306937 6=51005 7=7264 8=968 9=101 10=11 11=1

      

We can see that the relationship is pretty stable. Even with a load factor of 1.0 (the default maximum for C ++ unordered_set

and - map

) 36.8% of buckets can be considered empty, another 36.8% for processing Element

, 18.4% for 2 items, etc. For any given logic of resizing an array, you can easily understand how often you will need to resize (and possibly copy elements). You are correct that it doesn't look bad and might be better than linked lists if you do a lot of searches or iterations for this idealistic case of a cryptographic hash.

But good quality hashing is relatively expensive in CPU time, so general purpose hash tables that support hash functions are often very weak: for example. it is very common for C ++ Standard library implementations std::hash<int>

to return their argument, and MS Visual C ++ std::hash<std::string>

selects 10 characters spaced string

apart to include in the hash value, no matter how long string

.

Obviously, the implementation experience is that this combination of weak and fast hash functions and linked lists (or trees) to handle a higher collision rate is on average faster - and has fewer antagonistic manifestations showing unpleasant poor performance for everyday keys and requirements.

+2


source


Strategy 1

Use (small) arrays that are instantiated and then populated after collisions occur. 1 heap operation for array allocation followed by number for N-1. If this bucket has never been hit again, the N-1 recording capacity is lost. The list wins if collisions are rare, no excess memory is allocated just for the likelihood of more overflow on the bucket. Removing items is also more expensive. Either mark deleted points in the pattern, or bring the material behind it to the front. What if the array is full? A linked list of arrays or resizing an array?

One potential benefit of using arrays would be to do a sorted insert and then a binary search after the extraction. The linked list approach cannot compete with this. But it depends if it depends on the write / receive ratio. The less often you record, the more it can pay off.

Strategy 2

Use lists. You pay for what you get. 1 collision = 1 heap operation. There is no impatient guess (and price to pay in terms of memory) that "more will come". Linear search in collision lists. Cheaper to remove. (Except free()

here). One of the main reasons to think about arrays instead of lists would be to reduce the number of heap operations. Curiously, the general assumption is that they are cheap. But not many actually know how long it takes for an allocation, compared to, say, traversing a list looking for a match.



Strategy 3

Don't use an array or lists, but keep the hash table overflow records elsewhere. The last time I mentioned this, I frowned a little. Benefit: 0 memory allocation. Probably works best if you have really low table padding and only a few collisions.

Summary

In fact, there are many options and tradeoffs. Generic implementations of hash tables, for example in standard libraries, cannot make any assumptions about the write / read ratio, hash key quality, use cases, etc. If, on the other hand, all of these features of a hash table application are known (and if it is worth the effort), it is entirely possible to create an optimized hash table implementation that addresses the set of tradeoffs required by the application.

+1


source


The reason is that the expected length of these lists is tiny, with zero, one, or two elements in the vast majority of cases. However, these lists can also get arbitrarily long in the worst case of a very bad hash function. And while this worst case is not the one that hash tables are optimized for, they should still be able to handle it gracefully.

Now for an array based approach, you need to set a minimum size for the array. And, if that initial array size is anything other than zero, you already have significant overhead due to all empty lists. A minimum array size of two will mean that you are wasting half of your space. And you will need to implement the logic to reallocate the arrays when they become full, because you cannot set an upper limit on the length of the list, you should be able to handle the worst case.

Under these constraints, the branch-based approach is much more efficient: it only has the allocation overhead for node objects, most accesses have the same degree of indirection as the array-based approach, and is easier to write.

I am not saying that it is impossible to write an array-based implementation, but it is significantly more complex and less efficient than the list-based approach.

+1


source


why do many languages ​​(Java, C ++, Python, Perl, etc.) implement hash tables using linked lists to avoid collisions instead of arrays?

I'm pretty sure, at least for most of these "many" languages:

The original developers of hash tables for these languages ​​just followed the classic description of the algorithm from Knuth's book / another algorithm book and didn't even consider such subtle implementation options.

Some observations:

  • Even using collision resolution from separate chains instead of, say, open addressing , for "the most generic hash table implementation" is a serious dubious choice. My personal conviction is not the right choice.

  • When the hash table load factor is quite low (this should be chosen when using almost 99% hash tables), the difference between the proposed approaches can hardly affect the overall structure of the data structure (as Mascar explained at the beginning of his answer, and delnan makes sense specified in the comments). Because generic language implementations of hash tables are not designed for high density, linked lists versus arrays are not an issue for them.

  • Coming back to the very question of the topic, I see no conceptual reason why linked lists should be better than arrays. I can just imagine that actually arrays are faster on modern hardware / consume less memory with modern memory allocators in modern language environments / operating systems. Especially when the hash table key is primitive or copied structure. You can find some arguments to support this opinion here: http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures

    But the only way to find the right answer (for a specific processor, OS, memory allocator, virtual machine, and garbage collection algorithm and using the use / workload hash table!) Is to implement both approaches and compare them.

Am I misunderstanding something?

No, you don't understand anything, your question is legal. This is an example of fair confusion, when something is done in a certain way, not for a serious reason, but rather on occasion.

+1


source


If implemented using arrays, this would be costly in the case of insertion due to reallocation, which would not happen in the case of a linked list.

Coming to the deletion case, we have to search for the entire array, then either mark it as deletion or move the remaining elements. (in the first case, this makes the installation even more difficult, since we have to look for empty slots).

To improve the worst time complexity from o (n) to o (logn), as soon as the number of items in the hash bucket exceeds a certain threshold, this bucket will switch from using a linked list of records to a balanced tree (in java).

+1


source







All Articles