Why does LLVM choose an open address hash table for the llvm :: StringMap implementation?

Many sources say that open addressing , the hash collision handling approach used in llvm::StringMap

, is unstable. Open-address is said to be inferior to the chain when the load factor is high (as you might imagine)

But if the load factor is low, there will be a huge memory waste for open addressing, because I have to allocate bytes in * sizeof (Record) bytes in memory, even if most buckets do not contain a record.

So my question is : what is the reason that LLVM chooses open addressing on a separate chain? Is this simply because of the speed advantage achieved in terms of cache location (records are stored in buckets themselves)?

Thank:)

Edit: The C ++ 11 standard requirements on std::unordered_set

and std::unordered_map

imply a chaining approach, not open addressing. Why does LLVM choose a hash collision handling method that can't even meet the C ++ standard? Are there any special use cases for llvm :: StringMap that justify this rejection? (Edit: this slide deck compares the performance of several LLVM data structures to those of STL counterparts)

Another question , by the way:

How does it llvm::StringMap

ensure that the hash values ​​of the keys are not recalculated as they grow? manual says:

Hash table growth does not recalculate hash values ​​for rows already in the table.

+3


source to share


1 answer


Let's take a look at the implementation . Here we see that the table is stored as parallel arrays of indirect pointers to records, as well as any array of cached 32-bit hash codes, which are separate arrays of structures.

Effectively:

struct StringMap {
 uint32_t hashcode[CAPACITY];
 StringMapEntry *hashptr[CAPACITY];
};

      

Except that the capacitance is dynamic and the load factor appears to be maintained between 37.5% and 75% of the capacity.

The N

load factor is written F

, which gives the integers N/F

plus N/F

for the open access implementation versus the integers N*(1+1/F)

plus N

for the equivalent implementation chain. On a typical 64-bit system, the open address version is ~ 4% larger and 30% smaller.

However, as you rightly suspected, the main advantage here is cache effects. In addition to the middle cache, which reduces contention by shrinking data, collision filtering boils down to linear playback of sequential 32-bit hash keys without learning any additional information. Consequently, collision rejection is much faster, a related case in which the link needs to be included in what is likely unshielded storage, and therefore a significantly higher load factor can be used. On the other hand, another likely cache error has to be taken in the pointer lookup table, however, it is a constant that does not degrade with a load equivalent to a one-column collision.



Effectively:

StringMapEntry *StringMap::lookup(const char *text) {
    for(uint32_t *scan = &hashcode[hashvalue % CAPACITY]; *scan != SENTINEL; ++scan) {
        uint32_t hash_value = hash_function(text);
        if(hash_value == *scan) {
            StringMapEntry *entry = p->hashptr[scan - hashcode];
            if(!std::strcmp(entry->text, text))
                return entry;
            }
        }
    }
}

      

Plus subtleties such as packaging.

As with the second question, optimization is about pre-computing and storing hash keys. This is a waste of storage, but prevents the costly work of examining a potentially long, variable length string if a match is almost undetermined. And in degenerate cases, a complex template template name can be hundreds of characters long.

A further optimization in RehashTable is to use power-of-two instead of the main table size. This ensures that growing the table effectively brings one extra bit of hash code into play and de-interleaves the table by doubling it across two consecutive target arrays, effectively making the operation a linear scan based on the cache.

+4


source







All Articles