Fastest way to search in a sorted static array

I am looking for the fastest way to search in a sorted fixed array of 32 bit keys. The size and data of the array are static and will never change. The size of this array is ~ 1000-10000 unique elements. The search range is much wider (~ 100000), so many of the searched values ​​will not be found. I'm only interested in exact matches.

This is how the search continues:

  • Generate ~ 100 keys. These keys are in order of relevance, so they cannot be simply sorted.
  • Find a set of ~ 100 keys in a collection of static arrays (usually 50 to 300 of them)
  • Stop searching if we find enough matching results (hence, it's important not to sort the keywords to get the most relevant results).

A potentially interesting property of keys is that even if they are not close in integer value, most of them will only have a few different bits (~ 1-4) from their nearest neighbor.

Most of the answers I've found point to binary search, but none are dealing with the static array case, which probably opens up some optimization possibilities.

I have full control over the data structure, right now it's a fixed, sorted array, but I can change that if it's not optimal. I could add the pre-computed information as well, since the data doesn't change unless it takes up unnecessary memory.

The goal is to be efficient in both CPU and memory, although CPU priority is here.

Using C ++, although it probably won't affect the answer.

+3


source to share


1 answer


Given that your static arrays never change, and that you have infinite preprocessing power, I think the best approach would be to create a specific hash function for each of your arrays.

My approach is to define a parameterized hash function (code in java):

private static Function<Long, Integer> createHashFunction(int sz) {
    int mvLeft = ThreadLocalRandom.current().nextInt(30);
    int mvRight = ThreadLocalRandom.current().nextInt(16);
    int mvLeft2 = ThreadLocalRandom.current().nextInt(10);
    int mvRight2 = ThreadLocalRandom.current().nextInt(16);
    int mvLeft3 = ThreadLocalRandom.current().nextInt(16);
    int mvRight3 = ThreadLocalRandom.current().nextInt(20);
    return (key) -> {
        // These operations are totally random, and has no mathematical background beneath them!
        key = ~key + (key << mvLeft);
        key = key ^ (key >>> mvRight);
        key = key + (key << mvLeft2);
        key = key ^ (key >>> mvRight2);
        key = key + (key << mvLeft3);
        key = key ^ (key >>> mvRight3);
        return (int) (Math.abs(key) % sz); // sz is the size of target array
    };
}

      

For each test array, find the combination of parameters such that the maximum bucket size is the smallest.



Some testing (input array is 10k in size, filled with random elements):

  • Displaying hashes in [0..262k] results in a bucket of 2 elements max. 5k random arrays, single threaded version finds hash functions at ~ 100 arrays / second speed.

Given that with a max bucket size of 2 it is possible to map both values ​​into a single 64-bit integer, this approach will only result in one memory hop, and the simplest operations for hashing the CPU are via xor, plus and shifts, which should be extremely fast. and also bit comparison.

However, your data may not be that good and the bucket size 3 may be required, which destroys usability long long

for bucket items. In this case, you can try to find some decent hash function instead of the random mess I wrote.

+3


source







All Articles