Spatially efficient data structure for storing and searching across a large set of (evenly spaced) integers

Question

Spatially efficient data structure for storing and searching across a large set of (evenly spaced) integers

I have to store in memory and look at one million evenly spaced integers. My workload is extremely intense. My current implementation is using HashSet (Java). I see good search performance, but memory usage is not ideal (tens of MB).
Could you imagine a more efficient data (memory) structure?
Edit: The solution should support a small amount of data-break padding.

Background:
The integer problem outlined above is a simplification of the following problem:
I have a million-line set (my "Dictionary") and I want to tell if the Dictionary contains a given string or not. The vocabulary is too large to fit into memory, so I'm willing to sacrifice tiny precision to reduce memory footprint. I'll do this by switching to a dictionary containing each String Hashcode (integer) value instead of the actual characters. I am assuming the probability of collision for each row is 1M/2^32

.

+3

java memory-management hashcode data-structures

Gili nachum 17 Mar 12 at 21:17

source to share

6 answers

Jerry coffin · Answer 1 · 2012-03-17T22:47:11+0000

While Jon Skeet's answer provides good savings for a small investment, I think you can do better. Since your numbers are fairly distributed, you can use interpolation search for faster searches (roughly O (log log N) instead of O (log N)). For a million items, you can probably plan around 4 comparisons, not around 20.

If you want to do a little more work to cut the memory footprint (roughly) in half, you can create it as a two-level lookup table, basically a kind of simple trie version.

enter image description here

You have broken a 32 bit integer (presumably) into two 16 bit chunks. You should use the first 16 bits as an index on the first level of the lookup table. At this level, you will have 65536 pointers, one for each possible 16-bit value for that part of your integer. This will take you to the second level of the table. For this part, we performed a binary or interpolation search between the selected pointer and the next - that is, all the second level values that had the same value in the first 16 bits.

However, when we look in the second table, we already know the 16 bits of the value - so instead of storing all 32 bits of the value, we only need to store the remaining 16 bits of the value.

This means that instead of the second level, which is 4 megabytes, we have reduced it to 2 megabytes. Along with this, we need a first level table, but it is only 65536x4 = 256 KB.

This will almost certainly improve the binary search speed of the entire dataset. In the worst case (using binary search for the second level) we can have up to 17 comparisons (1 + log₂65536). The average will be better than that, though since we only have a million items, each second level partition can only have 1_000_000 / 65536 = ~ 15 items, which gives roughly 1 + log₂(16) = 5 comparisons. The use of interpolation search in the second level may decrease a bit, but when you start with only 5 comparisons, you don't have much room left for really significant improvements. If the second level only has ~ 15 items on average, the type of search you use won't matter much - even a linear search will be quite fast.

Of course, if you wanted, you could take it a step further and use a 4-level table instead (one for each byte in integer size). However, it may be an open question whether it will save you even more to be worth it. At least right away, I immediately assume that you will be doing a fair amount of extra work for quite minimal savings (just storing the trailing bytes from a million integers obviously takes 1 megabyte, and the three table levels leading up to that would be clearly take up a lot more, so you'll double the number of tiers to save about half a megabyte.If you're in a situation where saving only a little more will make a big difference, go for it, but otherwise, I doubt the return will justify the extra investment.

Jon Skeet · Answer 2 · 2012-03-17T21:19:34+0000

It looks like you could just keep sorted int[]

and then do a binary search. With a million values, that's ~ 20 comparisons to get any value - would that be fast enough?

Gareth rees · Answer 3 · 2012-03-18T10:19:52+0000

If you're willing to accept a small chance of a false positive in exchange for a significant reduction in memory usage, then a Bloom filter might be what you need.

The Bloom filter consists of k hash functions and an n-bit table, initially empty. To add an item to the table, feed it to each of the x hash functions (getting a number between 0 and n-1) and set the appropriate bit. To check if an element is in the table, pass it to each of the x hash functions and see if all the corresponding k bits are set.

Bloom filter with 1% false positive rate requires about 10 bits per element; the false positive rate decreases rapidly as you add more bits per element.

Here's an open source Java implementation.

M Platvoet · Answer 4 · 2012-03-17T21:44:16+0000

You might want to take a look at BitSet . The technology used in Lucene is even faster than the standard Java implementation because it neglects some of the standard boundary checks.

amit · Answer 5 · 2012-03-17T21:27:23+0000

There are several implementations IntHashSet

for the available primitives.

A quick googling got me this one . There is also an apache [open source] IntHashSet implementation . I would prefer the apache implementation, although it has some overhead [it is implemented as IntToIntMap ]

Artur Biesiadowski · Answer 6 · 2016-07-30T17:44:27+0000

I think you can revisit the original problem (having an effective wordlist) rather than trying to optimize the "optimization".

I would suggest looking at the Radix / Trie tree.

https://en.wikipedia.org/wiki/Radix_tree or https://en.wikipedia.org/wiki/Trie

Basically you store some sort of tree with string prefixes, branching out every time there is a choice in the dictionary. It has some interesting side effects (allows efficient filtering on prefixes), can save some memory for strings with longer common prefixes, and is fast enough.

Example of a Radix tree

Some examples of implementation:

https://lucene.apache.org/core/4_0_0/analyzers-stempel/org/egothor/stemmer/Trie.html

https://github.com/rkapsi/patricia-trie

https://github.com/npgall/concurrent-trees

Here's an interesting comparison of different implementations, with a lot of focus on performance rather than memory usage, but it might be useful

http://bhavin.directi.com/to-trie-or-not-to-trie-a-comparison-of-efficient-data-structures/

Spatially efficient data structure for storing and searching across a large set of (evenly spaced) integers

More articles: