Intersection algorithm for two unsorted small arrays

I am looking for an algorithm to intersect two small, unsorted arrays in a very specific state.

  • The type of an array element is either an integer type or an integer type.
  • For a significant amount of time (about 30 ~ 40%?), One or both arrays can be empty.
  • Arrays are usually very small - usually 1 ~ 3 elements, I don't expect more than 10.
  • The intersection function will be called very often.
  • I am not interested in platform specific solution - I am working on x86 / windows / C ++

Both brute force / sort and intersection solutions are not that bad, but I don't think they are fast enough. Is there a better solution?

+3


source to share


4 answers


Since arrays are of primitive types and short enough to be in cache lines, a quick implementation will focus on tactical comparison mechanics rather than big O complexity, for example. avoid hash tables as they usually involve hashing and indirection and will always involve a lot of management overhead.

If you have two sorted arrays, then the intersection is O (n + m). You say sort-then-intersect is brute force, but you can't do it any faster.



If the arrays are kept sorted, of course you end up with more as you say you invoke intersection often.

The intersection itself can be done using SSE .

+3


source


Here's a potential optimization: check if both arrays have a maximum element <= 32 (or 64, or maybe even 16). If they do, fill two bitmaps of the size (type uint32_t

, etc.) and is crossed by using a binary AND, &

. If this is not the case, resort to sorting.



Or, instead of sorting, use a highly efficient integer set representation because of Briggs and Torchon which allow linear time to intersect with O (m + n) and O (min (m, n)) intersect. It should be much faster than a hash table with better scores than a sort.

+2


source


To determine the intersection of both sets, you must check all elements at least once, so this means that the most optimal class of solutions gives O (n + m), where n is the number of elements in one set and m is the number of elements in the other.

You can achieve this using a hash table. Given that your elements are of type integer, you can look forward to finding a quick hash function. A simple algorithm would be:

  • Iterate the first set and add all the elements to the hash table
  • Iterate the second set and for each element, check if it exists in the hashtable, if so add it to the intersection set or just print it.

It will be O (n + m) if your hashing and your hash lookup is O (1).

Given that you know the sets are often empty, you can optimize this by checking first to make sure one of the sets is empty, and if so, just return an empty set. This, of course, assuming you know the score in advance and can calculate it without repeating the set. If this happens, you can optimize further by always reading and hashing the smaller set first, ensuring that the memory usage in the hash table is less of the two.

+1


source


Ok, since your arrays are quite small, using insert sort would be the fastest way to sort these two arrays, C ++ STL uses insert sort for arrays less than 16 elements in size. Then you can use iterators over these two arrays to compare and intersect arrays.

There may be other algorithms that run faster, however, the overhead of these algorithms is likely to be too large for 3-4 elements for each array.

+1


source







All Articles