How to efficiently search for elements in a large vector

I have a vector<unsigned>

size (90,000 * 9,000)

. I need to find many times if an element exists in this vector or not?

To do this, I saved the vector in a sorted form with std::sort()

and then scanned the elements in the vector using std::binary_search()

. However, when profiling using, perf

I found that finding items in vector<unsigned>

is the slowest operation.

Can anyone suggest a few data-structure

in C/C++

which I can use for effective search of elements in the vector elements (90,000 * 9,000)

.

I am doing insert (bulk nesting) only once. The rest of the time I only do searches, so the main overhead here is with the search.

+3


source to share


4 answers


You have 810 million values ​​out of 4 billion possible values ​​(assuming 32 bits unsigned

). It's 1/5 of the entire range and uses 3.2GB. This means you are actually better off std::vector<bool>

with 4 billion bits. This gives you O (1) searching in a smaller space (0.5 GB).



(Theoretically unsigned

could be 16 bits. unsigned long

Is at least 32 bits, std::uint32_t

may be what you want)

+10


source


Depending on the actual vector data structure, the operation contains

can take O(n)

or O(1)

. Usually it O(n)

, if the vector is supported by either an associative array or a linked list, in that case contains

will be a full view in the worst case. You have reduced the full scan by ordering and using a binary search, which is O(log (N))

. Log N

pretty good difficulty, only O(1)

better. So your choice is:



  • The cache looks for a result for elements, this can be a good compromise if you have many repetitions of the same element
  • Replace the vector with another data structure with an efficient operation contains

    , such as a hash table or set . Note that you might lose other features like ordering items
  • Use two data structures, one for operations contains

    and an original vector for all that you use for
  • Use a third data structure that offers a tradeoff, such as a data structure that works well with a color filter
+3


source


However, when profiling using perf, I find that finding elements in a vector is the slowest operation.

This is half of the information you need and the other half is how quickly does it compare to other algorithms / containers ? Perhaps the usage std::vector<>

is actually the fastest, or maybe the slowest. You will need to test / profile several different designs to find.

For example, the following naive tests using random integers on 1000x9000 containers (I would get seg-faults on large sizes for cards, presumably 32-bit memory limit).

If you need to count the number of unique integers:

  • std::vector<unsigned>

    = 500 ms
  • std::map<unsigned, unsigned>

    = 1700 ms
  • std::unordered_map<unsigned, unsigned>

    = 3700 ms

If you just need to check for unique integers:

  • std::vector<bool>

    = 15 ms
  • std::bitset<>

    = 50 ms
  • std::set<unsigned>

    = 350 ms

Note that we are not too interested in exact values, but rather in comparative comparisons between containers. std::map<>

relatively slow, which is not surprising given the number of dynamic allocations and the nonlocality of the data data. Bits are by far the fastest, but don't work when counting non-unique integers is needed.

I would suggest doing a similar test using your exact dimensions and container contents, which may affect the test results. It may turn out to be what std::vector<>

might be the best solution, but now you have the data to back up this design.

+2


source


If you don't need to iterate over the collection (sorted), since C ++ 11 you can use std::unordered_set<yourtype>

all you need to do is provide a way to collect hash and equality information for yourtype

, The collection item access time here is amortized O (1). as opposed to a sorted vector, where O (log (n)).

+1


source







All Articles