How to efficiently search for elements in a large vector

Question

How to efficiently search for elements in a large vector

I have a vector<unsigned>

size (90,000 * 9,000)

. I need to find many times if an element exists in this vector or not?

To do this, I saved the vector in a sorted form with std::sort()

and then scanned the elements in the vector using std::binary_search()

. However, when profiling using, perf

I found that finding items in vector<unsigned>

is the slowest operation.

Can anyone suggest a few data-structure

in C/C++

which I can use for effective search of elements in the vector elements (90,000 * 9,000)

.

I am doing insert (bulk nesting) only once. The rest of the time I only do searches, so the main overhead here is with the search.

+3

c ++ c vector

Steg verner May 27 '15 at 10:19

source to share

4 answers

Depending on the actual vector data structure, the operation contains

can take O(n)

or O(1)

. Usually it O(n)

, if the vector is supported by either an associative array or a linked list, in that case contains

will be a full view in the worst case. You have reduced the full scan by ordering and using a binary search, which is O(log (N))

. Log N

pretty good difficulty, only O(1)

better. So your choice is:

The cache looks for a result for elements, this can be a good compromise if you have many repetitions of the same element
Replace the vector with another data structure with an efficient operation contains

, such as a hash table or set . Note that you might lose other features like ordering items
Use two data structures, one for operations contains

and an original vector for all that you use for
Use a third data structure that offers a tradeoff, such as a data structure that works well with a color filter

+3

oleksii May 27 '15 at 10:30

source to share

However, when profiling using perf, I find that finding elements in a vector is the slowest operation.

This is half of the information you need and the other half is how quickly does it compare to other algorithms / containers ? Perhaps the usage std::vector<>

is actually the fastest, or maybe the slowest. You will need to test / profile several different designs to find.

For example, the following naive tests using random integers on 1000x9000 containers (I would get seg-faults on large sizes for cards, presumably 32-bit memory limit).

If you need to count the number of unique integers:

std::vector<unsigned>

= 500 ms
std::map<unsigned, unsigned>

= 1700 ms
std::unordered_map<unsigned, unsigned>

= 3700 ms

If you just need to check for unique integers:

std::vector<bool>

= 15 ms
std::bitset<>

= 50 ms
std::set<unsigned>

= 350 ms

Note that we are not too interested in exact values, but rather in comparative comparisons between containers. std::map<>

relatively slow, which is not surprising given the number of dynamic allocations and the nonlocality of the data data. Bits are by far the fastest, but don't work when counting non-unique integers is needed.

I would suggest doing a similar test using your exact dimensions and container contents, which may affect the test results. It may turn out to be what std::vector<>

might be the best solution, but now you have the data to back up this design.

+2

uesp May 27 '15 at 13:37

source to share

If you don't need to iterate over the collection (sorted), since C ++ 11 you can use std::unordered_set<yourtype>

all you need to do is provide a way to collect hash and equality information for yourtype

, The collection item access time here is amortized O (1). as opposed to a sorted vector, where O (log (n)).

+1

Wf May 27 '15 at 10:23

source to share

MSalters · Accepted Answer · 2015-05-27T10:43:01+0000

You have 810 million values out of 4 billion possible values (assuming 32 bits unsigned

). It's 1/5 of the entire range and uses 3.2GB. This means you are actually better off std::vector<bool>

with 4 billion bits. This gives you O (1) searching in a smaller space (0.5 GB).

(Theoretically unsigned

could be 16 bits. unsigned long

Is at least 32 bits, std::uint32_t

may be what you want)

How to efficiently search for elements in a large vector

More articles: