Sort a huge array with few duplicate keys

I want to sort a huge array, for example 10 ^ 8 records of type X

with no more N

different keys, where N

is ~ 10 ^ 2. Since I don't know the range or interval of the elements, sorting count is not an option. So so far, my guess is to use a hashmap for such calculations

std::unordered_map< X, unsigned > counts;
for (auto x : input)
    counts[x]++;

      

This works fine and is 4x faster than 3-way quicksort, but I'm a jittery person and it's still not fast enough.

I wonder: am I missing something? Can I make better use of the fact that N

is known in advance? Or can I customize the hash map to my needs?


EDIT An additional precondition is that the input sequence is poorly sorted and the key frequency is about the same.

+3


source to share


3 answers


STL implementations are often not perfect in terms of performance (no holy wars, please).

If you know the guaranteed and reasonable upper number of unique elements (N), then you can trivially implement your own 2 ^ s → N hash table. Here's how I usually do it myself:

int size = 1;
while (size < 3 * N) size <<= 1;
//Note: at least 3X size factor, size = power of two
//count = -1 means empty entry
std::vector<std::pair<X, int>> table(size, make_pair(X(), -1));
auto GetHash = [size](X val) -> int { return std::hash<X>()(val) & (size-1); };

for (auto x : input) {
  int cell = GetHash(x);
  bool ok = false;
  for (; table[cell].second >= 0; cell = (cell + 1) & (size-1)) {
    if (table[cell].first == x) { //match found -> stop
      ok = true;
      break;
    }
  }
  if (!ok) {             //match not found -> add entry on free place
    table[cell].first = x;
    table[cell].second = 0;
  }
  table[cell].second++;  //increment counter
}

      

In MSVC2013, it improves the time from 0.62 seconds to 0.52 seconds compared to your code, given that int is being used as type X.

Alternatively, we can choose a faster hash function. Note, however, that the choice of hash function is highly dependent on the properties of the input. Let's take Knuth's multiplicative hash :



auto GetHash = [size](X val) -> int { return (val*2654435761) & (size-1); };

      

This further improves the time to 0.34 seconds.

As a conclusion: do you really want to override the standard data structures to achieve a 2x speedup?

Notes: The speedup may be completely different on a different compiler / machine. You might have to do some hacks if your type X is not POD.

+2


source


Sort sort would indeed be best, but not applicable due to unknown range or interval.

It seems to be easily parallelized using fork-join for example. boost :: thread .



You can also try a more efficient, hand-crafted hash. Unorded_map usually uses linked lists to combat potentially bad hash functions. The memory overhead of linked lists can hurt performance if the hash table doesn't fit into the L1 cache. Closed Hashing can use less memory. Some optimization tips:

  • Closed hashing with linear probing and no support for deletion.
  • hash table power of two sizes for bit shifting instead of modulation (it takes multiple cycles to divide and there is only one hardware separator per core)
  • Low LoadFactor (records by size) to minimize conflicts. It depends on memory usage and the number of collisions. LoadFactor greater than 0.5 should be avoided. A hash table of size 256 seems to fit 100 records.
  • cheap hash function. You didn't specify the type X

    , so maybe a cheaper hash function might outweigh more collisions.
+2


source


I would like to store the elements in a sorted vector, since around 100 keys would mean inserting only 1 out of 10 ^ 6 entries into the vector. The search will be an efficient bsearch processor in the vector

0


source







All Articles