Fast 'group by / count' std :: vector <std :: u16string> to std :: map <u16string, int>

I have a function that reads ~ 10,000 words into a vector, then I want to group all the words into a map to "count" how many times a certain word appears.

While the code "works", it can sometimes take 2 seconds to reassemble the map.

NB : Unfortunately, I cannot change the read function, I need to work with a vector std::u16string

.

std::vector<std::u16string> vValues;
vValues.push_back( ... )
...

std::map<std::u16string, int> mValues;
for( auto it = vValues.begin(); it != vValues.end(); ++it )
{
  if( mValues.find( *it ) == mValues.end() )
  {
    mValues[*it] = 1;
  }
  else
  {
    ++mValues[*it];
  }
}

      

How could I speed up the "group" by keeping track of the number of times a word appears in the vector?

+3


source to share


3 answers


If you call std::map::operator[]

on a new key, the key's value will be value-initialized (to 0 for PODs like int

). So your loop can be simplified to:

for (auto it = vValues.begin(); it != vValues.end(); ++it)
    ++mValues[*it];

      

If *it

there is no key then the default will be 0

, but then it will immediately be incremented and become 1

.



If the key already exists, it just increments.

Also, it doesn't look like you need a card to be ordered, so std::unordered_map

you can use instead std::unordered_map

, since insert is constant average time, not logarithmic, which will speed it up even more.

+4


source


std::vector<std::u16string> vValues;
vValues.push_back( ... )
...

std::sort( vValues.begin(), vValues.end() );
struct counted {
  std::u16string value;
  std::size_t count;
};
std::vector<counted> result;
auto it = vValues.begin();
while (it != vValues.end()) {
  auto r = std::equal_range( it, vValues.end(), *it );
  result.push_back({ *it, r.second-r.first });
  it = r.second;
}

      

Then it result

will contain {value, count}

for each value and will be sorted.

Since all the work was done in neighboring containers, it should be faster than your implementation.

If you are not allowed to mutate vValues

, one thing you could do is create a vector from it gsl::span<char16_t>

, then sort it, and then create the vector in a result

similar way. (If you haven't gsl::span

, write one, it's not hard to write them down)

Otherwise, even copying result

may be faster than the original solution.

Using gsl::span<char16_t const>

in will counted

also preserve some distributions (reusing storage in vValues

by tying their lifespans.



One major issue is that if your strings are very long, determining that two strings are equal is expensive. And if they have common prefixes, defining them differently can be costly. We do log (n) comparisons per item in the code equal_range

and n log (n) comparisons ; sometimes sorting (hash of strings, strings) can be faster than sorting (string) s, as it does unlike strings, which are easy to detect.

Live example with 4 different versions. Just change test1

to test2

or test3

or test4

.

test3

is the fastest in every test I've done:

std::unordered_map<std::string, int> test3(std::vector<std::string> vValues)
{
  std::unordered_map<std::string, int> mValues;
  for( auto it = vValues.begin(); it != vValues.end(); ++it )
  {
    ++mValues[std::move(*it)];
  }
  return mValues;
}

      

than in all other versions.

+4


source


And there is an alternative. You might want to keep a non-owning shared pointer, but if you can't control the format of your inputs, Yakks' suggestion gsl::span

might work. This is from the Recommendations Support Library .

std::unordered_map<std::u16string, unsigned> hash_corpus;
// constexpr float heuristic_parameter = ?;
// hash_corpus.max_load_factor(heuristic_parameter);
/* The maximum possible number of entries in the hash table is the size of
 * the input vector.
 */
hash_corpus.reserve(corpus.size());

// Paul McKenzie suggested this trick in the comments:
for ( const std::u16string& s : corpus)
    ++hash_corpus[s]; // If the key is not in the table, [] inserts with value 0.

      

0


source







All Articles