What's the data structure for adding / searching / counting rows?

I am trying to figure out which data structure quickly supports the following operations:

  • Add a line (if it's not there, add it if it's there), add a counter for the word)
  • Count the given line (find the line and then read the counter)

I am discussing a hash table or three. In my opinion the hash table is quick to scan and add as long as you avoid collisions. If I didn't know that my inputs ahead of schedule would be better, can you go better?

+2


source to share


3 answers


It really depends on the types of strings you are going to use as "keys". If you are using heavily variable strings, plus you don't have a good hashing algorithm for your strings, then the trie can outperform the hash.

However, given a good hash, the search will be faster than the trie. (Given a very bad hash, the opposite is true.) If you don't know your inputs but have a good hashing algorithm, I personally prefer to use a hash.



Also, most modern languages ​​/ frameworks have very good hashing algorithms, so chances are you can implement a good hash lookup with very little work that will work well enough.

+2


source


Three won't buy you much; they are only interesting when prefixes are important. Hash tables are simpler and are usually part of your language standard library if it is not directly part of the language itself (Ruby, Python, etc.). Here's the simplest way to do it in Ruby:

strings = %w(some words that may be repeated repeated)
counts = Hash.new(0)
strings.each { |s| counts[s] += 1 }
#counts => {"words"=>1, "be"=>1, "repeated"=>2, "may"=>1, "that"=>1, "some"=>1}

      



Addenda : For C ++, you can probably use the Boost hash implementation .

+1


source


One of them is fast enough.

There is no need to completely avoid collisions.

Looking a little closer at performance, usually hash tables are faster than trees, but I doubt the program started too slowly in real life because it used tree instead of HT and some trees are faster than some hash tables.

What else can we say, well, hash tables are more common than trees.

One of the advantages of complex trees is that they have predictable access times. With hash tables and simple binary trees, the performance you see is data dependent and with HT characteristic is highly dependent on the quality of the implementation and its configuration relative to the size of the dataset.

0


source







All Articles