Number of characters using distributed programming

I have a huge file (only contains ascii characters) and I need to find the character that appears the most frequently.

My approach:

  • Split the file and distribute it to multiple processing nodes.
  • Each node will count characters and generate an array of character counts [256].
  • The parent node will get the entire count array from all nodes and calculate the most frequently used character.

But I'm wondering if the nodes need to pass the entire count array to just calculate the most frequent character? Is there a way to reduce the amount of processed data transferred between nodes.

Note. I'm new to distributed programming, so I'm trying to get familiar with fundamental techniques.

+3


source to share


2 answers


If you want to know the exact number of the most frequent character then yes, each node will have to return ALL counts, perhaps one node will count 1 million 'a' and another one only 1 instance. To get an accurate total, you will need all accounts.



Also (unrelated) item 1 says you are going to "split and distribute the file". Does this imply reading it on one machine and sending it over the network? In this case, you've already read part of the file into memory, so you can scan it right away while the caches are still warm. Of course, if you have a preallocated file, it doesn't matter.

+1


source


If you allow each node process eg. 1 MiB followed by 1 KiB of response (256 times 4 bytes for int

) is negligible.



And BTW look at mapreduce , especially... The "hello world" map-reduce is a word count - almost exactly what you're looking for.

+4


source