What logic should be used when using custom separator on the map to solve this problem

Question

What logic should be used when using custom separator on the map to solve this problem

If in the distribution of a file keyword like 99% of the words start with "A" and 1% start with "B" through "Z", and you need to count the number of words starting with each letter, how would you efficiently distribute your keys?

+3

java mapreduce hadoop load-balancing hadoop-partitioning

Anjul tiwari May 13 '15 at 12:07

source to share

1 answer

vefthym · Accepted Answer · 2015-05-13T12:18:50+0000

SOLUTION 1: I think the path is a combiner, not a delimiter. The combinator will sum the local sums of words starting with the letter "A" and then emit a partial sum (not number 1 always) to the reducers.

SOLUTION 2: However, if you insist on using a custom separator for this, you can simply handle words starting with the letter "A" in a separate reducer than all other words, ie. dedicate reducer only for words starting with the letter "A".

SOLUTION 3: Also, if you don't mind "cheating" a bit, you can define a counter for words starting with the letter "A" and increase it during the map phase. Then just ignore those words (no need to send them over the network) and use the default separator for other words. When the job ends, retrieve the counter value.

SOLUTION 4: If you don't mind "cheating" even more, define 26 counters, one for each letter, and simply increase them in the map phase, according to the first letter of the current word. You cannot use reducers (set the number of reducers to 0). This will save all sorting and shuffling. When the job ends, retrieve the value of all counters.

What logic should be used when using custom separator on the map to solve this problem

More articles: