Sorting values ​​before sending them to the reducer

I am thinking about creating a small test application in hadoop to get the system to hang.

The application I mean will be in the field of statistics. I want to have the "10 worst values ​​for each key" from my reducer function (where I have to assume the possibility of a huge number of values ​​for some keys).

What I have planned is that the values ​​that go into my gearbox will basically consist of a combination of Actual Value and Quality / Actual Value. Based on relevance, I "just" want to take the 10 worst / best values ​​and output them from the reducer.

How do I do this (assuming there are a lot of values ​​for a particular key)? Is there a way I can sort all the values ​​before they are sent to the reducer (and just stop reading the input when I read the first 10), or does this need to be done differently?

Can someone here point me to a piece of example code that I can take a look at?


Update: I found two interesting questions Jira HADOOP-485 and HADOOP-686 .

Anyone have a piece of code how to use it in Hadoop API 0.20?

+2


source to share


3 answers


It looks like you want to use a Combiner, which determines what to do with the values ​​you create in the map before submitting them to the Reducer, but after they are grouped by key. The combiner is often set as a reducer class (so you shrink on the card side and then shrink again on the shrink side).

See how the wordCount example uses a combiner to pre-calculate partial counts:

http://wiki.apache.org/hadoop/WordCount


Update This is what I mean for your problem; I may have misunderstood what you are trying to do.

Each transducer emits vapors <key, {score, data}>

.



The combiner receives a partial set of these: pairs <key, [set of {score, data}>

and performs a local sort (still on display nodes) and returns the pairs <key, [sorted set of top 10 local {score, data}]>

.

The reducer will receive <key, [set of top-10-sets]>

- all it has to do is perform the sort-merge step (no sorting) for each of the members of the value sets, and stop merging when the first 10 values ​​have been pulled.


update 2

So now that we know that the rank is cumulative and as a result you cannot filter the data before using combinators, the only thing to do, what you suggested, is to get the secondary view. You have found the correct tickets; there is an example of how to do this in Hadoop 20 at src / examples / org / apache / hadoop / examples / SecondarySort.java (or if you don't want to download the entire source tree, you can look at the example patch at https: // issues. apache.org/jira/browse/HADOOP-4545 )

+1


source


Ultimately sounds like SecondarySortProblem. Take a look at Hadoop: The Ultimate Guide if you like. This is from O'Reilly. You can also access it online. There they describe a pretty good implementation.

I also implemented it myself. Basically it works like this: The secretary will take care of all key-value pairs with the same key going to the same reducer. Nothing special here. But there is also a GroupingComparator that will form the groups. One group is actually passed as an iterator per one reduce () call. Thus, a section can contain several groupings. But the number of partitions must be equal to the number of reducers. But grouping also allows some sorting to be done, since it implements the compareTo method.



With this method, you can control that the 10 best / worst / highest / lowest keys, but the keys will reach the reducer first. So after you've read these 10 keys, you can leave the shrink method without any further iteration.

Hope this was helpful :-)

+4


source


If I understand the question correctly, you will need to use TotalOrderPartitioner .

0


source







All Articles