Hadoop thread how to install partition?

I am very new to hadoop thread and am having some difficulty with splitting.

According to what is found in the string, my mapper function returns

key1, 0, somegeneralvalues # some kind of "header" line where linetype = 0

      

or

key1, 1, value1, value2, othervalues... # "data" line, different values, linetype =1

      

To shrink properly, I need to group all rows with the same key1 and sort them by value1 , value2, and linetype (0 or 1), for example:

1 0 foo bar...  # header first
1 1 888 999.... # data line, with lower value1
1 1 999 111.... # a few datalines may follow. Sort by value1,value2 should be performed
------------    #possible partition here, and only here in this example
2 0 baz foobar....   
2 1 123 888... 
2 1 123 999...
2 1 456 111...  

      

Is there a way to provide such a split? so far i have tried to play with parameters like

-partitioner,'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'
-D stream.num.map.output.key.fields=4 # please use 4 fields to sort data
-D mapred.text.key.partitioner.options=-k1,1 # please make partitions based on first key

      

or alternatively

-D num.key.fields.for.partition=1 # Seriously, please group by key1 !

      

which still only brought rage and despair.

If it's worth mentioning that, my scripts work correctly if I use cat | cartographer | sort | shrink and i am using amazon elastic map to shrink ruby ​​client so i pass parameters with

--arg '-D','options' for the ruby script.

      

Any help would be much appreciated! thanks in advance

+3


source to share


2 answers


After reading this post, I suggest modifying your mapper to return pairs whose "keys" include your key value, your linetype value, and value1 / value2 values ​​are all concatenated together. You would retain the "value" of a portion of a pair. For example, you can return the following pairs to represent your first two examples:

<'10foobar',{'0','foo','bar'}>
<'11888999',{'1','888','999'}>

      



Now, if you must use a single reducer, all of your records will be sent to the same pruning task and sorted alphabetically based on their "key". This will fulfill your requirement for pairs to be sorted by key, then linetype, then value 1, and finally value2, and you could access those values ​​separately in the "value" part of the pair. I'm not very good with the various built-in partioner / sort classes, but I would suggest that you can just use the defaults and get this to work.

+2


source


Thanks to ryanbwork, I was able to solve this problem. Yay!

The correct idea was to create a key consisting of a concatenation of values. To go a little further, it is also possible to create a key that looks like

<'1.0.foo.bar', {'0','foo','bar'}>
<'1.1.888.999', {'1','888','999'}>

      



Parameters can then be passed to hasoop so that it can split into the first "part" of the key. If I am not mistaken in interpretation it looks like

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartioner
-D stream.map.output.field.separator=. # I added some "." in the key
-D stream.num.map.output.key.fields=4  # 4 "sub-fields" are used to sort
-D num.key.fields.for.partition=1      # only one field is used to partition

      

This solution, based on what ryanbwork said, allows for more reducers while still ensuring proper data splitting and sorting.

+3


source







All Articles