ReduceByKey processes each flatMap output without aggregating value on key in GraphX

I have a problem running GraphX

val adjGraph= adjGraph_CC.vertices 
   .flatMap { case (id, (compID, adjSet)) => (mapMsgGen(id, compID, adjSet)) } 
      // mapMsgGen will generate a list  of msgs each msg has the form K->V

   .reduceByKey((fst, snd) =>mapMsgMerg(fst, snd)).collect   
      // mapMsgMerg will merge each two msgs  passed to it 

      

what I was expecting from reduceByKey is to group all flatMap output with a key (K) and process a list of values ​​(Vs) for each key (K) using the provided function.

every output of flatMap happens (using mapMsgGen function), which is a list of K-> V pairs (not the same K usually), is processed immediately by mapBsKMMggMergg function and before the whole flatMap finishes.

need some clarification please I don't understand what is going wrong, or I understand that flatMap and reduceByKey are wrong?

Hello,

Maher

+3


source to share


1 answer


It is not necessary to produce all the output flatMap

before launching reduceByKey

. In fact, if you are not using intermediate output flatMap

, it is best not to create one and possibly save some memory.

If yours flatMap

outputs a list containing 'k' -> v1

and 'k' -> v2

, there is no reason to wait until the entire list is created to go through v1

and v2

before mapMsgMerge

. Once these two tuples are outputted, v1

both v2

can be concatenated like mapMsgMerge(v1, v2)

and v1

and v2

discarded if no intermediate list is used.



I don't know the details of Spark's scheduler well enough to tell if this behavior is guaranteed, but it looks like an instance of the original paper causing operations to be "pipelined".

+1


source







All Articles