ReduceByKey processes each flatMap output without aggregating value on key in GraphX

Question

ReduceByKey processes each flatMap output without aggregating value on key in GraphX

I have a problem running GraphX

val adjGraph= adjGraph_CC.vertices 
   .flatMap { case (id, (compID, adjSet)) => (mapMsgGen(id, compID, adjSet)) } 
      // mapMsgGen will generate a list  of msgs each msg has the form K->V

   .reduceByKey((fst, snd) =>mapMsgMerg(fst, snd)).collect   
      // mapMsgMerg will merge each two msgs  passed to it

what I was expecting from reduceByKey is to group all flatMap output with a key (K) and process a list of values (Vs) for each key (K) using the provided function.

every output of flatMap happens (using mapMsgGen function), which is a list of K-> V pairs (not the same K usually), is processed immediately by mapBsKMMggMergg function and before the whole flatMap finishes.

need some clarification please I don't understand what is going wrong, or I understand that flatMap and reduceByKey are wrong?

Hello,

Maher

+3

scala mapreduce apache-spark spark-graphx

Maher turifi Dec 24. 14 at 22:39

source to share

1 answer

mrmcgreg · Accepted Answer · 2014-12-25T20:41:31+0000

It is not necessary to produce all the output flatMap

before launching reduceByKey

. In fact, if you are not using intermediate output flatMap

, it is best not to create one and possibly save some memory.

If yours flatMap

outputs a list containing 'k' -> v1

and 'k' -> v2

, there is no reason to wait until the entire list is created to go through v1

and v2

before mapMsgMerge

. Once these two tuples are outputted, v1

both v2

can be concatenated like mapMsgMerge(v1, v2)

and v1

and v2

discarded if no intermediate list is used.

I don't know the details of Spark's scheduler well enough to tell if this behavior is guaranteed, but it looks like an instance of the original paper causing operations to be "pipelined".

ReduceByKey processes each flatMap output without aggregating value on key in GraphX

More articles: