ReduceByKey processes each flatMap output without aggregating value on key in GraphX
I have a problem running GraphX
val adjGraph= adjGraph_CC.vertices
.flatMap { case (id, (compID, adjSet)) => (mapMsgGen(id, compID, adjSet)) }
// mapMsgGen will generate a list of msgs each msg has the form K->V
.reduceByKey((fst, snd) =>mapMsgMerg(fst, snd)).collect
// mapMsgMerg will merge each two msgs passed to it
what I was expecting from reduceByKey is to group all flatMap output with a key (K) and process a list of values (Vs) for each key (K) using the provided function.
every output of flatMap happens (using mapMsgGen function), which is a list of K-> V pairs (not the same K usually), is processed immediately by mapBsKMMggMergg function and before the whole flatMap finishes.
need some clarification please I don't understand what is going wrong, or I understand that flatMap and reduceByKey are wrong?
Hello,
Maher
source to share
It is not necessary to produce all the output flatMap
before launching reduceByKey
. In fact, if you are not using intermediate output flatMap
, it is best not to create one and possibly save some memory.
If yours flatMap
outputs a list containing 'k' -> v1
and 'k' -> v2
, there is no reason to wait until the entire list is created to go through v1
and v2
before mapMsgMerge
. Once these two tuples are outputted, v1
both v2
can be concatenated like mapMsgMerge(v1, v2)
and v1
and v2
discarded if no intermediate list is used.
I don't know the details of Spark's scheduler well enough to tell if this behavior is guaranteed, but it looks like an instance of the original paper causing operations to be "pipelined".
source to share