Understanding the Spark Scala reduceByKey function definition
The function reduceByKey
for a Pair RDD
in a spark has the following definition:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
I understand what reduceByKey
takes an argument function when applying it to key values. I am trying to figure out how to read this definition where the function takes 2 values ββas input i.e. (V, V) => V
... Shouldn't be V => V
like a function mapValues
, where a function is applied to a value V to get a U that is a value of the same or a different type:
def mapValues[U](f: (V) β U): RDD[(K, U)]
Is this because it reduceByKey
applies to all values ββ(for the same key) at the same time, and mapValues
applies to each value (regardless of the key) one at a time ?, in which case it shouldn't be defined as something like(V1, V2) => V
source to share
... There shouldn't be V => V like mapValues ββ...
No, they are completely different. Recall that there is an invariant in functions map
, they return Iterable
( List
, Array
etc.) with the same length
as the original list (matched). On the other hand, reduce
functions aggregate or combine all elements, in which case reduceByKey
combine pairs or values ββusing an application function, this definition comes from a mathematical concept called monoid . You can see it this way, you combine the first two elements of the list using an application function, and the result of this operation, which must be of the same type of the first elements, is manipulated with the third, and so on, until you end up with one single element.
source to share
mapValues ββconverts every second part of pairs to RDD applying f: (V) β U, and reduceByKey reduces all pairs with the same key to one pair, applying f: (V, V) => V
val data = Array((1,1),(1,2),(1,4),(3,5),(3,7))
val rdd = sc.parallelize(data)
rdd.mapValues(x=>x+1).collect
// Array((1,2),(1,3),(1,5),(3,6),(3,8))
rdd.reduceByKey(_+_).collect
// Array((1,7),(3,12))
source to share