Understanding the Spark Scala reduceByKey function definition

Question

Understanding the Spark Scala reduceByKey function definition

The function reduceByKey

for a Pair RDD

in a spark has the following definition:

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

I understand what reduceByKey

takes an argument function when applying it to key values. I am trying to figure out how to read this definition where the function takes 2 values as input i.e. (V, V) => V

... Shouldn't be V => V

like a function mapValues

, where a function is applied to a value V to get a U that is a value of the same or a different type:

def mapValues[U](f: (V) ⇒ U): RDD[(K, U)]

Is this because it reduceByKey

applies to all values (for the same key) at the same time, and mapValues

applies to each value (regardless of the key) one at a time ?, in which case it shouldn't be defined as something like(V1, V2) => V

+3

scala apache-spark

rgamber March 25 17 at 22:57

source to share

2 answers

mapValues converts every second part of pairs to RDD applying f: (V) ⇒ U, and reduceByKey reduces all pairs with the same key to one pair, applying f: (V, V) => V

val data = Array((1,1),(1,2),(1,4),(3,5),(3,7))
val rdd = sc.parallelize(data)
rdd.mapValues(x=>x+1).collect
// Array((1,2),(1,3),(1,5),(3,6),(3,8))
rdd.reduceByKey(_+_).collect
// Array((1,7),(3,12))

+1

gilcu2 26 Mar 17 at 18:28

source to share

Alberto bonsanto · Accepted Answer · 2017-03-25T23:12:00+0000

... There shouldn't be V => V like mapValues ...

No, they are completely different. Recall that there is an invariant in functions map

, they return Iterable

( List

, Array

etc.) with the same length

as the original list (matched). On the other hand, reduce

functions aggregate or combine all elements, in which case reduceByKey

combine pairs or values using an application function, this definition comes from a mathematical concept called monoid . You can see it this way, you combine the first two elements of the list using an application function, and the result of this operation, which must be of the same type of the first elements, is manipulated with the third, and so on, until you end up with one single element.

Understanding the Spark Scala reduceByKey function definition

More articles: