Convert rdd to pair

This is a beginner's question.

Is it possible to convert an RDD, for example (key,1,2,3,4,5,5,666,789,...)

, with a dynamic size to an RDD pair, for example (key, (1,2,3,4,5,5,666,789,...))

?

I think it should be super easy, but I can't figure out how to do it.

The point is, I would like to sum all the values, but not the key.

Any help is appreciated.

I am using Spark 1.2.0

EDIT, enlightened by the answer, I explain that I am using a depilator. I have N (unknown at compile time) different RDD (key, value) pairs that need to be concatenated and whose values ​​need to be summed. Is there a better way than the one I was thinking about?

+3


source to share


1 answer


First of all, if you just want to sum all integers, but first the simplest way:

val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first

      

On the other hand, if you want to be able to access the index of the items, you can use the rdd zipWithIndex method like this:

  val indexed = rdd.zipWithIndex()
  indexed.cache()
  val result = (indexed.first()._2, indexed.filter(_._1 != 1))

      

But in your case it seems overkill.

One more thing I would add, it looks like a dubious desyn to put a key as the first element of your rdd. Why not just use (key, rdd) pairs in your driver program. Its quite difficult to reason about the order of the elements in rdd, and I can't think of the natural situation in the witch's key being calculated as the first element of rdd (from I don't know your usecase, so I can only guess).

EDIT



If you have one rdd key value pairs and you want to sum them by key, follow these steps:

val result = rdd.reduceByKey(_ + _)

      

If you have many rdds key value pairs before counting, you can simply sum them

  val list = List(pairRDD0, pairRDD1, pairRDD2)
  //another pairRDD arives in runtime
  val newList = anotherPairRDD0::list
  val pairRDD = newList.reduce(_ union _)
  val resultSoFar = pairRDD.reduceByKey(_ + _)
  //another pairRDD arives in runtime
  val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)

      

EDIT



I edited the example. As you can see, you can add additional rdd when each one appears at runtime. This is because reduceByKey returns rdd of the same type so that you can iterate through this operation (Ofc you will have to consider performance).

+1


source







All Articles