Spark scala: repeatable for individual key-value pairs
I have a problem with Spark Scala converting an Iterable (CompactBuffer) into separate pairs. I want to create a new RDD with key-value pairs of these in a CompactBuffer.
It looks like this:
CompactBuffer(Person2, Person5) CompactBuffer(Person2, Person5, Person7) CompactBuffer(Person1, Person5, Person11)
Compact buffers could get more people than just 3. Basically, I want this is a new RDD that has separate CompactBuffer combinations like this (I also want to avoid identical key values):
Array[
<Person2, Person5>
<Person5, Person2>
<Person2, Person7>
<Person7, Person2>
<Person5, Person7>
<Person7, Person5>
<Person1, Person5>
<Person5, Person1>
<Person1, Person11>
<Person11, Person1>
<Person5, Person11>
<Person11, Person5>]
Can anyone help me?
Thank you in advance
source to share
Here's something that produces pairs (and removes duplicates). I couldn't figure out how to use CompactBuffer
, so uses ArrayBuffer
as the source for CompactBuffer says it is more efficient ArrayBuffer
. You may need to convert CompactBuffer
to flatMap
to something that supports .combinations
.
object sparkapp extends App {
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
val data = List(
ArrayBuffer("Person2", "Person5"),
ArrayBuffer("Person2", "Person5", "Person7"),
ArrayBuffer("Person1", "Person5", "Person11"))
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val dataRDD = sc.makeRDD(data, 1)
val pairs = dataRDD.flatMap(
ab => ab.combinations(2)
.flatMap{case ArrayBuffer(x,y) => List((x, y),(y,x))}
).distinct
pairs.foreach (println _)
}
Output
(Person7,Person2) (Person7,Person5) (Person5,Person2) (Person11,Person1) (Person11,Person5) (Person2,Person7) (Person5,Person7) (Person1,Person11) (Person2,Person5) (Person5,Person11) (Person1,Person5) (Person5,Person1)
source to share