How to convert Scala RDD to map
I have an RDD (String array) org.apache.spark.rdd.RDD[String] = MappedRDD[18]
and convert it to a map with unique IDs. I did ' val vertexMAp = vertices.zipWithUniqueId
' but it gave me another RDD type 'org.apache.spark.rdd.RDD[(String, Long)]'
, but I want " Map[String, Long]
". How can I convert mine ' org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]
'?
thank
source to share
There PairRDDFunctions
is a built-in function collectAsMap
that will give you a map of the values of a pair in an RDD.
val vertexMAp = vertices.zipWithUniqueId.collectAsMap
It is important to remember that RDD is a distributed data structure. You can visualize these "chunks" of your data spread across the cluster. When you do collect
, you force all these parts to go to the driver, and in order to do so, they need to fit into the driver's memory.
From the comments, it looks like in your case you need to deal with a large dataset. Creating a map from it will not work as it will not fit into the driver's memory; throwing OOM exceptions if you try.
You probably need to store the dataset as an RDD. If you are creating a map to find items, you can use lookup
PairRDD for example:
import org.apache.spark.SparkContext._ // import implicits conversions to support PairRDDFunctions
val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")
source to share