How to convert Scala RDD to map

I have an RDD (String array) org.apache.spark.rdd.RDD[String] = MappedRDD[18]

and convert it to a map with unique IDs. I did ' val vertexMAp = vertices.zipWithUniqueId

' but it gave me another RDD type 'org.apache.spark.rdd.RDD[(String, Long)]'

, but I want " Map[String, Long]

". How can I convert mine ' org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]

'?

thank

+3


source to share


3 answers


There PairRDDFunctions

is a built-in function collectAsMap

that will give you a map of the values ​​of a pair in an RDD.

val vertexMAp = vertices.zipWithUniqueId.collectAsMap

      

It is important to remember that RDD is a distributed data structure. You can visualize these "chunks" of your data spread across the cluster. When you do collect

, you force all these parts to go to the driver, and in order to do so, they need to fit into the driver's memory.



From the comments, it looks like in your case you need to deal with a large dataset. Creating a map from it will not work as it will not fit into the driver's memory; throwing OOM exceptions if you try.

You probably need to store the dataset as an RDD. If you are creating a map to find items, you can use lookup

PairRDD for example:

import org.apache.spark.SparkContext._  // import implicits conversions to support PairRDDFunctions

val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")

      

+21


source


Build on a "local" machine and then convert Array [(String, Long)] to Map



val rdd: RDD[String] = ???

val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap

      

+6


source


You don't need to convert. The attributes for PairRDDFunctions detect bidirectional RDD and automatically apply the PairRDDFunctions methods.

+3


source







All Articles