Spark: how to split RDD [T] `into Seq [RDD [T]] and preserve order

How can I efficiently split RDD[T]

into Seq[RDD[T]]

/ Iterable[RDD[T]]

with elements n

and keep the original ordering?

I would like to write something like this

RDD(1, 2, 3, 4, 5, 6, 7, 8, 9).split(3)

      

which should lead to something like

Seq(RDD(1, 2, 3), RDD(4, 5, 6), RDD(7, 8, 9))

      

Is the spark such a function? If not, what is the way to achieve this?

val parts = rdd.length / n
val rdds = rdd.zipWithIndex().map{ case (t, i) => (i - (i % parts), t)}.groupByKey().values.map(iter => sc.parallelize(iter.toSeq)).collect

      

Doesn't look very fast.

+3


source to share


1 answer


You can technically do what you suggest. However, it really doesn't make sense in the context of using a compute cluster to perform distributed big data processing. This primarily concerns the entire spark point. If you do a groupByKey and then try to fetch them into separate RDDs, you are effectively fetching all the data allocated in the RDD to the driver and then re-allocating each one back to the cluster. If the driver cannot load the entire data file, it also cannot perform this operation.

You cannot load large data files into a node driver from the local file system. You have to migrate your file to a distributed filesystem like HDFS or S3. You can then load one large data file into your cluster using val lines = SparkContext.textFile(...)

strings in the RDD. When you do this, each worker in the cluster will only download a portion of the file, which can be done because the data is already distributed across the cluster on a distributed file system.

If you need to organize data in "batches" that are important for functional data processing, you can enter the data with the appropriate batch identifier, for example: val batches = lines.keyBy( line => lineBatchID(line) )



Each batch can then be summarized at the batch level and these summaries can be consolidated into a single result.

For testing Spark code, it's fine to load a small sample data file onto a single machine. But when it comes to a complete dataset, you must use a distributed file system in conjunction with a spark cluster to process that data.

0


source







All Articles