How can I map RDDs locally?

Question

How can I map RDDs locally?

As a follow-up to my previous question , how can I localize the local RDD, i.e. collect data into a local thread without actually using it collect

(since the data is too large).

Specifically, I want to write something like

from subprocess import Popen, PIPE
with open('out','w') as out:
    with open('err','w') as err:
        myproc = Popen([.....],stdin=PIPE,stdout=out,stderr=err)
myrdd.iterate_locally(lambda x: myproc.stdin.write(x+'\n'))

How do I implement this iterate_locally

?

DOES NOT work: collect

return value is too high:

myrdd.collect().foreach(lambda x: myproc.stdin.write(x+'\n'))
Does NOT work: foreach

executes its argument in distributed mode, not locally

myrdd.foreach(lambda x: myproc.stdin.write(x+'\n'))

on this topic:

Iskra: Best Practice to Extract Big Data from RDD to Local Machine

+3

apache-spark pyspark

sds May 27 '15 at 16:28

source to share

2 answers

David Griffin · Answer 1 · 2015-05-27T17:25:48+0000

How about RDD.foreachPartition

? You can work with data in batches, for example:

myRdd.foreachPartition(it => it.collect.foreach(...))

If you look at the history of requests for objects , it RDD.foreachPartition

was created to move this middle ground.

Holden · Answer 2 · 2015-05-27T20:26:36+0000

Your best bet is likely to save the data to a source that your local computer can access and then replicate.

If that's not an option, and assuming your local machine can handle data with one partition at a time, you can selectively return one partition at a time (I have to cache the data first) and then do something like:

rdd.cache()
for partition in range(0, rdd.numPartitions):
  data = rdd.mapPartitionsWithIndex(lambda index, itr: [(index, list(itr))]
  localData = data.filter(lambda x: x[0] == partition).collect
  # Do worker here

How can I map RDDs locally?

More articles: