How can I map RDDs locally?

As a follow-up to my previous question , how can I localize the local RDD, i.e. collect data into a local thread without actually using it collect

(since the data is too large).

Specifically, I want to write something like

from subprocess import Popen, PIPE
with open('out','w') as out:
    with open('err','w') as err:
        myproc = Popen([.....],stdin=PIPE,stdout=out,stderr=err)
myrdd.iterate_locally(lambda x: myproc.stdin.write(x+'\n'))

      

How do I implement this iterate_locally

?

  • DOES NOT work: collect

    return value is too high:

    myrdd.collect().foreach(lambda x: myproc.stdin.write(x+'\n'))

  • Does NOT work: foreach

    executes its argument in distributed mode, not locally

    myrdd.foreach(lambda x: myproc.stdin.write(x+'\n'))

on this topic:

+3


source to share


2 answers


How about RDD.foreachPartition

? You can work with data in batches, for example:

myRdd.foreachPartition(it => it.collect.foreach(...))

      



If you look at the history of requests for objects , it RDD.foreachPartition

was created to move this middle ground.

+1


source


Your best bet is likely to save the data to a source that your local computer can access and then replicate.

If that's not an option, and assuming your local machine can handle data with one partition at a time, you can selectively return one partition at a time (I have to cache the data first) and then do something like:



rdd.cache()
for partition in range(0, rdd.numPartitions):
  data = rdd.mapPartitionsWithIndex(lambda index, itr: [(index, list(itr))]
  localData = data.filter(lambda x: x[0] == partition).collect
  # Do worker here

      

+1


source







All Articles