How can I map RDDs locally?
As a follow-up to my previous question , how can I localize the local RDD, i.e. collect data into a local thread without actually using it collect
(since the data is too large).
Specifically, I want to write something like
from subprocess import Popen, PIPE
with open('out','w') as out:
with open('err','w') as err:
myproc = Popen([.....],stdin=PIPE,stdout=out,stderr=err)
myrdd.iterate_locally(lambda x: myproc.stdin.write(x+'\n'))
How do I implement this iterate_locally
?
-
DOES NOT work:
collect
return value is too high:myrdd.collect().foreach(lambda x: myproc.stdin.write(x+'\n'))
-
Does NOT work:
foreach
executes its argument in distributed mode, not locallymyrdd.foreach(lambda x: myproc.stdin.write(x+'\n'))
on this topic:
source to share
How about RDD.foreachPartition
? You can work with data in batches, for example:
myRdd.foreachPartition(it => it.collect.foreach(...))
If you look at the history of requests for objects , it RDD.foreachPartition
was created to move this middle ground.
source to share
Your best bet is likely to save the data to a source that your local computer can access and then replicate.
If that's not an option, and assuming your local machine can handle data with one partition at a time, you can selectively return one partition at a time (I have to cache the data first) and then do something like:
rdd.cache()
for partition in range(0, rdd.numPartitions):
data = rdd.mapPartitionsWithIndex(lambda index, itr: [(index, list(itr))]
localData = data.filter(lambda x: x[0] == partition).collect
# Do worker here
source to share