Python: multiprocessing, pathos, and what not
I must apologize in advance because this question is quite general and may not be clear enough. The question is, how would you run in parallel a Python function that itself uses the process pool for some subtasks and does a lot of heavy I / O? Is it even a valid task?
I will try to provide more information. I have a procedure, let's say test_reduce()
I need to run in parallel. I tried several ways to do this (see below) and I seem to be lacking in knowledge to understand why they all fail.
This procedure test_reduce()
does a lot of things. Some are more relevant to the question than others (and I list them below):
- It uses a module
multiprocessing
(sic!) , Namely an instancepool.Pool
, - It uses MongoDB connection,
- It relies mainly on
numpy
andscikit-learn
libs, - It uses callbacks and lambdas,
- It uses
dill
lib to expose some things.
I first tried using multiprocessing.dummy.Pool
(which seems to be a thread pool). I don't know what specifically refers to this pool and why it is, eh, "dummy"; it all worked and I got my results. The problem is CPU usage. For parallel partitions, test_reduce()
it was 100% for all cores; for synchronous sections, it was about 40-50% most of the time. I cannot say that there was an increase in overall speed for this type of "parallel" execution.
Then I tried to use an instance multiprocessing.pool.Pool
for map
this procedure for my data. The following failed:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed
I assumed I was to cPickle
blame and found a pathos
lib that uses a much more advanced pickler dill
. However, it also fails:
File "/local/lib/python2.7/site-packages/dill/dill.py", line 199, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
obj = cls.__new__(cls, *args)
TypeError: object.__new__(generator) is not safe, use generator.__new__()
Now this error is not clear at all. I have no output on stdout
from my procedure when it runs in a pool, so it's hard to guess what's going on. The only thing I know is what test_reduce()
succeeds when not using multiprocessing.
So how would you run something heavy and complex in parallel?
source to share