Python: multiprocessing, pathos, and what not

I must apologize in advance because this question is quite general and may not be clear enough. The question is, how would you run in parallel a Python function that itself uses the process pool for some subtasks and does a lot of heavy I / O? Is it even a valid task?

I will try to provide more information. I have a procedure, let's say test_reduce()

I need to run in parallel. I tried several ways to do this (see below) and I seem to be lacking in knowledge to understand why they all fail.

This procedure test_reduce()

does a lot of things. Some are more relevant to the question than others (and I list them below):

  • It uses a module multiprocessing

    (sic!) , Namely an instance pool.Pool

    ,
  • It uses MongoDB connection,
  • It relies mainly on numpy

    and scikit-learn

    libs,
  • It uses callbacks and lambdas,
  • It uses dill

    lib to expose some things.

I first tried using multiprocessing.dummy.Pool

(which seems to be a thread pool). I don't know what specifically refers to this pool and why it is, eh, "dummy"; it all worked and I got my results. The problem is CPU usage. For parallel partitions, test_reduce()

it was 100% for all cores; for synchronous sections, it was about 40-50% most of the time. I cannot say that there was an increase in overall speed for this type of "parallel" execution.

Then I tried to use an instance multiprocessing.pool.Pool

for map

this procedure for my data. The following failed:

File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed

      

I assumed I was to cPickle

blame and found a pathos

lib that uses a much more advanced pickler dill

. However, it also fails:

File "/local/lib/python2.7/site-packages/dill/dill.py", line 199, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
    obj = cls.__new__(cls, *args)
TypeError: object.__new__(generator) is not safe, use generator.__new__()

      

Now this error is not clear at all. I have no output on stdout

from my procedure when it runs in a pool, so it's hard to guess what's going on. The only thing I know is what test_reduce()

succeeds when not using multiprocessing.

So how would you run something heavy and complex in parallel?

+3


source to share


1 answer


So, thanks to @ MikeMcKerns answer, I found how to get pathos

lib working . I needed to get rid of all cursors pymongo

that (being generators) could not be pickled dill

; this solved the problem and I was able to run my code in parallel.



+1


source







All Articles