Why is it multiprocessing copying my data if I don't touch it?
I tracked the error out of memory and was horrified to find that python multiprocessing seems to copy large arrays even if I don't intend to use them.
Why does python (on Linux) do this, I thought copy-to-write would protect me from any additional copying? My guess is that whenever I refer to an object some kind of hook is called and only after that a copy is made.
The correct way to solve this for an arbitrary data type, like a 30 gigabyte custom dictionary to use Monitor
? Is there a way to build Python so that it doesn't have this stupidity?
import numpy as np
import psutil
from multiprocessing import Process
mem=psutil.virtual_memory()
large_amount=int(0.75*mem.available)
def florp():
print("florp")
def bigdata():
return np.ones(large_amount,dtype=np.int8)
if __name__=='__main__':
foo=bigdata()#Allocated 0.75 of the ram, no problems
p=Process(target=florp)
p.start()#Out of memory because bigdata is copied?
print("Wow")
p.join()
Duration:
[ebuild R ] dev-lang/python-3.4.1:3.4::gentoo USE="gdbm ipv6 ncurses readline ssl threads xml -build -examples -hardened -sqlite -tk -wininst" 0 KiB
source to share
The problem was that by default Linux checks for worst-case memory usage, which can actually exceed the amount of memory. This is true even if the python language is not exposed to variables. You need to disable the "overcommit" system to achieve the expected COW behavior.
sysctl `vm.overcommit_memory=2'
See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
source to share
I would expect this kind of behavior - when you pass Python code to compile, anything not protected behind a function or object is immediately exec
ed for evaluation.
In your case, bigdata=np.ones(large_amount,dtype=np.int8)
you need to evaluate - if your actual code has no other behavior florp()
, the non-callable has nothing to do with it.
To see a direct example:
>>> f = 0/0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero
>>> def f():
... return 0/0
...
>>>
To apply this to your code, put bigdata=np.ones(large_amount,dtype=np.int8)
behind a function and name it as your need, otherwise Python tries to be hepful by having this variable available to you at runtime.
If bigdata
not, you can write a function that gets or sets it to an object that you persist throughout the process.
edit: Coffee has just started working. When you create a new process, Python will need to copy all objects to that new process for access. You can avoid this by using threads or a mechanism that allows you to share memory between processes such as shared memory maps or shared ctypes