Optimize network multiprocessor code
I have a function that I am calling with multiprocessing .Pool
Like this:
from multiprocessing import Pool
def ingest_item(id):
# goes and does alot of network calls
# adds a bunch to a remote db
return None
if __name__ == '__main__':
p = Pool(12)
thing_ids = range(1000000)
p.map(ingest_item, thing_ids)
There are over 1 million items in the pool.map list, for each call ingest_item()
it will dispatch and call third party services and add data to the remote Postgresql database.
On a 12-core computer, processes are processed ~ 1000 pool.map
in 24 hours. Low CPU and RAM consumption.
How can I make it faster?
Could switching to Threads make sense since the bottleneck appears to be network calls?
Thanks in advance!
source to share
First: remember that you are doing a network task. You should expect your CPU and RAM to be low because the network is orders of magnitude slower than your 12-core machine.
However, it is wasteful to have one process per request. If you start having problems starting too many processes, you can try pycurl as suggested here Library or tool to download multiple files in parallel
This pycurl example looks very similar to your task https://github.com/pycurl/pycurl/blob/master/examples/retriever-multi.py
source to share
Using streams is unlikely to significantly improve performance. This is because no matter how much you split the task, all requests must go through the network.
To improve performance, you can see if third party services have some kind of high volume request API.
If your workload allows it, you can try to use some kind of caching. However, from your explanation of the task, it looks like it will be ineffective since you are basically sending data without asking for it. You can also consider caching open connections (if you don't already), this will help avoid very slow TCP handshakes. This type of caching is often used in web browsers (like Chrome ).
Disclaimer: I have no Python experience
source to share