Python Simple cluster computing / multiprocessing on LAN?

I am trying to implement a cluster computing solution using Python3 for the following problem: I have a lot of raw data files on the NAS (> 2TB, several thousand files, every single file can be processed in memory, although the NAS performance is good enough) and run on There is some Python code to process the data and create an output file for each input file and save it to the NAS as well.

Processing time is mainly limited by the performance of the local network (as a result of the NAS being in another building that is well connected to our building, but the local network inside our building is only running at 100 Mbps) and to a lesser extent by the processor. I have several (4-5) Windows PCs (list of IP addresses) on a local network that I would like to use for this purpose for a small and simple compute cluster. So far I've done it manually, grouping multiple files and tweaking my script on each computer and it worked fine.I would like now to automate this step to start a process on the "master" pc that creates a queue of tasks that execute the same code for each file (reading and writing from / to a network drive, perhaps using some command "map" ) and allocates these tasks to the PC in the cluster. Hence,

  • Each task can work independently and use its own environment. Communication between tasks / nodes is not required and variables should not be used.
  • In particular, I don't want the results to be sent to the host computer, but only individual PCs write the results to the NAS.

Given my very limited experience in cluster computing / parallel processing, I would like to implement this as simply as possible. After some research I am a little overwhelmed considering the presence of modules / packages, for example on this website: https://wiki.python.org/moin/ParallelProcessing . A lot of packages appear to be either a complete overkill or not supported for years.

If you could give me advice on a good package for this purpose that would be much appreciated. Is there a simple solution that I still haven't received? Alternatively, is it possible to solve this at all using standard packages like threads, multiprocessing, queuing, etc ?? Is there any sample code for a similar implementation somewhere?

Also, it would be nice to be able to do the following, although it is not required:

  • Bonus1: Be able to use the "main" PC for computing.
  • Bonus2: Run multiple instances on each PC. The number of copies should vary on different PCs, but due to different hardware settings.
+3


source to share





All Articles