What is a sample code to demonstrate multi-core acceleration in Python on Windows?

Question

What is a sample code to demonstrate multi-core acceleration in Python on Windows?

I am using Python 3 on Windows and am trying to build a toy example demonstrating how using multiple processor cores can speed up computation. An example of a toy is a rendering of a Mandelbrot fractal.

Still:

I avoided threads as the global interceptor lock disallows multicores in this context
I am writing example code that will not work on Windows because it lacks forking capabilities for Linux
An attempt was made to use the "multiprocessing" package. I declare p = Pool (8) (8 is my core count) and use p.starmap (..) to delegate work. It is supposed to create several "subprocesses" that will automatically delegate Windows to different processors.

However, I cannot demonstrate any speedup, whether due to overhead or lack of actual multiprocessing. So pointers to examples of toys with obvious speedups would be very helpful :-)

Edit: Thanks! This pushed me in the right direction and I now have a working example demonstrating doubling the speed on a 4-core processor.
A copy of my code with "lectures" is here: https://pastebin.com/c9HZ2vAV

I decided to use Pool () but later try the "Process" alternative pointed to by @ 16num. Below is a sample code for Pool ():

    p = Pool(cpu_count())

    #Unlike map, starmap only allows 1 input. "partial" provides a workaround
    partial_calculatePixel = partial(calculatePixel, dataarray=data) 
    koord = []
    for j in range(height):
        for k in range(width):
            koord.append((j,k))

    #Runs the calls to calculatePixel in a pool. "hmm" collects the output
    hmm = p.starmap(partial_calculatePixel,koord)

+3

python multicore

Abulafia June 13. 17 at 12:44

source to share

1 answer

zwer · Accepted Answer · 2017-06-13T15:19:03+0000

It's very easy to demonstrate the speed of multiprocessing:

import multiprocessing
import sys
import time

# multi-platform precision clock
get_timer = time.clock if sys.platform == "win32" else time.time

def cube_function(num):
    time.sleep(0.01)  # let simulate it takes ~10ms for the CPU core to cube the number
    return num**3

if __name__ == "__main__":  # multiprocessing guard
    # we'll test multiprocessing with pools from one to the number of CPU cores on the system
    # it won't show significant improvements after that and it will soon start going
    # downhill due to the underlying OS thread context switches
    for workers in range(1, multiprocessing.cpu_count() + 1):
        pool = multiprocessing.Pool(processes=workers)
        # lets 'warm up' our pool so it doesn't affect our measurements
        pool.map(cube_function, range(multiprocessing.cpu_count()))
        # now to the business, we'll have 10000 numbers to quart via our expensive function
        print("Cubing 10000 numbers over {} processes:".format(workers))
        timer = get_timer()  # time measuring starts now
        results = pool.map(cube_function, range(10000))  # map our range to the cube_function
        timer = get_timer() - timer  # get our delta time as soon as it finishes
        print("\tTotal: {:.2f} seconds".format(timer))
        print("\tAvg. per process: {:.2f} seconds".format(timer / workers))
        pool.close()  # lets clear out our pool for the next run
        time.sleep(1)  # waiting for a second to make sure everything is cleaned up

Of course we're just simulating 10ms computation here per number, you can replace cube_function

it with the CPU charging a fee to show the real world. Results expected:

Cubing 10000 numbers over 1 processes:
        Total: 100.01 seconds
        Avg. per process: 100.01 seconds
Cubing 10000 numbers over 2 processes:
        Total: 50.02 seconds
        Avg. per process: 25.01 seconds
Cubing 10000 numbers over 3 processes:
        Total: 33.36 seconds
        Avg. per process: 11.12 seconds
Cubing 10000 numbers over 4 processes:
        Total: 25.00 seconds
        Avg. per process: 6.25 seconds
Cubing 10000 numbers over 5 processes:
        Total: 20.00 seconds
        Avg. per process: 4.00 seconds
Cubing 10000 numbers over 6 processes:
        Total: 16.68 seconds
        Avg. per process: 2.78 seconds
Cubing 10000 numbers over 7 processes:
        Total: 14.32 seconds
        Avg. per process: 2.05 seconds
Cubing 10000 numbers over 8 processes:
        Total: 12.52 seconds
        Avg. per process: 1.57 seconds

Now why not 100% linear? Well, firstly, it takes a while to display / propagate data in subprocesses and return it, there is some context switching cost, there are other tasks that use my processors from time to time, time.sleep()

not quite accurate (and it cannot be in the operating room system other than RT). But the results are roughly in line with the expected value for parallel processing.

What is a sample code to demonstrate multi-core acceleration in Python on Windows?

More articles: