Is it possible to run multiple instances of a CUDA program on a multiprocessor machine?

Question

Is it possible to run multiple instances of a CUDA program on a multiprocessor machine?

Background:

I wrote a CUDA program that does character sequence processing. The program processes all character sequences in parallel with the condition that all sequences are of the same length. I am sorting my data into groups with each group consisting entirely of sequences of the same length. The program processes 1 group at a time.

Question:

I am running my code on a Linux machine with 4 GPUs and would like to use all 4 GPUs by running 4 instances of my program (1 per GPU). Is it possible for the program to select a GPU that is not being used by another CUDA application to run? I don't want to hardcode anything that might cause problems in the future when the program is run on other hardware with more or less GPUs.

+3

c ++ gpu cuda multi-gpu

aseal June 18. 15 at 19:42

source to share

2 answers

There is a better (more automatic) way that we use in PIConGPU, which runs on huge (and different) clusters. See the implementation here: https://github.com/ComputationalRadiationPhysics/picongpu/blob/909b55ee24a7dcfae8824a22b25c5aef6bd098de/src/libPMacc/include/Environment.hpp#L169

Basically: a call cudaGetDeviceCount

to get the number of GPUs, iterate over them and call cudaSetDevice

to set this as the current device and check if that worked. This check might include test thread creation due to some bug in CUDA that made setDevice succeed, but all subsequent calls failed because the device was actually in use. Note. You may need to set the GPU to exclusive mode, so the GPU can only be used by one process. If you don't have enough data for one "batch", you may want the opposite: multiple processes are sending work to the same GPU. So tune according to your needs.

Other ideas: Run an MPI application with the same number of processes per rank as GPUs and use the same device number as the local rank number. It would also help in applications like yours that have different datasets to distribute. So you can for example have MPI process length rank 0 and MPI data level 1 process length etc.

0

Flamefire 07 oct. '16 at 9:18

source to share

Robert crovella · Accepted Answer · 2015-06-18T20:42:53+0000

environment variable CUDA_VISIBLE_DEVICES

is your friend.

I am assuming you have as many terminals open as you have GPUs. Let's say your application is calledmyexe

Then, in one terminal, you can do:

CUDA_VISIBLE_DEVICES="0" ./myexe

In the following terminal:

CUDA_VISIBLE_DEVICES="1" ./myexe

etc.

The first instance will then run on the first GPU specified by CUDA. The second instance will run on a second GPU (only), etc.

Assuming bash, and for a given terminal session, you can make this "persistent" by exporting a variable:

export CUDA_VISIBLE_DEVICES="2"

thereafter, all CUDA applications running in that session will only observe the third enumerated GPU (enumeration starts at 0), and they will observe this GPU as if it were device 0 in their session.

This means you do not need to make any changes to your application for this method, assuming your application is using the default GPU or GPU 0.

You can also extend this to make multiple GPUs available, for example:

export CUDA_VISIBLE_DEVICES="2,4"

means GPUs, which are usually listed as 2 and 4 will now be the only GPUs "visible" in this session and they will be listed as 0 and 1.

In my opinion, this approach is the simplest. Choosing a GPU that is "not in use" is problematic because:

we need a definition "in use"
The GPU that was in use at a certain point in time cannot be used immediately after that.
Most importantly, a GPU that is not being used can be "used" asynchronously, which means you are subject to race conditions.

So the best advice (IMO) is to control the GPUs explicitly. Otherwise, you'll need some form of task scheduler (outside the scope of this question, IMO) to be able to query unused GPUs and "reserve" them before another application tries to do so in an orderly fashion.

Is it possible to run multiple instances of a CUDA program on a multiprocessor machine?

More articles: