How to copy memory between different gpus in cuda

I am currently working with two gtx 650. My program looks like a simple Clients / Server structure. I am distributing worker threads to two gpus. The server thread has to collect the result vectors from the client threads, so I need to copy memory between the two gpu's. Unfortunately, the simple P2P program in the cuda samples just doesn't work because my cards don't have TCC drivers. After spending two hours searching on google and SO, I can't seem to find an answer. Some source says what I should use cudaMemcpyPeer

, and another source says what I should use cudaMemcpy

with cudaMemcpyDefault

. Is there some easy way to get my work done differently than copying and then copying to a device. I know it must have been documented somewhere, but I cannot find it. Thank you for your help.


source to share

1 answer

Transferring data from one GPU to another often requires "staging" through host memory. An exception is when the GPUs and system topology support peer-to-peer (P2P) access and P2P is explicitly enabled. In this case, data can be transferred directly over the PCIE bus from one GPU to another.

In any case (with or without P2P or not), a typical cuda runtime API call would be cudaMemcpyPeer

/ cudaMemcpyPeerAsync

, as shown in the cuda p2pBandwidthLatencyTest sample code .

In Windows, one of the P2P requirements is that the driver supports both devices in TCC mode. For the most part, TCC is not available for GeForce GPUs (an exception was recently made for GeForce Titan GPUs using the drivers and runtime available in the CUDA 7.5RC Toolkit.)

Therefore, in Windows, these GPUs will not be able to take advantage of the P2P direct transfer. However, an almost identical sequence can be used for data transmission. The CUDA runtime will determine the transfer behavior and perform under-the-hood allocation to create an intermediate buffer. Then the transfer will be done in 2 steps: transfer from the source device to the intermediate buffer and transfer from the intermediate buffer to the target device.

Below is a fully worked example showing how to transfer data from one GPU to another using P2P access if available:

$ cat
#include <stdio.h>
#include <math.h>

#define SRC_DEV 0
#define DST_DEV 1

#define DSIZE (8*1048576)

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main(int argc, char *argv[]){

  int disablePeer = 0;
  if (argc > 1) disablePeer = 1;
  int devcount;
  cudaCheckErrors("cuda failure");
  int srcdev = SRC_DEV;
  int dstdev = DST_DEV;
  if (devcount <= max(srcdev,dstdev)) {printf("not enough cuda devices for the requested operation\n"); return 1;}
  int *d_s, *d_d, *h;
  int dsize = DSIZE*sizeof(int);
  h = (int *)malloc(dsize);
  if (h == NULL) {printf("malloc fail\n"); return 1;}
  for (int i = 0; i < DSIZE; i++) h[i] = i;
  int canAccessPeer = 0;
  if (!disablePeer) cudaDeviceCanAccessPeer(&canAccessPeer, srcdev, dstdev);
  cudaMalloc(&d_s, dsize);
  cudaMemcpy(d_s, h, dsize, cudaMemcpyHostToDevice);
  if (canAccessPeer) cudaDeviceEnablePeerAccess(dstdev,0);
  cudaMalloc(&d_d, dsize);
  cudaMemset(d_d, 0, dsize);
  if (canAccessPeer) cudaDeviceEnablePeerAccess(srcdev,0);
  cudaCheckErrors("cudaMalloc/cudaMemset fail");
  if (canAccessPeer) printf("Timing P2P transfer");
  else printf("Timing ordinary transfer");
  printf(" of %d bytes\n", dsize);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  cudaMemcpyPeer(d_d, dstdev, d_s, srcdev, dsize);
  cudaCheckErrors("cudaMemcpyPeer fail");
  float et;
  cudaEventElapsedTime(&et, start, stop);
  cudaMemcpy(h, d_d, dsize, cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy fail");
  for (int i = 0; i < DSIZE; i++) if (h[i] != i) {printf("transfer failure\n"); return 1;}
  printf("transfer took %fms\n", et);
  return 0;

$ nvcc -arch=sm_20 -o t850
$ ./t850
Timing P2P transfer of 33554432 bytes
transfer took 5.135680ms
$ ./t850 disable
Timing ordinary transfer of 33554432 bytes
transfer took 7.274336ms



  1. Passing any command line parameter will disable P2P usage, even if available.
  2. The above results are for a system where P2P access is possible and both GPUs are connected through a PCIE Gen2 link capable of about 6GB / s of bandwidth in one direction. P2P transfer times match this (32MB / 5ms ~ = 6GB / s). Transfer time without P2P is longer, but not twice. This is because for transfers to / from the intermediate buffer, after some data has been transferred to the intermediate buffer, an outgoing transfer may start. The driver / runtime uses this to partially overlap data transfers.

Please note that in general, P2P support may vary depending on the GPU or GPU family. Being able to run P2P on one type or family of GPUs does not necessarily mean that it will run on a different type or family of GPUs, even on the same system / setup. The final factor behind GPU P2P support is the tools provided, which query the runtime through cudaDeviceCanAccessPeer

. P2P support may vary based on system and other factors. Nothing made here is a guarantee of P2P support for any particular GPU in any particular installation.

Note: TCC driver requirements on Windows have been relaxed with recent drivers. With recent drivers, it should be possible to exchange P2P data between devices in WDDM mode if other requirements are met.



All Articles