Why is my CUDA implementation as fast as my CPU implementation

Question

Why is my CUDA implementation as fast as my CPU implementation

I created some code to do 2D convolution on a 1300x1300 grayscale image and 15x15 kernel in standard C ++ and CUDA. Both versions:

CPU:

#include <iostream>
#include <exception>

#define N 1300
#define K 15
#define K2 ((K - 1) / 2)


template<int mx, int my>
inline int index(int x, int y)
{
  return x*my + y;
}

int main() {
  double *image  = new double[N * N];
  double *kernel = new double[K * K];
  double *result = new double[N * N];

  for (int x=0; x<N; ++x)
  for (int y=0; y<N; ++y)
  {
    double r = 0;
    for(int i=0; i<K; ++i)
    for(int j=0; j<K; ++j)
    {
      if (x + i - K2 >= 0 and
          x + i - K2 < N  and
          y + j - K2 >= 0 and
          y + j - K2 < N)
      {
        r +=  kernel[index<K,K>(i,j)] * image[index<N,N>(x+i-K2, y+j-K2)];
      }
    }
    result[index<N,N>(x, y)] = r;
  }

  delete[] image;
  delete[] kernel;
  delete[] result;
}

GPU:

#include <iostream>
#include <exception>

// ignore, just for error handling
struct ErrorHandler {
  int d_line;
  char const *d_file;
  ErrorHandler(int line, char const *file) : d_line(line), d_file(file) {};
};

#define EH ErrorHandler(__LINE__, __FILE__)

ErrorHandler operator<<(ErrorHandler eh, cudaError_t err)
{
  if (err != cudaSuccess)
  {
    std::cerr << cudaGetErrorString( err ) << " in " << eh.d_file << " at line " << eh.d_line << '\n';
    throw std::exception();
  }
  return eh;
}
// end.

#define N 1300
#define K 15
#define K2 ((K - 1) / 2)


template<int mx, int my>
__device__ inline int index(int x, int y)
{
  return x*my + y;
}

__global__ void kernelkernel(double *image, double *kernel, double *result)
{
  int x = blockIdx.x;
  int y = blockIdx.y; // becomes: int y = threadIdx.x;

  double r = 0;
  for(int i=0; i<K; ++i)
  for(int j=0; j<K; ++j)
  {
    if (x + i - K2 >= 0 and
        x + i - K2 < N  and
        y + j - K2 >= 0 and
        y + j - K2 < N)
    {
      r +=  kernel[index<K,K>(i,j)] * image[index<N,N>(x+i-K2, y+j-K2)];
    }
  }
  result[index<N,N>(x, y)] = r;
}

int main() {
  double *image      = new double[N * N];
  double *kernel    = new double[K * K];
  double *result      = new double[N * N];

  double *image_cuda;
  double *kernel_cuda;
  double *result_cuda;
  EH << cudaMalloc((void **) &image_cuda,  N*N*sizeof(double));
  EH << cudaMalloc((void **) &kernel_cuda, K*K*sizeof(double));
  EH << cudaMalloc((void **) &result_cuda, N*N*sizeof(double));

  EH << cudaMemcpy(image_cuda,     image,     N*N*sizeof(double), cudaMemcpyHostToDevice);
  EH << cudaMemcpy(kernel_cuda,    kernel,    K*K*sizeof(double), cudaMemcpyHostToDevice);

  dim3 grid   ( N, N );
  kernelkernel<<<grid, 1>>>(image_cuda, kernel_cuda, result_cuda);
  // replace previous 2 statements with: 
  // kernelkernel<<<N, N>>>(image_cuda, kernel_cuda, result_cuda);
  EH << cudaMemcpy(result, result_cuda, N*N*sizeof(double), cudaMemcpyDeviceToHost);

  cudaFree( image_cuda );
  cudaFree( kernel_cuda );
  cudaFree( result_cuda );

  delete[] image;
  delete[] kernel;
  delete[] result;
}

I would expect the cuda code to be much faster:

$ nvprof ./gpuversion
==17806== NVPROF is profiling process 17806, command: ./gpuversion
==17806== Profiling application: ./gpuversion
==17806== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
99.89%  3.83149s         1  3.83149s  3.83149s  3.83149s  kernelkernel(double*, double*, double*)
  0.07%  2.6420ms         1  2.6420ms  2.6420ms  2.6420ms  [CUDA memcpy DtoH]
  0.04%  1.5111ms         2  755.54us     736ns  1.5103ms  [CUDA memcpy HtoD]

and

$ time ./cpuversion
real    0m3.382s
user    0m3.371s
sys     0m0.012s

Their difference is statistically insignificant. The CUDA core takes about 3-4 seconds, why not much faster? Is my code parallel?

PS: I'm new to CUDA, so I might be missing something trivial.

Decision

What I learned is that CUDA does not allow you to access memory willy-nilly from blocks. I am assuming the general CUDA programming strategy is:

allocate and copy memory from RAM to cuda using cudaMalloc and cudaMemCpy
Divide the workload between blocks and threads so that the memory accessed by different blocks does not overlap much.
If there is an overlap between the memory used by the blocks, start each block by copying the memory to the shared array . Note that:
- the size of this array should be known at compile time
- size is limited.
- this memory is shared by each thread in an ONE block, so __shared double foo [10] allocates 10 doubles for each BLOCK.
copy the memory needed by one block into shared variables inside the kernel. Of course you are using different threads to do this "efficiently".
synchronize streams so that all data is there before using it.
process the data and write the result. this to the kernel output array
sync again, I'm not sure why, but everyone on the internet does this: S
copy GPU memory to RAM.
clear GPU memory.

This gives the following code. This is the mex code for Matlab for structural similarity, which also works through a sliding core, but more than two images and with a different aggregate than the point-product.

// author: Herbert Kruitbosch, CC: be nice, include my name in documentation/papers/publications when used
#include <matrix.h>
#include <mex.h>

#include <cmath>
#include <iostream>
#include <fstream>

#include <iostream>
#include <stdio.h>

static void HandleError(
  cudaError_t err,
  const char *file,
  int line )
{
  if (err != cudaSuccess)
  {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line );
    exit( EXIT_FAILURE );
  }
}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
#define TILE_WIDTH 31

__device__ inline double sim(double v0, double v1, double c)
{
  return (c + 2*v0*v1) / (c + v1*v1 + v0*v0);
}

__device__ inline int index(int rows, int cols, int row, int col)
{
  return row + col*rows;
}

__global__ void ssimkernel(double *test, double *reference, const double * __restrict__ kernel, double *ssim, int k, int rows, int cols, int tile_batches_needed)
{
  int radius = k / 2;
  int block_width = TILE_WIDTH - k + 1;
  __shared__ double tile_test     [TILE_WIDTH][TILE_WIDTH];
  __shared__ double tile_reference[TILE_WIDTH][TILE_WIDTH];



  for(int offset=0; offset < tile_batches_needed; ++offset)
  {
    int dest = block_width*block_width*offset + threadIdx.y * block_width + threadIdx.x;
    int destRow = dest / TILE_WIDTH;
    int destCol = dest % TILE_WIDTH;
    int srcRow = blockIdx.y * block_width + destRow - radius;
    int srcCol = blockIdx.x * block_width + destCol - radius;
    int src  = srcCol * rows + srcRow;
    if (destRow < TILE_WIDTH)
    {
      if (srcRow >= 0 and srcRow < rows and
          srcCol >= 0 and srcCol < cols)
      {
        tile_test     [destRow][destCol] = test     [src];
        tile_reference[destRow][destCol] = reference[src];
      }
      else
      {
        tile_test     [destRow][destCol] = 0;
        tile_reference[destRow][destCol] = 0;
      }
    }
  }
  __syncthreads();

  double mean_test = 0;
  double mean_reference = 0;
  for(int i=0; i<k; ++i)
  for(int j=0; j<k; ++j)
  {
    double w = kernel[i * k + j];
    mean_test      +=  w * tile_test     [threadIdx.y+i][threadIdx.x+j];
    mean_reference +=  w * tile_reference[threadIdx.y+i][threadIdx.x+j];
  }

  double var_test = 0;
  double var_reference = 0;
  double correlation = 0;
  for(int i=0; i<k; ++i)
  for(int j=0; j<k; ++j)
  {
    double w = kernel[i * k + j];
    double a = (tile_test     [threadIdx.y+i][threadIdx.x+j] - mean_test     );
    double b = (tile_reference[threadIdx.y+i][threadIdx.x+j] - mean_reference);
    var_test      += w * a * a;
    var_reference += w * b * b;
    correlation   += w * a * b;
  }

  int destRow = blockIdx.y * block_width + threadIdx.y;
  int destCol = blockIdx.x * block_width + threadIdx.x;
  if (destRow < rows and destCol < cols)
    ssim[destCol * rows + destRow] = sim(mean_test, mean_reference, 0.01) * (0.03 + 2*correlation) / (0.03 + var_test + var_reference);

  __syncthreads();
}


template<typename T>
inline T sim(T v0, T v1, T c)
{
  return (c + 2*v0*v1) / (c + v1*v1 + v0*v0);
}

inline int upperdiv(int a, int b) {
  return (a + b - 1) / b;
}

void mexFunction(int nargout, mxArray *argout[], int nargin, const mxArray *argin[])
{
  mwSize rows = mxGetDimensions(argin[0])[0];
  mwSize cols = mxGetDimensions(argin[0])[1];
  mwSize k    = mxGetDimensions(argin[2])[0];
  mwSize channels = mxGetNumberOfDimensions(argin[0]) <= 2 ? 1 : mxGetDimensions(argin[0])[2];
  int dims[] = {rows, cols, channels};
  argout[0] = mxCreateNumericArray(3, dims, mxDOUBLE_CLASS, mxREAL);

  double *test      = (double *)mxGetData(argin[0]);
  double *reference = (double *)mxGetData(argin[1]);
  double *gaussian  = (double *)mxGetData(argin[2]);
  double *ssim      = (double *)mxGetData(argout[0]);

  double *test_cuda;
  double *reference_cuda;
  double *gaussian_cuda;
  double *ssim_cuda;
  HANDLE_ERROR( cudaMalloc((void **) &test_cuda,      rows*cols*sizeof(double)) );
  HANDLE_ERROR( cudaMalloc((void **) &reference_cuda, rows*cols*sizeof(double)) );
  HANDLE_ERROR( cudaMalloc((void **) &gaussian_cuda,  k*k*sizeof(double)) );
  HANDLE_ERROR( cudaMalloc((void **) &ssim_cuda,      rows*cols*sizeof(double)) );
  HANDLE_ERROR( cudaMemcpy(gaussian_cuda,  gaussian,  k*k*sizeof(double), cudaMemcpyHostToDevice) );

  int block_width = TILE_WIDTH - k + 1;
  int tile_batches_needed = upperdiv(TILE_WIDTH*TILE_WIDTH, block_width*block_width);

  for(int c=0; c<channels; ++c)
  {
    HANDLE_ERROR( cudaMemcpy(test_cuda,      test      + rows*cols*c, rows*cols*sizeof(double), cudaMemcpyHostToDevice) );
    HANDLE_ERROR( cudaMemcpy(reference_cuda, reference + rows*cols*c, rows*cols*sizeof(double), cudaMemcpyHostToDevice) );
    dim3 dimGrid(upperdiv(cols, block_width), upperdiv(rows, block_width), 1);
    dim3 dimBlock(block_width, block_width, 1);

    ssimkernel<<<dimGrid, dimBlock>>>(test_cuda, reference_cuda, gaussian_cuda, ssim_cuda, k, rows, cols, tile_batches_needed);

    HANDLE_ERROR( cudaMemcpy(ssim + rows*cols*c, ssim_cuda, rows*cols*sizeof(double), cudaMemcpyDeviceToHost) );
  }
  cudaFree( test_cuda );
  cudaFree( reference_cuda );
  cudaFree( gaussian_cuda );
  cudaFree( ssim_cuda );
}

+3

c ++ cuda convolution nvidia

Herbert Apr 24 At 8:04 am

source to share

3 answers

Accessing global memory in the kernel is expensive due to its latency. A global memory request (both read and write) requires hundreds of clock cycles. You want to minimize the number of global memory accesses and access them in contiguous blocks.

If each piece of data is accessed exactly once, there is nothing to do with latency, but it rarely happens. And definitely not the case in your code, where the array kernel

is accessible by all threads in the same template, and a large number is image

also accessed by multiple threads.

The solution for this is to start the kernel by fetching data from high-latency global memory to low-latency shared memory. Shared memory is a block of memory on a multiprocessor, and its latency is comparable to that of registers. Therefore, most simple kernels follow this structure:

Each thread fetches data from global memory to shared memory. You want to get data in contiguous sequences, if possible, as you access global memory through transactions. If there is not enough data for all threads to retrieve, leave some of them unoccupied.
Streams work with data in shared memory.
Data is written from shared memory back to global memory in the same pattern as in step 1.

Shared memory is shared by all threads in a stream block. This brings us to the second big problem in your code: you don't use thread blocks at all. Threads in the same block running on the same multiprocessor computer share shared memory, can be synchronized with each other, etc. You need to organize your streams well in blocks to make the most of them.

The block grid is just a mechanism to allow more blocks to run on a single call. All the benefits of parallel command execution and shared memory access are within the block. The block grid is just "yeah sorry, my data is so big that one block won't do, just run many of them."

You are doing exactly the opposite: your blocks have one thread at a time, which means that at each step on the multiprocessor, only one thread from each warp is running (based on your device's computation capabilities and the number of warp schedulers, this means something like 2 threads on one multiprocessor maximum).

You will need to rebuild your streams to mirror the data access patterns and prefetch the data into shared memory. This will give you the performance boost you expect.

The above is a short description. Refer to the CUDA Programming Guide for details on block organization, shared memory, and global memory transactions.

0

Angew Apr 24 15 at 8:43

source to share

If you are using global memory in CUDA, all data access will be queued synchronized, and you will end up with an almost linear solution rather than parallel.
Also, transferring a large set of data from your RAM to GPU memory also takes a long time (bus speed is limited).
So, I think you need to somehow map your data across computing devices on your GPU (some of it into shared memory). Check out this one to find out how to improve GPU memory usage in a case similar to yours.

-1

FalconUA Apr 24 '15 at 8:20

source to share

Hurkyl · Accepted Answer · 2015-04-24T08:33:04+0000

kernelkernel<<<grid, 1>>>

This is a serious problem; threads on nVidia GPUs run in 32-thread skew. However, you only assigned one thread for each block, which means that 31 of those threads will sit idle while one thread is running. And generally, for cores where you have flexibility, you usually want multiple skews per block, not just one.

You can get immediate speedup using N blocks and N threads per block, instead of using N ^ 2 blocks.

Actually, N may be too large, since there is an upper limit on the number of threads per block. Although you can choose a suitable M so that you use N / M streams for each block and N * M blocks.

In fact, you will probably get the best results in this regard by choosing multiple Ms (I assume 256 are likely to be close to optimal) and run with blocks L=ceiling(N*N/M)

and M blocks per thread. Then each thread draws rebuilds the index at [0, M*L)

based on its block and thread id, and then those whose index is at [0,N*N)

will go to dividing that index into the x and y coordinate and will work.

Why is my CUDA implementation as fast as my CPU implementation

Decision

More articles: