GPU and CPU concurrency: Manufacturer's limited user buffer

Consider the following problem:

You have a computing environment with a single gpu and a single processor. On gpu, you run a program that performs calculations on a 1e6 floating point array. This computation step is repeated n times (process 1). After each calculation step, I move the array from device memory to host memory. Upon completion of the transfer, the data is analyzed using a sequential algorithm on the CPU (process 2).

This program works serially. I would like to know how to parallelize processes 1 and 2 in order to shorten the overall execution time of the program. Process 1 needs to wait for process 2 to finish and vice versa.

I know that CUDA kernels are called asynchronously and I know that there are host sticky asynchronous copy operations. However, in this case, I need to wait for the GPU to finish before the CPU can start working on that output. How can this information be conveyed?

I tried to change the multithreaded cpu / consumer cpu code, but it didn't work. I ended up serializing two cpu threads that manage the gpu and cpu workload. However, here's my GPU waiting for the CPU to finish before continuing ...

#include <mutex>
#include <condition_variable>

#include "ProducerConsumerBuffer.hpp"

ProducerConsumerBuffer::ProducerConsumerBuffer(int capacity_in, int n): capacity(capacity_in), count(0) {
    c_bridge = new float[n];
    c_CPU = new float[n];
}

ProducerConsumerBuffer::~ProducerConsumerBuffer(){
    delete[] c_bridge;
    delete[] c_CPU;
}

void ProducerConsumerBuffer::upload(device_pointers *d, params &p, streams *s){
    std::unique_lock<std::mutex> l(lock);

    not_full.wait(l, [this](){return count != 1; });

    copy_GPU_to_CPU(d,c_bridge,p,s);
    count++;

    not_empty.notify_one();
}



void ProducerConsumerBuffer::fetch(){
    std::unique_lock<std::mutex> l(lock);

    not_empty.wait(l, [this](){return count != 0; });

    std::swap(c_bridge,c_CPU);
    count--;

    not_full.notify_one();

}

      

I was hoping this would be a way to do it using cudastreams. But I think they only work for device function calls. Do I need to use MPI or is there another way to synchronize processes on a heterogeneous computing platform? I read about OpenCL supporting this operation, since all computing devices are organized in one "context". Isn't it possible to do the same with CUDA?

In case my serial cpu operation takes 4 times longer than the GPU operation, I was planning to create 4 processors.

Any understanding would be greatly appreciated!

EDIT: The CPU function contains serial code that is not parallelizable.

+3


source to share


1 answer


It is impossible to do what you want without using multiple threads or processes, or significantly complicating your processor's algorithm to achieve acceptable scheduling latency. This is because you should be able to command the GPU at the correct frequency with low latency to process the data you have for the GPU workload, but the CPU workload does not sound insignificant and should be factored in during loop execution.



Because of this, in order to ensure that both CPU and GPU are constantly processing and to achieve the highest throughput and lowest latency, you must split the GPU instruction part and the expensive CPU computation part into separate threads, and between the two, IPC is the preferred shared memory. You might be able to simplify some tasks if the dedicated CPU processing thread works in a similar style with CUDA and use its cudaEvent_t for threads and make the GPU command thread control the CPU thread as well - that's 1 command thread and 2 subordinate commands (GPU, CPU).

0


source







All Articles