Convert parallel program from openMP to openCL

I'm just wondering how to convert the following openMP program to an openCL program.

The parallel algorithm implemented using openMP looks like this:

#pragma omp parallel
  {
    int thread_id = omp_get_thread_num();

    //double mt_probThreshold = mt_nProbThreshold_;
    double mt_probThreshold = nProbThreshold;

    int mt_nMaxCandidate = mt_nMaxCandidate_;
    double mt_nMinProb = mt_nMinProb_;

    int has_next = 1;
    std::list<ScrBox3d> mt_detected;
    ScrBox3d  sample;
    while(has_next) {
#pragma omp critical
    {  // '{' is very important and define the block of code that needs lock.
      // Don't remove this pair of '{' and '}'.
      if(piter_ == box_.end()) {
        has_next = 0;
      } else{
        sample = *piter_;
        ++piter_;
      }
    }  // '}' is very important and define the block of code that needs lock.

    if(has_next){
      this->SetSample(&sample, thread_id);
      //UpdateSample(sample, thread_id); // May be necesssary for more sophisticated features
      sample._prob = (float)this->Prob( true, thread_id, mt_probThreshold);
      //sample._prob = (float)_clf->LogLikelihood( thread_id);
      InsertCandidate( mt_detected, sample, mt_probThreshold, mt_nMaxCandidate, mt_nMinProb );
    }
  }

#pragma omp critical
  {  // '{' is very important and define the block of code that needs lock.
    // Don't remove this pair of '{' and '}'.
    if(mt_detected_.size()==0) {
      mt_detected_    = mt_detected;
      //mt_nProbThreshold_  = mt_probThreshold;
      nProbThreshold = mt_probThreshold;
    } else {
      for(std::list<ScrBox3d>::iterator it = mt_detected.begin(); 
          it!=mt_detected.end(); ++it)
        InsertCandidate( mt_detected_, *it, /*mt_nProbThreshold_*/nProbThreshold, 
        mt_nMaxCandidate_, mt_nMinProb_ );
      }
    }  // '}' is very important and define the block of code that needs lock.
  }//parallel section end

      

My question is, can this section be implemented with openCL? I followed a series of openCL tutorials and I figured out how to work, I wrote the code in .cu files (I previously installed the CUDA toolkit), but in this case the situation is more complicated because there are a lot of header files, template classes and object oriented programming.

How can I convert this section implemented in openMP to openCL? Should I create a new .cu file?

Any advice could help. Thanks in advance.

Edit:

Using VS profiler I noticed that the most execution time is spent on the InsertCandidate () function, I am thinking of writing a kernel to do this function on the GPU. The most expensive operation of this function is the instruction for

. But, as you can see, each loop contains three instructions if

, which can lead to discrepancy, which will lead to serialization even when executed on the GPU.

for( iter = detected.begin(); iter != detected.end(); iter++ )
    {
        if( nCandidate == nMaxCandidate-1 )
            nProbThreshold = iter->_prob;

        if( box._prob >= iter->_prob )
            break;
        if( nCandidate >= nMaxCandidate && box._prob <= nMinProb )
            break;
        nCandidate ++;
    }

      

As a conclusion, can this program be converted to openCL?

+3


source to share


1 answer


You might be able to convert your sample code to opencl, however I noticed a couple of problems with this.

  • Let's start with parallel execution. More workers may not help at all.
  • Adding work for runtime processing is a fairly recent feature in opencl. You will either have to use opencl 2.0 or know in advance how much work will be added and preallocate memory to store the new data structures. Calls to InsertCandidate can be part that "cannot" be converted to opencl.


If the function is large enough, you can wrap the calls to this -> Prob (...). You should be able to cache a bunch of calls while storing the parameters in a suitable data structure. By "bundle" I mean at least hundreds, but ideally thousands or more. Again, it's worth it if this-> Prob () is constant across all calls and complex enough to cost back and forth to and from the opencl device.

+2


source







All Articles