Using popcnt on GPU

Question

Using popcnt on GPU

I need to calculate

(a & b).count()

over a large set (> 10000) of bit vectors ( std::bitset<N>

), where N is from 2 ^ 10 to 2 ^ 16.

const size_t N = 2048;
std::vector<std::vector<char>> distances;
std::vector<std::bitset<N>> bits(100000);
load_from_file(bits);
for(int i = 0; i < bits.size(); i++){
    for(int j = 0; j < bits.size(); j++){
        distance[i][j] = (bits[i] & bits[j]).count();
    }
}

I am currently relying on multithreading and SSE / AVX for computation distances

. Fortunately, I can use vpand

from AVX to calculate &

, but my code still uses a popcnt (%rax)

loop to calculate the number of bits.

Is it possible to compute the function (a & b).count()

on my GPU (nVidia 760m)? Ideally I would just pass 2 chunks of memory a bit N

. I looked at using traction but I couldn't find the function popcnt

.

EDIT:

Current CPU implementation.

double validate_pooled(const size_t K) const{                           
    int right = 0;                                                          
    const size_t num_examples = labels.size();                              
    threadpool tp;                                                          
    std::vector<std::future<bool>> futs;                                    
    for(size_t i = 0; i < num_examples; i++){                               
        futs.push_back(tp.enqueue(&kNN<N>::validate_N, this, i, K));       
    }                                                                       
    for(auto& fut : futs)                                                   
        if(fut.get()) right++;                                              

    return right / (double) num_examples;                                   
}      

bool validate_N(const size_t cmp, const size_t n) const{                    
    const size_t num_examples = labels.size();                              
    std::vector<char> dists(num_examples, -1);                              
    for(size_t i = 0; i < num_examples; i++){                               
        if(i == cmp) continue;                                              
        dists[i] = (bits[cmp] & bits[i]).count();                           

    }                                                                       
    typedef std::unordered_map<std::string,size_t> counter;                 
    counter counts;                                                         
    for(size_t i = 0; i < n; i++){                                          
        auto iter = std::max_element(dists.cbegin(), dists.cend());         
        size_t idx = std::distance(dists.cbegin(), iter);                   
        dists[idx] = -1; // Remove the top result.                          
        counts[labels[idx]] += 1;                                           
    }                                                                       
    auto iter = std::max_element(counts.cbegin(), counts.cend(),            
            [](const counter::value_type& a, const counter::value_type& b){ return a.second < b.second; }); 

    return labels[cmp] == iter->first;;                                     
}

EDIT:

This is what I came up with. However its brutally slow. I'm not sure if there is anything wrong.

template<size_t N>
struct popl 
{
    typedef unsigned long word_type;
    std::bitset<N> _cmp;

    popl(const std::bitset<N>& cmp) : _cmp(cmp) {}

    __device__
    int operator()(const std::bitset<N>& x) const
    {
        int pop_total = 0;
        #pragma unroll
        for(size_t i = 0; i < N/64; i++)
            pop_total += __popcll(x._M_w[i] & _cmp._M_w[i]);

        return pop_total;
    }
}; 

int main(void) {
    const size_t N = 2048;

    thrust::host_vector<std::bitset<N> > h_vec;
    load_bits(h_vec);

    thrust::device_vector<std::bitset<N> > d_vec = h_vec;
    thrust::device_vector<int> r_vec(h_vec.size(), 0);
    for(int i = 0; i < h_vec.size(); i++){
        r_vec[i] = thrust::transform_reduce(d_vec.cbegin(), d_vec.cend(),  popl<N>(d_vec[i]), 0, thrust::maximum<int>());
    }

    return 0;
}

+3

c ++ opencl gpu cuda thrust

en4bz 05 oct. '14 at 1:19

source to share

2 answers

OpenCL 1.2 has popcount

, which seems to do what you want. It can work with a vector, so up ulong16

, which is 1024 bits at a time. Please note that NVIDIA drivers only support OpenCL 1.1, which does not include this feature.

Of course, you could just use a function or table to compute it pretty quickly, so an OpenCL 1.1 implementation is also possible and will most likely run on device memory bandwidth.

+1

Dithermaster 05 oct. '14 at 19:30

source to share

Robert crovella · Accepted Answer · 2014-10-05T02:16:07+0000

CUDA has hit counts for 32-bit and 64-bit types. ( __popc()

and __popcll()

)

They can be used directly in the CUDA core, or by pushing (in a functor), possibly passed to thrust::transform_reduce

.

If this is the only function you want to do on the GPU, it can be difficult to get a "payoff" on the network because of the "cost" of transferring data to the GPU. The total input dataset is about 1 GB (100,000 vectors of 65536 bit length), but the output dataset is 10-40 GB (100,000 * 100,000 * 1-4 bytes per result) according to my calculations.

Either the CUDA core or the traction function and data layout must be carefully handled in order to limit code execution to memory bandwidth only. The cost of data transfer can also be reduced, perhaps to a large extent, by duplicating copy and compute operations, mainly based on the output.

At first glance, this problem seems to be somewhat similar to the problem of calculating Euclidean distances between sets of vectors, so this question / answer may be of interest from a CUDA perspective.

EDIT: Adding the code I used to research. I can get a significant speedup (~ 25x including data copy time) on a naive single-threaded CPU implementation, but I don't know how fast the CPU version will use "multithreading and SSE / AVX" so it would be interesting to see more of yours implementation or get some performance metrics. I also don't think the CUDA code I have here is highly optimized, it's just a "first cut".

In this case, for a proof of concept, I focused on the small size of the problem, N

= 2048, 10000 bits. For this small size problem, I can fit enough bit vectors in shared memory for a "small" threaded block to take advantage of shared memory. Therefore, this particular approach should be modified for the larger one N

.

$ cat t581.cu
#include <iostream>
#include <vector>
#include <bitset>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define nTPB 128
#define OUT_CHUNK 250
#define N_bits 2048
#define N_vecs 10000
const size_t N = N_bits;

__global__ void comp_dist(unsigned *in, unsigned *out, unsigned numvecs, unsigned start_idx, unsigned end_idx){
  __shared__ unsigned sdata[(N/32)*nTPB];
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < numvecs)
    for (int i = 0; i < (N/32); i++)
      sdata[(i*nTPB)+threadIdx.x] = in[(i*numvecs)+idx];
  __syncthreads();
  int vidx = start_idx;
  if (idx < numvecs)
    while (vidx < end_idx) {
      unsigned sum = 0;
      for (int i = 0; i < N/32; i++)
        sum += __popc(sdata[(i*nTPB)+ threadIdx.x] & in[(i*numvecs)+vidx]);
      out[((vidx-start_idx)*numvecs)+idx] = sum;
      vidx++;}
}

void cpu_test(std::vector<std::bitset<N> > &in, std::vector<std::vector<unsigned> > &out){

  for (int i=0; i < in.size(); i++)
    for (int j=0; j< in.size(); j++)
      out[i][j] = (in[i] & in[j]).count();
}

int check_data(unsigned *d1, unsigned start_idx, std::vector<std::vector<unsigned> > &d2){
  for (int i = start_idx; i < start_idx+OUT_CHUNK; i++)
    for (int j = 0; j<N_vecs; j++)
      if (d1[((i-start_idx)*N_vecs)+j] != d2[i][j]) {std::cout << "mismatch at " << i << "," << j << " was: " << d1[((i-start_idx)*N_vecs)+j] << " should be: " << d2[i][j] << std::endl;  return 1;}
  return 0;
}

unsigned long long get_time_usec(){
  timeval tv;
  gettimeofday(&tv, 0);
  return (unsigned long long)(((unsigned long long)tv.tv_sec*1000000ULL)+(unsigned long long)tv.tv_usec);
}

int main(){

  unsigned long long t1, t2;
  std::vector<std::vector<unsigned> > distances;
  std::vector<std::bitset<N> > bits;

  for (int i = 0; i < N_vecs; i++){
    std::vector<unsigned> dist_row(N_vecs, 0);
    distances.push_back(dist_row);
    std::bitset<N> data;
    for (int j =0; j < N; j++) data[j] = rand() & 1;
    bits.push_back(data);}
  t1 = get_time_usec();
  cpu_test(bits, distances);
  t1 = get_time_usec() - t1;
  unsigned *h_data = new unsigned[(N/32)*N_vecs];
  memset(h_data, 0, (N/32)*N_vecs*sizeof(unsigned));
  for (int i = 0; i < N_vecs; i++)
    for (int j = 0; j < N; j++)
        if (bits[i][j]) h_data[(i)+((j/32)*N_vecs)] |= 1U<<(31-(j&31));

  unsigned *d_in, *d_out1, *d_out2, *h_out1, *h_out2;
  cudaMalloc(&d_in, (N/32)*N_vecs*sizeof(unsigned));
  cudaMalloc(&d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaMalloc(&d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);
  h_out1 = new unsigned[N_vecs*OUT_CHUNK];
  h_out2 = new unsigned[N_vecs*OUT_CHUNK];
  t2 = get_time_usec();
  cudaMemcpy(d_in, h_data, (N/32)*N_vecs*sizeof(unsigned), cudaMemcpyHostToDevice);
  for (int i = 0; i < N_vecs; i += 2*OUT_CHUNK){
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream1>>>(d_in, d_out1, N_vecs, i, i+OUT_CHUNK);
    cudaStreamSynchronize(stream2);
    if (i > 0) if (check_data(h_out2, i-OUT_CHUNK, distances)) return 1;
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream2>>>(d_in, d_out2, N_vecs, i+OUT_CHUNK, i+2*OUT_CHUNK);
    cudaMemcpyAsync(h_out1, d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream1);
    cudaMemcpyAsync(h_out2, d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream2);
    cudaStreamSynchronize(stream1);
    if (check_data(h_out1, i, distances)) return 1;
    }
  cudaDeviceSynchronize();
  t2 = get_time_usec() - t2;
  std::cout << "cpu time: " << ((float)t1)/(float)1000 << "ms gpu time: " << ((float) t2)/(float)1000 << "ms" << std::endl;
  return 0;
}
$ nvcc -O3 -arch=sm_20 -o t581 t581.cu
$ ./t581
cpu time: 20324.1ms gpu time: 753.76ms
$

CUDA 6.5, Fedora20, Xeon X5560, Quadro5000 (cc2.0). The above test case involves checking the results between distance data produced on the processor and the GPU. I've also broken this down into an interleaved algorithm with passing (and validating) the results data overlapping with computational operations to make it more easily extensible when there is a very large amount of output (e.g. 100,000 bits). However, I have not yet run this through the profiler.

EDIT 2: Here's the "Windows version" of the code:

#include <iostream>
#include <vector>
#include <bitset>
#include <stdlib.h>
#include <time.h>


#define nTPB 128
#define OUT_CHUNK 250
#define N_bits 2048
#define N_vecs 10000
const size_t N = N_bits;

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)



__global__ void comp_dist(unsigned *in, unsigned *out, unsigned numvecs, unsigned start_idx, unsigned end_idx){
  __shared__ unsigned sdata[(N/32)*nTPB];
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < numvecs)
    for (int i = 0; i < (N/32); i++)
      sdata[(i*nTPB)+threadIdx.x] = in[(i*numvecs)+idx];
  __syncthreads();
  int vidx = start_idx;
  if (idx < numvecs)
    while (vidx < end_idx) {
      unsigned sum = 0;
      for (int i = 0; i < N/32; i++)
        sum += __popc(sdata[(i*nTPB)+ threadIdx.x] & in[(i*numvecs)+vidx]);
      out[((vidx-start_idx)*numvecs)+idx] = sum;
      vidx++;}
}

void cpu_test(std::vector<std::bitset<N> > &in, std::vector<std::vector<unsigned> > &out){

  for (unsigned i=0; i < in.size(); i++)
    for (unsigned j=0; j< in.size(); j++)
      out[i][j] = (in[i] & in[j]).count();
}

int check_data(unsigned *d1, unsigned start_idx, std::vector<std::vector<unsigned> > &d2){
  for (unsigned i = start_idx; i < start_idx+OUT_CHUNK; i++)
    for (unsigned j = 0; j<N_vecs; j++)
      if (d1[((i-start_idx)*N_vecs)+j] != d2[i][j]) {std::cout << "mismatch at " << i << "," << j << " was: " << d1[((i-start_idx)*N_vecs)+j] << " should be: " << d2[i][j] << std::endl;  return 1;}
  return 0;
}

unsigned long long get_time_usec(){

  return (unsigned long long)((clock()/(float)CLOCKS_PER_SEC)*(1000000ULL));
}

int main(){

  unsigned long long t1, t2;
  std::vector<std::vector<unsigned> > distances;
  std::vector<std::bitset<N> > bits;

  for (int i = 0; i < N_vecs; i++){
    std::vector<unsigned> dist_row(N_vecs, 0);
    distances.push_back(dist_row);
    std::bitset<N> data;
    for (int j =0; j < N; j++) data[j] = rand() & 1;
    bits.push_back(data);}
  t1 = get_time_usec();
  cpu_test(bits, distances);
  t1 = get_time_usec() - t1;
  unsigned *h_data = new unsigned[(N/32)*N_vecs];
  memset(h_data, 0, (N/32)*N_vecs*sizeof(unsigned));
  for (int i = 0; i < N_vecs; i++)
    for (int j = 0; j < N; j++)
        if (bits[i][j]) h_data[(i)+((j/32)*N_vecs)] |= 1U<<(31-(j&31));

  unsigned *d_in, *d_out1, *d_out2, *h_out1, *h_out2;
  cudaMalloc(&d_in, (N/32)*N_vecs*sizeof(unsigned));
  cudaMalloc(&d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaMalloc(&d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaCheckErrors("cudaMalloc fail");
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);
   cudaCheckErrors("cudaStrem fail");
  h_out1 = new unsigned[N_vecs*OUT_CHUNK];
  h_out2 = new unsigned[N_vecs*OUT_CHUNK];
  t2 = get_time_usec();
  cudaMemcpy(d_in, h_data, (N/32)*N_vecs*sizeof(unsigned), cudaMemcpyHostToDevice);
   cudaCheckErrors("cudaMemcpy fail");
  for (int i = 0; i < N_vecs; i += 2*OUT_CHUNK){
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream1>>>(d_in, d_out1, N_vecs, i, i+OUT_CHUNK);
    cudaCheckErrors("cuda kernel loop 1 fail");
    cudaStreamSynchronize(stream2);
    if (i > 0) if (check_data(h_out2, i-OUT_CHUNK, distances)) return 1;
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream2>>>(d_in, d_out2, N_vecs, i+OUT_CHUNK, i+2*OUT_CHUNK);
    cudaCheckErrors("cuda kernel loop 2 fail");
    cudaMemcpyAsync(h_out1, d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream1);
    cudaMemcpyAsync(h_out2, d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream2);
    cudaCheckErrors("cuda kernel loop 3 fail");
    cudaStreamSynchronize(stream1);
    if (check_data(h_out1, i, distances)) return 1;
    }
  cudaDeviceSynchronize();
  cudaCheckErrors("cuda kernel loop 4 fail");
  t2 = get_time_usec() - t2;
  std::cout << "cpu time: " << ((float)t1)/(float)1000 << "ms gpu time: " << ((float) t2)/(float)1000 << "ms" << std::endl;
  return 0;
}

I have added a CUDA check to this code. Remember to build the release project in Visual Studio and not debug. When I run this on a Windows 7 laptop with a Quadro1000M GPU I get about 35 seconds for the CPU to execute and about 1.5 seconds for the GPU.

Using popcnt on GPU

EDIT:

EDIT:

More articles: