Different results for adding CUDA on the host and on the GPU

Question

Different results for adding CUDA on the host and on the GPU

I have a function that takes a color image and returns a gray version of it. If I run serial code on the host everything works fine. If I run it on a device, the result is slightly different (one pixel in 1000 is +1 or -1 compared to the correct value).

I think it has something to do with conversions, but I don't know for sure. This is the code I'm using:

    __global__ void rgb2gray_d (unsigned char *deviceImage, unsigned char *deviceResult, const int height, const int width){
    /* calculate the global thread id*/
    int threadsPerBlock  = blockDim.x * blockDim.y;
    int threadNumInBlock = threadIdx.x + blockDim.x * threadIdx.y;
    int blockNumInGrid   = blockIdx.x  + gridDim.x  * blockIdx.y;

    int globalThreadNum = blockNumInGrid * threadsPerBlock + threadNumInBlock;
    int i = globalThreadNum;

    float grayPix = 0.0f;
    float r = static_cast< float >(deviceImage[i]);
    float g = static_cast< float >(deviceImage[(width * height) + i]);
    float b = static_cast< float >(deviceImage[(2 * width * height) + i]);
    grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);

    deviceResult[i] = static_cast< unsigned char > (grayPix);
}

void rgb2gray(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height, NSTimer &timer) {

    unsigned char *deviceImage;
    unsigned char *deviceResult;

    int initialBytes = width * height * 3;  
    int endBytes =  width * height * sizeof(unsigned char);

    unsigned char grayImageSeq[endBytes];

    cudaMalloc((void**) &deviceImage, initialBytes);
    cudaMalloc((void**) &deviceResult, endBytes);
    cudaMemset(deviceResult, 0, endBytes);
    cudaMemset(deviceImage, 0, initialBytes);

    cudaError_t err = cudaMemcpy(deviceImage, inputImage, initialBytes, cudaMemcpyHostToDevice);    

    // Convert the input image to grayscale 
    rgb2gray_d<<<width * height / 256, 256>>>(deviceImage, deviceResult, height, width);
    cudaDeviceSynchronize();

    cudaMemcpy(grayImage, deviceResult, endBytes, cudaMemcpyDeviceToHost);

    ////// Sequential
    for ( int y = 0; y < height; y++ ) {
             for ( int x = 0; x < width; x++ ) {
                   float grayPix = 0.0f;
                   float r = static_cast< float >(inputImage[(y * width) + x]);
                   float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
                   float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);

                   grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);
                   grayImageSeq[(y * width) + x] = static_cast< unsigned char > (grayPix);
              }
        }

    //compare sequential and cuda and print pixels that are wrong
    for (int i = 0; i < endBytes; i++)
    {
        if (grayImage[i] != grayImageSeq[i])
        cout << i << "-" << static_cast< unsigned int >(grayImage[i]) <<
                 " should be " << static_cast< unsigned int >(grayImageSeq[i]) << endl;
        }

    cudaFree(deviceImage);
    cudaFree(deviceResult);
}

I mention that I am highlighting for the initial image width * height * 3 because the original image is a CImg.

I am working on a GeForce GTX 480.

+3

cuda

andreea.sandu 18 jan. 13 at 19:38

source to share

2 answers

Floating point math can produce slightly different results in device code versus host code.

There are many possibilities as to why this is so. You should bear in mind that these two functions are compiled by two different compilers into two different binaries running on two different floating point implementations.

For example, if floating point calculations are performed in a different order, round-off errors can lead to different results.

In addition, when performing floating point calculations with 32-bit (float) or 64-bit (double) floating point representations on an x86 CPU, the floating point math is performed by an FPU, which internally uses 80-bit precision and the result is then truncated. up to 32-bit for floating-point data type or 64-bit for double data type.

ALU GPUs use 32-bit precision for floating point math (assuming you are using a floating point data type).

An excellent article on the topic of floating point and arithmetic representations can be found here .

+1

RoBiK 18 jan. 13 at 20:53

source to share

andreea.sandu · Accepted Answer · 2013-01-19T10:58:26+0000

Finally I found the answer. CUDA is automatically compensated for by multiple prefixes in both single and double precision. Using the document below 1 , section 4.4, I was able to fix this. Instead of doing

grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);

Now i do

grayPix = __fadd_rn(__fadd_rn(__fmul_rn(0.3f, r),__fmul_rn(0.59f, g)), __fmul_rn(0.11f, b));

This disables merging multiplications and adds to smooth multiply-add commands.

Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Different results for adding CUDA on the host and on the GPU

More articles: