Using CUDA registers

Question

Using CUDA registers

The CUDA manual specifies the number of 32-bit registers per processor. Does this mean that:

Does a double variable take two registers?
Does a pointer variable take two registers? - There must be more than one Fermi register with 6 GB of memory, right?

If the answer to question 2 is yes, then it is better to use fewer pointer variables and more indices int

.

E. g., This kernel code:

float* p1;               // two regs
float* p2 = p1 + 1000;   // two regs
int i;                   // one reg
for ( i = 0; i < n; i++ )
{
    CODE THAT USES p1[i] and p2[i]
}

it theoretically requires more registers than this kernel code:

float* p1;               // two regs
int i;                   // one reg
int j;                   // one reg
for ( i = 0, j = 1000; i < n; i++, j++ )
{
    CODE THAT USES p1[i] and p1[j]
}

+3

gpu cuda

user2052436 13 Feb 13 at 17:53

source to share

1 answer

talonmies · Accepted Answer · 2013-02-13T18:17:56+0000

A short answer to your three questions:

Yes.
Yes, as long as the code is compiled for the 64-bit host operating system. The size of the device pointer always matches the size of the host application pointer in CUDA.
Not.

To elaborate on point 3, consider the following two simple memory copy kernels:

__global__
void debunk(float *in, float *out, int n)
{
    int i = n * (threadIdx.x + blockIdx.x*blockDim.x);

    for(int j=0; j<n; j++) {
        out[i+j] = in[i+j];
    }
}

__global__
void debunk2(float *in, float *out, int n)
{
    int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
    float *x = in + i;
    float *y = out + i;

    for(int j=0; j<n; j++, x++, y++) {
        *x = *y;
    }
}

In your opinion debunk

should use fewer registers because it only has two local integer variables whereas it debunk2

has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:

$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v"  pointer_size.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info    : Function properties for _Z6debunkPfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 52 bytes cmem[0]
ptxas info    : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info    : Function properties for _Z7debunk2PfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 52 bytes cmem[0]

They compile with the same number of registers. And if you analyze the output of the tool chain, you can see that apart from the setup code, the streams of the final instructions are almost identical. There are a number of reasons for this, but it basically boils down to two simple rules:

Trying to figure out the number of registers from C code (or even PTX assembler) is mostly useless
Trying to guess at a very complex compiler and assembler is also mostly useless.

Using CUDA registers

More articles: