CUDA error: Too much shared data (0x4018 bytes, 0x4000 max): Where does the extra 0x18bytes come from?

I am trying to implement this CUDA example: http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/ Since I have 0x4000 bytes, I tried to use TILE_DIM = 128

so

shared unsigned char tile[TILE_DIM][TILE_DIM];

will be 0x4000 bytes = 16384 bytes = 128 * 128 bytes.

However, this gives me the following error:

CUDACOMPILE : ptxas error : Entry function '_Z18transposeCoalescedPh' uses too much shared data (0x4018 bytes, 0x4000 max)

So I have 0x18 (24) extra bytes in shared memory. Where do they come from and can they be removed?

I can compile for Compute version 2.0+ above to remove the bug (my hardware is version 3.0), but this will use memory from the L1 cache, which is presumably slower.

+3


source to share


1 answer


So I have 0x18 (24) extra bytes in shared memory. Where do they come from and can they be removed?

Referring to the programming guide :

The total amount of shared memory required for a block is the sum of the sum of the statically allocated shared memory, the amount of dynamically allocated shared memory, and for compute capability 1.x devices, the sum of the shared memory used to pass kernel arguments (see __noinline__

and __forceinline__

).



As long as you compile for the cc1.x architecture, you cannot eliminate the use of shared memory to carry kernel parameters.

I think the solution, as you already pointed out, is to compile for cc2.0 or cc3.0 architecture. It is not clear why you would not want to do this.

+6


source







All Articles