Creating some, but not all (CUDA) memory

I just noticed that the core (CUDA core) can access unencrypted ones (see for example this answer here on SO ).

Can this be done ...

  • For a separate kernel separately?
  • At runtime and not at compile time?
  • Write only, not read / write?
+3


source to share


2 answers


  • Only if you compile this kernel separately, because it is an instruction-level feature that is included when generating code. You can also use the inline PTX assembler to issue instructions ld.global.cg

    for a specific boot operation in the kernel [see details here ).
  • No, this is a PTX instruction level function. You can SAVE a version of the code that contains non-cache memory loads at runtime, but this is still technically a compilation. You could probably use some templating tricks and separate compilation to force the runtime to keep two versions of the same code, built with or without caching, and choose between those versions at runtime. You can also use the same tricks to get two versions of a given kernel, without or without built-in PTX for light workloads [see p. here for one opportunity to achieve this]
  • These instructions, without caching, bypass the L1 cache with a byte level granularity to the L2 cache. So they only get loaded (all write invalid L1 cache and store up to L2).


+3


source


I don't know if this is possible before, but CUDA 8.0 gives you the ability to fine-tune caching for specific reads / writes. See the PTX manual for details .

For example, so that this code always goes into main memory when reading:

const float4 val = input[i];

      



you can write the following:

float4 val;
const float4* myinput = input+i;
asm("ld.global.cv.v4.f32 {%0, %1, %2, %3}, [%4];" : "=f"(val.x), "=f"(val.y), "=f"(val.z), "=f"(val.w) : "l"(myinput));

      

I was able to speed up one of my cache-intensive cores by about 20% by using non-cached reads and writes for data that was only accessed once by design

0


source







All Articles