Creating some, but not all (CUDA) memory

Question

Creating some, but not all (CUDA) memory

I just noticed that the core (CUDA core) can access unencrypted ones (see for example this answer here on SO ).

Can this be done ...

For a separate kernel separately?
At runtime and not at compile time?
Write only, not read / write?

+3

caching gpgpu cuda

einpoklum May 24 '15 at 6:36

source to share

2 answers

I don't know if this is possible before, but CUDA 8.0 gives you the ability to fine-tune caching for specific reads / writes. See the PTX manual for details .

For example, so that this code always goes into main memory when reading:

const float4 val = input[i];

you can write the following:

float4 val;
const float4* myinput = input+i;
asm("ld.global.cv.v4.f32 {%0, %1, %2, %3}, [%4];" : "=f"(val.x), "=f"(val.y), "=f"(val.z), "=f"(val.w) : "l"(myinput));

I was able to speed up one of my cache-intensive cores by about 20% by using non-cached reads and writes for data that was only accessed once by design

0

Dimaleks 03 oct. 16 at 13:27

source to share

talonmies · Accepted Answer · 2015-05-24T06:54:16+0000

Only if you compile this kernel separately, because it is an instruction-level feature that is included when generating code. You can also use the inline PTX assembler to issue instructions ld.global.cg

for a specific boot operation in the kernel [see details here ).
No, this is a PTX instruction level function. You can SAVE a version of the code that contains non-cache memory loads at runtime, but this is still technically a compilation. You could probably use some templating tricks and separate compilation to force the runtime to keep two versions of the same code, built with or without caching, and choose between those versions at runtime. You can also use the same tricks to get two versions of a given kernel, without or without built-in PTX for light workloads [see p. here for one opportunity to achieve this]
These instructions, without caching, bypass the L1 cache with a byte level granularity to the L2 cache. So they only get loaded (all write invalid L1 cache and store up to L2).

Creating some, but not all (CUDA) memory

More articles: