How is access to the same global memory address done by threads from different cores?

Question

How is access to the same global memory address done by threads from different cores?

If many threads in warp want to read an address in global memory, that data is translated, right?

If many threads in warp want to write to an address in global memory, there is serialization, but it is impossible to predict the order, right?

But, the first question: if many threads in different distortions in different blocks want to write to an address in global memory? What's the GPU going to do? Serialize all access to this address? Is there any guarantee of data consistency?

With Hyper-Q, you can run many threads containing kernels. If I have a position in memory and multiple threads in different cores want to write or read that address, what is the GPU going to do? Serializes calls of all threads from different cores, or does the GPU do nothing and some inconsistencies will occur? Is there any guarantee of data consistency when multiple cores read / write the same address?

+3

cuda

CarcaraH 22 jan. 13 at 4:05

source to share

1 answer

Robert crovella · Answer 1 · 2013-01-22T04:50:06+0000

He prefers that you ask one question per question.

If many threads in warp want to read an address in global memory, that data is translated, right?

Yes, this is true for Fermi (CC2.0) and beyond.

If many threads in warp want to write to an address in global memory, there is serialization, but it is impossible to predict the order, right?

Right. The order is undefined.

If many threads in different skews, in different blocks, want to write to an address in global memory? What's the GPU going to do? Serialize all access to this address?

If concurrently accessed, they are serialized. Again, the order is undefined.

Is there any guarantee of data consistency?

Not sure what you mean by data consistency. Anyway, what else can a GPU do besides serializing concurrent write? I'm surprised this is such a complex concept, there is no obvious alternative to me.

If I have memory in memory and multiple threads in different cores want to write or read that address, what will the GPU do? Serializes access of all threads from different cores or does the GPU do nothing and some inconsistencies will happen? Is there any guarantee of data consistency when multiple cores read / write the same address?

It does not matter what is the source of simultaneous writing to global memory, whether from the same warp or different skews, in different blocks in different cores. The concurrent write is serialized to undefined order. Again, for "data consistency" I would like to know what you mean by this. Simultaneous reading and writing will also create undefined behavior. Reads can return a value that includes the initial value of the memory location or any of the values that were written.

The end result of concurrent writes to any GPU memory location is undefined. If all concurrent writes write the same value, then this will reflect the final value at that location. Otherwise, the final value will reflect one of the values that were recorded. Which value is undefined. Also, most of your questions and statements don't make sense to me. (What do you mean by data consistency?) You shouldn't expect anything rational from such programming behavior. The GPU should be programmed as a distributed independent work machine, not a globally synchronous machine. Note that "undefined" also means that the results may differ from one kernel run to the next, even if the inputs are identical.

Simultaneous or near simultaneous reading and writing of global memory from different blocks (whether from the same or different cores) is especially dangerous for Fermi devices (cc2.x) due to independent incoherent L1 caches that are placed between SMs (where streaming blocks) and L2 cache (which is system-wide and therefore coherent). Trying to create synchronized behavior between block-blocks using global memory as a vehicle is difficult at best and not recommended. It is suggested that you consider ways to tweak your algorithm to structure the work independently.

How is access to the same global memory address done by threads from different cores?

More articles: