Difference between read or write prefetch
The gcc docs talk about the difference between read prefetch and write prefetch. What's the technical difference?
source to share
At the processor level, software prefetching (as opposed to those triggered by the machine itself) is a handy way to hint to the CPU that this line needs to be accessed and you want it to be preprogrammed beforehand to preserve latency.
If the access is going to be a simple read, you need regular prefetching that behaves similarly to a normal load from memory (except that it does not block the processor if it misses, rather than crashes if it misses, and all sorts of other benefits, depending from micro architecture).
However, if you intend to write to that line and it also exists in another kernel, a simple read operation will not be sufficient. This is due to the MESI based cache processing protocols. The kernel must own the line before modifying it in order to maintain consistency (if the same line is changed across multiple cores, you won't be able to enforce the order of those changes, and even lose some of them, which is not allowed for normal WB memory types). Instead, the write operation will start by acquiring ownership of the line and then tracking it from any other core / socket that might hold a copy. Only then can the recording take place. A read operation (query or prefetch) would leave the line in other cores in a common state, which is good if the line would be read multiple times across many cores.but won't help you if your kernel later writes to it.
To provide useful prefetching for strings that will later be written, most CPU companies support special write prefixes. On x86, both Intel and AMD support the prefetchW instruction, which should have the effect of writing (i.e., taking sole ownership of the string and invalidating any other copy, if any). Note that not all processors support this (even in the same family, not all generations), and not all compiler versions do.
Here's an example (with gcc 4.8.2) - note that you need to include it explicitly here -
#include <emmintrin.h>
int main() {
long long int a[100];
__builtin_prefetch (&a[0], 0, 0);
__builtin_prefetch (&a[16], 0, 1);
__builtin_prefetch (&a[32], 0, 2);
__builtin_prefetch (&a[48], 0, 3);
__builtin_prefetch (&a[64], 1, 0);
return 0;
}
compiled with gcc -O3 -mprfchw prefetchw.c -c
,:
0000000000000000 <main>:
0: 48 81 ec b0 02 00 00 sub $0x2b0,%rsp
7: 48 8d 44 24 88 lea -0x78(%rsp),%rax
c: 0f 18 00 prefetchnta (%rax)
f: 0f 18 98 80 00 00 00 prefetcht2 0x80(%rax)
16: 0f 18 90 00 01 00 00 prefetcht1 0x100(%rax)
1d: 0f 18 88 80 01 00 00 prefetcht0 0x180(%rax)
24: 0f 0d 88 00 02 00 00 prefetchw 0x200(%rax)
2b: 31 c0 xor %eax,%eax
2d: 48 81 c4 b0 02 00 00 add $0x2b0,%rsp
34: c3 retq
If you play with the second argument, you will notice that hint levels are ignored for prefetchW as it does not support temporal level hints. By the way, if you remove the -mprfchw flag, gcc will convert it to a normal read prefetch (I haven't tried different -march / mattr settings, maybe some of them include it as well).
source to share
The difference has to do with whether you expect the memory to be read only soon or also to be written. In the later case, the processor can be optimized in different ways. Remember that prefetch is only a hint, so GCC can ignore it.
To quote the GCC prefetch project page :
Some prefetch instructions distinguish between memory that is expected to be read and memory that is expected to be written. When data needs to be written, the prefetch instruction can move the block into the cache so that the expected store is in the cache. Prefetching for writing usually causes data to be cached in an exceptional or modified state.
source to share