Churning with reordering memory
I am trying to run a simple task (get the x2APIC ID of the current processor) on every available hardware thread. To do this, I wrote the following code that works on the machines I tested it on (see here for a complete MWE compiled on Linux as C ++ 11).
void print_x2apic_id()
{
uint32_t r1, r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
}
int main()
{
const auto _ = std::ignore;
auto nprocs = ::sysconf(_SC_NPROCESSORS_ONLN);
auto set = ::cpu_set_t{};
std::cout << "Processors online: " << nprocs << std::endl;
for (auto i = 0; i != nprocs; ++i) {
CPU_SET(i, &set);
check(::sched_setaffinity(0, sizeof(::cpu_set_t), &set));
CPU_CLR(i, &set);
print_x2apic_id();
}
}
Single machine output (when compiled with g ++, version 4.9.0):
0 2 4 6 32 34 36 38
Each iteration printed a different x2APIC id, so everything works as expected. Now the problems are starting. I replaced the call with the print_x2apic_id
following code:
uint32_t r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
This causes the same identifier to be printed for each iteration of the loop:
36 36 36 36 36 36 36 36
My guess for what happened is that the compiler noticed that the call cpuid
is independent of the loop iteration (although it really is). The compiler then "optimized" the code by moving the CPUID call out of the loop. To fix this, I converted r4
to atomic:
std::atomic<uint32_t> r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
This failed to fix the problem. Surprisingly, this fixes the problem:
std::atomic<uint32_t> r1;
uint32_t r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
... Okay, now I'm confused.
Edit: Replacing the operator asm
in the function cpuid
with asm volatile
also fixes the problem, but I can't see how it is needed.
My questions
- Shouldn't you insert a fence grab before the call
cpuid
and a fence fence after the CPUID call to prevent the compiler from doing memory reordering? - Why was there no conversion
r4
tostd::atomic<uint32_t>
? And why does keeping the first three outputs inr1
,r2
andr3
instead of ignoring them make the program run? - How can I write the loop correctly using the minimum amount of synchronization?
source to share
I have reproduced the issue with optimization enabled (-O). You are correct in suspecting compiler optimization. CPUID serves as a complete memory barrier (for the processor itself); but it is the compiler that generates the code without calling the function cpuid
in the loop, since it threatens it as a constant function. asm volatile
prevents the compiler from doing this by saying it has side effects.
See this answer for more details: fooobar.com/questions/127409 / ...
source to share