Churning with reordering memory

I am trying to run a simple task (get the x2APIC ID of the current processor) on every available hardware thread. To do this, I wrote the following code that works on the machines I tested it on (see here for a complete MWE compiled on Linux as C ++ 11).

void print_x2apic_id()
{
        uint32_t r1, r2, r3, r4;
        std::tie(r1, r2, r3, r4) = cpuid(11, 0);
        std::cout << r4 << std::endl;
}

int main()
{
        const auto _ = std::ignore;
        auto nprocs = ::sysconf(_SC_NPROCESSORS_ONLN);
        auto set = ::cpu_set_t{};
        std::cout << "Processors online: " << nprocs << std::endl;

        for (auto i = 0; i != nprocs; ++i) {
                CPU_SET(i, &set);
                check(::sched_setaffinity(0, sizeof(::cpu_set_t), &set));
                CPU_CLR(i, &set);
                print_x2apic_id();
        }
}

      

Single machine output (when compiled with g ++, version 4.9.0):

0
2
4
6
32
34
36
38

      

Each iteration printed a different x2APIC id, so everything works as expected. Now the problems are starting. I replaced the call with the print_x2apic_id

following code:

uint32_t r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

      

This causes the same identifier to be printed for each iteration of the loop:

36
36
36
36
36
36
36
36

      

My guess for what happened is that the compiler noticed that the call cpuid

is independent of the loop iteration (although it really is). The compiler then "optimized" the code by moving the CPUID call out of the loop. To fix this, I converted r4

to atomic:

std::atomic<uint32_t> r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

      

This failed to fix the problem. Surprisingly, this fixes the problem:

std::atomic<uint32_t> r1;
uint32_t r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

      

... Okay, now I'm confused.

Edit: Replacing the operator asm

in the function cpuid

with asm volatile

also fixes the problem, but I can't see how it is needed.

My questions

  • Shouldn't you insert a fence grab before the call cpuid

    and a fence fence after the CPUID call to prevent the compiler from doing memory reordering?
  • Why was there no conversion r4

    to std::atomic<uint32_t>

    ? And why does keeping the first three outputs in r1

    , r2

    and r3

    instead of ignoring them make the program run?
  • How can I write the loop correctly using the minimum amount of synchronization?
+3


source to share


1 answer


I have reproduced the issue with optimization enabled (-O). You are correct in suspecting compiler optimization. CPUID serves as a complete memory barrier (for the processor itself); but it is the compiler that generates the code without calling the function cpuid

in the loop, since it threatens it as a constant function. asm volatile

prevents the compiler from doing this by saying it has side effects.



See this answer for more details: fooobar.com/questions/127409 / ...

+3


source







All Articles