Is it possible to reorder memory in an OoOE processor?

We know that two instructions can be reordered using the OoOE processor . For example, there are two global variables that are shared between different threads.

int data;
bool ready;

      

The post thread creates data

and includes a flag ready

to allow readers to use this data.

data = 6;
ready = true;

      

Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execute). But what about the final commit / write of the results? those. will the repository be ok?

From what I've learned, it totally depends on the processor memory model. For example, x86 / 64 has a strong memory model and store reordering is prohibited. In contrast, ARM usually has a weak model in which store reordering can occur (along with a few other reorders).

Also, the gut feeling tells me that I'm right, because otherwise we wouldn't need a storefront barrier between these two instructions, which are used in typical multi-threaded programs.

But here's what our wikipedia says:

.. In the above scheme, the OoOE processor avoids the halt that occurs at stage (2) of the in-order processor, when the command is not fully ready for processing due to lack of data.

OoOE processors fill these "slots" in time with other instructions that are ready, then reorder the results at the end to appear so that the instructions are processed normally.

I'm confused. Does he say that the results should be recorded in order? Indeed, in an OoOE processor, can it be stored before data

and ready

for reordering?

+3


source to share


4 answers


The consistency model (or memory model) for an architecture determines which memory operations can be reordered. The idea is always to get the best performance out of the code while maintaining the semantics expected by the programmer. This is the point of view from wikipedia, memory operations appear to the programmer, although they may have been reordered. Reordering is usually safe when the code is single-threaded, as the processor can easily detect potential violations.

In x86, the general model is that records are not reordered with other records. However, the processor uses out of order of execution (OoOE), so instructions are constantly being reordered. Typically, a processor has several additional hardware structures to support OoOE, such as a reordering buffer and a storage queue. The reordering buffer ensures that all instructions are executed in order, so that interrupts and exceptions interrupt a specific point in the program. The load-load queue functions in a similar way, as it can restore the order of memory operations according to the memory model. The load-load queue also disambiguates addresses so that the processor can identify when operations are performed at the same or different addresses.

Back to OoOE, the processor executes 10 to 100 instructions in each cycle. Loads and storages calculate their addresses, etc. A processor can pre-cache cache lines for access (which can enable cache coherency), but it cannot actually access a line for either read or write access until it is safe (according to the memory model) to do so.



Inserting barriers in stores, memory fences, etc. tell both the compiler and the processor further restrictions on memory reordering. The compiler is part of the memory model implementation, as some languages, like java, have a specific memory model, while others, like C, are subject to "memory accesses" and should look as if they were executed in order " ...

In conclusion, yes, data and off-the-shelf can be reordered in OoOE. But it depends on the memory model as to whether it really is. Therefore, if you need a specific order, please provide an appropriate guidance using barriers, etc., so that the compiler, processor, etc. Didn't choose a different order to improve performance.

+3


source


In a modern processor, the save process itself is asynchronous (think about how to send changes to the L1 cache and continue execution, the caching system will propagate asynchronously). So changes to two objects lie in different cache blocks, can be implemented by OoO from a different processor perspective.

Moreover, even an instruction to store this data can be executed by OoO. For example, when two objects are saved "at the same time", but the bus line of one object is saved / blocked by another master or bus master, so another other object can be committed earlier.



So in order to properly share data across threads, you need some sort of memory barrier or use transactional memory found in the latest processor like TSX .

+3


source


I think you are misinterpreting "it seems that the instructions were processed as usual". This means that if I have:

add r1 + 7 -> r2
move r3 -> r1

      

and the order of these operations is effectively reversed by performing out of order, the value that participates in the add operation will still be the r1 value that was present before the move. Etc. The CPU will cache the values ​​of the registers and / or store of the delay registers to ensure that the "value" of the sequential instruction sequence does not change.

It doesn't say anything about how the store order looks visible from another processor.

+3


source


The simple answer is YES on some types of processors.

Before the processor, your code is facing an earlier issue, compiler reordering.

data = 6;
ready = true;

      

The compiler is free to modify these assertions, as far as it knows they do not affect each other (this is not supported by the thread).

Now down to the processor level:

1) An out-of-order processor may process these instructions in a different order, including reordering stores.

2) Even if the CPU executes them in order, the memory controller might not execute them in order because it might need to flush or introduce new cache lines or perform an address translation before it can write them.

3) Even if it doesn't, another processor in the system might not see them in the same order. In order to observe them, you may need to inject modified cache lines from the kernel that wrote them. It may not be able to dump one cache line earlier than another if it is being held, it is a different core, or if there is contention for that line by multiple cores and its own out-of-order execution will read one by one.

4) Finally, speculative execution on other cores can read the value data

before it ready

was set by the write core, and by the time it gets close to reading ready

it has already been set but has data

also been changed.

All of these problems are addressed with memory barriers. Loose memory platforms must use memory barriers to maintain memory consistency for thread synchronization.

+3


source







All Articles