ARM atomic performance
I am running the same code on an Intel processor and an ARM processor (Mac / iOS, compiler: Clang). While profiling the app, I noticed that on iOS / ARM atomic operations are the top 3 elements, while on Intel they are not even in the top 10. Is it true that on ARM atomic operations it is much slower? (relatively of course)
source to share
It should be noted that due to implementation details, you don't necessarily see the whole story.
According to ARM's parametric load / store paradigm, any atomic operation is at least 4 commands - load-exclusive, <operation> 1 store-exclusive, conditional branch to retry if necessary. Every other core completely ignores this and continues to do its job.
However, on x86, where instructions can work directly in memory, atoms are usually executed by attaching the LOCK prefix to a single instruction. This means 2 things: first, you can never be interrupted inside your atomic "subroutine" as it is one instruction. Secondly, no other core can access the memory when the bus is locked, so it effectively suspends execution of everything until it completes execution 2 . Together, they mean that the fetch profiler will rarely, if ever, catch an atomic operation "in progress" no matter how long it actually takes.
[1] OK, so atomic swap contains only 3 commands, but something else contains one or more commands in the middle.
[2] This is slightly less relevant for modern kernels, which will block their own cache rather than all, so as not to affect other kernels accessing unrelated areas, but hardware cache coherence will still prevent anyone else from interfering.
source to share