Practical Cost of Unprotected Thread Locks

To begin with, I should focus on pending thread locks in this question. I am well aware of the very high cost of threads entering highly contested locks, blocked and suspended and forced to wait in line using context switches, etc. Therefore, I am not very interested in the cost of competing locks.

"Cheap"?

I've been told repeatedly that ceiling locks are "really cheap" when helpless, only to find the opposite in practice when profiling. I guess it depends on how we define "really cheap".

So my request is for details that might help me understand what the cost of entering and exiting unprotected streams in more absolute terms, such as clock cycle ranges for different types of locks (maybe somewhat theoretical), if it relates to access to memory and caches, etc. I'm kind of a low-level coder, but not quite at the machine / assembly level (trying to improve my knowledge there as much as possible).

For example, is the cost comparable to allocating a heap with a general purpose allocator? Some consider it cheap, but I find it one of the most expensive things. How does this compare to a mispredicted industry? Does it vary a lot, as in the case of memory load, which can be very cheap from a cache line, but quite expensive for completely inaccessible DRAM access?

In the foreword, I want to make it clear that I am not asking for this in foresight, trying to obsess the microefficiency of something that I have not yet measured. On the contrary, I am asking for a glimpse into the past, having spent many years on large-scale production codebases where I often enforced useless thread locks, as in fact, much more often than I expected, a large access point. So I want to understand performance a little better in a more absolute and precise sense, especially to help me be more cost-conscious in terms of design decisions.

Also my standards for what constitutes "cheap" can be quite high since I usually work inside data structures. For example, many people think that heap allocation is relatively cheap and I would agree if we allocate descriptors to entire data structures. If we are inside a data structure and pay this overhead for every element we insert into it, it can get extremely costly. So my idea of ​​"expensive" and "cheap" may be completely different.

Unusual code

One of the codebases I worked in had a very long legacy (decades). So it was pretty much designed to only work with single-threaded ones with a lot of practices that did even a lot of basic thread-independent (often not even reentrant) functionality. Some of the more ambitious developers wanted to make this codebase increasingly multi-threaded in a modified form, and of course we ran into a lot of terrible problems. Command response: Sprinkle threads all over the place as bugs flood.

I was one of the few using profilers at the time and constantly encountered hotspots revolving around thread locks that were still only used in a fully single threaded, conflict-free context. In the beginning the codebase used platform-specific code for it, and given that I mainly used Windows for development / testing / profiling, locks were built-in critical sections used by the windows API. We later started using Qt to reduce portability headaches, and critical section hotspots were replaced with bottlenecks in QMutex

. Then we started to include some of the Intel Thread Building Blocks, and I saw several hotspots intbb::mutex

(although not much, but I'm not sure if it was because we didn't use it that much or if it was more efficient than the previous two solutions: it was a massive codebase spanning many millions of lines of code).

And here is the gross part. Once I noticed that the main bottleneck was in the castle QMutex

, which was completely conflict-free. It was only used in a single threaded context, and the blocking was only for thread safety, if ever used in a multithreaded context. So my employee "optimized" it like this (pseudocode):

if (thread_id != main_thread_id)
     mutex.lock();
...
if (thread_id != main_thread_id)
     mutex.unlock();

      

And it actually eliminated our hotspot and improved performance dramatically to make the user who reported the slowdown to be quite happy with the results! But I think I vomited a little when I saw this. This was based on the assumption that it was safe as it was in the code reading a resource that could only be modified from the main thread.

This is where I started to wonder more about the true waste of unaffected thread locks, where code as weird as above, sharing thread access and branching, can actually eliminate significant bottlenecks in the real world.

So my last question is, how expensive is unsecured thread locking (or at least somewhat more accurate than "cheap")?

On the occasions that I have seen, if I go on intuition (with full awareness that this may be completely wrong), I would say that the blockages we are dealing with "felt" as if they were in 100 loop (not as expensive as malloc, but comes close to it). Since people were interested in the hardware / OS details, I'm usually interested in broad answers as we're always dealing with multi-platform projects, but maybe my specific interest would be x86 / x64, Windows, OSX and Linux.

+3


source to share


1 answer


FWIW: if everything is implemented optimally, mutexes can be implemented, AFAIR, as two atomic increments / decrements (lock function * () in Windows-speak); they, in turn, are translated (to x86) into asm operations with the LOCK prefix, causing the bus to lock.

Bus locking, in turn, is implemented differently, and MAY behave differently in single-network single-core single-network multi-core, multi-charge and FSB devices and NUMA / SUMO devices. In practice, however, I've seen numbers around 100 hours for multi-sockets and tens of hours for a single socket. NB: these are extremely rough numbers, don't take them as supplied until you have done your own measurements (exactly on the target hardware) using something like RDTSC.



PS The provided snippet (with if (thread_id! = Main_thread_id)) is potentially dangerous even if the data can only be modified from the main thread.

+5


source







All Articles