Is locking multi-threaded slower than a single-threaded program?
I looked at parallelizing the program so that in the first phase it sorts the items into buckets modulo the number of parallel workers, so this avoids collisions in the second phase. Each thread in a parallel program uses std::atomic::fetch_add
to reserve space in the output array and then uses it std::atomic::compare_exchange_weak
to update the current bucket pointer. Therefore, it is blocked. However, I was in doubt about the performance of multiple threads vying for a single atomic one (the one we do fetch_add
, since the bucket head count is equal to the number of threads, so there isn't a lot of controversy on average), so I decided to measure that. Here is the code:
#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>
std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
void BenchmarkAtomic() {
const uint32_t maxThreads = std::thread::hardware_concurrency();
std::vector<std::thread> thrs;
thrs.reserve(maxThreads + 1);
for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
auto start = std::chrono::high_resolution_clock::now();
for (uint32_t i = 0; i < nThreads; i++) {
thrs.emplace_back(CountingThread);
}
for (uint32_t i = 0; i < nThreads; i++) {
thrs[i].join();
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
(long long)gCounter.load(std::memory_order_acquire));
thrs.clear();
gCounter.store(0, std::memory_order_release);
}
}
int __cdecl main() {
BenchmarkAtomic();
return 0;
}
And here's the output:
1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000
Processor 8-core, 16-thread (Ryzen 1800X @ 3.9 GHz). Thus, the total number of all threads of operations per second is sharply reduced to 4 threads. Then it slowly decreases and fluctuates slightly.
So, is this phenomenon common on other processors and compilers? Is there a workaround (other than referring to a single thread)?
source to share
A locked multithreaded program is no slower than a single threaded program. What makes it slow is data validation. Example The one provided is actually a highly controversial artificial program. In a real program, you will do a lot of work between each access to the shared data and thus have less cache invalidation and so on. This CppCon talk from Jeff Preshing might explain some of your questions better than I did.
Add: Try changing the CountingThread and add from time to time to pretend that you are busy with something other than incrementing the atomic variable gCounter. Then keep playing and playing with the value in the if statement to see how it affects the results of your program.
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
// take a nap every 10000th iteration to simulate work on something
// unrelated to access to shared resource
if (i%10000 == 0) {
std::chrono::milliseconds timespan(1);
std::this_thread::sleep_for(timespan);
}
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
In general, every time you call gCounter.fetch_add
it means the data is invalid in a different kernel cache. This forces them to get to the data in the cache further from the kernel. This effect is the main factor in slowing down the performance of your program.
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns ) local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns ) local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns ) local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns ) local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns ) remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns ) local DRAM ~60 ns remote DRAM ~100 ns
Above tables are taken from Approximate Costs of Accessing Various Caches and Main Memory?
Lock-free does not mean that you can exchange data between threads at no cost. Lock-free means that you don't wait for other threads to unlock the mutexes so you can read shared data. In fact, even blocked programs use blocking mechanisms to prevent data corruption.
Just follow a simple rule. Try to access the shared data as little as possible to get the most out of multi-core programming.
source to share
It depends on your specific workload.
See amdahl law
100 % (whole workload in percentage)
speedup = -----------------------------------------------------------
(sequential work load in %) + (parallel workload in %) / (count of workers)
Parallel workload in your program 0 %
, hence speeding up 1
. Aka is not accelerating. (You are syncing to increment the same memory location and therefore only one thread can increment a cell at any given time.)
Rough explanation why it is even worse than speedup=1
:
The containing cache line gCounter
remains in the processor's cache with only one thread.
With multiple threads scheduled on different processors or cores, the cache line containing gCounter
will bounce around different caches for the main cpus cores.
So the difference is somewhat comparable to incrementing a register with just one thread versus accessing memory for each increment operation. (This is sometimes faster than memory access because cache memory is cached in modern processor architectures.)
source to share
Like most of the broadest ones, which are faster questions, the only completely general answer depends on .
A good mental model is that when parallelizing an existing task, the execution time of the parallel version over threads N
will be roughly three contributions:
-
Still the serial part, common for serial and parallel algorithms. I.e,. work that was not parallelized, such as setting or breaking work, or work that was not running in parallel because the task was not accurately divided by 1 .
-
A parallel part that has been effectively parallelized among workers
N
. -
An overhead component that represents additional work performed in a parallel algorithm that does not exist in the production version. Splitting work almost always requires a small amount of overhead, delegating work streams and combining results, but in some cases, overhead can fuel the real work.
So, you have these three contributions, and let's assign T1p
, T2p
and T3p
accordingly. The component now T1p
exists and takes the same time in both sequential and parallel algorithms, so we can ignore as it overrides to determine the slower one.
Of course, if you've used coarse-grained synchronization, such as incrementing the local variable in each thread and only periodically (perhaps just once at the very end), updating the shared variable, things will change.
1 This also includes the case where the workload was well partitioned, but some threads were doing more work per unit of time, which is common with modern processors and modern operating systems.
source to share